Wine Quality

Analysing Chemical Parameters to Improve the Quality

Stages of Analysis

1. Define the problem

2. Collect Data

3. Data Cleaning and Preprocessing

4. Data Analysis Techniques

5. Exploratory Data Analysis (EDA) and Interpretation of Results

6. Conclusion and Recommendations

1. Define the problem ⬆️ Return

In this dataset, we have some chemical variables about types of wine, and the goal is to understand how these parameters are related to the quality and how we can improve the quality of the wine. The variables are:

Alcohol
Volatile Acidity
Fixed Acidity
Citric Acid
Residual Sugar
Chlorides
Density
Total Sulfur Dioxide
pH
Sulphates
Free Sulfur Dioxide

2. Collect Data ⬆️ Return

The data was extracted from the file wine_quality.xlsx, from Kaggle, availabre at: https://www.kaggle.com/datasets/yasserh/wine-quality-dataset.

3. Data Cleaning and Preprocessing ⬆️ Return

Using the library pandas to import and interprete the data:

import pandas as pd

wine_df = pd.read_excel('wine_quality.xlsx')
wine_df.info()

In this case, after analysing the resume of informations (columns name, number of rows and data type) it was noted that there isn't NaNvalues, but the ID column was named wrongly and the Type column can't helps a lot. So:

wine_df = wine_df.rename(columns{'Unnamed 0':'ID'})
wine_df = wine_df.drop(['Type'], axis = 1)

Calculating the initial wine quality's mean ⬆️ Return

With the dataset informations, the initial quality mean is:

initial_mean = wine_df['quality'].mean()

The initial mean is 5.812.

4. Data Analysis Techniques ⬆️ Return

A mean of 5.812 it's very low about quality. So, the challenge is reach out 7.000, at least. To help with this challenge, it will be used two artifices: a Pearson's correlation matrix and histogram graphics, to understand how the variables can interefe in the quality and how they are related to each other.

5. Exploratory Data Analysis (EDA) and Interpretation of Results ⬆️ Return

The Pearson's correlation is very useful in data analysis, because it can show if there's any correlation (weak or strong, direct or indirect) among variables. The libraries pandas and seaborn help us with, using the funtion .corr() from pandas and the heatmap from seaborn.

import seaborn as sns
    
correlation = wine_df.corr()
sns.heatmap(correlation, annot = True, fmt = ".1f", vmax = 1, vmin = -1)

Analysing the matrix, the most part of the correlations are very weak. Only two correlations are relatively strong: density with alcohol (-0.7) and total sulfur dioxide with free sulfur dioxide (+0.7). In this case, we can predict that the lower density is, higher the alcohol will be and vice versa (strong negative correlation) and lower the total sulfur dioxide is, higher the free sulfur dioxide will be and vice versa (strong positive correlation). When looking at the quality column, there isn't any variable who has a strong correlation. The highest value is +0.4 (correlation between quality and alcohol), a weak correlation.

About graphical analysis, some graphics were generated, relating the quality and chemical parameters, using the library plotly.express.

import plotly.express as px
    
for coluna in wine_df.columns:
    grafic = px.histogram(wine_df, x = coluna, color = 'quality')
    grafic.show()

Analysing the informations above, some insights could be get:

Quality x Alcohol: The bests types of wine have more alcohol in its composition compared with the worsts ones. Analysing deeply the graphic, values higher than 10.920 can be interesting.

Quality x Volatile Acidity: The bests types of wine have a volatile acidity between 0.150 and 0.450 in its composition compared with the worsts ones. Analysing deeply the graphic, values between 0.280 and 0.370 can be interesting.

Quality x Fixed Acidity: The bests types of wine have a fixed acidity between 6.000 and 7.500 in its composition compared with the worsts ones. Analysing deeply the graphic, values between 6.600 and 7.300 can be interesting.

Quality x Citric Acid: The bests types of wine have a citric acid between 0.250 and 0.400 in its composition compared with the worsts ones. Analysing deeply the graphic, values between 0.280 and 0.320 can be interesting.

Quality x Residual Sugar: The bests types of wine have a residual sugar between 1.000 and 7.000 in its composition compared with the worsts ones. Analysing deeply the graphic, values between 2.300 and 6.200 can be interesting.

Quality x Chlorides: The bests types of wine have less chlorides in its composition compared with the worsts ones. Analysing deeply the graphic, values lower than 0.034 can be interesting.

Quality x Density: The bests types of wine have density between 0.980 and 0.990 in its composition compared with the worsts ones. Analysing deeply the graphic, values between 0.989 and 0.992 can be interesting.

Quality x Total Sulfur Dioxide: The bests types of wine have total sulfur dioxide between 80 and 150 in its composition compared with the worsts ones. Analysing deeply the graphic, values between 80 and 129 can be interesting.

Quality x pH: The bests types of wine have pH between 2.850 and 3.600 in its composition compared with the worsts ones. Analysing deeply the graphic, values between 2.890 and 3.540 can be interesting.

Quality x Sulphates: The bests types of wine have sulphates between 0.300 and 1.000 in its composition compared with the worsts ones. Analysing deeply the graphic, values between 0.360 and 0.900 can be interesting.

Quality x Free Sulfur Dioxide: The bests types of wine have free sulfur dioxide between 20 and 70 in its composition compared with the worsts ones. Analysing deeply the graphic, values between 20 and 64 can be interesting.

The graphics above reforce the results of Pearson's correlation matrix: the most part of the density are very low and the alcohol high, as well as sulfur dioxide and free sulfur dioxide are low.

Setting chemical parameters and calculating new mean

Now, taking the bests values for each column and filtering them, the tendency is improve the quality. Checking this up:

    wine_df = wine_df[wine_df['alcohol'] >= 10.920]
    wine_df = wine_df[(wine_df['volatile acidity'] <= 0.370) & (wine_df['volatile acidity'] >= 0.280)]
    wine_df = wine_df[(wine_df['fixed acidity'] <= 7.300) & (wine_df['fixed acidity'] >= 6.600)]
    wine_df = wine_df[(wine_df['citric acid'] >= 0.280) & (wine_df['citric acid'] <= 0.320)]
    wine_df = wine_df[(wine_df['residual sugar'] <= 6.200) & (wine_df['residual sugar'] >= 2.300) ]
    wine_df = wine_df[wine_df['chlorides'] <= 0.034]
    wine_df = wine_df[(wine_df['density'] <= 0.992) & (wine_df['density'] >= 0.989)]
    wine_df = wine_df[(wine_df['total sulfur dioxide'] <= 129) & (wine_df['total sulfur dioxide'] >= 80)]
    wine_df = wine_df[(wine_df['pH'] >= 2.890) & (wine_df['pH'] <= 3.540)]
    wine_df = wine_df[(wine_df['sulphates'] >= 0.360) & (wine_df['sulphates'] <= 0.900)]
    wine_df = wine_df[(wine_df['free sulfur dioxide'] <= 64) & (wine_df['total sulfur dioxide'] >= 20)]

New mean:

    new_mean_quality = wine_df['quality'].mean()

The new mean is 7.104.

6. Conclusion and Recommendations ⬆️ Return

Adopting strategical values was able to improve the wine quality by 22.23% (from 5.812 to 7.104). Therefore, the wine-producing company should focus on the range of values for each chemical parameter:

Alcohol: Higher than 10.92
Volatile Acidity: Between 0.280 and 0.370
Fixed Acidity: Between 6.600 and 7.300
Citric Acid: Between 0.280 and 0.370
Residual Sugar: Between 2.300 and 6.200
Chlorides: Lower than 0.034
Density: Between 0.989 and 0.992
Total Sulfur Dioxide: Between 80 and 129
pH: Between 2.890 and 3.540
Sulphates: Between 0.360 and 0.900
Free Sulfur Dioxide: Between 20 and 64

The critical parameter is density, as it has the narrowest range of values (0.989 to 0.992). Therefore, it requires the highest precision in order to cover the range effectively. Consequently, producing wines with these parameters above will likely result in a delightful wine to enjoy.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Graphics.zip		Graphics.zip
ReadMe.md		ReadMe.md
wine_quality.ipynb		wine_quality.ipynb
wine_quality.py		wine_quality.py
wine_quality.xlsx		wine_quality.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wine Quality

Analysing Chemical Parameters to Improve the Quality

Stages of Analysis

1. Define the problem ⬆️ Return

2. Collect Data ⬆️ Return

3. Data Cleaning and Preprocessing ⬆️ Return

Calculating the initial wine quality's mean ⬆️ Return

4. Data Analysis Techniques ⬆️ Return

5. Exploratory Data Analysis (EDA) and Interpretation of Results ⬆️ Return

Setting chemical parameters and calculating new mean

6. Conclusion and Recommendations ⬆️ Return

About

Releases

Packages

Languages

JPedroPy/Data_Analysis_Wine_Quality

Folders and files

Latest commit

History

Repository files navigation

Wine Quality

Analysing Chemical Parameters to Improve the Quality

Stages of Analysis

1. Define the problem ⬆️ Return

2. Collect Data ⬆️ Return

3. Data Cleaning and Preprocessing ⬆️ Return

Calculating the initial wine quality's mean ⬆️ Return

4. Data Analysis Techniques ⬆️ Return

5. Exploratory Data Analysis (EDA) and Interpretation of Results ⬆️ Return

Setting chemical parameters and calculating new mean

6. Conclusion and Recommendations ⬆️ Return

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages