# Wine quality
We want to make a classification for wine quality, regarding several aspects of the wine. We will use a dataset from [Kaggle](https://www.kaggle.com/datasets/ghassenkhaled/wine-quality-data).
## Importing the libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from os import stat
from scipy.stats import norm
import numpy as np
import seaborn as sns
%matplotlib inline

## Importing the dataset

In [None]:
filename = 'data/Wine_Quality_Data.csv'

print(f'File size: {stat(filename).st_size / 1024} kB.')

df = pd.read_csv(filename)

## Exploratory Data Analysis
We are now going to explore and visualize the data.

In [None]:
print(f'Name of columns: {df.columns.values}.')
print(f'Data shape: {df.shape}.')
print(f'Data types: {df.dtypes}.')
print(f'Number of missing values: {df.isna().sum().sum()}.')

display(df.head(10))
display(df.describe())
display(df.info())

print(f'Wine colors: {df["color"].unique()}.')
print(f'Number of quality values: {df["quality"].unique()}.')

As we have two wine colors, we might want to split our global dataset into two datasets, one for each color.

In [None]:
data = {'red' : df[df['color'] == 'red'].drop('color', axis =1), 'white' : df[df['color'] == 'white'].drop('color', axis =1)}

for color in data:
    print(f'Wine color: {color}.')
    display(df.head(10))
    display(df.describe())
    display(df.info())

Now, we are ging to visualize the distribution of the quality for each color.

In [None]:
for color in data:
    plt.figure(figsize=(10, 7))
    plt.title(f'Wine color: {color}.')
    plt.hist(data[color]['quality'], bins=range(3, 9), edgecolor='black', linewidth=1.2)
    plt.plot(np.linspace(3,9,100), norm.pdf(np.linspace(3,9,100), data[color]['quality'].mean(), data[color]['quality'].std()) * len(data[color]['quality']) * 0.8, color='red')
    plt.show()

We have quite a normal distribution. Let's see the correlation between all variables.

In [None]:
for color in data:
    print(f'Wine color: {color}.')
    display(data[color].corr())
    plt.figure(figsize=(13,10))
    sns.heatmap(df.corr(), cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
    plt.title('Correlation heatmap of the dataset')
    plt.show()

The correlation matrix doesn't unveil any relevant correlation to drop a feature.
But we notice that the `alcohol` feature is correlated with the target variable.
Let's plot them.

In [None]:
for color in data:
    plt.figure(figsize=(10,7))
    plt.title(f'Wine color: {color}.')
    plt.scatter(data[color]['alcohol'],data[color]['quality'])
    plt.show()

## Data preprocessing
As we only have numerical features, we don't have to make any encoding.