# Wine-o-meter : Explanatory Data Analysis
-------
> For this project, we have a dataset that describes the Portuguese "Vinho Verde" wine. The description involves physico-chemical properties as well as the quality of the wine given by professional sommeliers (wine stewards) as a score between 0 (poor quality) and 10 (excellent quality). 

>> **Hypothesis**: The wine quality is related to the wine physico-chemical properties.

>> **Goal** 🎯: Assume that there is a relationship between wine quality and its physico-chemical properties, we want to build an artificial sommelier that can infer wine quality given the physico-chemical properties.

-----------

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
 Please go to the folder <b>Viz</b>, to see the different visualizations.
</div> </pre> 


### Table of Contents

* [1. Load Data](#section1)
* [2. EDA](#section2)
    * [2.1. Explore Dataset](#section21)
    * [2.2. Unique values](#section21)
    * [2.3. Missing values](#section22)
    * [2.4. Duplicates](#section22)
    * [2.5. Univariate Analysis](#section22)
        * [2.5.1. Quantitaive predictors](#section22)
        * [2.5.2. Qualitative predictors](#section22)
        * [2.5.3. Target Distribution](#section22)
    * [2.6. Bivariate Analysis](#section22)

In [1]:
# Generic librairies 
import os
import pandas as pd 
from numpy import arange

import warnings
warnings.filterwarnings('ignore')

# Visualization librairies 
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "iframe" 

# predefined modules
from modules import MyFunctions as MyFunct

# Global parameters 
file_path = 'data/winequality.csv'

if not os.path.exists("Viz"):
    os.mkdir("Viz")

 # Load Data <a class="anchor" id="section1"></a>

In [2]:
dataset = pd.read_csv(file_path)

# EDA <a class="anchor" id="section2"></a>

## Explore Dataset <a class="anchor" id="section2"></a>

In [3]:
MyFunct.explore(dataset)

Shape : (6497, 13)

data types : 
type                     object
fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

Display of dataset: 


Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6



Basics statistics: 


Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,6497,6487.0,6489.0,6494.0,6495.0,6495.0,6497.0,6497.0,6497.0,6488.0,6493.0,6497.0,6497.0
unique,2,,,,,,,,,,,,
top,white,,,,,,,,,,,,
freq,4898,,,,,,,,,,,,
mean,,7.216579,0.339691,0.318722,5.444326,0.056042,30.525319,115.744574,0.994697,3.218395,0.531215,10.491801,5.818378
std,,1.29675,0.164649,0.145265,4.758125,0.035036,17.7494,56.521855,0.002999,0.160748,0.148814,1.192712,0.873255
min,,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0
50%,,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0



Distinct values: 


type                      2
fixed acidity           107
volatile acidity        188
citric acid              90
residual sugar          317
chlorides               215
free sulfur dioxide     135
total sulfur dioxide    276
density                 998
pH                      109
sulphates               112
alcohol                 111
quality                   7
dtype: int64

## Unique values <a class="anchor" id="section21"></a>

In [4]:
Cols = ['type', 'quality']
MyFunct.unique_count(dataset, Cols)

unique values of type:


white    4898
red      1599
Name: type, dtype: int64

unique values of quality:


6    2836
5    2138
7    1079
4     216
8     193
3      30
9       5
Name: quality, dtype: int64

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
The dataset is <b>imbalanced</b>. Quality scores 3 and 9 have below 50 observations. There are munch more normal wines than excellent or poor ones.</div> </pre> 

## Missing values <a class="anchor" id="section22"></a>

In [5]:
print(f"Missing values over {dataset.shape[0]} observations: ")
MyFunct.missing(dataset)

Missing values over 6497 observations: 


Unnamed: 0,Variable,n_missing,p_missing
0,type,0,0.0
6,free sulfur dioxide,0,0.0
7,total sulfur dioxide,0,0.0
8,density,0,0.0
11,alcohol,0,0.0
12,quality,0,0.0
4,residual sugar,2,0.03
5,chlorides,2,0.03
3,citric acid,3,0.05
10,sulphates,4,0.06


<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
Not too much missing values. It would be better to remove them.
</div> </pre> 

## Duplicates <a class="anchor" id="section23"></a>

In [6]:
print(f"Duplicates over {dataset.shape[0]} observations: ")
MyFunct.duplicates_count(dataset)

Duplicates over 6497 observations: 


Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,records
0,red,4.6,0.52,0.15,2.1,0.054,8.0,65.0,0.99340,3.90,0.56,13.1,4,1
1,red,4.7,0.60,0.17,2.3,0.058,17.0,106.0,0.99320,3.85,0.60,12.9,6,1
2,red,4.9,0.42,0.00,2.1,0.048,16.0,42.0,0.99154,3.71,0.74,14.0,7,1
3,red,5.0,0.38,0.01,1.6,0.048,26.0,60.0,0.99084,3.70,0.75,14.0,6,1
4,red,5.0,0.40,0.50,4.3,0.046,29.0,80.0,0.99020,3.49,0.66,13.6,6,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5290,white,10.3,0.17,0.47,1.4,0.037,5.0,33.0,0.99390,2.89,0.28,9.6,3,1
5291,white,10.3,0.25,0.48,2.2,0.042,28.0,164.0,0.99800,3.19,0.59,9.7,5,1
5292,white,10.7,0.22,0.56,8.2,0.044,37.0,181.0,0.99800,2.87,0.68,9.5,6,2
5293,white,11.8,0.23,0.38,11.1,0.034,15.0,123.0,0.99970,2.93,0.55,9.7,3,1


<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
For this kind of problems, there is no issue with duplicates. 
</div> </pre> 

## Univariate Analysis <a class="anchor" id="section25"></a>

### Quantitative predictors 

In [7]:
cols = ['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']

In [8]:
title = 'Distribution of the different quantitative variables(1)'
fig = make_subplots(rows=4, cols=3)

k = 0
for i in range(1, 5, 1):
    for j in range(1, 4, 1):
        if k<11:
            fig.add_trace(MyFunct.my_box_plotter(dataset[cols[k]]), row=i, col=j)
            k+=1
        else:
            break

fig.update_layout(
    title= title, title_x = 0.5,
    showlegend=False,
    width = 1000,
    height = 700
)

fig.to_image(format="png", engine="kaleido")
if os.path.exists("Viz/"+title+".png"):
    os.remove("Viz/"+title+".png")
    
fig.write_image("Viz/"+title+".png")

fig.show()

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
Outliers are detected with all features. We might be dealing with outliers before training.
</div> </pre> 

In [9]:
title = 'Distribution of the different quantitative variables (2)'

fig = make_subplots(rows=4, cols=3)

k = 0
for i in range(1, 5, 1):
    for j in range(1, 4, 1):
        if k<11:
            fig.add_trace(MyFunct.my_hist_plotter(dataset[cols[k]]), row=i, col=j)
            fig.update_xaxes(title_text=cols[k], row=i,col=j)
            k+=1
        else:
            break

fig.update_layout(
    title= title, title_x = 0.5,
    showlegend=False,
    width = 1000,
    height = 700
)


fig.to_image(format="png", engine="kaleido")
if os.path.exists("Viz/"+title+".png"):
    os.remove("Viz/"+title+".png")
    
fig.write_image("Viz/"+title+".png")
fig.show() 

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
Almost all predictors' distributions are skewed.
</div> </pre> 

### Qualitative predictors 

In [10]:
title = 'Distribution of wine type'
wine_type = dataset.type.value_counts()

fig = go.Figure([MyFunct.my_bar_plotter(wine_type.index.values, wine_type.values)])
fig.update_layout(
    title= title, title_x = 0.5,
    showlegend=False,
    width = 500,
    height = 400
)

fig.to_image(format="png", engine="kaleido")
if os.path.exists("Viz/"+title+".png"):
    os.remove("Viz/"+title+".png")
    
fig.write_image("Viz/"+title+".png")
fig.show() 

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
We have not lot of data about red wine. This can bias the learning phase. We might omit this variable.
</div> </pre> 

### Target Distribution

In [11]:
title = 'Distribution of wine quality'
wine_quality = dataset.quality.value_counts()
wine_quality.index.values.astype(str)

fig = go.Figure([MyFunct.my_bar_plotter(wine_quality.index.values.astype(str), wine_quality.values)])
fig.update_layout(
    title= title, title_x = 0.5,
    showlegend=False,
    width = 500,
    height = 400
)

fig.to_image(format="png", engine="kaleido")
if os.path.exists("Viz/"+title+".png"):
    os.remove("Viz/"+title+".png")
    
fig.write_image("Viz/"+title+".png")
fig.show() 

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
The dataset is <b>imbalanced</b>. Quality scores 3 and 9 have below 50 observations. There are munch more normal wines than excellent or poor ones.</div> </pre> 

## Bivariate Analysis (Correlation)

In [12]:
title = 'Correlation degrees between different variables'
fig = MyFunct.my_heatmap(dataset, title)

# Export to a png image
fig.to_image(format="png", engine="kaleido")
if os.path.exists("Viz/"+title+".png"):
    os.remove("Viz/"+title+".png")

fig.write_image("Viz/"+title+".png")
fig.show()

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
<li>There is noteworthy correlation between <b><i>quality</i></b> and <b><i>alcohol</i></b>, <b><i>density</i></b>, <b><i>chlorides</i></b> and <b><i>volatile acidity</i></b>.   
<li>It seems that not all predictors are relevant. Hence, it would be interesting to run <b><i>feature selection</i></b> methods.

<li>Some predictors seems to be correlated:   
<b><i>free sulfur dioxide</i></b> & <b><i>total sulfur dioxide</i></b>;    
<b><i>density</i></b> & <b><i>alcohol</i></b>;   
<b><i>density</i></b> & <b><i>residual sugar</i></b>...
</div> </pre> 