---
# Wine Quality "Warm up" Challenge
### Physicochemical factors that predict good quality wine
---
The "warm up" challenge for this year is adapted from the well-known 'Wine Quality' challenge on Kaggle. In particular, given a dataset containing several attributes describing wine, your task is to make predictions on the quality of as-yet unlisted wine samples. Developing a model which accurately fits the available training data while also generalising to unseen data-points is a multi-faceted challenge that involves a mixture of data exploration, pre-processing, model selection, and performance evaluation.

**IMPORTANT**: please refer to the AML course guidelines concerning grading rules. Pay especially attention to the **presentation quality** item, which boils down to: don't dump a zillion of lines of code and plots in this notebook. Produce a concise summary of your findings: this notebook can exist in two versions, a "scratch" version that you will use to work and debug, a "presentation" version that you will submit. The "presentation" notebook should go to the point, and convay the main findings of your work.

---
## Overview
Beyond simply producing a well-performing model for making predictions, in this challenge we would like you to start developing your skills as a machine learning scientist. In this regard, your notebook should be structured in such a way as to explore the following tasks, that are expected to be carried out whenever undertaking such a project. The description below each aspect should serve as a guide for your work, but you are can also explore alternative options and directions. Thinking outside the box will be rewarded in these challenges.

### 1. Data preparation:
   
_Data exploration_: The first broad component of your work should enable you to familiarise yourselves with the given data, an outline of which is given at the end of this challenge specification. Among others, you can work on:
   
* Data cleaning, e.g. treatment of categorial variables;
* Data visualisation; Computing descriptive statistics, e.g. correlation.

_Data Pre-processing_: The previous step should give you a better understanding of which pre-processing is required for the data. This may include:

* Normalising and standardising the given data;
* Removing outliers;
* Carrying out feature selection, possibly using metrics derived from information theory;
* Handling missing information in the dataset;
* Augmenting the dataset with external information;
* Combining existing features.

Note that, as the name implies, this is a warm-up challenge, which essentially means that data is already put in a convenient format that requires minimal pre-processing.

### 2. Model selection
An important part of the work involves the selection of a model that can successfully handle the given data and yield sensible predictions. Instead of focusing exclusively on your final chosen model, it is also important to share your thought process in this notebook by additionally describing alternative candidate models. There is a wealth of models to choose from, such as decision trees, random forests, (Bayesian) neural networks, Gaussian processes, LASSO regression, and so on. 

Irrespective of your choice, it is highly likely that your model will have one or more parameters that require tuning. There are several techniques for carrying out such a procedure, such as cross-validation.

### 3. Performance evaluation
The evaluation metric for this project is "Log Loss". For the N wines in the test data set, the metric is calculated as:

\\(\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} y_i p_i + (1-y_i) \log(1-p_i)\\)

where \\(y\\) is the true (but withheld) quality outcome for wine \\(i\\) in the test data set, and \\(p\\) is the predicted probability of good quality for wine \\(i\\). Larger values of \\(\mathcal{L}\\) indicate poorer predictions.

---
## Dataset description
You will be working on two data files, which will be available in ```/mnt/datasets/wine/```, one for red and one for white wines:

* winequality-red.csv 
* winequality-white.csv

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference Cortez et al., 2009. Only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

### Tips
A possible trick is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. Note that this can be seen as a data preparation task.

### Training and test sets
We leave to the students to decide how to carve out training and test sets (validation sets too, if relevant to your approach). This is non a competition whereby the instructors hold a "private" test set to rank students' models.



P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

### Attributes

Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

### 1. Data preparation:

### 1.1 Data cleaning:

In [155]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn


dataset_r = pd.read_csv('winequality-red.csv')
dataset_w = pd.read_csv('winequality-white.csv')

In [156]:
dataset_w.head()

Unnamed: 0,"fixed acidity;""volatile acidity"";""citric acid"";""residual sugar"";""chlorides"";""free sulfur dioxide"";""total sulfur dioxide"";""density"";""pH"";""sulphates"";""alcohol"";""quality"""
0,7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
1,6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9...
2,8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;1...
3,7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4...
4,7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4...


In [157]:
# Create categories 

def get_column_list(df) :
    # Return the columns of the dataframe 
    return df.columns[0].replace('"','').split(';')
    
def convert_to_df(df) :
    # Convert the df into a new one with clean columns
    
    columns = get_column_list(df)
    
    rows = [df.iloc[:,0][i].split(';') for i in range(df.shape[0])] 
    
    new_df = pd.DataFrame(rows, columns = columns, dtype = np.int8)
    return new_df
    

dataset_r = convert_to_df(dataset_r)
dataset_w = convert_to_df(dataset_w)


In [158]:
# Merge the two datasets. No need to add a feature whether is it a red or white wine
# Indeed, we think that the color of the wine won't affect the quality
dataset_r['type'] = 0 # encode type of wine Red = 0
dataset_w['type'] = 1 # encode type of wine White = 1
dataset = pd.concat([dataset_r, dataset_w], axis = 0).astype(float)




In [159]:
# No missing values
(dataset.isnull()*1).sum()


fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
type                    0
dtype: int64

### 1.2 Data visualisation

In [160]:
dataset.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5.0,0.0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5.0,0.0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5.0,0.0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6.0,0.0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5.0,0.0


## Data Pre-processing

In [161]:
# Binary classification problem
dataset.quality.loc[ (dataset.quality < 7)] = 0
dataset.quality.loc[ (dataset.quality >= 7) ] = 1

In [162]:
%matplotlib notebook
corr_matrix = dataset.corr().abs()
fig, ax = plt.subplots(figsize=(10,10))
sn.heatmap(corr_matrix, annot=True, ax = ax)
plt.show()



<IPython.core.display.Javascript object>

In [39]:
best_features = corr_matrix['quality'].drop('quality').drop('free sulfur dioxide').sort_values(ascending=False).index
print(best_features)

Index(['alcohol', 'density', 'chlorides', 'volatile acidity', 'type',
       'residual sugar', 'citric acid', 'total sulfur dioxide',
       'fixed acidity', 'sulphates', 'pH'],
      dtype='object')


In [163]:
# free sulfur dioxide and total sulfur dioxide are very corelated, we remove one
# we choose to drop 'free sulfur dioxide' since it is not very correlated with the quality

dataset_before_droping = dataset

dataset['free sulfur dioxide'] = dataset['free sulfur dioxide']
dataset = dataset.drop('free sulfur dioxide', axis = 1)



In [164]:
# Choose k best features

k=None

dataset = dataset_before_droping
dataset['free sulfur dioxide'] = dataset['free sulfur dioxide']
dataset = dataset.drop('free sulfur dioxide', axis = 1)

if k!=None:
    print(k)
    for i in range(len(best_features)-k):
        print('drop ' + best_features[-(i+1)])
        dataset=dataset.drop(best_features[-(i+1)], axis = 1)

In [165]:
# We need to normalize whenever the dataset contains dataset with different ranges.
# Useful link to understand : https://medium.com/@swethalakshmanan14/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

dataset.describe()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,115.744574,0.994697,3.218501,0.531268,10.491801,0.196552,0.753886
std,1.296434,0.164636,0.145318,4.757804,0.035034,56.521855,0.002999,0.160787,0.148806,1.192712,0.397421,0.430779
min,3.8,0.08,0.0,0.6,0.009,6.0,0.98711,2.72,0.22,8.0,0.0,0.0
25%,6.4,0.23,0.25,1.8,0.038,77.0,0.99234,3.11,0.43,9.5,0.0,1.0
50%,7.0,0.29,0.31,3.0,0.047,118.0,0.99489,3.21,0.51,10.3,0.0,1.0
75%,7.7,0.4,0.39,8.1,0.065,156.0,0.99699,3.32,0.6,11.3,0.0,1.0
max,15.9,1.58,1.66,65.8,0.611,440.0,1.03898,4.01,2.0,14.9,1.0,1.0


In [166]:
from sklearn.model_selection import train_test_split

SEED = 4

target = dataset.quality.values
X = dataset.drop('quality', axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X,target,test_size=0.2,random_state=SEED)


In [167]:
# Normalization 

from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data
norm = MinMaxScaler().fit(X_train)

# transform training data
X_train_norm = norm.transform(X_train)

# transform testing dataabs
X_test_norm = norm.transform(X_test)


In [168]:
# Standarisation 

from sklearn.preprocessing import StandardScaler

scale = StandardScaler().fit(X_train_norm)
    
# transform the training data column
X_train_stand = scale.transform(X_train_norm)
    
# transform the testing data column
X_test_stand = scale.transform(X_test_norm)


In [169]:
X_train_norm[1]

array([0.16528926, 0.08      , 0.22289157, 0.00920245, 0.0448505 ,
       0.18862691, 0.02737613, 0.28682171, 0.15340909, 0.68115942,
       1.        ])

### 2. Model selection

In [170]:
# Linear regression

from sklearn.linear_model import LogisticRegression

log = LogisticRegression(random_state = SEED, solver='lbfgs').fit(X_train_stand, y_train)


In [171]:
# SVM
from sklearn.svm import LinearSVC

svm = LinearSVC(random_state=SEED, tol=1e-9, max_iter = 20000).fit(X_train_stand, y_train)



In [172]:
# LDA 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis().fit(X_train_stand, y_train)


In [173]:
#random forest
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=150,max_depth=20, random_state=SEED)
rfc.fit(X_train_stand, y_train)
pred_rfc = rfc.predict(X_test_stand)
           

### 3. Performance evaluation


In [174]:
def score_compute(pred, y):
    ms = pred- y
    ms = abs(ms)
    return 1 - ms.sum()/ms.shape[0]

In [175]:
from sklearn.model_selection import cross_val_score
import time

before = time.time()

log_eval = cross_val_score(estimator = LogisticRegression(random_state = SEED, solver='lbfgs'), X = X_train_stand, y
 = y_train, cv = 5)
svm_eval = cross_val_score(estimator = LinearSVC(random_state=SEED, tol=1e-9, max_iter = 20000), X = X_train_stand, y = y_train, cv = 5)
lda_eval = cross_val_score(estimator = LinearDiscriminantAnalysis(), X = X_train_stand, y = y_train, cv = 5)
rfc_eval = cross_val_score(estimator = rfc, X = X_train_stand, y = y_train, cv = 5)

after = time.time()

duration = after-before

In [176]:
print('Logistic Regression acc = ',log_eval.mean())
print('SVM acc = ',svm_eval.mean())
print('LDA acc = ',lda_eval.mean())
print('Random Forest acc = ',rfc_eval.mean())

Logistic Regression acc =  0.8160468645887317
SVM acc =  0.8168170208040276
LDA acc =  0.8146996002072999
Random Forest acc =  0.876658399348486


### 2.1. Other model selection

In [75]:
# Grid search to choose the best hyperparameters for random forest

# J'utilise les paramètres trouvés ici en haut quand je définis l'arbre de décisions

from sklearn.model_selection import train_test_split, GridSearchCV

n_estimators = [10,30,50,70,100,150,200,300]
max_depths = [3,4,5,6,7,8,10,15,20,None]

before = time.time()
param = {
    'n_estimators': [100,150,200],
    'max_depth': [20,25,None],
}
grid_svc = GridSearchCV(rfc, param_grid=param, scoring='accuracy', cv=10)
grid_svc.fit(X_train_stand, y_train)
after = time.time()

duration = after-before

In [76]:
grid_svc.best_params_

{'max_depth': 20, 'n_estimators': 150}