# Wine-o-meter : Classification
-----
> The prediction problem can be handled either as a classification task or as a regression task. In the current proposal, we will use classification techniques.

----

### Table of Contents

* [1. Data Preparation](#section1)
    * [1.1. Load Data](#section21)
    * [1.2. Predictors and Target](#section21)
    * [1.3. Training and Validation sets](#section22)
    * [1.4. Preprocessing pipeline](#section22)
* [2. Classification](#section22)
    * [2.1. Preliminary Model Selection](#section21)
    * [2.2. RandomForest Classifier](#section21)
    * [2.3. Support Vector Classifier (SVC)](#section22)
    * [2.4. Save the selected model](#section22)


In [14]:
# generic libs
import pandas as pd
import joblib

# Ml libs
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

# predefined modules
from modules import MyFunctions as MyFunct

file_path = "data/winequality.csv"
model_path= 'models/model.joblib'

seed = 0
train_ratio = 0.8
val_ratio = 0.2
scoring = 'accuracy'
cv = 3

# Data Preparation

## Load data

In [2]:
# load data
df = pd.read_csv(file_path)

## Predictors and Target

In [3]:
# define the predictors and the target
y = df["quality"]
X = df.drop(["quality", "type"], axis=1)

## Training and validation sets

In [4]:
# split the data into training and validation sets
X_train, X_val, y_train, y_val = MyFunct.train_val(X, y, train_ratio, val_ratio, seed)

## Preprocessing pipeline

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;"><ol>
<li><b>Missing values imputation </b>: we will use the median to fill in the missing values. The median is the safe way for data imputation because if the data distribution is skewed the mean is biased by outliers.
<li><b>Standardization</b>: we will standardize the numerical data before training to eliminate large scales effect on the learning phase.</ol>
</div> </pre> 

In [5]:
preprocessor = Pipeline(steps = [("imputer", SimpleImputer(strategy="median")), 
                                 ("scaler", StandardScaler())])

X_train = preprocessor.fit_transform(X_train)
X_val = preprocessor.transform(X_val)

# Classification

## Preliminary Model Selection

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
In this part, we want to establish a preliminary performance evaluation to get some first insights on the classification techniques that can be efficiently used to solve the current prediction problem. We will evaluate the baseline performance of various techniques, using the default settings as proposed by the ML library <b>sklearn</b>, by means of the <b>k-fold cross validation</b> technique.  
</div> </pre> 

In [7]:
classifiers = [
    LogisticRegression(max_iter=500),
    SVC(),
    GaussianNB(),
    RandomForestClassifier(random_state = seed),
    AdaBoostClassifier(random_state = seed),
    GradientBoostingClassifier(random_state = seed)
]

scores = []
for clf in classifiers:
    scores.append(MyFunct.model_validation(clf, X_train, y_train, cv, scoring))
    
scores_df = pd.DataFrame(scores, columns= ['name', 'mean', 'std'])
scores_df.sort_values(by=['mean','std'], ascending=False)

fitting LogisticRegression is done in 2.9933881759643555s
fitting SVC is done in 2.6053969860076904s
fitting GaussianNB is done in 0.023194313049316406s
fitting RandomForestClassifier is done in 1.6209118366241455s
fitting AdaBoostClassifier is done in 0.5535402297973633s
fitting GradientBoostingClassifier is done in 10.447529792785645s


Unnamed: 0,name,mean,std
3,RandomForestClassifier,0.647104,0.006674
5,GradientBoostingClassifier,0.576489,0.010655
1,SVC,0.560324,0.003719
0,LogisticRegression,0.536657,0.004098
4,AdaBoostClassifier,0.453719,0.040946
2,GaussianNB,0.428707,0.005776


<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
<li>The multiclass classification problem is not well handled with neither the <b>Naive Bayes</b> classifier nor the <b>AdaBoost</b> one. Other poor performer is the <b>logisticRegression</b> model. The <b>GradientBoosting</b> and the <b>SVC</b> classifiers have very close means. However, the std of the scores got from the <b>SVC</b> is way lower than the std got from <b>GradienBoosting</b>.

<li>Given this preliminary performance analysis, we want to further check the <b>RandomForest</b> classifier that gives the best scores and the <b>SVC</b> using <b>hyperparmeters tuning</b> by the means of the <b>GridSearchCV</b> technique. 
</div> </pre> 

## RandomForest Classifier

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
We will tune the most important hyperparameters that are: 
<li>the used splitting criterion, 
<li>the number of estimators, 
<li>the maximum depth of the tree estimators, 
<li>the minimum number of samples to make a split, 
<li>the minimum number of samples to generate leaves and finally, 
<li>the maximum number of features to be taken into account to select best splitters.
</div> </pre> 

In [6]:
params = {
    'criterion': ['gini', 'entropy'],
    'n_estimators' : [50, 100, 200],
    'max_depth' : [10, 15, 20],
    'min_samples_split' : [5, 10, 20],
    'min_samples_leaf' : [3, 5, 10],
    'max_features': [2, 4, 6, 8, 11]
}
rf_classifier = MyFunct.model_selection(RandomForestClassifier(random_state = seed), X_train, y_train, X_val, y_val, params, scoring)

cv =  PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0]))
Fitting 1 folds for each of 810 candidates, totalling 810 fits
Tuning RandomForestClassifier hyperparameters is done in 1031.071202993393s

Best Estimator 

Best Params 

{'criterion': 'entropy', 'max_depth': 15, 'max_features': 8, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 200}
Best score 

0.6715384615384615


## Support Vector Classifier (SVC)

<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
We will tune the most important hyperparameters that are: 
<li>the kernel function, 
<li>the regularization parameter C and 
<li>the gamma parameter that is used to reduce the prediction sensitivity to individual samples. Note that gamma is a parameter that is used with only some specific kernel functions. 

We will check 2 kernel functions that are the <b>linear</b> and the <b>rbf</b> kernels.
</div> </pre> 

In [13]:
params = {
    'C': [0.1, 10.0, 50.0, 100.0],
    'class_weight': [None, 'balanced'],
}
sv_classifier = MyFunct.model_selection(SVC(kernel = 'linear', random_state= seed), X_train, y_train, X_val, y_val, params, scoring)

cv =  PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0]))
Fitting 1 folds for each of 8 candidates, totalling 8 fits
Tuning SVC hyperparameters is done in 1074.62743973732s

Best Estimator 

Best Params 

{'C': 0.1, 'class_weight': None}
Best score 

0.5361538461538462


In [21]:
params = {
    'C': [0.1, 1.0, 10.0],
    'gamma': [0.01, 0.1, 1.0]
}
sv_classifier = MyFunct.model_selection(SVC(kernel = 'rbf', random_state= seed), X_train, y_train, X_val, y_val, params, scoring)

cv =  PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0]))
Fitting 1 folds for each of 9 candidates, totalling 9 fits
Tuning SVC hyperparameters is done in 24.949832439422607s

Best Estimator 

Best Params 

{'C': 10.0, 'gamma': 1.0}
Best score 

0.6523076923076923


<pre>
📝 <b>Note</b>
<div style="background-color:#C2F2ED;">
<li>With the hyperparameters tuning we succeeded to increase the accuracy scores with some points.

<li>Clearly, the <b>RandomForest</b> classifier is the most suitable (among the checked algorithms) for this prediction problem. Hence, we will use it to make the final predictions. We will save it as a <b>joblib</b> file to be reused later.
</div> </pre> 

## Save the selected model

In [10]:
rf_classifier

RandomForestClassifier(criterion='entropy', max_depth=15, max_features=8,
                       min_samples_leaf=3, min_samples_split=5,
                       n_estimators=200, random_state=0)

In [9]:
rf_classifier.fit(X_train, y_train)

print("Accuracy: {:.2f}".format(rf_classifier.score(X_val, y_val)))

Accuracy: 0.67


In [15]:
joblib.dump(rf_classifier, model_path)

['models/model.joblib']