# Train, Validate $\rightarrow$ Train, Test 
### Focus: Naive Bayes Classifier & Logistic Regression

## Introduction
When constructing a model, data availability may become an issue. 
In order to avoid overfitting, it is necessary to withhold some portion of the data as a test set. 
However, overfitting *on the test set* may also occur without a secondary validation step. 
As such, `scikit` contains a number of methods for cross-validation of data.

## References
1. [Scikit documentation - GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

## Setting up the model

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from collections import OrderedDict

# load dataset 

DATASET = '/dsa/data/all_datasets/wine-quality/winequality-red.csv'
dataset = pd.read_csv(DATASET, sep=';').sample(frac = 1).reset_index(drop=True)
X = dataset.iloc[:, [1,2,6,9,10]]
y = dataset.quality



In [2]:
dataset.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,6.8,0.56,0.22,1.8,0.074,15.0,24.0,0.99438,3.4,0.82,11.2,6
1,9.8,0.51,0.19,3.2,0.081,8.0,30.0,0.9984,3.23,0.58,10.5,6
2,8.5,0.655,0.49,6.1,0.122,34.0,151.0,1.001,3.31,1.14,9.3,5
3,6.9,0.56,0.03,1.5,0.086,36.0,46.0,0.99522,3.53,0.57,10.6,5
4,8.7,0.54,0.26,2.5,0.097,7.0,31.0,0.9976,3.27,0.6,9.3,6


In [3]:
# test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



## Scaling

In [4]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(X_train)  
# fitting should always be performed with X_train (y_train is labeled; so scaling is not required)
# we assume that we don't have access to X_test, y_test


X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

## Define Classifiers

In [5]:
baseline_classifier = DummyClassifier()
nb_classifier = GaussianNB()
regression_classifier = LogisticRegression(max_iter = 5000)

## Cross-validation
Though a manual CV workflow was described in [the cross-validation lab](./CrossValidation.ipynb), the automated `cross_val_score()` will work well enough for this example.

In [6]:
# automated CV step
baseline_scores = cross_val_score(baseline_classifier, X_train, y_train, cv=5)
nb_scores = cross_val_score(nb_classifier, X_train, y_train, cv=5)
regression_scores = cross_val_score(regression_classifier, X_train, y_train, cv=5)
print("baseline score ", baseline_scores) # TODO: visualization of CV process
print("NB score :", nb_scores)
print("Logistic Regression score: ", regression_scores)


baseline score  [0.42578125 0.42578125 0.42578125 0.42578125 0.42745098]
NB score : [0.60546875 0.58203125 0.578125   0.58984375 0.56862745]
Logistic Regression score:  [0.57421875 0.59765625 0.58203125 0.59765625 0.56470588]


Note that we are performing cross validation with the training set. These cross-validation values represent how well (with 1 being a perfect score) the model performed against a small, as-yet-untrained portion of the data for the classification task.

## Training the new models

Since the CV values for NBC and logistic regression are higher than the baseline, we can create a model using all the data in the training set and test against the testing set:

In [7]:
# fit new model
baseline_classifier.fit(X_train, y_train)
nb_classifier.fit(X_train, y_train)
regression_classifier.fit(X_train, y_train)

# model.predict() returns class labels (integers)
y_pred_baseline = baseline_classifier.predict(X_test)
y_pred_nb = nb_classifier.predict(X_test)
y_pred_reg = regression_classifier.predict(X_test)


In [8]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred_nb))
confusion_matrix(y_test, y_pred_nb)

              precision    recall  f1-score   support

           4       0.00      0.00      0.00        11
           5       0.65      0.78      0.71       136
           6       0.56      0.50      0.53       130
           7       0.43      0.43      0.43        37
           8       0.00      0.00      0.00         6

    accuracy                           0.58       320
   macro avg       0.33      0.34      0.33       320
weighted avg       0.55      0.58      0.57       320



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


array([[  0,   8,   3,   0,   0],
       [  2, 106,  27,   1,   0],
       [  1,  48,  65,  16,   0],
       [  0,   1,  20,  16,   0],
       [  0,   0,   2,   4,   0]])

## Score comparision between the three models. 
Score shows that bayes classifier is better that baseline. 
Again Regression is better in score than Bayes classifier. 

For comparision we used all the given features. But you can try out several amount of features to test the models. 
if we increase the number of features then accuracy will increase. but it will require more computation to converge. 
Again reduction of number of features will loose accuracy. We need to trade-off between computation power and accuracy. 

In [9]:
from sklearn.metrics import precision_score, f1_score, recall_score, accuracy_score
from pandas import DataFrame
from IPython.display import display 

data_dict = {
    'Metrics': ["Precision", 'Recall', 'F1', 'Recall'],
    'Baseline': [
        precision_score(y_test, y_pred_baseline, average= 'weighted'),  # weighted precision over all the classes
        recall_score(y_test, y_pred_baseline, average='weighted'),
        f1_score(y_test, y_pred_baseline, average='weighted'),
        accuracy_score(y_test, y_pred_baseline)
    ],
    "Naive Bayes": [
        precision_score(y_test, y_pred_nb, average= 'weighted'),
        recall_score(y_test, y_pred_nb, average='weighted'),
        f1_score(y_test, y_pred_nb, average='weighted'),
        accuracy_score(y_test, y_pred_nb)
    ],
    "Logistic Regression": [
        precision_score(y_test, y_pred_reg, average= 'weighted'),
        recall_score(y_test, y_pred_reg, average='weighted'),
        f1_score(y_test, y_pred_reg, average='weighted'),
        accuracy_score(y_test, y_pred_reg)
    ]
}
table = np.around(DataFrame(data_dict), 2)
print(table.to_string(index=False))

  Metrics  Baseline  Naive Bayes  Logistic Regression
Precision      0.18         0.55                 0.55
   Recall      0.42         0.58                 0.59
       F1      0.25         0.57                 0.56
   Recall      0.42         0.58                 0.59


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The above cell may throw warning as for some of the class labels `division by 0` error occurred while averaging the precision scores over class labels. You can see the formula for calculating performance measures for multi-label classification problem [here](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics). Here is the screenshot: 

<img src="./multi-label-performance-measure.jpg" />

**Comment:**

* The naive Bayes classifier and logistic regression perform equally well as there is no significant differences between these two in terms of these four measures. 
* Warning is thrown as for some of the classes precision is 0