# Module 1: Classification

In this lab you will create a classification model on the same red wine quality dataset and then apply and practice the same training and validation methodology. 
The classification model will be based on Naive Bayes provided by sci-kit learn.

In [None]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
%matplotlib inline  

## Load Dataset

We will load the dataset from file into a Panda data frame and investigate its structure. 


In [None]:
# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-red.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET, sep=';').sample(frac = 1).reset_index(drop=True)

# View some metadata of the dataset and see if that makes sense
print('dataset.shape', dataset.shape)

X = np.array(dataset.iloc[:,:-1])[:, [1,2,6,9,10]]
y = np.array(dataset.quality)

print('X', X.shape, 'y', y.shape)
print('Label distribution:', {i: np.sum(y==i) for i in np.unique(dataset.quality)})

In [None]:
dataset.head()

Describe dataset.

In [None]:
dataset.describe()

## Make the train/test split and then train the model

In [None]:
# You have seen this before!
# If you are so inclined, you may want to tweak the test_size and see how the model performs
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

model1 = DecisionTreeClassifier(criterion='gini', max_depth=4)
model1.fit(X_train, y_train)
print(f"Acc with gini: {model1.score(X_test, y_test)}")


model2 = DecisionTreeClassifier(criterion='entropy', max_depth=3)
model2.fit(X_train, y_train)
model2.score(X_test, y_test)
print(f"Acc with gini: {model2.score(X_test, y_test)}")


Optionally you can print out a sample and see for yourself how the classification performs.

In [None]:
print(y[20:50], " (True class value)")
print(model.predict(X[20:50]), " (Predicted class value)")

## Visualizing Decision Tree

### A text representation of the tree

In [None]:
text_representation = export_text(model1)
print(text_representation)

### Graph visualization

In [None]:
plt.figure(figsize=(25,10))
a = plot_tree(model2, 
              filled=True, 
              rounded=True, 
              fontsize=14)

## Model Evaluation

Usually a classifier's peformance quantified in terms of precison, reall, f1, and accuracy measures. These measures are calculated from [confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html). 

In [None]:
from sklearn.metrics import confusion_matrix
# Compute confusion matrix with expected value, predicted values... similar to RMSE 
confusion_matrix(y_test, model.predict(X_test))

### Beyond Confusion Matrix: Precision, Recall, and F1

Here we are going to look at a couple additional measures.

First: 
  * _condition positive_ (P) is the number of real positive cases in the data
  * _condition negatives_ (N) is the number of real negative cases in the data 

Then: 
  * _true positive_ (TP) is a correct prediction of a class, eqv. with hit in a Yes / No model
  * _true negative_ (TN) is a correct prediction of not a class, eqv. with correct rejection in a Yes / No model
  * _false positive_ (FP) is misclassification, eqv. with false alarm in a Yes / No model, **Type I error**
  * _false negative_ (FN) is misclassification, eqv. with miss in a Yes / No model, **Type II error** 

Metrics:
  * Recall or True Positive Rate:$$ Recall = \frac{TP}{P} = \frac{TP}{TP+FN} $$ 
  * Precision or Positive Predictive Value:$$ Precision = \frac{TP}{TP+FP} $$
  * [F1 is the harmonic mean of precision and recall](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)$$ F_{1} = 2 * \frac{Precision * Recall}{Precision + Recall}$$
  * Accuracy: $$ Accuracy = \frac{TP + TN}{TP+FP+TN+FN} $$
  
#### More details on scikit-learn model scoring:
http://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

# Check the API of these functions to learn more about the parameters

print("Precision  :", np.round(precision_score(y_test, model.predict(X_test), average='weighted'), 2))
print("Recall     :", np.round(precision_score(y_test, model.predict(X_test), average='weighted'), 2))
print("F1-Score   :", np.round(f1_score(y_test, model.predict(X_test), average='weighted'), 2))
print("Accuracy   :", np.round(accuracy_score(y_test, model.predict(X_test)), 2))


The above scores could be estimated with a call to `classification_report`. 

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, model.predict(X_test)))

**Note:** The F1 score integrates the two metrics precision and recall into one.
As it is one type of mean operation implies, the value of F1 score lies in between the two metrics.  
In the scikit-learn package, `f1_score()` is generalized to multiclass targets. Therefore the last parameter `average` is referring to the algorithm of choice for averaging over multiple classes.  
There is a more detailed explanation on this parameter in the documentation, as the discussion of different types of methodologies for integrating metrics would extend to a whole another subject of data fusion: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

---