# Module 1: Training and Testing

In this lab you will learn about an important methodology in setting up a reliable framework for evaluating the machine learning models you will be building. 
The **training and testing** workflow involves the selection of training and testing datasets, as well as a performance measure meaningful to your problem. 

### Tip: 
_We will use the same dataset across several labs, so take a little time to get yourself familiarized with the structure of the dataset._

#### Scikit Learn

Read about Scikit as your time permits: http://scikit-learn.org/stable/


Relevant sklearn API references:
 * [sklearn.tree.DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
 * [sklearn.model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

As an overview, we are going to use [**Decision Tree Classifier**](https://en.wikipedia.org/wiki/Decision_tree_learning) to fit the **red wine quality** dataset, 
then develop an understanding of why we need to hold out a test set to validate training, by taking a close look at a counterexample and seeing what could go wrong without this workflow.

In [1]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

## Load Dataset

Load dataset from files into multi-dimensional array and understand its structure.

In [2]:
# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-red.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET, sep=';').sample(frac = 1).reset_index(drop=True)

It's always good to have at least a rough idea on how many rows and columns are there in the dataset, and what are those columns, before we proceed.

In [3]:
print(dataset.shape)
dataset.describe() # Show the columns and basic statistics

(1599, 12)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


We can also preview the dataset.

In [4]:
dataset.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,11.5,0.35,0.49,3.3,0.07,10.0,37.0,1.0003,3.32,0.91,11.0,6
1,9.6,0.32,0.47,1.4,0.056,9.0,24.0,0.99695,3.22,0.82,10.3,7
2,8.0,0.38,0.44,1.9,0.098,6.0,15.0,0.9956,3.3,0.64,11.4,6
3,6.6,0.725,0.2,7.8,0.073,29.0,79.0,0.9977,3.29,0.54,9.2,5
4,8.3,0.705,0.12,2.6,0.092,12.0,28.0,0.9994,3.51,0.72,10.0,5


The last column is the quality of wine (0 to 10), other columns are features. We are going to build a classifier to tell quality of wine based on its features. 
And then come up with way to evaluate the performance of the classifier. 
Therefore, the classifier has the following input/output.

~~~
X = all features except last column
y = last column
~~~

In addition, for this lab, we are going to binarize Y into just 0 (ok wine) or 1 (good wine) just for simplification.

In [5]:
np.unique(dataset.quality)

array([3, 4, 5, 6, 7, 8])

In [6]:
X = np.array(dataset.iloc[:,:-1]) # Pull all rows, each column except the last
y = np.array(dataset.quality) # Pull just the quality column

# Binarize wine quality just for simplification using 6 as threshold
y = (y>=6).astype(int)
#          ^^^^^^^^^^ Convert True => 1 and False => 0
#   ^^^^^ boolean array (constains True/False as elements)

# The above is much similar to:
#    y[y<6]=0; y[y>=6]=1
# but much safer because The latter can go wrong when
# order is reversed by accident.

print('X', X.shape, 'y', y.shape)
print('Label distribution:', {0: np.sum(y==0), 1: np.sum(y==1)})

X (1599, 11) y (1599,)
Label distribution: {0: 744, 1: 855}


**NOTE**: Now that we have done some carpentry, re-running cells is best done starting from the top of the notebook!

## Try the simple approach - train and evaluate on the whole dataset

Train a Decision Tree Bayes model with the whole dataset and evaluate on the same dataset.

In [7]:
model = DecisionTreeClassifier(criterion='entropy')  # Create an instance of a model that can be trained
model.fit(X, y)       # fit = "train model parameters using this data and expected outcomes"
model.score(X, y)     # Evaluate a set of data, against the expected outcomes; here score is accuracy

1.0

This score means the **accuracy** (0 to 1) of the classifier on this dataset. Tranning a model over whole dataset do not address the following questions: 

+ Would the same predictive performance extend to future data?
+ Are there enough data for the model to learn from?
+ Could the classifier be learning from features that happen to correlate the result yet without necessary connection (noise).
+ How to make more accurate evaluation of the classifier?

## Counter-example

Pushing this idea to an extreme case, let's train the model on the first 3 rows!

In [8]:
dataset.head(3)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,11.5,0.35,0.49,3.3,0.07,10.0,37.0,1.0003,3.32,0.91,11.0,6
1,9.6,0.32,0.47,1.4,0.056,9.0,24.0,0.99695,3.22,0.82,10.3,7
2,8.0,0.38,0.44,1.9,0.098,6.0,15.0,0.9956,3.3,0.64,11.4,6


In [9]:
model = DecisionTreeClassifier()
model.fit(X[:3], y[:3])
model.score(X[:3], y[:3])

1.0

The model would score 100%!

Let's try predicting some other rows from the dataset.
In other words, if the model is applied to new data that was not part of training how well does it do?

In [10]:
print('Ground Truth : ', y[100:150])
print('Prediction   : ', model.predict(X[100:150]))
print('Score        : ', model.score(X[100:150], y[100:150]))

Ground Truth :  [1 0 1 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0 1 1 1
 0 0 1 1 1 1 1 1 0 1 1 1 0]
Prediction   :  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1]
Score        :  0.56


This is a demonstration of the phenomenon when building predictive models, known as **overfitting**.

Overfitting is when the model was not able to successfully generalize to perform its task on the general population of data. 
Instead it has been optimized for the specific instances of the training data.

## Hold out 25% for testing only

In [11]:
# Scikit has helpers for testing and evaluating models in a proper train/validate paradigm.
from sklearn.model_selection import train_test_split

# This function returns four sets:
# Training features
#       # Testing features
#       #        # Training labels
#       #        #        # Testing labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Get "blank model"
model = DecisionTreeClassifier()
model.fit(X_train, y_train) # Train it 
model.score(X_test, y_test) # Validate its training with some withheld testing data.

0.76

Note: In each run of the above cell we will observe different score because training and testing sets are changing. Previous training should have been approximately: 0.76

Compared to the first time we trained the model on the whole dataset, a decreased score would disprove the model's generalization ability by detecting counter example from test set, which implies a better model or training procedure is needed. 


Alternatively, if the model scored similarly (a necessary but not sufficient condition), it's more probable that the evaluation is accurate. 
That's why we must adopt a **training and test** process in order to evaluate the model more accurately. 
In module 2 we will learn about a more sophisticated evaluation approach, cross validation.

## Conclusion

In this lab we learned about:

+ Training and validation workflow
+ Splitting dataset into training and validation set
+ Concept of overfitting
+ Usage of DecisionTreeClassifier() from scikit-learn