# Train, Validate $\rightarrow$ Train, Test

In this exercise, you will perform empirical comparison of the results of a ten-fold cross validated model with a fully trained model.

## Notes and Guidelines
* Read a dataset from disk and use it for a classification task.
* Construct a Gaussian Naive Bayes classifier and fit it to the phoneme dataset provided.
* Save and re-load a trained classifier.
* Compare K-fold cross-validation scores with the success rate of a fully-trained model.


### Dataset
* Dataset acquired from [KEEL](http://sci2s.ugr.es/keel/dataset.php?cod=105), an excellent resource for finding 'toy' datasets (and a few more serious ones).
    * A description of the dataset is provided at the above link - **read it.**
    * Excerpt: 
    *The aim of this dataset is to distinguish between nasal (class 0) and oral sounds (class 1).
    The class distribution is 3,818 samples in class 0 and 1,586 samples in class 1.
    The phonemes are transcribed as follows: sh as in she, dcl as in dark, iy as the vowel in she, aa as the vowel in dark, and ao as the first vowel in water.*
    
* It is not necessary to fully understand the nature or context of the values in the dataset - only that there are five columns of input (featural) data and one column of output (class) data.

## Handling imports and dataset inclusion

In [14]:
import os
import pandas as pd
import numpy as np
import joblib

# <import necessary modules> 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from collections import OrderedDict
from sklearn.metrics import classification_report

# locate dataset
DATASET = '/dsa/data/all_datasets/phoneme.csv'  # phoneme classification dataset
assert os.path.exists(DATASET)  # check if the file actually exists


## Constructing DataFrame from raw dataset

<span style="background:yellow">**Note**</span>: Variable `dataset` should be used for the dataframe.

In [3]:

dataset = pd.read_csv(DATASET, header=0).sample(frac=1)

# verify dataset shape
print("Dataset shape: ", dataset.shape)


Dataset shape:  (5404, 6)


In [4]:
# show first few lines of the dataset
dataset.head()


Unnamed: 0,Aa,Ao,Dcl,Iy,Sh,Class
1670,0.185,1.416,2.055,0.872,0.0,0
2998,0.157,0.642,0.459,0.761,0.0,1
4251,0.249,1.876,0.914,0.379,0.0,0
3117,0.249,0.569,0.933,2.165,-1.034,1
1636,2.514,-0.238,0.103,0.089,0.09,0


## Splitting data into training and test sets

Split the datasets into training (80%) and testing (20%) sets. 

The below is only necessary if you are interested in visualizing
the data or providing neatly-labeled output within the program.

```python
# extract labels from column headers
phonemes = dataset.columns[0:5].tolist()  # Feature labels
labels = {0: 'Nasal', 1: 'Oral'}  # Class labels
```

In [5]:
# extract labels from column headers
phonemes = dataset.columns[0:5].tolist()  # Feature labels
labels = {0: 'Nasal', 1: 'Oral'}  # Class labels


In [8]:
# extract features and class data from primary data frame
X = dataset.loc[:,"Aa":"Sh"]    # include all the columns except the last one
y = dataset.loc[:,"Class"]   # last col (survived)


In [11]:
print(X)
print(y)


         Aa     Ao    Dcl     Iy     Sh
1670  0.185  1.416  2.055  0.872  0.000
2998  0.157  0.642  0.459  0.761  0.000
4251  0.249  1.876  0.914  0.379  0.000
3117  0.249  0.569  0.933  2.165 -1.034
1636  2.514 -0.238  0.103  0.089  0.090
...     ...    ...    ...    ...    ...
1956  0.540  3.144 -0.665 -0.261  0.000
3654  0.212  0.746  0.905 -0.393  1.210
655   0.124  0.426  0.948  0.710  0.000
2520  0.651  2.527  0.904  0.336  0.000
3313  0.222  0.433  0.684  1.743  0.000

[5404 rows x 5 columns]
1670    0
2998    1
4251    0
3117    1
1636    0
       ..
1956    0
3654    1
655     1
2520    0
3313    1
Name: Class, Length: 5404, dtype: int64


In [12]:
# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print("Training shapes (X, y): ", X_train.shape, y_train.shape)
print("Testing shapes (X, y): ", X_test.shape, y_test.shape)


Training shapes (X, y):  (3782, 5) (3782,)
Testing shapes (X, y):  (1622, 5) (1622,)


## Constructing the classifier and running automated cross-validation

* Run a 10-fold cross validation with `GaussianNB` classifier
* Print the accuracy scores for these 10 folds

In [19]:
# Your code below this line (Question #E101)
# --------------------------
classifier = GaussianNB()

# perform 10-fold *automated* cross-validation on the data
scores = cross_val_score(classifier, X_train, y_train, cv = 10)

print(scores)


[0.78100264 0.77572559 0.76455026 0.73015873 0.76719577 0.77777778
 0.76984127 0.75925926 0.74338624 0.76984127]


## Training the classifier and pickling to disk
* Learn the model with all the training instances and store to disk

In [21]:
# Your code below this line (Question #E102)
# --------------------------
classifier = GaussianNB()

# re-fit a model to the data
classifier.fit(X_train, y_train)

# Test model performance
y_pred = classifier.predict(X_test)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.87      0.77      0.82      1139
           1       0.57      0.72      0.64       483

    accuracy                           0.76      1622
   macro avg       0.72      0.75      0.73      1622
weighted avg       0.78      0.76      0.76      1622



In [23]:
# pickle model to disk
joblib.dump(classifier, 'GaussianPhonemes.pkl')


['GaussianPhonemes.pkl']

## Unpickling the model and making predictions

* Load the saved model 
* Make predictions for the testing set


In [28]:
# Your code below this line (Question #E103)
# --------------------------
# load pickled model
loaded_model = joblib.load('GaussianPhonemes.pkl')

# make predictions with freshly loaded model
y_pred = loaded_model.predict(X_test)

# verify input and output shape are appropriate
print("Input vs. output shape:")
print(X_test.shape, y_pred.shape)


Input vs. output shape:
(1622, 5) (1622,)


## Performing final performance comparison

In [27]:
# tally up right + wrong 'guesses' by model
true, false = 0, 0
for i, j in zip(y_test, y_pred):
    # print(i, j)
    if i == j:
        true += 1
    else:
        false += 1

# report results numerically and by percentage
true_percent = true / (true + false) * 100
print("Correct guesses: " + str(true) + "\nIncorrect guesses: " + str(false))
print("Percent correct: " + str(true_percent))

# compare to average of cross-validation scores
avg_cv = np.sum(scores) / len(scores) * 100
print("Percent cross-validation score (10 folds, average): " + str(avg_cv))

Correct guesses: 1226
Incorrect guesses: 396
Percent correct: 75.58569667077681
Percent cross-validation score (10 folds, average): 76.38738814200556


## Measure performance using Scikit Learn modules 

Compute and display the following:
 1. Compute Confusion Matrix
 1. Accuracy
 1. Precision
 1. Recall
 1. $F_1$-Score
 
Add additional cells if required. 

In [36]:
# Your code below this line  (Question #E104)
# --------------------------
# create confustion matrix
print("Confusion matrix:\n",confusion_matrix(y_test, loaded_model.predict(X_test)))

# print loaded model's performance
print("\nClassification report:\n",classification_report(y_test, y_pred))

Confusion matrix:
 [[876 263]
 [133 350]]

Classification report:
               precision    recall  f1-score   support

           0       0.87      0.77      0.82      1139
           1       0.57      0.72      0.64       483

    accuracy                           0.76      1622
   macro avg       0.72      0.75      0.73      1622
weighted avg       0.78      0.76      0.76      1622



## Conclusions ?

How did your trained model perform relative to your expectations based on the cross-validation?
Provide your answer in the cell below.

# Save your notebook!  Then `File > Close and Halt`