# Train, Validate $\rightarrow$ Train, Test

## Introduction
When constructing a model, data availability may become an issue. 
In order to avoid overfitting, it is necessary to withhold some portion of the data as a test set. 
However, overfitting *on the test set* may also occur without a secondary validation step. 
As such, `scikit` contains a number of methods for cross-validation of data.

## References
1. [Scikit documentation - GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

## Setting up the model

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB
from collections import OrderedDict

# load dataset 
raw = load_iris()
X = raw.data[:, :2] # slice off only the first feature (.data is multi-dimensional)
y = raw.target # the target data is a single label, so it can all be kept

# test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

# we'll use the Gaussian Naive Bayes classifier
classifier = GaussianNB()

## Cross-validation
Though a manual CV workflow was described in [the cross-validation lab](./CrossValidation.ipynb), the automated `cross_val_score()` will work well enough for this example.

In [2]:
# automated CV step
scores = cross_val_score(classifier, X_train, y_train, cv=5)
print(scores) # TODO: visualization of CV process

[0.77777778 0.77777778 0.72222222 0.94444444 0.66666667]


Note that we are performing cross validation with the training set. These cross-validation values represent how well (with 1 being a perfect score) the model performed against a small, as-yet-untrained portion of the data for the classification task.

## Training the new model

Since the CV values are relatively high, we can create a model using all the data in the training set and test against the testing set:

In [3]:
# fit new model
classifier.fit(X_train, y_train)

# GaussianNB.predict() returns class labels (integers)
y_pred = classifier.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.98        22
           1       0.59      0.59      0.59        17
           2       0.68      0.71      0.70        21

    accuracy                           0.77        60
   macro avg       0.76      0.75      0.75        60
weighted avg       0.77      0.77      0.77        60



## Saving and loading models through pickling / `joblib`

In many cases, the model being trained will be used more than once for the same or different data. 
For large datasets, training can be computationally expensive as well. 
For these and other reasons it is often necessary to save a trained model so it can later be loaded.
This is relatively easy to do from within `scikit` via `joblib`:

In [4]:
# assuming some model 'clf' is initialized and trained:
import joblib
joblib.dump(classifier, 'GaussianIris.pkl')

['GaussianIris.pkl']

`joblib` makes loading in pickled models similarly easy:

In [5]:
# we'll load to a new variable this time, for clarity
loaded_model = joblib.load('GaussianIris.pkl')

# and predict with the 'new' model:
y_pred = loaded_model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.98        22
           1       0.59      0.59      0.59        17
           2       0.68      0.71      0.70        21

    accuracy                           0.77        60
   macro avg       0.76      0.75      0.75        60
weighted avg       0.77      0.77      0.77        60



Note that the results are exactly the same.
This is no accident - the model being loaded below is *the same* as the one used above.

Please read a little about Pickling (or Serialization) [here](https://docs.python.org/3/library/pickle.html) and [here](https://en.wikipedia.org/wiki/Serialization).

## Notes on pickling - safety, efficiency

`joblib` (and by extension the default `pickle` module) is by no means your only option for model storage.
`cPickle` is a [faster](https://docs.python.org/2.2/lib/module-cPickle.html) 
C-based implementation of the same pickling algorithm, 
and for more significant models (as we may cover in later modules) it would be worth looking into.

Moreover, consider reading the `scikit` docs on [persistence](http://scikit-learn.org/stable/modules/model_persistence.html)
for considerations on the long-term safety of pickling. 