# STATS 102
## Class 27

Textbook reference: Python Data Science Handbook - Chapter 5

Here are the topics for this lecture:

* Hyperparameters and Model Validation

Let's get started...

<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">

*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

# Hyperparameters and Model Validation

In the previous section, we saw the basic recipe for applying a supervised machine learning model:

1. Select: Choose a class of model
2. Instantiate: Choose model hyperparameters
3. Fit: Fit the model to the training data
4. Predict: Use the model to predict labels for new data

### The first two pieces of this—the choice of model and choice of hyperparameters—are perhaps the most important part of using these tools and techniques effectively.

In order to make an informed choice, we need a way to *validate* that our model and our hyperparameters are a good fit to the data. While this may sound simple, there are some pitfalls that you must avoid to do this effectively.

## Thinking about Model Validation

In principle, model validation is very simple: after choosing a model and its hyperparameters, we can estimate how effective it is by applying it to some of the training data and comparing the prediction to the known value.

The following sections first show a naive approach to model validation and why it
fails, before exploring the use of holdout sets and cross-validation for more robust
model evaluation.

### Model validation the wrong way

Let's demonstrate the naive approach to validation using the Iris data, which we saw in the previous section.
We will start by loading the data:

In [40]:
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

Next we choose a model and hyperparameters. Here we'll use a *k*-neighbors classifier with ``n_neighbors=1``.
**This is a very simple and intuitive model that says "the label of an unknown point is the same as the label of its closest training point:"**

In [42]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)

Then we train the model, and use it to predict labels for data we already know:

In [43]:
model.fit(X, y)
y_model = model.predict(X)

Finally, we compute the fraction of correctly labeled points:

In [44]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_model)

1.0

### We see an accuracy score of 1.0, which indicates that 100% of points were correctly labeled by our model! But is this truly measuring the expected accuracy? Have we really come upon a model that we expect to be correct 100% of the time?

As you may have gathered, the answer is no.
In fact, this approach contains a fundamental flaw: *it trains and evaluates the model on the same data*.

Furthermore, the nearest neighbor model is an *instance-based* estimator that simply stores the training data, and predicts labels by comparing new data to these stored points: except in contrived cases, it will get 100% accuracy *every time!*

### Model validation the right way: Holdout sets

So what can be done?
A better sense of a model's performance can be found using what's known as a **holdout set**: that is, we hold back some subset of the data from the training of the model, and then use this holdout set to check the model performance.
This splitting can be done using the ``train_test_split`` utility in Scikit-Learn:

In [46]:
from sklearn.model_selection import train_test_split
# split the data with 50% in each set
X1, X2, y1, y2 = train_test_split(X, y, random_state=0,
                                  train_size=0.5)

# fit the model on one set of data
model.fit(X1, y1)

# evaluate the model on the second set of data
y2_model = model.predict(X2)
acc=accuracy_score(y2, y2_model)
print("The model accuracy is {}%".format(np.round(100*acc,2)))

The model accuracy is 90.67%


We see here a more reasonable result: the nearest-neighbor classifier is about 90% accurate on this hold-out set.
The hold-out set is similar to unknown data, because the model has not "seen" it before.

### Model validation via cross-validation

One disadvantage of using a holdout set for model validation is that we have lost a portion of our data to the model training.

One way to address this is to use **cross-validation**; that is, to do a **sequence of fits** where each subset of the data is used both as a training set and as a validation set.

Using the split data from before, we could implement it like this:

In [36]:
y2_model = model.fit(X1, y1).predict(X2) # Fit using first set
y1_model = model.fit(X2, y2).predict(X1) # Fit using second set
accuracy_score(y1, y1_model), accuracy_score(y2, y2_model)

(0.96, 0.9066666666666666)

What comes out are two accuracy scores, which we could combine (by, say, taking the mean) to get a better measure of the global model performance.

This particular form of cross-validation is a **two-fold cross-validation**—that is, one in which we have split the data into two sets and used each in turn as a validation set.

We could expand on this idea to use even more trials, and more folds in the data.  For example, we could split the data into five groups, and use each of them in turn to evaluate the model fit on the other 4/5 of the data.

This would be rather tedious to do by hand, and so we can use **Scikit-Learn's ``cross_val_score``** convenience routine to do it succinctly:

In [37]:
from sklearn.model_selection import cross_val_score
#from sklearn.cross_validation import cross_val_score
cross_val_score(model, X, y, cv=5)

array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1.        ])

Repeating the validation across different subsets of the data gives us an even better idea of the performance of the algorithm.

### Scikit-Learn implements a number of useful cross-validation schemes that are useful in particular situations; these are implemented via iterators in the ``cross_validation`` module.

For example, we might wish to go to the extreme case in which our number of folds is equal to the number of data points: that is, we train on all points but one in each trial. This type of cross-validation is known as *leave-one-out* cross validation, and can be used as follows:

In [38]:
from sklearn.model_selection import LeaveOneOut
scores = cross_val_score(model, X, y, cv=LeaveOneOut())

Because we have 150 samples, the leave one out cross-validation yields scores for 150 trials, and the score indicates either successful (1.0) or unsuccessful (0.0) prediction.
Taking the mean of these gives an estimate of the error rate:

In [39]:
scores.mean()

0.96

Other cross-validation schemes can be used similarly.
For a description of what is available in Scikit-Learn, use IPython to explore the ``sklearn.cross_validation`` submodule, or take a look at Scikit-Learn's online [cross-validation documentation](http://scikit-learn.org/stable/modules/cross_validation.html).

## Selecting the Best Model

Now that we've seen the basics of validation and cross-validation, let's get into a little more depth regarding model selection and selection of hyperparameters.

These issues are some of the most important aspects of the practice of machine learning, and I find that this information is often glossed over in introductory machine learning tutorials.

Of core importance is the following question: 

*if our estimator is underperforming, how should we move forward?*

There are several possible answers:

- Use a more complicated/more flexible model
- Use a less complicated/less flexible model
- Gather more training samples
- Gather more data to add features to each sample

The ability to determine what steps will improve your model is what separates the successful machine learning practitioners from the unsuccessful.

### The Bias-variance trade-off

This is perhaps the most important consideration when evaluating the performance on your model.  In order to get a good conceptual understanding, I would ask you to put this video on hold and watch the following coursera video: 

https://www.coursera.org/learn/deep-neural-network/lecture/ZhclI/bias-variance

Welcome back! As Mr Ng explains, looking at the training and validation (he calls this one development) errors can help you understand how your model is performing relative to *bias* and *variance*. Specifically, your training error gives you a good indication of your model's ability to fit the data or bias.  And then, the difference between these errors give you an indication of the model's ability to generalize or variance.  Finally, please note his point about Bayes error (or noise floor).  This is a particularly important consideration if Bayesian error is of the same order of magnitude as training and validation errors.

## Basic Recipe for Machine Learning

Once again, I would ask you to stop my video and go to Coursera to watch Mr Ng offers a systematic approach to diagnosing your model if found to have high bias and/or variance.

https://www.coursera.org/learn/deep-neural-network/lecture/ZBkx4/basic-recipe-for-machine-learning

## In Summary...

* we explored the concept of model validation and hyperparameter optimization
* we focused on the bias–variance trade-off and how it comes into play when fitting models to data
* we review Mr Ng's simple recipe for machine learning for diagnosing your model