# 05.03 - Hyperparameters and Model Validation

To review from the previous episode, we saw the basic steps for applying a supervised machine learning model:

1. Choose a class of model
2. Choose model hyperparameters
3. Fit the model to the training data
4. Use the model to predict labels for new data

Generally speaking, the first two steps are the most important. Finding the most appropriate **model** and tuning it with the right **hyperparameters** is of fundamental importance, and this section will cover how to perform validation on both of them. 

### Thinking about Model Validation

Model and hyperparameters validation is _deceitfully_ simple: 

1. Choose a model and hyperparameters 
2. Apply it to training data
3. Compare prediction to known value

However, there are good (and "less good") ways of doing it. Let's have a look at both:

### Model validation the wrong way

In [1]:
# loading the data
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

Next we choose a model and hyperparameters. Here we'll use a k-neighbors classifier with <code>n_neighbors=1</code>.

In [2]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)

In [3]:
# training the model
model.fit(X, y)
y_model = model.predict(X)

In [4]:
# accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y, y_model)

1.0

100% ? Wow, we might just have create the perfect model! Or maybe not. In fact, the flaw here has been training and evaluating the model on the _same data_.

Since KNN simply stores training data and then use it to predict labels by comparing new data to these stored points, it will get 100% accuracy (nearly almost) every time. 

### Model validation the right way: Holdout sets



A better way is to use **holdout sets**, parts of the datasets on which we have not performed model training and that we will use only to check model performance. 

This splitting can be done using the <code>train_test_split</code> utility in Scikit-Learn:

In [6]:
from sklearn.model_selection import train_test_split
# split the data with 50% in each set
X1, X2, y1, y2 = train_test_split(X, y, random_state=0,
                                  train_size=0.5)

# fit the model on one set of data
model.fit(X1, y1)

# evaluate the model on the second set of data
y2_model = model.predict(X2)
accuracy_score(y2, y2_model)



0.9066666666666666

### Model validation via cross-validation

In the previous case, an evident drawback is that we didn't get to use 50% of our available dataset to specify the model. There may be useful information that we are missing out on. 

To avoid this issue, and make the validation even more robust, we can use **cross-validation**, where each subset of the data is used both as a training set and as a validation set. 

Here we do two validation trials, alternately using each half of the data as a holdout set.

In [7]:
y2_model = model.fit(X1, y1).predict(X2)
y1_model = model.fit(X2, y2).predict(X1)
accuracy_score(y1, y1_model), accuracy_score(y2, y2_model)

(0.96, 0.9066666666666666)

This particular form of cross-validation is a **two-fold cross-validation** — that is, one in which we have split the data into two sets and used each in turn as a validation set.

The concept can be expanded to _n_-fold, splitting the dataset in _n_ parts and using them all in turn as training and validation sets.

To do it conveniently, we can use the Scikit-Learn's <code>cross_val_score</code>:

In [9]:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)

array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1.        ])