In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Parameter selection, Validation & Testing

Most models have parameters that influence how complex a model they can learn. Remember using KNeighborsRegressor.
If we change the number of neighbors we consider, we get a smoother and smoother prediction:

In [None]:
from figures import plot_kneighbors_regularization
plot_kneighbors_regularization()

Here, we trade of remembering too much about the particularities and noise of the training data vs not modeling enough of the variability. This is a trade-off that needs to be made in basically every machine learning application and is a centeral concept, called bias-variance-tradeoff or "overfitting vs underfitting".

![underfitting and overfitting](figures/overfitting_underfitting_cartoon.svg)

## Hyperparameters, Over-fitting, and Under-fitting
Unfortunately, there is no general rule how to find the sweet spot, and so machine learning practitioners have to find the best trade-off of model-complexity and generalization by trying several parameter settings.
Most commonly this is done using a brute force search, for example over multiple values of ``n_neighbors``:


In [None]:
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.neighbors import KNeighborsRegressor
# generate toy dataset:
x = np.linspace(-3, 3, 100)
y = np.sin(4 * x) + x + np.random.normal(size=len(x))
X = x[:, np.newaxis]

cv = KFold(n=len(x), shuffle=True)

# for each parameter setting do cross_validation:
for n_neighbors in [1, 3, 5, 10, 20]:
    scores = cross_val_score(KNeighborsRegressor(n_neighbors=n_neighbors), X, y, cv=cv)
    print("n_neighbors: %d, average score: %f" % (n_neighbors, np.mean(scores)))

If multiple parameters are important, like the parameters ``C`` and ``gamma`` in an ``SVM`` (more about that later), all possible combinations are tried:

In [None]:
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.svm import SVR

# each parameter setting do cross_validation:
for C in [0.001, 0.01, 0.1, 1, 10]:
    for gamma in [0.001, 0.01, 0.1, 1]:
        scores = cross_val_score(SVR(C=C, gamma=gamma), X, y, cv=cv)
        print("C: %f, gamma: %f, average score: %f" % (C, gamma, np.mean(scores)))

As this is such a very common pattern, there is a built-in class for this in scikit-learn, ``GridSearchCV``. ``GridSearchCV`` takes a dictionary that describes the parameters that should be trieda nd a model to train.

The grid of paramters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested.

In [None]:
from sklearn.grid_search import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1]}

grid = GridSearchCV(SVR(), param_grid=param_grid, cv=cv, verbose=3)

One of the great things about GridSearchCV is that it is a *meta-estimator*. It takes an estimator like SVR above, and creates a new estimator, that behaves exactly the same - in this case, like a regressor.
So we can call ``fit`` on it, to train it:

In [None]:
grid.fit(X, y)

What ``fit`` does is a bit more involved then what we did above. First, it runs the same loop with cross-validation, to find the best parameter combination.
Once it has the best combination, it runs fit again on all data passed to fit (without cross-validation), to built a single new model using the best parameter setting.

Then, as with all models, we can use ``predict`` or ``score``:


In [None]:
grid.predict(X)