## Automatic hyperparameter tuning

This example shows how to use scikit-learn's procedures for automatically tuning the values of hyperparameters. Hyperparameters are inputs to learning algorithms that control their behavior.

We first read the Adult dataset again.

In [1]:
import pandas as pd

train_data = pd.read_csv('adult_train.csv')

n_cols = len(train_data.columns)
Xtrain = train_data.iloc[:, :n_cols-1].to_dict('records')
Ytrain = train_data.iloc[:, n_cols-1]

test_data = pd.read_csv('adult_test.csv')
Xtest = test_data.iloc[:, :n_cols-1].to_dict('records')
Ytest = test_data.iloc[:, n_cols-1]

FileNotFoundError: [Errno 2] No such file or directory: 'adult_train.csv'

We create a `Pipeline` that handles all the preprocessing steps, and we then apply this preprocessing pipeline to the training set.

In [None]:
import numpy as np

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest

preprocessing_pipeline = make_pipeline(
    DictVectorizer(),
    StandardScaler(with_mean=False),
    SelectKBest(k=100),
)

preprocessing_pipeline.fit(Xtrain, Ytrain)
X_vec = preprocessing_pipeline.transform(Xtrain)

We will now carry out a grid search for the best hyperparameters for a `LogisticRegression` classifier.

We'll tune the hyperparameters `C` and `penalty`. See [scikit-learn's documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) of the LogisticRegression classifier for details about these hyperparameters.

For the grid search, we define lists of the values of each hyperparameter that we want to explore. The grid search procedure (which we run by calling `fit`) will then try out all combinations of values of each hyperparameter.

The `GridSearchCV` will run a separate cross-validation for each combination of hyperparameter values, and select the values that gave the highest classification accuracy in the cross-validation.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='liblinear')

param_grid = {'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}

gridsearch = GridSearchCV(clf, param_grid)

gridsearch.fit(X_vec, Ytrain);

After carrying out the grid search, we can inspect the hyperparameter values that led to the best results in the cross-validation.

In [None]:
gridsearch.best_params_

As an alternative to the grid search, we'll also take a look at a random search. It has been claimed that this procedure finds better parameter values in a fewer number of experiments ([Bergstra and Bengio, 2012](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)).

We could specify the parameter values to explore, just as for the grid search. However, in order to take advantage of the random search, it can be better to define a *distribution* for the continuous-value hyperparameters, such as `C` in the case of logistic regression. In this case, we'll use an [exponential distribution](https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.expon.html) for `C`. The `penalty` hyperparameter will still be a discrete choice.

To find a good set of hyperparameter values, we should run many experiments. This can take a bit of time. I've set the number of iterations to 5 here, but in a real-world setting this number would probably be much higher.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

from scipy.stats import expon

C_distr = expon(scale=2)
param_grid_random = {'C': C_distr, 'penalty': ['l1', 'l2']}

randomsearch = RandomizedSearchCV(clf, param_grid_random, n_iter=5)

randomsearch.fit(X_vec, Ytrain);

Again, we can inspect the best hyperparameter values after running the selection procedure.

In [None]:
randomsearch.best_params_

Finally, we can train a classifier that uses the hyperparameter values found by the search procedure.

In [None]:
best_C = randomsearch.best_params_['C']
best_penalty = randomsearch.best_params_['penalty']

pipeline = make_pipeline(
    DictVectorizer(),
    StandardScaler(with_mean=False),
    SelectKBest(k=100),
    LogisticRegression(C=best_C, penalty=best_penalty, solver='liblinear')
)

pipeline.fit(Xtrain, Ytrain)
accuracy_score(Ytest, pipeline.predict(Xtest))