### Chapter 12
## Model Selection

### 12.0 Introduction
In machine learning, we use training algorithms to learn the parameters of a model by minimizing some loss function. However, in addition many learning algorithms (e.g., support vector classifier and random forests) also have hyperparameters that must be defined outside of the learning process. For example, random forests are collections of decision trees (hence the word forest); however, the number of decision trees in the forest is not learned by the algorithm and must be set prior to fitting. This is often referred to as hyperparameter tuning, hyperparameter optimization, or model selection. Additionally, often we might want to try multiple learning algorithms (for example, trying both support vector classifier and random forests to see which learning method produces the best model).

While there is widespread variation in the use of terminology in this area, in this book we refer to both selecting the best learning algorithm and its best hyperparameters as model selection. The reason is straightforward: imagine we have data and want to train a support vector classifier with 10 candidate hyperparameter values and a random forest classifier with 10 candidate hyperparameter values. The result is that we are trying to select the best model from a set of 20 candidate models. In this chapter, we will cover techniques to efficiently select the best model from the set of candidates.

Throughout this chapter we will refer to specific hyperparameters, such as C (the inverse of regulatizaiton strength). Do not worry if you don't know what the hyperparameters are. We will cover them in later chapters. Instead, just treat hyperparameters like the settings for the learning algorithm we must choose before starting training

### 12.1 Selecting Best Models Using Exhaustive Search
#### Problem
You want to select the best model by searching over a range of hyperparameters.

#### Solution
Use scikit-learn's GridSearchCV:

In [11]:
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

# load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# create logistic regression
logistic = linear_model.LogisticRegression()

# create range of candidate penalty hyper parameter values
penalty = ['l1', 'l2']

# create range of candidate regularizaiton hyperparameter values
C = np.logspace(0, 4, 10)

# create dictinoary hyperparameter candidates
hyperparameters = dict(C=C, penalty=penalty)

# create grid search
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, verbose=1)

# fit grid search
best_model = gridsearch.fit(features, target)

# view best hyperparameters
print("Best penalty: {}".format(best_model.best_estimator_.get_params()['penalty']))
print("Best C: {}".format(best_model.best_estimator_.get_params()['C']))
print(best_model.predict(features))

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best penalty: l1
Best C: 7.742636826811269
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.8s finished


#### Discussion
`GridSearchCV` is a brute-force approach to model selection using cross-validation.

#### See Also
* http://scikit-learn.org/stable/modules/grid_search.html
* http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV

### 12.2 Selecting Best Models Using Randomized Search
computational cheaper method than exhaustive search

In [12]:
from scipy.stats import uniform
from sklearn import linear_model, datasets
from sklearn.model_selection import RandomizedSearchCV

# load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# create logistic regression
logistic = linear_model.LogisticRegression()

# create range of candidate penalty hyper parameter values
penalty = ['l1', 'l2']

# create range of candidate regularizaiton hyperparameter values
C = uniform(loc=0, scale=4)

# create dictinoary hyperparameter candidates
hyperparameters = dict(C=C, penalty=penalty)

# create grid search
randomizedsearch = RandomizedSearchCV(logistic, hyperparameters, random_state=1, n_iter=100, cv=5, verbose=1, n_jobs=-1)

# fit grid search
best_model = randomizedsearch.fit(features, target)

# view best hyperparameters
print("Best penalty: {}".format(best_model.best_estimator_.get_params()['penalty']))
print("Best C: {}".format(best_model.best_estimator_.get_params()['C']))
print(best_model.predict(features))

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best penalty: l1
Best C: 1.668088018810296
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:    1.7s finished


* RandomizedSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
* Random Searh for Hyper-Parameter Optimization (http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)

### 12.3 Selecting Best Models from Multiple Learning Algorithms

In [9]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# set random seed
np.random.seed(0)

# load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# create a pipeline
pipe = Pipeline([("classifier", RandomForestClassifier())])

# create dictionary with candidate learning algorithms and their hyperparameters
search_space = [{
                    'classifier': [LogisticRegression()],
                    'classifier__penalty': ['l1', 'l2'],
                    'classifier__C': np.logspace(0, 4, 10)},
                {
                    'classifier': [RandomForestClassifier()],
                    'classifier__n_estimators': [10, 100, 1000],
                    'classifier__max_features': [1, 2, 3]
                }]

# create grid search
gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=1)

# fit grid search
best_model = gridsearch.fit(features, target)

# view best model
print(best_model.best_estimator_.get_params()['classifier'])

Fitting 5 folds for each of 29 candidates, totalling 145 fits
LogisticRegression(C=7.742636826811269, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)


[Parallel(n_jobs=1)]: Done 145 out of 145 | elapsed:   21.3s finished


In [10]:
best_model.predict(features)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

### 12.4 Selecting Best Models When Preprocessing
#### Problem
You want to include a preprocessing step during model selection

#### Solution
Create a pipeline that includes the preprocessing step and any of its parameters

In [14]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# set random seed
np.random.seed(0)

# load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# create a preprocessing obejct taht includes StandardScaler features and PCA
preprocess = FeatureUnion([("std", StandardScaler()), ("pca", PCA())])

# create a pipeline
pipe = Pipeline([("preprocess", preprocess),
                 ("classifier", LogisticRegression())
                ])

# create a space of candidate values
search_space = [{
    "preprocess__pca__n_components": [1, 2, 3],
    "classifier__penalty": ["l1", "l2"],
    "classifier__C": np.logspace(0, 4, 10)
}]

# create grid search
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0, n_jobs=-1)

# fit grid search
best_model = clf.fit(features, target)

# view best model
best_model.best_estimator_.get_params()['preprocess__pca__n_components']

1

### 12.7 Evaluating Performance After Model Selection
#### Problem
You want to evaluate the performance of a model found through model selection.

#### Solution
Use nested cross-validation to avoid biased evaluation:

In [15]:
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV, cross_val_score

# load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# create logistic regression
logistic = LogisticRegression()

# create range of 20 candidate values for C
C = np.logspace(0, 4, 20)

# create hyperparameter options
hyperparameters = dict(C=C)

# create grid search
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, n_jobs=-1, verbose=0)

# conduct nested cross-validation and output the average score
cross_val_score(gridsearch, features, target).mean()

0.9534313725490197

#### Discussion
Nested cross-validation during model selection is a difficult concept for many people to grasp the first time. Remember that in k-fold cross-validation, we train our model on k-1 folds of the data, use this model to make predictions on the remaining fold, and then evaluate our model best on how well our model's predictions compare to the true values. We then repeat this process k times.

In the model selection searches described in this chapter (i.e. `GridSearchCV` and `RandomizedSearchCV`), we used cross-validation to evaluate which hyperparameter values produced the best models. However, a nuanced and generally underappreciated problem arises: since we used the data to select the best hyperparameter values, we cannot use that same data to evaluate the model's performance. The solution? Wrap the cross-validation used for model search in another cross-validation! In nested cross-validation, the "inner" cross-validation selects the best model, while the "outer" cross-validation provides us with an unbiased evaluation of the model's performance. In our solution, the inner cross-validation is our `GridSearchCV` object, which we then wrap in an outer cross-validation using `cross_val_score`.

If you are confused, try a simple experiment. First set `verbose=1` so we can see what is happening:

In [16]:
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, verbose=1)

Next, run `gridsearch.fit(features, target)`, which is our inner cross-validation used to find the best_model:

In [17]:
best_model = gridsearch.fit(features, target)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished


From the output you can see the inner cross-validation trained 20 candidate models five times, totaling 100 models. Next, nest clf inside a new cross-validation, which defaults to three folds:

In [18]:
scores = cross_val_score(gridsearch, features, target)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished


The output shows that the inner cross-validation trained 20 models five times to find the best model, and this model was evaluated using an outer three-fold cross-validation, creating a total of 300 models trained