## Hyperparameter Tuning

In this lesson we will learn how to optimize our model's hyperparameters and how to choose our models.

### New tools in this unit
- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

In [1]:
import pandas as pd
from IPython.display import Image

For this unit we will use the [Wisconsin Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)). It's a dataset containing measurements done on microscopic images of tissue cells from people with and without breast cancer. The goal of the dataset is to predict whether one patient has a breast cancer or not.

The images look like this one:

![](https://i.imgur.com/ElLUPsZ.jpg)

In [2]:
from sklearn.datasets import load_breast_cancer
cancer_data = load_breast_cancer()

cancer = pd.DataFrame(cancer_data["data"],
                           columns=cancer_data["feature_names"]
                          )

cancer["malign"] = cancer_data.target
cancer["malign"] = cancer["malign"].replace({0:1, 1:0})

In [3]:
cancer.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,malign
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,1


In [4]:
target_variable = "malign"
independent_variables = cancer.drop(target_variable, axis=1).columns

In [5]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

We keep a portion of the dataset for validation the final model.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
                                        cancer[independent_variables],
                                        cancer[target_variable], 
                                        test_size=0.2,
                                        random_state=42
                                        )

In order to do a search, we need to define a hyperparameter space, that is, all the hyperparameters we want to test and their possible values. Be aware that each hyperparameter is of a different type, so checking the model's documentation is a good idea.

In [7]:
grid_search_parameter_space = {'max_depth': range(1, 10),
                   'max_features': range(1, len(independent_variables))
                  }

We also need to define the model that we are going to use. In this case we will use a simple DecisionTreeClassifier

In [8]:
estimator = DecisionTreeClassifier()

In [9]:
GridSearchCV?

Now we can define the grid search with cross validation. We need to specify the metric we want to use to guide the process. In this case we choose the AUC score. We can also specify how many CV partitions we want to use to evaluate each hyperparameters combination.

In [10]:
grid_search = GridSearchCV(
                estimator,
                grid_search_parameter_space,
                cv=5,
                scoring="roc_auc"
)

In [11]:
%%timeit -n 1 -r 1

grid_search.fit(X_train, y_train)

5.06 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


We see it takes about 5 seconds to run the grid search.

We can access the best estimator found by the search with the `best_estimator_` param

In [12]:
grid_search.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=18, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

We can use the fitted grid search to predict

In [13]:
grid_search.predict(X_test)[:10]

array([0, 1, 1, 0, 0, 1, 1, 1, 0, 0])

We can also see the parameters for the best performing model

In [14]:
grid_search.best_params_

{'max_depth': 2, 'max_features': 18}

And the best model's score

In [15]:
grid_search.best_score_

0.9517267455508088

If we want to dig deeper into the search result, we can access the results obtained on each hyperparameter search iteration with `cv_results`.

In [36]:
pd.DataFrame(grid_search.cv_results_).sort_values(by="rank_test_score").head()

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_max_depth,param_max_features,params,rank_test_score,split0_test_score,split0_train_score,...,split2_test_score,split2_train_score,split3_test_score,split3_train_score,split4_test_score,split4_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
46,0.001572,0.000602,0.951727,0.965187,2,18,"{'max_features': 18, 'max_depth': 2}",1,0.940162,0.958025,...,0.964912,0.95114,0.953818,0.976629,0.957204,0.95874,0.000213,0.000104,0.00921,0.011696
59,0.001074,0.000655,0.947362,0.964922,3,2,"{'max_features': 2, 'max_depth': 3}",2,0.940669,0.972742,...,0.960526,0.945706,0.968524,0.97029,0.963052,0.951708,0.000212,0.000159,0.023519,0.014172
93,0.001311,0.000571,0.944734,0.995027,4,7,"{'max_features': 7, 'max_depth': 4}",3,0.934584,0.992284,...,0.981682,0.990927,0.965428,0.998496,0.905901,0.994012,0.000102,7.1e-05,0.026342,0.003367
36,0.001034,0.000513,0.944174,0.966409,2,8,"{'max_features': 8, 'max_depth': 2}",4,0.951826,0.962524,...,0.967492,0.963917,0.93808,0.968122,0.900053,0.968132,0.000111,2.6e-05,0.024141,0.002678
165,0.002929,0.00063,0.943736,0.999001,6,21,"{'max_features': 21, 'max_depth': 6}",5,0.944219,0.999951,...,0.973684,0.999838,0.941434,1.0,0.960659,0.995296,0.000282,0.000128,0.025295,0.001853


### Randomized Search

Unlike the Grid Seach, Randomized Search works by randomly selecting combinations of hyperparameters. This method tend to perform better when the hyperparameters space is huge (and thus impractical to "brute force" via a Grid search

In [17]:
from sklearn.model_selection import RandomizedSearchCV

In [18]:
RandomizedSearchCV?

To run a randomized search in scikit-learn, it is recommended to use statistical distributions instead of simple lists or ranges.

In [19]:
from scipy.stats import randint

random_search_parameter_space_dist = {
                   "max_depth": randint(1, 100),
                   "max_features": randint(1, len(independent_variables)),
                   "class_weight": ["balanced", None]
                  }

We set up the random Search. We fix the random state `random_state=42` to ensure reproducibility (that is, the random search running in *your* computer should return the same results as the one running on *my* computer).

In [20]:
randomized_search = RandomizedSearchCV(
                        estimator, 
                        random_search_parameter_space_dist,
                        cv=5, n_iter=250,
                        random_state=42
)

And we run it by fitting it to the data (same as with the GridSearchCV)

In [21]:
%%timeit -n 1 -r 1

randomized_search.fit(X_train, y_train)

4.27 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


The RandomizedSearch has the same attributes as the GridSearch

In [22]:
randomized_search.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=53,
            max_features=24, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [23]:
randomized_search.best_score_

0.9406593406593406

### Evaluating the models

Now we have 2 possible models (the best one found with the grid search and the best one found by the Randomized Search). Which one should we choose?

**Measuring predictive performance**

We can evaluate the predictive performance of the models by using the test dataset we held at the beginning.

In [24]:
from sklearn.model_selection import cross_val_score

In [25]:
grid_search_score = cross_val_score(grid_search.best_estimator_, 
                                    X_test, y_test, scoring="roc_auc", cv=3).mean()

In [26]:
grid_search_score

0.9306418219461697

In [27]:
randomized_search_score = cross_val_score(randomized_search.best_estimator_, 
                                    X_test, y_test, scoring="roc_auc", cv=3).mean()

In [28]:
randomized_search_score

0.9125862663906142

So in terms of predictive power, the gridsearch model performs better. 

**Processing time**

How about training time? If a model takes considerable longer to train than another, that is something to bear in mind when choosing a model. We can use the jupyter cell magic `timeit` to evaluate the training time of the two final estimators.

In [29]:
%%timeit -n 3 -r 100

grid_search.best_estimator_.fit(cancer[independent_variables], cancer[target_variable])

2.26 ms ± 189 µs per loop (mean ± std. dev. of 100 runs, 3 loops each)


In [30]:
%%timeit -n 3 -r 100

randomized_search.best_estimator_.fit(cancer[independent_variables], cancer[target_variable])

5.59 ms ± 303 µs per loop (mean ± std. dev. of 100 runs, 3 loops each)


So we see that the grid search model is faster to train.

**Measuring complexity**

How about complexity? Measuring complexity is not a simple as running a function, since each different algorithm has a different set of parameters that make it more or less complex.

In the case of decission trees, we can actually use  a visual inspection of the trees to decide which one is more complex.

We can export the trees to a text format called `graphviz` (a library designed to plot graphs). We can use an online service like [GraphvizOnline](https://dreampuf.github.io/GraphvizOnline/) or [webgraphviz](http://webgraphviz.com/) to visualize the exported trees

In [31]:
from sklearn.tree import export_graphviz

In [32]:
export_graphviz(grid_search.best_estimator_, "grid_search_winner.dot", 
                feature_names=cancer_data.feature_names)

In [33]:
export_graphviz(randomized_search.best_estimator_, "random_search_winner.dot", 
                feature_names=cancer_data.feature_names)

So in this case we see that the grid search model is most performant, less complex and faster to train, so we have a clear winner!