# SLU15 - Hyperparameter Tuning : Learning notebook
In the past two SLUs, you learned how to choose the best model and the best features. Here we'll be looking at how to optimize the parameters that influence the learning process of the model, the hyperparameters.

In [1]:
from IPython.display import Image
import pandas as pd
from sklearn import tree
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from scipy.stats import uniform
from sklearn.model_selection import cross_val_score, cross_validate

## 1. Hyperparameters

What are hyperparameters? Until now we have mostly trained estimators (models) like this:

In [2]:
classifier = tree.DecisionTreeClassifier() 

What this means is we are creating a Decision Tree for a classification problem **using its default settings**. However, every single kind of model we use can be *tweaked* and modified to better adapt to our specific problem. For example, we can specify that we want a decision tree with a maximum depth of 5.

In [3]:
classifier = tree.DecisionTreeClassifier(max_depth=5) 

The parameters we can specify when creating a model are called **hyperparameters**. Part of a Data Scientist's job is to figure out the right set of hyperparameters that make our model learn better from our data. You can see the complete list of available hyperparameters with the `get_params()` method. You can find the description of all hyperparameters in the API reference of the model.

In [4]:
classifier.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

**You can ask: why can't we include the hyperparameters as a subset of the parameters of the classification problem to be learned?**

In principle, the idea to optimize the hyperparameters using machine learning is good and you can see one such example linked in the further reading section.

However, optimizing parameters and hyperparameters at the same time means mixing the rules of the learning process (the settings) with the learning process itself. The model's optimization process prioritizes minimizing the error of the prediction on the training set which might be in conflict with finding appropriate hyperparameters. 

For instance, you can end up with a huge decision tree that is for sure not the best model in the real world (`max_depth` will explode to minimize training errors and you'll have a huge overfit solution).

The hyperparameters should be tuned after fitting the model to the training data by comparing the performance of several similar models, each of them with different settings, in a process called the validation process.

**Don't forget:**
- Parameters are internal variables whose values are learnt from the data, whereas hyperparameters are external configuration settings that control the learning process itself.
- The hyperparametes are adjusted based on the model's behavior on the validation set or through cross-validation.

## 2. The data

In this unit, we will use the [Wisconsin Breast Cancer Dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset) from the sklearn datasets. The features in the dataset were extracted from microscopic images of breast tissue cell nuclei from people with and without breast cancer. The data was used to develop a classification model to help the diagnosis of breast cancer.

In [5]:
cancer_data = load_breast_cancer()

cancer = pd.DataFrame(cancer_data["data"],
                      columns=cancer_data["feature_names"])

cancer["malign"] = cancer_data.target

In [6]:
cancer.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,malign
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


The target is the `malign` column and the 30 features are all numeric.

In [7]:
target_variable = "malign"
independent_variables = cancer.drop(target_variable, axis=1).columns

We do the usual train-test split.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
                                        cancer[independent_variables],
                                        cancer[target_variable], 
                                        test_size=0.2,
                                        random_state=42
                                        )
print(X_train.shape, X_test.shape)

(455, 30) (114, 30)


We should actually do three way split into train, validation, and test set. We should first choose the best model model by comparing several models on the train set, then tune the hyperparameters of the best model on the validation set, and test the predictions on the test set.

In this unit, we suppose that we already selected a model on another part of the dataset (the DecisionTreeClassifier) and now we just want to tune the hyperparameters.

## 3. Hyperparameters search

So we have said that finding the right set of hyperparameters is part of the job of building a good estimator. However, the models often have tons of different hyperparameters. 

Let's check the reference for our classifier. We can do so directly in the jupyter notebook, by using the `?` keyword.

In [9]:
tree.DecisionTreeClassifier?

[0;31mInit signature:[0m
[0mtree[0m[0;34m.[0m[0mDecisionTreeClassifier[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcriterion[0m[0;34m=[0m[0;34m'gini'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msplitter[0m[0;34m=[0m[0;34m'best'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_depth[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_samples_split[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_samples_leaf[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_weight_fraction_leaf[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_features[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_leaf_nodes[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_impurity_decrease[0m[0;34m=[0m[0;

How can we tune so many hyperparameters without going crazy? Fortunately, we can search the hyperparameter space automatically! Scikit-learn provides 2 different kinds of hyperparameter search strategies.

In [10]:
classifier.fit(X_train,y_train)

### 3.1 Grid Search

In a grid search, we define the search interval for every hyperparameter and then divide it into regularly spaced subintervals, forming a grid as in the image below. Every point in the grid is a combination of specific values of the hyperparameters. We then train the model with every hyperparameter combination from the grid and evaluate the performace on the test set.

Here is an example grid for 2 hyperparameters. The search interval for both of them is \[0,1\] with a step of 0.1. For more hyperparameters, the grid grows into higher dimensions.

<img src="media/grid_search.png" width=300>

The grid is defined with a dictionary where the keys are parameter names and the values are the search spaces. The search space can be an interval or a list of values. Our dataset has just numeric variables, so we the search spaces as a range.

In [11]:
grid_search_parameter_space = {'max_depth': range(1, 10),
                   'max_features': range(1, len(independent_variables))
                  }

The grid search with cross validation is executed with [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

In [12]:
from sklearn.model_selection import GridSearchCV

Let's define the grid search. We need to specify the metric against which we'll measure the model performance. Here we choose the AUC score. We can also specify how many cross validation partitions we want to use to evaluate each hyperparameter combination.

In [13]:
grid_search = GridSearchCV(
                classifier,
                grid_search_parameter_space,
                cv=5,
                scoring="roc_auc",
                return_train_score=True
                )

**NOTE: The %%timeit magic**

In the real world, when doing any kind of data intensive task, such as running a hyperparameter search or training a model, processing time matters. That is the time it takes for the computer to perform the task.

In the Jupyter notebook, we can use the cell magic `%%timeit` to check how long a cell takes to run. %%timeit takes two main arguments, **n** (the number of loops per repeatition) and **r** (the number of repetitions, default is 7). If you don't specify `n`, Jupyter will figure out a reasonable number to get a fair estimate of how long the cell takes to run. Think of it as cross validation for computing time!

In [14]:
%%timeit -n 1 -r 5

grid_search.fit(X_train, y_train)

16.7 s ± 413 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


We see it takes about 16 s (it will vary depending on your machine) to run the grid search.

We can access the best estimator found by the grid search with the `best_estimator_` attribute.

In [15]:
grid_search.best_estimator_

You can grab the all the model parameters using `get_params`.

In [16]:
grid_search.best_estimator_.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 4,
 'max_features': 2,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

We can see the hyperparameters optimized by the grid search.

In [17]:
grid_search.best_params_

{'max_depth': 4, 'max_features': 2}

We can use the fitted grid search to get a prediction using the estimator with the best hyperparameters.

In [18]:
grid_search.predict(X_test)[:10]

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1])

And the best model's score, the value of the metric specified in the grid search (the AUC score).

In [19]:
grid_search.best_score_

np.float64(0.953960713141881)

If we want to dig deeper into the search result, we can access the results obtained on each hyperparameter search iteration with `cv_results`. It also shows us the time necessary to fit each model.

In [20]:
pd.DataFrame(grid_search.cv_results_).sort_values(by="rank_test_score").head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_max_features,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
88,0.002747,0.000252,0.003305,0.000352,4,2,"{'max_depth': 4, 'max_features': 2}",0.974138,0.96517,0.982714,...,0.953961,0.025176,1,0.984311,0.987336,0.998075,0.980285,0.982258,0.986453,0.006263
100,0.005207,0.001214,0.003327,0.000347,4,14,"{'max_depth': 4, 'max_features': 14}",0.963166,0.961042,0.981682,...,0.952953,0.021747,2,0.986649,0.998092,0.982387,0.988177,0.995148,0.990091,0.005735
35,0.003945,0.000711,0.004661,0.001413,2,7,"{'max_depth': 2, 'max_features': 7}",0.950888,0.921311,0.989164,...,0.951272,0.025146,3,0.962413,0.964095,0.953291,0.971066,0.962526,0.962678,0.005667
70,0.004432,0.000221,0.003298,0.000339,3,13,"{'max_depth': 3, 'max_features': 13}",0.989551,0.903509,0.984778,...,0.951109,0.033496,4,0.974942,0.988889,0.985913,0.987126,0.986948,0.984764,0.005003
64,0.004203,0.001534,0.003682,0.000871,3,7,"{'max_depth': 3, 'max_features': 7}",0.939916,0.930083,0.963622,...,0.950161,0.013764,5,0.973587,0.983746,0.978894,0.989455,0.97519,0.980174,0.00581


The scores for the best models are very similar. Here it's time to use your common sense and choose the most reasonable model.

### 3.2 Randomized Search

Unlike the Grid Search, Randomized Search works by randomly selecting combinations of hyperparameters, as opposed to the uniform sampling of the hyperparameter space. This method tends to perform better than the Grid Search in large hyperparameters spaces when it's impractical to "brute force" the optimal solution via a Grid search.

If we had 2 hyperparameters, a Randomized Search could look like this:

<img src="media/random_search.png" width=300>

Why can a Random Search perform better than a GridSearch? In ideal conditions, if time/money were no issue, a Grid Search always performs better (because it tries all of the options). However, because of time constraints, a Random Search can explore more diverse combinations of hyperparameters (and find those hyperparameters that matter the most) than a Grid Search in a limited amount of time.

<img src="media/grid_vs_random_search.png">

In [21]:
# Import the Random Seaarch class from sklearn
from sklearn.model_selection import RandomizedSearchCV

To run a randomized search in scikit-learn we cannot use a range because the points in a range are equally spaced. We have to randomly pick numbers from the given interval. We can do so with `randint` which gives us a discrete distribution (just integers) in which each value has the same probability of being picked out.

In [22]:
from scipy.stats import randint

random_search_parameter_space_dist = {
                   "max_depth": randint(1, 100),
                   "max_features": randint(1, len(independent_variables)),
                   "class_weight": ["balanced", None]
                  }

Unlike in the grid search, where we supplied the values to be tested, here we input the distribution and the choice of the values from the distribution is done by the random search itself. Note that for illustration purposes, we are using a much larger interval for `max_depth` than in the grid search (1-100 vs. 1-10), but we will use about the same number of search points (250 vs. 261).

We set up the random search. We fix the random state `random_state=42` to ensure reproducibility (that is, the random search running in *your* computer should return the same results as the one running on *my* computer). The random search will choose a number of hyperparameter combinations given by the `n_iter` hyperparameter.

In [23]:
random_search = RandomizedSearchCV(
                        classifier, 
                        random_search_parameter_space_dist,
                        cv=5, n_iter=250,
                        random_state=42,
                        return_train_score=True
                        )

And we run it by fitting it to the data (same as with the GridSearchCV).

In [24]:
%%timeit -n 1 -r 5 

random_search.fit(X_train, y_train)

14.5 s ± 455 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


The randomized search runs in a similar amount of time as the grid search which makes sense given that it explores about the same number of hyperparameter combinations.

The RandomizedSearch has the same attributes as the GridSearch. Look at the best estimator and the best score:

In [25]:
random_search.best_estimator_

In [26]:
random_search.best_score_

np.float64(0.9494505494505494)

The score is a very similar to the results from the grid search, but the depth of the tree seems to be quite extreme. When you look at the first 10 best models, the depth pameter varies considerably, but the test score is basically the same.

In [27]:
pd.DataFrame(random_search.cv_results_).sort_values(by="rank_test_score").head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_class_weight,param_max_depth,param_max_features,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
194,0.004354,0.0007,0.001916,0.000134,,85,7,"{'class_weight': None, 'max_depth': 85, 'max_f...",0.923077,0.901099,...,0.949451,0.035845,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
81,0.002757,0.00045,0.00254,0.001358,,39,2,"{'class_weight': None, 'max_depth': 39, 'max_f...",0.945055,0.967033,...,0.945055,0.015541,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
166,0.009012,0.00111,0.002602,0.000739,,49,20,"{'class_weight': None, 'max_depth': 49, 'max_f...",0.923077,0.912088,...,0.940659,0.023671,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0
248,0.006112,0.000655,0.002491,0.000662,balanced,40,10,"{'class_weight': 'balanced', 'max_depth': 40, ...",0.923077,0.956044,...,0.938462,0.022628,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
73,0.004441,0.000282,0.001892,7e-05,,77,9,"{'class_weight': None, 'max_depth': 77, 'max_f...",0.967033,0.923077,...,0.938462,0.016447,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
77,0.009622,0.002823,0.002886,0.001369,,24,15,"{'class_weight': None, 'max_depth': 24, 'max_f...",0.934066,0.934066,...,0.938462,0.017855,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
75,0.004514,0.000975,0.002101,0.000387,balanced,9,5,"{'class_weight': 'balanced', 'max_depth': 9, '...",0.923077,0.945055,...,0.936264,0.017582,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0
80,0.006744,0.000237,0.001858,9.5e-05,,58,20,"{'class_weight': None, 'max_depth': 58, 'max_f...",0.945055,0.934066,...,0.936264,0.018906,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0
144,0.00816,0.001651,0.002663,0.000812,balanced,67,18,"{'class_weight': 'balanced', 'max_depth': 67, ...",0.901099,0.967033,...,0.936264,0.025441,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0
106,0.005994,0.000535,0.001878,0.000145,,90,16,"{'class_weight': None, 'max_depth': 90, 'max_f...",0.945055,0.956044,...,0.936264,0.020143,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0


## 4. Model Selection

We now have 2 possible models, the best one found with the grid search and the best one found with the randomized search. Which one should we choose?

Selecting a "final" model that we will use is not only a matter of selecting the model with the highest score. There are other aspects we should consider when evaluating one model versus another:

- Training Time: If one model takes 1 hour to train and another one takes 5 hours
- Prediction Time: If we are working on a real time predictive system, we cannot choose a model that takes seconds to perform a prediction!
- Interpretability: We may favor a less complex (or more interpretable) model due to regulations and/or our ability to explain it to clients


**Measuring predictive and computing performance**

We can evaluate the predictive performance of the models by using the test dataset we held at the beginning. We will do a cross validation.

In [28]:
grid_results = cross_validate(grid_search.best_estimator_, X_test, y_test, scoring="roc_auc", 
                              return_train_score=True, cv=3)

In [29]:
grid_results

{'fit_time': array([0.00354934, 0.00368953, 0.00228858]),
 'score_time': array([0.0042038 , 0.00406647, 0.00331831]),
 'test_score': array([0.96577381, 0.90029762, 0.86811594]),
 'train_score': array([0.99266324, 0.98569332, 1.        ])}

We can turn these results into a dataframe and calculate their means. This way we can see how much time it takes to train the dataset, how much time it takes to predict (which matters for real time applications), and how does the model performs with the training and the test set.

In [30]:
pd.DataFrame(grid_results).mean()

fit_time       0.003176
score_time     0.003863
test_score     0.911396
train_score    0.992786
dtype: float64

We can do the same thing with the randomized search estimator.

In [31]:
random_results = cross_validate(random_search.best_estimator_,
                                X_test, y_test, scoring="roc_auc",
                                return_train_score=True, cv=3)
pd.DataFrame(random_results).mean()

fit_time       0.004034
score_time     0.006357
test_score     0.948301
train_score    1.000000
dtype: float64

Now that we know which model performs better on the train and test set, which model is the fastest to train, we can make a more informed decision. (Note: take the results with a grain of salt because the test set is not very large).

## 5. CheatSheet 

Though we often have several hyperparameters per estimator that we can tune, in practice most of the performance variation can be attributed to just a few hyperparameters [[2](http://proceedings.mlr.press/v32/hutter14.html)]. To make your life easier, the table below suggests a couple of hyperparameters (using sklearn naming convention), for a select group of estimators, that usually have the greatest impact on the performance.

| Estimator        | Hyperparameter          | Notes  |
| ------------- |:-------------:| :-----|
| Logistic Regression      | penalty                  | Used to specify the norm used in the penalization. Can be '"l1", "l2", or "elasticnet"  |
| Logistic Regression      | C                            | Inverse of regularization strenght. Can go from close to zero (high variance) to large values (high bias) |
| SVM                                 | C                             | Inverse of regularization strenght. Can go from close to zero (high variance) to large values (high bias)|
| SVM                        | Kernel             | Type of kernel to use. Can be "linear", "poly", "rbf", or "sigmoid" |
| Tree Ensembles      | n_estimators |    Number of estimators to use. In practice up to hundreds of estimators are used  |
| Tree Ensembles             |  max_depth         |    Maximum depth of tree. Small values result in less complexity (1 often works well for Boosting) |
| KNN                                 | n_neighbors         |   Number of neighbors to use. Small values result in higher variace while larger ones in higher bias |
| KNN                                 |  weights                |   Weight function used in prediction. Can be "uniform" or "distance" |

Starting your search with the hyperparameters above is often a good choice. 

## 6. Recap

* Hyperparameters define the learning process of the model
* Different from parameters which define the model
* Hyperparameter search options to select best hyperparameters
 * Grid search
 * Random search
* Final model model selection should take into account other factors, not just the score

## 7. Further readings

* http://proceedings.mlr.press/v32/hutter14.html
* https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)