# Hyperparameter Search

Whenever we build an ML model, we have a lot of choicess to make about hyperparameters. Sometimes we can rely on our intuition to guide us towards good choices, but one of the main benefits of ML models is that they find surprising results by "thinking" quite differently from the way most people think. Being primarally a statistical discipline, machine learning experts will often use validation data, a technique called "cross validation," and programatic methods to explore the possible combinations of hyperparameters and find ones that produce good models emperically. 

SKLearn provides two very helpful classes for executing hyperparameter search: GridSearchCV and RandomizedSearchCV.

### Crosss Validation:

There are many kinds of cross validation, see https://scikit-learn.org/stable/modules/cross_validation.html for more details. In this lab we're going to use something called k-fold cross validation (because it is the default for SKLearn's hyperparameter search methods) but we strongly suggest experimenting with methods, especially "stratified k-fold" cross validation, which can also be easily applied to hyperparameter search using SKLearn. 

K-Fold cross validation separates your training dataset into K different training and validation sets, ensuring that each datapoint occurs in exactly one of the validation sets, and therefore appears in K-1 training sets. Take a look at this picture from the SKlearn docs:

![K-Fold Cross Validation Visualized.](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_0041.png)

Cross validation is more expensive computationally than having a single training/validation split, but it provides a more robust estimate of our model's performance and generalization. Generally, when we have found the hyperparameter set we like via cross validation, we retrain the model one more time with ALL the training data before evaluating our results on a held-out test set. 

Stratified K-Fold cross validation is like K-Fold, but it takes extra steps to ensure that for all K validation sets, an equal number of datapoints selected from each of the classes in our classifiation problem. See this image from the sklearn docs again:

![Stratified K-Fold Cross Validation Visualized](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_0071.png)

Typically "stratified" tactics are applied to classification problems, but there are similar tactics for regression problems. See this article for more information about that: https://scottclowe.com/2016-03-19-stratified-regression-partitions/

### Hyperparameter search

In SK-Learn we can use a single class to combine the search process with cross validation, automatically finding the best hyperparameters according to our cross-validation scores along whatever metrics we specifiy (e.g. accuracy or r^2). In both cases, sk-learn maintains information about the performance of all the models it trains, and gives us mechanisms to get the best one found without retraining. 

The two tactics are:

#### Grid Search

We supply a set of values for each hyperparameter that we want to search acrosss. SK-learn computes all the possible combinations of our selections and performs cross-validation on for each combination. 

#### Randomized Search

We supply a range of values for the hyperparameters we are interested in. SK-learn randomly selects values for each hyperparameter within our ranges and performs cross-validation. We specify how many times to perform this process.

In [2]:
# Lets do grid search cross validation on a Decision Tree:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


# Load the data
heart_dataset = pd.read_csv('../../datasets/uci-heart-disease/heart.csv')

# Split the data into input and labels
labels = heart_dataset['target']
input_data = heart_dataset.drop(columns=['target'])

# Note, we don't split the data. GridSearchCV will automatically apply 5-fold cross validation by default.

tuned_parameters = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 3, 5, 10, 20],
    'min_samples_split': [2, 4, 8, 16, 32],
    'max_leaf_nodes': [None, 10, 20, 40, 80]
}

# These two lines will result in every possible combo of the above paramters to be fit and scored
# which can take a LONG TIME with large datasets.
clf = DecisionTreeClassifier()
grid_tree = GridSearchCV(clf, tuned_parameters)
grid_tree.fit(input_data, labels) 

print("Best parameters set found on development set:")
print()
print(grid_tree.best_params_, grid_tree.best_score_)
print()
print("Grid scores on development set:")
print()
means = grid_tree.cv_results_['mean_test_score']
stds = grid_tree.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_tree.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))

Best parameters set found on development set:

{'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 32} 0.8150273224043716

Grid scores on development set:

0.752 (+/-0.104) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 2}
0.759 (+/-0.108) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 4}
0.772 (+/-0.085) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 8}
0.752 (+/-0.108) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 16}
0.799 (+/-0.066) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 32}
0.792 (+/-0.113) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': 10, 'min_samples_split': 2}
0.795 (+/-0.115) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': 10, 'min_samples_split': 4}
0.795 (+/-0.115) for {'criterion': 'gin

In [6]:
# We have to use scipy to provide the sampling distributions for our random search
from scipy.stats import randint as sp_randint

# Lists will be uniformly sampled.
# Distributions from scipy will follow the sampling distribution 
# (uniform in this case, but you could use any other provided distribution)
param_dist = {"max_depth": [None, 3, 5, 10, 20],
              "min_samples_split": sp_randint(2, 32),
              "max_leaf_nodes": sp_randint(2, 80),
              "criterion": ["gini", "entropy"]}

# Above, 250 combinations will be tried. The main benefit of randomized search
# is to reduce the search time but come up with similar results. So lets do
# just 100 iterations instead of 250 and see how similar 
clf = DecisionTreeClassifier()
n_iter_search = 100
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)
random_search.fit(input_data, labels)

print("Best parameters set found on development set:")
print()
print(random_search.best_params_, random_search.best_score_)
print()
print("Scores on development set:")
print()
means = random_search.cv_results_['mean_test_score']
stds = random_search.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, random_search.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))

Best parameters set found on development set:

{'criterion': 'entropy', 'max_depth': 20, 'max_leaf_nodes': 49, 'min_samples_split': 29} 0.8183060109289617

Scores on development set:

0.765 (+/-0.100) for {'criterion': 'gini', 'max_depth': 10, 'max_leaf_nodes': 44, 'min_samples_split': 18}
0.815 (+/-0.109) for {'criterion': 'entropy', 'max_depth': 3, 'max_leaf_nodes': 60, 'min_samples_split': 18}
0.792 (+/-0.106) for {'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': 49, 'min_samples_split': 20}
0.765 (+/-0.108) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': 18, 'min_samples_split': 11}
0.798 (+/-0.070) for {'criterion': 'gini', 'max_depth': 20, 'max_leaf_nodes': 45, 'min_samples_split': 29}
0.756 (+/-0.099) for {'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': 38, 'min_samples_split': 11}
0.808 (+/-0.096) for {'criterion': 'entropy', 'max_depth': 10, 'max_leaf_nodes': 22, 'min_samples_split': 30}
0.805 (+/-0.119) for {'criterion': 'entropy', 'max