# Predictive modeling - Hyperparameter Tuning

In this section we apply techniques for [hyperparameter tuning][1] on a real world data set, the _adult_ data set. The data set is available on the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and can be assessed and downloaded [here](https://archive.ics.uci.edu/ml/datasets/Adult).

For the purpose of this tutorial we already downloaded the data set. You may find it in the `data` folder (`./data/adult_data.txt`).

Please note that this tutorial bases on a talk given by [Olivier Grisel](https://github.com/ogrisel) and [Tim Head](https://github.com/betatim) at [EuroScipy 2017](https://www.euroscipy.org/2017/). You can watch their tutorial on YouTube ([Part I](https://www.youtube.com/watch?v=Vs7tdobwj1k&index=3&list=PL55N1lsytpbekFTO5swVmbHPhw093wo0h) and [Part II](https://www.youtube.com/watch?v=0eYOhEF_aK0&list=PL55N1lsytpbekFTO5swVmbHPhw093wo0h&index=2)).


[1]: https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)

**Import libraries**

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

**Global setting**

In [None]:
pd.options.display.max_columns = 200
plt.rcParams["figure.figsize"] = [12,6]

## Load the data

In [None]:
filepath = "./data/adult_data.txt"
names = ("age, workclass, fnlwgt, education, education-num, "
         "marital-status, occupation, relationship, race, sex, "
         "capital-gain, capital-loss, hours-per-week, "
         "native-country, income").split(', ')    
data = pd.read_csv(filepath , names=names)
data = data.drop('fnlwgt', axis=1)

We take a look at the first rows of the data set by calling the `head()` function.

In [None]:
data.head()

> __The goal is to predict whether a person makes over 50K $ a year.__

## Training-Test Split

Split the data set into `target` and `feature` data sets.

In [None]:
target = data['income']
features_data = data.drop('income', axis=1)
features = pd.get_dummies(features_data)

print("Target variable: ", target.shape)
print("Features: ", features.shape)

In [None]:
X = features.values.astype(np.float32)
y = (target.values == ' >50K').astype(np.int32)

In [None]:
X.shape

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42)

print("Training set: ", X_train.shape)
print("Validation set: ", X_val.shape)

## Learning Algorithm - Decision Trees

[__Decision Trees__](https://en.wikipedia.org/wiki/Decision_tree_learning) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.


Some advantages of decision trees are:

* Simple to understand and to interpret (white box model). Trees can be visualized.
* Requires little data preparation.
* The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
* Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See algorithms for more information.

The disadvantages of decision trees include:

* Decision-tree learners can create over-complex trees that do not generalize the data well. This is called [overfitting](https://en.wikipedia.org/wiki/Overfitting). 
* Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.




In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=8)
clf

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='roc_auc')
print("ROC AUC Decision Tree: {:.4f} +/-{:.4f}".format(
      np.mean(scores), np.std(scores)))

## Tuning your estimator

Hyperparameters are not directly learned by the classifier or regressor from the data. They need setting from the outside. An example of a hyper-parameter is `max_depth` for a decision tree classifier. In `scikit-learn` you can spot them as the parameters that are passed to the constructor of your estimator.


The best value of a hyper-parameter depends on the kind of problem you are solving:

* how many features and samples do you have?
* mostly numerical or mostly categorical features?
* is it a regression or classification task?

Therefore you should optimize the hyper-parameters for each problem, otherwise the performance of your classifier will not be as good as it could be.

### Search over a grid of parameters

This is the simplest strategy: you try every combination of values for each hyper-parameter. 
In scikit-learn __grid search__ is  provided by [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), which exhaustively generates candidates from a grid of parameter values specified with the `param_grid`. 

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {"max_depth": [1, 2, 4, 8, 16, 32]}

grid_search = GridSearchCV(clf, param_grid=param_grid, 
                           scoring='roc_auc', return_train_score=True)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
type(grid_search)

Once we have created a `sklearn.model_selection._search.GridSearchCV` object we can access its attributes using the `.`-notation. For instance, the results of the cross-validation are stored in the `cv_results_` attribute.

In [None]:
grid_search.cv_results_

We print out the values of `max_depth` and the average train and test scores for each iteration.

In [None]:
for n, max_depth in enumerate(grid_search.cv_results_['param_max_depth']):
    print("Max depth: {}, train score: {:.3f}, test score {:.3f}".format(max_depth,
          grid_search.cv_results_['mean_train_score'][n],
          grid_search.cv_results_['mean_test_score'][n],))


For the purpose of model diagnostics we write a function, `plot_grid_scores`, which allows us to compare test and train performance at for each value of of a particular hyperparameter, such as `max_depth`. 

In [None]:
def plot_grid_scores(param_name, cv_result):
    # access the parameter
    param_values = np.array(cv_result["param_{}".format(param_name)])
    
    # plotting
    fix, ax = plt.subplots()

    ax.set_title("Scores for {}".format(param_name), size=18)
    ax.grid()
    ax.set_xlabel(param_name)
    ax.set_ylabel("Score")
    
    train_scores_mean = cv_result['mean_train_score']
    test_scores_mean = cv_result['mean_test_score']
    ax.scatter(param_values, train_scores_mean, s=80 ,marker='o', color="r",
                label="Training scores")
    ax.scatter(param_values, test_scores_mean, s=80, marker='o', color="g",
                label="Cross-validation scores")
    ax.legend(loc="best")
    print("Best test score: {:.4f}".format(np.max(test_scores_mean)))


Once implemented we can use the `plot_grid_scores` and apply it on the `grid_search.cv_results_` object.

In [None]:
plot_grid_scores("max_depth", grid_search.cv_results_)

>**Challenge:** Extend the parameter grid to also search over different values for the `max_features` hyper-parameter. (Try: 3, 6, 12, 24, 48, and 96). Plot the results using the `plot_grid_scores` function from above.

In [None]:
## your code here ...

In [None]:
# %load ./src/_solutions/grid_search.py

Another interesting information might be to lookt at the best three parameter combinations so far. We write a function called `report` to achieve tis task.

In [None]:
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}\n".format(results['params'][candidate]))            

In [None]:
report(grid_search.cv_results_)

### Random grid search

An alternative to the exhaustive grid search is to sample parameter values at random. This has two main benefits over an exhaustive search:
* A budget can be chosen independent of the number of parameters and possible values.
* Adding parameters that do not influence the performance does not decrease efficiency.

[`RandomizedSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. In contrast to `GridSearchCV`, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by `n_iter`.


In [None]:
from scipy.stats import randint as sp_randint

from sklearn.model_selection import RandomizedSearchCV

param_grid = {"max_depth": sp_randint(1, 32),
              "max_features": sp_randint(1, 96),
             }
random_search = RandomizedSearchCV(clf, param_distributions=param_grid,
                                   n_iter=36, scoring='roc_auc', return_train_score=True)
random_search.fit(X_train, y_train)

In [None]:
plot_grid_scores("max_depth", random_search.cv_results_)

For the same number of model evaluations you get a much better view of how the performance varies as a function of `max_depth`. This is a big advantage especially if one of the hyper-parameters does not influence the performance of the estimator. Though as you increase the number of dimensions making a projection into just one becomes more noisy.

In [None]:
param_grid = {"max_depth": sp_randint(1, 32),
              "max_features": sp_randint(1, 96),
              "min_samples_leaf": sp_randint(15, 40)
             }
random_search = RandomizedSearchCV(clf, param_distributions=param_grid,
                                   n_iter=36, scoring='roc_auc', return_train_score=True)
random_search.fit(X_train, y_train)

In [None]:
plot_grid_scores("max_depth", random_search.cv_results_)

In [None]:
plot_grid_scores("max_features", random_search.cv_results_)

In [None]:
plot_grid_scores("min_samples_leaf", random_search.cv_results_)

You may assess the best performing parameter combination using the `best_params_` attribute.

In [None]:
random_search.best_params_

### Bayesian optimization

Neither the exhaustive grid search nor random search adapt their search for the best hyper-parameter as they evaluate points. For the grid all points are chosen upfront, and for random search all of them are chosen at random.

It makes sense to use the knowledge from the first few evaluations to decide what hyper-parameters to try next. This is what tools like [`scikit-optimize`](https://scikit-optimize.github.io/) try and do. The technique is known as Bayesian optimization or sequential model based optimization.
.

The basic algorithm goes like this:
* evaluate a new set of hyper-parameters
* fit a regression model to all sets of hyper-parameters
* use the regression model to predict which set of hyper-parameters is the best
* evaluate that set of hyper-parameters
* repeat.

`scikit-optimize` provides a drop-in replacement for `GridSearchCV` and `RandomSearchCV` that performs all this on the inside:

_Note that if `scikit-optimize` is not yet installed on your machine type `conda install scikit-optimize` into your shell._ 

In [None]:
from skopt import BayesSearchCV

In [None]:
bayes_search = BayesSearchCV(
    clf,
    {"max_depth": (1, 32),
     "max_features": (1, 96),
     "min_samples_leaf": (15, 40)
    },
    n_iter=15,
    scoring='roc_auc',
    return_train_score=True
)

In [None]:
bayes_search.fit(X_train, y_train)


Once the computation finished, we can access the results in the same fashion as we did before.


In [None]:
plot_grid_scores("max_depth", bayes_search.cv_results_)

In [None]:
bayes_search.best_params_

In [None]:
bayes_search.best_score_

In [None]:
np.mean(bayes_search.cv_results_["mean_test_score"])

## Using cross validation results for predictions

Once we finished our hyperparameter search, we may actually use the best model for predictions. Note that so far we did not build a test set, hence for the purpose of demonstration we use the validation set as test set:

In [None]:
X_test = np.copy(X_val)
y_test = np.copy(y_val)

We use accuracy as our model evaluation metric.

In [None]:
from sklearn.metrics import accuracy_score

Now there is more than one way to make predictions for a hold out set (`X_test`). We may use the `best_estimator_` attribute to instantiate an estimator object, or use `predict` directly on the CV-object.

In [None]:
# variant 1
m = bayes_search.best_estimator_
y_pred_v1 = m.fit(X_train, y_train).predict(X_test)
print("Accuracy on the test set: ", accuracy_score(y_true=y_test, y_pred=y_pred_v1))

In [None]:
# variant 2
y_pred_v2 = bayes_search.predict(X_test)
print("Accuracy on the test set: ", accuracy_score(y_true=y_test, y_pred=y_pred_v2))

The results should be the same.