## Model Tuning

**The hyperparameters of a machine learning model are parameters that are not learned from data. They should be set prior to fitting the model to the training set to tune the hyperparameters of a tree-based model using grid search cross validation.**

#### Tuning a CART's hyperparameters

#### Machine learning model:

* parameters: learned from data
    * CART example: split-point of a node, split- feature of a node, ...
* hyperparameters: not learned from data, set prior to training
    * CART example: max_depth, min_samples_leaf, splitting criterion...

#### What is hyperparameter tuning?

* Problem: search for a set of optimal hyperparameters for a learning algorithm.
* Solution: find a set of optimal hyperparameters that results in an optimal model.
* Optimal model: yields an optimal score.
* Score: in sklearn defaults to accuracy (classification) and R (regression).
* Cross validation is used to estimate the generalization performance.

#### Why tune hyperparameters?

* In sklearn, a model's default hyperparameters are not optimal for all problems.
* Hyperparameters should be tuned to obtain the best model performance.

#### Grid search cross validation

* Manually set a grid of discrete hyperparameter values.

* Set a metric for scoring model performance.

* hyperparameter space = 

Search exhaustively through the grid.
For each set of hyperparameters, evaluate each model's CV score.

* CV scores = { 

        , . . . }
* The optimal hyperparameters are those of the model achieving the best CV score.
    * optimal hyperparameters = set of hyperparameters corresponding to the best CV score

#### Example:

#### Hyperparameter of DecisionTreeClassifier

* DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

=> mean_features not a Hyperparameter of DecisionTreeClassifier
#### Doing

* Inspecting the hyperparameters of a CART in sklearn
* Extracting the best hyperparameters
    * Set the tree's hyperparameter grid
    * Define params_dt
    * Evaluate the optimal tree
    * Compute the test set ROC AUC score.
* Extracting the best estimator


In [1]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error as MSE

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingRegressor

SEED =1

In [3]:
#Dataset
liver = pd.read_csv('../indian_liver_patient/indian_liver_patient_preprocessed.csv', index_col = 0)
X = liver.drop('Liver_disease', axis = 1)
y = liver['Liver_disease']

dt = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, random_state=1,
            splitter='best')


#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)


## Set the tree's hyperparameter grid

In this exercise, you'll manually set the grid of hyperparameters that will be used to tune the classification tree dt and find the optimal classifier in the next exercise.
**Instructions**

* Define a grid of hyperparameters corresponding to a Python dictionary called params_dt with:

    * the key 'max_depth' set to a list of values 2, 3, and 4

    * the key 'min_samples_leaf' set to a list of values 0.12, 0.14, 0.16, 0.18


In [4]:
#set the tree's hyperparameter grid
# Define params_dt
params_dt = {
'max_depth': [2,3, 4],
'min_samples_leaf': [0.12, 0.14, 0.16, 0.18] }

## Search for the optimal tree

In this exercise, you'll perform grid search using 5-fold cross validation to find dt's optimal hyperparameters. Note that because grid search is an exhaustive process, it may take a lot time to train the model. Here you'll only be instantiating the GridSearchCV object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:

> grid_object.fit(X_train, y_train)

An untuned classification tree dt as well as the dictionary params_dt that you defined in the previous exercise are available in your workspace.
**Instructions**
* Import GridSearchCV from sklearn.model_selection.

* Instantiate a GridSearchCV object using 5-fold CV by setting the parameters:

    * estimator to dt, param_grid to params_dt and

    * scoring to 'roc_auc'.



In [5]:
# performing the grid search.
## Import GridSearchCV
# from sklearn.model_selection import GridSearchCV

## Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
                       param_grid=params_dt,
                       scoring='roc_auc',
                       cv=5,
                       n_jobs=-1)

grid_dt.fit(X_train, y_train)

#compute the test set ROC AUC score.

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=1), n_jobs=-1,
             param_grid={'max_depth': [2, 3, 4],
                         'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]},
             scoring='roc_auc')

## Evaluate the optimal tree

In this exercise, you'll evaluate the test set ROC AUC score of grid_dt's optimal model.

In order to do so, you will first determine the probability of obtaining the positive label for each test set observation. You can use the methodpredict_proba() of an sklearn classifier to compute a 2D array containing the probabilities of the negative and positive class-labels respectively along columns.

The dataset is already loaded and processed for you (numerical features are standardized); it is split into 80% train and 20% test. X_test, y_test are available in your workspace. In addition, we have also loaded the trained GridSearchCV object grid_dt that you instantiated in the previous exercise. Note that grid_dt was trained as follows:

grid_dt.fit(X_train, y_train)

**Instructions**
* Import roc_auc_score from sklearn.metrics.
* Extract the .best_estimator_ attribute from grid_dt and assign it to best_model.
* Predict the test set probabilities of obtaining the positive class y_pred_proba.
* Compute the test set ROC AUC score test_roc_auc of best_model.


In [6]:
# Import roc_auc_score from sklearn.metrics 
# from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

Test set ROC AUC score: 0.731


## Tuning an RF's Hyperparameters
#### Random Forests Hyperparameters

* CART hyperparameters
* number of estimators
* bootstrap #### Tuning is expensive
* Hyperparameter tuning:
    * computationally expensive,
    * sometimes leads to very slight improvement,
* Weight the impact of tuning on the whole project

#### Doing:

* Instantiate RF
* Instantiate GridsearchCV
* Evaluating the test set RMSE of the best model

## Set the hyperparameter grid of RF

In this exercise, you'll manually set the grid of hyperparameters that will be used to tune rf's hyperparameters and find the optimal regressor. For this purpose, you will be constructing a grid of hyperparameters and tune the number of estimators, the maximum number of features used when splitting each node and the minimum number of samples (or fraction) per leaf.

In [9]:
# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators': [100, 350, 500],
    'max_features':['log2', 'auto', 'sqrt'],
    'min_samples_leaf':[2, 10, 30]
}

## Search for the optimal forest

In this exercise, you'll perform grid search using 3-fold cross validation to find rf's optimal hyperparameters. To evaluate each model in the grid, you'll be using the negative mean squared error metric.

Note that because grid search is an exhaustive search process, it may take a lot time to train the model. Here you'll only be instantiating the GridSearchCV object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:

> grid_object.fit(X_train, y_train)

The untuned random forests regressor model rf as well as the dictionary params_rf that you defined in the previous exercise are available in your workspace.

In [10]:
# Import GridSearchCV
# from sklearn.model_selection import GridSearchCV
rf = RandomForestRegressor()
# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

In [11]:
bike = pd.read_csv('../bikes.csv')
X = bike[['hr', 'holiday', 'workingday', 'temp', 'hum', 'windspeed', 'instant',
       'mnth', 'yr', 'Clear to partly cloudy', 'Light Precipitation', 'Misty']]
y = bike['cnt']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
grid_rf.fit(X_train, y_train)

# Fit with train set
grid_rf.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
Fitting 3 folds for each of 27 candidates, totalling 81 fits


GridSearchCV(cv=3, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'max_features': ['log2', 'auto', 'sqrt'],
                         'min_samples_leaf': [2, 10, 30],
                         'n_estimators': [100, 350, 500]},
             scoring='neg_mean_squared_error', verbose=1)

In [12]:
# Import mean_squared_error from sklearn.metrics as MSE 
# from sklearn.metrics import mean_squared_error as MSE

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_test,y_pred)**(1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 

Test RMSE of best model: 51.421
