<a href="https://colab.research.google.com/github/Paulooh007/A-Guide-to-simple-cross-validation/blob/master/Hyperparameter_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
from functools import partial

from sklearn.datasets import load_breast_cancer

In [3]:
X,  y = load_breast_cancer(return_X_y=True)

In [4]:
classifier = ensemble.RandomForestClassifier()

For detailed explanation on cross validation:  https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

### GridSearchCV

Grid search is the process of performing hyper parameter tuning in order to determine the optimal values for a given model. This is significant as the performance of the entire model is based on the hyper parameter values specified.

A cross validation process is performed in order to determine the hyper parameter value set which provides the best accuracy levels.

How to use gridsearchCV
- Define a dictionary containing the set of parameter space
- Pass dictionary into gridsearch algorithm
- Fit feature vector (X) and target (y)
- Print out result of best parameter

In [5]:
# param_grid  = {
#     'n_estimators': [100, 200, 300],
#     'max_depth': [1,3,4],
#     'criterion': ['gini', 'entropy']
# }

param_grid  = {
    'n_estimators': [100, 200],
    'max_depth': [1,3]
#     'criterion': ['gini', 'entropy']
 }

In [6]:
model = model_selection.GridSearchCV(
    estimator = classifier,
    param_grid = param_grid,
    scoring= 'accuracy',
    verbose= 10
)

In [7]:
model.fit(X, y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] max_depth=1, n_estimators=100 ...................................
[CV] ....... max_depth=1, n_estimators=100, score=0.877, total=   0.2s
[CV] max_depth=1, n_estimators=100 ...................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV] ....... max_depth=1, n_estimators=100, score=0.912, total=   0.1s
[CV] max_depth=1, n_estimators=100 ...................................
[CV] ....... max_depth=1, n_estimators=100, score=0.939, total=   0.2s
[CV] max_depth=1, n_estimators=100 ...................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.5s remaining:    0.0s


[CV] ....... max_depth=1, n_estimators=100, score=0.939, total=   0.1s
[CV] max_depth=1, n_estimators=100 ...................................
[CV] ....... max_depth=1, n_estimators=100, score=0.947, total=   0.1s
[CV] max_depth=1, n_estimators=200 ...................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s remaining:    0.0s


[CV] ....... max_depth=1, n_estimators=200, score=0.868, total=   0.3s
[CV] max_depth=1, n_estimators=200 ...................................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    1.1s remaining:    0.0s


[CV] ....... max_depth=1, n_estimators=200, score=0.921, total=   0.3s
[CV] max_depth=1, n_estimators=200 ...................................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    1.4s remaining:    0.0s


[CV] ....... max_depth=1, n_estimators=200, score=0.939, total=   0.3s
[CV] max_depth=1, n_estimators=200 ...................................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.6s remaining:    0.0s


[CV] ....... max_depth=1, n_estimators=200, score=0.956, total=   0.3s
[CV] max_depth=1, n_estimators=200 ...................................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    2.0s remaining:    0.0s


[CV] ....... max_depth=1, n_estimators=200, score=0.956, total=   0.3s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] ....... max_depth=3, n_estimators=100, score=0.939, total=   0.2s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] ....... max_depth=3, n_estimators=100, score=0.930, total=   0.2s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] ....... max_depth=3, n_estimators=100, score=0.982, total=   0.2s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] ....... max_depth=3, n_estimators=100, score=0.965, total=   0.2s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] ....... max_depth=3, n_estimators=100, score=0.965, total=   0.2s
[CV] max_depth=3, n_estimators=200 ...................................
[CV] ....... max_depth=3, n_estimators=200, score=0.921, total=   0.4s
[CV] max_depth=3, n_estimators=200 ...................................
[CV] .

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    4.9s finished


GridSearchCV(cv=None, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              ra

In [9]:
print(model.best_score_)
best_params = model.best_estimator_.get_params()

0.956078248719143


In [13]:
print('n_estimators: {}, max_depth: {}'.format(best_params['n_estimators'], best_params['max_depth']))

n_estimators: 100, max_depth: 3


### RandomSearchCV
Using Scikit-Learn’s RandomizedSearchCV method, we can define a grid of hyperparameter ranges, and randomly sample from the grid, performing K-Fold CV with each combination of values.

The most important arguments in RandomizedSearchCV are n_iter, which controls the number of iterations, and cv which is the number of folds to use for cross validation. More iterations will cover a wider search space and more cv folds reduces the chances of overfitting

In [18]:
param_grid  = {
    'n_estimators': np.arange(100, 1500, 100),
    'max_depth': np.arange(1,20),
    'criterion': ['gini', 'entropy']
 }

In [15]:
model = model_selection.RandomizedSearchCV(
    estimator = classifier,
    param_distributions = param_grid,
    n_iter= 5,
    scoring = 'accuracy',
    verbose= 10,
    cv = 5
)

In [16]:
model.fit(X, y)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV] n_estimators=900, max_depth=3, criterion=gini ...................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  n_estimators=900, max_depth=3, criterion=gini, score=0.921, total=   1.6s
[CV] n_estimators=900, max_depth=3, criterion=gini ...................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.6s remaining:    0.0s


[CV]  n_estimators=900, max_depth=3, criterion=gini, score=0.939, total=   1.6s
[CV] n_estimators=900, max_depth=3, criterion=gini ...................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.1s remaining:    0.0s


[CV]  n_estimators=900, max_depth=3, criterion=gini, score=0.982, total=   1.6s
[CV] n_estimators=900, max_depth=3, criterion=gini ...................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    4.7s remaining:    0.0s


[CV]  n_estimators=900, max_depth=3, criterion=gini, score=0.956, total=   1.6s
[CV] n_estimators=900, max_depth=3, criterion=gini ...................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    6.3s remaining:    0.0s


[CV]  n_estimators=900, max_depth=3, criterion=gini, score=0.965, total=   1.5s
[CV] n_estimators=500, max_depth=18, criterion=gini ..................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    7.8s remaining:    0.0s


[CV]  n_estimators=500, max_depth=18, criterion=gini, score=0.930, total=   1.0s
[CV] n_estimators=500, max_depth=18, criterion=gini ..................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    8.8s remaining:    0.0s


[CV]  n_estimators=500, max_depth=18, criterion=gini, score=0.947, total=   1.0s
[CV] n_estimators=500, max_depth=18, criterion=gini ..................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    9.7s remaining:    0.0s


[CV]  n_estimators=500, max_depth=18, criterion=gini, score=0.991, total=   1.0s
[CV] n_estimators=500, max_depth=18, criterion=gini ..................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   10.7s remaining:    0.0s


[CV]  n_estimators=500, max_depth=18, criterion=gini, score=0.974, total=   1.0s
[CV] n_estimators=500, max_depth=18, criterion=gini ..................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   11.7s remaining:    0.0s


[CV]  n_estimators=500, max_depth=18, criterion=gini, score=0.973, total=   1.0s
[CV] n_estimators=100, max_depth=13, criterion=entropy ...............
[CV]  n_estimators=100, max_depth=13, criterion=entropy, score=0.939, total=   0.2s
[CV] n_estimators=100, max_depth=13, criterion=entropy ...............
[CV]  n_estimators=100, max_depth=13, criterion=entropy, score=0.947, total=   0.2s
[CV] n_estimators=100, max_depth=13, criterion=entropy ...............
[CV]  n_estimators=100, max_depth=13, criterion=entropy, score=0.991, total=   0.2s
[CV] n_estimators=100, max_depth=13, criterion=entropy ...............
[CV]  n_estimators=100, max_depth=13, criterion=entropy, score=0.965, total=   0.2s
[CV] n_estimators=100, max_depth=13, criterion=entropy ...............
[CV]  n_estimators=100, max_depth=13, criterion=entropy, score=0.982, total=   0.2s
[CV] n_estimators=700, max_depth=12, criterion=gini ..................
[CV]  n_estimators=700, max_depth=12, criterion=gini, score=0.930, total=

[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:   28.3s finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [19]:
print(model.best_score_)
best_params = model.best_estimator_.get_params()
print('n_estimators: {}, max_depth: {}'.format(best_params['n_estimators'], best_params['max_depth']))

0.9648812296227295
n_estimators: 100, max_depth: 13


### Scikit-optimize, Hyeropt and Optuna
(all three are very similar in operation, and only slight differences in code)

All three are simple and efficient libraries to minimize (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimization

Code Summary

- Define the function you want to minimize (in this example we don't want to minimze accuracy so instead we'll minimize ```-1 * accuracy```
- Define parameter space
- Use a minimization funtion (```gp_minimze```, ```fmin``` ) to find the best parameter that minimizes the the metric

In [20]:
from skopt import space
from skopt import gp_minimize

In [22]:
# The function we want to minimize

def optimize(params, param_names, x, y):
    params = dict(zip(param_names, params))
    model = ensemble.RandomForestClassifier(**params)
    kf = model_selection.StratifiedKFold(n_splits= 5)  #5 fold cross validation
    accuracies = []
    for idx in kf.split(X = x, y = y):
        train_idx, test_idx = idx[0], idx[1]
        xtrain = x[train_idx]
        ytrain = y[train_idx]
        
        xtest = x[test_idx]
        ytest = y[test_idx]
        
        model.fit(xtrain, ytrain)
        preds = model.predict(xtest)
        fold_acc = metrics.accuracy_score(ytest, preds)
        accuracies.append(fold_acc)
        
    return -1.0 * np.mean(accuracies)# return -1 * accuracy

In [23]:
param_space = [
    space.Integer(3, 15, name = 'max_depth'), #search for optimal value of max_depth between 3 and 15
    space.Integer(100, 600, name= 'n_estimators'), #search for optimal value of n_estimators between 100 and 600
    space.Categorical(['gini', 'entropy'], name = 'criterion'),
    space.Real(0.01, 1, prior= 'uniform', name =  'max_features')
    ]

In [24]:
param_names = [
               'max_depth',
               'n_estimators',
               'criterion',
               'max_features'
]

In [30]:
optimization_function = partial(
    optimize,
    param_names = param_names,
    x = X,
    y=y)

In [31]:
result = gp_minimize(
    optimization_function,
    dimensions = param_space,
    n_calls = 15,  # Try 15 times, The higher the number of trials the better the chances of a better score
    n_random_starts = 10,
    verbose = 10)

print(dict(zip(param_names,result.x)))

Iteration No: 1 started. Evaluating function at random point.
Iteration No: 1 ended. Evaluation done at random point.
Time taken: 2.4544
Function value obtained: -0.9561
Current minimum: -0.9561
Iteration No: 2 started. Evaluating function at random point.
Iteration No: 2 ended. Evaluation done at random point.
Time taken: 8.6690
Function value obtained: -0.9561
Current minimum: -0.9561
Iteration No: 3 started. Evaluating function at random point.
Iteration No: 3 ended. Evaluation done at random point.
Time taken: 10.7729
Function value obtained: -0.9631
Current minimum: -0.9631
Iteration No: 4 started. Evaluating function at random point.
Iteration No: 4 ended. Evaluation done at random point.
Time taken: 3.8499
Function value obtained: -0.9596
Current minimum: -0.9631
Iteration No: 5 started. Evaluating function at random point.
Iteration No: 5 ended. Evaluation done at random point.
Time taken: 1.5197
Function value obtained: -0.9666
Current minimum: -0.9666
Iteration No: 6 started.

### Hyperopt

In [32]:
from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll.base import scope

In [33]:
def optimize(params, x, y):
    # params = dict(zip(param_names, params))
    model = ensemble.RandomForestClassifier(**params)
    kf = model_selection.StratifiedKFold(n_splits= 5)
    accuracies = []
    for idx in kf.split(X = x, y = y):
        train_idx, test_idx = idx[0], idx[1]
        xtrain = x[train_idx]
        ytrain = y[train_idx]
        
        xtest = x[test_idx]
        ytest = y[test_idx]
        
        model.fit(xtrain, ytrain)
        preds = model.predict(xtest)
        fold_acc = metrics.accuracy_score(ytest, preds)
        accuracies.append(fold_acc)
        
    return -1.0 * np.mean(accuracies)

In [34]:
param_space = {
    'max_depth': scope.int(hp.quniform('max_depth', 3, 15, 1)),
    'n_estimators': scope.int(hp.quniform('n_estimators', 100, 600, 1)),
    'criterion': hp.choice('criterion', ['gini', 'entropy']),
     'max_features': hp.uniform('max_features', 0.01, 1)
}

In [35]:
optimization_function = partial(
    optimize,
    x = X,
    y=y)

trials = Trials()

result = fmin(
    fn = optimization_function,
    space= param_space,
    algo = tpe.suggest,
    max_evals = 15,  # Try 15 times, The higher the number of trials the better the chances of a better score
    trials = trials)

print(result)

100%|██████████| 15/15 [00:59<00:00,  3.94s/it, best loss: -0.9683900015525537]
{'criterion': 1, 'max_depth': 11.0, 'max_features': 0.7675116625583787, 'n_estimators': 254.0}


### Optuna

In [36]:
# pip install optuna

In [37]:
import optuna

In [38]:
def optimize(trial, x, y):
    criterion = trial.suggest_categorical("criterion", ['gini', 'entropy'])
    n_estimators = trial.suggest_int('n_estimators', 100, 1500)
    max_depth = trial.suggest_int('max_depth', 3 ,15)
    max_features = trial.suggest_uniform('max_features', 0.01 , 1.0)

    
    # params = dict(zip(param_names, params))
    model = ensemble.RandomForestClassifier(
        n_estimators = n_estimators,
        max_depth = max_depth,
        criterion = criterion,
        max_features = max_features
    )
    kf = model_selection.StratifiedKFold(n_splits= 5)
    accuracies = []
    for idx in kf.split(X = x, y = y):
        train_idx, test_idx = idx[0], idx[1]
        xtrain = x[train_idx]
        ytrain = y[train_idx]
        
        xtest = x[test_idx]
        ytest = y[test_idx]
        
        model.fit(xtrain, ytrain)
        preds = model.predict(xtest)
        fold_acc = metrics.accuracy_score(ytest, preds)
        accuracies.append(fold_acc)
        
    return -1.0 * np.mean(accuracies)

In [39]:
optimization_function = partial(
    optimize,
    x = X,
    y=y
)


study = optuna.create_study(direction = 'minimize')
study.optimize(optimization_function, n_trials = 15)  # Try 15 times, The higher the number of trials the better the chances of a better score

[I 2020-08-15 13:08:18,232] Trial 0 finished with value: -0.9666200900481293 and parameters: {'criterion': 'entropy', 'n_estimators': 1307, 'max_depth': 9, 'max_features': 0.8229456489469515}. Best is trial 0 with value: -0.9666200900481293.
[I 2020-08-15 13:08:25,938] Trial 1 finished with value: -0.9613724576929048 and parameters: {'criterion': 'gini', 'n_estimators': 544, 'max_depth': 12, 'max_features': 0.4514836841906695}. Best is trial 0 with value: -0.9666200900481293.
[I 2020-08-15 13:08:37,271] Trial 2 finished with value: -0.9648657040832169 and parameters: {'criterion': 'entropy', 'n_estimators': 1048, 'max_depth': 9, 'max_features': 0.16168838831165075}. Best is trial 0 with value: -0.9666200900481293.
[I 2020-08-15 13:08:46,565] Trial 3 finished with value: -0.9508150908244062 and parameters: {'criterion': 'gini', 'n_estimators': 1127, 'max_depth': 3, 'max_features': 0.15181765897525304}. Best is trial 0 with value: -0.9666200900481293.
[I 2020-08-15 13:09:17,939] Trial 4 