## All-Hyperparamter-Optimization

1. GridSearchCV
2. RandomizedSearchCV
3. Bayesian Optimization -Automate Hyperparameter Tuning (Hyperopt)
4. Sequential Model Based Optimization(Tuning a scikit-learn estimator with skopt)
5. Optuna- Automate Hyperparameter Tuning
6. Genetic Algorithms (TPOT Classifier)
## References
1. https://github.com/fmfn/BayesianOptimization
2. https://github.com/hyperopt/hyperopt
3. https://www.jeremyjordan.me/hyperparameter-tuning/
4. https://optuna.org/
5. https://towardsdatascience.com/hyperparameters-optimization-526348bb8e2d(By Pier Paolo Ippolito )
6. https://scikit-optimize.github.io/stable/auto_examples/hyperparameter-optimization.html

In [46]:
import numpy as np
import pandas as pd


In [47]:
df=pd.read_csv('https://raw.githubusercontent.com/krishnaik06/All-Hyperparamter-Optimization/master/diabetes.csv')

In [48]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [49]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [50]:
df['Glucose']=np.where(df['Glucose']==0,df['Glucose'].median(),df['Glucose'])
df['Pregnancies']=np.where(df['Pregnancies']==0,df['Pregnancies'].median(),df['Pregnancies'])
df['Insulin']=np.where(df['Insulin']==0,df['Insulin'].median(),df['Insulin'])
df['SkinThickness']=np.where(df['SkinThickness']==0,df['SkinThickness'].median(),df['SkinThickness'])

In [51]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72,35.0,30.5,33.6,0.627,50,1
1,1.0,85.0,66,29.0,30.5,26.6,0.351,31,0
2,8.0,183.0,64,23.0,30.5,23.3,0.672,32,1
3,1.0,89.0,66,23.0,94.0,28.1,0.167,21,0
4,3.0,137.0,40,35.0,168.0,43.1,2.288,33,1


In [52]:
X=df.iloc[:,:-1]
y=df.iloc[:,-1]

In [53]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)


In [54]:
from sklearn.ensemble  import RandomForestClassifier
tree_model=RandomForestClassifier(n_estimators=10)
tree_model.fit(x_train,y_train)



RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [55]:
prediction=tree_model.predict(x_test)

In [56]:

from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
print(confusion_matrix(y_test,prediction))
print(accuracy_score(y_test,prediction))
print(classification_report(y_test,prediction))

[[93 14]
 [21 26]]
0.7727272727272727
              precision    recall  f1-score   support

           0       0.82      0.87      0.84       107
           1       0.65      0.55      0.60        47

    accuracy                           0.77       154
   macro avg       0.73      0.71      0.72       154
weighted avg       0.77      0.77      0.77       154



## Randomized SearchCV

In [57]:
from sklearn.model_selection import RandomizedSearchCV

#Number of tress in randomforest
n_estimators=[int(x) for x in np.linspace(start=200,stop=2000,num=10)]
#maximun number of levels in tree
max_depth=[int(x) for x in np.linspace(start=10,stop=1000,num=10)]
#number of features to consider at each spilt
max_features=['auto','sqrt','log2']
#minimum samples required at each split of node
min_samples_split=[2,5,10,14]
#minimum samples required at each leaf node
min_samples_leaf=[1,2,4,6,8]
# random_state
random_state=[int(x) for x in np.linspace(start=10,stop=100,num=10)]

In [58]:
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']
              }
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [60]:
rf=RandomForestClassifier()
RandomizedSearchcv=RandomizedSearchCV(estimator=rf,param_distributions=random_grid,n_iter=100, n_jobs=-1, cv=3,
    verbose=2)
RandomizedSearchcv.fit(x_train,y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   59.2s


KeyboardInterrupt: 

In [None]:
RandomizedSearchcv.best_params_

In [None]:
best_random_grid=RandomizedSearchcv.best_estimator_
best_random_grid

In [None]:
y_pred=best_random_grid.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score

print(confusion_matrix(y_test,y_pred))
print("Accuracy Score {}".format(accuracy_score(y_test,y_pred)))
print("Classification report: {}".format(classification_report(y_test,y_pred)))

## GridSearchCv

In [None]:
from sklearn.model_selection import GridSearchCV


In [None]:
param_grid = {
    'criterion': [RandomizedSearchcv.best_params_['criterion']],
    'max_depth': [RandomizedSearchcv.best_params_['max_depth']],
    'max_features': [RandomizedSearchcv.best_params_['max_features']],
    'min_samples_leaf': [RandomizedSearchcv.best_params_['min_samples_leaf'], 
                         RandomizedSearchcv.best_params_['min_samples_leaf']+2, 
                         RandomizedSearchcv.best_params_['min_samples_leaf'] + 4],
    'min_samples_split': [RandomizedSearchcv.best_params_['min_samples_split'] - 2,
                          RandomizedSearchcv.best_params_['min_samples_split'] - 1,
                          RandomizedSearchcv.best_params_['min_samples_split'], 
                          RandomizedSearchcv.best_params_['min_samples_split'] +1,
                          RandomizedSearchcv.best_params_['min_samples_split'] + 2],
    'n_estimators': [RandomizedSearchcv.best_params_['n_estimators'] - 200, RandomizedSearchcv.best_params_['n_estimators'] - 100, 
                     RandomizedSearchcv.best_params_['n_estimators'], 
                     RandomizedSearchcv.best_params_['n_estimators'] + 100, RandomizedSearchcv.best_params_['n_estimators'] + 200]
}

print(param_grid)

In [None]:
rf=RandomForestClassifier()
GridSearchcv=GridSearchCV(estimator=rf,param_grid=random_grid, n_jobs=-1, cv=3,
    verbose=2)
GridSearchcv.fit(x_train,y_train)

In [None]:
GridSearchcv.best_params_

In [None]:
best_grid=GridSearchcv.best_estimator_

In [None]:
y_pred=best_grid.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print("Accuracy Score {}".format(accuracy_score(y_test,y_pred)))
print("Classification report: {}".format(classification_report(y_test,y_pred)))

## Automate Hyperparameter Tuning (Hyperopt)

### Bayesian Optimization 
Bayesian optimization uses probability to find the minimum of a function. The final aim is to find the input value to a function which can gives us the lowest possible output value.It usually performs better than random,grid and manual search providing better performance in the testing phase and reduced optimization time. In Hyperopt, Bayesian Optimization can be implemented giving 3 three main parameters to the function fmin.

- Objective Function = defines the loss function to minimize.
- Domain Space = defines the range of input values to test (in Bayesian Optimization this space creates a probability     distribution for each of the used Hyperparameters).
- Optimization Algorithm = defines the search algorithm to use to select the best input values to use in each new iteration.

In [None]:

from hyperopt import hp,fmin,tpe,STATUS_OK,Trials

In [None]:
pip install hyperopt

In [None]:
space = {'criterion': hp.choice('criterion', ['entropy', 'gini']),
        'max_depth': hp.quniform('max_depth', 10, 1200, 10),
        'max_features': hp.choice('max_features', ['auto', 'sqrt','log2', None]),
        'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.5),
        'min_samples_split' : hp.uniform ('min_samples_split', 0, 1),
        'n_estimators' : hp.choice('n_estimators', [10, 50, 300, 750, 1200,1300,1500])
    }

In [None]:
def objective(space):
    model = RandomForestClassifier(criterion = space['criterion'], max_depth = space['max_depth'],
                                 max_features = space['max_features'],
                                 min_samples_leaf = space['min_samples_leaf'],
                                 min_samples_split = space['min_samples_split'],
                                 n_estimators = space['n_estimators'], 
                                 )
    
    accuracy = cross_val_score(model, x_train, y_train, cv = 5).mean()

    # We aim to maximize accuracy, therefore we return it as a negative value
    return {'loss': -accuracy, 'status': STATUS_OK }

In [None]:
from sklearn.model_selection import cross_val_score
trials = Trials()
best = fmin(fn= objective,
            space= space,
            algo= tpe.suggest,
            max_evals = 80,
            trials= trials)
best

In [None]:
rit = {0: 'entropy', 1: 'gini'}
feat = {0: 'auto', 1: 'sqrt', 2: 'log2', 3: None}
est = {0: 10, 1: 50, 2: 300, 3: 750, 4: 1200,5:1300,6:1500}


print(crit[best['criterion']])
print(feat[best['max_features']])
print(est[best['n_estimators']])

In [None]:
best['min_samples_leaf']

In [None]:
trainedforest = RandomForestClassifier(criterion = crit[best['criterion']], max_depth = best['max_depth'], 
                                       max_features = feat[best['max_features']], 
                                       min_samples_leaf = best['min_samples_leaf'], 
                                       min_samples_split = best['min_samples_split'], 
                                       n_estimators = est[best['n_estimators']]).fit(X_train,y_train)
predictionforest = trainedforest.predict(X_test)
print(confusion_matrix(y_test,predictionforest))
print(accuracy_score(y_test,predictionforest))
print(classification_report(y_test,predictionforest))
acc5 = accuracy_score(y_test,predictionforest)

## Genetic Algorithms
Genetic Algorithms tries to apply natural selection mechanisms to Machine Learning contexts.

Let's immagine we create a population of N Machine Learning models with some predifined Hyperparameters. We can then calculate the accuracy of each model and decide to keep just half of the models (the ones that performs best). We can now generate some offsprings having similar Hyperparameters to the ones of the best models so that go get again a population of N models. At this point we can again caltulate the accuracy of each model and repeate the cycle for a defined number of generations. In this way, just the best models will survive at the end of the process.

In [61]:
from sklearn.model_selection import RandomizedSearchCV

#Number of tress in randomforest
n_estimators=[int(x) for x in np.linspace(start=200,stop=2000,num=10)]
#maximun number of levels in tree
max_depth=[int(x) for x in np.linspace(start=10,stop=1000,num=10)]
#number of features to consider at each spilt
max_features=['auto','sqrt','log2']
#minimum samples required at each split of node
min_samples_split=[2,5,10,14]
#minimum samples required at each leaf node
min_samples_leaf=[1,2,4,6,8]
# random_state
random_state=[int(x) for x in np.linspace(start=10,stop=100,num=10)]

In [66]:
param = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']
              }
print(param)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [62]:
pip install tpot

Collecting tpot
  Using cached TPOT-0.11.5-py3-none-any.whl (82 kB)
Collecting deap>=1.2
  Downloading deap-1.3.1-cp37-cp37m-win_amd64.whl (108 kB)
Collecting update-checker>=0.16
  Using cached update_checker-0.17-py2.py3-none-any.whl (7.0 kB)
Collecting stopit>=1.1.1
  Using cached stopit-1.1.2.tar.gz (18 kB)
Building wheels for collected packages: stopit
  Building wheel for stopit (setup.py): started
  Building wheel for stopit (setup.py): finished with status 'done'
  Created wheel for stopit: filename=stopit-1.1.2-py3-none-any.whl size=11959 sha256=9bf296d45ff04c64c67e50d127422e0c34b64a9382cb10460d7f81fded3c818b
  Stored in directory: c:\users\jesal\appdata\local\pip\cache\wheels\e2\d2\79\eaf81edb391e27c87f51b8ef901ecc85a5363dc96b8b8d71e3
Successfully built stopit
Installing collected packages: deap, update-checker, stopit, tpot
Successfully installed deap-1.3.1 stopit-1.1.2 tpot-0.11.5 update-checker-0.17
Note: you may need to restart the kernel to use updated packages.


In [None]:
from tpot import TPOTClassifier


tpot_classifier = TPOTClassifier(generations= 5, population_size= 24, offspring_size= 12,
                                 verbosity= 2, early_stop= 12,
                                 config_dict={'sklearn.ensemble.RandomForestClassifier': param}, 
                                 cv = 4, scoring = 'accuracy')
tpot_classifier.fit(x_train,y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=84.0, style=ProgressStyle(des…

In [None]:

accuracy = tpot_classifier.score(X_test, y_test)
print(accuracy)

## Optimize hyperparameters of the model using Optuna
The hyperparameters of the above algorithm are n_estimators and max_depth for which we can try different values to see if the model accuracy can be improved. The objective function is modified to accept a trial object. This trial has several methods for sampling hyperparameters. We create a study to run the hyperparameter optimization and finally read the best hyperparameters.

In [None]:

import optuna
import sklearn.svm
def objective(trial):

    classifier = trial.suggest_categorical('classifier', ['RandomForest', 'SVC'])
    
    if classifier == 'RandomForest':
        n_estimators = trial.suggest_int('n_estimators', 200, 2000,10)
        max_depth = int(trial.suggest_float('max_depth', 10, 100, log=True))

        clf = sklearn.ensemble.RandomForestClassifier(
            n_estimators=n_estimators, max_depth=max_depth)
    else:
        c = trial.suggest_float('svc_c', 1e-10, 1e10, log=True)
        
        clf = sklearn.svm.SVC(C=c, gamma='auto')

    return sklearn.model_selection.cross_val_score(
        clf,X_train,y_train, n_jobs=-1, cv=3).mean()

In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

trial = study.best_trial

print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

In [None]:

trial

In [None]:

study.best_params