<a href="https://colab.research.google.com/github/Neel7317/Hyper_Optimization_Techniques/blob/main/Automated_HyperTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Automated Hyperparameter Tuning can be done by using techniques such as**

-- Bayesian Optimization

-- Genetic Algorithms

-- Evolutionary Algorithms

##Bayesian Optimization##
Bayesian optimization uses probability to find the minimum of a function. The final aim is to find the input value to a function which can gives us the lowest possible output value.It usually performs better than random,grid and manual search providing better performance in the testing phase and reduced optimization time. In Hyperopt, Bayesian Optimization can be implemented giving 3 three main parameters to the function fmin.

1.Objective Function = defines the loss function to minimize.

2.Domain Space = defines the range of input values to test (in Bayesian Optimization this space creates a probability distribution for each of the used Hyperparameters).

3.Optimization Algorithm = defines the search algorithm to use to select the best input values to use in each new iteration.

In [40]:
from sklearn.model_selection import cross_val_score,train_test_split

In [41]:
import pandas as pd

In [42]:
df=pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [43]:
import numpy as np
df['Glucose']=np.where(df['Glucose']==0,df['Glucose'].median(),df['Glucose'])
df['SkinThickness']=np.where(df['SkinThickness']==0,df['SkinThickness'].median(),df['SkinThickness'])
df['Insulin']=np.where(df['Insulin']==0,df['Insulin'].median(),df['Insulin'])

In [44]:
X_train,X_test,y_train,y_test=train_test_split(df.drop('Outcome',axis=1),df['Outcome'],test_size=0.2)

In [57]:
from hyperopt import hp,fmin,tpe,STATUS_OK,Trials

In [58]:
space = {'criterion': hp.choice('criterion', ['entropy', 'gini']),
        'max_depth': hp.quniform('max_depth', 10, 1200, 10),
        'max_features': hp.choice('max_features', ['auto', 'sqrt','log2', None]),
        'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.5),
        'min_samples_split' : hp.uniform ('min_samples_split', 0, 1),
        'n_estimators' : hp.choice('n_estimators', [10, 50, 300, 750, 1200,1300,1500])
    }

In [59]:
space

{'criterion': <hyperopt.pyll.base.Apply at 0x7f107486d510>,
 'max_depth': <hyperopt.pyll.base.Apply at 0x7f107486d490>,
 'max_features': <hyperopt.pyll.base.Apply at 0x7f107486df10>,
 'min_samples_leaf': <hyperopt.pyll.base.Apply at 0x7f1075265ad0>,
 'min_samples_split': <hyperopt.pyll.base.Apply at 0x7f1074867dd0>,
 'n_estimators': <hyperopt.pyll.base.Apply at 0x7f1074867650>}

In [60]:

def objective(space):
    model = RandomForestClassifier(criterion = space['criterion'], max_depth = space['max_depth'],
                                 max_features = space['max_features'],
                                 min_samples_leaf = space['min_samples_leaf'],
                                 min_samples_split = space['min_samples_split'],
                                 n_estimators = space['n_estimators'], 
                                 )
    
    accuracy = cross_val_score(model, X_train, y_train, cv = 5).mean()

    # We aim to maximize accuracy, therefore we return it as a negative value
    return {'loss': -accuracy, 'status': STATUS_OK }

In [61]:

from sklearn.model_selection import cross_val_score
trials = Trials()
best = fmin(fn= objective,
            space= space,
            algo= tpe.suggest,
            max_evals = 80,
            trials= trials)
best

100%|██████████| 80/80 [07:08<00:00,  5.36s/it, best loss: -0.7442356390777023]


{'criterion': 0,
 'max_depth': 1130.0,
 'max_features': 0,
 'min_samples_leaf': 0.011493030287712619,
 'min_samples_split': 0.08699179630454579,
 'n_estimators': 4}

In [62]:
crit = {0: 'entropy', 1: 'gini'}
feat = {0: 'auto', 1: 'sqrt', 2: 'log2', 3: None}
est = {0: 10, 1: 50, 2: 300, 3: 750, 4: 1200,5:1300,6:1500}


print(crit[best['criterion']])
print(feat[best['max_features']])
print(est[best['n_estimators']])

entropy
auto
1200


In [63]:
best['min_samples_leaf']

0.011493030287712619

In [65]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [66]:

trainedforest = RandomForestClassifier(criterion = crit[best['criterion']], max_depth = best['max_depth'], 
                                       max_features = feat[best['max_features']], 
                                       min_samples_leaf = best['min_samples_leaf'], 
                                       min_samples_split = best['min_samples_split'], 
                                       n_estimators = est[best['n_estimators']]).fit(X_train,y_train)
predictionforest = trainedforest.predict(X_test)
print(confusion_matrix(y_test,predictionforest))
print(accuracy_score(y_test,predictionforest))
print(classification_report(y_test,predictionforest))
acc5 = accuracy_score(y_test,predictionforest)

[[91  9]
 [20 34]]
0.8116883116883117
              precision    recall  f1-score   support

           0       0.82      0.91      0.86       100
           1       0.79      0.63      0.70        54

    accuracy                           0.81       154
   macro avg       0.81      0.77      0.78       154
weighted avg       0.81      0.81      0.81       154



##Genetic Algorithms##

Genetic Algorithms tries to apply natural selection mechanisms to Machine Learning contexts.

Let's immagine we create a population of N Machine Learning models with some predifined Hyperparameters. We can then calculate the accuracy of each model and decide to keep just half of the models (the ones that performs best). We can now generate some offsprings having similar Hyperparameters to the ones of the best models so that go get again a population of N models. At this point we can again caltulate the accuracy of each model and repeate the cycle for a defined number of generations. In this way, just the best models will survive at the end of the process.

In [67]:

import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]
# Create the random grid
param = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']}
print(param)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [68]:
!pip install tpot

Collecting tpot
[?25l  Downloading https://files.pythonhosted.org/packages/b2/55/a7185198f554ea19758e5ac4641f100c94cba4585e738e2e48e3c40a0b7f/TPOT-0.11.7-py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 3.5MB/s 
[?25hCollecting update-checker>=0.16
  Downloading https://files.pythonhosted.org/packages/0c/ba/8dd7fa5f0b1c6a8ac62f8f57f7e794160c1f86f31c6d0fb00f582372a3e4/update_checker-0.18.0-py3-none-any.whl
Collecting deap>=1.2
[?25l  Downloading https://files.pythonhosted.org/packages/99/d1/803c7a387d8a7e6866160b1541307f88d534da4291572fb32f69d2548afb/deap-1.3.1-cp37-cp37m-manylinux2010_x86_64.whl (157kB)
[K     |████████████████████████████████| 163kB 22.8MB/s 
Collecting xgboost>=1.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/23/88/f52938e30f84ae662b4a0bc63cafc095fb3e38ca2ec188b8a863f8e2c016/xgboost-1.4.1-py3-none-manylinux2010_x86_64.whl (166.7MB)
[K     |████████████████████████████████| 166.7MB 82kB/s 
[?25hCollecting stopit>=1.1.1
  D

In [69]:
from tpot import TPOTClassifier


In [70]:
tpot_classifier=TPOTClassifier(generations=5,population_size=24,offspring_size=12,cv=5,n_jobs=-1,config_dict={'sklearn.ensemble.RandomForestClassifier': param},
                                verbosity= 2, early_stop= 12,scoring='accuracy')


In [71]:
tpot_classifier.fit(X_train,y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=84.0, style=ProgressStyle(des…


Generation 1 - Current best internal CV score: 0.7540050646408104

Generation 2 - Current best internal CV score: 0.7540050646408104

Generation 3 - Current best internal CV score: 0.7540050646408104

Generation 4 - Current best internal CV score: 0.7540050646408104

Generation 5 - Current best internal CV score: 0.7588964414234306

Best pipeline: RandomForestClassifier(RandomForestClassifier(input_matrix, criterion=entropy, max_depth=670, max_features=auto, min_samples_leaf=6, min_samples_split=5, n_estimators=400), criterion=gini, max_depth=670, max_features=sqrt, min_samples_leaf=8, min_samples_split=5, n_estimators=1600)


TPOTClassifier(config_dict={'sklearn.ensemble.RandomForestClassifier': {'criterion': ['entropy',
                                                                                      'gini'],
                                                                        'max_depth': [10,
                                                                                      120,
                                                                                      230,
                                                                                      340,
                                                                                      450,
                                                                                      560,
                                                                                      670,
                                                                                      780,
                                                                                 

In [72]:
accuracy = tpot_classifier.score(X_test, y_test)
print(accuracy)

0.8181818181818182


##Optimize hyperparameters of the model using Optuna

The hyperparameters of the above algorithm are n_estimators and max_depth for which we can try different values to see if the model accuracy can be improved. The objective function is modified to accept a trial object. This trial has several methods for sampling hyperparameters. We create a study to run the hyperparameter optimization and finally read the best hyperparameters.

In [74]:
!pip install optuna

Collecting optuna
[?25l  Downloading https://files.pythonhosted.org/packages/2b/21/d13081805e1e1afc71f5bb743ece324c8bd576237c51b899ecb38a717502/optuna-2.7.0-py3-none-any.whl (293kB)
[K     |████████████████████████████████| 296kB 5.7MB/s 
[?25hCollecting alembic
[?25l  Downloading https://files.pythonhosted.org/packages/72/a4/97eb6273839655cac14947986fa7a5935350fcfd4fff872e9654264c82d8/alembic-1.5.8-py2.py3-none-any.whl (159kB)
[K     |████████████████████████████████| 163kB 18.6MB/s 
Collecting cliff
[?25l  Downloading https://files.pythonhosted.org/packages/a2/d6/7d9acb68a77acd140be7fececb7f2701b2a29d2da9c54184cb8f93509590/cliff-3.7.0-py3-none-any.whl (80kB)
[K     |████████████████████████████████| 81kB 6.1MB/s 
[?25hCollecting cmaes>=0.8.2
  Downloading https://files.pythonhosted.org/packages/01/1f/43b01223a0366171f474320c6e966c39a11587287f098a5f09809b45e05f/cmaes-0.8.2-py3-none-any.whl
Collecting colorlog
  Downloading https://files.pythonhosted.org/packages/32/e6/e9ddc6fa

In [75]:
import optuna
import sklearn.svm
def objective(trial):

    classifier = trial.suggest_categorical('classifier', ['RandomForest', 'SVC'])
    
    if classifier == 'RandomForest':
        n_estimators = trial.suggest_int('n_estimators', 200, 2000,10)
        max_depth = int(trial.suggest_float('max_depth', 10, 100, log=True))

        clf = sklearn.ensemble.RandomForestClassifier(
            n_estimators=n_estimators, max_depth=max_depth)
    else:
        c = trial.suggest_float('svc_c', 1e-10, 1e10, log=True)
        
        clf = sklearn.svm.SVC(C=c, gamma='auto')

    return sklearn.model_selection.cross_val_score(
        clf,X_train,y_train, n_jobs=-1, cv=3).mean()

In [76]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

trial = study.best_trial

print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

[32m[I 2021-04-26 13:38:35,305][0m A new study created in memory with name: no-name-cd5f5abe-ebb9-4391-967d-e43ae486bc25[0m
[32m[I 2021-04-26 13:38:42,234][0m Trial 0 finished with value: 0.74095329188586 and parameters: {'classifier': 'RandomForest', 'n_estimators': 1440, 'max_depth': 49.336481575551375}. Best is trial 0 with value: 0.74095329188586.[0m
[32m[I 2021-04-26 13:38:49,251][0m Trial 1 finished with value: 0.7458552526701737 and parameters: {'classifier': 'RandomForest', 'n_estimators': 1480, 'max_depth': 39.22169710858817}. Best is trial 1 with value: 0.7458552526701737.[0m
[32m[I 2021-04-26 13:38:54,725][0m Trial 2 finished with value: 0.7425872788139646 and parameters: {'classifier': 'RandomForest', 'n_estimators': 1150, 'max_depth': 35.37924248478869}. Best is trial 1 with value: 0.7458552526701737.[0m
[32m[I 2021-04-26 13:39:00,915][0m Trial 3 finished with value: 0.7393193049577554 and parameters: {'classifier': 'RandomForest', 'n_estimators': 1320, 'max_

Accuracy: 0.7523433763749402
Best hyperparameters: {'classifier': 'RandomForest', 'n_estimators': 1430, 'max_depth': 13.112747031654637}


In [77]:

study.best_params

{'classifier': 'RandomForest',
 'max_depth': 13.112747031654637,
 'n_estimators': 1430}

In [78]:
rf=RandomForestClassifier(n_estimators=1430,max_depth=13)
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=13, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1430,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [79]:

y_pred=rf.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[83 17]
 [17 37]]
0.7792207792207793
              precision    recall  f1-score   support

           0       0.83      0.83      0.83       100
           1       0.69      0.69      0.69        54

    accuracy                           0.78       154
   macro avg       0.76      0.76      0.76       154
weighted avg       0.78      0.78      0.78       154

