# Hyperparameter Optimization Techniques:
    
## GridSearchCV
## RandomizedSearchCV
## Bayesian Optimization -Automate Hyperparameter Tuning (Hyperopt)
## Optuna- Automate Hyperparameter Tuning
## Genetic Algorithms (TPOT Classifier)

In [1]:
import pandas as pd
df=pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure,Triceps skin fold thickness,2-Hour serum insulin,Body mass index,Age,Class
0,6,148,72,35,0,33.6,50,positive
1,1,85,66,29,0,26.6,31,negative
2,8,183,64,0,0,23.3,32,positive
3,1,89,66,23,94,28.1,21,negative
4,0,137,40,35,168,43.1,33,positive


In [2]:
df.isnull().sum()

Number of times pregnant        0
Plasma glucose concentration    0
Diastolic blood pressure        0
Triceps skin fold thickness     0
2-Hour serum insulin            0
Body mass index                 0
Age                             0
Class                           0
dtype: int64

In [3]:
X=df.drop('Class',axis=1)
y=df['Class']

In [4]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=0)

## Random Forest Model Building

In [5]:

from sklearn.ensemble import RandomForestClassifier
rf_classifier=RandomForestClassifier(n_estimators=10).fit(X_train,y_train)
prediction=rf_classifier.predict(X_test)

In [6]:
y.value_counts()

negative    500
positive    268
Name: Class, dtype: int64

In [7]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
print(confusion_matrix(y_test,prediction))
print(accuracy_score(y_test,prediction))
print(classification_report(y_test,prediction))

[[92 15]
 [15 32]]
0.8051948051948052
              precision    recall  f1-score   support

    negative       0.86      0.86      0.86       107
    positive       0.68      0.68      0.68        47

    accuracy                           0.81       154
   macro avg       0.77      0.77      0.77       154
weighted avg       0.81      0.81      0.81       154



## Parameters of random forest

The main parameters used by a Random Forest Classifier are:

criterion = the function used to evaluate the quality of a split.

max_depth = maximum number of levels allowed in each tree.

max_features = maximum number of features considered when splitting a node.

min_samples_leaf = minimum number of samples which can be stored in a tree leaf.

min_samples_split = minimum number of samples necessary in a node to cause node splitting.

n_estimators = number of trees in the ensamble.

### Manual hyperparameter tuning

In [8]:
model=RandomForestClassifier(n_estimators=300,criterion='entropy',
                             max_features='sqrt',min_samples_leaf=10,random_state=100).fit(X_train,y_train)
predictions=model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(accuracy_score(y_test,predictions))
print(classification_report(y_test,predictions))

[[97 10]
 [18 29]]
0.8181818181818182
              precision    recall  f1-score   support

    negative       0.84      0.91      0.87       107
    positive       0.74      0.62      0.67        47

    accuracy                           0.82       154
   macro avg       0.79      0.76      0.77       154
weighted avg       0.81      0.82      0.81       154



# RandomizedSearchCV

In [9]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [10]:

rf=RandomForestClassifier()
rf_randomcv=RandomizedSearchCV(estimator=rf,param_distributions=random_grid,n_iter=100,cv=3,verbose=2,
                               random_state=100,n_jobs=-1)
### fit the randomized model
rf_randomcv.fit(X_train,y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   37.9s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  5.2min finished


RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'criterion': ['entropy', 'gini'],
                                        'max_depth': [10, 120, 230, 340, 450,
                                                      560, 670, 780, 890,
                                                      1000],
                                        'max_features': ['auto', 'sqrt',
                                                         'log2'],
                                        'min_samples_leaf': [1, 2, 4, 6, 8],
                                        'min_samples_split': [2, 5, 10, 14],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=100, verbose=2)

In [11]:
rf_randomcv.best_params_

{'n_estimators': 2000,
 'min_samples_split': 2,
 'min_samples_leaf': 8,
 'max_features': 'auto',
 'max_depth': 340,
 'criterion': 'gini'}

In [12]:
best_random_grid=rf_randomcv.best_estimator_

In [13]:
from sklearn.metrics import accuracy_score
y_pred=best_random_grid.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print("Accuracy Score {}".format(accuracy_score(y_test,y_pred)))
print("Classification report: {}".format(classification_report(y_test,y_pred)))

[[97 10]
 [19 28]]
Accuracy Score 0.8116883116883117
Classification report:               precision    recall  f1-score   support

    negative       0.84      0.91      0.87       107
    positive       0.74      0.60      0.66        47

    accuracy                           0.81       154
   macro avg       0.79      0.75      0.76       154
weighted avg       0.81      0.81      0.81       154



# Bayesian Optimization

Bayesian optimization uses probability to find the minimum of a function. The final aim is to find the input value to a function which can gives us the lowest possible output value.It usually performs better than random,grid and manual search providing better performance in the testing phase and reduced optimization time. In Hyperopt library, Bayesian Optimization can be implemented giving 3 three main parameters to the function fmin.

·         Objective Function = defines the loss function to minimize.

·         Domain Space = defines the range of input values to test (in Bayesian Optimization this space creates a probability distribution for each of the used Hyperparameters).

·         Optimization Algorithm = defines the search algorithm to use to select the best input values to use in each new iteration

In [14]:
pip install hyperopt

Note: you may need to restart the kernel to use updated packages.


First we define the domain space.similar to param grid in random search and grid search where we define the values for the parameters.Here hp.choice takes one among those values, hp.quniform selects the integer values ex it takes 10 numbers between 10 and 1200. Hp.uniform selects the float values.we give the values as list.

Define the objective function and to access the parameter values, use space [‘criterion’]. Inside this define the model and the cros val score.


Call the Trials and the fmin function minimizes the function value. It takes many parameters. The objective function, space, algorithm is tpe imported from hyperopt and tye.suggest is the algorithm to be used, max eval is the nuber of evaluations and the trails.



The best values are returned in form of intergers. To get the values, we do the mapping to respective values.

If we just write best[‘criterion’] we get the 1. To we need mapping to return ‘gini’

Provide these best values to the random forest to get best accuracy.so use mapping here as crit [ best ['criterion']] instead of just best['criterion'] 



In [15]:
from hyperopt import hp,fmin,tpe,STATUS_OK,Trials

In [16]:
space = {'criterion': hp.choice('criterion', ['entropy', 'gini']),
        'max_depth': hp.quniform('max_depth', 10, 1200, 10),
        'max_features': hp.choice('max_features', ['auto', 'sqrt','log2', None]),
        'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.5),
        'min_samples_split' : hp.uniform ('min_samples_split', 0, 1),
        'n_estimators' : hp.choice('n_estimators', [10, 50, 300, 750, 1200,1300,1500])
    }

In [17]:
def objective(space):
    model = RandomForestClassifier(criterion = space['criterion'], max_depth = space['max_depth'],
                                 max_features = space['max_features'],
                                 min_samples_leaf = space['min_samples_leaf'],
                                 min_samples_split = space['min_samples_split'],
                                 n_estimators = space['n_estimators'], 
                                 )
    
    accuracy = cross_val_score(model, X_train, y_train, cv = 5).mean()

    # We aim to maximize accuracy, therefore we return it as a negative value
    return {'loss': -accuracy, 'status': STATUS_OK }

In [18]:
from sklearn.model_selection import cross_val_score
trials = Trials()
best = fmin(fn= objective,
            space= space,
            algo= tpe.suggest,
            max_evals = 80,
            trials= trials)
best

100%|███████████████████████████████████████████████| 80/80 [14:26<00:00, 10.83s/trial, best loss: -0.7605890976942556]


{'criterion': 1,
 'max_depth': 700.0,
 'max_features': 3,
 'min_samples_leaf': 0.07420789936287814,
 'min_samples_split': 0.08919110954462761,
 'n_estimators': 4}

In [19]:
crit = {0: 'entropy', 1: 'gini'}
feat = {0: 'auto', 1: 'sqrt', 2: 'log2', 3: None}
est = {0: 10, 1: 50, 2: 300, 3: 750, 4: 1200,5:1300,6:1500}


print(crit[best['criterion']])
print(feat[best['max_features']])
print(est[best['n_estimators']])

gini
None
1200


In [20]:
trainedforest = RandomForestClassifier(criterion = crit[best['criterion']], max_depth = best['max_depth'], 
                                       max_features = feat[best['max_features']], 
                                       min_samples_leaf = best['min_samples_leaf'], 
                                       min_samples_split = best['min_samples_split'], 
                                       n_estimators = est[best['n_estimators']]).fit(X_train,y_train)
predictionforest = trainedforest.predict(X_test)
print(confusion_matrix(y_test,predictionforest))
print(accuracy_score(y_test,predictionforest))
print(classification_report(y_test,predictionforest))
acc5 = accuracy_score(y_test,predictionforest)

[[96 11]
 [23 24]]
0.7792207792207793
              precision    recall  f1-score   support

    negative       0.81      0.90      0.85       107
    positive       0.69      0.51      0.59        47

    accuracy                           0.78       154
   macro avg       0.75      0.70      0.72       154
weighted avg       0.77      0.78      0.77       154



# Optuna

Optuna is Automate Hyperparameter Tuning method.

 

Objective function – function that applies all hyper parameter tuning

Create_study- creates space where training occurs and maximizes the accuracy.

In [21]:
pip install optuna

Collecting optuna
  Downloading optuna-2.1.0.tar.gz (232 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Collecting alembic
  Downloading alembic-1.4.3-py2.py3-none-any.whl (159 kB)
Collecting cliff
  Downloading cliff-3.4.0-py3-none-any.whl (76 kB)
Collecting colorlog
  Downloading colorlog-4.2.1-py2.py3-none-any.whl (14 kB)
Collecting cmaes>=0.6.0
  Downloading cmaes-0.6.1-py3-none-any.whl (9.7 kB)
Collecting Mako
  Downloading Mako-1.1.3-py2.py3-none-any.whl (75 kB)
Collecting python-editor>=0.3
  Downloading python_editor-1.0.4-py3-none-any.whl (4.9 kB)
Collecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.5.0-py2.py3-none-any.whl (106 kB)
Collecting cmd2!=0.8.3,>=0.8.0
  Downloading cmd2-1.3.10-py3-none-any.whl (132 kB)
C

ERROR: pytest-astropy 0.8.0 requires pytest-cov>=2.0, which is not installed.
ERROR: pytest-astropy 0.8.0 requires pytest-filter-subpackage>=0.1, which is not installed.


In [22]:
import optuna
import sklearn.svm
def objective(trial):

    classifier = trial.suggest_categorical('classifier', ['RandomForest', 'SVC'])
    
    if classifier == 'RandomForest':
        n_estimators = trial.suggest_int('n_estimators', 200, 2000,10)
        max_depth = int(trial.suggest_float('max_depth', 10, 100, log=True))

        clf = sklearn.ensemble.RandomForestClassifier(
            n_estimators=n_estimators, max_depth=max_depth)
    else:
        c = trial.suggest_float('svc_c', 1e-10, 1e10, log=True)
        
        clf = sklearn.svm.SVC(C=c, gamma='auto')

    return sklearn.model_selection.cross_val_score(
        clf,X_train,y_train, n_jobs=-1, cv=3).mean()

In [23]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

trial = study.best_trial

print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

[I 2020-09-26 22:25:52,815] A new study created in memory with name: no-name-10274797-c463-4710-8412-fd37c6058888
[I 2020-09-26 22:25:55,610] Trial 0 finished with value: 0.640068547744301 and parameters: {'classifier': 'SVC', 'svc_c': 2615185.238753893}. Best is trial 0 with value: 0.640068547744301.
[I 2020-09-26 22:25:58,578] Trial 1 finished with value: 0.728024868483979 and parameters: {'classifier': 'RandomForest', 'n_estimators': 500, 'max_depth': 29.626733345352875}. Best is trial 1 with value: 0.728024868483979.
[I 2020-09-26 22:25:58,635] Trial 2 finished with value: 0.640068547744301 and parameters: {'classifier': 'SVC', 'svc_c': 0.001498297110918989}. Best is trial 1 with value: 0.728024868483979.
[I 2020-09-26 22:25:58,684] Trial 3 finished with value: 0.640068547744301 and parameters: {'classifier': 'SVC', 'svc_c': 7.0030105219297375e-09}. Best is trial 1 with value: 0.728024868483979.
[I 2020-09-26 22:25:58,739] Trial 4 finished with value: 0.640068547744301 and paramete

[I 2020-09-26 22:27:13,254] Trial 36 finished with value: 0.7394069823051171 and parameters: {'classifier': 'RandomForest', 'n_estimators': 220, 'max_depth': 35.906656061492335}. Best is trial 36 with value: 0.7394069823051171.
[I 2020-09-26 22:27:14,013] Trial 37 finished with value: 0.7312848716722461 and parameters: {'classifier': 'RandomForest', 'n_estimators': 200, 'max_depth': 34.10913635138121}. Best is trial 36 with value: 0.7394069823051171.
[I 2020-09-26 22:27:14,072] Trial 38 finished with value: 0.640068547744301 and parameters: {'classifier': 'SVC', 'svc_c': 8.136867552771699e-05}. Best is trial 36 with value: 0.7394069823051171.
[I 2020-09-26 22:27:15,988] Trial 39 finished with value: 0.7280408098198629 and parameters: {'classifier': 'RandomForest', 'n_estimators': 500, 'max_depth': 32.068557515566496}. Best is trial 36 with value: 0.7394069823051171.
[I 2020-09-26 22:27:17,203] Trial 40 finished with value: 0.7329029172644668 and parameters: {'classifier': 'RandomForest

[I 2020-09-26 22:28:09,375] Trial 72 finished with value: 0.7312689303363622 and parameters: {'classifier': 'RandomForest', 'n_estimators': 260, 'max_depth': 38.486181117537846}. Best is trial 52 with value: 0.7459190180137095.
[I 2020-09-26 22:28:10,961] Trial 73 finished with value: 0.7345289335246293 and parameters: {'classifier': 'RandomForest', 'n_estimators': 450, 'max_depth': 69.20163289340071}. Best is trial 52 with value: 0.7459190180137095.
[I 2020-09-26 22:28:11,762] Trial 74 finished with value: 0.7345448748605133 and parameters: {'classifier': 'RandomForest', 'n_estimators': 200, 'max_depth': 58.92314193413159}. Best is trial 52 with value: 0.7459190180137095.
[I 2020-09-26 22:28:13,034] Trial 75 finished with value: 0.7312769010043042 and parameters: {'classifier': 'RandomForest', 'n_estimators': 350, 'max_depth': 63.56685531379669}. Best is trial 52 with value: 0.7459190180137095.
[I 2020-09-26 22:28:14,728] Trial 76 finished with value: 0.7296429140761996 and parameters

Accuracy: 0.7459190180137095
Best hyperparameters: {'classifier': 'RandomForest', 'n_estimators': 420, 'max_depth': 59.33230858257117}


In [24]:

trial

FrozenTrial(number=52, value=0.7459190180137095, datetime_start=datetime.datetime(2020, 9, 26, 22, 27, 38, 333142), datetime_complete=datetime.datetime(2020, 9, 26, 22, 27, 39, 815167), params={'classifier': 'RandomForest', 'n_estimators': 420, 'max_depth': 59.33230858257117}, distributions={'classifier': CategoricalDistribution(choices=('RandomForest', 'SVC')), 'n_estimators': IntUniformDistribution(high=2000, low=200, step=10), 'max_depth': LogUniformDistribution(high=100, low=10)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=52, state=TrialState.COMPLETE)

In [25]:
study.best_params

{'classifier': 'RandomForest',
 'n_estimators': 420,
 'max_depth': 59.33230858257117}

In [26]:
rf=RandomForestClassifier(n_estimators=330,max_depth=30)
rf.fit(X_train,y_train)

RandomForestClassifier(max_depth=30, n_estimators=330)

In [27]:

y_pred=rf.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[91 16]
 [16 31]]
0.7922077922077922
              precision    recall  f1-score   support

    negative       0.85      0.85      0.85       107
    positive       0.66      0.66      0.66        47

    accuracy                           0.79       154
   macro avg       0.76      0.76      0.76       154
weighted avg       0.79      0.79      0.79       154



# Genetic Algorithms

TPOTClassifier needs tensorflow to be installed.

Genetic Algorithms tries to apply natural selection mechanisms to Machine Learning contexts.

Lets imagine we create a population of N Machine Learning models with some predifined Hyperparameters. We can then calculate the accuracy of each model and decide to keep just half of the models (the ones that performs best). We can now generate some offsprings having similar Hyperparameters to the ones of the best models so that go get again a population of N models. At this point we can again caltulate the accuracy of each model and repeate the cycle for a defined number of generations. In this way, just the best models will survive at the end of the process.

In [28]:
pip install tpot

Collecting tpot
  Downloading TPOT-0.11.5-py3-none-any.whl (82 kB)
Collecting stopit>=1.1.1
  Downloading stopit-1.1.2.tar.gz (18 kB)
Collecting update-checker>=0.16
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting deap>=1.2
  Downloading deap-1.3.1-cp37-cp37m-win_amd64.whl (108 kB)
Building wheels for collected packages: stopit
  Building wheel for stopit (setup.py): started
  Building wheel for stopit (setup.py): finished with status 'done'
  Created wheel for stopit: filename=stopit-1.1.2-py3-none-any.whl size=11959 sha256=c6e9b155f3c52ace4dcb76a4295a5713cfaa3b1e8470133718a931b43ab38664
  Stored in directory: c:\users\vikee\appdata\local\pip\cache\wheels\e2\d2\79\eaf81edb391e27c87f51b8ef901ecc85a5363dc96b8b8d71e3
Successfully built stopit
Installing collected packages: stopit, update-checker, deap, tpot
Successfully installed deap-1.3.1 stopit-1.1.2 tpot-0.11.5 update-checker-0.18.0
Note: you may need to restart the kernel to use updated packages.


Population size - initially created n models with some hyperparameters

Offspring size – no of models needed from the pop size

Cofig dict – mention the algorithm to be used with the parameters. These parameters will be used by Population size to create the n models

In [30]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]
# Create the random grid
param = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']}
print(param)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [31]:
from tpot import TPOTClassifier


tpot_classifier = TPOTClassifier(generations= 5, population_size= 24, offspring_size= 12,
                                 verbosity= 2, early_stop= 12,
                                 config_dict={'sklearn.ensemble.RandomForestClassifier': param}, 
                                 cv = 4, scoring = 'accuracy')
tpot_classifier.fit(X_train,y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=84.0, style=ProgressStyle(des…


Generation 1 - Current best internal CV score: 0.7475808505220269
Generation 2 - Current best internal CV score: 0.7475808505220269
Generation 3 - Current best internal CV score: 0.7475808505220269
Generation 4 - Current best internal CV score: 0.7475808505220269
Generation 5 - Current best internal CV score: 0.7475808505220269
Best pipeline: RandomForestClassifier(CombineDFs(input_matrix, input_matrix), criterion=gini, max_depth=670, max_features=log2, min_samples_leaf=8, min_samples_split=14, n_estimators=2000)


TPOTClassifier(config_dict={'sklearn.ensemble.RandomForestClassifier': {'criterion': ['entropy',
                                                                                      'gini'],
                                                                        'max_depth': [10,
                                                                                      120,
                                                                                      230,
                                                                                      340,
                                                                                      450,
                                                                                      560,
                                                                                      670,
                                                                                      780,
                                                                                 

Tpot_calssifier.get_params gives the parameters of the offsprings

In [None]:

accuracy = tpot_classifier.score(X_test, y_test)
print(accuracy)