## All Techniques of Hyper Parameter Optimization
- GridSearchCV
- RandomizedSearchCV
- Bayesian Optimization - Automate Hyperparamtere Tuning(Hyperopt)
- Sequential Model Based Optimization (Tuninga scikit-learn estimator with skopt)
- Optuna-Automate Hyperparameter Tuning
- Genetic Algorithms (TPOT Classifier)

### Refrences
- https://github.com/fmfn/BayesianOptimization
- https://github.com/hyperopt/hyperopt
- https://www.jeremyjordan.me/hyperparameter-tuning/
- https://optuna.org/
- https://towardsdatascience.com/hyperparameters-optimization-526348bb8e2d
- http://scikit-optimize.github.io/stable/auto_examples/hyperparameter-optimization.html

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd

## Read dataset
df = pd.read_csv('C:/Users/Mohan/Documents/Machine Learning R_27.07.21/Machine Learning R_27.07.21/Machine Learning Project 7 - Advandced Hyperparameter Tunning/diabetes.csv')

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
import numpy as np

## converting glucose and Insulin features into one-hot-encoding
df['Glucose'] = np.where(df.Glucose == 0, df.Glucose.median(), df['Glucose'])
df['Insulin'] = np.where(df.Insulin == 0, df.Insulin.median(), df['Insulin'])

In [5]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72,35,30.5,33.6,0.627,50,1
1,1,85.0,66,29,30.5,26.6,0.351,31,0
2,8,183.0,64,0,30.5,23.3,0.672,32,1
3,1,89.0,66,23,94.0,28.1,0.167,21,0
4,0,137.0,40,35,168.0,43.1,2.288,33,1


In [6]:
## Independent And Dependent Features

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [7]:
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148.0,72,35,30.5,33.6,0.627,50
1,1,85.0,66,29,30.5,26.6,0.351,31
2,8,183.0,64,0,30.5,23.3,0.672,32
3,1,89.0,66,23,94.0,28.1,0.167,21
4,0,137.0,40,35,168.0,43.1,2.288,33


In [8]:
## Labels

y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

In [9]:
## train test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [10]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)
prediction = rf.predict(X_test)

In [11]:
y.value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

In [12]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
print("<-------------------Confusion metrics results is ------------->\n : {}".format(confusion_matrix(y_test, prediction)))
print("<------------------Classification report is---------------> \n: {}".format(classification_report(y_test, prediction)))
print("<------------------ Accuracy score----------------> : {}".format(accuracy_score(y_test, prediction)))

<-------------------Confusion metrics results is ------------->
 : [[85 14]
 [21 34]]
<------------------Classification report is---------------> 
:               precision    recall  f1-score   support

           0       0.80      0.86      0.83        99
           1       0.71      0.62      0.66        55

    accuracy                           0.77       154
   macro avg       0.76      0.74      0.74       154
weighted avg       0.77      0.77      0.77       154

<------------------ Accuracy score----------------> : 0.7727272727272727


### The main parameters used by a Random Forest
- criterion = the function used to evaluate the quality of a split
- max_depth = maximum number of levels allowed in each tree
- max_features = maximum number of features considered when splitting a node.
- min_samples_leaf = minimum number of samples which can be stored in a tree leaf.
- min_samples_split = minimum number of samples necessary in a node to cause node splitting.
- n_estimarors = number of trees in the ensemble.

In [13]:
## Manual HyperParameter Tunning
model = RandomForestClassifier(n_estimators=500, criterion='gini',
                               max_features='sqrt', min_samples_leaf=10, random_state=100).fit(X_train, y_train)
prediction = model.predict(X_test)
print("<-------------------Confusion metrics results is ------------->\n : {}".format(confusion_matrix(y_test, prediction)))
print("<------------------Classification report is---------------> \n: {}".format(classification_report(y_test, prediction)))
print("<------------------ Accuracy score----------------> : {}".format(accuracy_score(y_test, prediction)))

<-------------------Confusion metrics results is ------------->
 : [[83 16]
 [21 34]]
<------------------Classification report is---------------> 
:               precision    recall  f1-score   support

           0       0.80      0.84      0.82        99
           1       0.68      0.62      0.65        55

    accuracy                           0.76       154
   macro avg       0.74      0.73      0.73       154
weighted avg       0.76      0.76      0.76       154

<------------------ Accuracy score----------------> : 0.7597402597402597


### Randomized SearchCV

In [14]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]

# Number of featuers to consider at every split
max_features = ['auto', 'sqrt', 'log2']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000, 10)]

# Minimum number of samples required to split a node
min_samples_split = [1, 2, 3, 4, 5, 7, 9]

# Minimum number of samples required at each leaf node
min_samples_leaf =  [1, 2, 4, 6, 8]

# Create thre random grid
random_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf,
    'criterion': ['entropy', 'gini']
}

print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [1, 2, 3, 4, 5, 7, 9], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [15]:
rf = RandomForestClassifier()
rf_randomcv = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                                n_iter=100, cv=3, verbose=2, random_state=100, n_jobs=-1)

In [16]:
## fit the randomized model
rf_randomcv.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


In [17]:
rf_randomcv.best_params_

{'n_estimators': 1000,
 'min_samples_split': 4,
 'min_samples_leaf': 6,
 'max_features': 'log2',
 'max_depth': 10,
 'criterion': 'gini'}

In [18]:
randomcv_best_params = rf_randomcv.best_estimator_

In [19]:
y_pred = randomcv_best_params.predict(X_test)
print("<-------------------Confusion metrics results is ------------->\n : {}".format(confusion_matrix(y_test, y_pred)))
print("<------------------Classification report is---------------> \n: {}".format(classification_report(y_test, y_pred)))
print("<------------------ Accuracy score----------------> : {}".format(accuracy_score(y_test, y_pred)))

<-------------------Confusion metrics results is ------------->
 : [[81 18]
 [17 38]]
<------------------Classification report is---------------> 
:               precision    recall  f1-score   support

           0       0.83      0.82      0.82        99
           1       0.68      0.69      0.68        55

    accuracy                           0.77       154
   macro avg       0.75      0.75      0.75       154
weighted avg       0.77      0.77      0.77       154

<------------------ Accuracy score----------------> : 0.7727272727272727


### GridSearchCV

In [20]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'criterion': [rf_randomcv.best_params_['criterion']],
    'max_depth': [rf_randomcv.best_params_['max_depth']],
    'max_features': [rf_randomcv.best_params_['max_features']],
    'min_samples_leaf': [rf_randomcv.best_params_['min_samples_leaf'],
                         rf_randomcv.best_params_['min_samples_leaf'] + 2,
                         rf_randomcv.best_params_['min_samples_leaf'] + 4],
    'min_samples_split': [rf_randomcv.best_params_['min_samples_split'] - 2,
                          rf_randomcv.best_params_['min_samples_split'] - 1,
                          rf_randomcv.best_params_['min_samples_split'],
                          rf_randomcv.best_params_['min_samples_split'] + 1,
                          rf_randomcv.best_params_['min_samples_leaf'] + 2],
    'n_estimators': [rf_randomcv.best_params_['n_estimators'] - 200,
                    rf_randomcv.best_params_['n_estimators'] - 100,
                    rf_randomcv.best_params_['n_estimators'],
                    rf_randomcv.best_params_['n_estimators'] + 100,
                    rf_randomcv.best_params_['n_estimators'] + 200,
                    rf_randomcv.best_params_['n_estimators'] - 600,]
}

print(param_grid)

{'criterion': ['gini'], 'max_depth': [10], 'max_features': ['log2'], 'min_samples_leaf': [6, 8, 10], 'min_samples_split': [2, 3, 4, 5, 8], 'n_estimators': [800, 900, 1000, 1100, 1200, 400]}


In [21]:
1 * 1 * 1 * 3 * 5 * 6

90

In [22]:
rf =  RandomForestClassifier()
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=10, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

Fitting 10 folds for each of 90 candidates, totalling 900 fits


In [23]:
grid_search.best_estimator_

In [24]:
best_grid = grid_search.best_estimator_

In [25]:
y_pred = best_grid.predict(X_test)
print("<-------------------Confusion metrics results is ------------->\n : {}".format(confusion_matrix(y_test, y_pred)))
print("<------------------Classification report is---------------> \n: {}".format(classification_report(y_test, y_pred)))
print("<------------------ Accuracy score----------------> : {}".format(accuracy_score(y_test, y_pred)))

<-------------------Confusion metrics results is ------------->
 : [[80 19]
 [18 37]]
<------------------Classification report is---------------> 
:               precision    recall  f1-score   support

           0       0.82      0.81      0.81        99
           1       0.66      0.67      0.67        55

    accuracy                           0.76       154
   macro avg       0.74      0.74      0.74       154
weighted avg       0.76      0.76      0.76       154

<------------------ Accuracy score----------------> : 0.7597402597402597


### Automated Hyperparameter Tuning
Automated Hyperparameter Tuning can be done by using techniques such as
- Bayesian Optimization
- Gradient Descent
- Evolutionary Algorithms

### Bayesian Optimization
It uses the probability to find the minimum of a function. The final aim to find the input value of a funciton which can gives us the lowest output value. It usually performs better than random grid and manual search providing better performance in the testing phase and reduced optimization time. In Hyperopt, Bayesian Optimization can be implemented giving 3 main parameters to the function fmin.
- Objective Function = defines the loss function to minimize
- Domain Space = define the range of input value of test (in Bayesian Optimization this space creates a probability distribution for each of the used Hyperparameters)
- Optimization Algorithm = defines the search algorithm to use to select the best input values to use in each new iteration.

In [26]:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials

In [27]:
## hp is used to define whether we are defining interger values, floating values, or choice function
space = {
    'criterion': hp.choice('criterion', ['entropy', 'gini']),
    'max_depth': hp.quniform('max_depth', 10, 1200, 10),
    'max_featuers': hp.choice('max_features', ['auto', 'sqrt', 'log2', None]),
    'min_samples_leaf': hp.uniform('min_samples_leaf', 0, 0.5),
    'min_samples_split': hp.uniform('min_samples_split', 0, 1),
    'n_estimators': hp.choice('n_estimators', [10, 50, 300, 750, 1200, 1300, 1500])
}

In [28]:
space

{'criterion': <hyperopt.pyll.base.Apply at 0x1934bc72610>,
 'max_depth': <hyperopt.pyll.base.Apply at 0x1934bc72bd0>,
 'max_featuers': <hyperopt.pyll.base.Apply at 0x1934bc732d0>,
 'min_samples_leaf': <hyperopt.pyll.base.Apply at 0x1934bc73710>,
 'min_samples_split': <hyperopt.pyll.base.Apply at 0x1934bc73b90>,
 'n_estimators': <hyperopt.pyll.base.Apply at 0x1934bc74550>}

In [29]:
def objective(space):
    model = RandomForestClassifier(criterion=space['criterion'], max_depth=space['max_depth'],
                                  max_features=space['max_featuers'], min_samples_leaf=space['min_samples_leaf'],
                                  n_estimators=space['n_estimators'])

    accuracy = cross_val_score(model, X_train, y_train, cv=5).mean()
    
    # We aim to maximize accuracy, therefore we return it as a negative value
    return {'loss': -accuracy, 'status': STATUS_OK}

In [30]:
from sklearn.model_selection import cross_val_score
trials = Trials() # it is responsible for minimizing the function
best = fmin(fn=objective,space=space,algo=tpe.suggest,max_evals=80,trials=trials)
best

  0%|          | 0/80 [00:00<?, ?trial/s, best loss=?]

job exception: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Mohan\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Mohan\anaconda3\Lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "c:\Users\Mohan\anaconda3\Lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\Mohan\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError

  0%|          | 0/80 [00:00<?, ?trial/s, best loss=?]




ValueError: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Mohan\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Mohan\anaconda3\Lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "c:\Users\Mohan\anaconda3\Lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\Mohan\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_depth' parameter of RandomForestClassifier must be an int in the range [1, inf) or None. Got 390.0 instead.


In [None]:
crit = {0: 'entropy', 1: 'gini'}
feat = {0:'auto', 1: 'sqrt', 2: 'log2', 3: None}
est = {0:10, 1: 50, 2:300, 4: 1200, 5:1300, 6: 1500}

print(crit[best['criterion']])
print(feat[best['max_features']])
print(est[best['n_estimators']])

entropy
auto
1500


In [None]:
trainedforest = RandomForestClassifier(criterion=crit[best['criterion']], max_depth=best['max_depth'],
                                      max_features=feat[best['max_features']], min_samples_leaf=best['min_samples_leaf'],
                                      min_samples_split=best['min_samples_split'], n_estimators=est[best['n_estimators']]).fit(X_train, y_train)
predictionforest = trainedforest.predict(X_test)
print("<-------------------Confusion metrics results is ------------->\n : {}".format(confusion_matrix(y_test, predictionforest)))
print("<------------------Classification report is---------------> \n: {}".format(classification_report(y_test, predictionforest)))
print("<------------------ Accuracy score----------------> : {}".format(accuracy_score(y_test, predictionforest)))

<-------------------Confusion metrics results is ------------->
 : [[87 12]
 [22 33]]
<------------------Classification report is---------------> 
:               precision    recall  f1-score   support

           0       0.80      0.88      0.84        99
           1       0.73      0.60      0.66        55

    accuracy                           0.78       154
   macro avg       0.77      0.74      0.75       154
weighted avg       0.78      0.78      0.77       154

<------------------ Accuracy score----------------> : 0.7792207792207793
