<a href="https://www.kaggle.com/code/bhavkaur/hyper-parameter-optimizatoin?scriptVersionId=134841815" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **All Techniques of Hyper Parameter Optimization**
- GridSearchCV
- RandomizedSearchCV
- Bayesian Optimization -Automate Hyperparameter Tuning (Hyperopt)
- Sequential Model Based Optimization (Tuning a scikit-learn estimator with skopt)
- Optuna- Automate Hyperparameter Tuning
- Genetic Algorithms (TPOT Classifier)

### Why do we require hyper parameter tuning?
Hyperparameters directly control model structure, function, and performance. Hyperparameter tuning allows data scientists to tweak model performance for optimal results. This process is an essential part of machine learning, and choosing appropriate hyperparameter values is crucial for success.

In [35]:
import warnings
warnings.filterwarnings('ignore')

In [36]:
import pandas as pd
df=pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Doing the below strp to remove any '0' values in 'Glucose' and similarly for 'Insulin'

In [37]:
import numpy as np
df['Glucose']=np.where(df['Glucose']==0, df['Glucose'].median(), df['Glucose'])
df['Insulin']=np.where(df['Insulin']==0, df['Insulin'].median(), df['Insulin'])
df['SkinThickness']=np.where(df['SkinThickness']==0, df['SkinThickness'].median(), df['SkinThickness'])
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72,35.0,30.5,33.6,0.627,50,1
1,1,85.0,66,29.0,30.5,26.6,0.351,31,0
2,8,183.0,64,23.0,30.5,23.3,0.672,32,1
3,1,89.0,66,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40,35.0,168.0,43.1,2.288,33,1


### Do we require feature scaling if we are using RandomForest?
### No, because RandomForest works on DecissionTree (makes branches).

In [38]:
X = df.drop('Outcome', axis=1)
y=df['Outcome']

In [39]:
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148.0,72,35.0,30.5,33.6,0.627,50
1,1,85.0,66,29.0,30.5,26.6,0.351,31
2,8,183.0,64,23.0,30.5,23.3,0.672,32
3,1,89.0,66,23.0,94.0,28.1,0.167,21
4,0,137.0,40,35.0,168.0,43.1,2.288,33


In [40]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

# Train Test Split

In [41]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.20, random_state=33)

In [42]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)
pred = rf.predict(X_test)

In [43]:
y.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [44]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
print(confusion_matrix(y_test, pred))
print(accuracy_score(y_test, pred))
print(classification_report(y_test, pred))

[[355  50]
 [105 105]]
0.7479674796747967
              precision    recall  f1-score   support

           0       0.77      0.88      0.82       405
           1       0.68      0.50      0.58       210

    accuracy                           0.75       615
   macro avg       0.72      0.69      0.70       615
weighted avg       0.74      0.75      0.74       615



### The main parameters used by a RandomForestClassifier are:
- criterion = the funciton is used to evaluate the quaity of a split
- max_depth = maximum no of levels allowed in each tree
- max_features = maximum no of features considered when splitting a node
- min_samples_leaf = minimum no of samples which can be stored in a tree leaf
- min_samples_leaf = minimum no of samples necessary in a node wo cause node splitting
- n_estimators = number of trees in the ensebles

# Manual Hyperparameter Tuning

In [45]:
model = RandomForestClassifier(n_estimators=500,criterion='entropy',max_features='sqrt', min_samples_leaf=10,random_state=100).fit(X_train, y_train)
predictions=model.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))

[[357  48]
 [109 101]]
0.7447154471544716
              precision    recall  f1-score   support

           0       0.77      0.88      0.82       405
           1       0.68      0.48      0.56       210

    accuracy                           0.74       615
   macro avg       0.72      0.68      0.69       615
weighted avg       0.74      0.74      0.73       615



### Start from Randomized SearchCV because it will narrow down the results
### And the go for GridSearch

# Randomized SearchCV

In [46]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# No of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
# No of features to consider at every split
max_features = ['auto','sqrt','log2']
# Maximum no of levels in tree
max_depth = [int(x) for x in np.linspace(10,1000,10)]
# Minimum no of samples required to split a node
min_samples_split = [21,3,4,5,7,9]
# Minimum no of samples reauired at each Leaf node
min_samples_leaf = [1,2,4,6,8]
# Create the random grid
random_grid = {'n_estimators':n_estimators,
              'max_features':max_features,
              'max_depth': max_depth,
              'min_samples_split':min_samples_split,
              'min_samples_leaf':min_samples_leaf,
              'criterion':['entropy', 'gini']}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [21, 3, 4, 5, 7, 9], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}


In [47]:
rf=RandomForestClassifier()
rf_randomcv = RandomizedSearchCV(estimator=rf,param_distributions=random_grid, n_iter=100, cv=3, verbose=2, random_state=100)
rf_randomcv.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END criterion=entropy, max_depth=670, max_features=auto, min_samples_leaf=8, min_samples_split=21, n_estimators=200; total time=   0.4s
[CV] END criterion=entropy, max_depth=670, max_features=auto, min_samples_leaf=8, min_samples_split=21, n_estimators=200; total time=   0.4s
[CV] END criterion=entropy, max_depth=670, max_features=auto, min_samples_leaf=8, min_samples_split=21, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_depth=560, max_features=log2, min_samples_leaf=1, min_samples_split=7, n_estimators=1600; total time=   3.1s
[CV] END criterion=gini, max_depth=560, max_features=log2, min_samples_leaf=1, min_samples_split=7, n_estimators=1600; total time=   3.1s
[CV] END criterion=gini, max_depth=560, max_features=log2, min_samples_leaf=1, min_samples_split=7, n_estimators=1600; total time=   3.1s
[CV] END criterion=gini, max_depth=340, max_features=sqrt, min_samples_leaf=2, min_samples_split=9, 

In [48]:
rf_randomcv.best_params_

{'n_estimators': 400,
 'min_samples_split': 5,
 'min_samples_leaf': 8,
 'max_features': 'log2',
 'max_depth': 340,
 'criterion': 'gini'}

In [49]:
best_random_grid=rf_randomcv.best_estimator_

In [50]:
from sklearn.metrics import accuracy_score
y_pred = best_random_grid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print("Accuracy score: {}".format(accuracy_score(y_test, y_pred)))
print("Classifiaction report: {}".format(classification_report(y_test, y_pred)))

[[351  54]
 [100 110]]
Accuracy score: 0.7495934959349594
Classifiaction report:               precision    recall  f1-score   support

           0       0.78      0.87      0.82       405
           1       0.67      0.52      0.59       210

    accuracy                           0.75       615
   macro avg       0.72      0.70      0.70       615
weighted avg       0.74      0.75      0.74       615



# GridSearch CV

In [51]:
rf_randomcv.best_params_

{'n_estimators': 400,
 'min_samples_split': 5,
 'min_samples_leaf': 8,
 'max_features': 'log2',
 'max_depth': 340,
 'criterion': 'gini'}

In [52]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'criterion': [rf_randomcv.best_params_['criterion']],
    'max_depth': [rf_randomcv.best_params_['max_depth']],
    'max_features': [rf_randomcv.best_params_['max_features']],
    'min_samples_leaf': [rf_randomcv.best_params_['min_samples_leaf'], 
                         rf_randomcv.best_params_['min_samples_leaf']+2, 
                         rf_randomcv.best_params_['min_samples_leaf'] + 4],
    'min_samples_split': [rf_randomcv.best_params_['min_samples_split'] - 2,
                          rf_randomcv.best_params_['min_samples_split'] - 1,
                          rf_randomcv.best_params_['min_samples_split'], 
                          rf_randomcv.best_params_['min_samples_split'] +1,
                          rf_randomcv.best_params_['min_samples_split'] + 2],
    'n_estimators': [rf_randomcv.best_params_['n_estimators'] - 200, rf_randomcv.best_params_['n_estimators'] - 100, 
                     rf_randomcv.best_params_['n_estimators'], 
                     rf_randomcv.best_params_['n_estimators'] + 100, rf_randomcv.best_params_['n_estimators'] + 200]
}

print(param_grid)

{'criterion': ['gini'], 'max_depth': [340], 'max_features': ['log2'], 'min_samples_leaf': [8, 10, 12], 'min_samples_split': [3, 4, 5, 6, 7], 'n_estimators': [200, 300, 400, 500, 600]}


In [53]:
rf=RandomForestClassifier()
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=10,verbose=2 ).fit(X_train, y_train)

Fitting 10 folds for each of 75 candidates, totalling 750 fits
[CV] END criterion=gini, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=3, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=3, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=3, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=3, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=3, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=3, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=3, n_estimators=20

In [54]:
grid_search.best_estimator_

In [55]:
BestGrid=grid_search.best_estimator_

In [56]:
BestGrid

In [57]:
y_pred = BestGrid.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print("Accuracy score: {}".format(accuracy_score(y_test, y_pred)))
print("Classification report: {}".format(classification_report(y_test, y_pred)))

[[354  51]
 [100 110]]
Accuracy score: 0.7544715447154472
Classification report:               precision    recall  f1-score   support

           0       0.78      0.87      0.82       405
           1       0.68      0.52      0.59       210

    accuracy                           0.75       615
   macro avg       0.73      0.70      0.71       615
weighted avg       0.75      0.75      0.75       615

