## **Hyperparameter Tuning**

***Hyper parameter tuning is a way of finding best combination of parameters to make your model perform at it's best.***

1. **GridSearchCV**
2. **Randomized Search**

## **Grid Search CV**

In [None]:
# Let's suppose there's a model x which takes 3 parameters a,b,c.

# Now I want to figure out the best performance of my model using grid search.

# a = [0.1,0.5,0.8,1.0]
# b = [True, False]
# c = [10,50,75,100,150]

# Now Grid Search is going to build our model continuously with all the possible parameters
# Model(a=0.1,b=True,c=10)
# Model(a=0.1,b=True,c=50)
# ..... with all possible combinations.

# Model(a=1,b=False,c=150)


## **RandomizedCV**

In [None]:
# Instead of trying all the combinations of parameters that we will pass Randomized Search
# picks a fixed number of random combinations from our parameters.

# # Let's suppose there's a model x which takes 3 parameters a,b,c.

# Now I want to figure out the best performance of my model using randomized search.

# a = [0.1,0.5,0.8,1.0]
# b = [True, False]
# c = [10,50,75,100,150]


# Model(a=0.5,b=True,c=75)
# Randomized will randomly select the parameters and build the model.
# It won't try all the possible combinations as grid search instead it will pick randomly.

In [None]:
# IN Randomized there's a possibility you won't get the best parameters.

# Steps:                GridSearch                 Randomized Search
#Define Model           Select Model               Select Model
#Parameters             Provide a list of          Provide a list of parameters
#                       parameters
#
#Process               Tries every possible           Picks a fixed number of random combinations
#                      combination from the grid      from the grid.
#                      4x2x5 = 40 (40 runs)           Only N random combinations
#
#Best_Params           It will return us the          It will return the best from the samples
#                      absolute best

#Computational         High for large grids           Lower, especially for large search spaces
#Time                  (SLOW)

#Risk:                 No risk of missing best        Might miss the exact best combo but
#                      combination                    often close.

#When_To_choose        When you want best performance          When you have time constraint
#                      without any time constraint             and need a good one.

| Feature | GridSearchCV | RandomizedSearchCV |
| :--- | :--- | :--- |
| **Definition** | Searches exhaustively through a specified parameter grid. | Samples a fixed number of parameter combinations from a given distribution. |
| **Steps** | 1. Select Model <br> 2. Provide a list of parameters | 1. Select Model <br> 2. Provide a list of parameters |
| **Process** | Tries every possible combination from the grid (e.g., $4 \times 2 \times 5 = 40$ runs). | Picks a fixed number (N) of random combinations from the grid. |
| **Best Parameters** | Returns the absolute best combination. | Returns the best combination from the random samples. |
| **Computational Cost** | High for large grids (slow). | Lower, especially for large search spaces. |
| **Time** | Can be very time-consuming. | Generally faster. |
| **Risk** | No risk of missing the best combination. | Might miss the exact best combination, but often finds a combination that is very close. |

## ***HandsOn***

**Dataset URL:** https://drive.google.com/file/d/1d57DIdFinPqD-zdXsUM05yfAsxYrjeBT/view?usp=drive_link

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('diabetes.csv')
df # 7:50 AM

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


**Perform EDA and build a logisticregression and decision tree model and check it's performance**

In [None]:
df.duplicated().sum()

np.int64(0)

In [None]:
X = df.iloc[:,:-1]
y = df['Outcome']

In [None]:
X

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [None]:
y

Unnamed: 0,Outcome
0,1
1,0
2,1
3,0
4,1
...,...
763,0
764,0
765,0
766,1


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X,y,train_size=0.8,random_state=0)

In [None]:
x_train.shape

(614, 8)

In [None]:
x_test.shape

(154, 8)

In [None]:
logistic_model = LogisticRegression(max_iter=500)
logistic_model.fit(x_train,y_train)

In [None]:
log_pred = logistic_model.predict(x_test)
log_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0])

In [None]:
accuracy_score(log_pred,y_test)

0.8246753246753247

In [None]:
dt_model = DecisionTreeClassifier()
dt_model.fit(x_train,y_train)

In [None]:
dt_pred = dt_model.predict(x_test)
accuracy_score(dt_pred,y_test)

0.7662337662337663

In [None]:
# Now we want to improve the performance of our decision tree model
# Now we will use hyperparameter techniques

In [None]:
param_grid = {
    'criterion':['gini','entropy'],
    'max_depth':[10,50,75,100],
    'min_samples_split':[2,5,10],
    'min_samples_leaf':[1,2,4],
    'max_features':[None,'sqrt','log2']
}

# These are all the parameters that are used in DecisionTreee


In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
randomDT_model = RandomizedSearchCV(dt_model,param_grid,cv=3,scoring='accuracy',n_iter=20)

In [None]:
randomDT_model.fit(x_train,y_train)

In [None]:
randomDT_model.best_params_  #To get the best best parameters

{'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 100,
 'criterion': 'gini'}

In [None]:
new_model = DecisionTreeClassifier(min_samples_split = 2, min_samples_leaf=2,max_features='sqrt',max_depth=100,criterion='entropy')

In [None]:
new_model.fit(x_train,y_train)

In [None]:
new_pred = new_model.predict(x_test)
accuracy_score(new_pred,y_test)
# It was fast but not effective

0.7207792207792207

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid_model = GridSearchCV(dt_model,param_grid,cv=5,scoring='accuracy')

In [None]:
grid_model.fit(x_train,y_train)

In [None]:
grid_model.best_params_

{'criterion': 'gini',
 'max_depth': 50,
 'max_features': 'log2',
 'min_samples_leaf': 4,
 'min_samples_split': 5}

In [None]:
final_dt = DecisionTreeClassifier(criterion='gini',max_depth=50,max_features='sqrt',min_samples_leaf=4,min_samples_split=5)

In [None]:
final_dt.fit(x_train,y_train)

In [None]:
pred = final_dt.predict(x_test)
accuracy_score(pred,y_test)

0.7402597402597403

In [None]:
grid_model = {
    'penalty':['l1','l2','elasticnet',None],
    'C':[0.01,0.1,1,10,100],
    'solver':['bfgs','liblinear','saga'],
    'max_iter':[200,500,700],
    'l1_ratio':[0,0.5,1]
}

In [None]:
grid_logistic = GridSearchCV(logistic_model,grid_model,cv=5,scoring='accuracy')

In [None]:
grid_logistic.fit(x_train,y_train)

1350 fits failed out of a total of 2700.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
900 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(


In [None]:
grid_logistic.best_params_

{'C': 1,
 'l1_ratio': 0,
 'max_iter': 200,
 'penalty': 'l1',
 'solver': 'liblinear'}

In [None]:
final_logistic_model = LogisticRegression(C=1,l1_ratio=0,max_iter=200,penalty='l1',solver='liblinear')

In [None]:
final_logistic_model.fit(x_train,y_train)



In [None]:
final_pred = final_logistic_model.predict(x_test)

In [None]:
accuracy_score(final_pred,y_test)

0.8246753246753247

In [None]:
# 1 -> 5000
# 2 -> 680

In [None]:
help(GridSearchCV)

Help on class GridSearchCV in module sklearn.model_selection._search:

class GridSearchCV(BaseSearchCV)
 |  GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
 |  
 |  Exhaustive search over specified parameter values for an estimator.
 |  
 |  Important members are fit, predict.
 |  
 |  GridSearchCV implements a "fit" and a "score" method.
 |  It also implements "score_samples", "predict", "predict_proba",
 |  "decision_function", "transform" and "inverse_transform" if they are
 |  implemented in the estimator used.
 |  
 |  The parameters of the estimator used to apply these methods are optimized
 |  by cross-validated grid-search over a parameter grid.
 |  
 |  Read more in the :ref:`User Guide <grid_search>`.
 |  
 |  Parameters
 |  ----------
 |  estimator : estimator object
 |      This is assumed to implement the scikit-learn estimator interface.
 |      Either est