## Grid Search Hyperparameter optimization

This case study is all about using grid searches to identify the optimal parameters for a machine learning algorithm. To complere this case study, you'll use the Pima Indian diabetes dataset from Kaggle and KNN. Follow along with the preprocessing steps of this case study.

Load the necessary packages

In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# set random seed to try make this exercise and solutions reproducible (NB: this is just for teaching purpose and not something you would do in real life)
random_seed_number = 42
np.random.seed(random_seed_number)

#### Load the diabetes data

In [44]:
diabetes_data = pd.read_csv('diabetes.csv')
diabetes_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


**<font color='teal'> Start by reviewing the data info.</font>**

In [45]:
diabetes_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


**<font color='teal'> Apply the describe function to the data.</font>**

In [46]:
diabetes_data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


**<font color='teal'> Currently, the missing values in the dataset are represented as zeros. Replace the zero values in the following columns ['Glucose','BloodPressure','SkinThickness','Insulin','BMI'] with nan .</font>**

In [47]:
# Columns to replace NaN values with 0
columns_to_replace = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# Replace NaN with 0 in the specified columns
diabetes_data[columns_to_replace] = diabetes_data[columns_to_replace].fillna(0)

**<font color='teal'> Plot histograms of each column. </font>**

In [None]:
diabetes_data.hist(bins=10, figsize=(10, 8), color='skyblue', edgecolor='black')

# Add a title to the entire plot
plt.suptitle('Histograms of Dataset Columns', fontsize=16)

# Display the plots
plt.tight_layout(rect=[0, 0, 1, 0.96])  # Adjust layout to fit the title
plt.show()

#### Replace the zeros with mean and median values.

In [None]:
diabetes_data['Glucose'].fillna(diabetes_data['Glucose'].mean(), inplace = True)
diabetes_data['BloodPressure'].fillna(diabetes_data['BloodPressure'].mean(), inplace = True)
diabetes_data['SkinThickness'].fillna(diabetes_data['SkinThickness'].median(), inplace = True)
diabetes_data['Insulin'].fillna(diabetes_data['Insulin'].median(), inplace = True)
diabetes_data['BMI'].fillna(diabetes_data['BMI'].median(), inplace = True)

**<font color='teal'> Plot histograms of each column after replacing nan. </font>**

In [None]:
diabetes_data.hist(bins=10, figsize=(10, 8), color='orange', edgecolor='black')

# Add a title to the entire plot
plt.suptitle('Histograms of Dataset Columns', fontsize=16)

# Display the plots
plt.tight_layout(rect=[0, 0, 1, 0.96])  # Adjust layout to fit the title
plt.show()

#### Plot the correlation matrix heatmap

In [None]:
plt.figure(figsize=(12,10))
print('Correlation between various features')
p=sns.heatmap(diabetes_data.corr(), annot=True,cmap ='Blues')

**<font color='teal'> Define the `y` variable as the `Outcome` column.</font>**

In [None]:
y=diabetes_data['Outcome']
X = diabetes_data.drop('Outcome',axis=1)

**<font color='teal'> Create a 70/30 train and test split. </font>**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 5)

**<font color='teal'> Using Sklearn, standarize the magnitude of the features by scaling the values. </font>**

Note: Don't forget to fit() your scaler on X_train and then use that fitted scaler to transform() X_test. This is to avoid data leakage while you standardize your data.

In [None]:
from sklearn.preprocessing import StandardScaler
clf = StandardScaler()
X_train = clf.fit_transform(X_train)
X_test = clf.transform(X_test)

#### Using a range of neighbor values of 1-10, apply the KNearestNeighbor classifier to classify the the data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier


test_scores = []
train_scores = []

for i in range(1,10):

    knn = KNeighborsClassifier(i)
    knn.fit(X_train,y_train)
    
    train_scores.append(knn.score(X_train,y_train))
    test_scores.append(knn.score(X_test,y_test))

**<font color='teal'> Print the train and test scores for each iteration.</font>**

In [None]:
print('Train Scores:', train_scores, '\n')
print('Test Scores:', test_scores)


**<font color='teal'> Identify the number of neighbors that resulted in the max score in the training dataset. </font>**

In [None]:
# Find the maximum training score and its corresponding number of neighbors (k)
max_train_score = max(train_scores)
optimal_k_train = train_scores.index(max_train_score) + 1  # +1 because range starts from 1, not 0

print(f"The maximum training score is {max_train_score:.4f} achieved with {optimal_k_train} neighbors.")


**<font color='teal'> Identify the number of neighbors that resulted in the max score in the testing dataset. </font>**

In [None]:
# Find the index of the maximum testing accuracy score
max_test_accuracy_index = test_scores.index(max(test_scores))

# The corresponding number of neighbors (k) that resulted in the max testing accuracy
best_k_test = max_test_accuracy_index + 1  # +1 because k starts from 1, not 0

# Print the result
print(f"The number of neighbors (k) that resulted in the maximum testing accuracy is: {best_k_test}")
print(f"Maximum Testing Accuracy: {test_scores[max_test_accuracy_index]:.4f}")

Plot the train and test model performance by number of neighbors.

In [None]:
plt.figure(figsize=(12,5))
sns.lineplot(x=range(1,10),y=train_scores,marker='*',label='Train Score')
sns.lineplot(x=range(1,10),y=test_scores,marker='o',label='Test Score')

**<font color='teal'> Fit and score the best number of neighbors based on the plot. </font>**

In [None]:
# Find the best number of neighbors based on maximum test accuracy
best_k =3

# Initialize the KNN classifier with the best k value
best_knn = KNeighborsClassifier(n_neighbors=best_k)

# Fit the model to the training data
best_knn.fit(X_train, y_train)

# Evaluate the model on the training and test datasets
train_accuracy = best_knn.score(X_train, y_train)
test_accuracy = best_knn.score(X_test, y_test)

# Output the best k and corresponding accuracies
print(f"Best number of neighbors (k): {best_k}")
print(f"Train Accuracy with k={best_k}: {train_accuracy:.4f}")
print(f"Test Accuracy with k={best_k}: {test_accuracy:.4f}")

**<font color='teal'> Plot the confusion matrix for the model fit above. </font>**

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Make predictions on the test set
y_pred = best_knn.predict(X_test)

# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plotting the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, 
            xticklabels=['No Diabetes', 'Diabetes'], 
            yticklabels=['No Diabetes', 'Diabetes'])

plt.title(f'Confusion Matrix (k={best_k})')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

**<font color='teal'> Print the classification report </font>**

In [None]:
# Print the classification report
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred, target_names=['No Diabetes', 'Diabetes'])
print(report)

#### In the case of the K nearest neighbors algorithm, the K parameter is one of the most important parameters affecting the model performance.  The model performance isn't horrible, but what if we didn't consider a wide enough range of values in our neighbors for the KNN? An alternative to fitting a loop of models is to use a grid search to identify the proper number. It is common practice to use a grid search method for all adjustable parameters in any type of machine learning algorithm. First, you define the grid — aka the range of values — to test in the parameter being optimized, and then compare the model outcome performance based on the different values in the grid.

#### Run the code in the next cell to see how to implement the grid search method for identifying the best parameter value for the n_neighbors parameter. Notice the param_grid is the range value to test and we apply cross validation with five folds to score each possible value of n_neighbors.

In [None]:
from sklearn.model_selection import GridSearchCV


# Define the parameter grid for n_neighbors
param_grid = {'n_neighbors': range(1, 31)}  # Testing neighbor values from 1 to 30

# Initialize the KNN classifier
knn = KNeighborsClassifier()

# Set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model using Grid Search
grid_search.fit(X_train, y_train)

# Output the best parameters and best score
print(f"Best number of neighbors: {grid_search.best_params_['n_neighbors']}")
print(f"Best cross-validated accuracy: {grid_search.best_score_:.4f}")

#### Print the best score and best parameter for n_neighbors.

In [None]:
# Output the best parameters and best score
print(f"Best number of neighbors: {grid_search.best_params_['n_neighbors']}")
print(f"Best cross-validated accuracy: {grid_search.best_score_:.4f}")

Here you can see that the ideal number of n_neighbors for this model is 14 based on the grid search performed. 

**<font color='teal'> Now, following the KNN example, apply this grid search method to find the optimal number of estimators in a Randon Forest model.
</font>**

In [None]:
from sklearn.ensemble import RandomForestClassifier
# Define the parameter grid for n_estimators
param_grid = {'n_estimators': range(10, 201, 10)}  # Testing estimators from 10 to 200 in steps of 10

# Initialize the Random Forest classifier
rf = RandomForestClassifier(random_state=0)

# Set up GridSearchCV with 5-fold cross-validation
grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model using Grid Search
grid_search_rf.fit(X_train, y_train)

# Output the best parameters and best score
print(f"Best number of estimators: {grid_search_rf.best_params_['n_estimators']}")
print(f"Best cross-validated accuracy: {grid_search_rf.best_score_:.4f}")

In [None]:
best_rf = grid_search_rf.best_estimator_
test_accuracy = best_rf.score(X_test, y_test)
print(f"Test Accuracy with Best Random Forest: {test_accuracy:.4f}")