# Grid Search

In this section we are trying to utilize GridSearchCV function provided with Scikit-Learn in order to find the optimal paramters for a Random Forest Model.

Please refer into following link to get some overall idea about GridSearchCV before prceeding into the excercise.

GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

The following link provide information about Random Forest Classifer and its usage in Scikit-Learn library.

RandomForest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Furthermore, the following dataset will be utilized for the following task.

Heart Disease Cleveland: https://www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland

In [1]:
# Load the necesary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Your code here

In [2]:
# suppress warning messages
import warnings
warnings.filterwarnings('ignore')

# Your code here

In [4]:
# Load the dataset as a Pandas dataframe and display the head
df = pd.read_csv('Heart_disease_cleveland_new.csv')
df.head()
# Your code here

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,0,145,233,1,2,150,0,2.3,2,0,2,0
1,67,1,3,160,286,0,2,108,1,1.5,1,3,1,1
2,67,1,3,120,229,0,2,129,1,2.6,1,2,3,1
3,37,1,2,130,250,0,0,187,0,3.5,2,0,1,0
4,41,0,1,130,204,0,2,172,0,1.4,0,0,1,0


In [5]:
# Check for the null values
df.isnull().sum()

# Your code here

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [6]:
# Seperate the feature columns and targer using pandas functions
X = df.drop('target', axis=1)
y = df['target']

# Your code here

In [7]:
# Split dataset into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

# Your code here

In [8]:
# Print train dataset size
print(X_train.shape)
print(y_train.shape)

# Your code here

(227, 13)
(227,)


In [9]:
# Print test dataset size
print(X_test.shape)
print(y_test.shape)

# Your code here

(76, 13)
(76,)


In [10]:
# Scale the data using standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Your code here

In [11]:
# Define the random forest classifier with the default paramters
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(random_state=0)

# Your code here

In [12]:
# Define the parameter grid for the grid search
# Refer to the GridSearchCV Documentation
param_grid = {'n_estimators': [100, 200, 300, 400, 500],
              'max_depth': [3, 4, 5, 6, 7],
              'max_features': [3, 4, 5, 6, 7]}

# Your code here

In [13]:
# Peform Grid Search to identify optimal parameters
# Use cv = 5
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(rf_clf, param_grid, cv=5, return_train_score=True)

# Your code here

In [14]:
# Fit the training data
grid_search.fit(X_train_scaled, y_train)

In [15]:
# Print best hyperparameters detected from the Grid Search
grid_search.best_params_

# Your code here

{'max_depth': 3, 'max_features': 4, 'n_estimators': 100}

In [16]:
# Print the mean cross-validated score of the best_estimator
grid_search.best_score_

# Your code here

0.8592270531400967

In [17]:
# Use best estimator to obtain the accuracy for the test set
grid_search.best_estimator_.score(X_test_scaled, y_test)

# Your code here

0.7894736842105263