In [1]:
import pandas as pd

In [2]:
dfml = pd.read_csv('./encoded_heart.csv')
dfml

Unnamed: 0,Age,Sex,RestingBP,Cholesterol,FastingBS,MaxHR,ExerciseAngina,Oldpeak,HeartDisease,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ST_Slope_Flat,ST_Slope_Up
0,40,1,140,289,0,172,0,0.0,0,1,0,0,1,0,0,1
1,49,0,160,180,0,156,0,1.0,1,0,1,0,1,0,1,0
2,37,1,130,283,0,98,0,0.0,0,1,0,0,0,1,0,1
3,48,0,138,214,0,108,1,1.5,1,0,0,0,1,0,1,0
4,54,1,150,195,0,122,0,0.0,0,0,1,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,1,110,264,0,132,0,1.2,1,0,0,1,1,0,1,0
914,68,1,144,193,1,141,0,3.4,1,0,0,0,1,0,1,0
915,57,1,130,131,0,115,1,1.2,1,0,0,0,1,0,1,0
916,57,0,130,236,0,174,0,0.0,1,1,0,0,0,0,1,0


In [3]:
# split data
from sklearn.model_selection import train_test_split

X = dfml.drop('HeartDisease', axis=1)
y = dfml['HeartDisease']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [4]:
# normalize data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [5]:
# make some initial predictions as control

from sklearn.metrics import accuracy_score 
from sklearn.ensemble import RandomForestClassifier

# define model
RF_clf = RandomForestClassifier(random_state=42)
# fit model
RF_clf.fit(X_train, y_train)
# make predictions
RF_preds = RF_clf.predict(X_test)
# check overall accuracy %
RF_acc = accuracy_score(y_test, RF_preds)

RF_acc


0.8695652173913043

# Hyperparameter Tuning

Let's try to improve the performance of our model by tuning hyperparameters. We will have a look today at just a few options for Random Forest Classifier (check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for all options): 
  - `n_estimators` - represents the number of decision trees in the "forest". A higher number will usually give a better response, but will require a lot more processing power. 
  - `max_depth` - defines the maximum path length from the "root" node to the "leaf" nodes in each decision tree. Higher numbers will also generally give better results, but will also increase processing power.
  - `min_sample_split` - the number of observations required before splitting into a new branch node on the tree. A higher number can prevent overfitting on individual decision trees. 
  - `max_features` - the maximum number of features to be considered for a split in a tree. Usually the ideal number lies somewhere around the square root of the number of features present, so this function will accept either a number, or a string value that defines the calculation that should be made for a number.

Knowing which hyperparameters to tune, and the options to try, takes research. For the Random Forest algorithm, I found [this](https://blog.dataiku.com/narrowing-the-search-which-hyperparameters-really-matter), [this](https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/), and [this](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d) article helpful while writing this spike. There are also many many guides available online for any model you might want to try. 

## GridSearchCV vs. RandomizedSearchCV

When you have determined the hyperparameters to be tested, SciKit Learn provide some helpful functions to automate the process. These functions will test combinations of hyperparameter values set within your specified boundaries to find the combination that optimizes performance. 

[`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) will test _every_ combination of your defined hyperparameters, which guarantees you will have the best possible result. However, if you are working with a large dataset, and/or are testing large ranges of values for the hyperparameters, then testing every possible combination will require a lot of processing power (and time!!) to achieve. 

[`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) is an alternative that [randomly selects values](https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search) in the ranges you set for your hyperparameter testing. You can set the maximum number of iterations which will then limit the time spent making comparisons. It returns the best combination of results from the random combinations it tests, but you cannot guarantee that this is the best result overall. It can potentially be used as a way to narrow the range for individual hyperparameters for a more limited range GridSearch. 

Whichever method you choose, you will need to define a Python dictionary with the hyperparameters you wish to test, and all the values you wish to be tested:

In [6]:
import numpy as np

n_estimators = np.arange(10,201,10)
max_depth = np.arange(5, 55, 5)
min_samples_split = np.arange(2, 15, 2)
max_features = ["sqrt", "log2", None]

param_grid = {
  'n_estimators': n_estimators,
  'max_depth': max_depth,
  'min_samples_split': min_samples_split,
  'max_features': max_features,
}

param_grid

{'n_estimators': array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120, 130,
        140, 150, 160, 170, 180, 190, 200]),
 'max_depth': array([ 5, 10, 15, 20, 25, 30, 35, 40, 45, 50]),
 'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14]),
 'max_features': ['sqrt', 'log2', None]}

Define a blank model, then give the model, together with your parameter grid dictionary, to the grid function. It will return an object that can now be fitted with the same training data variables you use to train a single model:

In [12]:
from sklearn.model_selection import RandomizedSearchCV

# redefine model
RF_clf = RandomForestClassifier(random_state=42)

RS_grid = RandomizedSearchCV(estimator=RF_clf, param_distributions=param_grid, n_iter=10)
RS_grid

In [13]:
RS_grid.fit(X_train, y_train)

After being fitted, the grid object now contains metrics regarding the tests. We can print some of the highlights, and then if we want to see exactly how it arrived to that conclusion, we can create a dataframe from the hyperparameter combinations and the accuracy score for that combination. 

In [None]:
print(
  'best score: ', RS_grid.best_score_,
  '\nparams: ', RS_grid.best_params_
)

'''
attempt 1
best score:  0.8719038300251608 
params:  {'n_estimators': np.int64(100), 'min_samples_split': np.int64(6), 'max_features': 'sqrt', 'max_depth': np.int64(15)}

best score:  0.8760041002702451 
params:  {'n_estimators': np.int64(160), 'min_samples_split': np.int64(10), 'max_features': 'sqrt', 'max_depth': np.int64(40)}
'''

best score:  0.8760041002702451 
params:  {'n_estimators': np.int64(160), 'min_samples_split': np.int64(10), 'max_features': 'sqrt', 'max_depth': np.int64(40)}


"\nattempt 1\nbest score:  0.8719038300251608 \nparams:  {'n_estimators': np.int64(100), 'min_samples_split': np.int64(6), 'max_features': 'sqrt', 'max_depth': np.int64(15)}\n"

In [10]:
RS_grid.cv_results_

{'mean_fit_time': array([0.18846245, 0.13977637, 0.09918504, 0.19162474, 0.38698325,
        0.4282577 , 0.28531446, 0.22063003, 0.04116573, 0.31410756]),
 'std_fit_time': array([0.00181613, 0.00469174, 0.00351446, 0.00956455, 0.01359798,
        0.01616394, 0.01353329, 0.00400522, 0.00109342, 0.02635177]),
 'mean_score_time': array([0.00964899, 0.00823379, 0.00556593, 0.01120305, 0.01339521,
        0.01446695, 0.01137795, 0.01217623, 0.00314322, 0.01026649]),
 'std_score_time': array([1.01435312e-03, 7.09855163e-04, 5.52045200e-04, 1.19552477e-03,
        1.22210540e-03, 1.01312480e-03, 2.18886503e-03, 7.17489765e-04,
        7.99814702e-05, 7.59409373e-04]),
 'param_n_estimators': masked_array(data=[100, 80, 70, 140, 190, 200, 140, 160, 30, 150],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value=999999),
 'param_min_samples_split': masked_array(data=[6, 12, 12, 12, 10, 2, 10, 6, 10, 6],
              mas

In [15]:
import pandas as pd

grid_results = pd.concat([
  pd.DataFrame(RS_grid.cv_results_["params"]),
  pd.DataFrame(RS_grid.cv_results_["mean_test_score"], columns=["Accuracy"])
], axis=1)

grid_results

Unnamed: 0,n_estimators,min_samples_split,max_features,max_depth,Accuracy
0,190,12,sqrt,5,0.866443
1,170,14,,25,0.851458
2,20,14,sqrt,10,0.861029
3,200,6,sqrt,30,0.869183
4,80,12,log2,40,0.871895
5,10,2,log2,20,0.850107
6,30,4,,25,0.839204
7,160,10,sqrt,40,0.876004
8,80,2,log2,15,0.867841
9,30,2,,20,0.835104


When you are performing random searches, you can narrow the range of certain hyperparameters to be closer to where you are getting the best results. When you think you are close, you can run a grid search on the limited range. Be aware that there might actually be _multiple_ optimal combinations! If you are presented with this scenario, it is always best to go for the first result (i.e. the one where the values are lowest), as this reduces the processing power required for the model while training. 

In [None]:
# from sklearn.model_selection import GridSearchCV

# # redefine model
# RF_clf = RandomForestClassifier(random_state=42)

# GS_grid = GridSearchCV(estimator=RF_clf, param_grid=param_grid)
# GS_grid

In [None]:
# GS_grid.fit(X_train, y_train)

# after running 40m 27.9s
# best score:  0.8787345075016308 
# params:  {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_split': 8, 'n_estimators': 170}

Look at other workbook with contour plots for visualizations that showcase multiple peaks. (To run it locally, you will need to install both `plotly` and `nbformat` in your project's Anaconda environment.)