'Ensemble learning': combining predictions of multiple models for more accurate outcomes e.g. averaging results, taking most common answer etc. This can produce stronger results than individual models

'Base learners' involves training models on distinct subsets of training data. You'll get too much correlation between models otherwise, failing for the same reasons

'Bagging' is 'bootstrap' + 'aggregating'

Random forest is an ensemble of decision trees which have been trained on bootstrapped data with randomly selected features. Limits correlated errors

This works pretty well for big data since splitting the training data into bags means you don't have to read all the data at once for a model to be trained

When random forest splits data, it also splits features so that no single model uses all of the features

# Tuning

Like decision trees, you can set random forest hyperparameters

RF models also include **max_features** which determines how many features are to be used max in each decision tree

**n_estimators** sets how many decision trees will be included in the RF model

In [3]:
import numpy as np
import pandas as pd

import matplotlib as plt
pd.set_option('display.max_columns', None)

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, \
f1_score, confusion_matrix, ConfusionMatrixDisplay

from sklearn.ensemble import RandomForestClassifier

import pickle

# Depdending on whether you're using categorical or continuous data, you could
# use RandomForestClassifier (categorical), or pickle (regressor)

In [7]:
file_location = '/content/Churn_Modelling.csv'

df_original = pd.read_csv(file_location)
df_original.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [8]:
churn_df = df_original.drop(['RowNumber', 'CustomerId', 'Surname', 'Gender'], axis = 1)
churn_df.head()

Unnamed: 0,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,41,1,83807.86,1,0,1,112542.58,0
2,502,France,42,8,159660.8,3,1,0,113931.57,1
3,699,France,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,43,2,125510.82,1,1,1,79084.1,0


In [9]:
churn_df2 = pd.get_dummies(churn_df, drop_first = 'True')
churn_df2.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain
0,619,42,2,0.0,1,1,1,101348.88,1,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0
3,699,39,1,0.0,2,0,0,93826.63,0,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1


In [10]:
y = churn_df2['Exited']

x = churn_df2.copy()
x = x.drop('Exited', axis = 1)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, stratify = y)

In [11]:
## Beginning with %% is a 'magic' command. Simplifcation of common function
%%time

rf = RandomForestClassifier(random_state=0)

# 'None' alllows one of the trees to grow without a depth limit
cv_params = {'max_depth': [2,3,4,5, None],
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'max_features': [2,3,4],
             'n_estimators': [75, 100, 125, 150]
             }

scoring = {'accuracy', 'precision', 'recall', 'f1'}

# cv = 5 specifies 5 folds for cross validation
# Gridsearch is given multiple scoring methods for this, so refit = f1 tells it
# to prioritise f1 score metric (combo of precision & recall)
rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='f1')

# rf_cv.fit(X_train, y_train)

CPU times: user 112 µs, sys: 0 ns, total: 112 µs
Wall time: 117 µs


In [16]:
path = '/content/drive/MyDrive/Google cert materials'

# First argument creates an empty pickle file, 'wb' = write in binary, 'as'
# statement = assign result to local object
with open(path+'rf_cv_model.pickle', 'wb') as to_write:
    pickle.dump(rf_cv, to_write)

In [17]:
# Reading the model back in (rf_cv) after it's saved in previous command
with open(path+'rf_cv_model_p.pickle', 'rb') as to_read:
  rf_cv = pickle.load(to_read)

# Method 1

Haven't run this one, took too long

In [20]:
rf_cv.fit(x_train, y_train)
rf_cv.best_params_

KeyboardInterrupt: ignored

In [None]:
rf_cv.best_score_

In [None]:
rf_cv.results = make_results('Random Forest CV', rf_cv)
rf_cv_results

In [None]:
results = pd.read_csv('.../Datasets/result1.csv', index_col = 0)
results

In [None]:
results = pd.concat([rf_cv_results results])
results

# Method 2

In [24]:
# Split to set validation data
x_tr, x_val, y_tr, y_val = train_test_split(x_train, y_train, test_size = 0.2, \
                                            stratify = y_train, random_state = 10)

In [25]:
split_index = [0 if x in x_val.index else -1 for x in x_train.index]

In [30]:
from sklearn.model_selection import GridSearchCV, PredefinedSplit
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=0)

cv_params = {'max_depth': [2,3,4,5, None],
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'max_features': [2,3,4],
             'n_estimators': [75, 100, 125, 150]
             }

scoring = {'accuracy', 'precision', 'recall', 'f1'}

# Splits to train and test data using a predefined scheme
custom_split = PredefinedSplit(split_index)

rf_val = GridSearchCV(rf, cv_params, scoring = scoring, cv = custom_split, refit = 'f1')

In [31]:
%%time

rf_val.fit(x_train, y_train)

CPU times: user 5min 59s, sys: 1.25 s, total: 6min 1s
Wall time: 6min 11s
