# Predicting Churn

Beta Bank customers are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.

We need to predict whether a customer will leave the bank soon. We have the data on clients’ past behavior and termination of contracts with the bank.

We are required to build a model with the maximum possible F1 score which must be at least 0.59. 

## Data preprocessing

In [67]:
#importing libraries
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt


from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

import joblib

In [2]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [3]:
df = pd.read_csv('/datasets/Churn.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [5]:
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


There are a number of columns we can use in our model:

-As long as there are no unforeseen issues with the data, `creditscore`, `age`, `balance`, `numofproducts`, `hascrcard`, `isactivemember` and `estimatedsalary` can all be used as features in tree-based models without being modified.

- `geography` and `gender` can be encoded using LabelEncoder before being used as features.

- `tenure` can be used once missing values a dealt with.

- `Exited` can be used 'as is' for our traget variable.

We can now take a quick look at each column in turn, making sure there are no issues with our data.

#### `creditscore`

In [6]:
df['creditscore'].describe()

count    10000.000000
mean       650.528800
std         96.653299
min        350.000000
25%        584.000000
50%        652.000000
75%        718.000000
max        850.000000
Name: creditscore, dtype: float64

All looks fine

#### `geography`

In [7]:
df['geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

We can encode these using LableEncoder

In [8]:
#encoding categorical variables
le = LabelEncoder()
df['geography'] = le.fit_transform(df['geography'])
df['geography'].unique()

array([0, 2, 1])

#### `gender`

Same approach as to `geography`

In [9]:
df['gender'].unique()

array(['Female', 'Male'], dtype=object)

In [10]:
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
df['gender'].unique()

array([0, 1])

#### `age`

In [11]:
df['age'].describe()

count    10000.000000
mean        38.921800
std         10.487806
min         18.000000
25%         32.000000
50%         37.000000
75%         44.000000
max         92.000000
Name: age, dtype: float64

#### `tenure`

Here we have some missing values, let's first see if there are any patterns to in our missing data.

In [12]:
df['tenure'].describe()

count    9091.000000
mean        4.997690
std         2.894723
min         0.000000
25%         2.000000
50%         5.000000
75%         7.000000
max        10.000000
Name: tenure, dtype: float64

In [13]:
#checking absolute differences in key metrics
df[df['tenure'].isnull()].describe() - df[df['tenure'].notnull()].describe()

Unnamed: 0,rownumber,customerid,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
count,-8182.0,-8182.0,-8182.0,-8182.0,-8182.0,-8182.0,-9091.0,-8182.0,-8182.0,-8182.0,-8182.0,-8182.0,-8182.0
mean,-147.523772,-1238.579851,-2.285508,-0.005308,-0.015781,-0.301216,,-405.398541,5.8e-05,0.005688,-0.005114,-1000.825551,-0.002618
std,25.170877,3498.062048,2.66891,0.007571,0.00149,-0.770144,,776.162138,0.007449,-0.002375,0.000381,-1246.691883,-0.001738
min,30.0,105.0,9.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,95.09,0.0
25%,-210.5,-2407.5,-4.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,-1355.415,0.0
50%,-132.0,-4191.0,-5.0,0.0,0.0,0.0,,-643.7,0.0,0.0,0.0,-796.18,0.0
75%,-205.5,3952.5,1.0,1.0,0.0,-1.0,,993.09,0.0,0.0,0.0,-3807.51,0.0
max,1.0,30.0,0.0,0.0,0.0,0.0,,-44234.34,0.0,0.0,0.0,-602.03,0.0


In [14]:
#checking correllations
df.corr()['tenure']

rownumber         -0.007322
customerid        -0.021418
creditscore       -0.000062
geography         -0.000888
gender             0.012634
age               -0.013134
tenure             1.000000
balance           -0.007911
numofproducts      0.011979
hascrcard          0.027232
isactivemember    -0.032178
estimatedsalary    0.010520
exited            -0.016761
Name: tenure, dtype: float64

There do not appear to be any patterns to our missing data, we will therefore replace those missing values with the median value for `tender`.

In [15]:
#replacing missing values
df.loc[df['tenure'].isnull(), 'tenure'] = df[df['tenure'].notnull()]['tenure'].median()
df['tenure'].describe()

count    10000.00000
mean         4.99790
std          2.76001
min          0.00000
25%          3.00000
50%          5.00000
75%          7.00000
max         10.00000
Name: tenure, dtype: float64

#### `balance`

In [16]:
df['balance'].describe()

count     10000.000000
mean      76485.889288
std       62397.405202
min           0.000000
25%           0.000000
50%       97198.540000
75%      127644.240000
max      250898.090000
Name: balance, dtype: float64

No issues.

#### `numofproducts`

In [17]:
df['numofproducts'].describe()

count    10000.000000
mean         1.530200
std          0.581654
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          4.000000
Name: numofproducts, dtype: float64

No issues.

#### `hascrcard`

In [18]:
df['hascrcard'].describe()

count    10000.00000
mean         0.70550
std          0.45584
min          0.00000
25%          0.00000
50%          1.00000
75%          1.00000
max          1.00000
Name: hascrcard, dtype: float64

No issues.

#### isactivemember`

In [19]:
df['isactivemember'].describe()

count    10000.000000
mean         0.515100
std          0.499797
min          0.000000
25%          0.000000
50%          1.000000
75%          1.000000
max          1.000000
Name: isactivemember, dtype: float64

No issues.

#### `estimatedsalary`

In [20]:
df['estimatedsalary'].describe()

count     10000.000000
mean     100090.239881
std       57510.492818
min          11.580000
25%       51002.110000
50%      100193.915000
75%      149388.247500
max      199992.480000
Name: estimatedsalary, dtype: float64

No issues.

#### `exited`

In [21]:
df['exited'].describe()

count    10000.000000
mean         0.203700
std          0.402769
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: exited, dtype: float64

We can see that our data is currently imbalanced, with only ~20% of data having a positive target class. 
We'll first try evaluating models without rebalancing, then see if we can improve our scores with some rebalancing techniques.

## Splitting data

We will be using k-Fold Cross-Validation to find the ideal hyperparameters for our RandomForestClassifier wheras we'll merely perform a naive search for max_depth on our DecisionTreeClassifier. Cross-validation does not require a pre split set of train and validation data but our naive search for our DecisionTreeClassifer will not.

We will therefore need to split our data twice to end up with:

- X_test, y_test : 10% of dataset
    - these will be 'unseen' and be used to validate our model.
    
- X, y: 90% of our dataset 
    - These will be used in Cross Validation
    
- X_test, y_test, X_val, y_val: 70% and 20% respectively (subsets of X, y)
    - these will be used in our DecisionTreeClassifier but will not be used separately in Cross Validation.    
    


In [22]:
# getting features and targets
X = df.drop(['rownumber', 'customerid', 'surname', 'exited'], axis = 1) 
y = df['exited']

In [23]:
#splitting off test set
X, X_test, y, y_test = train_test_split(X, y, test_size = 0.1, random_state = 1)
#creating 20% validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=(0.2222), random_state = 1) 

In [24]:
#checking correct split
for i in [X, X_test, X_train, X_val, y, y_test, y_train, y_val]:
    print(i.shape)

(9000, 10)
(1000, 10)
(7000, 10)
(2000, 10)
(9000,)
(1000,)
(7000,)
(2000,)


## Model selection

We will start off by looking at how a simple DecisionTreeClassifier performs and then try and improve on that score using a RandomForestClassifier.

Throughout, we will be looking at 2 metric for evaluations: 'f1_score' and 'roc_auc'.
F1 to sum up the predictive performance of a model by combining two otherwise competing metrics — precision and recall. 
ROC-AUC to tell us how much our model is capable of distinguishing between classes.

When we evaluate scores we'll take the predictions from the .predict() method for use in calculating f1 score and for roc_auc, we'll use the probabilities of the positive class from the .predict_proba() method, as calculating the roc_auc score requires examining what happens as the threshold changes.

### DecisionTreeClassifier

We'll take a look at the f1 score over a range of max depths as a benchmark for our future models.

In [25]:
#testing a broad max depths
for depth in list(range(2,25)):
    model = DecisionTreeClassifier(max_depth = depth, random_state = 1)
    model.fit(X_train, y_train)
    predictions = model.predict(X_val)
    f1 = f1_score(y_val, predictions)
    roc_auc = roc_auc_score(y_val, model.predict_proba(X_val)[:,1])
    
    print(f'depth: {depth:.3f}, f1_score: {f1:.3f}, ROC-AUC: {roc_auc:.3f}')

depth: 2.000, f1_score: 0.504, ROC-AUC: 0.733
depth: 3.000, f1_score: 0.373, ROC-AUC: 0.785
depth: 4.000, f1_score: 0.482, ROC-AUC: 0.808
depth: 5.000, f1_score: 0.502, ROC-AUC: 0.804
depth: 6.000, f1_score: 0.490, ROC-AUC: 0.796
depth: 7.000, f1_score: 0.502, ROC-AUC: 0.786
depth: 8.000, f1_score: 0.514, ROC-AUC: 0.784
depth: 9.000, f1_score: 0.495, ROC-AUC: 0.757
depth: 10.000, f1_score: 0.494, ROC-AUC: 0.726
depth: 11.000, f1_score: 0.498, ROC-AUC: 0.712
depth: 12.000, f1_score: 0.473, ROC-AUC: 0.693
depth: 13.000, f1_score: 0.450, ROC-AUC: 0.682
depth: 14.000, f1_score: 0.452, ROC-AUC: 0.682
depth: 15.000, f1_score: 0.449, ROC-AUC: 0.669
depth: 16.000, f1_score: 0.441, ROC-AUC: 0.664
depth: 17.000, f1_score: 0.448, ROC-AUC: 0.666
depth: 18.000, f1_score: 0.444, ROC-AUC: 0.663
depth: 19.000, f1_score: 0.429, ROC-AUC: 0.653
depth: 20.000, f1_score: 0.454, ROC-AUC: 0.666
depth: 21.000, f1_score: 0.433, ROC-AUC: 0.652
depth: 22.000, f1_score: 0.440, ROC-AUC: 0.656
depth: 23.000, f1_sco

In [26]:
#checking model on test set
predictions = model.predict(X_test)
f1 = f1_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
print(f1)
print(roc_auc)

0.46265060240963857
0.6590470870199845


A depth of 8 got us an f1 score of 0.462 on our test set, far below what is required and an unimpressive ROC-AUC score of 0.659.
We'll therefore have to use more sophisticated models.

### RandomForestClassifier

We'll start with Random Cross-Validation to get an idea of where roughly the hyperparameters should be tuned to and follow it up with a Grid Search around our best performing models.

We'll be varying the following hyperparameters:

- `n_estimators`
    - The number of trees in the random forest
- `max_features`
    - The number of features to consider at every split
- `max_depth`
    - The maximum number of levels in tree
- `min_samples_split`
    - The minimum number of samples required to split a node
- `min_samples_leaf
    - The minimum number of samples required at each leaf node
- `bootstrap`
    - The method of selecting samples for training each tree

In [27]:
#Setting range for hyperparameters
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 100, num = 100)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(1, 25, num = 25)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 16]
min_samples_leaf = [1, 2, 5]
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap
               }

In [28]:
# Use the random grid to search for best hyperparameters
# Creating the base model to tune
rf = RandomForestClassifier(verbose = 0, random_state = 1)
# random search of parameters, using 5 fold cross validation, 
# search across 200 different combinations, and use all  available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 200, cv = 5, verbose=1, random_state=1, n_jobs = -1, refit = False, scoring = ['f1', 'roc_auc'])
# Fit the random search model
rf_random.fit(X, y)

results = pd.DataFrame(rf_random.cv_results_)

ranked = results.sort_values('rank_test_f1').reset_index(drop = True)
ranked

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_min_samples_split,param_min_samples_leaf,param_max_features,param_max_depth,param_bootstrap,params,split0_test_f1,split1_test_f1,split2_test_f1,split3_test_f1,split4_test_f1,mean_test_f1,std_test_f1,rank_test_f1,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc
0,1.392059,0.110202,0.095875,0.013825,81,16,1,auto,19.0,False,"{'n_estimators': 81, 'min_samples_split': 16, ...",0.595238,0.609881,0.532663,0.578595,0.56942,0.57716,0.026227,1,0.850445,0.862538,0.836237,0.84123,0.858317,0.849753,0.009926,94
1,1.628661,0.029529,0.106512,0.000586,95,5,1,auto,,False,"{'n_estimators': 95, 'min_samples_split': 5, '...",0.589404,0.608696,0.543372,0.572816,0.566434,0.576144,0.021977,2,0.83744,0.857907,0.833635,0.836771,0.849238,0.842998,0.00915,141
2,1.145187,0.006361,0.078383,0.001262,71,16,1,auto,20.0,False,"{'n_estimators': 71, 'min_samples_split': 16, ...",0.582064,0.616438,0.533557,0.580858,0.566372,0.575858,0.026794,3,0.85029,0.863863,0.833952,0.839422,0.855025,0.84851,0.010732,111
3,0.313063,0.047528,0.026188,0.000236,19,16,2,sqrt,12.0,False,"{'n_estimators': 19, 'min_samples_split': 16, ...",0.593103,0.604167,0.552066,0.584041,0.545455,0.575766,0.023048,4,0.850443,0.867174,0.832497,0.845002,0.853017,0.849627,0.011268,98
4,1.070409,0.01002,0.074213,0.000494,65,10,1,auto,17.0,False,"{'n_estimators': 65, 'min_samples_split': 10, ...",0.582064,0.601709,0.538206,0.583607,0.570423,0.575202,0.021033,5,0.848163,0.862667,0.837663,0.836128,0.855409,0.848006,0.010177,113
5,0.46127,0.003059,0.036976,0.00042,30,16,2,sqrt,14.0,False,"{'n_estimators': 30, 'min_samples_split': 16, ...",0.601351,0.604811,0.532663,0.576792,0.55615,0.574353,0.02732,6,0.851543,0.86453,0.828831,0.839014,0.859921,0.848768,0.013215,107
6,1.25821,0.013284,0.105429,0.039631,78,16,1,auto,23.0,False,"{'n_estimators': 78, 'min_samples_split': 16, ...",0.589041,0.608247,0.537563,0.575707,0.560847,0.574281,0.024093,7,0.849197,0.862811,0.835903,0.840014,0.857714,0.849128,0.010187,102
7,1.538167,0.042689,0.101958,0.002074,90,5,1,sqrt,18.0,False,"{'n_estimators': 90, 'min_samples_split': 5, '...",0.587065,0.59727,0.539216,0.589577,0.554974,0.57362,0.022477,8,0.841445,0.85828,0.830141,0.836741,0.850103,0.843342,0.009902,139
8,1.879469,0.322997,0.149298,0.037753,98,2,1,auto,23.0,False,"{'n_estimators': 98, 'min_samples_split': 2, '...",0.589786,0.60371,0.538088,0.572358,0.558923,0.572573,0.022984,9,0.837355,0.8536,0.825964,0.836116,0.845566,0.83972,0.009324,158
9,0.927464,0.045736,0.063904,0.00078,51,2,1,auto,23.0,False,"{'n_estimators': 51, 'min_samples_split': 2, '...",0.586491,0.595674,0.537459,0.577419,0.56229,0.571867,0.020431,10,0.834585,0.847209,0.817959,0.835762,0.841622,0.835427,0.00983,168


In [29]:
#checking top 10
ranked.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_min_samples_split,param_min_samples_leaf,param_max_features,param_max_depth,param_bootstrap,params,split0_test_f1,split1_test_f1,split2_test_f1,split3_test_f1,split4_test_f1,mean_test_f1,std_test_f1,rank_test_f1,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc
0,1.392059,0.110202,0.095875,0.013825,81,16,1,auto,19.0,False,"{'n_estimators': 81, 'min_samples_split': 16, ...",0.595238,0.609881,0.532663,0.578595,0.56942,0.57716,0.026227,1,0.850445,0.862538,0.836237,0.84123,0.858317,0.849753,0.009926,94
1,1.628661,0.029529,0.106512,0.000586,95,5,1,auto,,False,"{'n_estimators': 95, 'min_samples_split': 5, '...",0.589404,0.608696,0.543372,0.572816,0.566434,0.576144,0.021977,2,0.83744,0.857907,0.833635,0.836771,0.849238,0.842998,0.00915,141
2,1.145187,0.006361,0.078383,0.001262,71,16,1,auto,20.0,False,"{'n_estimators': 71, 'min_samples_split': 16, ...",0.582064,0.616438,0.533557,0.580858,0.566372,0.575858,0.026794,3,0.85029,0.863863,0.833952,0.839422,0.855025,0.84851,0.010732,111
3,0.313063,0.047528,0.026188,0.000236,19,16,2,sqrt,12.0,False,"{'n_estimators': 19, 'min_samples_split': 16, ...",0.593103,0.604167,0.552066,0.584041,0.545455,0.575766,0.023048,4,0.850443,0.867174,0.832497,0.845002,0.853017,0.849627,0.011268,98
4,1.070409,0.01002,0.074213,0.000494,65,10,1,auto,17.0,False,"{'n_estimators': 65, 'min_samples_split': 10, ...",0.582064,0.601709,0.538206,0.583607,0.570423,0.575202,0.021033,5,0.848163,0.862667,0.837663,0.836128,0.855409,0.848006,0.010177,113
5,0.46127,0.003059,0.036976,0.00042,30,16,2,sqrt,14.0,False,"{'n_estimators': 30, 'min_samples_split': 16, ...",0.601351,0.604811,0.532663,0.576792,0.55615,0.574353,0.02732,6,0.851543,0.86453,0.828831,0.839014,0.859921,0.848768,0.013215,107
6,1.25821,0.013284,0.105429,0.039631,78,16,1,auto,23.0,False,"{'n_estimators': 78, 'min_samples_split': 16, ...",0.589041,0.608247,0.537563,0.575707,0.560847,0.574281,0.024093,7,0.849197,0.862811,0.835903,0.840014,0.857714,0.849128,0.010187,102
7,1.538167,0.042689,0.101958,0.002074,90,5,1,sqrt,18.0,False,"{'n_estimators': 90, 'min_samples_split': 5, '...",0.587065,0.59727,0.539216,0.589577,0.554974,0.57362,0.022477,8,0.841445,0.85828,0.830141,0.836741,0.850103,0.843342,0.009902,139
8,1.879469,0.322997,0.149298,0.037753,98,2,1,auto,23.0,False,"{'n_estimators': 98, 'min_samples_split': 2, '...",0.589786,0.60371,0.538088,0.572358,0.558923,0.572573,0.022984,9,0.837355,0.8536,0.825964,0.836116,0.845566,0.83972,0.009324,158
9,0.927464,0.045736,0.063904,0.00078,51,2,1,auto,23.0,False,"{'n_estimators': 51, 'min_samples_split': 2, '...",0.586491,0.595674,0.537459,0.577419,0.56229,0.571867,0.020431,10,0.834585,0.847209,0.817959,0.835762,0.841622,0.835427,0.00983,168


We can now perform a Grid Search over a narrower range based on the hyperparameters of the best scoring models.

In [30]:
# setting narrower ranges for hyperparameters
n_estimators = [int(x) for x in np.linspace(start = 80, stop = 100, num = 21)]
max_features = ['auto']
max_depth = [int(x) for x in np.linspace(17, 30, num = 14)]
max_depth.append(None)
min_samples_split = [16]
min_samples_leaf = [1]
bootstrap = [False]
# Create the  grid
grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               }

In [31]:
#Setting up Grid Search
rf = RandomForestClassifier(verbose = 0, random_state = 1)
rf_grid = GridSearchCV(rf, grid, verbose=1, n_jobs = -1, refit = False, scoring = ['f1', 'roc_auc'] )
rf_grid.fit(X,y)

Fitting 5 folds for each of 315 candidates, totalling 1575 fits


GridSearchCV(estimator=RandomForestClassifier(random_state=1), n_jobs=-1,
             param_grid={'bootstrap': [False],
                         'max_depth': [17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
                                       27, 28, 29, 30, None],
                         'max_features': ['auto'], 'min_samples_leaf': [1],
                         'min_samples_split': [16],
                         'n_estimators': [80, 81, 82, 83, 84, 85, 86, 87, 88,
                                          89, 90, 91, 92, 93, 94, 95, 96, 97,
                                          98, 99, 100]},
             refit=False, scoring=['f1', 'roc_auc'], verbose=1)

In [32]:
# Gathering results
grid_results = pd.DataFrame(rf_grid.cv_results_)
grid_ranked = grid_results.sort_values('rank_test_f1').reset_index(drop = True)
grid_ranked.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bootstrap,param_max_depth,param_max_features,param_min_samples_leaf,param_min_samples_split,param_n_estimators,params,split0_test_f1,split1_test_f1,split2_test_f1,split3_test_f1,split4_test_f1,mean_test_f1,std_test_f1,rank_test_f1,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc
0,1.439074,0.018381,0.098223,0.004281,False,21,auto,1,16,89,"{'bootstrap': False, 'max_depth': 21, 'max_fea...",0.604414,0.608844,0.542714,0.57,0.564831,0.578161,0.025026,1,0.849494,0.862065,0.835928,0.839432,0.857467,0.848877,0.01005,292
1,1.439595,0.011575,0.098254,0.001137,False,21,auto,1,16,91,"{'bootstrap': False, 'max_depth': 21, 'max_fea...",0.602369,0.611205,0.541806,0.57,0.564831,0.578042,0.025471,2,0.849508,0.8622,0.835802,0.839732,0.857551,0.848959,0.010077,264
2,1.634764,0.098008,0.103689,0.000606,False,22,auto,1,16,98,"{'bootstrap': False, 'max_depth': 22, 'max_fea...",0.600683,0.611205,0.531092,0.579035,0.567376,0.577878,0.02804,3,0.849757,0.863978,0.837862,0.841672,0.858989,0.850452,0.009919,3
3,1.522952,0.014671,0.101833,0.000546,False,21,auto,1,16,95,"{'bootstrap': False, 'max_depth': 21, 'max_fea...",0.601351,0.612245,0.54,0.570952,0.564831,0.577876,0.026019,4,0.849749,0.862223,0.83583,0.84019,0.857253,0.849049,0.009946,237
4,1.379086,0.013462,0.093221,0.001603,False,22,auto,1,16,86,"{'bootstrap': False, 'max_depth': 22, 'max_fea...",0.59589,0.611205,0.531773,0.586667,0.56383,0.577873,0.027696,5,0.849386,0.863772,0.837699,0.840589,0.858113,0.849912,0.009957,42
5,1.45972,0.077947,0.092842,0.001237,False,26,auto,1,16,86,"{'bootstrap': False, 'max_depth': 26, 'max_fea...",0.589655,0.607509,0.543947,0.589018,0.558719,0.57777,0.023061,6,0.848787,0.862995,0.836279,0.840407,0.857487,0.849191,0.010034,215
6,1.298512,0.01542,0.088682,0.000437,False,22,auto,1,16,82,"{'bootstrap': False, 'max_depth': 22, 'max_fea...",0.594872,0.614601,0.529313,0.583333,0.566372,0.577698,0.028837,7,0.849052,0.863651,0.836821,0.841024,0.85781,0.849672,0.01002,89
7,1.701947,0.170238,0.102992,0.001088,False,22,auto,1,16,96,"{'bootstrap': False, 'max_depth': 22, 'max_fea...",0.603066,0.612245,0.529313,0.579035,0.564831,0.577698,0.02948,8,0.849658,0.863529,0.837904,0.841567,0.858766,0.850285,0.009767,5
8,1.332597,0.008938,0.095105,0.007492,False,22,auto,1,16,83,"{'bootstrap': False, 'max_depth': 22, 'max_fea...",0.592466,0.614601,0.535117,0.58194,0.56383,0.577591,0.026845,9,0.848986,0.86342,0.837043,0.840868,0.858007,0.849665,0.00996,91
9,1.5386,0.078043,0.108377,0.019217,False,21,auto,1,16,92,"{'bootstrap': False, 'max_depth': 21, 'max_fea...",0.601351,0.609881,0.54,0.572379,0.56383,0.577488,0.025443,10,0.849258,0.862092,0.835679,0.840116,0.85761,0.848951,0.010019,269


Our best model has a mean F1 Score of 0.578161, we can check if this model is generalisable by evaluating it's performance on the test set of data we split earlier that was not part of the data the model was trained on, to see if we get a similar score.

In [33]:
#defining function to evaluate the best performing model on the test set
def get_best(ranked):
    params = ranked.iloc[0]['params']
    rf = RandomForestClassifier(**params, random_state = 1)
    rf.fit(X, y)

    predictions = rf.predict(X_test)
    print(f1_score(y_test, predictions))
    print(roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]))

In [34]:
#retrieving best model
get_best(grid_ranked)

0.5758513931888545
0.8632470161401737


We do get similar scores but these are still below the required threshold for our model.

It's worth noting that our current ROC-AUC score of 0.863 suggests this model is very good at distinguising between classes.

Our data is currently unbalanced, only ~20% of our data is clients who exited.

We can therefore balance our data using two different methods.

- Downsampling
- Balancing Class Weights in our model

## Balancing

### Downsampling

We'll create a downsampling function, downsample our data and then search for the best models fitted on our downsampled dataset.

In [35]:
#defining a downsampling function
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones]) 
    
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones]) 
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345
    )

    return features_downsampled, target_downsampled



In [36]:
#checking shape
Xd, yd = downsample(X, y, 0.3)
print(Xd.shape, yd.shape)

(3978, 10) (3978,)


In [37]:
#Setting range for hyperparameters
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 100, num = 100)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(1, 25, num = 25)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 16]
min_samples_leaf = [1, 2, 5]
bootstrap = [True, False]

# Create the random grid
ds_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap
               }

In [38]:
ds_rf = RandomForestClassifier(verbose = 0, random_state = 1)
ds_rf_random = RandomizedSearchCV(estimator = ds_rf, param_distributions = ds_grid, n_iter = 200, cv = 5, verbose=1, random_state=1, n_jobs = -1, refit = False, scoring = ['f1', 'roc_auc'])
# Fit the random search model
ds_rf_random.fit(X, y)
ds_results = pd.DataFrame(rf_random.cv_results_)
ds_ranked = ds_results.sort_values('rank_test_f1').reset_index(drop = True)
ds_ranked.head(10)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_min_samples_split,param_min_samples_leaf,param_max_features,param_max_depth,param_bootstrap,params,split0_test_f1,split1_test_f1,split2_test_f1,split3_test_f1,split4_test_f1,mean_test_f1,std_test_f1,rank_test_f1,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc
0,1.392059,0.110202,0.095875,0.013825,81,16,1,auto,19.0,False,"{'n_estimators': 81, 'min_samples_split': 16, ...",0.595238,0.609881,0.532663,0.578595,0.56942,0.57716,0.026227,1,0.850445,0.862538,0.836237,0.84123,0.858317,0.849753,0.009926,94
1,1.628661,0.029529,0.106512,0.000586,95,5,1,auto,,False,"{'n_estimators': 95, 'min_samples_split': 5, '...",0.589404,0.608696,0.543372,0.572816,0.566434,0.576144,0.021977,2,0.83744,0.857907,0.833635,0.836771,0.849238,0.842998,0.00915,141
2,1.145187,0.006361,0.078383,0.001262,71,16,1,auto,20.0,False,"{'n_estimators': 71, 'min_samples_split': 16, ...",0.582064,0.616438,0.533557,0.580858,0.566372,0.575858,0.026794,3,0.85029,0.863863,0.833952,0.839422,0.855025,0.84851,0.010732,111
3,0.313063,0.047528,0.026188,0.000236,19,16,2,sqrt,12.0,False,"{'n_estimators': 19, 'min_samples_split': 16, ...",0.593103,0.604167,0.552066,0.584041,0.545455,0.575766,0.023048,4,0.850443,0.867174,0.832497,0.845002,0.853017,0.849627,0.011268,98
4,1.070409,0.01002,0.074213,0.000494,65,10,1,auto,17.0,False,"{'n_estimators': 65, 'min_samples_split': 10, ...",0.582064,0.601709,0.538206,0.583607,0.570423,0.575202,0.021033,5,0.848163,0.862667,0.837663,0.836128,0.855409,0.848006,0.010177,113
5,0.46127,0.003059,0.036976,0.00042,30,16,2,sqrt,14.0,False,"{'n_estimators': 30, 'min_samples_split': 16, ...",0.601351,0.604811,0.532663,0.576792,0.55615,0.574353,0.02732,6,0.851543,0.86453,0.828831,0.839014,0.859921,0.848768,0.013215,107
6,1.25821,0.013284,0.105429,0.039631,78,16,1,auto,23.0,False,"{'n_estimators': 78, 'min_samples_split': 16, ...",0.589041,0.608247,0.537563,0.575707,0.560847,0.574281,0.024093,7,0.849197,0.862811,0.835903,0.840014,0.857714,0.849128,0.010187,102
7,1.538167,0.042689,0.101958,0.002074,90,5,1,sqrt,18.0,False,"{'n_estimators': 90, 'min_samples_split': 5, '...",0.587065,0.59727,0.539216,0.589577,0.554974,0.57362,0.022477,8,0.841445,0.85828,0.830141,0.836741,0.850103,0.843342,0.009902,139
8,1.879469,0.322997,0.149298,0.037753,98,2,1,auto,23.0,False,"{'n_estimators': 98, 'min_samples_split': 2, '...",0.589786,0.60371,0.538088,0.572358,0.558923,0.572573,0.022984,9,0.837355,0.8536,0.825964,0.836116,0.845566,0.83972,0.009324,158
9,0.927464,0.045736,0.063904,0.00078,51,2,1,auto,23.0,False,"{'n_estimators': 51, 'min_samples_split': 2, '...",0.586491,0.595674,0.537459,0.577419,0.56229,0.571867,0.020431,10,0.834585,0.847209,0.817959,0.835762,0.841622,0.835427,0.00983,168


In [39]:
get_best(ds_ranked)

0.5896656534954408
0.8592014608449113


Already slightly better scores than before.
Let's grid search around the most promising looking hyperparameters.

In [40]:
# setting narrower ranges for hyperparameters
n_estimators = [int(x) for x in np.linspace(start = 80, stop = 100, num = 21)]
max_features = ['auto']
max_depth = [int(x) for x in np.linspace(17, 25, num = 9)]
max_depth.append(None)
min_samples_split = [16]
min_samples_leaf = [1]
bootstrap = [False]
# Create the  grid
ds_grid2 = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               }

In [41]:
#Setting up Grid Search
rf = RandomForestClassifier(verbose = 0, random_state = 1)
ds_rf_grid = GridSearchCV(rf, ds_grid2, verbose=1, n_jobs = -1, refit = False, scoring = ['f1', 'roc_auc'] )
ds_rf_grid.fit(Xd,yd)

Fitting 5 folds for each of 210 candidates, totalling 1050 fits


GridSearchCV(estimator=RandomForestClassifier(random_state=1), n_jobs=-1,
             param_grid={'bootstrap': [False],
                         'max_depth': [17, 18, 19, 20, 21, 22, 23, 24, 25,
                                       None],
                         'max_features': ['auto'], 'min_samples_leaf': [1],
                         'min_samples_split': [16],
                         'n_estimators': [80, 81, 82, 83, 84, 85, 86, 87, 88,
                                          89, 90, 91, 92, 93, 94, 95, 96, 97,
                                          98, 99, 100]},
             refit=False, scoring=['f1', 'roc_auc'], verbose=1)

In [42]:
ds_grid_results = pd.DataFrame(ds_rf_grid.cv_results_)
ds_grid_ranked = ds_grid_results.sort_values('rank_test_f1').reset_index(drop = True)
ds_grid_ranked.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bootstrap,param_max_depth,param_max_features,param_min_samples_leaf,param_min_samples_split,param_n_estimators,params,split0_test_f1,split1_test_f1,split2_test_f1,split3_test_f1,split4_test_f1,mean_test_f1,std_test_f1,rank_test_f1,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc
0,0.5689,0.003471,0.057251,0.000464,False,18,auto,1,16,81,"{'bootstrap': False, 'max_depth': 18, 'max_fea...",0.792717,0.722892,0.75,0.707736,0.766434,0.747956,0.030315,1,0.88914,0.846963,0.851703,0.81749,0.859331,0.852925,0.023001,132
1,0.575837,0.039405,0.056284,0.00058,False,18,auto,1,16,80,"{'bootstrap': False, 'max_depth': 18, 'max_fea...",0.79219,0.720721,0.7507,0.709585,0.764706,0.74758,0.02986,2,0.889184,0.847109,0.851684,0.81742,0.859732,0.853026,0.023052,82
2,0.578352,0.006966,0.058578,0.001226,False,18,auto,1,16,82,"{'bootstrap': False, 'max_depth': 18, 'max_fea...",0.787115,0.721805,0.748948,0.712446,0.765778,0.747218,0.027547,3,0.888923,0.846582,0.851716,0.817859,0.858981,0.852812,0.02282,170
3,0.629419,0.045964,0.059153,0.000496,False,18,auto,1,16,84,"{'bootstrap': False, 'max_depth': 18, 'max_fea...",0.787709,0.71988,0.748948,0.712446,0.764045,0.746606,0.027858,4,0.888936,0.846658,0.851677,0.818241,0.859102,0.852923,0.02271,133
4,0.609541,0.041502,0.058321,0.00082,False,17,auto,1,16,84,"{'bootstrap': False, 'max_depth': 17, 'max_fea...",0.783934,0.729198,0.744382,0.70977,0.764457,0.746348,0.025987,5,0.888237,0.846658,0.853037,0.819484,0.860733,0.85363,0.022195,8
5,0.599153,0.041974,0.058268,0.000396,False,18,auto,1,16,83,"{'bootstrap': False, 'max_depth': 18, 'max_fea...",0.788219,0.720965,0.7507,0.710602,0.761236,0.746344,0.027985,6,0.888803,0.846435,0.851913,0.818299,0.859006,0.852891,0.022655,148
6,0.595348,0.016214,0.05816,0.001429,False,17,auto,1,16,85,"{'bootstrap': False, 'max_depth': 17, 'max_fea...",0.785615,0.728097,0.744711,0.70977,0.76338,0.746315,0.026479,7,0.888714,0.847198,0.852904,0.818968,0.861083,0.853773,0.022493,1
7,0.675736,0.04544,0.064457,0.00099,False,17,auto,1,16,94,"{'bootstrap': False, 'max_depth': 17, 'max_fea...",0.786611,0.727273,0.745763,0.708752,0.762712,0.746222,0.027079,8,0.88872,0.846569,0.853285,0.818675,0.860879,0.853626,0.022608,10
8,0.69945,0.00805,0.068072,0.000331,False,17,auto,1,16,100,"{'bootstrap': False, 'max_depth': 17, 'max_fea...",0.782247,0.729198,0.745042,0.708752,0.764873,0.746022,0.025848,9,0.888231,0.846315,0.853222,0.818949,0.86093,0.853529,0.022392,15
9,0.697964,0.049367,0.065896,0.001285,False,17,auto,1,16,98,"{'bootstrap': False, 'max_depth': 17, 'max_fea...",0.784916,0.727273,0.745763,0.706897,0.764873,0.745944,0.027388,10,0.888294,0.846575,0.85284,0.81907,0.861198,0.853595,0.022377,11


In [43]:
params = ds_grid_ranked.iloc[0]['params']
rf = RandomForestClassifier(**params, random_state = 1)
rf.fit(Xd, yd)

predictions = rf.predict(X_test)
print(f1_score(y_test, predictions))
print(roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]))

0.6175869120654397
0.8675989163798437


We have a new best f1 score: 0.618 and a new highest AUC_ROC score 0.868

 ### Balancing `class_weight`

We can start off by merely adding the parameter `class_weight` = `balanced` to our current best performing RandomForestClassifier and seeing if it improves performance.

Changing this `class_weight` parameter from the default of `None` to `balanced` will use the values of y to automatically adjust weights inversely proportional to class frequencies in the input data.


In [44]:
# testing current best model with class weights balanced
rf_balanced = RandomForestClassifier(**params, random_state = 1, class_weight = 'balanced')
rf_balanced.fit(X,y)

#checking score on test set
predictions = rf_balanced.predict(X_test)
print(f1_score(y_test, predictions))
print(roc_auc_score(y_test, rf_balanced.predict_proba(X_test)[:,1]))

0.608695652173913
0.8639287838105707


A clear improvement and better than the 0.59 threshold we were aiming for.
We can try and improve this score slightly by searching over a similar grid as before but now with the balanced class weights.

In [45]:
#choosing random ranges
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 150, num = 101)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 50, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 16]
min_samples_leaf = [1, 2, 5]
bootstrap = [True, False]

# Create the random grid
balanced_random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'class_weight' : ['balanced']
               }

In [46]:
# Use the random grid to search for best hyperparameters
# Creating the base model to tune
brf = RandomForestClassifier(verbose = 0, random_state = 1)
# random search of parameters, using 5 fold cross validation, 
# search across 200 different combinations, and use all  available cores
brf_random = RandomizedSearchCV(estimator = brf, param_distributions = balanced_random_grid, n_iter = 200, cv = 5, verbose=1, random_state=1, n_jobs = -1, refit = False, scoring = ['f1', 'roc_auc'])
# Fit the random search model
brf_random.fit(X, y)

balanced_results = pd.DataFrame(brf_random.cv_results_)

balanced_ranked = balanced_results.sort_values('rank_test_f1').reset_index(drop = True)
balanced_ranked

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_min_samples_split,param_min_samples_leaf,param_max_features,param_max_depth,param_class_weight,param_bootstrap,params,split0_test_f1,split1_test_f1,split2_test_f1,split3_test_f1,split4_test_f1,mean_test_f1,std_test_f1,rank_test_f1,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc
0,1.44059,0.081172,0.166729,0.048017,131,16,5,auto,31.0,balanced,True,"{'n_estimators': 131, 'min_samples_split': 16,...",0.625337,0.65252,0.601078,0.620076,0.620787,0.62396,0.016531,1,0.855801,0.870202,0.844691,0.850268,0.863537,0.8569,0.009113,10
1,1.430356,0.044614,0.135962,0.004226,142,16,5,sqrt,15.0,balanced,True,"{'n_estimators': 142, 'min_samples_split': 16,...",0.623656,0.65073,0.598675,0.624041,0.616457,0.622712,0.01676,2,0.855134,0.869994,0.843824,0.847539,0.862962,0.855891,0.009642,13
2,1.149072,0.018664,0.106179,0.001832,102,16,2,sqrt,23.0,balanced,True,"{'n_estimators': 102, 'min_samples_split': 16,...",0.624309,0.660167,0.589674,0.619792,0.612303,0.621249,0.022817,3,0.85679,0.86549,0.841636,0.844391,0.864119,0.854485,0.009861,37
3,0.94492,0.059969,0.088807,0.001136,91,16,1,auto,12.0,balanced,True,"{'n_estimators': 91, 'min_samples_split': 16, ...",0.629879,0.65508,0.59025,0.623574,0.607242,0.621205,0.021824,4,0.859711,0.870267,0.842429,0.847104,0.859743,0.855851,0.009945,14
4,0.930207,0.012757,0.094259,0.01047,90,16,2,sqrt,29.0,balanced,True,"{'n_estimators': 90, 'min_samples_split': 16, ...",0.628099,0.654596,0.592992,0.614583,0.615385,0.621131,0.02018,5,0.855839,0.865582,0.841109,0.844586,0.863354,0.854094,0.009795,46
5,0.874709,0.039269,0.083875,0.000983,83,16,2,sqrt,23.0,balanced,True,"{'n_estimators': 83, 'min_samples_split': 16, ...",0.626039,0.656467,0.594595,0.612565,0.615819,0.621097,0.020389,6,0.855214,0.864883,0.840764,0.844666,0.862888,0.853683,0.009599,53
6,0.682047,0.04603,0.065389,0.00301,62,16,5,sqrt,12.0,balanced,True,"{'n_estimators': 62, 'min_samples_split': 16, ...",0.630749,0.643411,0.59144,0.609787,0.627876,0.620652,0.01813,7,0.85683,0.872149,0.843851,0.846222,0.862834,0.856377,0.010495,12
7,0.847677,0.083671,0.100063,0.040214,78,16,5,auto,14.0,balanced,True,"{'n_estimators': 78, 'min_samples_split': 16, ...",0.632708,0.645586,0.587927,0.61597,0.619835,0.620405,0.019289,8,0.855908,0.869423,0.840601,0.845369,0.862458,0.854752,0.010623,28
8,1.066753,0.039991,0.098904,0.001258,102,16,5,sqrt,,balanced,True,"{'n_estimators': 102, 'min_samples_split': 16,...",0.624665,0.645418,0.595174,0.618504,0.617978,0.620348,0.016062,9,0.856998,0.870135,0.844244,0.850568,0.863052,0.856999,0.009091,7
9,1.505714,0.014214,0.103723,0.000385,100,16,5,auto,44.0,balanced,False,"{'n_estimators': 100, 'min_samples_split': 16,...",0.617188,0.654592,0.60281,0.61063,0.615797,0.620203,0.017918,10,0.849525,0.867357,0.839927,0.842593,0.858042,0.851489,0.010113,87


In [47]:
#getting top 10 results
balanced_random_results = pd.DataFrame(brf_random.cv_results_)
brf_ranked = balanced_random_results.sort_values('rank_test_f1').reset_index(drop = True)
brf_ranked.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_min_samples_split,param_min_samples_leaf,param_max_features,param_max_depth,param_class_weight,param_bootstrap,params,split0_test_f1,split1_test_f1,split2_test_f1,split3_test_f1,split4_test_f1,mean_test_f1,std_test_f1,rank_test_f1,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc
0,1.44059,0.081172,0.166729,0.048017,131,16,5,auto,31.0,balanced,True,"{'n_estimators': 131, 'min_samples_split': 16,...",0.625337,0.65252,0.601078,0.620076,0.620787,0.62396,0.016531,1,0.855801,0.870202,0.844691,0.850268,0.863537,0.8569,0.009113,10
1,1.430356,0.044614,0.135962,0.004226,142,16,5,sqrt,15.0,balanced,True,"{'n_estimators': 142, 'min_samples_split': 16,...",0.623656,0.65073,0.598675,0.624041,0.616457,0.622712,0.01676,2,0.855134,0.869994,0.843824,0.847539,0.862962,0.855891,0.009642,13
2,1.149072,0.018664,0.106179,0.001832,102,16,2,sqrt,23.0,balanced,True,"{'n_estimators': 102, 'min_samples_split': 16,...",0.624309,0.660167,0.589674,0.619792,0.612303,0.621249,0.022817,3,0.85679,0.86549,0.841636,0.844391,0.864119,0.854485,0.009861,37
3,0.94492,0.059969,0.088807,0.001136,91,16,1,auto,12.0,balanced,True,"{'n_estimators': 91, 'min_samples_split': 16, ...",0.629879,0.65508,0.59025,0.623574,0.607242,0.621205,0.021824,4,0.859711,0.870267,0.842429,0.847104,0.859743,0.855851,0.009945,14
4,0.930207,0.012757,0.094259,0.01047,90,16,2,sqrt,29.0,balanced,True,"{'n_estimators': 90, 'min_samples_split': 16, ...",0.628099,0.654596,0.592992,0.614583,0.615385,0.621131,0.02018,5,0.855839,0.865582,0.841109,0.844586,0.863354,0.854094,0.009795,46
5,0.874709,0.039269,0.083875,0.000983,83,16,2,sqrt,23.0,balanced,True,"{'n_estimators': 83, 'min_samples_split': 16, ...",0.626039,0.656467,0.594595,0.612565,0.615819,0.621097,0.020389,6,0.855214,0.864883,0.840764,0.844666,0.862888,0.853683,0.009599,53
6,0.682047,0.04603,0.065389,0.00301,62,16,5,sqrt,12.0,balanced,True,"{'n_estimators': 62, 'min_samples_split': 16, ...",0.630749,0.643411,0.59144,0.609787,0.627876,0.620652,0.01813,7,0.85683,0.872149,0.843851,0.846222,0.862834,0.856377,0.010495,12
7,0.847677,0.083671,0.100063,0.040214,78,16,5,auto,14.0,balanced,True,"{'n_estimators': 78, 'min_samples_split': 16, ...",0.632708,0.645586,0.587927,0.61597,0.619835,0.620405,0.019289,8,0.855908,0.869423,0.840601,0.845369,0.862458,0.854752,0.010623,28
8,1.066753,0.039991,0.098904,0.001258,102,16,5,sqrt,,balanced,True,"{'n_estimators': 102, 'min_samples_split': 16,...",0.624665,0.645418,0.595174,0.618504,0.617978,0.620348,0.016062,9,0.856998,0.870135,0.844244,0.850568,0.863052,0.856999,0.009091,7
9,1.505714,0.014214,0.103723,0.000385,100,16,5,auto,44.0,balanced,False,"{'n_estimators': 100, 'min_samples_split': 16,...",0.617188,0.654592,0.60281,0.61063,0.615797,0.620203,0.017918,10,0.849525,0.867357,0.839927,0.842593,0.858042,0.851489,0.010113,87


In [48]:
get_best(brf_ranked)

0.6255924170616114
0.8715333465482132


Some promising scores, let's grid search around the best hyperparameters.

In [49]:
# setting narrower ranges for hyperparameters
n_estimators = [int(x) for x in np.linspace(start = 125, stop = 150, num = 26)]
max_features = ['auto']
max_depth = [int(x) for x in np.linspace(25, 35, num = 11)]
max_depth.append(None)
min_samples_split = [16]
min_samples_leaf = [5]
bootstrap = [True]
# Create the  grid
balanced_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'class_weight' : ['balanced']
               }

In [50]:
#Setting up Grid Search
brf = RandomForestClassifier(verbose = 0, random_state = 1)
brf_grid = GridSearchCV(brf, balanced_grid, verbose=1, n_jobs = -1, refit = False, scoring = ['f1', 'roc_auc'] )
brf_grid.fit(X,y)

Fitting 5 folds for each of 312 candidates, totalling 1560 fits


GridSearchCV(estimator=RandomForestClassifier(random_state=1), n_jobs=-1,
             param_grid={'bootstrap': [True], 'class_weight': ['balanced'],
                         'max_depth': [25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
                                       35, None],
                         'max_features': ['auto'], 'min_samples_leaf': [5],
                         'min_samples_split': [16],
                         'n_estimators': [125, 126, 127, 128, 129, 130, 131,
                                          132, 133, 134, 135, 136, 137, 138,
                                          139, 140, 141, 142, 143, 144, 145,
                                          146, 147, 148, 149, 150]},
             refit=False, scoring=['f1', 'roc_auc'], verbose=1)

In [51]:
#checking top models
balanced_grid_results = pd.DataFrame(brf_grid.cv_results_)
balanced_grid_ranked = balanced_grid_results.sort_values('rank_test_f1').reset_index(drop = True)
balanced_grid_ranked.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bootstrap,param_class_weight,param_max_depth,param_max_features,param_min_samples_leaf,param_min_samples_split,param_n_estimators,params,split0_test_f1,split1_test_f1,split2_test_f1,split3_test_f1,split4_test_f1,mean_test_f1,std_test_f1,rank_test_f1,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc
0,1.358028,0.054503,0.14296,0.040786,True,balanced,,auto,5,16,130,"{'bootstrap': True, 'class_weight': 'balanced'...",0.627027,0.65252,0.602703,0.621997,0.617318,0.624313,0.016274,1,0.855835,0.869916,0.844754,0.850264,0.863676,0.856889,0.009033,73
1,1.305755,0.02728,0.124818,0.001615,True,balanced,31.0,auto,5,16,130,"{'bootstrap': True, 'class_weight': 'balanced'...",0.627027,0.65252,0.602703,0.621997,0.617318,0.624313,0.016274,1,0.855835,0.869916,0.844754,0.850264,0.863676,0.856889,0.009033,73
2,1.298935,0.00864,0.124095,0.001466,True,balanced,34.0,auto,5,16,130,"{'bootstrap': True, 'class_weight': 'balanced'...",0.627027,0.65252,0.602703,0.621997,0.617318,0.624313,0.016274,1,0.855835,0.869916,0.844754,0.850264,0.863676,0.856889,0.009033,73
3,1.326152,0.036169,0.127987,0.00525,True,balanced,29.0,auto,5,16,130,"{'bootstrap': True, 'class_weight': 'balanced'...",0.627027,0.65252,0.602703,0.621997,0.617318,0.624313,0.016274,1,0.855835,0.869916,0.844754,0.850264,0.863676,0.856889,0.009033,73
4,1.904586,0.067213,0.169652,0.005873,True,balanced,25.0,auto,5,16,130,"{'bootstrap': True, 'class_weight': 'balanced'...",0.627027,0.65252,0.602703,0.621997,0.617318,0.624313,0.016274,1,0.855835,0.869916,0.844754,0.850264,0.863676,0.856889,0.009033,73
5,1.320409,0.050199,0.127704,0.005034,True,balanced,30.0,auto,5,16,130,"{'bootstrap': True, 'class_weight': 'balanced'...",0.627027,0.65252,0.602703,0.621997,0.617318,0.624313,0.016274,1,0.855835,0.869916,0.844754,0.850264,0.863676,0.856889,0.009033,73
6,1.336322,0.030669,0.128523,0.00821,True,balanced,26.0,auto,5,16,130,"{'bootstrap': True, 'class_weight': 'balanced'...",0.627027,0.65252,0.602703,0.621997,0.617318,0.624313,0.016274,1,0.855835,0.869916,0.844754,0.850264,0.863676,0.856889,0.009033,73
7,1.330177,0.033587,0.124171,0.002632,True,balanced,33.0,auto,5,16,130,"{'bootstrap': True, 'class_weight': 'balanced'...",0.627027,0.65252,0.602703,0.621997,0.617318,0.624313,0.016274,1,0.855835,0.869916,0.844754,0.850264,0.863676,0.856889,0.009033,73
8,1.290419,0.017992,0.12315,0.00126,True,balanced,35.0,auto,5,16,130,"{'bootstrap': True, 'class_weight': 'balanced'...",0.627027,0.65252,0.602703,0.621997,0.617318,0.624313,0.016274,1,0.855835,0.869916,0.844754,0.850264,0.863676,0.856889,0.009033,73
9,1.297834,0.022044,0.126241,0.004774,True,balanced,32.0,auto,5,16,130,"{'bootstrap': True, 'class_weight': 'balanced'...",0.627027,0.65252,0.602703,0.621997,0.617318,0.624313,0.016274,1,0.855835,0.869916,0.844754,0.850264,0.863676,0.856889,0.009033,73


In [52]:
get_best(balanced_grid_ranked)

0.6320754716981133
0.8714492518575916


0.632, a significant improvement in f1 score and a AUC-ROC score of 0.871 which suggests our model is excellent at distinguising between classes.

### Final Model

In [68]:
#checking params and fitting model
params = balanced_grid_ranked.iloc[0]['params']
rf = RandomForestClassifier(**params, random_state = 1)
rf.fit(X, y)

#saving model
joblib.dump(rf , 'churn_rf_jlib')

params

{'bootstrap': True,
 'class_weight': 'balanced',
 'max_depth': None,
 'max_features': 'auto',
 'min_samples_leaf': 5,
 'min_samples_split': 16,
 'n_estimators': 130}

In [65]:
predictions = rf.predict(X_test)
print(f'Final f1 score on test set: {f1_score(y_test, predictions)}')
print(f'Final ROC-AUC score on test set: {roc_auc_score(y_test, rf.predict_proba(X_test)[:,1])}')

Final f1 score on test set: 0.6320754716981133
Final ROC-AUC score on test set: 0.8714492518575916


Our model had a mean f1 score of 0.624 on our training and validation data and a mean ROC-AUC score of 0.857 and also performed well on the unseen test set with an f1 score of 0.632 and an ROC-AUC score of 0.871.

We can see that with `max_depth` being 'None' and `n_estimators` being towards the higher end of those we searched for, it's possible we could see marginally better performance searching a wider grid but at a higher cost to computing power and time.