It is necessary to predict whether the client will leave the Bank in the near future or not. You are presented with historical data on customer behavior and termination of contracts with the Bank.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
rndd = 12345

df = pd.read_csv('/kaggle/input/bank-customer-churn-modeling/Churn_Modelling.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


## **1. DataSet preparation**

As we can see the data set is full without any NaN values.
Now let's briefly see the data from the top and from the end.

In [2]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
df.tail()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.0,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.0,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1
9999,10000,15628319,Walker,792,France,Female,28,4,130142.79,1,1,0,38190.78,0


The 'RowNumber' columns looks like index duplicate. Let's dropp it.

In [4]:
df.drop('RowNumber', axis = 1, inplace = True)
df.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


For a make prediction via scikit-learn, we should prepare our dataset, the should consist only numeric values. We starting from 'Gender' column.

In [5]:
df.Gender.unique()

array(['Female', 'Male'], dtype=object)

The consist only from two values, for this purpose we change values to int

In [6]:
df.Gender = df.Gender.map({'Female': 0, 'Male':1})
df.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,0,42,2,0.0,1,1,1,101348.88,1
1,15647311,Hill,608,Spain,0,41,1,83807.86,1,0,1,112542.58,0
2,15619304,Onio,502,France,0,42,8,159660.8,3,1,0,113931.57,1
3,15701354,Boni,699,France,0,39,1,0.0,2,0,0,93826.63,0
4,15737888,Mitchell,850,Spain,0,43,2,125510.82,1,1,1,79084.1,0


For columns 'Surname' and 'Geography' we will used OrdinalEncoder, for coding every string value to the int value. In general cloumns like 'Surname' should not be presented in real dataset, and they cannot affect the final result, but looking a little ahead in our case have a positive impact on the metrics.

In [7]:
encoder = OrdinalEncoder()
data = encoder.fit_transform(df)
df_trans = pd.DataFrame(data, columns = df.columns)
df_trans.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,2736.0,1115.0,228.0,0.0,0.0,24.0,2.0,0.0,0.0,1.0,1.0,5068.0,1.0
1,3258.0,1177.0,217.0,2.0,0.0,23.0,1.0,743.0,0.0,0.0,1.0,5639.0,0.0
2,2104.0,2040.0,111.0,0.0,0.0,24.0,8.0,5793.0,2.0,1.0,0.0,5707.0,1.0
3,5435.0,289.0,308.0,0.0,0.0,21.0,1.0,0.0,1.0,0.0,0.0,4704.0,0.0
4,6899.0,1822.0,459.0,2.0,0.0,25.0,2.0,3696.0,0.0,1.0,1.0,3925.0,0.0


As you can see dataset consist only from numeric values.
<br> Now checking value types.

In [8]:
df_trans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
CustomerId         10000 non-null float64
Surname            10000 non-null float64
CreditScore        10000 non-null float64
Geography          10000 non-null float64
Gender             10000 non-null float64
Age                10000 non-null float64
Tenure             10000 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null float64
HasCrCard          10000 non-null float64
IsActiveMember     10000 non-null float64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null float64
dtypes: float64(13)
memory usage: 1015.8 KB


We can optimize value types.

In [9]:
df_trans = df_trans.astype({
    'CustomerId'    : 'int32',
    'Surname'       : 'int32',
    'Geography'     : 'int32',
    'Gender'        : 'int32',
    'Age'           : 'int32',
    'Tenure'        : 'int32',
    'NumOfProducts' : 'int32',
    'HasCrCard'     : 'int32',
    'IsActiveMember': 'int32',
    'Exited'        : 'int32'})
df_trans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
CustomerId         10000 non-null int32
Surname            10000 non-null int32
CreditScore        10000 non-null float64
Geography          10000 non-null int32
Gender             10000 non-null int32
Age                10000 non-null int32
Tenure             10000 non-null int32
Balance            10000 non-null float64
NumOfProducts      10000 non-null int32
HasCrCard          10000 non-null int32
IsActiveMember     10000 non-null int32
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int32
dtypes: float64(3), int32(10)
memory usage: 625.1 KB


### **2. Split data for trainig**

Now let's create datasets for training. First cut the 'Exited' it is our target value.

In [10]:
target = df_trans['Exited']
train = df_trans.drop('Exited', axis = 1)

For having ability to reproduce experimental values , we define constant for future using in random state generators

rndd=12345

In [11]:
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.25, random_state=rndd)

![](http://)now we have training a model RandomForestClassifier, let's start training data and predict target data.

In [12]:
#create a model for prediction
rand_Forest = RandomForestClassifier(random_state = rndd)

# define model parameters and values for tuning
parameters = {
    'n_estimators':np.arange(1,300, 50),
    'max_depth' : np.arange(2, 30, 2),
    'min_samples_split': np.arange(2, 30, 2),
    'min_samples_leaf': np.arange(2, 30, 2)    
}
#create a searchCV to cycle through the possible values
rand_Forest_grid = RandomizedSearchCV(
    estimator = rand_Forest,
    param_distributions  = parameters,
    scoring='f1',
    n_jobs=2,
    cv = 5,
    n_iter = 150,
    verbose=True, refit=True, return_train_score = True, random_state = rndd)
    
#fit the model    
rand_Forest_grid.fit(X_train, y_train)
#check scores result
f1_train = rand_Forest_grid.best_score_
print('Best Estimator: ', rand_Forest_grid.best_estimator_)
print('Best Params: ', rand_Forest_grid.best_params_)
print('f1 =', f1_train)
predicted_train = rand_Forest_grid.predict(X_train)
accuracy_train = accuracy_score(y_train, predicted_train)
print('accuracy =', accuracy_train)
roc_auc_score_train =  roc_auc_score(y_train, predicted_train)
print('roc_auc_score',  roc_auc_score_train)

Fitting 5 folds for each of 150 candidates, totalling 750 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   25.6s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:  2.2min
[Parallel(n_jobs=2)]: Done 446 tasks      | elapsed:  4.4min
[Parallel(n_jobs=2)]: Done 750 out of 750 | elapsed:  6.9min finished


Best Estimator:  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=18, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=20,
                       min_weight_fraction_leaf=0.0, n_estimators=101,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)
Best Params:  {'n_estimators': 101, 'min_samples_split': 20, 'min_samples_leaf': 2, 'max_depth': 18}
f1 = 0.5516498033711694
accuracy = 0.8978666666666667
roc_auc_score 0.7612260012103458


now checking our model on test parts of data

In [13]:
#predict values on previously trained model
y_predicted = rand_Forest_grid.predict(X_test)

f1_test = f1_score(y_test, y_predicted)
accuracy_test = accuracy_score(y_test, y_predicted)
roc_auc_score_test =  roc_auc_score(y_test, y_predicted)
print('TEST       f1      =', f1_test)
print('TEST accuracy      =', accuracy_test)
print('TEST roc_auc_score =', roc_auc_score_test)

TEST       f1      = 0.5394088669950738
TEST accuracy      = 0.8504
TEST roc_auc_score = 0.6899146274761598


For future comparing results we will save all result in dataset result.

In [14]:
#Create empty dataframe with columns
results = pd.DataFrame(columns=['expirement', 'f1_train', 'f1_test', 'accuracy_train', 'accuracy_test', 'roc_auc_train', 'roc_auc_test'])
#add values to columns accordingly
results = results.append([{'expirement':'simple model',
                           'f1_train':f1_train, 'f1_test': f1_test,
                           'accuracy_train': accuracy_train, 'accuracy_test':accuracy_test,
                           'roc_auc_train':roc_auc_score_train, 'roc_auc_test':roc_auc_score_test}])
results

Unnamed: 0,expirement,f1_train,f1_test,accuracy_train,accuracy_test,roc_auc_train,roc_auc_test
0,simple model,0.55165,0.539409,0.897867,0.8504,0.761226,0.689915


For now we got not bad results. Lets check our data deeper.

In [15]:
X_train.describe()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
count,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0
mean,5012.062667,1505.762667,260.1676,0.747733,0.543867,20.846133,5.034533,2029.195733,0.523067,0.706667,0.5204,5008.469733
std,2891.50042,849.850407,96.70528,0.827639,0.498105,10.512538,2.887176,2125.905454,0.583249,0.45532,0.499617,2898.56616
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2497.75,764.0,193.0,0.0,0.0,14.0,3.0,0.0,0.0,0.0,0.0,2478.75
50%,5008.5,1537.0,261.5,0.0,1.0,19.0,5.0,1361.5,0.0,1.0,1.0,5039.5
75%,7541.5,2251.25,326.0,1.0,1.0,26.0,8.0,3888.5,1.0,1.0,1.0,7512.25
max,9999.0,2931.0,459.0,2.0,1.0,69.0,10.0,6381.0,3.0,1.0,1.0,9998.0


As you can see EstimatedSalary mean = 5008.469733 and CreditScore mean=260.16760  - the order of values is 5008/260 = 16+ times different. It is not good for model. Let's bring to one order via StandartScaler

In [16]:
#create scaler
scaler = StandardScaler()
#fit and transform data
X_train_scaled = scaler.fit_transform(X_train)
#transform data based on previous fit process
X_test_scaled = scaler.transform(X_test)

#put transformed data for pretty print
d = pd.DataFrame(columns=X_train.columns, data=X_train_scaled).describe()
print('order of values', abs(d.loc['mean','EstimatedSalary']/ d.loc['mean','CreditScore']))

order of values 0.05368806754819181


Now the scale order of values is same, lets check result on our model. For now we will used hyper parmeters from previous grid trainig.
<br> Best Params:  {'n_estimators': 101, 'min_samples_split': 20, 'min_samples_leaf': 2, 'max_depth': 18}

In [17]:
#create model with parameters vased on previous training result
rand_Forest = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=18, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=20,
                       min_weight_fraction_leaf=0.0, n_estimators=101,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

#define function for reducing code duplication
def checkModel(X_train, y_train, X_test, y_test, model = rand_Forest):
    
    model.fit(X_train, y_train)
    y_train_predicted = rand_Forest.predict(X_train)
    f1_train = f1_score(y_train, y_train_predicted)
    accuracy_train = accuracy_score(y_train, y_train_predicted)
    roc_auc_score_train =  roc_auc_score(y_train, y_train_predicted)
    
    print('roc_auc_score',  roc_auc_score_train)
    print('f1 =', f1_train)
    print('accuracy =', accuracy_train)
    
    y_test_predicted = rand_Forest.predict(X_test)
    f1_test = f1_score(y_test, y_test_predicted)
    accuracy_test = accuracy_score(y_test, y_test_predicted)
    roc_auc_score_test =  roc_auc_score(y_test, y_test_predicted)
    
    print('TEST       f1 =', f1_test)
    print('TEST accuracy =', accuracy_test)
    print('TEST roc_auc_score =', roc_auc_score_test)
    
    return f1_train, accuracy_train, roc_auc_score_train, f1_test, accuracy_test, roc_auc_score_test

#call function
f1_train, accuracy_train, roc_auc_score_train, f1_test, accuracy_test, roc_auc_score_test = checkModel(X_train_scaled, y_train, X_test_scaled, y_test)

roc_auc_score 0.7616422518114118
f1 = 0.6773648648648649
accuracy = 0.8981333333333333
TEST       f1 = 0.5394088669950738
TEST accuracy = 0.8504
TEST roc_auc_score = 0.6899146274761598


Put result scores to dataframe

In [18]:
results = results.append([{'expirement':'scaled data model',
                           'f1_train':f1_train, 'f1_test': f1_test,
                           'accuracy_train': accuracy_train, 'accuracy_test':accuracy_test,
                           'roc_auc_train':roc_auc_score_train, 'roc_auc_test':roc_auc_score_test}])
results

Unnamed: 0,expirement,f1_train,f1_test,accuracy_train,accuracy_test,roc_auc_train,roc_auc_test
0,simple model,0.55165,0.539409,0.897867,0.8504,0.761226,0.689915
0,scaled data model,0.677365,0.539409,0.898133,0.8504,0.761642,0.689915


As we can see there are improvements, but they showed themselves only in the training sample. Now lets check our target value

In [19]:
y_train.value_counts()

0    5998
1    1502
Name: Exited, dtype: int64

We observe that 0 is a value 4 times greater than 1. Let's try to equalize their number by applying the technique upsampling/downsampling. To do this, randomly mix the existing data with the target feature 1. For this porprouse define a function upsample_1

In [20]:
def upsample_1(features, target, repeat):
    #array only with 0 values from features
    features_zeros = features[target == 0]
    #array only with 1 values from features
    features_ones = features[target == 1]
    
    #array only with 0 values from target
    target_zeros = target[target == 0]
    #array only with 1 values from target
    target_ones = target[target == 1]
    
    #create new data frame with features 0 values and features 1 value repeated Repeat(incoming parameters in functions) times
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    
    #create new data frame with target 0 values and target 1 value repeated Repeat(incoming parameters in functions) times
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    #just shuffle values in dataframe
    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=rndd)
    
    return features_upsampled, target_upsampled

Check their upsmpling result

In [21]:
X_train_u, y_train_u = upsample_1(X_train, y_train, 4)
X_test_u, y_test_u = upsample_1(X_test, y_test, 4)
y_train_u.value_counts()

1    6008
0    5998
Name: Exited, dtype: int64

Now you can see taht 0 and 1 meet about the same time, lets check result on our model.

In [22]:
f1_train, accuracy_train, roc_auc_score_train, f1_test, accuracy_test, roc_auc_score_test = checkModel(X_train_u, y_train_u, X_test_u, y_test_u)

roc_auc_score 0.9750822400187545
f1 = 0.975513880927033
accuracy = 0.975095785440613
TEST       f1 = 0.728448275862069
TEST accuracy = 0.7544457978075517
TEST roc_auc_score = 0.7599082067013865


put result to our dataset

In [23]:
results = results.append([{'expirement':'upsmpled data model',
                           'f1_train':f1_train, 'f1_test': f1_test,
                           'accuracy_train': accuracy_train, 'accuracy_test':accuracy_test,
                           'roc_auc_train':roc_auc_score_train, 'roc_auc_test':roc_auc_score_test}])
results

Unnamed: 0,expirement,f1_train,f1_test,accuracy_train,accuracy_test,roc_auc_train,roc_auc_test
0,simple model,0.55165,0.539409,0.897867,0.8504,0.761226,0.689915
0,scaled data model,0.677365,0.539409,0.898133,0.8504,0.761642,0.689915
0,upsmpled data model,0.975514,0.728448,0.975096,0.754446,0.975082,0.759908


As we can see the result is also positive. Particularly for the main metric for classification F1
<br> Now lets apply scaller also, and check result.

In [24]:
scaler = StandardScaler()
X_train_u_scaled = scaler.fit_transform(X_train_u)
X_test_u_scaled = scaler.transform(X_test_u)

f1_train, accuracy_train, roc_auc_score_train, f1_test, accuracy_test, roc_auc_score_test = checkModel(X_train_u_scaled, y_train_u, X_test_u_scaled, y_test_u)

roc_auc_score 0.9750822400187545
f1 = 0.975513880927033
accuracy = 0.975095785440613
TEST       f1 = 0.7282520872609749
TEST accuracy = 0.7542021924482338
TEST roc_auc_score = 0.7596537537751777


In [25]:
results = results.append([{'expirement':'upsmpled scaled data model',
                           'f1_train':f1_train, 'f1_test': f1_test,
                           'accuracy_train': accuracy_train, 'accuracy_test':accuracy_test,
                           'roc_auc_train':roc_auc_score_train, 'roc_auc_test':roc_auc_score_test}])
results

Unnamed: 0,expirement,f1_train,f1_test,accuracy_train,accuracy_test,roc_auc_train,roc_auc_test
0,simple model,0.55165,0.539409,0.897867,0.8504,0.761226,0.689915
0,scaled data model,0.677365,0.539409,0.898133,0.8504,0.761642,0.689915
0,upsmpled data model,0.975514,0.728448,0.975096,0.754446,0.975082,0.759908
0,upsmpled scaled data model,0.975514,0.728252,0.975096,0.754202,0.975082,0.759654


**In this paper, we have considered two techniques for dealing with data imbalances in data classification. This is a dimensionality reduction of values and upsampling by target value.**