# Predicting Customer Bank Churn


## By: José Francisco Medeiros Monteiro 

# OBJECTIVE

- Prepare and clean dataset
- Analyse the dataset
- Engineer features
- A/B testing on some categorical features
- Build machine learn models that predict churn
- Compare machine learn models
- Pick the best machine learn model based on some score metric to be determined
- Apply champion model to holdout sample

In [None]:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn
!pip install datetime
!pip install sklearn
!pip install xgboost

In [None]:
!pip install -U scikit-learn

# Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import roc_auc_score , accuracy_score, recall_score , precision_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV , train_test_split

from scipy import stats

from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier , plot_tree
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier , plot_importance

import pickle

# Prepare and clean

In [None]:
path = '/kaggle/input/bank-customer-churn/Customer-Churn-Records.csv'

df0 = pd.read_csv(path)
print(df0.shape)
df0.head()

In [None]:
# Checking every columns type
df0.info()

In [None]:
# Checking NA values
df0.isna().sum()

- There aren't any missing values in the dataset

In [None]:
# Checking what unique values and their count there are at every column
for column in df0.columns:
    print(df0[column].value_counts())
    print('Max: '+ str(df0[column].max()) + ' , Min: ' + str(df0[column].min()))
    print('---------------------------------------------')
    print('')


- Our dependend variable is going to be the 'exited' .As we can see running the previous cell, there is an imbalance of 80% people that didn't exit and 20% that did.
- It is an imbalance but it is still acceptable, so I will go on without doing any resampling of the data.

In [None]:
# Renaming 'Card Type' column to 'CardType'
df1 = df0.rename(columns = {'Card Type':'CardType'})

In [None]:
# Checking the distributions for each column to check if there is something odd
remove = ['RowNumber','CustomerId','Surname']
plot_columns = list(df1.columns)

for column in remove:
    plot_columns.remove(column)

for column in plot_columns:
    print(column)
    plot = sns.histplot(x = df1[column])
    plt.show(plot)

- Balance has a concentration of values at 0, but it seems to make sense since there are more people that open accounts but don't deposit anything than there would be in any other specific value
- Credit score has a concentration on the maximum score, since it can't go any higher any value that would be more than the maximum is concentrated at the maximum. So it also seems correct.

In [None]:
# checking for dupplicates using 'CustomerId' column

df1[df1['CustomerId'].duplicated() == True]

In [None]:
# Leaving just features that are relevant to the problem

df2 = df1[plot_columns]

- There aren't any dupplicates

# Analysing

In [None]:
df2.info()

In [None]:
# checking categorical variables
sns.histplot(data = df2, x ='Exited', hue = 'Gender' )
df2.groupby(by='Gender').mean()['Exited']

- there seems to be a difference based on gender on the percentage o people that leave: 16.5% of males leaves, wheras 25.1% of female leaves.

In [None]:
sns.histplot(data = df2, x ='Exited', hue = 'Geography' )
df2.groupby(by='Geography').mean()['Exited']

- People in 'Germany' seem to leave more than in comparison to other countries, so I will create a variable is_germany.

In [None]:
sns.histplot(data = df2, x ='Exited', hue = 'CardType' )
df2.groupby(by='CardType').mean()['Exited']

- Doesn't seem to be any difference on the exit rate between card types

In [None]:
# turning gender into a numerical feature
df3 = pd.get_dummies(data = df2 , columns = ['Gender'], drop_first= True)
df3['IsGermany'] = df3['Geography'] == 'Germany'
df3['IsGermany'] = df3['IsGermany'].replace({True:1,False:0})
df3.drop(columns = 'Geography', inplace = True)
print(df3['IsGermany'].value_counts())

df3.head()


In [None]:
sns.violinplot(data = df3, x = 'Exited' , y = 'Balance')

In [None]:
sns.pairplot(data = df3)

In [None]:
df3.info()

In [None]:
plt.figure(figsize=(16, 6))
sns.heatmap(df3.corr(), vmin=-1, vmax=1, annot=True)

- The best predictor variable is by far 'Complain' but there might be an issue with using it since the correlation is so high that it might indicate that the action of exit and complain seem to occur simultaneously and inseparable, so there wouldn't be any value in predicting if someone would leave by the complains if there is nothing we can do to stop it since they are going to leave right away after filing the complaint.
- Some good predictor variables seem to be 'Age', 'Balance', 'IsActiveMember', 'Gender_Male', 'IsGermany'.

In [None]:
for column in df3.columns.drop('Exited'):
    sns.histplot(data = df3 , x = column , hue = 'Exited')
    plt.show()

# Hypothesis testing (balance mean difference if exited or not)

In [None]:
sns.boxplot(data = df3, x ='Exited', y = 'Balance' )
print(df3.groupby(by='Exited').mean()['Balance'])
print(df3.groupby(by='Exited').std()['Balance'])

- Hypothesis 0 : There isn't a mean difference in balance depending on whether the client has exited or not
- Hypothesis 1 : There is a mean difference in balance depending on whether the client has exited or not

In [None]:
alpha = 0.05 # Significance
statistics = stats.ttest_ind(a=df3[df3['Exited']==0]['Balance'], b=df3[df3['Exited']==1]['Balance'], equal_var=False)
statistics

- Since the p-value is < than 0.05 (significance level), we can say with 95% confidence that the mean balance when people exit and when people don't are different
- Since there is a difference, the balance feature should help out model classify churning
- OBS: I could do this to other variables but I just wanted to briefly showcase my knowledge of hypothesis testing

# Train, test, holdout split

In [None]:
df_x = df3[['Age', 'Balance', 'IsActiveMember', 'Gender_Male', 'IsGermany']]
df_y = df3[['Exited']]

x_train, x_holdout, y_train , y_holdout = train_test_split(df_x , df_y , test_size= 0.15 , stratify = df_y , random_state= 42 )
x_train, x_test , y_train, y_test = train_test_split(x_train , y_train , test_size= 0.1764 , stratify = y_train , random_state= 0 )
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
print(x_holdout.shape)
print(y_holdout.shape)

# Constructing Machine Learning Models

In [None]:
train = False   ### In case we want to train the models : train = True , if we want to use the saved model then train = False

xgb = XGBClassifier(random_state = 42)
tree = DecisionTreeClassifier(random_state = 42)
rf = RandomForestClassifier(random_state = 42)

In [None]:
# XBG CV (crossvalidation) parameters grid
xgb_grid = {'n_estimators':[50,250,500],
            'max_depth':[3,8,None],
            'min_child_weight':[0.01,0.05,0.1],
            'learning_rate':[0.1,0.3],
            'colsample_bytree':[0.4,0.6],
            'subsample':[0.7,0.5]
            }

# decision tree CV parameters grid
tree_grid = {'max_depth':[3,8,None],
             'min_samples_split':[0.1,0.01],
             'min_samples_leaf':[0.005 , 0.05, 0.01],
             'max_features':['auto',None],
             }

# Random Forest CV parameters grid
rf_grid = {'n_estimators':[50,250,500] ,
            'max_depth': [3,8,None] ,
            'min_samples_split': [0.02,0.04] ,
            'min_samples_leaf': [0.01, 0.02],
            'max_features': [2,4] ,
            'max_samples': [0.5,0.7],
            }

In [None]:
score = ['f1','accuracy','precision','recall']

Here we will set the score method to pick the winner as f1, we could do something different based on whether the cost of false positives or false negatives are different from one another, but since we don't have that information I will just leave the standard f1 score as the selection metric.

In [None]:
xgb_cv = GridSearchCV( estimator= xgb , param_grid = xgb_grid , scoring = score , refit = 'f1' , cv = 5)
tree_cv = GridSearchCV(estimator= tree , param_grid = tree_grid, scoring = score , refit = 'f1' , cv = 5)
rf_cv = GridSearchCV(estimator= rf , param_grid = rf_grid, scoring = score , refit = 'f1' , cv = 5)

In [None]:
%%time
if train == True:
    xgb_cv.fit(x_train, y_train)

In [None]:
%%time
if train == True:
    tree_cv.fit(x_train,y_train)

In [None]:
%%time
if train == True:
    rf_cv.fit(x_train,y_train)

In [None]:
if train == True:
    best_xgb = xgb_cv.best_estimator_
    best_tree = tree_cv.best_estimator_
    best_rf = rf_cv.best_estimator_

In [None]:
if train == True:
    pickle.dump(best_xgb, open('best_xgb.pkl', 'wb'))
    pickle.dump(best_tree, open('best_tree.pkl', 'wb'))
    pickle.dump(best_rf, open('best_rf.pkl', 'wb'))

In [None]:
if train == False:
    best_xgb = pickle.load(open('/kaggle/input/trainedmodels/best_xgb.pkl', 'rb'))
    best_tree = pickle.load(open('/kaggle/input/trainedmodels/best_tree.pkl', 'rb'))
    best_rf = pickle.load(open('/kaggle/input/trainedmodels/best_rf.pkl', 'rb'))

In [None]:
y_pred_xgb = best_xgb.predict(x_test)
y_pred_tree = best_tree.predict(x_test)
y_pred_rf = best_rf.predict(x_test)

In [None]:
def get_scores(name,y_real,y_predict):
    roc = roc_auc_score(y_real,y_predict)
    acc = accuracy_score(y_real,y_predict)
    rec = recall_score(y_real,y_predict)
    prec = precision_score(y_real,y_predict)
    f1 = f1_score(y_real,y_predict)
    dict_for_df = {'model_name':[name],
                   'f1_score':[f1],
                   'roc_auc_score':[roc],
                   'accuracy_score':[acc],
                   'recall_score':[rec],
                   'precision_score':[prec]
                   }
    df_score = pd.DataFrame(data = dict_for_df)
    return df_score


In [None]:
score_xgb = get_scores('XGBoost_Model',y_test,y_pred_xgb)
score_tree = get_scores('Tree_Model',y_test,y_pred_tree)
score_rf = get_scores('RandomForest_Model',y_test,y_pred_rf)

df_score = pd.concat([score_xgb,score_tree,score_rf])
df_score

In [None]:
plt.figure(figsize = (16,10))
plot_tree(best_tree, max_depth = 2 , feature_names = x_train.columns)

In [None]:
plot_importance(best_xgb)

In [None]:
feature_names = x_train.columns
importances = best_rf.feature_importances_
std = np.std([single_tree.feature_importances_ for single_tree in best_rf.estimators_], axis = 0)
forest_importances = pd.Series(importances , index = feature_names)
fig, ax= plt.subplots()
forest_importances.plot.bar(yerr = std, ax =ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

In [None]:
### Como XGBoost ganhou vamos testar ele em um dataset novo que foi separado

y_final_pred = best_xgb.predict(x_holdout)
final_score = get_scores('XGBoost_Model',y_holdout,y_final_pred)
final_score

# Conclusion


- So our model is better than a random prediction roc_auc_score > 0.5
- There might be some features that combined could have a better impact at predictiong churn
- We didn't use the complain feature to predict Churn because they seem to be the same thing (extremely high correlation, meaning that people that are already exiting file complains and therefore it can be seen as one single act)
- There might be some added value in resampling, it is worth exploring since there is an imbalance and the resulting F1 score isn't great.