<h1>Creator of Notebook: Shehryar Mallick</h1>
<h1>This notebook contains implementations to determine whether the individuals will leave a bank of a US bank</h1>

<h1>Data Preprocessing Stage</h1>

<h2>Exploratory Data Analysis to gain insights and Data Wrangling techniques</h2>
<h3> checking for unique values in a column
<h3> checking for null values in a column
<h3> checking for outliers in specific column
<h3> Encoding categorical columns using one hot encoding
<h3> Normalizing the data using MinMax Scaler
<h3> Binning the data columns with high cardinality

In [1]:
import pandas as pd

print('import successful')

In [2]:
df = pd.read_csv('../input/bank-customer-churn-prediction/Churn_Modelling.csv')
df.head(5)
print(df.shape)

In [3]:
### print out all of the column names and the contents in the dataset
col_names = df.columns
print(col_names,'\n#########################')
for i in col_names:
  print("Column : ",i)
  display(df[i].value_counts())
  print("#######################")

In [4]:
df.isnull().sum()

In [None]:
### from the initial analysis it seemed logical to drop cols:RowNumber, CustomerId, Surname
### moreover we have to create bin due to high cardinality and also identify any prevailing outliers of the columns: CreditScore, Age,Balance, EstimatedSalary
### the columns that need to be encoded are: Geography, Gender

<h3>Dropping columns that do not affect the data set

In [5]:
df = df.drop(columns=['RowNumber', 'CustomerId', 'Surname'])
df

<h3>Outlier Detection using
    <h4>Box plot technique
    <h4> Z score technique
<h3>Outlier removal by setting threshold

In [6]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2,figsize=(16,16))

sns.boxplot(df['CreditScore'], ax=axes[0,0]).set(title='CreditScore')
sns.boxplot(df['Age'], ax=axes[0,1]).set(title='Age')
sns.boxplot(df['EstimatedSalary'], ax=axes[1,0]).set(title='EstimatedSalary')
sns.boxplot(df['Balance'], ax=axes[1,1]).set(title='Balance')

In [7]:
### z score to detect outliers

from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df[['CreditScore', 'Age','Balance', 'EstimatedSalary']]))
# print(z)    ###  to check the z score of the mentioned columns

threshold = 3
print(np.where(z > 3))

In [8]:
print(df.shape)
df= df[(z < 3).all(axis=1)]    ### we are now eliminating the rows that possess the outliers
print(df.shape)

<h2>Making copies of Data to explore variety of techniques

In [133]:
df_o = df.copy()   ### we will use this data for binning technique and without normalization
df_normal = df.copy() ### we will use this dataset with normalization using minmax scaler

In [134]:
import category_encoders as ce   #for importing the encoder
from sklearn.preprocessing import MinMaxScaler

df_normal['Gender'] = df_normal['Gender'].replace(['Male','Female'],[1,0])

#Create object for one-hot encoding
OH_encoder=ce.OneHotEncoder(cols='Geography',handle_unknown='return_nan',return_df=True,use_cat_names=True) #for geography col

# encode dataset
df_normal = OH_encoder.fit_transform(df_normal)
# df_normal

scaler = MinMaxScaler()
df_normal[['CreditScore','Age','Balance','EstimatedSalary']] = scaler.fit_transform(df_normal[['CreditScore','Age','Balance','EstimatedSalary']])
df_normal.head(3)

In [10]:
bin_cols = ['CreditScore', 'Age','Balance', 'EstimatedSalary']
for i in bin_cols:
    print(i,'\n',df[i].min(),'\n',df[i].max(),'\n__________________')

<h2>Binnig for three columns was conducted using Strudes Rule

In [11]:
### bins for credit score, for the selection of num of bins we used Sturge’s rule K = 1+3.322log(N)
### where N = 9859 observations. hence K=14

from sklearn.preprocessing import KBinsDiscretizer   ###importing binsmaker from sklear

est = KBinsDiscretizer(n_bins=14, encode='ordinal',strategy='uniform')
df_o['CreditScore'] = est.fit_transform(df_o[['CreditScore']])

fig, axes = plt.subplots(1, 2,figsize=(15,7))
sns.distplot(df[['CreditScore']], ax=axes[0]).set(title='CreditScore before binning')
sns.distplot(df_o['CreditScore'],ax=axes[1]).set(title='CreditScore after binning')


In [12]:
### bins for Age

est = KBinsDiscretizer(n_bins=5, encode='ordinal',strategy='uniform')
df_o['Age'] = est.fit_transform(df_o[['Age']])

fig, axes = plt.subplots(1, 2,figsize=(15,7))
sns.distplot(df[['Age']], ax=axes[0]).set(title='Age before binning')
sns.distplot(df_o['Age'],ax=axes[1]).set(title='Age after binning')

In [13]:
### bins for Balance

est = KBinsDiscretizer(n_bins=14, encode='ordinal',strategy='uniform')
df_o['Balance'] = est.fit_transform(df_o[['Balance']])

fig, axes = plt.subplots(1, 2,figsize=(15,7))
sns.distplot(df[['Balance']], ax=axes[0]).set(title='Balance before binning')
sns.distplot(df_o['Balance'],ax=axes[1]).set(title='Balance after binning')

In [14]:
### bins for EstimatedSalary

est = KBinsDiscretizer(n_bins=14, encode='ordinal',strategy='uniform')
df_o['EstimatedSalary'] = est.fit_transform(df_o[['EstimatedSalary']])

fig, axes = plt.subplots(1, 2,figsize=(15,7))
sns.distplot(df[['EstimatedSalary']], ax=axes[0]).set(title='EstimatedSalary before binning')
sns.distplot(df_o['EstimatedSalary'],ax=axes[1]).set(title='EstimatedSalary after binning')


In [17]:
import category_encoders as ce   #for importing the encoder

df_o['Gender'] = df_o['Gender'].replace(['Male','Female'],[1,0])

#Create object for one-hot encoding
OH_encoder=ce.OneHotEncoder(cols='Geography',handle_unknown='return_nan',return_df=True,use_cat_names=True) #for geography col

# encode dataset
data_encoded = OH_encoder.fit_transform(df_o)
data_encoded

<h1>DATA VISUALIZATION</h1>

<h2>Visualizing the correlation between different feature variables and target variable

In [144]:
### plot the heatmap of correlation between different columns
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(14, 8))
corr = data_encoded.corr()
sns.heatmap(corr, annot=True, center=0, linewidths=.5)
plt.title('Bank Churn Dataset Correlation', fontsize=16)

<h2>Visualization to determine the relation of the "CHURN" against individual feature

In [63]:
fig, axes = plt.subplots(4, 3,figsize=(25,15))

df_cols_name = list(data_encoded.columns)
df_cols_name.pop()

inc = 0
for i in range(4):
    for j in range(3):
        sns.countplot(x='Exited',data=data_encoded,hue=df_cols_name[inc],palette="pastel",ax=axes[i,j]).legend(fontsize=7,loc='upper right').set_title(df_cols_name[inc],prop={'size':9})
        inc+=1

<h1>Machine Learning Phase

<h2>Splitting the data into Feature set and Target Variable

In [135]:
### features and target comprising of bins data
y = data_encoded['Exited']
X = data_encoded.drop(columns=['Exited'])
print(X.shape,y.shape)

### features and target comprising of normalized data
y_norm = df_normal['Exited']
X_norm = df_normal.drop(columns=['Exited'])
print(X_norm.shape,y_norm.shape)

<h2> I have used Cross validation technique with cv=5 folds
<h2> The evaluation metrics used are:
    <h3>Accuracy'
    <h3>precision
    <h3>precision_macro
    <h3>recall
    <h3>recall_macro
    <h3>roc_auc
    <h3>f1
        
<h2> The Machine Learning Algorithms used are:
    <h3>K-Nearest Neighbours
    <h3>Decision Tree
    <h3>Random Forest
    <h3>Logistic Regression
    <h3>Gaussian Naïve Bayes
    <h3>XGBOOST
        

    

In [88]:
### all the avialable scoring options

# import sklearn 
# sklearn.metrics.SCORERS.keys()

In [136]:
### K-Nearest Neighbours
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate
# from sklearn.metrics import recall_score
# from sklearn.metrics import classification_report
# from sklearn.metrics import confusion_matrix
# from sklearn.metrics import roc_auc_score

# for n in range (10,16):
#     neigh = KNeighborsClassifier(n_neighbors=n)
#     scores = cross_val_score(neigh, X, y,cv=5)
#     print(scores)
#     print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

neigh = KNeighborsClassifier(n_neighbors=12)
# accuracy = cross_val_score(neigh, X, y,cv=5)
# print(accuracy)
# print("%0.2f accuracy with a standard deviation of %0.2f" % (accuracy.mean(), accuracy.std()))

scoring = ['accuracy','precision','precision_macro','recall','recall_macro','roc_auc','f1']
scores = cross_validate(neigh, X, y.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH BINNING AND NO NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

print('---------------------------------------------------------------------------')
scores = cross_validate(neigh, X_norm, y_norm.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH NO BINNING AND NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

In [137]:
### Decision Tree

from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier(random_state=0)

scoring = ['accuracy','precision','precision_macro','recall','recall_macro','roc_auc','f1']
scores = cross_validate(DT, X, y.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH BINNING AND NO NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())
    
print('---------------------------------------------------------------------------')
scores = cross_validate(DT, X_norm, y_norm.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH NO BINNING AND NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

In [138]:
### random forest
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(max_depth=8, random_state=0)
scoring = ['accuracy','precision','precision_macro','recall','recall_macro','roc_auc','f1']
scores = cross_validate(RF, X, y.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH BINNING AND NO NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

print('---------------------------------------------------------------------------')
scores = cross_validate(RF, X_norm, y_norm.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH NO BINNING AND NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

In [139]:
### logistic regression

from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(random_state=0,solver='saga',max_iter=500)
scoring = ['accuracy','precision','precision_macro','recall','recall_macro','roc_auc','f1']
scores = cross_validate(LR, X, y.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH BINNING AND NO NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

print('---------------------------------------------------------------------------')
scores = cross_validate(LR, X_norm, y_norm.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH NO BINNING AND NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

In [140]:
### Gaussian NB
from sklearn.naive_bayes import GaussianNB

GNB = GaussianNB()
scoring = ['accuracy','precision','precision_macro','recall','recall_macro','roc_auc','f1']
scores = cross_validate(GNB, X, y.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH BINNING AND NO NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

    
print('---------------------------------------------------------------------------')
scores = cross_validate(GNB, X_norm, y_norm.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH NO BINNING AND NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

In [141]:
from sklearn.ensemble import GradientBoostingClassifier

XGB = GradientBoostingClassifier(n_estimators=100, learning_rate=1.5,max_depth=3, random_state=0)
scoring = ['accuracy','precision','precision_macro','recall','recall_macro','roc_auc','f1']
scores = cross_validate(XGB, X, y, scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH BINNING AND NO NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

print('---------------------------------------------------------------------------')
scores = cross_validate(XGB, X_norm, y_norm.ravel(), scoring=scoring)
scoreKeys = sorted(scores.keys())
print('DATA SET WITH NO BINNING AND NORMALIZATION')
for key in scoreKeys:
    print(key,':',scores[key].mean())

<h1> From the above models the best performing model was: Random Forest
    <h2>Once with Binned dataset and not normalized one
    <h2>Then with Normalized but not binned
    <h2>It yeilded the following scores:
    <h3>DATA SET WITH BINNING AND NO NORMALIZATION
    <h3>test_accuracy : 0.855562553578614
    <h3>test_f1 : 0.5242486500790025
    <h3>test_precision : 0.804694724619009
    <h3>test_precision_macro : 0.8329500236088124
    <h3>test_recall : 0.388999582340368
    <h3>test_recall_macro : 0.6823200913743958
    <h3>test_roc_auc : 0.8550440033383303
    <h3>---------------------------------------------------------------------------
    <h3>DATA SET WITH NO BINNING AND NORMALIZATION
    <h3>test_accuracy : 0.8620540947182421
    <h3>test_f1 : 0.5570555332233829
    <h3>test_precision : 0.8131918112086078
    <h3>test_precision_macro : 0.8406216257238357
    <h3>test_recall : 0.4241861779230033
    <h3>test_recall_macro : 0.6994670418887529
    <h3>test_roc_auc : 0.8596263124908884