# Churn for Bank Costumers Analysis  
M.Shumskiy 31/12/2020  
## Abstract  
In this project I analyze customer retention, determine the most relevant parameters that retain customers, implement several Machine Learning Classification Algorithms and compare to determine which better performs the classification.  
This project has application across a vast number of enterprises which concern themselves with client retention.  
It is worthy of note that this project is also of help to better plan and employ marketing campaigns to attract new customers.  
With this in mind, the work developed here is of great value.  
The analysis as well as the code is presented in this work.

## 1. Introduction  
Client retention is a subject of great importance to many enterprises, for it increases a company's performance in the market which, in turn, reflects in higher revenue and overall growth of said company.  
To extract insights related to the topic here discussed, data from existing customers and former costumers must be analyzed, with the proper tools, to determine the most relevant parameters in their decision to leave or to stay.  
Since every enterprise of success must look towards the future, Machine Learning Algorithms must be implemented to give prediction power, on which the company rely.
 
## 2. Methodology 
The method of analysis consists of the following phases:  
 1. Data Cleaning,  
 2. Exploratory Data Analysis,  
 3. Machine Learning Classification Algorithms implementation and performance comparison,  
 4. Implementation of the most accurate Classification Algorithm.  

### 2.1. The Data  
Was analyzed data from Kaggle named Churn for Bank Customers by Mehmet A. The data looks this way:

In [None]:
import pandas as pd
file_path=r'C:\Users\Pc\Desktop\Data Science\Projects\Churn for Bank Customers\churn.csv'
data=pd.read_csv(file_path)
print(data.shape)
data.head()

As you can see, the bank costumer data is composed of 10000 costumer’s info spread by 13 columns (the first being redundant).  
### 2.2. Data Cleaning
First, I checked if there were any Nan values.

In [None]:
data.isna().sum()

As it is observed, there are none. So now, some redundant columns need to be discarded.

In [None]:
data.drop(columns=['RowNumber','CustomerId','Surname'],inplace=True)
data_3=data.copy() # copy

Now let's look at the relation between client exits.

In [None]:
import matplotlib.pyplot as plt

churn_pos=len(data[data['Exited']==1])
churn_neg=len(data[data['Exited']==0])
ratio=(churn_pos/churn_neg)
print('The ration between the positive and negative outcomes of the variable Exited is ',round(ratio,3))

data['Exited'].hist()
plt.title('Comparisson between churn values')
plt.xlabel('Churn')
plt.ylabel('Number of occurences')

It is observed that 25.6 % of costumers left the bank.  
This ratio will serve as one of the criteria to determine parameter significance later on.  
Now columns as **Credit Score**,**Balance** and **Estimated Salary** must be categorized, for this let's examine the following graphs.

In [None]:
column=['CreditScore', 'Balance', 'EstimatedSalary']
for col in column:
    plt.title('ovwerview of {}'.format(col))
    data.boxplot(column=[col])
    data.hist(column=[col],bins=20)
    plt.show()

From the analysis of these graphs, the categorization of the parameters will follow this method: [x0 ,xf ,dx], where:  
 - x0 represents the smaller value of the parameter,  
 - xf represents the highest value of the parameter,  
 - dx represents the increment, i.e., the width of each category.  
 
And the parameters will be categorized following these values:  
 - Credit Score: [350, 850, 25],  
 - Balance: [0, 260000, 15000],  
 - Estimated Salary: [0, 200000, 25000]  
 
But first, the following functions will be created:  
 - **bin_creator** : creates the slices,  
 - **label_creator** : creates the labels for the slices.  
 
And the values of **Balance** and **Estimated Salary** will be rounded.

In [None]:
def bin_creator(x0,xf,dx):
    bins=[x0]
    n_bins=abs(xf-x0)//dx
    i=1
    while i<=n_bins:
        bins.append(bins[i-1]+dx)
        i+=1
    return bins
def label_creator(x0,xf,dx):
    labels=[]
    n_bins=abs(xf-x0)//dx
    i=1
    while i<=n_bins:
        labels.append('{},{}'.format(str(x0+(i-1)*dx),str(x0+i*dx)))
        i+=1
    return labels
data.Balance=data.Balance.round(0)
data.EstimatedSalary=data.EstimatedSalary.round(0)

Now let's categorize the data

In [None]:
values=[[350,850,25],[0,260000,15000],[0,200000,25000]]
columns=['CreditScore','Balance','EstimatedSalary']
i=0
for column in columns:
    category=pd.cut(data[column],bins=bin_creator(values[i][0],values[i][1],values[i][2]),labels=label_creator(values[i][0],values[i][1],values[i][2]))
    data.insert(11+i,'{} cat'.format(column),category)
    i+=1
data['Balance cat'].fillna('0,15000',inplace=True) # where balance=0 the category will be Nan
data.drop(columns=['CreditScore','Balance','EstimatedSalary'],inplace=True)
data.head()

Now the dataframe is more friendly to analyze.  
### 2.3. Exploratory Data Analysis  
Lets take a look at the information that lies in the dataframe.

In [None]:
import seaborn as sns

for col in ['Geography', 'Gender', 'Age', 'Tenure',
       'NumOfProducts', 'HasCrCard', 'IsActiveMember','CreditScore cat','Balance cat','EstimatedSalary cat',
       'Exited']:
    if col=='Exited':
        break
    else:
        pd.crosstab(data[col],data['Exited']).plot(kind='bar',color=['g','r'],figsize=(10, 6))
        plt.title('Client exits in relation to {}'.format(col))
        plt.xlabel(col)
        plt.ylabel('Number of occurences in Exited')
        plt.show()

From the analysis of the graphics we can see the following:  
 - Geography: the French tend to stay with the bank,  
 - Gender: males tend to stay with the bank,  
 - Age: There seem to be 2 distributions, apparently Poisson distributions with 2 different average values for those clients who stayed and those who left,  
 - Tenure: does not seem to be related to exits,  
 - NumOfProducts: People with 2 products tend to stay,  
 - HasCrCard: presence of a credit card does not seem to affect the exits,  
 - IsActiveMember: active members seem to stay but the difference is small,  
 - CreditScore cat: credit score does not seem to affect the exists (all data entries follow the ratio between exits and stays),  
 - Balance cat: balance does not seem to affect the exists (all data entries follow the ratio between exits and stays),  
 - EstimatedSalary cat: estimated salary does not seem to affect the exists (all data entries follow the ratio between exits and stays).  
 
With these points in mind, we can predict that the most relevant factors in the decision to stay or leave will be:  
 - Age,  
 - Activity,  
 - Gender,  
 - Geography.  
 
To calculate correlation, we need to convert string values to integer values. To do such, we must prepare the dataframe:  
 - The categorical columns (Balance cat, CreditScore cat, EstimatedSalary cat) must be converted to integer,  
 - Any Nan values must be dealt with.

In [None]:
data_4=data_3
def bin_creator(x0,xf,dx):
    bins=[x0]
    n_bins=abs(xf-x0)//dx
    i=1
    while i<=n_bins:
        bins.append(bins[i-1]+dx)
        i+=1
    return bins
def label_creator(x0,xf,dx):
    labels=[]
    n_bins=abs(xf-x0)//dx
    i=1
    while i<=n_bins:
        labels.append(str(i-1)) #here is the change. This gives an integer instead of interval
        i+=1
    return labels
values=[[350,850,25],[0,260000,15000],[0,200000,25000]]
columns=['CreditScore','Balance','EstimatedSalary']
i=0
for column in columns:
    category=pd.cut(data_3[column],bins=bin_creator(values[i][0],values[i][1],values[i][2]),labels=label_creator(values[i][0],values[i][1],values[i][2]))
    data_4.insert(11+i,'{} cat'.format(column),category)
    i+=1
    
data_4['Balance cat'].fillna('0',inplace=True)
data_4['CreditScore cat'].fillna('0',inplace=True)

data_map_geo={'France':0,'Germany':1,'Spain':2}
data_map_gender={'Female':0,'Male':1}
data_map_credscore={}
data_4['Geography'],data_4['Gender']=data_4['Geography'].map(data_map_geo),data_4['Gender'].map(data_map_gender)

data_5=data_4[['Geography', 'Gender', 'Age', 'Tenure',
       'NumOfProducts', 'HasCrCard', 'IsActiveMember','CreditScore cat', 'Balance cat', 'EstimatedSalary cat','Exited']]
data_5=data_5.astype({'Balance cat': 'int64','CreditScore cat':'int64','EstimatedSalary cat':'int64'})

data_5.head()

Now the dataframe is suited for correlation calculation.

In [None]:
# create a dataframe only for the parameters
data_6=data_5[['Geography', 'Gender', 'Age', 'Tenure',
       'NumOfProducts', 'HasCrCard', 'IsActiveMember','CreditScore cat', 'Balance cat', 'EstimatedSalary cat']]

In [None]:
abscorwithdep=[]
for var in data_6.columns:
    if var=='Exited':
        break
    else:
        abscorwithdep.append((abs(data_5['Exited'].corr(data_6[var]))))
    
parameters=data_6.columns.to_list()
corr_table={'parameters':parameters,
           'corr':abscorwithdep}

corr_table_df=pd.DataFrame.from_dict(corr_table)
corr_table_df.sort_values('corr', ascending=False,inplace=True)

sns.barplot(x='corr',y='parameters',data=corr_table_df)
plt.title('Absolute Correlation between parameters and Exits')

In [None]:
plt.figure()
correlation_matrix = data_6.corr().abs()
sns.heatmap(correlation_matrix,cmap='Blues')
plt.title('Correlation matrix between parameters',fontsize=15)

It appears that there are no crossed correlations between parameter, so no corrections must be made here.  
Having in mind the small percentage of clients that left the bank, I implement the SMOTE (Synthetic Minority Oversampling Technique) to correct the discrepancy.

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

X=data_6
Y=data['Exited']

os=SMOTE(random_state=0)
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
columns=X_train.columns

os_data_X,os_data_Y=os.fit_sample(X_train,Y_train)
os_data_X=pd.DataFrame(data=os_data_X,columns=columns)
os_data_Y=pd.DataFrame(data=os_data_Y,columns=['Exited'])

print('lenght of oversampled data is',len(os_data_X))
print('lenght of exits=0 in oversampled data',len(os_data_Y[os_data_Y['Exited']==0]))
print('proportion of exits=0 data in oversampled data is ',len(os_data_Y[os_data_Y['Exited']==0])/len(os_data_X))
print('proportion of exits=1 data in oversampled data is ',len(os_data_Y[os_data_Y['Exited']==1])/len(os_data_X))

So, the data is as clean as it can get and corrected for exits discrepancies for Machine Learning Algorithms implementation.  

### 2.4 Machine Learning Classification Algorithms implementation and performance comparison  
The classification algorithms to be used are the following:  
 - Naive Bayes,  
 - Random Forest Classifier,   
 - Logistic Regression Classifier,  
 - K Nearest Neighbors.

#### 2.4.1. Naive Bayes

In [None]:
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(os_data_X, os_data_Y)
y_pred=nb.predict(X_test)

print(classification_report(Y_test,y_pred))

#### 2.4.2. Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf=RandomForestClassifier(n_estimators=100)
clf.fit(os_data_X, os_data_Y)
y_pred=clf.predict(X_test)

print(classification_report(Y_test,y_pred))

#### 2.4.3. Logistic Regression Classifier

In [None]:
from sklearn.linear_model import LogisticRegression

logreg=LogisticRegression()
logreg.fit(os_data_X,os_data_Y)
y_pred=logreg.predict(X_test)

print(classification_report(Y_test,y_pred))

#### 2.4.4. K Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=2).fit(os_data_X, os_data_Y)
y_pred=knn.predict(X_test)    

print(classification_report(Y_test,y_pred))

#### 2.4.5. Comparison of the models

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Naive Bayes
nb_roc_auc=roc_auc_score(Y_test,nb.predict(X_test))
fpr_nb,tpr_nb,thresholds_nb=roc_curve(Y_test,nb.predict_proba(X_test)[:,1])
# Random Forest
rf_roc_auc=roc_auc_score(Y_test,clf.predict(X_test))
fpr_rf,tpr_rf,thresholds_rf=roc_curve(Y_test,clf.predict_proba(X_test)[:,1])

# Logistic Regression
logit_roc_auc=roc_auc_score(Y_test,logreg.predict(X_test))
fpr,tpr,thresholds=roc_curve(Y_test,logreg.predict_proba(X_test)[:,1])

# K Nearest Neighbors
knn_roc_auc=roc_auc_score(Y_test,knn.predict(X_test))
fpr_knn,tpr_knn,thresholds_knn=roc_curve(Y_test,knn.predict_proba(X_test)[:,1])

plt.figure(figsize=(10, 6))
plt.plot(fpr_nb,tpr_nb,label='Naive Bayes (area=%0.2f)' % nb_roc_auc)
plt.plot(fpr_rf,tpr_rf,label='Random Forest Classification (area=%0.2f)' % rf_roc_auc)
plt.plot(fpr,tpr,label='Logistic Regression (area=%0.2f)' % logit_roc_auc)
plt.plot(fpr_knn,tpr_knn,label='K Nearest Neighbors (area=%0.2f)' % knn_roc_auc)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Reciever Operating Characteristic Comparisson')
plt.legend(loc='lower right')
plt.show()

It can be seen that the **Random Forest Classifier** outperforms the other models and it's this model that will be selected.

## 3. Results  
From the Exploratory Data Analysis, the correlation between parameters and exists is displayed in the following graph:

In [None]:
sns.barplot(x='corr',y='parameters',data=corr_table_df)
plt.title('Absolute Correlation between parameters and Exits')

And the performance of several Machine Learning algorithms is displayed on the following graph:

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(fpr_nb,tpr_nb,label='Naive Bayes (area=%0.2f)' % nb_roc_auc)
plt.plot(fpr_rf,tpr_rf,label='Random Forest Classification (area=%0.2f)' % rf_roc_auc)
plt.plot(fpr,tpr,label='Logistic Regression (area=%0.2f)' % logit_roc_auc)
plt.plot(fpr_knn,tpr_knn,label='K Nearest Neighbors (area=%0.2f)' % knn_roc_auc)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Reciever Operating Characteristic Comparisson')
plt.legend(loc='lower right')
plt.show()

## 4. Conclusion  
This project was developed with the intent to determine the most relevant parameters in client exits, as well as implementation of the Machine Learning model with best performance.  
The parameters with most relevance for the client decision to either leave or stay are ordered as follow:  
 1. Age, the younger people tend to stay and displayed interesting distributions worth examining in the future,  
 2. Level of Activity, the more active members tend to stay,  
 3. Account Balance, clients with lower balance tend to stay,  
 4. Gender, the men tend to stay more than women.  
 
As for the Machine Learning model, the **Random Forest Classifier** performed the best, therefore, it shall be the model to be implemented.  
It is worth to note of the uses of this project. With this information the bank can determine it's focus points to retain clients, as well as to direct its marketing campaigns to acquire new ones. This work can be, easily, applied to other enterprises.
### 5. References  
 - Metrics to Evaluate your Machine Learning Algorithm by Aditya Mishra,  
 - An in-depth guide to supervised machine learning classification by Badreesh Shetty,  
 - Python Data Science Handbook by Jake VanderPlas,  
 - Personnel / Client Retention Study by M.Shumskiy