# **BANKING CHURN ANALYSIS**
___

Churn is a measure of how many customers stop using a product or service. It is a common phenomenon in businesses and clients may churn for different reasons such better offers from competitors, being unhappy with the product or service being offered or even change in personal circumstances.

It is important for a business to keep the churn as minimal as possible and retain its clients therefore there is a need for a business to understand why clients are leaving so as to take the necessary steps to minimize the churning and retain their current clients.

In this project, we will conduct a churn analysis for a bank to address the ongoing challenge of customer attrition, which results in financial losses and diminished customer satisfaction. Understanding the factors that drive a client's decision to leave the bank will provide valuable insights to enhance retention strategies.

The Data was aquired from Kaggle.com

Objectives:

1. Conduct a comprehensive churn analysis for the bank.
2. Develop a machine learning model to predict customer churn and identify key factors contributing to it.
3. Provide actionable insights and recommend strategies to minimize churn rates and enhance client retention


In [1]:
 # Importing the Necessary Libraries
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency ,fisher_exact
from sklearn.model_selection import train_test_split , GridSearchCV
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix , precision_score , recall_score , f1_score
from sklearn.metrics import roc_curve, roc_auc_score
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Loading the dataset 
DATA = pd.read_csv( 'Churn_Modelling.csv' )

display( DATA.head() )

print( 'Number of Rows' , DATA.shape[0] )
print( 'Number of Columns' , DATA.shape[1] )
print( 'Column Names : ' , DATA.columns.to_list())

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Number of Rows 10000
Number of Columns 14
Column Names :  ['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited']


From the output we can see that the dataframe has 10,000 rows and 14 columns wher the column names are :

1. RowNumber : Shows the Row Number of different customers

2. CustomerId : This is a unique identifier that identifies the different customers 

3. Surname : This is the surnames of the customers 

4. CreditScore : It is a Numerical prediction of the clients credit behavior, such as how likely you are to pay a loan back on time (Credit Worthness).

5. Geography : This is the Location where the Customer Resides in 

6. Gender : Shows the gender of the customer (Binary)

7. Age : The customers Age (Numeric)

8. Tenure : It  represents the number of years  the customer has been associated with the bank.

9. Balance : This column represents the ammount of money in the customers account at the present time (Numerical).

10. NumberOfProducts : This represents the number of accounts a customer has (Numerical).

11.  HasCrCard : This represents if the customer has or doesnt have a credit card (Binary).

12. IsActivemenber : This represents if the customer is(1) or is not(0) an active member within the bank (Binary).

13. EstimatedSalary : This column  provides an approximation of the customer's income level.

14. Exited : This represents if the customer has Churned(1) or has not churned(0), (Binary)

We will then do some data Exploration 

In [3]:
 # Getting the missing values 
DATA.isnull().sum().to_frame().rename( columns = { 0 : 'Total No. of Missing Values' } )

Unnamed: 0,Total No. of Missing Values
RowNumber,0
CustomerId,0
Surname,0
CreditScore,0
Geography,0
Gender,0
Age,0
Tenure,0
Balance,0
NumOfProducts,0


There are no missing values in the dataframe

In [4]:
 # Getting the Data Types
DATA.dtypes.to_frame().rename( columns = { 0 : 'Data Types' } )

Unnamed: 0,Data Types
RowNumber,int64
CustomerId,int64
Surname,object
CreditScore,int64
Geography,object
Gender,object
Age,int64
Tenure,int64
Balance,float64
NumOfProducts,int64


From the output we can see that columns with numerical data are 11 while columns with non numerical data are 3.

In [5]:
# Checking for number of duplicated values 
DATA.duplicated().sum()

np.int64(0)

There are no duplicated values in or DataFrame.

In [6]:
 # Summary of some Numeric datatypes
DATA[['Age','CreditScore','Tenure','Balance','NumOfProducts','EstimatedSalary']].describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,10000.0,38.9218,10.487806,18.0,32.0,37.0,44.0,92.0
CreditScore,10000.0,650.5288,96.653299,350.0,584.0,652.0,718.0,850.0
Tenure,10000.0,5.0128,2.892174,0.0,3.0,5.0,7.0,10.0
Balance,10000.0,76485.889288,62397.405202,0.0,0.0,97198.54,127644.24,250898.09
NumOfProducts,10000.0,1.5302,0.581654,1.0,1.0,1.0,2.0,4.0
EstimatedSalary,10000.0,100090.239881,57510.492818,11.58,51002.11,100193.915,149388.2475,199992.48


In [7]:
 # Summary of the non numeric columns
DATA[['Surname','Geography','Gender']].describe().T

Unnamed: 0,count,unique,top,freq
Surname,10000,2932,Smith,32
Geography,10000,3,France,5014
Gender,10000,2,Male,5457


From the summary we can see that :

- The minimum age is 18 while the maximum age is 92.

- The average credit score 650 where the highest credit score is 850 while the lowest is 350.

- The least ammount of years the customer has been assosiated with the bank is 0 years while the highest is 10 years 

- The average balance is 76,485.89 having the minimum balance as 0.00 and the maximum balance as 250,898.00 

- The maximum number of  accounts a customer has in the bank is 4 while the minimum number of accounts a customer has is 1

- The average estimated salary is 100,090.24 having the maximum salary being 199,992.48 and the minimum being 11.58.

We can also see that there are : 
- Three unique Geographys where the top Geography is France.

- There are more male than female customers where the male customers are 5,457.

Since Our Target variable is Excited we will now rename the column to Churned and the dataset from 1 to Yes and from 0 to No 

In [8]:
 # Renaming the Column 
DATA.rename( columns = { 'Exited' : 'Churned'} , inplace = True ) 

In [9]:
 # Replacing the Data
DATA["Churned"] = DATA["Churned"].replace({0: "No", 1: "Yes"})
DATA["HasCrCard"] = DATA["HasCrCard"].replace({0: "Doesn't Have", 1: "Has"})
DATA["IsActiveMember"] = DATA["IsActiveMember"].replace({0: "Not Active", 1: "Active"})

DATA.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Churned
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,Has,Active,101348.88,Yes
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,Doesn't Have,Active,112542.58,No
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,Has,Not Active,113931.57,Yes
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,Doesn't Have,Not Active,93826.63,No
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,Has,Active,79084.1,No


We will drop the RowNumber , CustomerId and Surname columns since they is not necessarily required.

In [10]:
# Drop columns
DATA.drop(columns = 'RowNumber', inplace = True )
DATA.drop(columns = 'CustomerId', inplace = True )
DATA.drop(columns = 'Surname', inplace = True )

In [11]:
# Saving the Cleaned Data 
DATA.to_csv('CLEANED DATA.csv', index=False)