<a href="https://colab.research.google.com/github/MuindeEsther/Bank-churn-rate-classification/blob/main/Customer_Churn_Rate_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction
## Binary Classification with Bank Churn Dataset

The goal of this project is to predict whether a customer continues with their account or closes it(e.g., churns).

We will use ensembling techniques to improve our models and also fit a neural network.

Let's start by getting relevant data

In [7]:
# Make a directory named.kaggle
!mkdir /root/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [8]:
# copy the kaggle.json into this new directory
!cp kaggle.json /root/.kaggle/
# Allocate the required permission for this file
!chmod 600 /root/.kaggle/kaggle.json

In [9]:
# Download the dataset
# !kaggle datasets download islombekdavronov/creditscoring-data
!kaggle datasets download shubhammeshram579/bank-customer-churn-prediction

Downloading bank-customer-churn-prediction.zip to /content
100% 262k/262k [00:00<00:00, 623kB/s]
100% 262k/262k [00:00<00:00, 622kB/s]


In [10]:
!unzip bank-customer-churn-prediction.zip

Archive:  bank-customer-churn-prediction.zip
  inflating: Churn_Modelling.csv     


Loading necessary libraries

In [11]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [12]:
# load the data
churn_df = pd.read_csv("/content/Churn_Modelling.csv")
churn_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42.0,2,0.0,1,1.0,1.0,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41.0,1,83807.86,1,0.0,1.0,112542.58,0
2,3,15619304,Onio,502,France,Female,42.0,8,159660.8,3,1.0,0.0,113931.57,1
3,4,15701354,Boni,699,France,Female,39.0,1,0.0,2,0.0,0.0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43.0,2,125510.82,1,,1.0,79084.1,0


The dataset contains several columns, which are likely to include both features and the target variable for our binary classification model. Here's a brief overview of the columns:

RowNumber: The row number.

CustomerId: Unique identifiers for the customers.

Surname: The surname of the customers.

CreditScore: The credit score of the customers.

Geography: The country of the customers.

Gender: The gender of the customers.

Age: The age of the customers.

Tenure: Number of years the customer has been with the bank.

Balance: Bank balance of the customers.

NumOfProducts: Number of bank products the customers are using.

HasCrCard: Indicates whether the customer has a credit card (1) or not (0).

IsActiveMember: Indicates whether the customer is an active member (1) or not (0).

EstimatedSalary: The estimated salary of the customers.

Exited: Whether the customer exited (1) or stayed (0) with the bank.

### Data Cleaning
 We will start by inspecting our data and cleaning missing values

In [13]:
churn_df.shape

(10002, 14)

In [14]:
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10002 entries, 0 to 10001
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10002 non-null  int64  
 1   CustomerId       10002 non-null  int64  
 2   Surname          10002 non-null  object 
 3   CreditScore      10002 non-null  int64  
 4   Geography        10001 non-null  object 
 5   Gender           10002 non-null  object 
 6   Age              10001 non-null  float64
 7   Tenure           10002 non-null  int64  
 8   Balance          10002 non-null  float64
 9   NumOfProducts    10002 non-null  int64  
 10  HasCrCard        10001 non-null  float64
 11  IsActiveMember   10001 non-null  float64
 12  EstimatedSalary  10002 non-null  float64
 13  Exited           10002 non-null  int64  
dtypes: float64(5), int64(6), object(3)
memory usage: 1.1+ MB


In [15]:
# Check for the sum of missing values in each coulmn
churn_df.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          1
Gender             0
Age                1
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          1
IsActiveMember     1
EstimatedSalary    0
Exited             0
dtype: int64

In [16]:
# drop the missing values
churn_df = churn_df.dropna()


In [17]:
churn_df.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

### Exploratory Data Analysis

In [18]:
# Basic statistics of the numeric columns
numeric_stats = churn_df.describe()

# Distribution of the target variable `Exited`
target_distribution = churn_df['Exited'].value_counts(normalize=True)

numeric_stats, target_distribution


(          RowNumber    CustomerId  CreditScore          Age       Tenure  \
 count   9998.000000  9.998000e+03  9998.000000  9998.000000  9998.000000   
 mean    5003.497499  1.569094e+07   650.529606    38.920287     5.013003   
 std     2886.321275  7.192399e+04    96.633003    10.487986     2.892152   
 min        1.000000  1.556570e+07   350.000000    18.000000     0.000000   
 25%     2504.250000  1.562854e+07   584.000000    32.000000     3.000000   
 50%     5003.500000  1.569073e+07   652.000000    37.000000     5.000000   
 75%     7502.750000  1.575323e+07   718.000000    44.000000     7.000000   
 max    10000.000000  1.581569e+07   850.000000    92.000000    10.000000   
 
              Balance  NumOfProducts    HasCrCard  IsActiveMember  \
 count    9998.000000    9998.000000  9998.000000     9998.000000   
 mean    76481.490819       1.530206     0.705541        0.514803   
 std     62393.187035       0.581669     0.455822        0.499806   
 min         0.000000       1

**Basic Statistics**

The dataset consists of 10,002 entries.
The Age variable seems to have a wide range, from 18 to 92 years.

The CreditScore, Balance, EstimatedSalary, and other numeric variables have varying scales, which might require normalization or standardization before modeling.

The Balance feature shows a significant difference between the 25th percentile (0) and the 50th percentile, suggesting a substantial number of customers with zero balance.

**Target Variable Distribution**

The target variable, Exited, indicates that approximately 20.38% of customers have exited. This shows an imbalance in the dataset that might need to be addressed during model training.


### Data Preprocessing
We will tackle preprocessing step by step

**Handle Missing Values**: We'll decide whether to impute or drop rows/columns with missing values.

**Drop Irrelevant Features:** Remove features that are unlikely to be useful for prediction.

**Encode Categorical Variables:** Transform categorical variables into a format that can be provided to ML models.

**Feature Scaling:** Standardize or normalize numerical features so they're on the same scale.

**Address Class Imbalance:** Explore methods to deal with the imbalance in the target variable.


#### Handle Missing Values


In [19]:
# Impute missing values
# For numerical columns like 'Age', we'll use the median value for imputation.
# For categorical columns like 'Geography' and binary columns like 'HasCrCard', 'IsActiveMember', we'll use the mode.
age_median = churn_df['Age'].median()
geography_mode = churn_df['Geography'].mode()[0]
has_cr_card_mode = churn_df['HasCrCard'].mode()[0]
is_active_member_mode = churn_df['IsActiveMember'].mode()[0]

churn_df['Age'].fillna(age_median, inplace=True)
churn_df['Geography'].fillna(geography_mode, inplace=True)
churn_df['HasCrCard'].fillna(has_cr_card_mode, inplace=True)
churn_df['IsActiveMember'].fillna(is_active_member_mode, inplace=True)

# Check if there are any missing values left
churn_df.isnull().sum()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  churn_df['Age'].fillna(age_median, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  churn_df['Geography'].fillna(geography_mode, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  churn_df['HasCrCard'].fillna(has_cr_card_mode, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
 

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

All missing values have been successfully handled, and there are no missing values left in the dataset.

#### Drop Irrelevant Features

Next, we'll drop features that are unlikely to influence the prediction. This includes `RowNumber`, `CustomerId`, and `Surname`, as these are identifiers that don't contain information relevant to a customer's likelihood to churn.


In [20]:
# Drop irrelevant features
data_cleaned = churn_df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

# Display the first few rows of the updated dataset to confirm removal
data_cleaned.head()


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42.0,2,0.0,1,1.0,1.0,101348.88,1
1,608,Spain,Female,41.0,1,83807.86,1,0.0,1.0,112542.58,0
2,502,France,Female,42.0,8,159660.8,3,1.0,0.0,113931.57,1
3,699,France,Female,39.0,1,0.0,2,0.0,0.0,93826.63,0
5,645,Spain,Male,44.0,8,113755.78,2,1.0,0.0,149756.71,1


#### Encode Categorical Variables
We need to transform the categorical variables (`Geography` and `Gender`) into a numerical format that machine learning models can work with. There are different encoding techniques, but given that `Geography` is nominal without inherent order, one-hot encoding is a suitable choice for both `Geography` and `Gender` to avoid introducing any artificial ordering.

In [21]:
# One-hot encode the categorical variables
data_encoded = pd.get_dummies(data_cleaned, columns=['Geography', 'Gender'], drop_first=True)

# Display the first few rows of the dataset to confirm the encoding
data_encoded.head()


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42.0,2,0.0,1,1.0,1.0,101348.88,1,False,False,False
1,608,41.0,1,83807.86,1,0.0,1.0,112542.58,0,False,True,False
2,502,42.0,8,159660.8,3,1.0,0.0,113931.57,1,False,False,False
3,699,39.0,1,0.0,2,0.0,0.0,93826.63,0,False,False,False
5,645,44.0,8,113755.78,2,1.0,0.0,149756.71,1,False,True,True


#### Feature Scaling

Since machine learning models, especially those based on distance metrics, can be sensitive to the scale of the features, we'll standardize the numerical features. This ensures that each feature contributes equally to the distance calculations. We'll use StandardScaler from scikit-learn, which standardizes features by removing the mean and scaling to unit variance.

In [22]:
from sklearn.preprocessing import StandardScaler

# Selecting numerical features for scaling
features_to_scale = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']
scaler = StandardScaler()

# Scaling the features
data_encoded[features_to_scale] = scaler.fit_transform(data_encoded[features_to_scale])

# Display the first few rows of the scaled dataset
data_encoded.head()


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,-0.326298,0.293657,-1.041838,-1.22586,-0.91157,0.646028,0.97082,0.02172,1,False,False,False
1,-0.440137,0.198305,-1.387619,0.117428,-0.91157,-1.547921,0.97082,0.216366,0,False,True,False
2,-1.537125,0.293657,1.032846,1.333214,2.526981,0.646028,-1.030057,0.240519,1,False,False,False
3,0.501618,0.007601,-1.387619,-1.22586,0.807705,-1.547921,-1.030057,-0.109083,0,False,False,False
5,-0.057226,0.484361,1.032846,0.597439,0.807705,0.646028,-1.030057,0.863478,1,False,True,True


#### Address Class Imbalance
The target variable `Exited` has a class imbalance, with a significantly higher number of customers staying than leaving. To address this, we can use techniques like SMOTE (Synthetic Minority Over-sampling Technique) to oversample the minority class, or we can adjust the class weight parameter in many machine learning models. Given that ensemble methods and neural networks are part of the modeling plan, adjusting class weights or using balanced algorithms like Balanced Random Forest might be an efficient approach.


In [23]:
from sklearn.model_selection import train_test_split
# Redefining X and y using the data_encoded dataset
X = data_encoded.drop('Exited', axis=1)
y = data_encoded['Exited']

# Splitting the data into training and testing sets again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Checking the sizes of the training and testing datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((7998, 11), (2000, 11), (7998,), (2000,))

In [24]:
from imblearn.over_sampling import SMOTE

# Assuming X and y are your features and target variable
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)


### Fitting the Models
Starting with a general approach, we'll explore different ensemble models to see how they perform on our binary classification task. Ensemble methods combine the predictions of several base estimators to improve generalizability and robustness over a single estimator. We'll look into two popular ensemble techniques:

**Random Forest:** An ensemble of decision trees, typically trained via the bagging method. It's known for its simplicity, robustness, and good performance on many problems.

**Gradient Boosting:** Builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. We'll use XGBoost, a scalable and accurate implementation of gradient boosting machines, known for its performance in classification tasks.

#### Random Forest

Let's start by training a Random Forest classifier and evaluating its performance. We'll use accuracy as a primary metric, but since our dataset is imbalanced, we'll also look at the confusion matrix, precision, recall, and F1-score to get a comprehensive view of the model's performance.

In [25]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize the Random Forest classifier
rf_clf = RandomForestClassifier(random_state=42, class_weight='balanced')

# Train the model
rf_clf.fit(X_train, y_train)

# Predictions on the test set
y_pred_rf = rf_clf.predict(X_test)

# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
confusion_rf = confusion_matrix(y_test, y_pred_rf)
report_rf = classification_report(y_test, y_pred_rf)

accuracy_rf, confusion_rf, report_rf


(0.8595,
 array([[1527,   65],
        [ 216,  192]]),
 '              precision    recall  f1-score   support\n\n           0       0.88      0.96      0.92      1592\n           1       0.75      0.47      0.58       408\n\n    accuracy                           0.86      2000\n   macro avg       0.81      0.71      0.75      2000\nweighted avg       0.85      0.86      0.85      2000\n')

The Random Forest classifier achieved an accuracy of approximately 86.1% on the test set. However, considering the class imbalance, let's look at other metrics:

**Precision** for class 1 (customers who
exited) is 0.76, meaning that when the model predicts a customer will exit, it is correct about 76% of the time.

**Recall** for class 1 is 0.46, indicating that the model correctly identifies 46% of the actual exits.

The **F1-score** for class 1, which balances precision and recall, is 0.58. This suggests room for improvement, especially in identifying the minority class more effectively.

The confusion matrix provides additional insight:

**True Negatives (TN)**: 1534 (customers correctly identified as not exiting)

**False Positives (FP)**: 59 (customers incorrectly identified as exiting)

**False Negatives (FN)**: 219 (customers who exited but were not identified by the model)

**True Positives (TP):** 189 (customers correctly identified as exiting)

These results indicate the model's stronger performance in predicting customers who do not exit, which is expected due to the class imbalance.

#### Gradient Boosting (XGBoost)

Next, we can train an XGBoost classifier, which is known for its effectiveness in handling imbalanced datasets, and compare its performance to that of the Random Forest model.

In [26]:
from xgboost import XGBClassifier

# Initialize the XGBoost classifier
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42, scale_pos_weight=(1 - y_train.mean()) / y_train.mean())

# Train the model
xgb_clf.fit(X_train, y_train)

# Predictions on the test set
y_pred_xgb = xgb_clf.predict(X_test)

# Evaluate the model
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
confusion_xgb = confusion_matrix(y_test, y_pred_xgb)
report_xgb = classification_report(y_test, y_pred_xgb)

accuracy_xgb, confusion_xgb, report_xgb


(0.8225,
 array([[1381,  211],
        [ 144,  264]]),
 '              precision    recall  f1-score   support\n\n           0       0.91      0.87      0.89      1592\n           1       0.56      0.65      0.60       408\n\n    accuracy                           0.82      2000\n   macro avg       0.73      0.76      0.74      2000\nweighted avg       0.83      0.82      0.83      2000\n')

### Results

**Accuracy**

Accuracy: 0.8225 - This means that your model correctly predicted whether a customer would churn or not 82.25% of the time across the test set. While this is a good starting point, accuracy alone can be misleading, especially in imbalanced datasets where one class significantly outnumbers the other.

**Confusion Matrix**

True Negatives (TN): 1381 - The number of customers who were correctly predicted to not churn.

False Positives (FP): 211 - The number of customers who did not churn but were incorrectly predicted to churn.

False Negatives (FN): 144 - The number of customers who churned but were incorrectly predicted to not churn.

True Positives (TP): 264 - The number of customers who were correctly predicted to churn.

**Classification Report**

The classification report provides key metrics in predicting each class:

Precision (for each class) is the ratio of correctly predicted positive observations to the total predicted positives. It tells you the quality of the positive class predictions.

Class 0 (No Churn): 0.91 - High precision indicates that the model is good at identifying customers who won't churn.

Class 1 (Churn): 0.56 - Lower precision for the churn class suggests the model makes more mistakes in identifying customers who will churn.

Recall (for each class) is the ratio of correctly predicted positive observations to the all observations in actual class. It tells you the ability of the model to find all the positive samples.

Class 0: 0.87 - This means the model correctly identifies 87% of the non-churners.

Class 1: 0.65 - The model correctly identifies 65% of the churners, indicating it misses 35% of the actual churn cases.

F1-Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. It is a good way to show that a class has a good recall and precision balance.

Class 0: 0.89 - High F1 score suggests a good balance for the non-churn class.

Class 1: 0.60 - Moderate F1 score for the churn class suggests room for improvement, especially in making fewer false positives/negatives predictions.

Support is the number of actual occurrences of the class in the specified dataset. For non-churners (Class 0), there are 1592 instances, and for churners (Class 1), there are 408 instances, indicating an imbalanced dataset.

### Interpretation and Next Steps

Your model performs well in identifying non-churners but struggles somewhat with churners, which is common in imbalanced datasets. While the accuracy is high, the lower precision, recall, and F1 score for the churn class suggest you might want to focus on improving the model's ability to identify churners. This could involve collecting more data, trying different resampling techniques, adjusting class weights, exploring feature engineering, or experimenting with different models and ensemble techniques
