# **Bank Customer Churn Predictor**
This notebook builds bank customer churn predictior using the `Bank Customer Churn Prediction` Kaggle dataset by `shantanudhakadd`

---
Following libraries must be installed:

- Numpy
- Pandas
- Matplotlib
- Seaborn
- Sklearn
- Pickle



## **Import libraries**

Following cell imports all th libraries, classes and functions used in this notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
import pickle

## **Load dataset**

Following cells downloads the dataset and loads it into the code.

### Download dataset

Following cell downloads the `Bank Customer Churn Prediction` Kaggle dataset by `shantanudhakadd` and unzips the csv file from the .zip file.

In [None]:
!kaggle datasets download -d shantanudhakadd/bank-customer-churn-prediction
!unzip bank-customer-churn-prediction.zip
!rm bank-customer-churn-prediction.zip

### Load dataset

Following cell loads the dataset from `Churn_Modelling.csv` to variable `churn_data`.

In [None]:
churn_data = pd.read_csv('Churn_Modelling.csv')
churn_data.head()

## **EDA**

Following cells perforn exploratory data analysis.

### Basic info of data

Display the number of rows, number of columns, datatype of each column and non-null values in each column.

In [None]:
churn_data.info()

### Unique value count

Display the number of unique values in each column.

In [None]:
for column in churn_data.columns:
    print(f'{column} : {churn_data[column].nunique()}')

### Statistical description of data

Display basic statistics such as mean, standard deviation min, max etc for each column.

In [None]:
churn_data.describe()

### Plot Numerical

Plot numerical columns `Age`, `Balance`, `EstimatedSalary`, and `CreditScore` as histograms.

In [None]:
plt.figure(figsize = (15, 6))

to_plot = ['Age', 'Balance', 'EstimatedSalary', 'CreditScore']

## Plot graphs
for i, column in enumerate(to_plot, 1):
    plt.subplot(2, 2, i)
    sns.histplot(x = column, data = churn_data, hue = 'Exited', stat='percent', kde = True, bins = 20, multiple='stack')

plt.show()

### Plot categorical

Plot categorical columns `Tenure`, `Gender`, `HasCrCard`, `IsActiveMember`, `Geography`, and `NumOfProducts` as histograms.

In [None]:
plt.figure(figsize = (15, 8))

to_plot = ['Tenure', 'Gender', 'HasCrCard', 'IsActiveMember', 'Geography', 'NumOfProducts']

## Plot graphs
for i, column in enumerate(to_plot, 1):
    plt.subplot(2, 3, i)
    sns.histplot(x = column, data = churn_data, hue = 'Exited', stat = 'percent', multiple = 'dodge', bins = churn_data[column].nunique(), palette='tab10')

plt.show()

### Plot target

Plot target column `Exited` as pie chart.

In [None]:
plt.pie(churn_data.Exited.value_counts(), labels = ['Retained', 'Exited'], autopct='%1.1f%%', colors=sns.color_palette('tab10'), explode=[0, 0.1], shadow=True)
plt.show()

### Check null

Display number of null values in each column

In [None]:
churn_data.isnull().sum()

### Check duplicates

Display number of duplicate rows.

In [None]:
churn_data.duplicated(subset = 'CustomerId').sum()

## **Preprocessing**

Following cells selects useful features, split data into train and test sets, encode categorical columns

### Select features

Drop redundant and not useful columns. Seprate target from features.

In [None]:
to_drop = ['RowNumber', 'CustomerId', 'Surname', 'Exited']
X_full = churn_data.drop(to_drop, axis = 1)
y_full= churn_data.Exited

X_full.head()

### Train-Test split

Split data into training and testing sets in a ratio of 20%.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size = 0.2, random_state = 42)

X_train.shape, X_test.shape

### Encode categorical columns

Use one hot encoding to encode categorical columns `Georaphy` and `Gender`.

In [None]:
# One hot encode 'Geography'
X_train = pd.get_dummies(X_train, columns = ['Geography'], dtype = int)
X_test = pd.get_dummies(X_test, columns = ['Geography'], dtype=int)

# One hot encode 'Gender'
X_train.Gender = X_train.Gender.map({'Male': 0, 'Female': 1})
X_test.Gender = X_test.Gender.map({'Male': 0, 'Female': 1})

X_train.head()

### Scale data

Use standard scaling to scale `EstimatedSalary` and `Balance` column.

In [None]:
scaler = StandardScaler()

X_train['EstimatedSalary'], X_train['Balance'] = scaler.fit_transform(X_train[['EstimatedSalary', 'Balance']]).transpose()
X_test['EstimatedSalary'], X_test['Balance'] = scaler.transform(X_test[['EstimatedSalary', 'Balance']]).transpose()

X_train.describe().apply(lambda s: s.apply('{0:.5f}'.format))

### Check correlation

Display correlation of each feature with the target.

In [None]:
X_train.corrwith(y_train).abs().sort_values(ascending = False)

## **Train Model**

Following cells train and evaluate different models to find the best prediction model.

### Train and evaluate

Following cell describes function `train_evaluate` which train a model on training data and evaluates different metrics such as accuracy, confusion matrix and f1 score.

In [None]:
def train_evaluate(model):
    """
    Trains a model and evaluates its performance on the test set.
    """
    # Train model
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Evaluate model
    print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
    print(f'F1 Score: {f1_score(y_test, y_pred)}')
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))

### Logistic regression

In [None]:
LR = LogisticRegression(max_iter=500, random_state = 3)
train_evaluate(LR)

### Random forest classifier

In [None]:
RFC = RandomForestClassifier(n_estimators=300, random_state=3)
train_evaluate(RFC)

### Gradient boosting classifier

In [None]:
GBC = GradientBoostingClassifier(n_estimators = 1000, random_state=3)
train_evaluate(GBC)

## **Final model**

Following cells train the best model i.e. Gradient boosting classifier on the whole data i.e. training and testing data combined and finally saves the model.

### Train model on full data

Join train and test data, train model and display final f1 score

In [None]:
X_full = pd.concat([X_train, X_test])
y_full = pd.concat([y_train, y_test])

GBC.fit(X_full, y_full)
print("F1 score: ")
f1_score(y_full, GBC.predict(X_full))

### Save model

Save the model to `churn_model.pkl`

In [None]:
pickle.dump(GBC, open('churn_model.pkl', 'wb'))
print('Model Saved')