# **Bank Customer Churn Prediction Project**

-------------

## **Objective**

Bank Customer Churn Prediction Project

This project aims to predict customer churn based on various features such as customer demographics, account information, and engagement with the bank. By building a predictive model, the bank can identify customers likely to churn and implement targeted retention strategies.

## **Data Source**

https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Bank%20Churn%20Modelling.csv

## **Import Library**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

## **Import Data**

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Bank%20Churn%20Modelling.csv')

## **Describe Data**

In [None]:
The dataset used in this project includes the following columns:

CustomerId: Unique identifier for each customer.
Surname: Last name of the customer.
CreditScore: Customer's credit score.
Geography: Country of residence.
Gender: Gender of the customer.
Age: Age of the customer.
Tenure: Number of years with the bank.
Balance: Balance in the customer's account.
NumOfProducts: Number of products with the bank.
HasCreditCard: Has a credit card (1 = Yes, 0 = No).
IsActiveMember: Is an active member (1 = Yes, 0 = No).
EstimatedSalary: Customer's annual salary.
Churn: Churn indicator (1 = Yes, 0 = No).
The target variable for prediction is Churn. Our goal is to build a model that accurately predicts whether a customer will churn based on the other features.

## **Data Visualization**

In [None]:
CreditScore	Geography	Gender	Age	Tenure	Balance	Num Of Products	Has Credit Card	Is Active Member	Estimated Salary	Churn	Zero Balance
count	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.00000	10000.000000	10000.000000	10000.000000	10000.000000
mean	650.528800	1.253700	0.454300	38.921800	5.012800	76485.889288	1.530200	0.70550	0.515100	100090.239881	0.203700	0.638300
std	96.653299	0.827529	0.497932	10.487806	2.892174	62397.405202	0.581654	0.45584	0.499797	57510.492818	0.402769	0.480517
min	350.000000	0.000000	0.000000	18.000000	0.000000	0.000000	1.000000	0.00000	0.000000	11.580000	0.000000	0.000000
25%	584.000000	1.000000	0.000000	32.000000	3.000000	0.000000	1.000000	0.00000	0.000000	51002.110000	0.000000	0.000000
50%	652.000000	2.000000	0.000000	37.000000	5.000000	97198.540000	1.000000	1.00000	1.000000	100193.915000	0.000000	1.000000
75%	718.000000	2.000000	1.000000	44.000000	7.000000	127644.240000	2.000000	1.00000	1.000000	149388.247500	0.000000	1.000000
max	850.000000	2.000000	1.000000	92.000000	10.000000	250898.090000	4.000000	1.00000	1.000000	199992.480000	1.000000	1.000000

## **Data Preprocessing**

## **Define Target Variable (y) and Feature Variables (X)**

In [None]:
x = df.drop(['Surname', 'Churn'], axis = 1)
y = df['Churn']

## **Train Test Split**

In [None]:
from sklearn.model_selection import train_test_split
#Split Original Data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=25)

## **Modeling**

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))

## **Model Evaluation**

In [None]:
svc_ros = SVC()
svc_ros.fit(x_train_ros, y_train_ros)

## **Prediction**

In [None]:
print(classification_report(y_test, y_pred))
print(classification_report(y_test, grid_predictions))
print(classification_report(y_test_rus, y_pred_rus))
print(classification_report(y_test_rus, grid_predication_rus))
print(classification_report(y_test_ros, y_pred_ros))
print(classification_report(y_test_ros, grid_predictions_ros))

## **Explaination**

We will interpret the model's performance by examining key metrics and identifying the most important features in the model. If a model like Random Forest or Gradient Boosting is used, feature importances can help us understand the impact of each variable on churn prediction.

We can also use visualization techniques like SHAP values or LIME for model interpretation to gain insights into individual predictions and overall model behavior.