_______________________
<font size=8 color=bisque> WEEK 24 : GRADED MINI PROJECT

_________________

<font size=6 color=chocolate>Bank Customer Churn Analysis

In today’s competitive financial landscape, customer retention is a key driver of profitability. A leading international bank is experiencing a surge in customer churn—clients are closing accounts despite a wide array of financial products. This trend threatens long-term growth and brand loyalty.

<font size=5 color=olive> Objective
As a data analyst, your mission is to:
- Analyze customer data to uncover churn-driving factors.
- Build a predictive model to classify customers as likely to stay or exit.
- Enable targeted retention strategies for at-risk customers.

<font size=5 color=olive>Problem Statement
> “Which customers are at risk of leaving the bank?”

Answering this question empowers the bank’s marketing and customer success teams to proactively engage vulnerable customers and reduce churn.

<font size=5 color=olive>Introduction

This notebook addresses the project objectives: cleaning and preprocessing the dataset, performing EDA to uncover churn patterns, building predictive models for churn, and deriving insights. The dataset contains customer data with the target variable Exited (1 = churned, 0 = stayed).

<font size=5 color=olive>Libraries used:

- pandas for data manipulation
- numpy for numerical operations
- matplotlib and seaborn for visualization
- scikit-learn for preprocessing, modeling, and evaluation
- scipy for hierarchical clustering (explored in EDA for unsupervised insights)


-------------
<font size=6 color=seagreen > Tasks To Do

--------------


<font size=4 color=cyan>1. Data Cleaning & Preprocessing

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster
import warnings
warnings.filterwarnings('ignore')


In [3]:
# Load Dataset
df = pd.read_csv('/content/Data.csv')
df.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15708791.0,Abazu,584,Spain,Male,32.0,9,85534.83,1,0.0,0.0,169137.24,0
1,15576156.0,Abazu,710,Spain,Female,28.0,6,0.0,1,1.0,0.0,48426.98,0
2,15737792.0,Abbie,818,France,Female,31.0,1,186796.37,1,0.0,0.0,178252.63,0
3,15680804.0,Abbott,850,France,Male,29.0,6,0.0,2,1.0,1.0,10672.54,0
4,15723706.0,Abbott,573,France,Female,33.0,0,90124.64,1,1.0,0.0,137476.71,0


In [9]:
print(f'Shape of the dataset\t: {df.shape}')
print(f'Row Labels\t\t:  {df.index}')
print(f'\nColumns:\n {df.columns}')
print(f'\nData types: \n{df.dtypes}')

Shape of the dataset	: (10502, 13)
Row Labels		:  RangeIndex(start=0, stop=10502, step=1)

Columns:
 Index(['CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age',
       'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember',
       'EstimatedSalary', 'Exited'],
      dtype='object')

Data types: 
CustomerId         float64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                float64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard          float64
IsActiveMember     float64
EstimatedSalary    float64
Exited               int64
dtype: object


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10502 entries, 0 to 10501
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       9571 non-null   float64
 1   Surname          10502 non-null  object 
 2   CreditScore      10502 non-null  int64  
 3   Geography        10490 non-null  object 
 4   Gender           9560 non-null   object 
 5   Age              9516 non-null   float64
 6   Tenure           10502 non-null  int64  
 7   Balance          10502 non-null  float64
 8   NumOfProducts    10502 non-null  int64  
 9   HasCrCard        10501 non-null  float64
 10  IsActiveMember   10501 non-null  float64
 11  EstimatedSalary  10502 non-null  float64
 12  Exited           10502 non-null  int64  
dtypes: float64(6), int64(4), object(3)
memory usage: 1.0+ MB


In [11]:
df.describe(include='all')

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,9571.0,10502,10502.0,10490,9560,9516.0,10502.0,10502.0,10502.0,10501.0,10501.0,10502.0,10502.0
unique,,2932,,3,2,,,,,,,,
top,,Smith,,France,Male,,,,,,,,
freq,,36,,5264,5234,,,,,,,,
mean,15690880.0,,650.773948,,,38.899015,5.015045,76426.09173,1.530375,0.70498,0.514713,100401.133536,0.204247
std,71971.78,,96.725437,,,10.523426,2.895205,62423.431813,0.58158,0.456073,0.499807,57536.9032,0.403169
min,15565700.0,,350.0,,,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,15628310.0,,584.0,,,32.0,2.0,0.0,1.0,0.0,0.0,51431.7325,0.0
50%,15690590.0,,652.0,,,37.0,5.0,97029.715,1.0,1.0,1.0,100600.355,0.0
75%,15753110.0,,718.0,,,44.0,8.0,127647.84,2.0,1.0,1.0,149643.62,0.0


><font color=olivedrab>Detecting Missing Values

In [13]:
# Returns True/False Values for each Columns
df.isnull()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10497,False,False,False,False,False,False,False,False,False,False,False,False,False
10498,False,False,False,False,False,False,False,False,False,False,False,False,False
10499,True,False,False,False,True,True,False,False,False,False,False,False,False
10500,True,False,False,False,True,True,False,False,False,False,False,False,False


In [18]:
# Column-wise sum
df.isna().sum(axis=0)

Unnamed: 0,0
CustomerId,931
Surname,0
CreditScore,0
Geography,12
Gender,942
Age,986
Tenure,0
Balance,0
NumOfProducts,0
HasCrCard,1


In [16]:
# Calculating the percentage of missing values
(df.isna().sum() / df.shape[0] *100).round

Unnamed: 0,0
CustomerId,8.86
Surname,0.0
CreditScore,0.0
Geography,0.11
Gender,8.97
Age,9.39
Tenure,0.0
Balance,0.0
NumOfProducts,0.0
HasCrCard,0.01


><font color=olivedrab>Handling Missing Values