# 🏦 Customer Churn Analysis - EDA & Insights  

## 📌 Project Overview  
This notebook focuses on **Exploratory Data Analysis (EDA)** for our **European Banking Customer Churn dataset**. The goal is to uncover patterns, trends, and key factors influencing customer churn through visual and statistical analysis.  

## 📂 Dataset Information  
- **Dataset Name:** `european_customer_churn_data.parquet`  
- **Total Customers:** 1,008,530  
- **Key Features:**  
  - **Demographics:** Age, gender, geography, employment status  
  - **Financial Data:** Credit score, balance, salary, transactions  
  - **Behavioral Data:** Login frequency, mobile vs. branch transactions  
  - **Banking Issues:** Account problems, loan concerns, customer complaints  
  - **Sentiment & Small Talk:** Personalized interactions based on past experiences  

## 🔍 Notebook Structure  
### **1️⃣ Data Loading & Preprocessing**  
- Load `parquet` dataset into Pandas  
- Handle missing values, incorrect data entries, and standardize column formats  

### **2️⃣ Exploratory Data Analysis (EDA)**  
#### ✨ **Demographic Insights**  
- Age distribution and its relation to churn  
- Gender-wise and geography-based churn patterns  

#### 📊 **Financial Behavior Analysis**  
- Credit score vs. churn trends  
- Salary & balance distribution for active vs. churned customers  
- Transaction frequency and spending behavior  

#### 🔄 **Customer Interaction & Sentiment Analysis**  
- Common banking issues reported  
- Sentiment distribution across churned vs. non-churned customers  
- Impact of personalized small talk on churn rates  

#### 🏦 **High-Value Customers & Dedicated RMs**  
- Identification of VIP customers  
- Agent performance & escalation trends  

### **3️⃣ Feature Engineering for ML (Future Scope)**  
- Creating new features from transaction patterns  
- Encoding categorical variables for modeling  

### **4️⃣ Insights & Recommendations**  
- Key takeaways from data analysis  
- Business recommendations to reduce churn  

---

## 🎯 Goals & Expected Outcomes  
✔️ Identify **churn drivers** from customer demographics & banking issues  
✔️ Detect **high-risk customers** early for proactive retention strategies  
✔️ Provide **data-backed recommendations** for improving banking services  

---

**🛠️ Let’s dive into the analysis! 🚀**

In [1]:
!pip install pandas numpy matplotlib seaborn plotly

Defaulting to user installation because normal site-packages is not writeable
Collecting plotly
  Downloading plotly-6.0.1-py3-none-any.whl.metadata (6.7 kB)
Collecting narwhals>=1.15.1 (from plotly)
  Downloading narwhals-1.33.0-py3-none-any.whl.metadata (9.2 kB)
Downloading plotly-6.0.1-py3-none-any.whl (14.8 MB)
   ---------------------------------------- 0.0/14.8 MB ? eta -:--:--
   - -------------------------------------- 0.5/14.8 MB 5.4 MB/s eta 0:00:03
   ---- ----------------------------------- 1.6/14.8 MB 6.0 MB/s eta 0:00:03
   -------- ------------------------------- 3.1/14.8 MB 6.6 MB/s eta 0:00:02
   ------------ --------------------------- 4.7/14.8 MB 6.9 MB/s eta 0:00:02
   ---------------- ----------------------- 6.3/14.8 MB 7.1 MB/s eta 0:00:02
   --------------------- ------------------ 7.9/14.8 MB 7.1 MB/s eta 0:00:01
   -------------------------- ------------- 9.7/14.8 MB 7.4 MB/s eta 0:00:01
   ------------------------------- -------- 11.8/14.8 MB 7.7 MB/s eta 0:00


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
!pip install missingno

Defaulting to user installation because normal site-packages is not writeable
Collecting missingno
  Downloading missingno-0.5.2-py3-none-any.whl.metadata (639 bytes)
Downloading missingno-0.5.2-py3-none-any.whl (8.7 kB)
Installing collected packages: missingno
Successfully installed missingno-0.5.2



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [21]:
# Import core libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Missing values visualization
import missingno as msno

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer

In [4]:
# Display settings
plt.style.use("ggplot")
sns.set_theme(style="whitegrid")

In [5]:
# Ignore warnings for better readability
import warnings
warnings.filterwarnings("ignore")

In [9]:
# Load dataset
local_storage_path = "C:/Users/Nandan/GenAIProjects/ChurnData.parquet"
df = pd.read_parquet(local_storage_path )

# Display first few rows
df.head(10)

Unnamed: 0,CustomerID,Name,Age,Tenure,Gender,City,EmploymentStatus,CreditScore,Balance,EstimatedSalary,...,LoanAmount,OverdueLoan,LoginsLastMonth,TransactionsMobile,TransactionsBranch,Referrals,FamilyLinkedAccounts,MobileVsBranch,SupportTickets,CustomerComplaints
0,61059466,Misty Rhodes,69,27,Female,Copenhagen,Employed,758.0,232647.03,68055.33,...,21743.82,0,28,1,10,No,No,Mobile,2,0
1,84119748,Jamie Johnson,75,49,Mle,Berlin,Retired,422.0,68105.58,49612.55,...,21087.43,1,6,44,2,No,No,Branch,8,0
2,27358799,Stephen Rodriguez,49,48,Male,Stockholm,Student,629.0,151449.31,107608.61,...,35437.38,0,3,25,5,No,No,Mobile,5,0
3,70823244,Terri Joseph,78,12,Female,Paris,Employed,483.0,192763.97,53729.1,...,44502.37,0,27,49,13,No,No,Branch,8,0
4,87348217,Lauren Beard,35,16,Mle,Paris,Unemployed,,214415.46,97974.8,...,35591.63,0,11,40,1,No,No,Mobile,9,0
5,43945592,Rhonda Harris,67,26,Femle,London,Self-employed,847.0,23792.45,83918.37,...,9801.23,1,2,40,7,No,No,Mobile,3,0
6,59327365,Joseph Taylor,24,20,Female,Berlin,Unemployed,356.0,200092.44,53356.3,...,31123.28,0,13,25,3,No,No,Mobile,5,0
7,76295275,Douglas Martin,30,2,Female,Stockholm,Unemployed,324.0,225325.78,41011.99,...,13082.36,0,17,39,15,No,No,Mobile,1,0
8,66358792,Mike Meza,38,19,Mle,Amsterdam,Self-employed,479.0,61398.02,36306.95,...,17582.38,0,28,29,7,No,No,Mobile,6,0
9,69790702,Matthew Cole,72,26,Mle,Copenhagen,Self-employed,422.0,104257.09,65920.65,...,14723.09,0,23,29,19,No,No,Mobile,3,0


In [10]:
# Basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1043828 entries, 0 to 1043827
Data columns (total 25 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   CustomerID            1043828 non-null  int64  
 1   Name                  1043828 non-null  object 
 2   Age                   1043828 non-null  int64  
 3   Tenure                1043828 non-null  int64  
 4   Gender                1043828 non-null  object 
 5   City                  1043828 non-null  object 
 6   EmploymentStatus      1043828 non-null  object 
 7   CreditScore           991491 non-null   float64
 8   Balance               970833 non-null   float64
 9   EstimatedSalary       1001915 non-null  float64
 10  Churn                 1043828 non-null  int64  
 11  MonthlyTransactions   1043828 non-null  int32  
 12  MonthlyDeposits       1043828 non-null  float64
 13  MonthlyWithdrawals    1043828 non-null  float64
 14  LoanType              834789 non-n

In [11]:
# Summary statistics
df.describe()

Unnamed: 0,CustomerID,Age,Tenure,CreditScore,Balance,EstimatedSalary,Churn,MonthlyTransactions,MonthlyDeposits,MonthlyWithdrawals,LoanAmount,OverdueLoan,LoginsLastMonth,TransactionsMobile,TransactionsBranch,SupportTickets,CustomerComplaints
count,1043828.0,1043828.0,1043828.0,991491.0,970833.0,1001915.0,1043828.0,1043828.0,1043828.0,1043828.0,834789.0,1043828.0,1043828.0,1043828.0,1043828.0,1043828.0,1043828.0
mean,55040920.0,53.96053,19.23653,575.109488,124987.042923,109924.5,0.1195925,24.50109,9995.659,9998.493,25502.253241,0.08021436,15.0168,24.99549,9.991653,4.501109,0.0502784
std,25978500.0,21.05741,14.16313,158.88329,72140.002541,51996.23,0.3244846,14.42361,5770.942,5772.967,14135.548199,0.2716249,8.940607,14.71909,6.051749,2.872293,0.2185189
min,10000070.0,18.0,0.0,300.0,0.34,20000.13,0.0,0.0,0.007549152,0.01991612,1000.05,0.0,0.0,0.0,0.0,0.0,0.0
25%,32537500.0,36.0,8.0,438.0,62471.8,64800.75,0.0,12.0,5003.381,5002.47,13266.11,0.0,7.0,12.0,5.0,2.0,0.0
50%,55036270.0,54.0,16.0,575.0,124947.16,109890.3,0.0,24.0,9990.176,9995.976,25510.02,0.0,15.0,25.0,10.0,4.0,0.0
75%,77543370.0,72.0,30.0,713.0,187436.61,154846.1,0.0,37.0,14992.89,14993.41,37730.2,0.0,23.0,38.0,15.0,7.0,0.0
max,99999970.0,90.0,50.0,850.0,249999.91,200000.0,1.0,49.0,19999.99,19999.96,49999.9,1.0,30.0,50.0,20.0,9.0,1.0


In [12]:
# Check missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

CreditScore         52337
Balance             72995
EstimatedSalary     41913
LoanType           209039
LoanAmount         209039
dtype: int64

In [13]:
df.columns

Index(['CustomerID', 'Name', 'Age', 'Tenure', 'Gender', 'City',
       'EmploymentStatus', 'CreditScore', 'Balance', 'EstimatedSalary',
       'Churn', 'MonthlyTransactions', 'MonthlyDeposits', 'MonthlyWithdrawals',
       'LoanType', 'LoanAmount', 'OverdueLoan', 'LoginsLastMonth',
       'TransactionsMobile', 'TransactionsBranch', 'Referrals',
       'FamilyLinkedAccounts', 'MobileVsBranch', 'SupportTickets',
       'CustomerComplaints'],
      dtype='object')

In [15]:
df['Gender'].unique()

array(['Female', 'Mle', 'Male', 'Femle'], dtype=object)

In [16]:
# Replace incorrect gender entries with correct ones
df['Gender'] = df['Gender'].replace({'Mle': 'Male', 'Femle': 'Female'})

# Verify the changes
print(df['Gender'].unique())

['Female' 'Male']


In [None]:
df['EmploymentStatus'].unique()

array(['Employed', 'Retired', 'Student', 'Unemployed', 'Self-employed'],
      dtype=object)

In [18]:
df['LoanType'].unique()

array(['Auto', 'Student', 'Mortgage', 'Personal', None], dtype=object)

In [19]:
df["LoanType"].fillna("No Loan", inplace=True)
df["LoanAmount"].fillna(0, inplace=True)

In [20]:
df["Balance"].fillna(0, inplace=True)

In [22]:
# Identify features to use to impute the CreditScore with appropriate Calculation
features = ["Balance", "EstimatedSalary", "Age", "Tenure", "LoanAmount"]
df_train = df[df["CreditScore"].notnull()]
df_pred = df[df["CreditScore"].isnull()]

In [None]:
# Prepare training data
X_train = df_train[features]
y_train = df_train["CreditScore"]

In [24]:
# Handle missing values in features
imputer = SimpleImputer(strategy="median")
X_train = imputer.fit_transform(X_train)

In [26]:
# Train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

MemoryError: could not allocate 134217728 bytes

In [None]:
# Predict missing CreditScore values
X_pred = imputer.transform(df_pred[features])
df.loc[df["CreditScore"].isnull(), "CreditScore"] = model.predict(X_pred)

In [None]:
# Check if all missing values are filled
print(df["CreditScore"].isnull().sum())