
# 🏦 Bank Customer Segmentation & Regional Transaction Volume Forecasting
### AI in Finance | Economics & Business Analytics | K-Means Clustering + Linear Regression

---

## 📌 Business Problem Statement

Banks face the challenge of understanding diverse customer bases and predicting future revenue streams across different regions. This project addresses two key questions:

1. **Customer Segmentation**: Who are our customers? What behavioral patterns define distinct customer clusters?
2. **Revenue Forecasting**: How will transaction volumes grow by region in the future?

By answering these questions, banks can optimize:
- **Pricing Strategy**: Tailor fees and interest rates per segment
- **Risk Management**: Identify high-risk segments prone to default or attrition
- **Resource Allocation**: Direct marketing and operational budgets to high-value regions
- **Demand-Supply Optimization**: Match banking services supply with regional demand forecasts

## 🔗 Dataset
[Massive Bank Dataset (1 Million Rows)](https://www.kaggle.com/datasets/ksabishek/massive-bank-dataset-1-million-rows)


## 📦 1. Install & Import Libraries

In [None]:
# Install required packages (uncomment in Colab)
# !pip install kaggle scikit-learn plotly seaborn matplotlib pandas numpy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# ML
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.metrics import silhouette_score, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.2f}'.format)
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.family'] = 'DejaVu Sans'

print("✅ All libraries imported successfully!")

## 📥 2. Data Loading

In [None]:
import pandas as pd
df = pd.read_csv('bankdataset.csv')
df['Date'] = pd.to_datetime(df['Date'], format='mixed')
df['Year'] = df['Date'].dt.year
df['Quarter'] = df['Date'].dt.quarter
df['Month'] = df['Date'].dt.month
df['AvgTxnValue'] = (df['Value'] / df['Transaction_count']).round(2)
print(f"✅ Dataset loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
df.head()

## 🧹 3. Data Cleaning & Preprocessing

In [None]:
print("=" * 60)
print("📊 DATASET OVERVIEW")
print("=" * 60)
print(f"Shape: {df.shape}")
print("\n📌 Missing Values:")
print(df.isnull().sum())

df_clean = df.copy()
# Removing duplicates exactly across all columns
before = len(df_clean)
df_clean.drop_duplicates(inplace=True)
print(f"🔁 Duplicates removed: {before - len(df_clean)}")

## 📊 4. Exploratory Data Analysis (EDA)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# 4.1 Top Locations by Value
loc_stats = df_clean.groupby('Location')['Value'].sum().reset_index().sort_values('Value', ascending=False)
fig = px.bar(loc_stats.head(10), x='Location', y='Value', title='Top 10 Locations by Processing Volume')
fig.show()

In [None]:
# 4.2 Domain Distribution
dom_stats = df_clean.groupby('Domain')['Transaction_count'].sum().reset_index()
fig2 = px.pie(dom_stats, names='Domain', values='Transaction_count', title='Transaction Count by Domain')
fig2.show()

## 🤖 5. K-Means Location Segmentation

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

loc_features = df_clean.groupby('Location').agg(
    TotalValue=('Value', 'sum'),
    TotalTxns=('Transaction_count', 'sum'),
    AvgTxnValue=('AvgTxnValue', 'mean'),
    DistinctDomains=('Domain', 'nunique')
).reset_index()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(loc_features[['TotalValue', 'TotalTxns', 'AvgTxnValue', 'DistinctDomains']])

kmeans = KMeans(n_clusters=4, random_state=42)
loc_features['Cluster'] = kmeans.fit_predict(X_scaled)

cluster_names = {0:'Premium Markets', 1:'Volume Hubs', 2:'Balanced Zones', 3:'Emerging Areas'}
loc_features['Segment'] = loc_features['Cluster'].map(cluster_names)

print("✅ Clustering Complete. Segments:")
print(loc_features['Segment'].value_counts())

## 📉 6. Linear Regression — Time Series Forecasting

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder

reg_df = df_clean.groupby(['Location', 'Domain', 'Year', 'Quarter']).agg(
    TotalValue=('Value', 'sum'),
    TotalTxns=('Transaction_count', 'sum')
).reset_index()

reg_df['TimeIndex'] = (reg_df['Year'] - reg_df['Year'].min())*4 + reg_df['Quarter']

le_loc = LabelEncoder()
le_dom = LabelEncoder()
reg_df['LocEnc'] = le_loc.fit_transform(reg_df['Location'])
reg_df['DomEnc'] = le_dom.fit_transform(reg_df['Domain'])

X_reg = scaler.fit_transform(reg_df[['TimeIndex', 'LocEnc', 'DomEnc', 'TotalTxns']])
y_reg = reg_df['TotalValue']

lr = LinearRegression()
lr.fit(X_reg, y_reg)
print(f"✅ Linear Regression Trained. R^2 Score on training data: {lr.score(X_reg, y_reg):.4f}")

## 💼 7. Business Interpretation of Results

---

### 7.1 Location Clusters
- **Premium Markets**: High value, lower overall volumes. Focus on elite services.
- **Volume Hubs**: Massive transaction counts, lower average values. Focus on operational efficiency.
- **Balanced Zones**: Steady combinations of value and volume.
- **Emerging Areas**: Low current volume but high growth opportunity.


In [None]:
import joblib
joblib.dump(kmeans, 'kmeans_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(lr, 'lr_model.pkl')
print("✅ Saved models for deployment.")