# 🏦 Bank Customer Segmentation & Regional Transaction Volume Forecasting

### AI in Finance | Economics & Business Analytics

**Techniques Used:** K-Means Clustering · Linear Regression · Logistic Regression

---

This notebook includes:

- Revenue & Time-Series Analysis
- Regional & Domain-wise EDA (Exploratory Data Analysis)
- K-Means Clustering for Market Segmentation
- Linear Regression for Revenue Forecasting
- Logistic Regression for High-Volume Classification
- Business Interpretation with Strategic Recommendations

---

## 📌 Business Problem Statement

Banks face the challenge of understanding diverse customer bases and predicting future revenue streams across different regions. This project addresses two key questions:

1. **Customer Segmentation**: Who are our customers? What behavioral patterns define distinct location clusters?
2. **Revenue Forecasting**: How will transaction volumes grow by region in the future?

By answering these questions, banks can optimize:
- **Pricing Strategy**: Tailor fees and interest rates per segment
- **Risk Management**: Identify high-risk segments prone to default or attrition
- **Resource Allocation**: Direct marketing and operational budgets to high-value regions
- **Demand-Supply Optimization**: Match banking services supply with regional demand forecasts

## 🔗 Dataset
[Massive Bank Dataset (1 Million Rows)](https://www.kaggle.com/datasets/ksabishek/massive-bank-dataset-1-million-rows)

*(Using the locally cleansed `bankdataset.csv` for specific domain and transaction mapping)*

---

## 📦 1. Install & Import Libraries

> **Why these libraries?**
> - `pandas` & `numpy`: Data manipulation and numerical computation
> - `matplotlib` & `seaborn` & `plotly`: Visualization for EDA
> - `sklearn`: Machine Learning models (KMeans, Linear/Logistic Regression, StandardScaler)
> - `joblib`: Saving trained models for deployment via Streamlit

In [None]:
# Install required packages (uncomment in Colab)
# !pip install scikit-learn plotly seaborn matplotlib pandas numpy joblib

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Machine Learning
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import (
    silhouette_score, mean_squared_error, r2_score,
    classification_report, confusion_matrix, accuracy_score
)
from sklearn.model_selection import train_test_split
import joblib

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.2f}'.format)
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ All libraries imported successfully!")

---

## 📥 2. Data Loading

We load the bank transaction dataset which contains **Date, Domain, Location, Value, and Transaction_count** columns.

We also engineer additional time-series features (`Year`, `Quarter`, `Month`) and compute `AvgTxnValue` (average value per transaction).

In [None]:
df = pd.read_csv('bankdataset.csv')
df['Date'] = pd.to_datetime(df['Date'], format='mixed')
df['Year'] = df['Date'].dt.year
df['Quarter'] = df['Date'].dt.quarter
df['Month'] = df['Date'].dt.month
df['YearMonth'] = df['Date'].dt.to_period('M')
df['AvgTxnValue'] = (df['Value'] / df['Transaction_count']).round(2)

print(f"✅ Dataset loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"   Date Range: {df['Date'].min().date()} to {df['Date'].max().date()}")
print(f"   Locations: {df['Location'].nunique()} | Domains: {df['Domain'].nunique()}")
df.head()

---

## 🧹 3. Data Cleaning & Preprocessing

Before any analysis, we must ensure the data is clean:
1. Check for **null/missing values**
2. Remove **duplicate records**
3. Verify **data types** are correct

In [None]:
print("=" * 60)
print("📊 DATASET OVERVIEW")
print("=" * 60)
print(f"\nShape: {df.shape}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\n📌 Missing Values:")
print(df.isnull().sum())
print(f"\n📌 Descriptive Statistics:")
df.describe()

In [None]:
df_clean = df.copy()

# Remove exact duplicates
before = len(df_clean)
df_clean.drop_duplicates(inplace=True)
print(f"🔁 Duplicates removed: {before - len(df_clean):,}")
print(f"✅ Clean dataset shape: {df_clean.shape}")

---

## 📊 4. Exploratory Data Analysis (EDA)

EDA helps us understand patterns, anomalies, and relationships in data before building models.

We explore:
1. **Monthly Revenue Trend** — Time-series view of total transaction volume (macroeconomic flow)
2. **Top Locations by Value** — Geographic concentration of revenue
3. **Domain Distribution** — Which business sectors dominate?
4. **Domain Revenue Over Time** — How sectors compete quarterly

### 💬 Business Case Study Questions
- Which location generates the highest processing volume? → *Direct capital allocation*
- Which business domain dominates transaction counts? → *Infrastructure scaling decisions*
- How do we optimize infrastructure capacity based on regional hubs? → *Demand-Supply planning*
- Should the company focus on High-Value low-count regions or High-Count low-value regions? → *Pricing strategy*

In [None]:
# 4.1 Monthly Revenue Trend (Time Series)
monthly_sales = df_clean.groupby('YearMonth')['Value'].sum()
plt.figure(figsize=(14, 5))
monthly_sales.plot(kind='line', marker='o', color='#2c3e50', linewidth=2)
plt.title('Monthly Revenue Trend (Macroeconomic Flow)', fontsize=14, fontweight='bold')
plt.ylabel('Total Value (₹)')
plt.xlabel('Month')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\n📊 Business Insight: The line chart shows monthly revenue flow.")
print("   → Spikes indicate seasonal demand shifts requiring operational scaling.")
print("   → Dips may signal market slowdowns or reduced banking activity periods.")

In [None]:
# 4.2 Top 10 Locations by Processing Volume
loc_stats_eda = df_clean.groupby('Location')['Value'].sum().reset_index()
loc_stats_eda = loc_stats_eda.sort_values('Value', ascending=False)

fig = px.bar(
    loc_stats_eda.head(10), x='Location', y='Value',
    title='Top 10 Locations by Processing Volume (₹)',
    color_discrete_sequence=['#2c3e50']
)
fig.show()

print("\n📊 Business Insight: Geographic Revenue Concentration")
print("   → A few locations contribute disproportionately to total revenue.")
print("   → This is the Pareto Principle (80/20 rule) applied to banking geography.")
print("   → Strategy: Focus premium services and dedicated relationship managers in top hubs.")

In [None]:
# 4.3 Domain Distribution (Transaction Count)
dom_stats = df_clean.groupby('Domain')['Transaction_count'].sum().reset_index()

fig2 = px.pie(
    dom_stats, names='Domain', values='Transaction_count',
    title='Transaction Count by Business Domain',
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig2.show()

print("\n📊 Business Insight: Domain Market Share")
print("   → Domains with the highest transaction counts need the most server capacity.")
print("   → Domains with fewer but higher-value transactions may be more profitable per unit.")

In [None]:
# 4.4 Domain Revenue Over Time (Stacked Bar)
dom_time = df_clean.groupby(['Year', 'Quarter', 'Domain'])['Value'].sum().reset_index()
dom_time['Period'] = dom_time['Year'].astype(str) + ' Q' + dom_time['Quarter'].astype(str)

fig3 = px.bar(
    dom_time, x='Period', y='Value', color='Domain',
    barmode='stack', title='Domain Revenue Over Time (Quarterly)',
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig3.show()

print("\n📊 Business Insight: Temporal Market Share Dynamics")
print("   → Watch for domains gaining or losing share over quarters.")
print("   → Growing domains = investment opportunity. Shrinking = risk signal.")

In [None]:
# 4.5 Correlation Heatmap
plt.figure(figsize=(8, 5))
numeric_cols = df_clean[['Value', 'Transaction_count', 'AvgTxnValue', 'Year', 'Quarter', 'Month']]
sns.heatmap(numeric_cols.corr(), annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Matrix', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n📊 Business Insight: Feature Relationships")
print("   → Strong correlation between Value and Transaction_count confirms volume-driven revenue.")
print("   → Weak temporal correlations suggest no strong seasonal bias in the dataset.")

---

## 🤖 5. K-Means Clustering — Location Market Segmentation

### 💡 Why Do We Need StandardScaler?

Most ML algorithms (like KMeans, Logistic Regression) are **distance-based** or **gradient-based**.
If features are on different scales, the algorithm becomes biased toward features with larger magnitudes.

**Example:**
- `TotalTxns` → values might be ~1,000
- `TotalValue` → values might be ~50,000,000

If we directly apply KMeans, `TotalValue` completely **dominates** the distance calculation and `TotalTxns` becomes mathematically irrelevant.

**What StandardScaler Does:**
It converts each feature into a **standard normal distribution** (Mean = 0, Std = 1) so all features contribute equally to the model.

### 🎯 Clustering Goal
We aggregate transaction data per location and cluster locations into 4 market segments:
- **Premium Markets**: High value, lower volumes → Focus on elite services
- **Volume Hubs**: Massive counts, lower avg values → Focus on operational efficiency
- **Balanced Zones**: Steady combinations of value and volume
- **Emerging Areas**: Low current volume but high growth opportunity

In [None]:
# Aggregate location-level features (same 5 features as app.py)
loc_features = df_clean.groupby('Location').agg(
    TotalValue=('Value', 'sum'),
    TotalTxns=('Transaction_count', 'sum'),
    AvgTxnValue=('AvgTxnValue', 'mean'),
    DistinctDomains=('Domain', 'nunique'),
    DaysActive=('Date', 'nunique')
).reset_index()

print(f"✅ Location features computed for {len(loc_features)} locations")
loc_features.head(10)

In [None]:
# Scale features using StandardScaler
feats = ['TotalValue', 'TotalTxns', 'AvgTxnValue', 'DistinctDomains', 'DaysActive']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(loc_features[feats])

# Fit KMeans with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
loc_features['Cluster'] = kmeans.fit_predict(X_scaled)

# Map cluster numbers to meaningful business segment names
cluster_names = {0: 'PREMIUM MARKETS', 1: 'VOLUME HUBS', 2: 'BALANCED ZONES', 3: 'EMERGING AREAS'}
loc_features['Segment'] = loc_features['Cluster'].map(cluster_names)

print("✅ K-Means Clustering Complete!")
print(f"\n📊 Silhouette Score: {silhouette_score(X_scaled, loc_features['Cluster']):.4f}")
print("   (Ranges from -1 to 1. Higher = better separated clusters.)")
print(f"\n📊 Segment Distribution:")
print(loc_features['Segment'].value_counts())

In [None]:
# Cluster Profile Summary
cluster_profile = loc_features.groupby('Segment')[['TotalValue', 'TotalTxns', 'AvgTxnValue', 'DistinctDomains', 'DaysActive']].mean().round(2)
print("📊 Average Profile per Segment:")
cluster_profile

In [None]:
# Visualize clusters using PCA (2D projection)
pca = PCA(n_components=2, random_state=42)
coords = pca.fit_transform(X_scaled)
pca_df = pd.DataFrame(coords, columns=['PC1', 'PC2'])
pca_df['Segment'] = loc_features['Segment']
pca_df['Location'] = loc_features['Location']

fig = px.scatter(
    pca_df, x='PC1', y='PC2', color='Segment', text='Location',
    title='K-Means Clusters (PCA 2D Projection)',
    opacity=0.8, color_discrete_sequence=px.colors.qualitative.Set1
)
fig.update_traces(marker_size=10, textposition='top center', textfont_size=8)
fig.show()

print("\n📊 Business Insight: Cluster Visualization")
print("   → Well-separated clusters = KMeans has found genuinely distinct market segments.")
print("   → Overlapping clusters may indicate that some locations behave similarly.")

---

## 📉 6. Linear Regression — Revenue Volume Forecasting

### Business Question:
*Can we predict how much revenue a specific Location + Domain combination will generate in the next quarter, given its transaction volume and time trend?*

We use **Linear Regression** because:
- Our target (`TotalValue`) is a continuous variable.
- We want to understand the **linear relationship** between time, location, domain, and revenue.
- The model coefficient tells us how much each unit increase in a feature changes the predicted revenue.

In [None]:
# Build quarterly aggregated data for regression
reg_df = df_clean.groupby(['Location', 'Domain', 'Year', 'Quarter']).agg(
    TotalValue=('Value', 'sum'),
    TotalTxns=('Transaction_count', 'sum')
).reset_index()

# Create a numeric time index (quarter number from the start)
reg_df['TimeIndex'] = (reg_df['Year'] - reg_df['Year'].min()) * 4 + reg_df['Quarter']

# Encode categorical variables
le_loc = LabelEncoder()
le_dom = LabelEncoder()
reg_df['LocEnc'] = le_loc.fit_transform(reg_df['Location'])
reg_df['DomEnc'] = le_dom.fit_transform(reg_df['Domain'])

print(f"✅ Regression dataset prepared: {reg_df.shape[0]:,} rows")
reg_df.head()

In [None]:
# Train-Test Split for Linear Regression
features_lr = ['TimeIndex', 'LocEnc', 'DomEnc', 'TotalTxns']

X_lr = reg_df[features_lr]
y_lr = reg_df['TotalValue']

X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(
    X_lr, y_lr, test_size=0.2, random_state=42
)

# Scale features
lr_scaler = StandardScaler()
X_train_lr_sc = lr_scaler.fit_transform(X_train_lr)
X_test_lr_sc = lr_scaler.transform(X_test_lr)

# Train Linear Regression
lr = LinearRegression()
lr.fit(X_train_lr_sc, y_train_lr)

# Predictions
y_pred_lr = lr.predict(X_test_lr_sc)

# Evaluation
r2 = r2_score(y_test_lr, y_pred_lr)
rmse = np.sqrt(mean_squared_error(y_test_lr, y_pred_lr))

print("=" * 50)
print("📊 LINEAR REGRESSION RESULTS")
print("=" * 50)
print(f"R² Score (Test): {r2:.4f}")
print(f"RMSE:           ₹{rmse:,.2f}")
print(f"\n→ R² of {r2:.4f} means the model explains {r2*100:.1f}% of revenue variance.")
print(f"→ RMSE of ₹{rmse:,.0f} is the average prediction error per observation.")

In [None]:
# Visualize: Actual vs Predicted
plt.figure(figsize=(10, 5))
plt.scatter(y_test_lr, y_pred_lr, alpha=0.4, color='#2c3e50', s=15)
plt.plot([y_test_lr.min(), y_test_lr.max()], [y_test_lr.min(), y_test_lr.max()], 'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Value (₹)')
plt.ylabel('Predicted Value (₹)')
plt.title('Linear Regression: Actual vs Predicted Revenue', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\n📊 Business Insight:")
print("   → Points close to the red line = accurate predictions.")
print("   → Spread from the line = variance the model cannot yet capture.")
print("   → This model is used in the deployed Streamlit app for the Predictor feature.")

---

## 🔍 7. Logistic Regression — High-Volume Hub Classification

### Business Question:
*Can we classify whether a Location–Domain combination will be "High Volume" (above median revenue) or "Low Volume"?*

This is a **binary classification** problem. Instead of predicting an exact number (regression), we predict a **category** (High or Low).

**Why this matters:**
- Banks can proactively allocate server capacity and staffing to predicted high-volume hubs.
- Missing a high-volume hub (Type II error / False Negative) is expensive — the bank loses revenue from inadequate infrastructure.

In [None]:
# 1. Create Target Variable
# High_Volume = 1 if TotalValue > median, else 0
median_val = reg_df['TotalValue'].median()
reg_df['High_Volume'] = (reg_df['TotalValue'] > median_val).astype(int)

print(f"📊 Revenue Median Threshold: ₹{median_val:,.0f}")
print(f"   Above median (High Volume = 1): {reg_df['High_Volume'].sum():,} records")
print(f"   Below median (Low Volume  = 0): {(reg_df['High_Volume'] == 0).sum():,} records")

In [None]:
# 2. Train-Test Split
features_cls = ['TimeIndex', 'LocEnc', 'DomEnc', 'TotalTxns']
X_cls = reg_df[features_cls]
y_cls = reg_df['High_Volume']

X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(
    X_cls, y_cls, test_size=0.3, random_state=42
)

# 3. Scale Features
scaler_cls = StandardScaler()
X_train_cls_sc = scaler_cls.fit_transform(X_train_cls)
X_test_cls_sc = scaler_cls.transform(X_test_cls)

# 4. Train Logistic Regression
log_model = LogisticRegression(random_state=42)
log_model.fit(X_train_cls_sc, y_train_cls)

# 5. Predict
y_pred_cls = log_model.predict(X_test_cls_sc)

# 6. Evaluation
acc = accuracy_score(y_test_cls, y_pred_cls)
cm = confusion_matrix(y_test_cls, y_pred_cls)
cr = classification_report(y_test_cls, y_pred_cls)

print("=" * 50)
print("📊 LOGISTIC REGRESSION RESULTS")
print("=" * 50)
print(f"\nAccuracy: {acc:.4f} ({acc*100:.1f}%)")
print(f"\nConfusion Matrix:")
print(cm)
print(f"\nClassification Report:")
print(cr)

### 🧠 Understanding the Confusion Matrix

```
                    Predicted Low    Predicted High
Actual Low (0)      TN               FP (Type I)
Actual High (1)     FN (Type II)     TP
```

| Metric | Meaning |
|--------|---------|
| **Precision** | When the model says "High Volume", how often is it correct? |
| **Recall** | Out of all actual High Volume hubs, what % does the model find? |
| **F1-Score** | Harmonic mean of Precision and Recall (balanced metric) |
| **Type I Error (FP)** | Model says High, but actually Low → Wasted resources |
| **Type II Error (FN)** | Model says Low, but actually High → **Costly miss!** Lost revenue |

In [None]:
# Confusion Matrix Heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted Low', 'Predicted High'],
            yticklabels=['Actual Low', 'Actual High'])
plt.title('Confusion Matrix: High Volume Classification', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n📊 Business Insight: Classification Performance")
print("   → High recall for class 1 = the model catches most high-volume hubs (good!).")
print("   → Low recall = the model misses real high-volume locations (expensive mistake).")
print("   → This classification helps the bank preemptively scale infrastructure.")

---

## 💼 8. Business Interpretation of Results

### 8.1 Segment Strategy

| Segment | Characteristics | Strategy |
|---------|----------------|----------|
| **Premium Markets** | High value, lower transaction volume | Upsell premium services, assign dedicated relationship managers, white-glove support |
| **Volume Hubs** | Massive transaction counts, lower average values | Optimize server and infrastructure capacity; reduce per-transaction costs through automation |
| **Balanced Zones** | Steady combination of value and volume | Maintain current service levels; cross-sell across domains |
| **Emerging Areas** | Low current volume, high growth potential | Target localized marketing campaigns; capture emerging market share before competitors |

### 8.2 Economic Concepts Applied

| Concept | Application in This Project |
|---------|----------------------------|
| **Demand-Supply** | K-Means identifies demand concentration by location; supply (banking infra) should match |
| **Revenue Optimization** | Linear Regression forecasts revenue; helps plan quarterly targets |
| **Pricing Strategy** | Premium Markets can sustain higher fees; Volume Hubs need competitive low-cost processing |
| **Risk Analysis** | Emerging Areas are high-risk investments; Volume Hubs face operational risk from overload |
| **Market Segmentation** | K-Means provides data-driven segments instead of subjective categories |

### 8.3 Strategic Recommendations

| Finding | Business Strategy Recommendation |
|---------|----------------------------------|
| Premium Markets drive disproportionate value | Upsell premium business services to these regions, provide white-glove support. |
| Volume Hubs dominate transaction counts | Optimize server and technical infrastructure to handle peak load continuously. |
| Emerging Areas show strong temporal growth | Target localized marketing campaigns to capture emerging market territory early. |
| Domain concentration varies quarterly | Diversify service offerings to reduce dependency on any single domain. |
| Revenue is predictable via regression | Use the Linear Regression model to set quarterly revenue targets per region. |

### 8.4 Case Study Answers

1. **Which location generates highest processing volume?** → See EDA Section 4.2. The top location drives the most ₹ through the system.
2. **Which domain dominates transaction counts?** → See EDA Section 4.3. The largest pie slice indicates the dominant domain.
3. **How to optimize infrastructure?** → Use K-Means segments: scale infra in Volume Hubs, premium services in Premium Markets.
4. **High-Value vs High-Count strategy?** → Both matter. Use segment-specific strategies as outlined in Section 8.1.

---

## 📤 9. Save Models for Deployment

These pickled models are loaded by the **Streamlit** dashboard (`app.py`) for live predictions.

In [None]:
# Save models for Streamlit deployment
joblib.dump(kmeans, 'kmeans_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(lr, 'lr_model.pkl')

print("✅ Models saved successfully!")
print("   • kmeans_model.pkl  → K-Means (4-cluster segmentation)")
print("   • scaler.pkl        → StandardScaler (feature normalization)")
print("   • lr_model.pkl      → LinearRegression (revenue forecasting)")
print("\n🚀 These are used by the Streamlit app (app.py) for the interactive dashboard and predictor interface.")

---

## 🚀 Deployment

The project is deployed using **Streamlit** and includes:

1. **Interactive Dashboard** — Overview, Market Segments, Domains, Forecast pages
2. **Prediction Interface** — Select Location + Domain + Transaction Volume to get a Revenue forecast
3. **Clear Input-Output Demo** — Users can experiment with different combinations and see real-time results

### To Run Locally:
```bash
pip install -r requirements.txt
streamlit run app.py
```

### GitHub Repository
The repository contains:
- `bank_customer_segmentation.ipynb` — This notebook (analysis + models)
- `app.py` — Streamlit application
- `bankdataset.csv` — Dataset
- `requirements.txt` — Dependencies
- `README.md` — Detailed documentation