# Customer Segmentation using RFM Analysis

## Business Problem Statement

The company treats all customers equally despite significant differences
in purchasing behavior and profitability. Leadership needs a data-driven
way to identify high-value, at-risk, and low-engagement customers.

## Why This Matters

Customer segmentation enables:
- Targeted marketing campaigns
- Smarter discount allocation
- Better customer retention
- Higher lifetime value

In [18]:
import pandas as pd

df = pd.read_csv(
    "../data/processed/featured_data.csv",
    parse_dates=["Order Date", "Ship Date"]
)

In [19]:
snapshot_date = df['Order Date'].max() + pd.Timedelta(days=1)

rfm = df.groupby('Customer ID').agg({
    'Order Date': lambda x: (snapshot_date - x.max()).days,
    'Order ID': 'nunique',
    'Sales': 'sum'
}).reset_index()

rfm.columns = ['Customer ID', 'Recency', 'Frequency', 'Monetary']
rfm.head()

Unnamed: 0,Customer ID,Recency,Frequency,Monetary
0,AA-10315,185,5,5563.56
1,AA-10375,20,9,1056.39
2,AA-10480,260,4,1790.512
3,AA-10645,56,6,5086.935
4,AB-10015,416,3,886.156


In [20]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[['Recency', 'Frequency', 'Monetary']])

In [21]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, random_state=42)
rfm['Segment'] = kmeans.fit_predict(rfm_scaled)

In [22]:
segment_summary = rfm.groupby('Segment').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': 'mean',
    'Customer ID': 'count'
}).rename(columns={'Customer ID': 'Num_Customers'})

segment_summary

Unnamed: 0_level_0,Recency,Frequency,Monetary,Num_Customers
Segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,72.741611,8.516779,3322.222985,298
1,101.197015,4.731343,1669.68829,335
2,123.71875,8.296875,9479.545687,64
3,559.489583,3.697917,1470.228226,96


In [23]:
import joblib

rfm.to_csv("../data/processed/customer_segments.csv", index=False)
joblib.dump(kmeans, "../models/customer_segmentation.pkl")

['../models/customer_segmentation.pkl']

In [26]:
rfm_summary = (
    rfm.groupby("Segment")
    .agg(
        Customers=("Customer ID", "count"),
        Avg_Recency=("Recency", "mean"),
        Avg_Frequency=("Frequency", "mean"),
        Avg_Monetary=("Monetary", "mean"),
        Total_Revenue=("Monetary", "sum")
    )
    .reset_index()
)

rfm_summary.to_csv(
    "../data/processed/customer_segment_summary.csv",
    index=False
)


In [29]:

rfm["CLV"] = rfm["Monetary"] * rfm["Frequency"]

# Churn proxy (low recency + low frequency)
rfm["Churn_Risk"] = (
    (rfm["Recency"] > rfm["Recency"].quantile(0.75)) &
    (rfm["Frequency"] < rfm["Frequency"].quantile(0.25))
).astype(int)

# Revenue concentration
top_20_pct = rfm.sort_values("Monetary", ascending=False).head(int(0.2 * len(rfm)))
revenue_concentration = top_20_pct.Monetary.sum() / rfm.Monetary.sum()

# Segment actions
action_map = {
    0: "Retain & Upsell",
    1: "Loyal – Grow Share",
    2: "Re-engage Immediately",
    3: "New – Nurture"
}

rfm["Business_Action"] = rfm["Segment"].map(action_map)

rfm.to_csv("../data/processed/customer_business_insights.csv", index=False)

pd.DataFrame([{
    "total_customers": rfm["Customer ID"].nunique(),
    "revenue_concentration": revenue_concentration,
    "high_risk_customers": rfm["Churn_Risk"].sum()
}]).to_csv("../data/processed/customer_kpis.csv", index=False)


## Executive Summary

Customer segmentation reveals distinct groups with different purchasing
behaviors. High-frequency, high-monetary customers drive a disproportionate
share of revenue and should be prioritized for retention and loyalty programs.