# 03 - Customer Segmentation (RFM Analysis)

RFM (Recency, Frequency, Monetary) segmentation is a data-driven approach to categorize customers based on their purchasing behavior. This analysis enables targeted marketing strategies and resource allocation.

**Objectives:**
1. Calculate RFM scores (1-5 scale) for each customer
2. Define meaningful customer segments based on RFM combinations
3. Profile each segment to understand their characteristics
4. Provide actionable business recommendations per segment

**Data Source:** `customers_with_rfm.csv` - Contains pre-calculated recency, frequency, and monetary values from notebook 01.

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Ensure images directory exists
os.makedirs('../images', exist_ok=True)

print("Libraries loaded successfully.")

## 1. Load RFM Data

In [None]:
# Load customers with RFM data
rfm = pd.read_csv('../data/processed/customers_with_rfm.csv')

# Filter to customers with RFM data (those who made purchases)
rfm = rfm.dropna(subset=['recency', 'frequency', 'monetary'])

print(f"Customers with RFM data: {len(rfm):,}")
print(f"\nRFM Summary Statistics:")
print(rfm[['recency', 'frequency', 'monetary']].describe())

In [None]:
# Preview the data
rfm[['customer_unique_id', 'customer_city', 'customer_state', 'recency', 'frequency', 'monetary']].head(10)

## 2. Calculate RFM Scores (1-5 Scale)

Each RFM dimension is scored from 1-5 using quantile-based binning:
- **Recency**: Lower values are better (recent purchases) - so 5 = most recent
- **Frequency**: Higher values are better - so 5 = most frequent
- **Monetary**: Higher values are better - so 5 = highest spenders

In [None]:
# Score each dimension (5 is best)
# Recency: Lower is better, so reverse the labels [5,4,3,2,1]
rfm['r_score'] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])

# Frequency: Higher is better, but many duplicates exist
# Use rank method='first' to break ties and enable quantile binning
rfm['f_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])

# Monetary: Higher is better
rfm['m_score'] = pd.qcut(rfm['monetary'], 5, labels=[1, 2, 3, 4, 5])

# Create combined RFM score string
rfm['rfm_score'] = rfm['r_score'].astype(str) + rfm['f_score'].astype(str) + rfm['m_score'].astype(str)

print("RFM Scores calculated:")
print(f"\nRecency Score Distribution:")
print(rfm['r_score'].value_counts().sort_index())
print(f"\nFrequency Score Distribution:")
print(rfm['f_score'].value_counts().sort_index())
print(f"\nMonetary Score Distribution:")
print(rfm['m_score'].value_counts().sort_index())

In [None]:
# Show sample of RFM scores
rfm[['customer_unique_id', 'recency', 'frequency', 'monetary', 'r_score', 'f_score', 'm_score', 'rfm_score']].head(15)

## 3. Create Customer Segments

Based on RFM score combinations, we categorize customers into actionable segments:

| Segment | R Score | F Score | Description |
|---------|---------|---------|-------------|
| **Champions** | 4-5 | 4-5 | Best customers - recent and frequent buyers |
| **Loyal Customers** | 3-5 | 3-5 | Consistent purchasers with good engagement |
| **New Customers** | 4-5 | 1-2 | Recently acquired, low purchase history |
| **Potential Loyalists** | 3-5 | 1-2 | Recent buyers who could become loyal |
| **At Risk** | 1-2 | 3-5 | Were frequent buyers, now disengaged |
| **Hibernating** | 1-2 | 1-2 | Low activity, long since last purchase |
| **Need Attention** | Others | Others | Mixed signals, require investigation |

In [None]:
def segment_customer(row):
    """
    Assign customer segment based on RFM scores.
    
    Segmentation logic prioritizes recency (R) and frequency (F) as the 
    primary indicators of customer engagement and loyalty.
    """
    r, f, m = int(row['r_score']), int(row['f_score']), int(row['m_score'])
    
    if r >= 4 and f >= 4:
        return 'Champions'
    elif r >= 3 and f >= 3:
        return 'Loyal Customers'
    elif r >= 4 and f <= 2:
        return 'New Customers'
    elif r >= 3 and f <= 2:
        return 'Potential Loyalists'
    elif r <= 2 and f >= 3:
        return 'At Risk'
    elif r <= 2 and f <= 2:
        return 'Hibernating'
    else:
        return 'Need Attention'

# Apply segmentation
rfm['segment'] = rfm.apply(segment_customer, axis=1)

print("Customer Segment Distribution:")
print(rfm['segment'].value_counts())

## 4. Segment Profile Analysis

Analyze each segment's characteristics to understand customer behavior patterns.

In [None]:
# Segment profile summary
segment_profile = rfm.groupby('segment').agg(
    customer_count=('customer_unique_id', 'count'),
    avg_recency=('recency', 'mean'),
    avg_frequency=('frequency', 'mean'),
    avg_monetary=('monetary', 'mean'),
    total_revenue=('monetary', 'sum')
).round(2)

# Calculate percentage of customers
segment_profile['pct_customers'] = (segment_profile['customer_count'] / segment_profile['customer_count'].sum() * 100).round(1)

# Calculate percentage of revenue
segment_profile['pct_revenue'] = (segment_profile['total_revenue'] / segment_profile['total_revenue'].sum() * 100).round(1)

# Sort by customer count descending
segment_profile = segment_profile.sort_values('customer_count', ascending=False)

print("Segment Profile Summary")
print("="*80)
segment_profile

In [None]:
# Revenue at risk analysis
at_risk_segments = ['At Risk', 'Hibernating']
at_risk_customers = rfm[rfm['segment'].isin(at_risk_segments)]

total_revenue = rfm['monetary'].sum()
at_risk_revenue = at_risk_customers['monetary'].sum()

print("Revenue at Risk Analysis")
print("="*60)
print(f"Total customers at risk: {len(at_risk_customers):,}")
print(f"Percentage of customer base: {len(at_risk_customers)/len(rfm)*100:.1f}%")
print(f"\nHistorical revenue from at-risk customers: ${at_risk_revenue:,.2f}")
print(f"Percentage of total revenue: {at_risk_revenue/total_revenue*100:.1f}%")
print(f"\nBreakdown:")
for seg in at_risk_segments:
    seg_data = rfm[rfm['segment'] == seg]
    print(f"  {seg}: {len(seg_data):,} customers (${seg_data['monetary'].sum():,.2f})")

## 5. Visualizations

### 5.1 Segment Distribution

In [None]:
# Segment distribution donut chart
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Define colors for segments
segment_colors = {
    'Champions': '#2ecc71',
    'Loyal Customers': '#3498db',
    'New Customers': '#9b59b6',
    'Potential Loyalists': '#1abc9c',
    'At Risk': '#e74c3c',
    'Hibernating': '#95a5a6',
    'Need Attention': '#f39c12'
}

# Customer count by segment
segment_counts = rfm['segment'].value_counts()
colors = [segment_colors.get(seg, '#333') for seg in segment_counts.index]

# Donut chart - Customer Distribution
wedges, texts, autotexts = axes[0].pie(
    segment_counts.values,
    labels=segment_counts.index,
    autopct='%1.1f%%',
    colors=colors,
    pctdistance=0.75,
    startangle=90
)

# Create donut effect
centre_circle = plt.Circle((0, 0), 0.50, fc='white')
axes[0].add_patch(centre_circle)
axes[0].set_title('Customer Distribution by Segment', fontsize=14, fontweight='bold')

# Revenue by segment
segment_revenue = rfm.groupby('segment')['monetary'].sum().reindex(segment_counts.index)

wedges2, texts2, autotexts2 = axes[1].pie(
    segment_revenue.values,
    labels=segment_revenue.index,
    autopct='%1.1f%%',
    colors=colors,
    pctdistance=0.75,
    startangle=90
)

centre_circle2 = plt.Circle((0, 0), 0.50, fc='white')
axes[1].add_patch(centre_circle2)
axes[1].set_title('Revenue Distribution by Segment', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../images/segment_distribution.png', dpi=150, bbox_inches='tight', facecolor='white')
plt.show()
print("Saved: images/segment_distribution.png")

### 5.2 Monetary Value by Segment

In [None]:
# Box plots of monetary value by segment
fig, ax = plt.subplots(figsize=(12, 6))

# Order segments by median monetary value
segment_order = rfm.groupby('segment')['monetary'].median().sort_values(ascending=False).index.tolist()

# Create box plot
box_colors = [segment_colors.get(seg, '#333') for seg in segment_order]
bp = ax.boxplot(
    [rfm[rfm['segment'] == seg]['monetary'].values for seg in segment_order],
    labels=segment_order,
    patch_artist=True,
    showfliers=False  # Hide outliers for cleaner visualization
)

# Color the boxes
for patch, color in zip(bp['boxes'], box_colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax.set_xlabel('Customer Segment', fontsize=12)
ax.set_ylabel('Monetary Value ($)', fontsize=12)
ax.set_title('Monetary Value Distribution by Customer Segment', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.savefig('../images/segment_monetary_boxplot.png', dpi=150, bbox_inches='tight', facecolor='white')
plt.show()
print("Saved: images/segment_monetary_boxplot.png")

### 5.3 RFM Score Heatmap by Segment

In [None]:
# Heatmap of average RFM scores by segment
# Convert scores to numeric for aggregation
rfm['r_score_num'] = rfm['r_score'].astype(int)
rfm['f_score_num'] = rfm['f_score'].astype(int)
rfm['m_score_num'] = rfm['m_score'].astype(int)

# Calculate average scores per segment
segment_rfm_avg = rfm.groupby('segment').agg({
    'r_score_num': 'mean',
    'f_score_num': 'mean',
    'm_score_num': 'mean'
}).round(2)

segment_rfm_avg.columns = ['Recency', 'Frequency', 'Monetary']

# Reorder by total RFM score
segment_rfm_avg['total'] = segment_rfm_avg.sum(axis=1)
segment_rfm_avg = segment_rfm_avg.sort_values('total', ascending=False)
segment_rfm_avg = segment_rfm_avg.drop('total', axis=1)

# Create heatmap
fig, ax = plt.subplots(figsize=(10, 7))

sns.heatmap(
    segment_rfm_avg,
    annot=True,
    cmap='RdYlGn',
    fmt='.2f',
    linewidths=1,
    linecolor='white',
    cbar_kws={'label': 'Average Score'},
    ax=ax,
    vmin=1,
    vmax=5
)

ax.set_title('Average RFM Scores by Customer Segment', fontsize=14, fontweight='bold')
ax.set_xlabel('RFM Dimension', fontsize=12)
ax.set_ylabel('Segment', fontsize=12)

plt.tight_layout()
plt.savefig('../images/segment_rfm_heatmap.png', dpi=150, bbox_inches='tight', facecolor='white')
plt.show()
print("Saved: images/segment_rfm_heatmap.png")

## 6. Business Recommendations

Based on the RFM segmentation analysis, here are targeted actions for each customer segment:

In [None]:
# Business recommendations by segment
recommendations = {
    'Champions': 'Reward & retain with loyalty programs, early access to new products, VIP treatment',
    'Loyal Customers': 'Upsell premium products, request reviews/referrals, maintain engagement',
    'New Customers': 'Onboard with welcome series, educate about product range, encourage second purchase',
    'Potential Loyalists': 'Nurture with personalized offers, membership benefits, increase engagement',
    'At Risk': 'Win-back campaigns, feedback surveys, special reactivation offers',
    'Hibernating': 'Aggressive win-back with deep discounts, remind of value proposition',
    'Need Attention': 'Targeted research to understand needs, A/B test different offers'
}

# Create recommendations dataframe
segment_counts = rfm['segment'].value_counts()
reco_df = pd.DataFrame({
    'Segment': segment_counts.index,
    'Count': segment_counts.values,
    'Recommended Action': [recommendations.get(seg, 'Review individually') for seg in segment_counts.index]
})

# Calculate percentage
reco_df['% of Customers'] = (reco_df['Count'] / reco_df['Count'].sum() * 100).round(1)
reco_df = reco_df[['Segment', 'Count', '% of Customers', 'Recommended Action']]

print("Business Recommendations by Customer Segment")
print("="*100)
reco_df

## 7. Export Segmented Data

In [None]:
# Save segmented customer data for downstream analysis
output_cols = [
    'customer_unique_id', 'customer_city', 'customer_state',
    'recency', 'frequency', 'monetary',
    'r_score', 'f_score', 'm_score', 'rfm_score', 'segment'
]

rfm_export = rfm[output_cols].copy()
rfm_export.to_csv('../data/processed/customers_segmented.csv', index=False)

print(f"Saved: data/processed/customers_segmented.csv")
print(f"Total rows: {len(rfm_export):,}")
print(f"\nColumns exported:")
for col in output_cols:
    print(f"  - {col}")

---

## Key Insights Summary

### Segmentation Results

| Metric | Value |
|--------|-------|
| Total Customers Analyzed | See output above |
| Number of Segments | 7 |
| Largest Segment | Typically Hibernating/New Customers (one-time buyers) |
| Highest Value Segment | Champions (recent, frequent, high-spend) |

### Strategic Priorities

1. **Protect Champions**: They drive disproportionate revenue. Invest in retention.

2. **Convert Potential Loyalists**: Recent buyers with low frequency represent growth opportunity. Focus on second purchase campaigns.

3. **Reactivate At-Risk Customers**: These were once valuable. Win-back campaigns can recover significant revenue.

4. **Manage Hibernating Efficiently**: Large group but low ROI on reactivation. Use low-cost channels only.

### Charts Generated
- `images/segment_distribution.png` - Donut charts of customer and revenue distribution
- `images/segment_monetary_boxplot.png` - Monetary value distribution by segment
- `images/segment_rfm_heatmap.png` - Average RFM scores heatmap

### Next Steps
- Use segmented data (`customers_segmented.csv`) for targeted marketing campaigns
- Feed segments into churn prediction model (notebook 04)
- Track segment migration over time to measure marketing effectiveness