# Notebook 03: RFM Customer Segmentation

**Source:** Kobets & Yashyna (2025), Table 2 - 8-Segment RFM Model

---

## What This Notebook Does

1. Calculates RFM metrics for each customer
2. Assigns scores (1-5) to each metric
3. Classifies customers into 8 segments
4. Saves results for promotion strategy



In [1]:
## Step 1: Import Libraries

import pandas as pd
import numpy as np
import os
from datetime import datetime

print("="*60)
print("NOTEBOOK 03: RFM SEGMENTATION")
print("="*60)

NOTEBOOK 03: RFM SEGMENTATION


In [None]:

## STEP 2: Load Data
DATA_DIR = '../data/processed'
OUTPUT_DIR = '../data/processed'

print("\n[1/4] Loading data...")

# Load optimized data from Notebook 01
orders = pd.read_csv(f'{DATA_DIR}/orders_optimized.csv')
orders_prior = pd.read_csv(f'{DATA_DIR}/orders_prior_optimized.csv')

print(f"  ✓ orders: {len(orders):,} rows")
print(f"  ✓ orders_prior: {len(orders_prior):,} rows")

# Check if pricing data exists
if os.path.exists(f'{DATA_DIR}/products_priced_eur.csv'):
    products_priced = pd.read_csv(f'{DATA_DIR}/products_priced_eur.csv')
    has_prices = True
    print(f"  ✓ pricing data available")
else:
    products_priced = None
    has_prices = False
    print(f"  Warning ! no pricing data found - RFM will be based on order counts only")


[1/4] Loading data...
  ✓ orders: 3,421,083 rows
  ✓ orders_prior: 32,434,489 rows
  ✓ pricing data available


In [7]:
# STEP 3: Calculate RFM Metrics

print("\n[2/4] Calculating RFM metrics...")

# Count products per order
order_counts = orders_prior.groupby('order_id').size().reset_index()
order_counts.columns = ['order_id', 'num_items']

# Merge with orders
order_data = pd.merge(orders, order_counts, on='order_id', how='left')

# Add monetary value if prices available
if has_prices:
    # Merge order products with prices
    order_prices = pd.merge(
        orders_prior[['order_id', 'product_id']],
        products_priced[['product_id', 'price_eur']],
        on='product_id',
        how='left'
    )
    order_prices['price_eur'] = order_prices['price_eur'].fillna(0)
    
    # Sum prices per order
    order_value = order_prices.groupby('order_id')['price_eur'].sum().reset_index()
    order_value.columns = ['order_id', 'order_value_eur']
    
    # Merge with order data
    order_data = pd.merge(order_data, order_value, on='order_id', how='left')
else:
    order_data['order_value_eur'] = 0

# Aggregate to customer level (206K customers)
rfm = order_data.groupby('user_id').agg({
    'order_id': 'nunique',              # Frequency (number of orders)
    'order_value_eur': 'sum',           # Monetary (total spend)
    'days_since_prior_order': 'mean'    # Recency (average days between orders)
}).reset_index()

# Rename columns
rfm.columns = ['user_id', 'num_orders', 'total_spent_eur', 'avg_days_between_orders']

print(f"  ✓ RFM table created: {len(rfm):,} customers")


[2/4] Calculating RFM metrics...
  ✓ RFM table created: 206,209 customers


**What is RFM algorithm ?**

Recency: How recently did they buy? (lower days = better)
Frequency: How often do they buy? (higher = better)
Monetary: How much do they spend? (higher = better)

We will pursue by using Kobets and Yashyna methodology to have deeper marketing analysis than traditional RFM segmentation. 
Source: Kobets & Yashyna (2025), p. 35


In [8]:
#STEP 4: Assign RFM Scores
print("\n[3/4] Assigning RFM scores (1-5)...")

# Recency: Lower days = better = higher score
# We use the panda qcut function to assign scores 
rfm['R_score'] = pd.qcut(rfm['avg_days_between_orders'].fillna(0), 
                         q=5, labels=[5,4,3,2,1]).astype(int)

# Frequency: Higher orders = better = higher score
rfm['F_score'] = pd.qcut(rfm['num_orders'], q=5, 
                         labels=[1,2,3,4,5]).astype(int)

# Monetary: Higher spend = better = higher score
if has_prices:
    rfm['M_score'] = pd.qcut(rfm['total_spent_eur'], q=5, 
                             labels=[1,2,3,4,5]).astype(int)
else:
    # If no prices, use frequency as proxy
    rfm['M_score'] = rfm['F_score']

print("  ✓ Scores assigned")
print(f"\nScore distribution:")
print(f"  R_score: {rfm['R_score'].value_counts().sort_index().to_dict()}")


[3/4] Assigning RFM scores (1-5)...
  ✓ Scores assigned

Score distribution:
  R_score: {1: 40114, 2: 42344, 3: 41255, 4: 41254, 5: 41242}


Here, qcut divides data into equal groups (quintiles = 5 groups)
Source: https://pandas.pydata.org/docs/reference/api/pandas.qcut.html

We apply this function to stick with K&Y methodology : ""Customers are often assigned scores (for example, from 1 to 5, where 5 is the highest score) for each of the parameters (R, F, M) depending on their place in the quintiles or quartiles of the distribution." (Kobets & Yashyna, 2025, p. 36)"

In [9]:
#STEP 5 : Classify into segments
print("\n[4/4] Classifying into 8 segments...")

# Simple function to classify customers
# Source: Kobets & Yashyna (2025), Table 2, p. 37
def get_segment(row):
    # Premium: High R, High F, High M
    if row['R_score'] >= 4 and row['F_score'] >= 4 and row['M_score'] >= 4:
        return 'Premium'
    
    # Loyal: High F, Medium R/M
    elif row['F_score'] >= 3 and row['R_score'] >= 3 and row['M_score'] >= 3:
        return 'Loyal'
    
    # High Check: High M, Low F
    elif row['M_score'] >= 4 and row['F_score'] <= 2:
        return 'High_Check'
    
    # Frugal: High F, Low M
    elif row['F_score'] >= 4 and row['M_score'] <= 2:
        return 'Frugal'
    
    # Promising: Medium R/F/M
    elif row['R_score'] >= 3 and row['F_score'] >= 2 and row['M_score'] >= 2:
        return 'Promising'
    
    # Lost: Low R, Low F
    elif row['R_score'] <= 2 and row['F_score'] <= 2:
        return 'Lost'
    
    # Sleeping: Low R, High F
    elif row['R_score'] <= 2 and row['F_score'] >= 3:
        return 'Sleeping'
    
    # New: High R, Low F
    else:
        return 'New'

# Apply classification
rfm['segment'] = rfm.apply(get_segment, axis=1)

# Show distribution
print("\nSegment distribution:")
for segment, count in rfm['segment'].value_counts().items():
    pct = count / len(rfm) * 100
    print(f"  {segment}: {count:,} ({pct:.1f}%)")


[4/4] Classifying into 8 segments...

Segment distribution:
  Lost: 50,282 (24.4%)
  Premium: 45,125 (21.9%)
  Loyal: 34,570 (16.8%)
  Sleeping: 26,772 (13.0%)
  New: 21,555 (10.5%)
  Promising: 13,952 (6.8%)
  Frugal: 7,652 (3.7%)
  High_Check: 6,301 (3.1%)


**We are close to the exepected distribution from Kobets & Yashyna (figure 3) :**

Only on :
Lost: ~22.63% vs 24.4 on our dataset 
Sleeping: ~14.71% vs 13.6 on our dataset 
Premium: ~19.15% vs 21.9 on our dataset 
Loyal: ~17.73% vs 16.8 on our dataset  
High Check: ~5.44%  vs 3.1 on our dataset 
Frugal: ~2.61% vs 3.7 on our dataset 

Our dataset stands out on these segments 
Promising: ~15.81% vs 6.8 : we have less promising customers 
New: ~1.92%  vs 10.5 on our dataset : we have more new customers than they do in their study. 

In [10]:
print("\nSaving results...")

# Save RFM data
rfm.to_csv(f'{OUTPUT_DIR}/rfm_customer_segments.csv', index=False)
print(f"  ✓ rfm_customer_segments.csv saved")



Saving results...
  ✓ rfm_customer_segments.csv saved
