#  Recommended strategy on how to establish a baseline RFM segmentation

Below is an overview of one recommended strategy, along with a code example, that explains how to establish a baseline RFM segmentation model using k-means and how to monitor drift to decide when to retrain. In summary, the strategy involves:

1. **Establishing a Baseline Model:**
   - **Data Aggregation & Preprocessing:**  
     Aggregate your raw orders over a defined “stable” period (for example, the first 12 months of the 2-year period) to compute the RFM metrics (recency, frequency, monetary).
   - **Feature Scaling & Pipeline Consistency:**  
     Standardize (or otherwise scale) your features and save your transformation pipeline. This is crucial so that you can apply the exact same transformation to new data.
   - **Clustering & Model Storage:**  
     Train your k-means clustering model (using techniques like k-means++ initialization and a fixed random state for reproducibility) and compute baseline quality metrics (e.g., inertia, silhouette score, and cluster distribution). Save the cluster centroids and other model parameters.

2. **Monitoring for Drift:**
   - **Regular Aggregation on a Rolling Basis:**  
     For subsequent time periods (e.g., monthly or quarterly), aggregate new orders into RFM metrics for the same customers (or new ones if needed).  
   - **Comparing Distributions & Metrics:**  
     - **Statistical Tests:** Use tests like the Kolmogorov–Smirnov test or calculate the Population Stability Index (PSI) on each RFM feature to detect shifts in their distributions.
     - **Quality Metrics:** Compare the new aggregated data’s clustering quality (e.g., inertia, silhouette score) against the baseline.
     - **Cluster Center Movements:** If possible, compare the positions of the new cluster centers (or the assigned clusters using the baseline model) to the baseline centroids.  
     - **Label Comparison (when applicable):** If the same customers are present in both periods, you can use metrics such as the Adjusted Rand Index (ARI) or homogeneity and completeness scores to evaluate if the cluster assignments are remaining similar.
     
3. **Deciding When to Retrain:**
   - **Threshold-Based Alerts:**  
     Define thresholds for relative changes in metrics (for example, a 20% change in inertia or silhouette score, or significant shifts in feature distributions). When these thresholds are exceeded, it indicates that customer behavior has shifted and that the model may no longer represent the current state.
   - **Evaluate the Frequency:**  
     By monitoring these metrics on a rolling window (say monthly or quarterly), you can analyze how quickly your metrics drift. This analysis can help you determine the optimal frequency for retraining—if you see that significant drift happens every 3–4 months, then quarterly retraining might be ideal.
   - **Business Cycle Considerations:**  
     In some industries, seasonal effects (holidays, promotions, etc.) might suggest aligning model updates with business cycles.


---

### Explanation

1. **Baseline Model Establishment:**  
   - We aggregate orders over the first 12 months to create the baseline RFM table.
   - We scale the features and train a k-means model (with fixed random state for reproducibility) on the baseline data.
   - Baseline quality metrics such as inertia, silhouette score, and cluster distribution are computed and stored.

2. **Drift Monitoring:**  
   - On a rolling basis (in this example, every 3 months after the baseline), **we re-aggregate the orders** up to the current period and compute new RFM values.
   - We transform the new data using the same scaler and either use the baseline model for cluster assignment or retrain a new k-means model for comparison.
   - We compare the new clustering’s **inertia, silhouette score, and distribution** against the baseline.
   - We also compare the clustering labels for the common set of customers using **ARI, homogeneity, and completeness**.
   - The **Kolmogorov–Smirnov** test is used to assess shifts in the distribution of each RFM feature.
   - The distance between baseline and new cluster centers is computed to quantify shifts.

3. **Evaluating Retraining Frequency:**  
   - By reviewing how these drift metrics change over successive periods, you can identify the cadence at which significant changes occur.
   - If the metrics (or thresholds) indicate significant drift (for example, over a 20% change in inertia or silhouette score), that’s a signal to retrain.
   - The monitoring frequency (monthly, quarterly, etc.) should be chosen based on the rate of change observed and business cycle considerations.

This strategy ensures that your baseline RFM segmentation model remains representative of your customer behavior and that you have an objective method to decide when retraining is warranted.

---

Below is a Python example that demonstrates these concepts.

In [12]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, adjusted_rand_score, homogeneity_score, completeness_score
from scipy.stats import ks_2samp
import matplotlib.pyplot as plt

# ===============================
# STEP 1: Establishing a Baseline Model
# ===============================

# Assume orders data over 2 years with columns: customer_unique_id, order_date, amount
# For demonstration, we simulate data.

np.random.seed(42)
n_customers = 200
n_orders = 1000
start_date = datetime(2023, 1, 1)
end_date = datetime(2024, 12, 31)
date_range = (end_date - start_date).days

# Simulated orders
orders = pd.DataFrame({
    'customer_unique_id': np.random.choice(range(1, n_customers+1), size=n_orders),
    'order_date': [start_date + timedelta(days=int(np.random.rand()*date_range)) for _ in range(n_orders)],
    'amount': np.random.exponential(scale=100, size=n_orders)  # some variability in spend
})

# Define baseline period (for example, first 12 months)
baseline_end = start_date + timedelta(days=365)
baseline_orders = orders[orders['order_date'] <= baseline_end]

# Compute RFM metrics for baseline period
# - Recency: days since last order relative to baseline_end
# - Frequency: number of orders in baseline period
# - Monetary: total spend in baseline period

rfm_baseline = baseline_orders.groupby('customer_unique_id').agg({
    'order_date': lambda x: (baseline_end - x.max()).days,
    'customer_unique_id': 'count',
    'amount': 'sum'
}).rename(columns={
    'order_date': 'recency',
    'customer_unique_id': 'frequency',
    'amount': 'monetary'
})

print("Baseline RFM (first few rows):")
print(rfm_baseline.head())

Baseline RFM (first few rows):
                    recency  frequency    monetary
customer_unique_id                                
1                        25          3   86.329042
2                       328          1  127.805743
3                       210          2  102.044041
4                        20          3  229.920447
5                         7          4  906.574205


In [26]:
rfm_baseline.shape

(179, 4)

In [27]:
# Preprocessing: scale features
features = ['recency', 'frequency', 'monetary']
scaler = StandardScaler()
X_baseline = scaler.fit_transform(rfm_baseline[features])

# Train k-means on baseline data
k = 4  # or choose via domain knowledge / elbow method
kmeans_baseline = KMeans(n_clusters=k, random_state=42, n_init=10)
rfm_baseline['cluster'] = kmeans_baseline.fit_predict(X_baseline)

# Compute baseline quality metrics
baseline_inertia = kmeans_baseline.inertia_
baseline_silhouette = silhouette_score(X_baseline, rfm_baseline['cluster'])
baseline_distribution = rfm_baseline['cluster'].value_counts(normalize=True).sort_index()

print("\nBaseline Clustering Metrics:")
print("Inertia:", baseline_inertia)
print("Silhouette Score:", baseline_silhouette)
print("Cluster Distribution:\n", baseline_distribution)

# Save the baseline model parameters: scaler and cluster centers
baseline_centers = kmeans_baseline.cluster_centers_


Baseline Clustering Metrics:
Inertia: 160.98940763117378
Silhouette Score: 0.3368236192887641
Cluster Distribution:
 cluster
0    0.307263
1    0.223464
2    0.145251
3    0.324022
Name: proportion, dtype: float64


In [15]:
# ===============================
# STEP 2: Monitoring for Drift on New Data Periods
# ===============================

def aggregate_rfm(orders_df, period_end):
    """
    Aggregate orders into RFM for each customer up to a given period_end date.
    """
    rfm = orders_df[orders_df['order_date'] <= period_end].groupby('customer_unique_id').agg({
        'order_date': lambda x: (period_end - x.max()).days,
        'customer_unique_id': 'count',
        'amount': 'sum'
    }).rename(columns={
        'order_date': 'recency',
        'customer_unique_id': 'frequency',
        'amount': 'monetary'
    })
    return rfm

In [16]:
baseline_end

datetime.datetime(2024, 1, 1, 0, 0)

In [17]:
# For example, monitor drift every 3 months after the baseline period
monitor_dates = [baseline_end + timedelta(days=90*i) for i in range(1, 5)]  # 90, 180, 270, 360 days later
monitor_dates

[datetime.datetime(2024, 3, 31, 0, 0),
 datetime.datetime(2024, 6, 29, 0, 0),
 datetime.datetime(2024, 9, 27, 0, 0),
 datetime.datetime(2024, 12, 26, 0, 0)]

In [25]:
print("\nBaseline Clustering Metrics:")
print("Inertia:", baseline_inertia)
print("Silhouette Score:", baseline_silhouette)
print("Cluster Distribution:\n", baseline_distribution)

print("\n--- Monitoring for Drift ---")

for monitor_date in monitor_dates:
    print(f"\n--- Monitoring at period end: {monitor_date.date()} ---")
    # Aggregate new RFM data for current period
    rfm_new = aggregate_rfm(orders, monitor_date)
    # print("New RFM Shape:", rfm_new.shape)
    # print(rfm_new.iloc[0,:])
    
    # Align customers: use intersection of baseline and new customers for label-based comparisons
    common_customers = rfm_baseline.index.intersection(rfm_new.index)
    # print("Common Customers:", len(common_customers))
    rfm_new = rfm_new.loc[common_customers]
    # print("New RFM Shape (after alignment):", rfm_new.shape)
    X_new = scaler.transform(rfm_new[features])
    
    # Option 1: Use the baseline k-means model to assign clusters (keeps the original segmentation)
    rfm_new['cluster_baseline'] = kmeans_baseline.predict(X_new)
    
    # Option 2: Retrain a k-means on new data (to see the drift in cluster centers) 
    # Note: In production, you might not retrain until you decide to update.
    kmeans_new = KMeans(n_clusters=k, random_state=42, n_init=10)
    rfm_new['cluster_new'] = kmeans_new.fit_predict(X_new)
    
    new_inertia = kmeans_new.inertia_
    new_silhouette = silhouette_score(X_new, rfm_new['cluster_new'])
    new_distribution = rfm_new['cluster_new'].value_counts(normalize=True).sort_index()
    
    print("New Data Metrics:")
    print("Inertia:", new_inertia)
    print("Silhouette Score:", new_silhouette)
    print("Cluster Distribution:\n", new_distribution)
    
    # Compare baseline vs. new clustering (only on common customers)
    ari = adjusted_rand_score(rfm_baseline.loc[common_customers, 'cluster'], rfm_new['cluster_new'])
    homogeneity = homogeneity_score(rfm_baseline.loc[common_customers, 'cluster'], rfm_new['cluster_new'])
    completeness = completeness_score(rfm_baseline.loc[common_customers, 'cluster'], rfm_new['cluster_new'])
    
    print("Adjusted Rand Index (ARI):", ari)
    print("Homogeneity:", homogeneity)
    print("Completeness:", completeness)
    
    # Compare feature distributions using KS test
    for feature in features:
        stat, p_value = ks_2samp(rfm_baseline.loc[common_customers, feature],
                                  rfm_new[feature])
        print(f"KS test for {feature}: statistic={stat:.3f}, p-value={p_value:.3f}")
    
    # Compare cluster centers distance
    centers_new = kmeans_new.cluster_centers_
    # Calculate Euclidean distances between baseline and new cluster centers
    from scipy.spatial.distance import cdist
    distance_matrix = cdist(baseline_centers, centers_new, metric='euclidean')
    print("Distance Matrix between Baseline and New Cluster Centers:")
    print(distance_matrix)
    
    # Decide if drift is significant: you might define thresholds based on business rules or historical data.
    inertia_change = abs(new_inertia - baseline_inertia) / baseline_inertia
    silhouette_change = abs(new_silhouette - baseline_silhouette) / baseline_silhouette
    if inertia_change > 0.2 or silhouette_change > 0.2:
        print("Significant change in cluster quality detected. Consider re-training the model.")
    else:
        print("No significant drift detected yet.")
    print("Inertia Change:", inertia_change)
    print("Silhouette Change:", silhouette_change)

# ===============================
# STEP 3: Determining the Best Training Frequency
# ===============================

# By running the monitoring code above on a sliding window (e.g., every 3 months),
# you can analyze how quickly your drift metrics (inertia, silhouette, feature distributions, cluster center shifts) change.
#
# For instance, if you observe that significant changes (exceeding your defined thresholds) occur every quarter,
# then quarterly retraining might be optimal. If the drift is very gradual,
# you might extend the retraining frequency to every 6 months.
#
# You can also supplement these metrics with business insights (e.g., seasonal campaigns, product changes)
# to decide on the retraining frequency.




Baseline Clustering Metrics:
Inertia: 160.98940763117378
Silhouette Score: 0.3368236192887641
Cluster Distribution:
 cluster
0    0.307263
1    0.223464
2    0.145251
3    0.324022
Name: proportion, dtype: float64

--- Monitoring for Drift ---

--- Monitoring at period end: 2024-03-31 ---
New Data Metrics:
Inertia: 214.2919585453768
Silhouette Score: 0.301222347461644
Cluster Distribution:
 cluster_new
0    0.111732
1    0.201117
2    0.374302
3    0.312849
Name: proportion, dtype: float64
Adjusted Rand Index (ARI): 0.23563551366334912
Homogeneity: 0.3968279302122142
Completeness: 0.4102794116018595
KS test for recency: statistic=0.061, p-value=0.889
KS test for frequency: statistic=0.218, p-value=0.000
KS test for monetary: statistic=0.145, p-value=0.046
Distance Matrix between Baseline and New Cluster Centers:
[[2.99187405 3.60334816 1.74446156 0.9287696 ]
 [1.12046353 4.06240232 2.84758721 1.31135911]
 [4.6786521  0.53910127 1.79862873 2.9131685 ]
 [3.48671388 2.13931313 0.55603562

## Intersection of customers

You are excluding new customers from that particular label-based comparison. The main reason for doing so is that metrics like the Adjusted Rand Index (ARI), homogeneity, and completeness require comparing the clustering assignments on the same set of data points. Here’s why:

1. **One-to-One Correspondence Requirement:**  
   Metrics such as **ARI and others compare how each customer is clustered in both periods**. If a customer appears only in the new period (i.e., a new customer), there is no baseline cluster assignment available for that customer. This lack of a direct mapping makes the comparison invalid.

2. **Ensuring Meaningful Comparisons:**  
   By using only the intersection of customers present in both the baseline and new data, you ensure that you’re comparing "apples to apples." This allows you to assess whether the clustering assignments for the same set of customers have changed over time, which is what these metrics are designed to measure.

3. **Handling New Customers Separately:**  
   While new customers are excluded from the label-based drift metrics, they’re still important for your overall analysis. You can monitor new customers by:
   - Checking their distribution across clusters using the baseline model.
   - Analyzing how the overall cluster proportions change when new customers are included.
   - Evaluating business metrics (e.g., average spend, churn rate) for new versus existing customers.

In practice, you might use two different analyses:
- **Label-Based Comparison:** Restrict to customers present in both periods to use ARI, homogeneity, and completeness.
- **Overall Drift Monitoring:** Analyze all customers (including new ones) to see how overall cluster characteristics and business metrics are evolving.

This two-pronged approach ensures that you capture both the stability of the segmentation for existing customers and the integration and behavior of new customers over time.

## aggregating all past data

Great question! The choice between **aggregating all past data (2023 + Q1 2024)** versus using a **rolling window (e.g., baseline 2023, new data Q1 2024 only)** for drift monitoring depends on the assumptions you make about customer behavior and the stability of your clusters.  

---

## **1. Why Re-Aggregate Orders (2023 + Q1 2024) When Drift Monitoring?**  
When you aggregate **all past data up to the current period**, you assume that past customer behavior is still relevant and contributes to the current segmentation. This is useful when:  

- **Customer Behavior is Cumulative** → If customers’ purchasing behavior is persistent (e.g., high-spending customers continue to be high spenders), their full historical transaction history remains important for clustering.  
- **Long-Term Trends Matter** → You want to track whether the overall segmentation structure is shifting over time, rather than only detecting short-term fluctuations.  
- **Business Needs a Stable Segmentation** → If the marketing team relies on stable customer segments, you want to monitor drift relative to a long-term reference, rather than reacting to short-term seasonal changes.  

### **Key Benefit**:  
It allows you to detect **gradual drifts** in customer behavior over time, rather than being overly sensitive to short-term fluctuations.  

---

## **2. Why Use a Rolling Window Instead? (Baseline: 2023, New: Q1 2024 Only)**  
Using a rolling window means that you always compare customer behavior over **a fixed recent period** (e.g., last 12 months, or just the last quarter) rather than accumulating all past data.  

### **Use This When:**  
- **Customer Behavior Changes Rapidly** → If customer purchasing patterns change frequently (e.g., due to seasonality, economic shifts, new product releases).  
- **You Want to Detect Sudden Shifts** → This approach is useful for identifying sudden behavioral changes (e.g., if customers stop purchasing or switch to different spending patterns).  
- **Recent Data is More Relevant** → In industries where past customer data loses relevance quickly (e.g., fast-moving consumer goods, subscriptions, fashion).  

### **Key Benefit**:  
It allows you to react **faster** to changes in customer behavior by focusing only on the most recent period.  

---

## **So, Which Strategy is Better?**  

It depends on the business case:  

| **Approach** | **Best For** | **Trade-Offs** |
|-------------|-------------|---------------|
| **Aggregating all past data (2023 + Q1 2024, etc.)** | Businesses with long-term customer relationships where past behavior remains relevant | May be slow to react to sudden changes |
| **Rolling window (e.g., baseline 2023, new Q1 2024 only)** | Fast-changing businesses where recent customer activity matters most | May be overly sensitive to short-term fluctuations |

---

## **How to Choose?**  
You can experiment with both methods and track how drift metrics behave:  

1. **Compute drift metrics (e.g., Adjusted Rand Index, cluster centroid movement) using both methods**  
2. **Compare their sensitivity to changes in customer behavior**  
3. **Align with business needs**—if stability is critical, use the full aggregated approach; if rapid detection of changes is needed, use a rolling window.  

💡 **Hybrid Approach?**  
Some businesses use a **weighted rolling window**, where recent data is given more importance but older data still contributes. For example, weighting transactions so that:  
- Orders from the past **6 months** are given full weight (1.0)  
- Orders from **6-12 months ago** are given partial weight (e.g., 0.5)  
- Orders older than **12 months** are ignored  

---

## **Final Recommendation**  
- If customer behavior is stable over time → **Aggregate all past data**  
- If customer behavior shifts frequently → **Use a rolling window**  
- If unsure, experiment with **both methods** and evaluate their impact using drift metrics  

Would you like a Python example comparing these two methods? 🚀