# Customer Segmentation - Complete Solution
ML Implementation Series for Product Managers - Post #2

### DISCLAIMER: It is greatly beneficial if you know Python and ML basics before hand. If not, I would highly urge you to learn. This should be non-negotiable. This would form the basement for future posts in this series and your career as PM working with ML teams. 

### Why this post follows Post #1

In Post #1, we predicted **which** customers are at risk of churning.  
Now, we need to understand **who** they are.  

While the churn model flags high-risk customers, marketing asks:  
**"Are these customers similar? Or do they need different strategies?"**  

Without segmentation, we might treat everyone the same, wasting resources or missing opportunities.  
With data-driven customer profiles, marketing can personalize interventions—saving budget and increasing retention.

This post shows how to find these customer groups using behavior and value metrics, so we can act smarter—based on who they are, not just their risk score.


### Problem statement

The marketing VP walked into the room after seeing our churn model results and said:

"Great, we know who's leaving. But who ARE these people? Are they all the same? Or do we need different strategies for different groups?"

The team had been using a simple manual segmentation:
- "Premium" vs "Standard" vs "Basic" (based on subscription tier)
- "New" vs "Active" vs "Dormant" (based on rough activity guesses)

But nobody trusted these labels. They felt arbitrary.

The real questions we needed to answer:
- Do we have behavioral data beyond just demographics?
- Can we find natural groupings in customer behavior?
- What patterns emerge when we look at purchase history, engagement, and value together?

Marketing was spending the same on all "Standard" customers, but some were worth more and other were worth less.

We needed data-driven segmentation that reflected actual customer behavior.


### Dataset overview

We used the same Customer Data Platform (CDP) with 5,000 customers from Post #1.

#### Tables used:
- **cdp_customers** - Customer demographics
- **cdp_customer_features** - Pre-calculated behavioral metrics (RFM, engagement)

#### Features for segmentation:

We selected 9 key behavioral and value metrics:

1. `recency_days` - Days since last purchase
2. `frequency` - Total number of purchases
3. `monetary_value` - Total customer spend
4. `avg_order_value` - Average transaction size
5. `engagement_score` - Overall platform engagement
6. `email_open_rate` - Email engagement level
7. `email_click_rate` - Click-through behavior
8. `campaign_conversions` - Campaign response history
9. `customer_lifetime_value` - Predicted total value

**Key insight:** We combined RFM (Recency, Frequency, Monetary) with engagement metrics to capture both spending behavior and brand connection.


### ML approach

**Problem type:** Unsupervised learning - clustering

**Algorithm:** K-means clustering

#### Why K-means?
- Simple and interpretable
- Fast to train on thousands of customers
- Produces clear, distinct segments
- Easy to explain to marketing stakeholders
- Scales to millions of customers in production

#### How it works:
1. Group customers by similarity across all 9 features
2. Each customer is assigned to their nearest cluster center
3. Segments are defined by shared behavioral patterns, not arbitrary rules


$ Let's-get-into-it$

In [1]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
import warnings
warnings.filterwarnings('ignore')

print("="*70)
print("CUSTOMER SEGMENTATION - ML SOLUTION")
print("="*70)

# Load the datasets
cdp_customers = pd.read_csv('cdp_customers.csv')
cdp_customer_features = pd.read_csv('cdp_customer_features.csv')

# Merge
df = cdp_customers.merge(cdp_customer_features, on='customer_id', how='left')

print(f"\n✓ Loaded data: {len(df)} customers")
print(f"✓ Available features: {len(df.columns)} columns")

print("\n" + "="*70)
print("STEP 1: SELECTING SEGMENTATION FEATURES")
print("="*70)

# Select features for clustering (RFM + Engagement)
segmentation_features = [
    'recency_days',
    'frequency',
    'monetary_value',
    'avg_order_value',
    'engagement_score',
    'email_open_rate',
    'email_click_rate',
    'campaign_conversions',
    'customer_lifetime_value'
]

# Create feature matrix
X = df[segmentation_features].copy()
X = X.fillna(X.median())

print(f"\n✓ Selected {len(segmentation_features)} features for segmentation:")
for i, feature in enumerate(segmentation_features, 1):
    print(f"   {i}. {feature}")

print(f"\n✓ Feature matrix shape: {X.shape}")


CUSTOMER SEGMENTATION - ML SOLUTION

✓ Loaded data: 5000 customers
✓ Available features: 29 columns

STEP 1: SELECTING SEGMENTATION FEATURES

✓ Selected 9 features for segmentation:
   1. recency_days
   2. frequency
   3. monetary_value
   4. avg_order_value
   5. engagement_score
   6. email_open_rate
   7. email_click_rate
   8. campaign_conversions
   9. customer_lifetime_value

✓ Feature matrix shape: (5000, 9)


In [3]:

print("\n" + "="*70)
print("FEATURE SCALING")
print("="*70)

# Scale features (critical for K-Means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("\n✓ Features scaled using StandardScaler")
print("✓ All features now have mean=0, std=1")

print("\n" + "="*70)
print("DETERMINING OPTIMAL NUMBER OF CLUSTERS")
print("="*70)

# Test different numbers of clusters
inertias = []
silhouette_scores = []
K_range = range(2, 9)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

print("\n Cluster Evaluation Metrics:")
print("-" * 70)
print(f"{'K':>5} {'Inertia':>15} {'Silhouette Score':>20}")
print("-" * 70)
for k, inertia, sil_score in zip(K_range, inertias, silhouette_scores):
    print(f"{k:>5} {inertia:>15.2f} {sil_score:>20.3f}")

# Choose optimal K (highest silhouette score)
optimal_k = K_range[np.argmax(silhouette_scores)]
print(f"\n✓ Optimal number of clusters: {optimal_k}")
print(f"✓ Best silhouette score: {max(silhouette_scores):.3f}")



FEATURE SCALING

✓ Features scaled using StandardScaler
✓ All features now have mean=0, std=1

DETERMINING OPTIMAL NUMBER OF CLUSTERS

 Cluster Evaluation Metrics:
----------------------------------------------------------------------
    K         Inertia     Silhouette Score
----------------------------------------------------------------------
    2        35062.55                0.305
    3        29470.00                0.323
    4        24800.87                0.323
    5        21362.26                0.315
    6        18804.38                0.247
    7        16969.29                0.248
    8        15235.09                0.244

✓ Optimal number of clusters: 4
✓ Best silhouette score: 0.323


In [7]:

print("\n" + "="*70)
print("BUILDING FINAL K-MEANS MODEL")
print("="*70)

# Train final model with optimal K
optimal_k = 4
kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df['segment'] = kmeans_final.fit_predict(X_scaled)

print(f"\n✓ K-Means model trained with {optimal_k} clusters")
print(f"✓ Silhouette Score: {silhouette_score(X_scaled, df['segment']):.3f}")
print(f"✓ Davies-Bouldin Index: {davies_bouldin_score(X_scaled, df['segment']):.3f} (lower is better)")

# Segment distribution
print("\n SEGMENT DISTRIBUTION:")
print("-" * 70)
segment_counts = df['segment'].value_counts().sort_index()
for seg, count in segment_counts.items():
    pct = (count / len(df)) * 100
    print(f"  Segment {seg}: {count:>5} customers ({pct:>5.1f}%)")

print("\n" + "="*70)
print("SEGMENT PROFILING")
print("="*70)

# Calculate segment profiles
segment_profiles = df.groupby('segment')[segmentation_features].mean()

print("\n SEGMENT CHARACTERISTICS (Average values):\n")
print(segment_profiles.round(2).to_string())



BUILDING FINAL K-MEANS MODEL

✓ K-Means model trained with 4 clusters
✓ Silhouette Score: 0.323
✓ Davies-Bouldin Index: 1.169 (lower is better)

 SEGMENT DISTRIBUTION:
----------------------------------------------------------------------
  Segment 0:   737 customers ( 14.7%)
  Segment 1:  2881 customers ( 57.6%)
  Segment 2:  1363 customers ( 27.3%)
  Segment 3:    19 customers (  0.4%)

SEGMENT PROFILING

 SEGMENT CHARACTERISTICS (Average values):

         recency_days  frequency  monetary_value  avg_order_value  engagement_score  email_open_rate  email_click_rate  campaign_conversions  customer_lifetime_value
segment                                                                                                                                                              
0              822.81       3.99          643.37           135.88              9.02             0.39              0.17                   0.0                   643.37
1              836.17       3.00          382.

In [9]:

print("\n" + "="*70)
print("NAMING AND INTERPRETING SEGMENTS")
print("="*70)

# Analyze and name segments based on their profiles
segment_names = {}
segment_descriptions = {}

# Segment 0: Medium recency, low frequency, medium CLV, high engagement
segment_names[0] = "Engaged Browsers"
segment_descriptions[0] = "Medium spenders with high email engagement but moderate purchase frequency"

# Segment 1: High recency, low frequency, low CLV, very low engagement
segment_names[1] = "At-Risk Dormant"
segment_descriptions[1] = "Largest segment - low value, low engagement, haven't purchased recently"

# Segment 2: Lower recency, high frequency, high CLV, low engagement
segment_names[2] = "High-Value Customers"
segment_descriptions[2] = "Top spenders with high purchase frequency - key revenue drivers"

# Segment 3: Medium recency, medium frequency, medium CLV, very high engagement
segment_names[3] = "Campaign Champions"
segment_descriptions[3] = "Highly engaged with campaigns, responsive to marketing efforts"

# Add segment names to dataframe
df['segment_name'] = df['segment'].map(segment_names)

print("\n SEGMENT PROFILES:\n")
for seg in range(optimal_k):
    count = segment_counts[seg]
    pct = (count / len(df)) * 100
    profile = segment_profiles.loc[seg]
    
    print(f"{'='*70}")
    print(f"SEGMENT {seg}: {segment_names[seg].upper()}")
    print(f"{'='*70}")
    print(f"Size: {count} customers ({pct:.1f}%)")
    print(f"\n{segment_descriptions[seg]}")
    print(f"\nKey Metrics:")
    print(f"  • Recency: {profile['recency_days']:.0f} days")
    print(f"  • Frequency: {profile['frequency']:.1f} purchases")
    print(f"  • Customer Lifetime Value: ${profile['customer_lifetime_value']:.2f}")
    print(f"  • Email Open Rate: {profile['email_open_rate']:.1%}")
    print(f"  • Engagement Score: {profile['engagement_score']:.1f}")
    print()



NAMING AND INTERPRETING SEGMENTS

 SEGMENT PROFILES:

SEGMENT 0: ENGAGED BROWSERS
Size: 737 customers (14.7%)

Medium spenders with high email engagement but moderate purchase frequency

Key Metrics:
  • Recency: 823 days
  • Frequency: 4.0 purchases
  • Customer Lifetime Value: $643.37
  • Email Open Rate: 38.5%
  • Engagement Score: 9.0

SEGMENT 1: AT-RISK DORMANT
Size: 2881 customers (57.6%)

Largest segment - low value, low engagement, haven't purchased recently

Key Metrics:
  • Recency: 836 days
  • Frequency: 3.0 purchases
  • Customer Lifetime Value: $382.20
  • Email Open Rate: 1.1%
  • Engagement Score: 1.3

SEGMENT 2: HIGH-VALUE CUSTOMERS
Size: 1363 customers (27.3%)

Top spenders with high purchase frequency - key revenue drivers

Key Metrics:
  • Recency: 604 days
  • Frequency: 9.0 purchases
  • Customer Lifetime Value: $2369.67
  • Email Open Rate: 3.6%
  • Engagement Score: 2.3

SEGMENT 3: CAMPAIGN CHAMPIONS
Size: 19 customers (0.4%)

Highly engaged with campaigns, res

In [11]:

print("\n" + "="*70)
print("BUSINESS INSIGHTS & RECOMMENDATIONS")
print("="*70)

# Calculate revenue contribution by segment
segment_revenue = df.groupby('segment_name')['customer_lifetime_value'].sum()
total_revenue = segment_revenue.sum()

print("\n REVENUE CONTRIBUTION BY SEGMENT:")
print("-" * 70)
for seg_name in segment_names.values():
    revenue = segment_revenue[seg_name]
    pct = (revenue / total_revenue) * 100
    count = len(df[df['segment_name'] == seg_name])
    avg_clv = revenue / count
    print(f"{seg_name:25s} ${revenue:>12,.0f} ({pct:>5.1f}%)  |  Avg CLV: ${avg_clv:>8,.2f}")

print(f"\n{'Total Revenue':25s} ${total_revenue:>12,.0f}")

print("\n" + "="*70)
print("KEY BUSINESS INSIGHTS")
print("="*70)

print("""
1. HIGH-VALUE CUSTOMERS (27.3% of customers) drive 65% of revenue
   → These 1,363 customers are the business's backbone
   → Despite high CLV, their engagement is low (3.6% email open rate)
   → Risk: They could churn without warning

2. AT-RISK DORMANT (57.6%) represent wasted potential
   → Largest segment but lowest value and engagement
   → Haven't purchased in 836 days on average
   → Action needed: Win-back campaigns or deprioritize

3. ENGAGED BROWSERS (14.7%) show promise
   → High email engagement (38.5% open rate) but moderate spending
   → Opportunity to convert engagement into purchases
   → Strategy: Targeted upsell campaigns

4. CAMPAIGN CHAMPIONS (0.4%) are brand advocates
   → Tiny segment but highly responsive
   → 42.8% email open rate, 65% click rate
   → Leverage for referrals and testimonials
""")

print("="*70)
print("RECOMMENDED MARKETING STRATEGIES BY SEGMENT")
print("="*70)

strategies = {
    "High-Value Customers": [
        "VIP loyalty programs with exclusive benefits",
        "Personal account managers for top 100 customers",
        "Early access to new products",
        "Prevent churn through proactive engagement"
    ],
    "At-Risk Dormant": [
        "Aggressive win-back campaigns with steep discounts",
        "Test re-engagement vs. deprioritization",
        "Remove non-responders from email lists (reduce costs)"
    ],
    "Engaged Browsers": [
        "Upsell campaigns based on browsing behavior",
        "Free shipping offers to convert browsers to buyers",
        "Product recommendations based on interests"
    ],
    "Campaign Champions": [
        "Referral program incentives",
        "Beta tester opportunities",
        "User-generated content campaigns",
        "Brand ambassador programs"
    ]
}

for seg_name, actions in strategies.items():
    print(f"\n{seg_name.upper()}:")
    for action in actions:
        print(f"  • {action}")



BUSINESS INSIGHTS & RECOMMENDATIONS

 REVENUE CONTRIBUTION BY SEGMENT:
----------------------------------------------------------------------
Engaged Browsers          $     474,164 (  9.8%)  |  Avg CLV: $  643.37
At-Risk Dormant           $   1,101,125 ( 22.8%)  |  Avg CLV: $  382.20
High-Value Customers      $   3,229,855 ( 66.9%)  |  Avg CLV: $2,369.67
Campaign Champions        $      20,646 (  0.4%)  |  Avg CLV: $1,086.62

Total Revenue             $   4,825,789

KEY BUSINESS INSIGHTS

1. HIGH-VALUE CUSTOMERS (27.3% of customers) drive 65% of revenue
   → These 1,363 customers are the business's backbone
   → Despite high CLV, their engagement is low (3.6% email open rate)

2. AT-RISK DORMANT (57.6%) represent wasted potential
   → Largest segment but lowest value and engagement
   → Haven't purchased in 836 days on average
   → Action needed: Win-back campaigns or deprioritize

3. ENGAGED BROWSERS (14.7%) show promise
   → High email engagement (38.5% open rate) but moderate spen

---

### Results

#### Model performance
- **Silhouette score:** 0.323 (good cluster separation)
- **Davies-Bouldin index:** 1.169 (lower is better, good cohesion)
- **4 distinct segments** identified

---

#### Segment distribution

| Segment | Size | % of total |
|---------|------|------------|
| **Segment 0: Engaged browsers** | 737 | 14.7% |
| **Segment 1: At-risk dormant** | 2,881 | 57.6% |
| **Segment 2: High-value customers** | 1,363 | 27.3% |
| **Segment 3: Campaign champions** | 19 | 0.4% |

---

### Segment profiles

#### Segment 0: Engaged browsers (14.7% of customers)

**Profile:**  
Medium spenders with high email engagement but moderate purchase frequency

**Key metrics:**
- Recency: 823 days
- Frequency: 4.0 purchases
- Customer lifetime value: $643.37
- Email open rate: 38.5%
- Engagement score: 9.0/10

**Revenue contribution:** $474,164 (9.8% of total)

**Marketing strategy:**
- Upsell campaigns based on browsing behavior
- Free shipping offers to convert browsers to buyers
- Product recommendations based on interests
- Nurture high engagement into higher spending

---

#### Segment 1: At-risk dormant (57.6% of customers)

**Profile:**  
Largest segment with low value, low engagement, haven't purchased recently

**Key metrics:**
- Recency: 836 days
- Frequency: 3.0 purchases
- Customer lifetime value: $382.20
- Email open rate: 1.1%
- Engagement score: 1.3/10

**Revenue contribution:** $1,101,125 (22.8% of total)

**Marketing strategy:**
- Aggressive win-back campaigns with steep discounts
- Test re-engagement vs. deprioritization
- Remove non-responders from email lists (reduce costs)
- Consider sunsetting to save marketing spend

---

#### Segment 2: High-value customers (27.3% of customers)

**Profile:**  
Top spenders with high purchase frequency, key revenue drivers

**Key metrics:**
- Recency: 604 days
- Frequency: 9.0 purchases
- Customer lifetime value: $2,369.67
- Email open rate: 3.6%
- Engagement score: 2.3/10

**Revenue contribution:** $3,229,855 (66.9% of total) ⭐

**Marketing strategy:**
- VIP loyalty programs with exclusive benefits
- Personal account managers for top 100 customers
- Early access to new products
- Prevent churn through proactive engagement
- **Critical:** These customers drive 67% of revenue despite being only 27% of the base

---

#### Segment 3: Campaign champions (0.4% of customers)

**Profile:**  
Highly engaged with campaigns, responsive to marketing efforts

**Key metrics:**
- Recency: 803 days
- Frequency: 5.8 purchases
- Customer lifetime value: $1,086.62
- Email open rate: 42.8%
- Email click rate: 65%
- Engagement score: 6.6/10

**Revenue contribution:** $20,646 (0.4% of total)

**Marketing strategy:**
- Referral program incentives
- Beta tester opportunities
- User-generated content campaigns
- Brand ambassador programs
- Leverage their engagement to acquire similar customers

---

### Key business insights

#### 1. The 80/20 rule is real

**27.3% of customers (high-value) drive 66.9% of revenue**

This segment spent an average of $2,369 per customer vs. $382 for at-risk dormant customers.

**Implication:** Marketing should allocate budget proportionally. Losing one high-value customer equals losing 6 at-risk customers in revenue terms.

---

#### 2. Engagement does not equal spending (but it should)

**Engaged browsers have 38.5% email open rates but only $643 CLV**

Meanwhile, high-value customers have just 3.6% email open rates but spend 3x more.

**Insight:** High engagement is wasted if it doesn't convert to purchases. Engaged browsers are prime candidates for conversion campaigns.

---

#### 3. The dormant majority is a resource drain

**57.6% of customers are at-risk dormant with minimal engagement**

They consume marketing resources (emails, campaigns, support) but contribute only 22.8% of revenue.

**Action:** Test aggressive win-back. If they don't respond after 2 campaigns, deprioritize or remove from active lists.

---

#### 4. Campaign champions are influencers

**Just 19 customers (0.4%) have 42.8% email open rates and 65% click rates**

These customers are brand advocates in waiting.

**Opportunity:** Turn them into referral engines. Their high engagement suggests they'll share positive experiences.

---

### From a product manager's lens

Building the segmentation model was straightforward.  
Making it **operational** is where the real work begins.

#### What's needed for production:

##### 1. Automated re-segmentation
- Customers move between segments over time
- Re-cluster monthly or quarterly
- Track segment migration patterns

##### 2. CRM integration
- Push segment labels to Salesforce, HubSpot, or marketing automation
- Enable campaign targeting by segment
- Create dashboards showing segment health

##### 3. Campaign measurement
- Track campaign performance by segment
- Measure: Does personalized messaging outperform generic?
- A/B test: Segment-specific offers vs. one-size-fits-all

##### 4. Cross-functional alignment
- Marketing needs to understand what each segment represents
- Sales needs different scripts for high-value vs. dormant customers
- Product team prioritizes features based on high-value needs

##### 5. Governance
- Document how segments are defined
- Ensure segment logic is explainable to leadership
- Avoid "black box" models that stakeholders don't trust

---

### Connection to Post #1: Churn prediction

Now that we have segments, we can **layer churn predictions on top**:

#### Example: High-value customer + high churn risk
- **Priority:** Critical (immediate intervention)
- **Strategy:** Personal outreach, VIP retention offers
- **Budget:** High (losing this customer equals major revenue loss)

#### Example: At-risk dormant + high churn risk
- **Priority:** Low (already disengaged)
- **Strategy:** Final win-back email, then deprioritize
- **Budget:** Minimal (limited upside)

**This is the power of combining models:** Segmentation tells you WHO, churn prediction tells you WHEN to act.

---

### Solution summary

| Component | Details |
|-----------|---------|
| **Dataset** | 5,000 customers from CDP |
| **Features** | 9 behavioral and value metrics (RFM + engagement) |
| **Algorithm** | K-Means clustering |
| **Segments** | 4 distinct customer groups |
| **Performance** | Silhouette score: 0.323 (good separation) |
| **Key finding** | 27% of customers drive 67% of revenue |
| **Business impact** | Enables personalized marketing by segment profile |

---

### Recommended next steps

#### 1. Integrate with CRM
- Tag every customer with their segment label
- Build segment-specific campaign workflows

#### 2. Measure baseline performance
- Track revenue, engagement, and churn by segment
- Establish benchmarks before optimization

#### 3. Launch segment-specific campaigns
- **High-value:** VIP loyalty program
- **Engaged browsers:** Conversion offers
- **At-risk dormant:** Win-back or sunset
- **Campaign champions:** Referral program

#### 4. Monitor segment migration
- Track customers moving between segments
- Identify triggers that move customers up (e.g., engaged browser to high-value)

#### 5. Combine with churn model (Post #1)
- Prioritize interventions based on segment + churn risk
- Allocate marketing budget strategically

---


---

### What to do with these segments

Now that you know your customers fall into four distinct groups, the next step is to stop treating them all the same way. Take these segment labels to your marketing team and ask a simple question: "What would you do differently for each group?"

You'll likely discover that what works for high-value customers (exclusive VIP treatment, personal attention) is wasted on at-risk dormant customers (who need aggressive discounts or nothing at all). Campaign champions are your brand advocates, not your acquisition problem. Engaged browsers just need a nudge to convert.

This is where segmentation becomes real. Not in the model, but in how marketing spends their budget and time. The insight is only valuable if it changes decisions.

---
