#  PHASE 7: User Segmentation & Retention Strategy

## Big-Tech-Grade User Retention & Churn Prediction System

---

**Author**: Senior Data Scientist  
**Date**: February 2026  
**Objective**: Translate predictions into actionable retention strategies

---

##  Connecting Predictions to Action

Different risk levels require different interventions.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported")

In [None]:
PROJECT_ROOT = '/Users/anuj/Desktop/Churn_Retension/churn-prediction-bigtech'
PROCESSED_DATA_PATH = os.path.join(PROJECT_ROOT, 'data', 'processed')
MODELS_PATH = os.path.join(PROJECT_ROOT, 'models')

test_df = pd.read_parquet(os.path.join(PROCESSED_DATA_PATH, 'test.parquet'))
features_df = pd.read_parquet(os.path.join(PROCESSED_DATA_PATH, 'customer_features.parquet'))

target_col = 'is_churned'
feature_cols = [c for c in test_df.columns if c != target_col]
X_test, y_test = test_df[feature_cols], test_df[target_col]

try:
    model = joblib.load(os.path.join(MODELS_PATH, 'xgboost_final.pkl'))
    print("✅ Loaded XGBoost model")
except:
    model = joblib.load(os.path.join(MODELS_PATH, 'random_forest.pkl'))
    print("✅ Loaded Random Forest model")

cost_analysis = pd.read_csv(os.path.join(PROCESSED_DATA_PATH, 'cost_analysis.csv'))
optimal_threshold = cost_analysis['optimal_threshold'].values[0]
print(f"✅ Optimal threshold: {optimal_threshold:.2f}")

In [None]:
y_proba = model.predict_proba(X_test)[:, 1]

segment_df = X_test.copy()
segment_df['churn_probability'] = y_proba
segment_df['actual_churned'] = y_test.values

print(f"\nTest set size: {len(segment_df):,} customers")

---

## 1. Risk-Based Segmentation

In [None]:
def assign_risk_segment(prob):
    """Assign customers to risk segments based on churn probability"""
    if prob >= 0.7:
        return 'HIGH_RISK'
    elif prob >= 0.4:
        return 'MEDIUM_RISK'
    elif prob >= 0.2:
        return 'LOW_RISK'
    else:
        return 'SAFE'

segment_df['risk_segment'] = segment_df['churn_probability'].apply(assign_risk_segment)

segment_counts = segment_df['risk_segment'].value_counts()
segment_churn = segment_df.groupby('risk_segment')['actual_churned'].agg(['sum', 'count', 'mean'])
segment_churn.columns = ['churned', 'total', 'churn_rate']

print("="*70)
print("                    RISK SEGMENT DISTRIBUTION")
print("="*70)
print(f"""
┌──────────────────────────────────────────────────────────────────────────────┐
│                         CUSTOMER RISK SEGMENTS                               │
├─────────────────┬────────────┬───────────────┬───────────────┬──────────────┤
│ Segment         │ Customers  │ % of Base     │ Actual Churn  │ Churn Rate   │
├─────────────────┼────────────┼───────────────┼───────────────┼──────────────┤""")

for segment in ['HIGH_RISK', 'MEDIUM_RISK', 'LOW_RISK', 'SAFE']:
    if segment in segment_churn.index:
        data = segment_churn.loc[segment]
        pct = data['total'] / len(segment_df) * 100
        print(f"│ {segment:<15} │ {data['total']:>10,} │ {pct:>11.1f}% │ {data['churned']:>13,.0f} │ {data['churn_rate']:>10.1%} │")

print(f"├─────────────────┼────────────┼───────────────┼───────────────┼──────────────┤")
print(f"│ TOTAL           │ {len(segment_df):>10,} │       100.0% │ {segment_df['actual_churned'].sum():>13,.0f} │ {segment_df['actual_churned'].mean():>10.1%} │")
print(f"└─────────────────┴────────────┴───────────────┴───────────────┴──────────────┘")

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

colors = {'HIGH_RISK': '#e74c3c', 'MEDIUM_RISK': '#f39c12', 'LOW_RISK': '#3498db', 'SAFE': '#2ecc71'}
segment_order = ['HIGH_RISK', 'MEDIUM_RISK', 'LOW_RISK', 'SAFE']
available_segments = [s for s in segment_order if s in segment_counts.index]
bar_colors = [colors[s] for s in available_segments]

axes[0].bar(available_segments, [segment_counts[s] for s in available_segments], color=bar_colors)
axes[0].set_xlabel('Risk Segment')
axes[0].set_ylabel('Number of Customers')
axes[0].set_title('Segment Distribution', fontsize=12, fontweight='bold')
axes[0].tick_params(axis='x', rotation=45)

churn_rates = [segment_churn.loc[s, 'churn_rate'] for s in available_segments]
axes[1].bar(available_segments, churn_rates, color=bar_colors)
axes[1].set_xlabel('Risk Segment')
axes[1].set_ylabel('Actual Churn Rate')
axes[1].set_title('Churn Rate by Segment', fontsize=12, fontweight='bold')
axes[1].tick_params(axis='x', rotation=45)
axes[1].axhline(y=segment_df['actual_churned'].mean(), color='black', linestyle='--', label='Overall')
axes[1].legend()

for segment in available_segments:
    data = segment_df[segment_df['risk_segment'] == segment]['churn_probability']
    axes[2].hist(data, bins=20, alpha=0.5, label=segment, color=colors[segment])
axes[2].set_xlabel('Churn Probability')
axes[2].set_ylabel('Count')
axes[2].set_title('Probability Distribution by Segment', fontsize=12, fontweight='bold')
axes[2].legend()

plt.tight_layout()
plt.show()

---

## 2. Retention Strategy by Segment

In [None]:
print("="*70)
print("                    RETENTION STRATEGY PLAYBOOK")
print("="*70)

strategies = {
    'HIGH_RISK': {
        'action': 'IMMEDIATE INTERVENTION',
        'tactics': [
            '• Personal outreach (phone/email from account manager)',
            '• Exclusive discount (15-20% off next purchase)',
            '• Win-back campaign with urgency messaging',
            '• Survey to understand pain points',
            '• Free shipping on next 3 orders'
        ],
        'cost_per_customer': 50,
        'expected_save_rate': 0.25
    },
    'MEDIUM_RISK': {
        'action': 'PROACTIVE ENGAGEMENT',
        'tactics': [
            '• Personalized email with product recommendations',
            '• Limited-time offer (10% discount)',
            '• Loyalty points bonus',
            '• "We miss you" email sequence',
            '• Early access to new products'
        ],
        'cost_per_customer': 20,
        'expected_save_rate': 0.40
    },
    'LOW_RISK': {
        'action': 'NURTURE & MONITOR',
        'tactics': [
            '• Regular newsletter with relevant content',
            '• Occasional special offers',
            '• Product update notifications',
            '• Referral program incentives',
            '• Automated engagement triggers'
        ],
        'cost_per_customer': 5,
        'expected_save_rate': 0.60
    },
    'SAFE': {
        'action': 'MAINTAIN RELATIONSHIP',
        'tactics': [
            '• Standard marketing communications',
            '• Loyalty program benefits',
            '• Cross-sell opportunities',
            '• VIP treatment for high-value customers',
            '• Monitor for changes in behavior'
        ],
        'cost_per_customer': 2,
        'expected_save_rate': 0.80
    }
}

for segment, strategy in strategies.items():
    print(f"""
┌──────────────────────────────────────────────────────────────────────────────┐
│  {segment:^74}  │
│  Action: {strategy['action']:<65}  │
├──────────────────────────────────────────────────────────────────────────────┤
│  TACTICS:                                                                    │""")
    for tactic in strategy['tactics']:
        print(f"│    {tactic:<72}  │")
    print(f"│                                                                              │")
    print(f"│  Cost/Customer: ${strategy['cost_per_customer']:<5}    Expected Save Rate: {strategy['expected_save_rate']:.0%}              │")
    print(f"└──────────────────────────────────────────────────────────────────────────────┘")

---

## 3. Budget Allocation & ROI Estimation

In [None]:
AVG_CLV = 200

print("="*70)
print("                    BUDGET ALLOCATION & ROI")
print("="*70)

budget_analysis = []

for segment in ['HIGH_RISK', 'MEDIUM_RISK', 'LOW_RISK', 'SAFE']:
    if segment not in segment_churn.index:
        continue
        
    strategy = strategies[segment]
    seg_data = segment_churn.loc[segment]
    
    customers = int(seg_data['total'])
    expected_churners = int(seg_data['churned'])
    
    total_cost = customers * strategy['cost_per_customer']
    
    customers_saved = expected_churners * strategy['expected_save_rate']
    revenue_saved = customers_saved * AVG_CLV
    
    net_benefit = revenue_saved - total_cost
    roi = (net_benefit / total_cost * 100) if total_cost > 0 else 0
    
    budget_analysis.append({
        'segment': segment,
        'customers': customers,
        'expected_churners': expected_churners,
        'cost_per_customer': strategy['cost_per_customer'],
        'total_cost': total_cost,
        'expected_save_rate': strategy['expected_save_rate'],
        'customers_saved': customers_saved,
        'revenue_saved': revenue_saved,
        'net_benefit': net_benefit,
        'roi': roi
    })

budget_df = pd.DataFrame(budget_analysis)

print(f"""
┌──────────────────────────────────────────────────────────────────────────────┐
│                    SEGMENT BUDGET ANALYSIS                                   │
├─────────────┬────────────┬────────────┬────────────┬────────────┬────────────┤
│ Segment     │ Customers  │ Budget     │ Saved      │ Net Benefit│ ROI        │
├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤""")

for _, row in budget_df.iterrows():
    print(f"│ {row['segment']:<11} │ {row['customers']:>10,} │ ${row['total_cost']:>8,.0f} │ {row['customers_saved']:>10,.0f} │ ${row['net_benefit']:>8,.0f} │ {row['roi']:>9.0f}% │")

print(f"├─────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤")
total_cost = budget_df['total_cost'].sum()
total_benefit = budget_df['net_benefit'].sum()
total_roi = (total_benefit / total_cost * 100) if total_cost > 0 else 0
print(f"│ TOTAL       │ {budget_df['customers'].sum():>10,} │ ${total_cost:>8,.0f} │ {budget_df['customers_saved'].sum():>10,.0f} │ ${total_benefit:>8,.0f} │ {total_roi:>9.0f}% │")
print(f"└─────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘")

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

colors = {'HIGH_RISK': '#e74c3c', 'MEDIUM_RISK': '#f39c12', 'LOW_RISK': '#3498db', 'SAFE': '#2ecc71'}
bar_colors = [colors[s] for s in budget_df['segment']]

axes[0].bar(budget_df['segment'], budget_df['total_cost'], color=bar_colors)
axes[0].set_xlabel('Segment')
axes[0].set_ylabel('Budget ($)')
axes[0].set_title('Budget by Segment', fontsize=12, fontweight='bold')
axes[0].tick_params(axis='x', rotation=45)

axes[1].bar(budget_df['segment'], budget_df['roi'], color=bar_colors)
axes[1].set_xlabel('Segment')
axes[1].set_ylabel('ROI (%)')
axes[1].set_title('ROI by Segment', fontsize=12, fontweight='bold')
axes[1].tick_params(axis='x', rotation=45)
axes[1].axhline(y=0, color='black', linestyle='-', linewidth=0.5)

axes[2].bar(budget_df['segment'], budget_df['net_benefit'], color=bar_colors)
axes[2].set_xlabel('Segment')
axes[2].set_ylabel('Net Benefit ($)')
axes[2].set_title('Net Benefit by Segment', fontsize=12, fontweight='bold')
axes[2].tick_params(axis='x', rotation=45)
axes[2].axhline(y=0, color='black', linestyle='-', linewidth=0.5)

plt.tight_layout()
plt.show()

---

## 4. Segment Profiles

In [None]:
print("="*70)
print("                    SEGMENT PROFILES")
print("="*70)

profile_cols = [col for col in segment_df.columns if col not in ['churn_probability', 'actual_churned', 'risk_segment']]

key_features = []
for col in ['recency_days', 'frequency_total', 'monetary_total', 'avg_order_value', 'days_since_first_purchase']:
    if col in segment_df.columns:
        key_features.append(col)

if key_features:
    segment_profiles = segment_df.groupby('risk_segment')[key_features].mean().round(2)
    print("\nAverage Feature Values by Segment:")
    print(segment_profiles.T)
else:
    print("\nKey features not found. Showing available feature statistics:")
    available_features = [c for c in segment_df.columns[:5] if c not in ['churn_probability', 'actual_churned', 'risk_segment']]
    if available_features:
        segment_profiles = segment_df.groupby('risk_segment')[available_features].mean().round(2)
        print(segment_profiles.T)

In [None]:
segment_summary = segment_df[['risk_segment', 'churn_probability', 'actual_churned']].copy()
segment_summary.to_parquet(os.path.join(PROCESSED_DATA_PATH, 'customer_segments.parquet'))
budget_df.to_csv(os.path.join(PROCESSED_DATA_PATH, 'segment_budget.csv'), index=False)

print(f"\n✅ Segment analysis saved:")
print(f"   • {PROCESSED_DATA_PATH}/customer_segments.parquet")
print(f"   • {PROCESSED_DATA_PATH}/segment_budget.csv")

---

## 5. Implementation Roadmap

In [None]:
print("="*70)
print("                    IMPLEMENTATION ROADMAP")
print("="*70)

print("""
┌──────────────────────────────────────────────────────────────────────────────┐
│                        30-DAY IMPLEMENTATION PLAN                            │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  WEEK 1: HIGH-RISK SEGMENT                                                   │
│  ─────────────────────────────                                               │
│  • Day 1-2: Set up personal outreach workflow                                │
│  • Day 3-4: Create win-back email templates                                  │
│  • Day 5-7: Launch high-risk intervention campaign                           │
│                                                                              │
│  WEEK 2: MEDIUM-RISK SEGMENT                                                 │
│  ───────────────────────────                                                 │
│  • Day 8-9: Configure personalization engine                                 │
│  • Day 10-11: Set up loyalty points bonus                                    │
│  • Day 12-14: Launch engagement campaign                                     │
│                                                                              │
│  WEEK 3: LOW-RISK SEGMENT                                                    │
│  ─────────────────────────                                                   │
│  • Day 15-17: Configure automated email sequences                            │
│  • Day 18-21: Set up behavior monitoring triggers                            │
│                                                                              │
│  WEEK 4: MONITORING & OPTIMIZATION                                           │
│  ─────────────────────────────                                               │
│  • Day 22-25: Monitor campaign performance                                   │
│  • Day 26-28: A/B test variations                                            │
│  • Day 29-30: Report results, iterate                                        │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

KPIs TO TRACK:
──────────────
1. Retention rate by segment (weekly)
2. Campaign response rate (daily)
3. Cost per saved customer (weekly)
4. ROI by segment (monthly)
5. Model accuracy over time (monthly)
""")

---

##  Phase 7 Checklist

- [x] Created risk segments (High/Medium/Low/Safe)
- [x] Analyzed segment distribution and churn rates
- [x] Defined retention tactics per segment
- [x] Calculated budget allocation
- [x] Estimated ROI per segment
- [x] Saved segment data for deployment
- [x] Created implementation roadmap

**Phase 7 Status: COMPLETE** 

---

**Next**: Phase 8 - Final Insights & Storytelling