### DISCLAIMER: It is greatly beneficial if you know Python and ML basics before hand. If not, I would highly urge you to learn. This should be non-negotiable. This would form the basement for future posts in this series and your career as PM working with ML teams. 

#### Problem Statement
The retention marketing team has noticed that a significant portion of active customers haven't made repeat purchases in 90 days. The CMO wants to know: "Which customers are about to churn—and can we stop them before they do?"

The challenge: No one could agree on what "churn" actually meant. Was it 60 days of inactivity? 90 days? This is where most ML projects stall—not because of the algorithm, but because of the problem definition. 

#### Let's now dive into the solution for Problem 1:

In [1]:
# Load the existing datasets
import pandas as pd
import numpy as np
from datetime import datetime

print("="*70)
print("LOADING EXISTING CDP DATASETS")
print("="*70)

# Load all available datasets
cdp_customers = pd.read_csv('cdp_customers.csv')
cdp_customer_features = pd.read_csv('cdp_customer_features.csv')
cdp_campaign_responses = pd.read_csv('cdp_campaign_responses.csv')
cdp_support_tickets = pd.read_csv('cdp_support_tickets.csv')

print("\n✓ Loaded cdp_customers:", cdp_customers.shape)
print("✓ Loaded cdp_customer_features:", cdp_customer_features.shape)
print("✓ Loaded cdp_campaign_responses:", cdp_campaign_responses.shape)
print("✓ Loaded cdp_support_tickets:", cdp_support_tickets.shape)

print("\n" + "="*70)
print("CHECKING DATA STRUCTURE")
print("="*70)

print("\ncdp_customers columns:")
print(cdp_customers.columns.tolist())
print("\nFirst 3 rows:")
print(cdp_customers.head(3))

print("\n\ncdp_customer_features columns:")
print(cdp_customer_features.columns.tolist())
print("\nFirst 3 rows:")
print(cdp_customer_features.head(3))

LOADING EXISTING CDP DATASETS

✓ Loaded cdp_customers: (5000, 11)
✓ Loaded cdp_customer_features: (5000, 19)
✓ Loaded cdp_campaign_responses: (5849, 10)
✓ Loaded cdp_support_tickets: (3568, 10)

CHECKING DATA STRUCTURE

cdp_customers columns:
['customer_id', 'first_name', 'last_name', 'email', 'age', 'gender', 'city', 'state', 'signup_date', 'customer_segment', 'loyalty_tier']

First 3 rows:
  customer_id first_name  last_name                          email  age  \
0   CUST00001     Robert  Rodriguez  barbara.anderson710@email.com   60   
1   CUST00002  Elizabeth     Garcia      anthony.lewis19@email.com   72   
2   CUST00003     Donald  Hernandez        john.smith237@email.com   31   

  gender       city state signup_date customer_segment loyalty_tier  
0      F    Phoenix    OH  2023-02-22          Unknown     Platinum  
1      F  Charlotte    TX  2023-02-02          Unknown       Bronze  
2  Other    Houston    TX  2020-08-24          Unknown       Silver  


cdp_customer_features 

In [5]:

print("\n" + "="*70)
print("ANALYZING AVAILABLE DATA FOR CHURN PREDICTION")
print("="*70)

# Check campaign responses for email engagement data
print("\ncdp_campaign_responses columns:")
print(cdp_campaign_responses.columns.tolist())
print("\nFirst 3 rows:")
print(cdp_campaign_responses.head(3))




ANALYZING AVAILABLE DATA FOR CHURN PREDICTION

cdp_campaign_responses columns:
['response_id', 'customer_id', 'campaign_name', 'campaign_type', 'sent_date', 'delivered', 'opened', 'clicked', 'converted', 'unsubscribed']

First 3 rows:
  response_id customer_id                 campaign_name campaign_type  \
0  RESP000001   CUST00001                Welcome Series   Display Ads   
1  RESP000002   CUST00001  Product Launch - Electronics         Email   
2  RESP000003   CUST00001                Welcome Series         Email   

    sent_date  delivered  opened  clicked  converted  unsubscribed  
0  2024-05-24       True    True    False      False         False  
1  2023-11-29       True   False    False      False         False  
2  2024-12-28       True   False    False      False         False  


In [11]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

print("\n" + "="*70)
print("FEATURE ENGINEERING FOR CHURN PREDICTION")
print("="*70)

# Merge customers with features
df = cdp_customers.merge(cdp_customer_features, on='customer_id', how='left')

print(f"\n✓ Combined dataset shape: {df.shape}")
print(f"✓ Columns available: {len(df.columns)}")

# Create binary churn target variable
# Define churn as: High churn risk OR (recency > 90 days AND email_open_rate < 0.2)
df['churned'] = ((df['churn_risk'] == 'High') | 
                 ((df['recency_days'] > 90) & (df['email_open_rate'] < 0.2))).astype(int)

print(f"\n✓ Churn distribution:")
print(df['churned'].value_counts())
print(f"  Churn rate: {df['churned'].mean():.1%}")

# Select features for the model
feature_columns = [
    'recency_days',
    'frequency',
    'monetary_value',
    'avg_order_value',
    'days_since_last_activity',
    'engagement_score',
    'email_open_rate',
    'email_click_rate',
    'num_support_tickets',
    'campaigns_received',
    'campaign_conversions',
    'age'
]

# Create feature matrix
X = df[feature_columns].copy()

# Handle missing values
X = X.fillna(X.median())

# Target variable
y = df['churned']

print(f"\n✓ Feature matrix shape: {X.shape}")
print(f"✓ Features used: {len(feature_columns)}")
print("\nFeatures:")
for i, col in enumerate(feature_columns, 1):
    print(f"  {i}. {col}")



FEATURE ENGINEERING FOR CHURN PREDICTION

✓ Combined dataset shape: (5000, 29)
✓ Columns available: 29

✓ Churn distribution:
churned
1    4165
0     835
Name: count, dtype: int64
  Churn rate: 83.3%

✓ Feature matrix shape: (5000, 12)
✓ Features used: 12

Features:
  1. recency_days
  2. frequency
  3. monetary_value
  4. avg_order_value
  5. days_since_last_activity
  6. engagement_score
  7. email_open_rate
  8. email_click_rate
  9. num_support_tickets
  10. campaigns_received
  11. campaign_conversions
  12. age


In [13]:

print("\n" + "="*70)
print("TRAIN/TEST SPLIT")
print("="*70)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\n✓ Training set: {X_train.shape[0]} customers")
print(f"✓ Test set: {X_test.shape[0]} customers")
print(f"\nTrain churn rate: {y_train.mean():.1%}")
print(f"Test churn rate: {y_test.mean():.1%}")

print("\n" + "="*70)
print("STEP 3: FEATURE SCALING")
print("="*70)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\n✓ Features scaled using StandardScaler")
print("✓ Mean = 0, Std = 1 for all features")



TRAIN/TEST SPLIT

✓ Training set: 4000 customers
✓ Test set: 1000 customers

Train churn rate: 83.3%
Test churn rate: 83.3%

STEP 3: FEATURE SCALING

✓ Features scaled using StandardScaler
✓ Mean = 0, Std = 1 for all features


In [15]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [17]:

print("\n" + "="*70)
print("MODEL TRAINING - LOGISTIC REGRESSION")
print("="*70)

# Train Logistic Regression model
model = LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced')
model.fit(X_train_scaled, y_train)

print("\n✓ Model trained successfully")
print("✓ Algorithm: Logistic Regression")
print("✓ Class weights: balanced (to handle imbalanced classes)")

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

print("\n" + "="*70)
print("MODEL EVALUATION")
print("="*70)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"\n MODEL PERFORMANCE METRICS:")
print("-" * 70)
print(f"  Accuracy:  {accuracy:.1%}")
print(f"  Precision: {precision:.1%}  (Of predicted churners, {precision:.1%} actually churned)")
print(f"  Recall:    {recall:.1%}  (We caught {recall:.1%} of all actual churners)")
print(f"  ROC-AUC:   {roc_auc:.3f}")

print("\n CONFUSION MATRIX:")
print("-" * 70)
cm = confusion_matrix(y_test, y_pred)
print(f"  True Negatives:  {cm[0,0]:4d}  (Correctly predicted non-churners)")
print(f"  False Positives: {cm[0,1]:4d}  (Incorrectly predicted as churners)")
print(f"  False Negatives: {cm[1,0]:4d}  (Missed churners)")
print(f"  True Positives:  {cm[1,1]:4d}  (Correctly predicted churners)")

print("\n CLASSIFICATION REPORT:")
print("-" * 70)
print(classification_report(y_test, y_pred, target_names=['Not Churned', 'Churned']))



MODEL TRAINING - LOGISTIC REGRESSION

✓ Model trained successfully
✓ Algorithm: Logistic Regression
✓ Class weights: balanced (to handle imbalanced classes)

MODEL EVALUATION

 MODEL PERFORMANCE METRICS:
----------------------------------------------------------------------
  Accuracy:  85.3%
  Precision: 98.7%  (Of predicted churners, 98.7% actually churned)
  Recall:    83.4%  (We caught 83.4% of all actual churners)
  ROC-AUC:   0.933

 CONFUSION MATRIX:
----------------------------------------------------------------------
  True Negatives:   158  (Correctly predicted non-churners)
  False Positives:    9  (Incorrectly predicted as churners)
  False Negatives:  138  (Missed churners)
  True Positives:   695  (Correctly predicted churners)

 CLASSIFICATION REPORT:
----------------------------------------------------------------------
              precision    recall  f1-score   support

 Not Churned       0.53      0.95      0.68       167
     Churned       0.99      0.83      0.

In [21]:
    
print("\n" + "="*70)
print("FEATURE IMPORTANCE ANALYSIS")
print("="*70)

# Get feature importance from coefficients
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'coefficient': model.coef_[0],
    'abs_coefficient': np.abs(model.coef_[0])
}).sort_values('abs_coefficient', ascending=False)

print("\n TOP 10 MOST IMPORTANT FEATURES FOR CHURN PREDICTION:")
print("-" * 70)
for idx, row in feature_importance.head(10).iterrows():
    direction = "↑ increases" if row['coefficient'] > 0 else "↓ decreases"
    print(f"  {row['feature']:30s} {direction} churn risk (coef: {row['coefficient']:7.3f})")

print("\n" + "="*70)
print("BUSINESS INSIGHTS")
print("="*70)

# Create risk scores for all test customers
test_results = pd.DataFrame({
    'customer_id': df.iloc[X_test.index]['customer_id'].values,
    'actual_churned': y_test.values,
    'predicted_churned': y_pred,
    'churn_probability': y_pred_proba,
    'recency_days': X_test['recency_days'].values,
    'email_open_rate': X_test['email_open_rate'].values,
    'frequency': X_test['frequency'].values
})

# Sort by churn probability
test_results = test_results.sort_values('churn_probability', ascending=False)

print("\n KEY INSIGHTS:")
print("-" * 70)

# Top 20% highest risk
top_20_pct = int(len(test_results) * 0.20)
high_risk_customers = test_results.head(top_20_pct)
actual_churn_in_top_20 = high_risk_customers['actual_churned'].mean()

print(f"1. Top 20% of at-risk customers ({top_20_pct} customers):")
print(f"   → {actual_churn_in_top_20:.1%} actually churned")
print(f"   → Marketing can focus on these {top_20_pct} customers for retention campaigns")

print(f"\n2. Email engagement is a strong predictor:")
avg_email_rate_churned = df[df['churned']==1]['email_open_rate'].mean()
avg_email_rate_active = df[df['churned']==0]['email_open_rate'].mean()
print(f"   → Churned customers: {avg_email_rate_churned:.1%} email open rate")
print(f"   → Active customers: {avg_email_rate_active:.1%} email open rate")

print(f"\n3. Recency is critical:")
avg_recency_churned = df[df['churned']==1]['recency_days'].mean()
avg_recency_active = df[df['churned']==0]['recency_days'].mean()
print(f"   → Churned customers: {avg_recency_churned:.0f} days since last purchase")
print(f"   → Active customers: {avg_recency_active:.0f} days since last purchase")

print("\n ACTIONABLE RECOMMENDATION:")
print("-" * 70)
print("  Trigger re-engagement campaigns when:")
print("  • Recency > 30 days AND")
print("  • Email open rate drops below 20%")
print("  This catches customers BEFORE they fully churn")



FEATURE IMPORTANCE ANALYSIS

 TOP 10 MOST IMPORTANT FEATURES FOR CHURN PREDICTION:
----------------------------------------------------------------------
  recency_days                   ↑ increases churn risk (coef:   1.441)
  days_since_last_activity       ↑ increases churn risk (coef:   1.051)
  engagement_score               ↓ decreases churn risk (coef:  -1.015)
  email_open_rate                ↓ decreases churn risk (coef:  -0.757)
  frequency                      ↓ decreases churn risk (coef:  -0.323)
  campaigns_received             ↑ increases churn risk (coef:   0.127)
  email_click_rate               ↓ decreases churn risk (coef:  -0.101)
  avg_order_value                ↑ increases churn risk (coef:   0.099)
  num_support_tickets            ↓ decreases churn risk (coef:  -0.075)
  monetary_value                 ↑ increases churn risk (coef:   0.044)

BUSINESS INSIGHTS

 KEY INSIGHTS:
----------------------------------------------------------------------
1. Top 20% of at-ri

Below is a snapshot of the solution we built. 

Dataset: 5,000 customers from CDP
Features: 12 behavioral and engagement metrics
Algorithm: Logistic Regression (interpretable, production-ready)
Performance: 85.3% accuracy, 98.7% precision, 0.933 ROC-AUC
Business Value: Focus on top 20% at-risk customers (100% precision)
Key Insight: Email silence predicts churn 30 days before purchase drops

This model transforms reactive marketing into proactive retention.


##### What next?

We built the solution.  How do we take it to production? 

Below are some of the steps that should be established to successfully use the solution we built such that it is used by the marketing team to 

1. DATA PIPELINE AUTOMATION
   → Feature engineering needs to run daily
   → Integrate with data warehouse (Snowflake, BigQuery, Redshift)
   → Set up automated data quality checks

2. MODEL DEPLOYMENT
   → Deploy via REST API or batch scoring system
   → Containerize with Docker for reproducibility
   → Version control models (MLflow, W&B)
   → Set up monitoring for model drift

3. CRM INTEGRATION
   → Push churn scores to marketing automation platform (Salesforce, HubSpot)
   → Create automated workflows for high-risk segments
   → Enable marketing ops to filter by risk score

4. DATA GOVERNANCE & COMPLIANCE
   → Document feature logic for explainability (GDPR requirement)
   → Ensure no PII leakage in model artifacts
   → Get legal/compliance sign-off on automated decisions
   → Set up audit trails

5. BUSINESS VALIDATION
   → Run A/B test: intervention group vs control group
   → Measure: Did targeted campaigns actually reduce churn?
   → Track: Cost per save, retention rate lift, ROI
   → Iterate based on feedback loop

6. STAKEHOLDER ALIGNMENT
   → Marketing needs training on how to use risk scores
   → Data engineering needs SLAs for feature freshness
   → Finance needs to understand ROI calculation
   → Product team owns ongoing model iteration


