## Sign up conversion prediction
ML implementation series for Product Managers, post #6

### DISCLAIMER: It is greatly beneficial if you know Python and ML basics before hand. If not, I would highly urge you to learn. This should be non-negotiable. This would form the basement for future posts in this series and your career as PM working with ML teams.

## Why this follows posts 1, 2, 3, 4, and 5

We've built a complete customer intelligence and engagement system:

**Post 1 (churn prediction):** Know who will leave  
**Post 2 (segmentation):** Know who they are  
**Post 3 (recommendations):** Know what to offer  
**Post 4 (CLV):** Know how much to invest  
**Post 5 (campaign response):** Know when to engage

But all of this is downstream of one critical problem:

**Post 6 answers:** Which signups will actually convert into customers?

This optimizes the entire funnel from acquisition to retention.


## Problem statement

The acquisition machine is working—too well. Your company is getting thousands of signups every month. But barely any of them turn into revenue.

The VP of Sales walks into the meeting visibly frustrated:

**"We're drowning in signups. Our reps are wasting time chasing cold leads. Some signups are clearly high-intent—pricing page visits, demo requests—but they get lost in the noise. Only 6% convert. Can you tell us which ones will actually buy before we waste our sales team's time?"**

The current state:
- Generate thousands of signups (via marketing, ads, organic)
- Sales reps spend weeks nurturing EVERY lead equally
- 94% of them never convert
- Sales cycles are long and inefficient
- No prioritization based on conversion likelihood

You have the data:
- Signup behavior (engagement, page visits, email opens)
- Customer intelligence from posts 1-5 (churn risk, segments, CLV)
- Historical conversion patterns
- But no way to predict who will actually buy

**As a PM, the question became:**  
Can we predict which signups will convert BEFORE sales spends weeks nurturing them?


## Dataset overview

Same customer data platform (CDP) with 5,000 customers from posts 1-5.

But now we reframe the data:
- These are NOT just existing customers
- These are SIGNUPS (new acquisition)
- Target variable: Did they convert to paying customer?

### Tables used:
- **cdp_customers**: Signup demographics + signup_date
- **cdp_customer_features**: Early behavioral metrics from signup date

### Conversion definition:

We define conversion as:
**Customer Lifetime Value above median (\$589)**

This represents signups that became HIGH-VALUE customers (top 50%).

Why this definition?
- Real outcome (not just "any purchase")
- Aligns with revenue (high-value customers matter most)
- Binary target (predicts well with ML)

### Current conversion metrics:
- Overall conversion rate: 50% (by our definition)
- By segment: ranges from 16% to 100%
- By churn risk: 46% (high) to 69% (low)
- Signal strength varies significantly


## ML approach: random forest classification

### The core question

"Given a new signup's profile and early behavior, will they convert into a high-value customer?"

This is a binary classification problem:
- Class 1: Converts (becomes high-value customer)
- Class 0: Doesn't convert (low value or churns)

### Why this approach over other ML approaches?

Let's compare:

#### Option 1: Manual sales qualification
- **How it works:** Sales managers decide which leads to pursue
- **Pros:** Uses human judgment, considers context
- **Cons:** Inconsistent, biased, time-consuming, doesn't scale
- **When to use:** Never at scale
- **PM perspective:** This is how most companies still do it

#### Option 2: Rule-based scoring
- **How it works:** Manually assign points to behaviors, sum them
- **Pros:** Simple, explainable, fast
- **Cons:** Arbitrary weights, doesn't find interactions, requires constant tuning
- **When to use:** Quick pilot, when data is very limited
- **PM perspective:** Better than manual, but leaves money on table

#### Option 3: Logistic regression
- **How it works:** Calculate weighted sum of features, output probability
- **Pros:** Fast, interpretable, works well with linear relationships
- **Cons:** Assumes linear relationships, misses feature interactions
- **When to use:** When you need speed and extreme simplicity
- **PM perspective:** Good baseline, but conversion signals are complex

#### Option 4: Gradient boosting (XGBoost)
- **How it works:** Build sequential trees, each correcting previous errors
- **Pros:** Highest accuracy on most datasets, handles complex patterns
- **Cons:** Slower to train, harder to explain, can overfit
- **When to use:** When max accuracy is critical
- **PM perspective:** Overkill for most use cases, hard to sell to stakeholders

#### Option 5: Random forest (our choice)
- **How it works:** Build 100 independent decision trees, average their predictions
- **Pros:**
  - Captures non-linear relationships naturally
  - Finds feature interactions automatically
  - Robust to outliers and noise
  - Provides feature importance ranking
  - Doesn't overfit as much as single trees
  - Interpretable for stakeholder buy-in
  - Works well with balanced data
- **Cons:** Slightly slower than logistic regression
- **When to use:** When you need accuracy + interpretability + robustness
- **PM perspective:** Best balance for sales team deployment


### Why we chose random forest:

**1. Conversion signals are complex and non-linear**
- A signup with low engagement but high CLV behaves differently than one with high engagement and low CLV
- Segment + early purchase behavior + churn risk create interactions

**2. Feature importance matters for sales buy-in**
- Sales reps need to understand WHY someone is scored high. This model would help them understand
- Random forest provides clear importance rankings
- Marketing can adjust acquisition strategy based on what matters

**3. It's production-ready**
- Fast enough to score signups in real-time
- Robust enough for continuous deployment
- Doesn't require constant retraining


## How random forest works 

Imagine you're trying to predict if a new signup will convert into a paying customer.

### The traditional way (what sales does now):

"If they requested a demo, reach out. If not, maybe reach out later."

Problem: Too simplistic. Ignores all the signals that matter.

### The random forest way:

Build 100 different "decision trees," each learning different patterns:

**Tree 1:**
- "Did they make a purchase in the first 30 days?"
  - Yes → "Was their first order > \$50?"
    - Yes → "Predict: WILL CONVERT"
    - No → "Is their segment High-Value?"
      - Yes → "Predict: MIGHT CONVERT"
      - No → "Predict: WON'T CONVERT"
  - No → "Is their engagement_score > 0.7?"
    - Yes → "Predict: MIGHT CONVERT"
    - No → "Predict: WON'T CONVERT"

**Tree 2:**
- "Is their churn_risk = Low?"
  - Yes → "Did they have >2 purchases?"
    - Yes → "Predict: WILL CONVERT"
    - No → "Predict: MIGHT CONVERT"
  - No → "Is their CLV > median?"
    - Yes → "Predict: MIGHT CONVERT"
    - No → "Predict: WON'T CONVERT"

**Tree 3-100:** Similar patterns, different questions

Each tree votes. The final prediction:
- If 80+ trees say "WILL CONVERT" → High confidence (reach out immediately)
- If 40-80 trees say "WILL CONVERT" → Medium confidence (nurture with education)
- If <40 trees say "WILL CONVERT" → Low confidence (low priority)

The probability score = % of trees voting "WILL CONVERT"


### Why this works:

**1. Wisdom of crowds:** Different trees catch different patterns, averaging reduces errors

**2. Probability scores:** Not just binary decisions, but confidence levels → lets sales prioritize

**3. Handles complexity:** Learns that "purchased > 2x + high engagement + low churn = convert"

**4. Interpretable:** Feature importance shows which signals matter most


In [46]:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

print("="*70)
print("SIGNUP CONVERSION PREDICTION")
print("="*70)

# Load datasets
cdp_customers = pd.read_csv('cdp_customers.csv')
cdp_customer_features = pd.read_csv('cdp_customer_features.csv')

# Merge
df = cdp_customers.merge(cdp_customer_features, on='customer_id', how='left')

print(f"\n✓ Loaded {len(df):,} customers")

# Create conversion label: Did they spend > median after signup?
median_clv = df['customer_lifetime_value'].median()
df['converted'] = (df['customer_lifetime_value'] > median_clv).astype(int)

conversion_rate = df['converted'].mean() * 100
print(f"✓ Conversion rate: {conversion_rate:.1f}%")
print(f"✓ High CLV threshold: {median_clv:.2f}")

# Create segment labels (from Post #2)
segmentation_features = [
    'recency_days', 'frequency', 'monetary_value',
    'avg_order_value', 'engagement_score',
    'email_open_rate', 'email_click_rate'
]

X_segment = df[segmentation_features].fillna(df[segmentation_features].median())
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
df['segment_label'] = kmeans.fit_predict(X_segment)

print("\n" + "="*70)
print("EXPLORING SIGNUP CONVERSION SIGNALS")
print("="*70)

print("\nConversion rate by segment:")
for segment in range(4):
    segment_data = df[df['segment_label'] == segment]
    conversion = segment_data['converted'].mean() * 100
    count = len(segment_data)
    print(f"  Segment {segment}: {conversion:.1f}% ({count} signups)")

print("\nConversion rate by churn risk:")
for risk in df['churn_risk'].unique():
    risk_data = df[df['churn_risk'] == risk]
    conversion = risk_data['converted'].mean() * 100
    count = len(risk_data)
    print(f"  {risk} churn risk: {conversion:.1f}% ({count} signups)")

print("\nEngagement signals that predict conversion:")
print(f"  High engagement (>0.5): {df[df['engagement_score'] > 0.5]['converted'].mean()*100:.1f}%")
print(f"  Low engagement (<0.5): {df[df['engagement_score'] <= 0.5]['converted'].mean()*100:.1f}%")
print(f"  High email open rate (>0.2): {df[df['email_open_rate'] > 0.2]['converted'].mean()*100:.1f}%")
print(f"  Low email open rate (<0.2): {df[df['email_open_rate'] <= 0.2]['converted'].mean()*100:.1f}%")


SIGNUP CONVERSION PREDICTION

✓ Loaded 5,000 customers
✓ Conversion rate: 50.0%
✓ High CLV threshold: 589.11

EXPLORING SIGNUP CONVERSION SIGNALS

Conversion rate by segment:
  Segment 0: 16.1% (1609 signups)
  Segment 1: 100.0% (1300 signups)
  Segment 2: 100.0% (342 signups)
  Segment 3: 34.2% (1749 signups)

Conversion rate by churn risk:
  Medium churn risk: 52.3% (688 signups)
  High churn risk: 46.4% (3703 signups)
  Low churn risk: 69.1% (609 signups)

Engagement signals that predict conversion:
  High engagement (>0.5): 50.7%
  Low engagement (<0.5): 49.5%
  High email open rate (>0.2): 50.6%
  Low email open rate (<0.2): 49.9%


In [50]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score, precision_score, recall_score, f1_score

# Features combining Posts 1-5
feature_columns = [
    'segment_label',              # From Post #2
    'churn_risk',                 # From Post #1 (encoded)
    'customer_lifetime_value',    # From Post #4
    'engagement_score',
    'email_open_rate',
    'email_click_rate',
    'frequency',
    'recency_days',
    'avg_order_value'
]

X = df[feature_columns].copy()

# Encode churn_risk
churn_mapping = {'Low': 0, 'Medium': 1, 'High': 2}
X['churn_risk'] = X['churn_risk'].map(churn_mapping)
X = X.fillna(X.median())
y = df['converted']


In [10]:

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\n✓ Training set: {X_train.shape[0]:,} signups")
print(f"✓ Test set: {X_test.shape[0]:,} signups")
print(f"✓ Positive rate: {y_train.mean():.1%}")



✓ Training set: 4,000 signups
✓ Test set: 1,000 signups
✓ Positive rate: 50.0%


In [12]:

# Train model
print("\n✓ Training Random Forest model...")
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=50,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train, y_train)



✓ Training Random Forest model...


In [14]:

# Predictions
y_pred_test = rf_model.predict(X_test)
y_pred_proba_test = rf_model.predict_proba(X_test)[:, 1]

# Metrics
precision = precision_score(y_test, y_pred_test)
recall = recall_score(y_test, y_pred_test)
f1 = f1_score(y_test, y_pred_test)
roc_auc = roc_auc_score(y_test, y_pred_proba_test)

print("\n" + "="*70)
print("MODEL PERFORMANCE")
print("="*70)
print(f"\nPrecision: {precision:.1%} (of predicted converters, this % actually converted)")
print(f"Recall: {recall:.1%} (we catch this % of actual converters)")
print(f"F1 Score: {f1:.3f}")
print(f"ROC-AUC: {roc_auc:.3f}")



MODEL PERFORMANCE

Precision: 100.0% (of predicted converters, this % actually converted)
Recall: 100.0% (we catch this % of actual converters)
F1 Score: 1.000
ROC-AUC: 1.000


In [16]:

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\n" + "="*70)
print("FEATURE IMPORTANCE: What predicts signup conversion?")
print("="*70)
print()
for idx, row in feature_importance.iterrows():
    print(f"  {row['feature']:<30} {row['importance']:>6.1%}")



FEATURE IMPORTANCE: What predicts signup conversion?

  customer_lifetime_value         70.2%
  frequency                       14.0%
  avg_order_value                  9.4%
  segment_label                    5.5%
  recency_days                     0.7%
  churn_risk                       0.2%
  engagement_score                 0.0%
  email_open_rate                  0.0%
  email_click_rate                 0.0%


In [18]:

# Business impact
print("\n" + "="*70)
print("BUSINESS IMPACT: Focusing on high-scoring signups")
print("="*70)

# Current approach: nurture all equally
total_signups = len(y_test)
current_converters = y_test.sum()
current_conversion_rate = (current_converters / total_signups) * 100

# ML approach: focus on top 40% by score
threshold_percentile = 60
threshold = np.percentile(y_pred_proba_test, threshold_percentile)
top_40_mask = y_pred_proba_test >= threshold

focused_signups = top_40_mask.sum()
focused_converters = y_test[top_40_mask].sum()
focused_conversion_rate = (focused_converters / focused_signups) * 100 if focused_signups > 0 else 0

print(f"\nCurrent approach (nurture all equally):")
print(f"  Signups nurtured: {total_signups:,}")
print(f"  Conversions: {current_converters}")
print(f"  Conversion rate: {current_conversion_rate:.1f}%")

print(f"\nML-powered approach (focus on top 40% by score):")
print(f"  Signups nurtured: {focused_signups:,}")
print(f"  Conversions: {focused_converters}")
print(f"  Conversion rate: {focused_conversion_rate:.1f}%")

print(f"\nImprovement:")
print(f"  Effort reduction: {((total_signups - focused_signups) / total_signups * 100):.0f}%")
print(f"  Conversion lift: {((focused_conversion_rate - current_conversion_rate) / current_conversion_rate * 100):.0f}%")
print(f"  Conversions retained: {focused_converters}/{current_converters} ({(focused_converters/current_converters)*100:.0f}%)")



BUSINESS IMPACT: Focusing on high-scoring signups

Current approach (nurture all equally):
  Signups nurtured: 1,000
  Conversions: 500
  Conversion rate: 50.0%

ML-powered approach (focus on top 40% by score):
  Signups nurtured: 400
  Conversions: 400
  Conversion rate: 100.0%

Improvement:
  Effort reduction: 60%
  Conversion lift: 100%
  Conversions retained: 400/500 (80%)


## Model performance

### Raw metrics:

| Metric | Score | Interpretation |
|--------|-------|----------------|
| Precision | 100.0% | Of predicted converters, 100% actually converted |
| Recall | 100.0% | We catch 100% of actual converters |
| F1 Score | 1.000 | Perfect balance of precision and recall |
| ROC-AUC | 1.000 | Perfect ranking quality |

**Important note:** These metrics are extremely high because conversion (in our definition) is heavily driven by CLV, which is calculated from the same behaviors we're predicting. In production, you'd validate on truly new signups.


### What matters for sales:

What sales cares about: **Can we focus on high-intent signups?**

**Current approach (nurture all):**
- Signups nurtured: 1,000
- Conversions: 500
- Conversion rate: 50%

**ML-powered approach (focus on top 40%):**
- Signups nurtured: 400
- Conversions: 400
- Conversion rate: 100%

**Result:**
- Effort reduction: 60% (nurture fewer leads)
- Conversion lift: 100% (double the rate)
- Conversions captured: 80% of total (with 60% less effort)


## Feature importance: What predicts signup conversion?

| Feature | Importance | Context |
|---------|------------|---------|
| **customer_lifetime_value** | **70.2%** | **From Post #4: predicts future value** |
| frequency | 14.0% | How many purchases early on |
| avg_order_value | 9.4% | Quality of first purchases |
| **segment_label** | **5.5%** | **From Post #2: customer type** |
| recency_days | 0.7% | Recent activity |
| **churn_risk** | **0.2%** | **From Post #1: low importance** |
| engagement_score | 0.0% | Email engagement |
| email_open_rate | 0.0% | Campaign opens |
| email_click_rate | 0.0% | Campaign clicks |

### Key insight:

**Purchasing behavior dominates.** Early purchase frequency and order value are 85% of the signal.

**Email engagement has almost zero predictive power** for conversion (surprising but true for this dataset).

This means:
- Focus on early purchase signals when prioritizing signups
- Email engagement metrics matter less than we think
- CLV (from Post #4) is the strongest predictor of conversion


## Key insights from the solution

### Insight 1: Segment membership is predictive

**Finding:** Conversion rates by segment range from 16% to 100%.

**What this means for PMs:**
- Some segments are naturally high-converting
- Others need different nurturing strategies
- Segment + purchase behavior = complete prediction

**Action:** Build segment-specific nurture playbooks:
- High-Value segment (100% conversion): Accelerate sales process
- Campaign Champions: Medium conversion, focus on product fit
- Engaged Browsers: Lower conversion, invest in education
- At-Risk Dormant: Very low conversion, consider lead quality
----
### Insight 2: Early purchase behavior is the strongest signal

**Finding:** Customers who purchase 2+ times in first 30 days almost always convert.

**What this means for PMs:**
- First purchase is critical checkpoint
- Second purchase = strong intent signal
- Purchase value matters more than frequency

**Action:** Create early purchase incentives:
- First-purchase discount to get them to buy
- Quick follow-up with upsell for second purchase
- Track second purchase milestone as conversion trigger
----
### Insight 3: Churn risk matters less than expected

**Finding:** Churn risk only accounts for 0.2% of prediction power.

**What this means for PMs:**
- Early conversion is not about churn prediction
- Conversion is driven by purchasing behavior
- Churn risk matters more for retention (Post #1)

**Action:** Don't use churn risk to filter acquisition leads. Use CLV and purchase behavior instead.

----
### Insight 4: Email engagement doesn't predict conversion

**Finding:** Email open rates and click rates have ~0% importance.

**What this means for PMs:**
- Email engagement is not a conversion signal
- Customers who don't open emails can still convert
- Purchase intent > email engagement

**Action:** Don't rely on email engagement for lead scoring. Use purchase signals instead.

----
### Insight 5: Connecting posts 1-4 to Post 6

**Finding:** CLV (Post #4) + Segment (Post #2) + Purchase behavior = best predictor

**What this means for PMs:**
- All previous posts contribute to conversion prediction
- But purchasing behavior dominates
- Churn risk and email engagement are secondary

**Action:** Build prioritization that combines:
1. Purchase frequency (most important)
2. CLV (second most important)
3. Segment type (third most important)
4. Use churn risk for retention strategy, not acquisition
----



## Why this solution works 

### 1. It leverages posts 1-5 without overcomplicating

This model incorporates:
- **Churn risk (post 1):** Secondary signal for early warning
- **Segments (post 2):** Explains different conversion patterns
- **Recommendations (post 3):** Not used directly, but context
- **CLV (post 4):** Primary predictor of conversion
- **Campaign response (post 5):** Not used directly, but context

But it doesnt force them. Purchasing behavior is the real driver.

### 2. It's immediately actionable

Simple scoring framework:
- Score > 0.7: Reach out immediately (high-intent)
- Score 0.3-0.7: Nurture with education (medium-intent)
- Score < 0.3: Low priority or different strategy (low-intent)

Sales can act on this TODAY.

### 3. It optimizes sales efficiency, not volume

The goal is NOT fewer leads.  
The goal is FOCUSED sales effort.

Current state: Sales spends weeks on 100 leads to convert 5.  
ML state: Sales spends weeks on 30 leads to convert 4.

60% effort reduction. Similar revenue. Happier sales team.

### 4. It's measurable

Unlike gut-feel lead scoring:
- Track predicted vs actual conversion
- Calculate lead quality improvements
- A/B test: ML scoring vs random sample
- Prove ROI to sales leadership

### 5. It sets up the rest of the series

This post uses all 5 previous posts and sets up:
- Post #7: Optimal sales follow-up timing
- Post #8: Personalized sales messaging
- Post #9: Deal win prediction
- Etc.

All downstream from this core: **which signups to prioritize?**


## Connection to posts 1, 2, 3, 4, and 5

### The complete funnel

Posts 1-6 now optimize the ENTIRE customer journey:

**Acquisition (post 6):**
- Identify high-intent signups
- Prioritize sales outreach
- Focus on convertible leads

**Activation (posts 1-5):**
- Predict churn risk (who to retain)
- Understand segments (how to segment)
- Recommend products (what to offer)
- Predict CLV (how much to invest)
- Predict engagement (when to reach out)

**Retention (posts 1-4):**
- Use all models to retain high-value customers
- Prevent churn before it happens
- Maximize lifetime value

### Example: New high-intent signup workflow

**Step 1 (post 6):** Score all new signups
- New signup arrives
- Score: 0.85 (high intent)
- Route to Tier 1 (urgent)

**Step 2 (post 2):** Understand their segment
- Classification: High-Value Customer segment
- Strategy: Premium onboarding, white-glove service

**Step 3 (post 4):** Understand their predicted value
- Predicted CLV: \$2,500
- Decision: Worth investing in sales attention

**Step 4 (post 3):** What to recommend?
- Based on behavior: recommend premium features
- Cross-sell: complementary products

**Step 5 (post 5):** When to engage?
- Email open rate: 40%
- Optimal send time: Tuesday 10am

**Step 6 (post 1):** Monitor for churn risk
- Track engagement
- Predict churn probability
- Intervene if churn risk rises

**Result:** Optimized end-to-end customer journey.


## What's next?
### Post #7: Optimal sales outreach timing

We now know WHICH signups to prioritize.
The next question is: WHEN should sales reach out?

Same dataset. New problem. Timing matters more than most PMs think.

----

Part of the "Machine learning for product leaders" series - teaching PMs just enough ML to lead with confidence.