# **Customer Lifetime Value (CLV) & Churn Prediction**
## ***Churn Prediction Modeling***
**Goal:** Predict the probability that a customer will churn using behavioral features and CLV.

This model enables:
- Proactive retention strategies
- Prioritization of high-value, high-risk customers
- Integration with a What-If dashboard

In [15]:
# Importing necessary libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

In [16]:
features = pd.read_csv("final_rfm_features.csv", dtype={'customer_id': str})
clv = pd.read_csv("clv_predictions.csv", dtype={'customer_id': str})

# Merging CLV predictions with RFM features
df = features.merge(
    clv[['customer_id', 'clv_12m']],
    on='customer_id',
    how='left'
)

df.head()

Unnamed: 0,customer_id,recency,frequency,monetary,avg_order_value,purchase_interval_mean,purchase_interval_std,high_value_customer,one_time_buyer,customer_age_days,log_monetary,log_avg_order_values,clv_12m
0,12346.0,326,12,77556.46,77556.46,11.969697,40.41309,True,False,726,11.258774,11.258774,22268.269061
1,12347.0,2,8,4921.53,4921.53,1.80543,10.487359,True,False,404,8.501578,8.501578,3468.587381
2,12348.0,75,5,2019.4,2019.4,7.24,28.617506,False,False,438,7.611051,7.611051,1502.268172
3,12349.0,19,4,4428.69,4428.69,3.270115,31.898349,True,False,589,8.396085,8.396085,2449.322654
4,12350.0,310,1,334.4,334.4,0.0,0.0,False,True,310,5.815324,5.815324,


### **Section 1:** Defining Churn Label

In [17]:
# Setting churn threshold to 90 for customers to hadn't bought anything in the last 90 days
churn_threshold = 90

df['churn'] = (df['recency'] > churn_threshold).astype(int)
df['churn'].value_counts(normalize=True)

churn
1    0.508587
0    0.491413
Name: proportion, dtype: float64

### **Section 2:** Selecting Features for ML model

In [18]:
# Removing identifier column
drop_cols = ['customer_id']
X = df.drop(columns = drop_cols + ['churn'])
y = df['churn']

#### **Section 3:** Train-Test Split

In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size=0.25, random_state=42, stratify=y
)

#### **Section 4:** Building ML Pipeline

In [20]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(
        n_estimators=300,
        max_depth=8,
        class_weight='balanced',
        random_state=42
    ))
])

In [22]:
pipeline.fit(X_train, y_train)

0,1,2
,steps,"[('scaler', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,n_estimators,300
,criterion,'gini'
,max_depth,8
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


#### **Section 5:** Evaluating the Model

In [23]:
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print("ROC AUC Score:", roc_auc_score(y_test, y_proba))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

ROC AUC Score: 1.0

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       723
           1       1.00      1.00      1.00       748

    accuracy                           1.00      1471
   macro avg       1.00      1.00      1.00      1471
weighted avg       1.00      1.00      1.00      1471



In [24]:
confusion_matrix(y_test, y_pred)

array([[723,   0],
       [  0, 748]], dtype=int64)

#### **Section 6:** Feature Importance

In [27]:
importances = pipeline.named_steps['model'].feature_importances_

feature_importance = (
    pd.DataFrame({
        'features' : X.columns,
        'importance' : importances
    }).sort_values('importance', ascending=False)
)

# Displaying top 10 important features
feature_importance.head(10)

Unnamed: 0,features,importance
0,recency,0.680541
11,clv_12m,0.106507
8,customer_age_days,0.081394
1,frequency,0.034406
10,log_avg_order_values,0.018377
9,log_monetary,0.01836
3,avg_order_value,0.016043
2,monetary,0.015866
4,purchase_interval_mean,0.011882
5,purchase_interval_std,0.011516


#### **Section 7:** Predicting Churn Probability for all customers

In [28]:
df['churn_probability'] = pipeline.predict_proba(X)[:,1]

In [29]:
df['risk_segment'] = pd.cut(
    df['churn_probability'],
    bins=[0,0.3,0.6,1.0],
    labels=['low risk', 'medium risk', 'high risk']
)

#### **Section 8:** Revenue at Risk

In [30]:
df['revenue_at_risk'] = df['churn_probability'] * df['clv_12m']

df[['customer_id', 'churn_probability','clv_12m','revenue_at_risk']].head()

Unnamed: 0,customer_id,churn_probability,clv_12m,revenue_at_risk
0,12346.0,0.899286,22268.269061,20025.540618
1,12347.0,0.010838,3468.587381,37.591845
2,12348.0,0.022261,1502.268172,33.441242
3,12349.0,0.037921,2449.322654,92.87988
4,12350.0,0.999503,,


* Exporting Dataset

In [31]:
df.to_csv("churn_predictions.csv", index=False)