# **Customer Lifetime Value (CLV) & Churn Prediction**
## ***Churn Prediction Modeling***
**Goal:** Predict the probability that a customer will churn using behavioral features and CLV.

This model enables:
- Proactive retention strategies
- Prioritization of high-value, high-risk customers
- Integration with a What-If dashboard

In [34]:
# Importing necessary libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

First, I will import two datasets: "final_rfm_features.csv" and "clv_predictions.csv", and merge then into a single dataframe to predict customer churn. 

In [35]:
# Importing datasets
features = pd.read_csv("final_rfm_features.csv", dtype={'customer_id': str})
clv = pd.read_csv("clv_predictions.csv", dtype={'customer_id': str})

# Merging CLV predictions with RFM features
df = features.merge(
    clv[['customer_id', 'clv_12m']],
    on='customer_id',
    how='left'
)

df.head()

Unnamed: 0,customer_id,recency,frequency,monetary,avg_order_value,purchase_interval_mean,purchase_interval_std,high_value_customer,one_time_buyer,customer_age_days,log_monetary,log_avg_order_values,clv_12m
0,12346.0,326,12,77556.46,77556.46,11.969697,40.41309,True,False,726,11.258774,11.258774,22268.269061
1,12347.0,2,8,4921.53,4921.53,1.80543,10.487359,True,False,404,8.501578,8.501578,3468.587381
2,12348.0,75,5,2019.4,2019.4,7.24,28.617506,False,False,438,7.611051,7.611051,1502.268172
3,12349.0,19,4,4428.69,4428.69,3.270115,31.898349,True,False,589,8.396085,8.396085,2449.322654
4,12350.0,310,1,334.4,334.4,0.0,0.0,False,True,310,5.815324,5.815324,


### **Section 1:** Defining Churn Label

Churn is defined as customer inactivity beyond 90 days, a commonly used proxy in non-contractual retail businesses.

In [36]:
# Setting churn threshold to 90 for customers to hadn't bought anything in the last 90 days
churn_threshold = 90

df['churn'] = (df['recency'] > churn_threshold).astype(int)
df['churn'].value_counts(normalize=True)

churn
1    0.508587
0    0.491413
Name: proportion, dtype: float64

### **Section 2:** Selecting Features for ML model

In [37]:
# Removing identifier column
drop_cols = ['customer_id']

# Defining features and target variable
X = df.drop(columns = drop_cols + ['churn'])
y = df['churn']

In [38]:
# Checking feature set
X.head()

Unnamed: 0,recency,frequency,monetary,avg_order_value,purchase_interval_mean,purchase_interval_std,high_value_customer,one_time_buyer,customer_age_days,log_monetary,log_avg_order_values,clv_12m
0,326,12,77556.46,77556.46,11.969697,40.41309,True,False,726,11.258774,11.258774,22268.269061
1,2,8,4921.53,4921.53,1.80543,10.487359,True,False,404,8.501578,8.501578,3468.587381
2,75,5,2019.4,2019.4,7.24,28.617506,False,False,438,7.611051,7.611051,1502.268172
3,19,4,4428.69,4428.69,3.270115,31.898349,True,False,589,8.396085,8.396085,2449.322654
4,310,1,334.4,334.4,0.0,0.0,False,True,310,5.815324,5.815324,


#### **Section 3:** Train-Test Split

In [39]:
# Splitting data into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size=0.25, random_state=42, stratify=y
)

#### **Section 4:** Building ML Pipeline

In [40]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(
        n_estimators=300,
        max_depth=8,
        class_weight='balanced',
        random_state=42
    ))
])

* Random Forest was chosen for its ability to capture non-linear relationships and handle mixed feature types with minimal preprocessing.

In [41]:
# Fitting the pipeline to the training data
pipeline.fit(X_train, y_train)

0,1,2
,steps,"[('scaler', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,n_estimators,300
,criterion,'gini'
,max_depth,8
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


#### **Section 5:** Evaluating the Model

In [42]:
# Making predictions and evaluating the model
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

# Evaluating model performance
print("ROC AUC Score:", roc_auc_score(y_test, y_proba))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

ROC AUC Score: 1.0

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       723
           1       1.00      1.00      1.00       748

    accuracy                           1.00      1471
   macro avg       1.00      1.00      1.00      1471
weighted avg       1.00      1.00      1.00      1471



Also, checking the confusion matrix.

In [43]:
# Displaying the confusion matrix
confusion_matrix(y_test, y_pred)

array([[723,   0],
       [  0, 748]], dtype=int64)

The model is showing ROC-AUC score = 1, with perfect precision, recall and accuracy; which indicates data leakage.
Let's check what happen:
* I set recency > 90 as churn indicator
* But we are also using recency based features for churn prediction, causing data leakage.
* That's why model is predicting: if recency > 90, Churn; else Not churn.

Although recency is a strong indicator of churn, it needs to be excluded from the churn
prediction feature set because churn was defined using recency.
Including recency would result in target leakage and unrealistically high model performance.

#### **Section 6:** Fixing Data Leakage
* Step 1: Removing recency and recency deried features from features set

In [44]:
leakage_cols = ['recency', 'purchase_interval_mean', 'purchase_interval_std']
X = X.drop(columns=leakage_cols)

In [45]:
# Checking feature set again
X.head()

Unnamed: 0,frequency,monetary,avg_order_value,high_value_customer,one_time_buyer,customer_age_days,log_monetary,log_avg_order_values,clv_12m
0,12,77556.46,77556.46,True,False,726,11.258774,11.258774,22268.269061
1,8,4921.53,4921.53,True,False,404,8.501578,8.501578,3468.587381
2,5,2019.4,2019.4,False,False,438,7.611051,7.611051,1502.268172
3,4,4428.69,4428.69,True,False,589,8.396085,8.396085,2449.322654
4,1,334.4,334.4,False,True,310,5.815324,5.815324,


These features looks safe to train the model.

* Step 2:  Splitting data into train and test set.

In [46]:
# Splitting data into training and testing sets again
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size=0.25, random_state=42, stratify=y
)

* Step 3: Re-Training the model

In [47]:
pipeline.fit(X_train, y_train)
y_proba = pipeline.predict_proba(X_test)[:,1]

print("ROC AUC Score after removing leakage features:", roc_auc_score(y_test, y_proba))

ROC AUC Score after removing leakage features: 0.9487892101389782


#### Final Churn Model Performance

After correcting for data leakage, the churn model achieved:

- **ROC-AUC: 0.948**
- Strong recall for high-risk customers
- Stable probability estimates suitable for ranking and prioritization

This confirms the model’s effectiveness in identifying churn prone customers without relying on future information.

#### **Section 7:** Feature Importance
It is imporatant to check which factors are causing customers to churn, so that business can make strategies to retain customers.

In [48]:
# Analyzing feature importance
importances = pipeline.named_steps['model'].feature_importances_

# Creating a DataFrame for feature importance
feature_importance = (
    pd.DataFrame({
        'features' : X.columns,
        'importance' : importances
    }).sort_values('importance', ascending=False)
)

# Displaying top 10 important features
feature_importance.head(10)

Unnamed: 0,features,importance
8,clv_12m,0.356761
5,customer_age_days,0.311831
0,frequency,0.107424
1,monetary,0.059956
7,log_avg_order_values,0.053253
2,avg_order_value,0.050473
6,log_monetary,0.04391
4,one_time_buyer,0.014467
3,high_value_customer,0.001926


#### **Section 8:** Predicting Churn Probability for all customers

In [49]:
# Adding churn probability to the original dataframe
df['churn_probability'] = pipeline.predict_proba(X)[:,1]

Also, putting each customers into different segments based on their risk level.

In [50]:
# Segmenting customers based on churn risk
df['risk_segment'] = pd.cut(
    df['churn_probability'],
    bins=[0,0.3,0.6,1.0],
    labels=['low risk', 'medium risk', 'high risk']
)

#### **Section 9:** Revenue at Risk

In [51]:
# Calculating how much revenue is at risk due to potential churn
df['revenue_at_risk'] = df['churn_probability'] * df['clv_12m']

# Displaying sample of final dataframe with churn probabilities and revenue at risk
df[['customer_id', 'churn_probability','clv_12m','revenue_at_risk']].head(10)

Unnamed: 0,customer_id,churn_probability,clv_12m,revenue_at_risk
0,12346.0,0.111213,22268.269061,2476.515354
1,12347.0,0.095226,3468.587381,330.298122
2,12348.0,0.20221,1502.268172,303.773668
3,12349.0,0.386454,2449.322654,946.549929
4,12350.0,0.983199,,
5,12351.0,0.980348,,
6,12352.0,0.050887,2162.617254,110.048369
7,12353.0,0.917096,125.331101,114.940684
8,12354.0,0.971556,,
9,12355.0,0.562791,517.031755,290.980706


Customers with high churn probability and high CLV represent the highest revenue risk and should be prioritized for retention campaigns.

* Exporting Dataset

In [52]:
# Saving the final dataframe with churn predictions
df.to_csv("churn_predictions.csv", index=False)

### Modeling Notes & Limitations

- Churn was defined using an inactivity-based rule due to the non-contractual nature of the business.
- Recency-based features were intentionally excluded from the churn model to prevent data leakage.
- The resulting churn probabilities represent relative risk scores rather than absolute churn guarantees.

These outputs are suitable for prioritization and decision-making in retention strategies.