# **Customer Lifetime Value (CLV) & Churn Prediction**

## ***Feature Engineering***
The goal of this phase is to transform cleaned transactional data into **customer-level, ML-ready features** that capture purchasing behavior, engagement, and value.

These engineered features will:
- Serve as inputs for **CLV estimation**
- Improve **churn prediction accuracy**
- Enable **business segmentation** (high-value vs low-value customers)

In [20]:
# Importing Necessary Libraries
import numpy as np
import pandas as pd

* Loading Dataset

In [21]:
# Importing dataset
df = pd.read_csv('cleaned_transactions.csv', parse_dates=['invoice_date'], dtype={'customer_id': str, 'invoice_id': str})
df.head()

Unnamed: 0,invoice,stockcode,description,quantity,price,customer_id,country,invoice_date,total_price
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,6.95,13085.0,United Kingdom,2009-12-01 07:45:00,83.4
1,489434,79323P,PINK CHERRY LIGHTS,12,6.75,13085.0,United Kingdom,2009-12-01 07:45:00,81.0
2,489434,79323W,WHITE CHERRY LIGHTS,12,6.75,13085.0,United Kingdom,2009-12-01 07:45:00,81.0
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2.1,13085.0,United Kingdom,2009-12-01 07:45:00,100.8
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,1.25,13085.0,United Kingdom,2009-12-01 07:45:00,30.0


### **Part A: RFM Calculation**
Creating RFM features to better understanding the customers' purchase behaviour as:
* Recency : number of days since a customer's last purchase   
* Frequency : number of unique invoices per customer(number of purchases by each customer)
* Monetary : total amount spent by a customer

In [22]:
# Setting reference date for recency calculation
ref_date = df['invoice_date'].max() + pd.Timedelta(days=1)

# Calculating RFM metrics
rfm = df.groupby('customer_id').agg({
    'invoice_date': lambda x: (ref_date - x.max()).days, # Recency
    'invoice': 'nunique',                                # Frequency
    'total_price':'sum'                                  # Monetary
}).reset_index()

# Renaming columns
rfm.columns = ['customer_id', 'recency','frequency', 'monetary']

# Displaying RFM table
rfm.head()

Unnamed: 0,customer_id,recency,frequency,monetary
0,12346.0,326,12,77556.46
1,12347.0,2,8,4921.53
2,12348.0,75,5,2019.4
3,12349.0,19,4,4428.69
4,12350.0,310,1,334.4


RFM features are calculated at a customer level by aggregating transactional data.

### **PART B â€” Additional Features**
I will be creating some additional features to RFM for better results.

* Average Order Value

In [23]:
rfm['avg_order_value'] = rfm['monetary']

* Purchase Interval

In [24]:
purchase_gaps = df.sort_values(['customer_id', 'invoice_date']).groupby('customer_id')['invoice_date'].diff().dt.days
interval_stats = purchase_gaps.groupby(df['customer_id']).agg(['mean','std'])

interval_stats.columns = ['purchase_interval_mean', 'purchase_interval_std']

rfm = rfm.merge(interval_stats, on='customer_id', how='left')

rfm.head()

Unnamed: 0,customer_id,recency,frequency,monetary,avg_order_value,purchase_interval_mean,purchase_interval_std
0,12346.0,326,12,77556.46,77556.46,11.969697,40.41309
1,12347.0,2,8,4921.53,4921.53,1.80543,10.487359
2,12348.0,75,5,2019.4,2019.4,7.24,28.617506
3,12349.0,19,4,4428.69,4428.69,3.270115,31.898349
4,12350.0,310,1,334.4,334.4,0.0,0.0


* High Value Customer Flag

In [25]:
threshold = rfm['monetary'].quantile(0.75)
rfm['high_value_customer'] = rfm['monetary'] > threshold

* One-Time Buyer Flag

In [26]:
rfm['one_time_buyer'] = rfm['frequency'] == 1

* Time Since First Purchase

In [27]:
first_purchase = df.groupby('customer_id')['invoice_date'].min()
rfm = rfm.merge(first_purchase.rename('first_purchase_date'), on='customer_id')

rfm['customer_age_days'] = (ref_date - rfm['first_purchase_date']).dt.days
rfm.drop(columns=['first_purchase_date'], inplace=True)

### **Part C- Handling Missing/Skew Values**
Checking if there any missing values in newly created features.

In [28]:
rfm.isnull().sum()

customer_id                 0
recency                     0
frequency                   0
monetary                    0
avg_order_value             0
purchase_interval_mean    118
purchase_interval_std     176
high_value_customer         0
one_time_buyer              0
customer_age_days           0
dtype: int64

Purchase interval may be null for single purchase customers. Its better to fill those null with 0.

In [29]:
rfm[['purchase_interval_mean', 'purchase_interval_std']] = rfm[['purchase_interval_mean', 'purchase_interval_std']].fillna(0)

* Log-transform skewed numeric columns

In [30]:
rfm['log_monetary'] = np.log1p(rfm['monetary'])
rfm['log_avg_order_values'] = np.log1p(rfm['avg_order_value'])

### **Final Dataset Check**

In [31]:
rfm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5881 entries, 0 to 5880
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   customer_id             5881 non-null   object 
 1   recency                 5881 non-null   int64  
 2   frequency               5881 non-null   int64  
 3   monetary                5881 non-null   float64
 4   avg_order_value         5881 non-null   float64
 5   purchase_interval_mean  5881 non-null   float64
 6   purchase_interval_std   5881 non-null   float64
 7   high_value_customer     5881 non-null   bool   
 8   one_time_buyer          5881 non-null   bool   
 9   customer_age_days       5881 non-null   int64  
 10  log_monetary            5881 non-null   float64
 11  log_avg_order_values    5881 non-null   float64
dtypes: bool(2), float64(6), int64(3), object(1)
memory usage: 471.1+ KB


* New dataset contains 12 columns and 5881 rows.
* Seems like all the columns have correct datatypes and no missing values.

In [32]:
rfm.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
recency,5881.0,201.457745,209.474135,1.0,26.0,96.0,380.0,739.0
frequency,5881.0,6.287196,13.012879,1.0,1.0,3.0,7.0,398.0
monetary,5881.0,2954.396237,14437.322635,0.0,341.9,865.6,2247.72,580987.04
avg_order_value,5881.0,2954.396237,14437.322635,0.0,341.9,865.6,2247.72,580987.04
purchase_interval_mean,5881.0,5.131778,17.328271,0.0,0.0,1.842105,4.793651,596.0
purchase_interval_std,5881.0,18.776753,25.113909,0.0,0.0,11.747198,26.376723,328.512303
customer_age_days,5881.0,474.698011,223.149927,1.0,313.0,530.0,668.0,739.0
log_monetary,5881.0,6.813501,1.393661,0.0,5.837439,6.764578,7.718116,13.272485
log_avg_order_values,5881.0,6.813501,1.393661,0.0,5.837439,6.764578,7.718116,13.272485


In [33]:
rfm.head()

Unnamed: 0,customer_id,recency,frequency,monetary,avg_order_value,purchase_interval_mean,purchase_interval_std,high_value_customer,one_time_buyer,customer_age_days,log_monetary,log_avg_order_values
0,12346.0,326,12,77556.46,77556.46,11.969697,40.41309,True,False,726,11.258774,11.258774
1,12347.0,2,8,4921.53,4921.53,1.80543,10.487359,True,False,404,8.501578,8.501578
2,12348.0,75,5,2019.4,2019.4,7.24,28.617506,False,False,438,7.611051,7.611051
3,12349.0,19,4,4428.69,4428.69,3.270115,31.898349,True,False,589,8.396085,8.396085
4,12350.0,310,1,334.4,334.4,0.0,0.0,False,True,310,5.815324,5.815324


With this, feature Engineering is also done.

This feature set forms the foundation for both CLV estimation and churn prediction models.

### **Exporting Dataset**

In [34]:
rfm.to_csv("final_rfm_features.csv", index=False)