<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h1 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#005097; border:0' role="tab" aria-controls="home"><center>RFM Segmentation and CLV modeling </center></h1>

### Table of Contents

* [1. Theoretical concepts](#section_1)
    * [RFM segmentation](#section_1_1)
    * [Customer Lifetime Value](#section_1_2) 
    ___
* [2. Data Preprocessing](#section_2)
    * [Feature engineering](#section_2_1)
    * [Statistical summary](#section_2_2)
    
    ___
* [3. RFM Segmentation](#section_3)
    * [Recency calculation](#section_3_1)
    * [Frequency calculation](#section_3_2)
    * [Monetary calculation](#section_3_3)
    * [Segment creation](#section_3_4)
    
    ___
* [4. CLV modeling](#section_4)
    * [Deriving RFM Metrics](#section_4_1)
    * [Retention Model fitting](#section_4_2)
    * [Value Model fitting](#section_4_3)
    * [CLV estimates](#section_4_4)
        
    ___
* [5. Conclusion](#section_5)
    
    ___

In [None]:
!pip install Lifetimes
!pip install scikit-learn-extra

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import datetime
from datetime import timedelta
from datetime import date
import squarify
from lifetimes.plotting import *
from lifetimes.utils import *
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from scipy.stats import gamma, beta
from sklearn_extra.cluster import KMedoids
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
data_folder = "/kaggle/input/arketing-campaign/"

# 1. Theoretical concepts <a class="anchor" id="section_1"></a>

### A. RFM Segmentation <a class="anchor" id="section_1_1"></a>

RFM segmentation is a scoring technique used to better quantify customer behavior. During marketing campaigns, not all customers should be contacted with the same effort. Direct marketing segmentation enables to group customers in different segments and anayze their profitability accordingly.

RFM metrics are closely related to the Customer Lifetime Value as frequency and monetary value affect directly **CLV** and recency affects **retention**
>- **Recency** : Time since last order  
>- **Frequency** : Total number of transactions
>- **Monetary** : Total transactions value

These metrics are very important to understand customer behavior :
- The more **recent** the purchase, the more **responsive** the customer is to promotions
- The more **frequently** customers buy, the more **engaged** they are

### B. Customer Lifetime Value <a class="anchor" id="section_1_2"></a>

- **CLV definition**

Customer Lifetime Value can be viewed as the economic value derived from the firm's relationship with its customers. 
CLV is defined as a measure of the present value of future cash flows attributed to the customer relationship.In other words, CLV measure the net profit a customer will bring to the firm over the future periods. Hence past customer transactions may be used as a predictive driver of the economic value of a firm's customer relationship.

The CLV formula can be written as :

$$CLV = \sum_{n=1}^{N} \frac {Value_{n}*Retention^{n}}{ (1+ DiscountRate)^{n}}$$


- **Buy Till You Die model (BTYD model)**

BTYD model is built on 4 metrics which are closely related to the ones used for RFM segmentation :
- **Frequency** : The number of repeated purchases the customer made after his first date of first purchase
- **Age** (Time) : The period the customer has been enrolled in the company, expressed in days, weeks or even months. 
   $\textit{Age = Last date in dataset - first customer purchase date }$
- **Recency** : The age of the customer when he made its last purchase  
    $\textit{Recency = Last customer purchase date - first customer purchase date }$
- **Monetary value** : The average  amount spent by a customer

While it exists several version of BTYD models, I will here use the BG/NBD model.  
BG/NBD was introduced in 2004 by Peter Fader and stands for Beta Geometric/Negative Binomial Distribution.   
The model distinguish customer behaviour in two parts:
- The buying process which models the probability a customer makes a purchase
- The dying process (or dropout) which models the probability a customer quit and never purchase again

BG/NBD model is based on 5 assumptions :
>1. While active, the number of transactions made by a customer follows a **Poisson distribution** with transaction rate $\lambda$
>2. Heterogenity in transaction rate $\lambda$ follows a **Gamma distribution** (each customer has its own probability of buying)
>3. After any transaction, a customer becomes inactive with probability $p$. The point at which a customer "drops out" (or "die") is distributed across the transactions according to a **Geometric distribution**
>4. Heterogeneity in $p$ (dropout probability) follows a **Beta distribution** 
>5. The transaction rate $\lambda$ and the dropout probability $p$ vary independently across customers

Once these probability distributions have been fitted, we obtain for each customer :
- $P(X(t)=x| \lambda ,p) $ : the probability of observing $x$ transactions in a time period of lenght $t$
- $E(X(t)| \lambda ,p) $ : the expected number of transactions in a time period of lenght $t$
- $P(\tau>t) $ : the probability of a customer becoming inactive at period $\tau$

# 2. Data Preprocessing <a class="anchor" id="section_2"></a>

We covered in the <a href="https://www.kaggle.com/raphael2711/data-prep-visual-eda-and-statistical-hypothesis">previous notebooks</a> the data discovery steps (data types, data shape, data completeness, etc..)  
We will therefore directly start with the feature engineering step and the analysis of the statistical metrics relevant for this usecase.

### A. Feature Engineering <a class="anchor" id="section_2_1"></a>

In [None]:
dataset=pd.read_csv(data_folder+'marketing_campaign.csv',header=0,sep=';') 
dataset.head(10)

The dataset already have all the variables needed to create the RFM metrics. We just need to prepare the data.

We wrill create two variables :

>- Variable __*Spending*__ as the sum of the amount spent on the 6 product categories.
>- Variable __*Transactions*__ as the total number of purchases made by the customer.


We will remove the unused variables for this analysis and keep only the customers who made more than 1 repeat purchase in order to calculate the Customer Lifetime Value.

In [None]:
dataset['Spending']=dataset['MntWines']+dataset['MntFruits']+dataset['MntMeatProducts']+dataset['MntFishProducts']+dataset['MntSweetProducts']+dataset['MntGoldProds']
dataset['Transactions']=dataset['NumWebPurchases']+dataset['NumCatalogPurchases']+dataset['NumStorePurchases']
dataset=dataset[['ID','Spending','Transactions','Recency','Dt_Customer']]
dataset = dataset[dataset['Transactions'] > 1] #We keep customers with repeated purchases, implying number of transactions must be at least 2
dataset = dataset[dataset['Spending'] > 0]

### B. Statistical Summary <a class="anchor" id="section_2_2"></a>

In [None]:
print("Summary of the last 2 years spending")
print("Number of transactions: ", dataset['Transactions'].sum())
print("Total sales: ",dataset['Spending'].sum())
print("Number of customers:", dataset['ID'].nunique())

# 3. RFM Segmentation <a class="anchor" id="section_3"></a>

### A. Recency calculation <a class="anchor" id="section_3_1"></a>

In [None]:
recency_df = dataset[['ID','Recency']]
recency_df

### B. Frequency calculation <a class="anchor" id="section_3_2"></a>

In [None]:
frequency_df = dataset[['ID','Transactions']]
temp_df = recency_df.merge(frequency_df,on='ID')
frequency_df

### C. Monetary calculation <a class="anchor" id="section_3_3"></a>

In [None]:
monetary_df = dataset[['ID','Spending']]
monetary_df

#### DataFrame aggregation

In [None]:
tx_user  = temp_df.merge(monetary_df,on='ID')
tx_user.columns = ['ID','Recency','Frequency','Monetary']
tx_user

#### Elbow method

In [None]:
#Select number of clusters for each attributes
#Step 1 : Clusters for Recency
sse={}
tx_recency = tx_user[['Recency']]
for k in range(1, 10):
    kmedoids = KMedoids(n_clusters=k, random_state=0, max_iter=1000,init='k-medoids++',metric='euclidean').fit(tx_recency)
    tx_recency["clusters"] = kmedoids.labels_
    sse[k] = kmedoids.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

In this analysis we will divide our customers in 5 clusters for each RFM metrics leading to 5x5x5 clusters

#### Recency clusters creation

In [None]:
kmedoids = KMedoids(n_clusters=5, random_state=0, max_iter=1000,init='k-medoids++',metric='euclidean').fit(tx_recency)
tx_user['RecencyCluster'] = kmedoids.predict(tx_recency)

#function for ordering cluster numbers
def order_cluster(cluster_field_name, target_field_name,df,ascending):
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final

tx_user = order_cluster('RecencyCluster', 'Recency',tx_user,False)
#see details of each cluster
tx_user.groupby('RecencyCluster')['Recency'].describe()

#### Frequency clusters creation

In [None]:
tx_frequency = tx_user[['Frequency']]

kmedoids = KMedoids(n_clusters=5, random_state=0, max_iter=1000,init='k-medoids++',metric='euclidean').fit(tx_frequency)
tx_user['FrequencyCluster'] = kmedoids.predict(tx_frequency)

#order the frequency cluster
tx_user = order_cluster('FrequencyCluster', 'Frequency',tx_user,True)

#see details of each cluster
tx_user.groupby('FrequencyCluster')['Frequency'].describe()

#### Monetary clusters creation

In [None]:
tx_monetary = tx_user[['Monetary']]

kmedoids = KMedoids(n_clusters=5, random_state=0, max_iter=1000,init='k-medoids++',metric='euclidean').fit(tx_monetary)
tx_user['MonetaryCluster'] = kmedoids.predict(tx_monetary)

#order the cluster numbers
tx_user = order_cluster('MonetaryCluster', 'Monetary',tx_user,True)

#show details of the dataframe
tx_user.groupby('MonetaryCluster')['Monetary'].describe()

### D. Segment creation <a class="anchor" id="section_3_4"></a>

In order to keep a manageable number of segments, the segments are created using only the recency and frequency scores.  
The monetary score is often viewed as an aggregation metric for summarizing transactions.

In [None]:
segt_map = {
    r'30': 'Promising',
    r'23': 'Loyal customers',
    r'24': 'Loyal customers',
    r'33': 'Loyal customers',
    r'34': 'Loyal customers',
    r'43': 'Loyal customers',
    r'32': 'Potential loyalist',
    r'31': 'Potential loyalist',
    r'42': 'Potential loyalist',
    r'41': 'Potential loyalist',
    r'21': 'Need attention',
    r'22': 'Need attention',
    r'12': 'Need attention',
    r'11': 'Need attention',
    r'40': 'New customers',
    r'20': 'About to sleep',
    r'14': 'Cant loose them',
    r'04': 'Cant loose them',
    r'10': 'Lost',
    r'00': 'Lost',
    r'01': 'Lost',
    r'02': 'At risk',
    r'03': 'At risk',
    r'13': 'At risk',
    r'44': 'Champions',
}

tx_user['Segment'] = tx_user['RecencyCluster'].map(str) + tx_user['FrequencyCluster'].map(str)
tx_user['Segment'] = tx_user['Segment'].replace(segt_map, regex=True)
tx_user.head()

In [None]:
# count the number of customers in each segment
segments_counts = tx_user['Segment'].value_counts().sort_values(ascending=True)

fig, ax = plt.subplots()

bars = ax.barh(range(len(segments_counts)),
              segments_counts,
              color='silver')
ax.set_frame_on(False)
ax.tick_params(left=False,
               bottom=False,
               labelbottom=False)
ax.set_yticks(range(len(segments_counts)))
ax.set_yticklabels(segments_counts.index)

for i, bar in enumerate(bars):
        value = bar.get_width()
        if segments_counts.index[i] in ['Champions', 'Loyal customers']:
            bar.set_color('firebrick')
        ax.text(value,
                bar.get_y() + bar.get_height()/2,
                '{:,} ({:}%)'.format(int(value),
                                   int(value*100/segments_counts.sum())),
                va='center',
                ha='left'
               )

plt.show()

#### Metrics analysis per segment 

In [None]:
# Calculate average values for each RFM segment, and return a size of each segment 
tx_user_viz = tx_user.groupby('Segment').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': ['mean', 'count'],
}).round(1)
# Print the aggregated dataset
tx_user_viz

#### Segment visualization

In [None]:
tx_user_viz.columns = ['Recencymean','Frequencymean', 'Monetarymean','Count']
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 9)
squarify.plot(sizes=tx_user_viz['Count'], 
              label=['Lost',
                     'About to Sleep',
                     'Cant loose them',
                     'Promising',
                     'New Customers',
                     'Need Attention',
                     'Potential Loyalists',
                     'At risk',
                     'Loyal Customers',
                     'Champions',
                     ], alpha=.6 )
plt.title("RFM Segments",fontsize=22,fontweight="bold")
ax.set_xlabel('Recency',fontsize=12)
ax.set_ylabel('Frequency',fontsize=12)
plt.axis('on')
plt.show()

# 4. CLV modeling <a class="anchor" id="section_4"></a>

### A. Deriving RFM Metrics <a class="anchor" id="section_4_1"></a>

To calculate CLV, we will retrieve the "DT_Customer" which will help us calculate the Recency and Age variables in our BTYD model.  
`Note that Recency that we will use in our BTYD model is different that the one we used in our RFM segmentation `

In [None]:
tx_user  = tx_user.merge(dataset[['ID','Dt_Customer']],on='ID')
tx_user.head(10)

#### Metrics calculation

In [None]:
last_date = date(2014,10, 4)
tx_user['Age']=pd.to_datetime(tx_user['Dt_Customer'], dayfirst=True,format = '%Y-%m-%d')
tx_user['Age'] = pd.to_numeric(tx_user['Age'].dt.date.apply(lambda x: (last_date - x)).dt.days, downcast='integer')

tx_user['Recency']=(tx_user['Age']-tx_user['Recency'])

tx_user['Monetary_value']=tx_user['Monetary']/tx_user['Frequency']
tx_user['Frequency']=tx_user['Frequency']-1

In [None]:
tx_user=tx_user[['ID','Frequency','Recency','Age','Monetary_value','Segment']]
tx_user

### B. Retention Model fitting <a class="anchor" id="section_4_2"></a>

In [None]:
bgf = BetaGeoFitter(penalizer_coef=0.000000005)
bgf.fit(tx_user['Frequency'], tx_user['Recency'], tx_user['Age'])

In [None]:
# plot the estimated gamma distribution of λ (customers' propensities to purchase)
plot_transaction_rate_heterogeneity(bgf);

In [None]:
bgf.summary

The summary above shows the estimated distribution parameter values from the dataset.   
The model can now use this parameters to predict the future number of transactions for each customer and their churn rate.

#### The Frequency/Recency Heatmap helps us better understanding how the model estimates the  probability of a customer still being alive and their expected number of future purchases

In [None]:
# visualize our frequency/recency matrix
fig = plt.figure(figsize=(12,8))
plot_frequency_recency_matrix(bgf, T = 30);

>- We can easily understand from the above heatmap that if a customer has made 30 transactions and their latest purchase as when they were 700 days old, then they are considered as the **best customers** and are more likely to buy in the following 30 days. (bottom right)  
>- We can also notice the interesting area in light blue around (5 ; 500) which represents customers who buy infrequently but we have seen them recently. We are not sure if they are dead or if they might purchase again soon. (probability around 0.5) 

In [None]:
fig = plt.figure(figsize=(12,8))
plot_probability_alive_matrix(bgf);

The second interesing heatmap is the probability of a customer of still being alive.
>- If a customer has a high number of transactions (frequency) and the time between their first and last transaction is hight (recency), his/her probability of still being alive is high. (bottom right)  
>- If a customer has a small number of transactions but the recency is low, then his/her probability of still being alive is also high (top left)

#### Estimates the expected number of repeat purchases for each customer

In this step, we will predict the number of repeat purchase each customer will make in the next 30 days

In [None]:
t = 30 # to calculate the number of expected repeat purchases over the next 30 days
tx_user['Predicted_purchases'] = bgf.conditional_expected_number_of_purchases_up_to_time(t, tx_user['Frequency'], tx_user['Recency'], tx_user['Age'])
tx_user.sort_values(by='Predicted_purchases').tail(5)

#### Estimates the probability of a customer still being alive

In [None]:
tx_user['p_alive'] = bgf.conditional_probability_alive(tx_user['Frequency'], tx_user['Recency'], tx_user['Age'])
tx_user.sort_values(by='Predicted_purchases').tail(5)

In [None]:
sns.distplot(tx_user['p_alive']);

### C. Value Model fitting <a class="anchor" id="section_4_3"></a>

In [None]:
# We fit the Gamma-Gamma model to our data
ggf = GammaGammaFitter(penalizer_coef=0.00005)
ggf.fit(frequency = tx_user['Frequency'], monetary_value = tx_user['Monetary_value'])

#### Estimates the average transaction value for each customer

In [None]:
tx_user['predicted_Sales'] = ggf.conditional_expected_average_profit(tx_user['Frequency'], tx_user['Monetary_value'])
tx_user.head()

We can quickly check if the predicted sales and the actual sales are not 

In [None]:
print(f"Expected Average sales: {tx_user['predicted_Sales'].mean()}")
print(f"Actual Average sales: {tx_user['Monetary_value'].mean()}")

The results we got are fine. We can now calculate the Customer Lifetime Values

### D. CLV estimates <a class="anchor" id="section_4_4"></a>

In this final step, we calculate the Long Term Value for each customer over the next 12 months. As explained in the theorical part, we will assume a monthly discount rate of 1%

In [None]:
tx_user['LTV'] = ggf.customer_lifetime_value(bgf,tx_user['Frequency'], tx_user['Recency'], tx_user['Age'], tx_user['Monetary_value'],
    time = 12,freq='D',discount_rate = 0.01)
tx_user.head()

We can plot our top 10 customers based on LTV.  
We can see that these 10 customers all belong to the Champions/Loyal customers/Potential loyalist segments

In [None]:
pd.options.display.float_format = "{:.2f}".format
best_projected_cust_LTV = tx_user.sort_values('LTV').tail(10)
best_projected_cust_LTV

# 5. Conclusion <a class="anchor" id="section_5"></a>

We can now calculate the average Long Term Value of each RFM segment we define earlier for the next 12 months

In [None]:
pd.options.display.float_format = "{:.0f}".format
# Calculate average values for each RFM segment
tx_user_clv = tx_user.groupby('Segment').agg({'LTV': 'mean',}).sort_values('LTV',ascending=False)
tx_user_clv

We can see that segments with the highest LTV values are the **Champions**, followed just after by the **Loyal customers**.  
We can use our new RFM segmentation along the LTV to develop a classification model and determine wich customers are most likely to be receptive to our next promotionnal marketing campaign