## Overview
The dataset came from: https://statso.io/rfm-analysis-case-study/
- RFM Analysis is a concept used by Data Science professionals, especially in the marketing domain for understanding and segmenting customers based on their buying behaviour. Using RFM Analysis, a business can assess customers’:
    - recency (the date they made their last purchase)
    - frequency (how often they make purchases)
    - monetary value (the amount spent on purchases)
- Recency, Frequency, and Monetary value of a customer are three key metrics that provide information about customer engagement, loyalty, and value to a business.

In [44]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"

data = pd.read_csv('rfm_data.csv')
data.head()

Unnamed: 0,CustomerID,PurchaseDate,TransactionAmount,ProductInformation,OrderID,Location
0,8814,2023-04-11,943.31,Product C,890075,Tokyo
1,2188,2023-04-11,463.7,Product A,176819,London
2,4608,2023-04-11,80.28,Product A,340062,New York
3,2559,2023-04-11,221.29,Product A,239145,London
4,9482,2023-04-11,739.56,Product A,194545,Paris


## Calculating RFM Values

In [45]:
from datetime import datetime
# Convert 'PurchaseDate' to datetime
data['PurchaseDate'] = pd.to_datetime(data['PurchaseDate'])

# Calculate Recency
data['Recency'] = (datetime.now() - data['PurchaseDate']).dt.days

# Calculate Frequency
frequency_data = data.groupby('CustomerID')['OrderID'].count().reset_index()
frequency_data.rename(columns={'OrderID': 'Frequency'}, inplace=True)
data = data.merge(frequency_data, on='CustomerID', how='left')

# Calculate Monetary Value
monetary_data = data.groupby('CustomerID')['TransactionAmount'].sum().reset_index()
monetary_data.rename(columns={'TransactionAmount': 'MonetaryValue'}, inplace=True) # Remember inplace=True!
data = data.merge(monetary_data, on='CustomerID', how='left')
display(data)

Unnamed: 0,CustomerID,PurchaseDate,TransactionAmount,ProductInformation,OrderID,Location,Recency,Frequency,MonetaryValue
0,8814,2023-04-11,943.31,Product C,890075,Tokyo,461,1,943.31
1,2188,2023-04-11,463.70,Product A,176819,London,461,1,463.70
2,4608,2023-04-11,80.28,Product A,340062,New York,461,1,80.28
3,2559,2023-04-11,221.29,Product A,239145,London,461,1,221.29
4,9482,2023-04-11,739.56,Product A,194545,Paris,461,1,739.56
...,...,...,...,...,...,...,...,...,...
995,2970,2023-06-10,759.62,Product B,275284,London,401,1,759.62
996,6669,2023-06-10,941.50,Product C,987025,New York,401,1,941.50
997,8836,2023-06-10,545.36,Product C,512842,London,401,1,545.36
998,1440,2023-06-10,729.94,Product B,559753,Paris,401,1,729.94


- To calculate recency, we subtracted the purchase date from the current date and extracted the number of days using the datetime.now().date() function. It gives us the number of days since the customer’s last purchase, representing their recency value.

- After that, we calculated the frequency for each customer. We grouped the data by ‘CustomerID’ and counted the number of unique ‘OrderID’ values to determine the number of purchases made by each customer. It gives us the frequency value, representing the total number of purchases made by each customer.

- Finally, we calculated the monetary value for each customer. We grouped the data by ‘CustomerID’ and summed the ‘TransactionAmount’ values to calculate the total amount spent by each customer. It gives us the monetary value, representing the total monetary contribution of each customer.

By performing these calculations, we now have the necessary RFM values (recency, frequency, monetary value) for each customer, which are important indicators for understanding customer behaviour and segmentation in RFM analysis.

In [46]:
display(data.head())

Unnamed: 0,CustomerID,PurchaseDate,TransactionAmount,ProductInformation,OrderID,Location,Recency,Frequency,MonetaryValue
0,8814,2023-04-11,943.31,Product C,890075,Tokyo,461,1,943.31
1,2188,2023-04-11,463.7,Product A,176819,London,461,1,463.7
2,4608,2023-04-11,80.28,Product A,340062,New York,461,1,80.28
3,2559,2023-04-11,221.29,Product A,239145,London,461,1,221.29
4,9482,2023-04-11,739.56,Product A,194545,Paris,461,1,739.56


## Calculating RFM Scores: Customer Segmentation

- pd.cut: Creates bins with equal width unless specified otherwise.
- pd.qcut: Creates bins based on quantiles, ensuring that each bin has approximately the same number of observations.

In [47]:
# Define scoring criteria for each RFM value
recency_scores = [5, 4, 3, 2, 1]  # Higher score for lower recency (more recent)
frequency_scores = [1, 2, 3, 4, 5]  # Higher score for higher frequency
monetary_scores = [1, 2, 3, 4, 5]  # Higher score for higher monetary value

# Calculate RFM scores
data['RecencyScore'] = pd.cut(data['Recency'], bins = 5, labels = recency_scores)
data['FrequencyScore'] = pd.cut(data['Frequency'], bins=5, labels=frequency_scores)
data['MonetaryScore'] = pd.cut(data['MonetaryValue'], bins=5, labels=monetary_scores)

display(data.head())

Unnamed: 0,CustomerID,PurchaseDate,TransactionAmount,ProductInformation,OrderID,Location,Recency,Frequency,MonetaryValue,RecencyScore,FrequencyScore,MonetaryScore
0,8814,2023-04-11,943.31,Product C,890075,Tokyo,461,1,943.31,1,1,2
1,2188,2023-04-11,463.7,Product A,176819,London,461,1,463.7,1,1,1
2,4608,2023-04-11,80.28,Product A,340062,New York,461,1,80.28,1,1,1
3,2559,2023-04-11,221.29,Product A,239145,London,461,1,221.29,1,1,1
4,9482,2023-04-11,739.56,Product A,194545,Paris,461,1,739.56,1,1,2


### Side Note: pd.qcut() vs pd.cut()

In [48]:
import pandas as pd

# use qcut with 4 quantiles
made_up_data = [1, 7, 5, 4, 6, 3, 8, 2, 9, 0]
df = pd.DataFrame(made_up_data, columns=['data'])
df['qcut'] = pd.qcut(df['data'], q=4, labels=False)
df['cut'] = pd.cut(df['data'], bins=4, labels=[0, 1, 2, 3])
display(df)


Unnamed: 0,data,qcut,cut
0,1,0,0
1,7,3,3
2,5,2,2
3,4,1,1
4,6,2,2
5,3,1,1
6,8,3,3
7,2,0,0
8,9,3,3
9,0,0,0


In [49]:
display(data['RecencyScore'].info())
# Convert RFM scores to type int
data['RecencyScore'] = data['RecencyScore'].astype(int)
data['FrequencyScore'] = data['FrequencyScore'].astype(int)
data['MonetaryScore'] = data['MonetaryScore'].astype(int)
display(data['RecencyScore'].info())

<class 'pandas.core.series.Series'>
RangeIndex: 1000 entries, 0 to 999
Series name: RecencyScore
Non-Null Count  Dtype   
--------------  -----   
1000 non-null   category
dtypes: category(1)
memory usage: 1.3 KB


None

<class 'pandas.core.series.Series'>
RangeIndex: 1000 entries, 0 to 999
Series name: RecencyScore
Non-Null Count  Dtype
--------------  -----
1000 non-null   int64
dtypes: int64(1)
memory usage: 7.9 KB


None

### RFM Value Segmentation

In [50]:
# Calculate RFM score by combining the scores of Recency, Frequency, and MonetaryValue
data['RFM_Score'] = data['RecencyScore'] + data['FrequencyScore'] + data['MonetaryScore']

# Create RFM segments based on the RFM score
segment_labels = ['Low-Value', 'Mid-Value', 'High-Value']
data['Value Segment'] = pd.qcut(data['RFM_Score'], q = 3, labels = segment_labels)
data

Unnamed: 0,CustomerID,PurchaseDate,TransactionAmount,ProductInformation,OrderID,Location,Recency,Frequency,MonetaryValue,RecencyScore,FrequencyScore,MonetaryScore,RFM_Score,Value Segment
0,8814,2023-04-11,943.31,Product C,890075,Tokyo,461,1,943.31,1,1,2,4,Low-Value
1,2188,2023-04-11,463.70,Product A,176819,London,461,1,463.70,1,1,1,3,Low-Value
2,4608,2023-04-11,80.28,Product A,340062,New York,461,1,80.28,1,1,1,3,Low-Value
3,2559,2023-04-11,221.29,Product A,239145,London,461,1,221.29,1,1,1,3,Low-Value
4,9482,2023-04-11,739.56,Product A,194545,Paris,461,1,739.56,1,1,2,4,Low-Value
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2970,2023-06-10,759.62,Product B,275284,London,401,1,759.62,5,1,2,8,High-Value
996,6669,2023-06-10,941.50,Product C,987025,New York,401,1,941.50,5,1,2,8,High-Value
997,8836,2023-06-10,545.36,Product C,512842,London,401,1,545.36,5,1,2,8,High-Value
998,1440,2023-06-10,729.94,Product B,559753,Paris,401,1,729.94,5,1,2,8,High-Value


We first sum three values to obtain the final RFM score, and then we use pd.qcut() to segment customers based on the score.

### Check segment distribution

In [51]:
segment_counts = data['Value Segment'].value_counts().reset_index()
segment_counts.columns = ['Value Segment', 'Count']
# segment_counts.rename(columns = {'count': 'Count'}, inplace = True)
display(segment_counts)

pastel_colors = px.colors.qualitative.Pastel

# Create the bar chart
fig_segment_dist = px.bar(segment_counts, x='Value Segment', y='Count', 
                          color='Value Segment', color_discrete_sequence=pastel_colors,
                          title='RFM Value Segment Distribution')

# Update the layout
fig_segment_dist.update_layout(xaxis_title='RFM Value Segment',
                              yaxis_title='Count',
                              showlegend=False)

# Show the figure
fig_segment_dist.show()

Unnamed: 0,Value Segment,Count
0,Low-Value,435
1,Mid-Value,386
2,High-Value,179








### RFM Customer Segments
Now, we create and analyze RFM Customer Segments that are broader classifications based on the RFM scores. We use the following criteria to define the segments: "Champions", "Potential Loyalists", and "Can't Lose". 

In [52]:
# Create a new column for RFM Customer Segments
data['RFM Customer Segments'] = ''

# Assign RFM segments based on the RFM score
data.loc[data['RFM_Score'] >= 9, 'RFM Customer Segments'] = 'Champions'
data.loc[(data['RFM_Score'] >= 6) & (data['RFM_Score'] < 9), 'RFM Customer Segments'] = 'Potential Loyalists'
data.loc[(data['RFM_Score'] >= 5) & (data['RFM_Score'] < 6), 'RFM Customer Segments'] = 'At Risk Customers'
data.loc[(data['RFM_Score'] >= 4) & (data['RFM_Score'] < 5), 'RFM Customer Segments'] = "Can't Lose"
data.loc[(data['RFM_Score'] >= 3) & (data['RFM_Score'] < 4), 'RFM Customer Segments'] = "Lost"

data.head()

Unnamed: 0,CustomerID,PurchaseDate,TransactionAmount,ProductInformation,OrderID,Location,Recency,Frequency,MonetaryValue,RecencyScore,FrequencyScore,MonetaryScore,RFM_Score,Value Segment,RFM Customer Segments
0,8814,2023-04-11,943.31,Product C,890075,Tokyo,461,1,943.31,1,1,2,4,Low-Value,Can't Lose
1,2188,2023-04-11,463.7,Product A,176819,London,461,1,463.7,1,1,1,3,Low-Value,Lost
2,4608,2023-04-11,80.28,Product A,340062,New York,461,1,80.28,1,1,1,3,Low-Value,Lost
3,2559,2023-04-11,221.29,Product A,239145,London,461,1,221.29,1,1,1,3,Low-Value,Lost
4,9482,2023-04-11,739.56,Product A,194545,Paris,461,1,739.56,1,1,2,4,Low-Value,Can't Lose


## RFM Analysis

### RFM Customer Segments by Value

In [53]:
segment_product_counts = data.groupby(['Value Segment', 'RFM Customer Segments']).size().reset_index(name='Count')
segment_product_counts = segment_product_counts.sort_values(by='Count', ascending=False)
display(segment_product_counts) # Note that 'Value Segment' and 'RFM Customer Segments' are both based on 'RFM_Score'

fig_treemap_segment_product = px.treemap(segment_product_counts, 
                                         path = ['Value Segment', 'RFM Customer Segments'],
                                         values = 'Count',
                                         color = 'Value Segment', color_discrete_sequence = px.colors.qualitative.Pastel,
                                         title = 'RFM Customer Segments by Value')

fig_treemap_segment_product.show()





Unnamed: 0,Value Segment,RFM Customer Segments,Count
9,Mid-Value,Potential Loyalists,386
0,Low-Value,At Risk Customers,180
1,Low-Value,Can't Lose,173
14,High-Value,Potential Loyalists,117
3,Low-Value,Lost,82
12,High-Value,Champions,62
2,Low-Value,Champions,0
4,Low-Value,Potential Loyalists,0
5,Mid-Value,At Risk Customers,0
6,Mid-Value,Can't Lose,0








### Distribution of RFM values within the Champions segment

In [54]:
champions_segment = data[data['RFM Customer Segments'] == 'Champions']

fig = go.Figure()
fig.add_trace(go.Box(x=champions_segment['RecencyScore'], name='Recency'))
fig.add_trace(go.Box(x=champions_segment['FrequencyScore'], name='Frequency'))
fig.add_trace(go.Box(x=champions_segment['MonetaryScore'], name='Monetary Value'))

fig.update_layout(title = 'Distribution of RFM Values for Champions',
                  yaxis_title = 'RFM Value',
                  showlegend = True)

fig.show()

### Correlation of the recency, frequency, and monetary value

In [55]:
correlation_matrix = champions_segment[['RecencyScore', 'FrequencyScore', 'MonetaryScore']].corr()
display(correlation_matrix)

# Visualize using a heatmap
fig_heatmap = go.Figure(data=go.Heatmap(
                        z=correlation_matrix.values,
                        x=correlation_matrix.columns,
                        y=correlation_matrix.columns,
                        colorscale='RdBu',
                        colorbar=dict(title='Correlation Coefficient')))
fig_heatmap.update_layout(title='Correlation Matrix of RFM Values for Champions')
fig_heatmap.show()
                   

Unnamed: 0,RecencyScore,FrequencyScore,MonetaryScore
RecencyScore,1.0,-0.571727,-0.474715
FrequencyScore,-0.571727,1.0,0.390657
MonetaryScore,-0.474715,0.390657,1.0


### Number of Customers in each RFM Segment

In [56]:
import plotly.colors

pastel_colors = px.colors.qualitative.Pastel
segment_counts = data['RFM Customer Segments'].value_counts()
display(segment_counts)

# Create a bar chart to compare segment counts
fig = go.Figure(data=[go.Bar(x=segment_counts.index, y=segment_counts.values, 
                             marker_color=pastel_colors, text=segment_counts.values, 
                             textposition='outside')])

# Set the color of the Champions segment as a different color
champions_color = 'rgb(158, 202, 225)'
fig.update_traces(marker_color=[champions_color if segment == 'Champions' else pastel_colors[i]
                                for i, segment in enumerate(segment_counts.index)],
                  marker_line_color='rgb(8, 48, 107)',
                  marker_line_width=1.5, opacity=0.6)

# Update the layout
fig.update_layout(title='Comparison of RFM Segments',
                  xaxis_title='RFM Segments',
                  yaxis_title='Number of Customers',
                  showlegend=False)

fig.show()

RFM Customer Segments
Potential Loyalists    503
At Risk Customers      180
Can't Lose             173
Lost                    82
Champions               62
Name: count, dtype: int64

### Recency, Frequency, and Monetary Scores of all the Segments

In [63]:
# Calculate the average Recency, Frequency, and Monetary scores for each segment respectively

# segment_scores = data.groupby('RFM Customer Segments').agg({'RecencyScore': 'mean',
#                                                               'FrequencyScore': 'mean',
#                                                               'MonetaryScore': 'mean'}).reset_index()

# Alternative method
segment_scores = data.groupby('RFM Customer Segments')[['RecencyScore', 'FrequencyScore', 'MonetaryScore']].mean().reset_index()
display(segment_scores) # displaying this to be more intuitive

# How to order them based on the order of the segments, i.e. Champions, Potential Loyalists, At Risk Customers, Can't Lose, Lost
segment_scores['RFM Customer Segments'] = pd.Categorical(segment_scores['RFM Customer Segments'], 
                                                         categories=['Champions', 'Potential Loyalists', 'At Risk Customers', "Can't Lose", 'Lost'], 
                                                         ordered=True)

segment_scores = segment_scores.sort_values('RFM Customer Segments')
display(segment_scores)

# Create a grouped bar chart to compare segment scores
fig = go.Figure()

# Add bars for Recency score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segments'],
    y=segment_scores['RecencyScore'],
    name='Recency Score',
    marker_color='rgb(158,202,225)'
))

# Add bars for Frequency score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segments'],
    y=segment_scores['FrequencyScore'],
    name='Frequency Score',
    marker_color='rgb(94,158,217)'
))

# Add bars for Monetary score
fig.add_trace(go.Bar(
    x=segment_scores['RFM Customer Segments'],
    y=segment_scores['MonetaryScore'],
    name='Monetary Score',
    marker_color='rgb(32,102,148)'
))

# Update the layout
fig.update_layout(
    title='Comparison of RFM Segments based on Recency, Frequency, and Monetary Scores',
    xaxis_title='RFM Segments',
    yaxis_title='Score',
    barmode='group',
    showlegend=True
)

fig.show()



Unnamed: 0,RFM Customer Segments,RecencyScore,FrequencyScore,MonetaryScore
0,At Risk Customers,2.344444,1.011111,1.644444
1,Can't Lose,1.537572,1.0,1.462428
2,Champions,3.806452,3.064516,3.225806
3,Lost,1.0,1.0,1.0
4,Potential Loyalists,3.918489,1.194831,1.741551


Unnamed: 0,RFM Customer Segments,RecencyScore,FrequencyScore,MonetaryScore
2,Champions,3.806452,3.064516,3.225806
4,Potential Loyalists,3.918489,1.194831,1.741551
0,At Risk Customers,2.344444,1.011111,1.644444
1,Can't Lose,1.537572,1.0,1.462428
3,Lost,1.0,1.0,1.0


## Summary
In this case study, we performed RFM Analysis on a dataset to understand and segment customers based on their buying behaviour. We calculated the RFM values (recency, frequency, monetary value) for each customer and then used these values to calculate the RFM scores. We segmented customers based on the RFM scores and analyzed the distribution of RFM values within each segment. Finally, we created RFM Customer Segments and analyzed the distribution of customers in each segment. RFM Analysis provides valuable insights into customer behaviour and helps businesses understand and segment customers for targeted marketing and personalized customer engagement.