**Goal of the project**

Customer segmentation can give us huge insights into our business and identify a whole range of different things about our customers, allowing us to change our marketing and improve results.

The RFM model is probably one of the best known and most widely used customer segmentation models by data driven marketers. It’s used for both measuring customer value and predicting future customer behaviour and has been in regular use since the 1970s, particularly in the field of catalogue marketing.

However, despite being such an old model, RFM analysis is still commonly used today and is still the subject of ongoing research and development to help improve business performance.

In this project, I’ll use a Python package called EcommerceTools which lets us segment customers using RFM.

**Load the packages**

In [74]:
# Importing libraries
import pandas as pd
from ecommercetools import utilities
from ecommercetools import transactions
from ecommercetools import customers

**Load the data**

For this project I’ve used a [standard transactional dataset](https://www.kaggle.com/datasets/marian447/retail-store-sales-transactions) from Kaggle. This anonymized dataset includes 64.682 transactions of 5.242 SKU's sold to 22.625 customers during one year.

In [75]:
# Load dataset
filename = '../input/retail-store-sales-transactions/scanner_data.csv'
df = pd.read_csv(filename)

In [76]:
# Examine the data
df.head()

Unnamed: 0.1,Unnamed: 0,Date,Customer_ID,Transaction_ID,SKU_Category,SKU,Quantity,Sales_Amount
0,1,02/01/2016,2547,1,X52,0EM7L,1.0,3.13
1,2,02/01/2016,822,2,2ML,68BRQ,1.0,5.46
2,3,02/01/2016,3686,3,0H2,CZUZX,1.0,6.35
3,4,02/01/2016,3719,4,0H2,549KK,1.0,5.59
4,5,02/01/2016,9200,5,0H2,K8EHH,1.0,6.88


EcommerceTools requires that we standardise the column names in our transaction items dataframe.

In [77]:
transaction_items_df = utilities.load_transaction_items(filename,
                                                        date_column = 'Date',
                                                        order_id_column = 'Transaction_ID',
                                                        customer_id_column = 'Customer_ID',
                                                        sku_column = 'SKU',
                                                        quantity_column = 'Quantity',
                                                        unit_price_column = 'Sales_Amount')

transaction_items_df.head()

Unnamed: 0.1,Unnamed: 0,order_date,customer_id,order_id,SKU_Category,sku,quantity,unit_price,line_price
0,1,2016-02-01,2547,1,X52,0EM7L,1.0,3.13,3.13
1,2,2016-02-01,822,2,2ML,68BRQ,1.0,5.46,5.46
2,3,2016-02-01,3686,3,0H2,CZUZX,1.0,6.35,6.35
3,4,2016-02-01,3719,4,0H2,549KK,1.0,5.59,5.59
4,5,2016-02-01,9200,5,0H2,K8EHH,1.0,6.88,6.88


**Create a transactions dataset**

Next, we will take our original dataframe of transaction items and create a dataframe of transactions. We can do that by passing the name of our transaction items dataframe to the get_transactions( ) function. This aggregates the data on the order_id, and returns the count of SKUs and items, and sums the total revenue, and identifies whether the item was a replacement and its order number. A value of 1 in the order_number column denotes an acquisition, and everything higher is a returning customer.

In [78]:
transactions_df = transactions.get_transactions(transaction_items_df)
transactions_df.head()

Unnamed: 0,order_id,order_date,customer_id,skus,items,revenue,replacement,order_number
0,1,2016-02-01,2547,1,1.0,3.13,0,2
1,2,2016-02-01,822,1,1.0,5.46,0,2
2,3,2016-02-01,3686,1,1.0,6.35,0,5
3,4,2016-02-01,3719,1,1.0,5.59,0,2
4,5,2016-02-01,9200,1,1.0,6.88,0,1


**Create a customer dataset**

Next, we can create a dataframe of customers using the get_customers( ) function. This also takes the name of the dataframe of transaction items and returns a customer-level dataset containing the total spend, total number of SKUs and items purchased, the first order and last order dates, the tenure, recency, and average order size and value metrics for each customer.

In [79]:
customers_df = customers.get_customers(transaction_items_df)
customers_df.head()

Unnamed: 0,customer_id,revenue,orders,skus,items,first_order_date,last_order_date,avg_items,avg_order_value,tenure,recency,cohort
0,1,16.29,1,1,2.0,2016-01-22,2016-01-22,2.0,16.29,2391,2391,20161
1,2,22.77,2,1,2.0,2016-03-24,2016-06-19,1.0,11.38,2329,2242,20161
2,3,19.08,1,1,4.0,2016-02-01,2016-02-01,4.0,19.08,2381,2381,20161
3,4,33.29,2,2,5.0,2016-09-11,2016-11-07,2.5,16.64,2158,2101,20163
4,5,248.27,5,1,14.0,2016-02-22,2016-09-02,2.8,49.65,2360,2167,20161


**Behavioural segmentation using RFM and heterogeneity**

Now we have this customer-level dataset containing the raw recency, frequency and monetary value data for each customer, we can pass it to the get_rfm_segments( ) function. This segmentation process will create a behavioural segmentation of our customer base.

The function returns a few duplicated columns for reference purposes, plus the individual R, F, and M scores, the combined RFM label (from 111 to 555), the RFM score (i.e. 3 to 15), and a label representing the segment name. The segmentation process creates groups of customers with common characteristics that can be targeted via marketing campaigns aimed at influencing their behaviour.

In [80]:
rfm_df = customers.get_rfm_segments(customers_df)
rfm_df.head()

Unnamed: 0,customer_id,acquisition_date,recency_date,recency,frequency,monetary,heterogeneity,tenure,r,f,m,h,rfm,rfm_score,rfm_segment_name
0,1,2016-01-22,2016-01-22,2391,1,16.29,1,2391,1,1,1,1,111,3,Risky
1,3,2016-02-01,2016-02-01,2381,1,19.08,1,2381,1,1,1,1,111,3,Risky
2,9,2016-03-20,2016-03-20,2333,1,15.75,1,2333,1,1,1,1,111,3,Risky
3,11,2016-01-29,2016-01-29,2384,1,6.35,1,2384,1,1,1,1,111,3,Risky
4,18,2016-01-20,2016-01-20,2393,1,1.9,1,2393,1,1,1,1,111,3,Risky


Now, let’s discuss how to interpret the RFM segments to understand the behaviors of those users, and recommend some effective marketing strategies.

**Creating the customer segments**

Now everything is set up, we can use a groupby( ) to examine each segment and use agg( ) to calculate some summary statistics examining the size of the segments and the mean values within. 

In [81]:
rfm_df.groupby('rfm_segment_name').agg(customers = ('customer_id', 'count'), recency = ('recency','mean'), frequency = ('frequency','mean'), monetary = ('monetary','mean')).round(1).sort_values(by = 'recency')

Unnamed: 0_level_0,customers,recency,frequency,monetary
rfm_segment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Star,191,2057.9,29.1,1783.5
Loyal,11391,2099.1,3.6,153.7
Potential loyal,3565,2209.6,1.9,72.8
Hold and improve,3712,2287.6,1.6,54.3
Risky,3766,2365.3,1.3,55.2


**Analyzing RFM segmentation**

Let’s delve into few interesting segments:

* **Star customers:** Bought recently, buy often and spend the most

* **Loyal customers:** Buy on a regular basis. Responsive to promotions.

* **Potential loyalist:** Recent customers with average frequency.

* **Hold and improve:** Below average recency and frequency. Will lose them if not reactivated.

* **At risk:** Some time since they’ve purchased. Need to bring them back!

For each of the segments, we could design appropriate actions, for example:

* **Star customers:** Reward them. They can become evangelists and early adopters of new products.

* **Loyal customers:** Up-sell higher value products. Engage them. Ask for reviews.

* **Potential loyalist:** Recommend other products. Engage in loyalty programs.

* **Hold and improve:** Reactivate them. Share valuable resources. Recommend popular products. Offer discounts.

* **At risk:** Send personalised email or other messages to reconnect. Provide good offers and share valuable resources.
