# Customer Segmentation using Recency, Frequency, and Monetary Segmentation

 - Recency (R) is based on the last purchase
 - Frequency (F) is based on how many purchases have been made in the last 12 months
 - Monetary Value (M) is based on how much customer spent in last 12 months
 - RFM can be grouped by percentiles

# Goals

  ## 1. Calculate Recency, Frequency, and Monetary Value
  ## 2. Building Recency, Frequency, Monetary segments
  ## 3. Analyze RFM Segments

# Import Modules

In [21]:
# Data Manipulation Libraries: Standard dataframes and array libraries
import pandas as pd
import numpy as np
from pandas import ExcelWriter
from pandas import ExcelFile
# from datetime import datetime
import datetime as dt

# Data Visualization Libraries:
import matplotlib.pyplot as plt
import seaborn as sns

# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

# Displaying plots in jupter notebook
%matplotlib inline
# Displaying pandas columns and rows
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Import Data

In [2]:
# import data
df = pd.read_excel("filepath/online_retail.xlsx", sheet_name="Online Retail")

# Clean Data

 - Inspect Datatypes
 - Drop missing values in key column
 - Change datatypes as needed

In [22]:
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [23]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,406829,406829,406829
mean,12,3,15288
std,249,69,1714
min,-80995,0,12346
25%,2,1,13953
50%,5,2,15152
75%,12,4,16791
max,80995,38970,18287


### <font color="blue">Note: </font>Many missing values in the <code>CustomerID</code> Column

In [12]:
# Drop rows that have missing customerID values
df = df.dropna(subset=['CustomerID'])

In [24]:
# Convert customerID column to integers, need to convert to string first
df.astype({'CustomerID': 'str'}).dtypes
df.astype({'CustomerID': 'int'}).dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID              int64
Country                object
dtype: object

# <font color="blue">Part 1: Calculate Recency, Frequency, and Monteary Value</font>

 - Step 1: Define time period and filter data accordingly
 - Step 2: Calculate the sales revenue for each transaction
 - Step 3: Calculate the Recency, Frequency, Monetary Value for a specific day

# Step 1: Filter Data to Specific Time Period

In [14]:
# Create a subset of the dataframe that is filtered for most recent year of activity
subset_df = df[df['InvoiceDate']>'2010-12-10'].copy()

In [15]:
# Confirm subset dates
print('Min: {}; Max: {}'.format(min(subset_df.InvoiceDate),
                              max(subset_df.InvoiceDate)))

Min: 2010-12-10 09:33:00; Max: 2011-12-09 12:50:00


# Step 2: Calculate the Sales Revenue per Transaction

 - This is calculated by <code>Quantity</code> * <code>UnitPrice</code>

In [25]:
# Create a sales revenue column named <code>TotalSum</code>
subset_df["TotalSum"] = subset_df["Quantity"]*subset_df["UnitPrice"]

In [27]:
# Inspect updated dataframe
subset_df.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalSum
22523,538172,21562,HAWAIIAN GRASS SKIRT,12,2010-12-10 09:33:00,1,15805,United Kingdom,15
22524,538172,79321,CHILLI LIGHTS,8,2010-12-10 09:33:00,5,15805,United Kingdom,40
22525,538172,22041,"RECORD FRAME 7"" SINGLE SIZE",12,2010-12-10 09:33:00,3,15805,United Kingdom,31


# Step 3: Calculate the Recency, Frequency, and Monetary Metrics for a snapshot date in dataset

In [28]:
# Create snapshot_day
snapshot_date = max(subset_df.InvoiceDate) + dt.timedelta(days=1)

In [29]:
# Display date
snapshot_date

Timestamp('2011-12-10 12:50:00')

In [36]:
# Aggregate data (Recent day "snapshot_date" - last transaction)
rfm_data = subset_df.groupby(["CustomerID"]).agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo' : 'count',
    'TotalSum' : 'sum'})

In [37]:
# Inspect data
rfm_data.head(3)

Unnamed: 0_level_0,InvoiceDate,InvoiceNo,TotalSum
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346,326,2,0
12347,2,151,3598
12348,75,31,1797


#### Notice invoice date is not longer an actual date datatype, but an integer that represents the number of dates since the last invoice date
 - CustomerID 12346 hasn't been a customer in almost a year, has had two transcations, but spent $0.00

In [39]:
# Inspecting this customer we see that they made a purchase
# But then received a refund
subset_df[subset_df["CustomerID"] == 12346]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalSum
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18 10:01:00,1,12346,United Kingdom,77184
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1,12346,United Kingdom,-77184


In [40]:
# Rename columns for easier interpretation
rfm_data.rename(columns = {'InvoiceDate' : 'Recency',
                          'InvoiceNo' : 'Frequency',
                          'TotalSum': 'Monetary Value'}, inplace=True)

In [33]:
# Inspect the relabeled data
rfm_data.head(3)

Unnamed: 0_level_0,Recency,Frequency,Monetary Value
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346,326,2,0
12347,2,151,3598
12348,75,31,1797


# <font color="blue">Part II: Building Recency, Frequency, Monetary segments</font>

 - Step 1: Define quartile ranges for recency, frequency, and monetary segments
 - Step 2: Assign each instance in dataframe a quartile value
 - Step 3: Build and Assign an RFM Segment and RFM Score to each instance in dataframe

# Step 1: Define Quartile Range

 - Sort customers based on metric
 - Break customers into a pre-defined number of groups of equal size
 - Assign a label to each group
 - Method used in this post will rely on percentiles
 <br><br>
 <b>Pandas Methods</b>
 - Pandas' method <code>pd.qcut()</code> Discretize variable into equal-sized buckets based on rank or based on sample quantiles.
 - <code>pandas.qcut(x, q, labels=None, retbins: bool = False, precision: int = 3, duplicates: str = 'raise')</code>
 - Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
 <br><br>
 - Each instance will be assigned a quartile based on its value using the <code>dataframe.assign()</code> method.
 - <code>DataFrame.assign(self, **kwargs)</code>
 - Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html

## Step 1a: Define Recency quartile labels

In [41]:
# The recency column will be divided into four equal-sized bins, and will start from 4 and end in 0
r_labels = range(4,0,-1)
r_labels

range(4, 0, -1)

## Step 1b: Create an array using the pd.qcut method

In [46]:
# Create a series that contains the quartile assignment for each instance in the dataset 'Recency' column
r_quartiles = pd.qcut(rfm_data["Recency"], 4, labels=r_labels)

In [47]:
# Inspect subset of data
r_quartiles.head()

CustomerID
12346    1
12347    4
12348    2
12349    3
12350    1
Name: Recency, dtype: category
Categories (4, int64): [4 < 3 < 2 < 1]

## Step 1c: Add Recency quartile assignment to dataset using the dataframe.assign() method

In [50]:
# Add Recency quartile assignment to dataset using the dataframe.assign() method
rfm_data = rfm_data.assign(R = r_quartiles.values)

In [51]:
# Inspect subset of data with the new R quartile column
rfm_data.head()

Unnamed: 0_level_0,Recency,Frequency,Monetary Value,R
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12346,326,2,0,1
12347,2,151,3598,4
12348,75,31,1797,2
12349,19,73,1758,3
12350,310,17,334,1


## Repeat Steps 1a-1c for Frequency and Monetary Values

 - The quartile values for these two metrics will be in ascending order
 - The rankings will be as such: 1 > 2 > 3 > 4

### Step 1a: Define quartile labels

In [54]:
# Frequency labels
f_labels = range(1,5)

In [55]:
# motenary labels
m_labels = range(1,5)

### Step 1b: Create arrays using the pd.qcut method

In [57]:
# Assign a frequency quantile to each instance
f_quantiles = pd.qcut(rfm_data["Frequency"], 4, labels=f_labels)

In [59]:
# Assign a monetary value quantile to each instance
m_quantiles = pd.qcut(rfm_data["Monetary Value"], 4, labels=m_labels)

### Step 1c: Add Frequency and Monetary quartile assignment to dataset using the dataframe.assign() method

In [60]:
# Assign a frequency quantile to each instance
rfm_data = rfm_data.assign(F = f_quantiles.values)

In [61]:
# Assign a monetary value quantile to each instance
rfm_data = rfm_data.assign(M = m_quantiles.values)

In [63]:
# Inspect updated dataframe
rfm_data.head(3)

Unnamed: 0_level_0,Recency,Frequency,Monetary Value,R,F,M
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
12346,326,2,0,1,1,1
12347,2,151,3598,4,4,4
12348,75,31,1797,2,2,4


### <font color="blue">Inspect Data: </font>

<b>Recency Quartile (last invoice date from snapshot date) <code>R</code>: </b>
 - 1 is the lowest quartile ranking for this metric
 - Customer 12346 was assigned a "1" because they have not made a purchase in nearly a year
 - Customer 12347 was assigned a "4" because they recently made a purchase (2 days ago from the snap shot date)

<b>Frequency Quartile (number of invoices per customer) <code>F</code>: </b>
 - 4 is the highest quartile ranking for this metric
 - Customer 12346 was assigned a "1" because they had two transactions
 - Customer 12347 was assigned a "4" because they had more than 150 transactions
 
<b>Monetary Value Quartile (To amount spent) <code>M</code>: </b>
 - 4 is the highest quartile ranking for this metric
 - Customer 12346 was assigned a "1" because they did not spend any money, they received a refund.
 - Customer 12347 was assigned a "4" because they spent a high amount of $3598.00

# Step 2: Build RFM Segment and RFM Score

 - Final step in the RFM Segmentation pipieline
 - The RFM Segment is a concatenation of the RFM quartile values intended to make it easier to inspect
 - The RFM Score is a sum of these values, and the higher an RFM Score, the more active a customer is for a company.
 
<b>Steps: </b>
 - Step 2a: Create a function that concatenates the 'R', 'F', 'M' values
 - Step 2b: Add an <code>RFM_Segment</code> column to dataframe by sing the <code>dataframe.apply()</code> method.<br>
    - This method will apply the concatenation function to each row in the dataframe
    - Then assign that value to the newly created <code>RFM_Segment</code> column for that instance.
 - Step 2c: Add an <code>RFM_Score</code> column to dataframe by sing the <code>dataframe.sum()</code> method.
    - This method will to sum the values for the 'R', 'F', 'M' in each each row
    - Then assign that value to the newly created <code>RFM_Score</code> column for that instance.

In [67]:
# Step 2a: Function that concatenates the R, F, M values
def join_rfm(x): return str(x['R']) + str(x['F']) + str(x['M'])

In [68]:
# Step 2b: Applies function to dataframe and creates a new column
rfm_data['RFM_Segment'] = rfm_data.apply(join_rfm, axis=1)

In [69]:
# Step 2c: Creates a new column that contains the sum of the R, F, M values
rfm_data['RFM_Score'] = rfm_data[['R','F','M']].sum(axis=1)

In [70]:
# Inpect data
rfm_data.head(3)

Unnamed: 0_level_0,Recency,Frequency,Monetary Value,R,F,M,RFM_Segment,RFM_Score
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12346,326,2,0,1,1,1,111,3
12347,2,151,3598,4,4,4,444,12
12348,75,31,1797,2,2,4,224,8


### <font color="blue">Inspect Data: </font>

<b><code>RFM_Segment</code>: </b>
 - This is just a concatenation of three column values
 - Customer 12346 was assigned a "111" because they have not made a purchase in nearly a year, had few transcations, and has spent very little
 - Customer 12347 was assigned a "444" because they have made a recent purchase, have had many transcations, and has spent more.
 
<b><code>RFM_Score</code>: </b>
 - Sum of the R, F, M values
 - The min score is 3, the max is 12
 - A 3 score is not very active customer
 - A 12 score is very active customer
 - Customer 12346 was assigned a "3" because they had 1 for each RFM value. Not active customer.
 - Customer 12347 was assigned a "12" because they had 4 for each RFM value. Very active customer.

# <font color="blue">Part III: Analyze RFM segments</font>

 - a) Identify the largest RFM Segment
 - b) Display summary metrics per RFM Score
 - c) Group into names segments

# a) Identify the largest RFM Segments

<b>Use:</b>
 - <code>dataframe.groupby()</code>: Group DataFrame using a mapper or by a Series of columns
 - <code>dataframe.size()</code>: Return an int representing the number of elements in this object.
 - <code>dataframe.sort_values()</code>: Sort by the values along either axis. 

In [71]:
# .size() function is used to get an int representing the number of elements in this object.
rfm_data.groupby('RFM_Segment').size().sort_values(ascending=False)[:10]

RFM_Segment
444    462
111    389
122    199
344    197
211    176
222    170
333    168
233    158
433    154
322    118
dtype: int64

### <font color="blue">Inspect Data: </font>
 - 444 has the highest number of customers at 462
 - 111 has the 2nd highest numbr of custoemrs at 389

# b) Display summary metrics per RFM Score

In [72]:
rfm_data.groupby('RFM_Score').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary Value' : ['mean', 'count']}).round(1)

Unnamed: 0_level_0,Recency,Frequency,Monetary Value,Monetary Value
Unnamed: 0_level_1,mean,mean,mean,count
RFM_Score,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
3,248,8,106,389
4,163,14,221,376
5,147,21,350,503
6,87,28,484,460
7,80,39,725,438
8,57,55,954,464
9,44,77,1355,403
10,30,113,1796,443
11,20,191,3942,358
12,7,364,8571,462


### <font color="blue">Inspect Data: </font>

<b>Min RFM_Segment at <code>3</code>: </b>
 - Mean Recency is 248 days
 - Mean Frequency is around 8 invoices
 - Mean Monetary Value is 106 dollars in purchases

<b>Max RFM_Segment at <code>12</code>: </b>
 - Mean Recency is 7 days
 - Mean Frequency is around 364 invoices
 - Mean Monetary Value is over 8500 dollars in purchases

# c) Group into named segments

This is a method to add a descriptive tag for each segment to make it easier to read<br>

Steps:<br>
 - Step 1: Create segmentation labeling function
 - Step 2: Apply function to dataframe
 - Step 3: Display summary metrics for RFM

### Step 1: Create segmentation labeling function

In [73]:
def segment_name(df):
    if df['RFM_Score'] >= 9:
        return 'Gold'
    elif (df['RFM_Score'] >= 5) and (df['RFM_Score'] < 9):
        return 'Silver'
    else:
        return 'Bronze'

### Step 2: Apply function to dataframe

In [75]:
rfm_data["General_Segment"] = rfm_data.apply(segment_name, axis=1)

### Step 3: Display summary metrics by segmentation label

In [85]:
rfm_data.groupby('General_Segment').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary Value' : ['mean', 'count']})

Unnamed: 0_level_0,Recency,Frequency,Monetary Value,Monetary Value
Unnamed: 0_level_1,mean,mean,mean,count
General_Segment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Bronze,206,11,163,765
Gold,25,191,4029,1666
Silver,94,35,621,1865
