# Customer Segmentation using Recency, Frequency, and Monetary Segmentation

 - Recency (R) is based on the last purchase
 - Frequency (F) is based on how many purchases have been made in the last 12 months
 - Monetary Value (M) is based on how much customer spent in last 12 months
 - RFM can be grouped by percentiles

# Goals

  ## 1. Calculate Recency, Frequency, and Monetary Value
  ## 2. Building Recency, Frequency, Monetary segments
  ## 3. Analyze RFM Segments

# Import Modules

In [21]:
# Data Manipulation Libraries: Standard dataframes and array libraries
import pandas as pd
import numpy as np
from pandas import ExcelWriter
from pandas import ExcelFile
# from datetime import datetime
import datetime as dt

# Data Visualization Libraries:
import matplotlib.pyplot as plt
import seaborn as sns

# K-means clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

# Displaying plots in jupter notebook
%matplotlib inline
# Displaying pandas columns and rows
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Import Data

In [2]:
# import data
df = pd.read_excel("static/data/online_retail.xlsx", sheet_name="Online Retail")

# Clean Data

 - Inspect Datatypes
 - Drop missing values in key column
 - Change datatypes as needed

In [22]:
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [23]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,406829,406829,406829
mean,12,3,15288
std,249,69,1714
min,-80995,0,12346
25%,2,1,13953
50%,5,2,15152
75%,12,4,16791
max,80995,38970,18287


### <font color="blue">Note: </font>Many missing values in the <code>CustomerID</code> Column

In [12]:
# Drop rows that have missing customerID values
df = df.dropna(subset=['CustomerID'])

In [24]:
# Convert customerID column to integers, need to convert to string first
df.astype({'CustomerID': 'str'}).dtypes
df.astype({'CustomerID': 'int'}).dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID              int64
Country                object
dtype: object

# <font color="blue">Part 1: Calculate Recency, Frequency, and Monteary Value</font>

 - Step 1: Define time period and filter data accordingly
 - Step 2: Calculate the sales revenue for each transaction
 - Step 3: Calculate the Recency, Frequency, Monetary Value for a specific day

# Step 1: Filter Data to Specific Time Period

In [14]:
# Create a subset of the dataframe that is filtered for most recent year of activity
subset_df = df[df['InvoiceDate']>'2010-12-10'].copy()

In [15]:
# Confirm subset dates
print('Min: {}; Max: {}'.format(min(subset_df.InvoiceDate),
                              max(subset_df.InvoiceDate)))

Min: 2010-12-10 09:33:00; Max: 2011-12-09 12:50:00


# Step 2: Calculate the Sales Revenue per Transaction

 - This is calculated by <code>Quantity</code> * <code>UnitPrice</code>

In [25]:
# Create a sales revenue column named <code>TotalSum</code>
subset_df["TotalSum"] = subset_df["Quantity"]*subset_df["UnitPrice"]

In [27]:
# Inspect updated dataframe
subset_df.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalSum
22523,538172,21562,HAWAIIAN GRASS SKIRT,12,2010-12-10 09:33:00,1,15805,United Kingdom,15
22524,538172,79321,CHILLI LIGHTS,8,2010-12-10 09:33:00,5,15805,United Kingdom,40
22525,538172,22041,"RECORD FRAME 7"" SINGLE SIZE",12,2010-12-10 09:33:00,3,15805,United Kingdom,31


# Step 3: Calculate the Recency, Frequency, and Monetary Metrics for a snapshot date in dataset

In [28]:
# Create snapshot_day
snapshot_date = max(subset_df.InvoiceDate) + dt.timedelta(days=1)

In [29]:
# Display date
snapshot_date

Timestamp('2011-12-10 12:50:00')

In [36]:
# Aggregate data (Recent day "snapshot_date" - last transaction)
rfm_data = subset_df.groupby(["CustomerID"]).agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo' : 'count',
    'TotalSum' : 'sum'})

In [37]:
# Inspect data
rfm_data.head(3)

Unnamed: 0_level_0,InvoiceDate,InvoiceNo,TotalSum
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346,326,2,0
12347,2,151,3598
12348,75,31,1797


#### Notice invoice date is not longer an actual date datatype, but an integer that represents the number of dates since the last invoice date
 - CustomerID 12346 hasn't been a customer in almost a year, has had two transcations, but spent $0.00

In [39]:
# Inspecting this customer we see that they made a purchase
# But then received a refund
subset_df[subset_df["CustomerID"] == 12346]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalSum
61619,541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18 10:01:00,1,12346,United Kingdom,77184
61624,C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18 10:17:00,1,12346,United Kingdom,-77184


In [40]:
# Rename columns for easier interpretation
rfm_data.rename(columns = {'InvoiceDate' : 'Recency',
                          'InvoiceNo' : 'Frequency',
                          'TotalSum': 'Monetary Value'}, inplace=True)

In [33]:
# Inspect the relabeled data
rfm_data.head(3)

Unnamed: 0_level_0,Recency,Frequency,Monetary Value
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346,326,2,0
12347,2,151,3598
12348,75,31,1797


# K-means clustering

 - Simple and fast
 - Most popular supervised learning model

### Assumptions
 - Distributions are symmetric
 - Variables have same average values (scale)
 - variables with same variance (scale)

### Evaluate data skewness
 - Apply logarithmic transformation if skewed
 - Only works on positive values

### Calculate statistics of variables

In [None]:
rfm_data.describe()

### Manage data skewness

In [None]:
# Plot distribution
rfm_variables = ["Recency", "Frequency", "Monetary Value"]
fig, axes = plt.subplots(1, 3, figsize=(12,4))
for axes, var in zip(axes.ravel(), rfm_variables):
    axes.set_title(var)
    sns.distplot(rfm_data[var], ax=axes)

plt.tight_layout()
# plt.savefig("filepath/filename.format", bbox_inches='tight')
plt.show()

### Data transformation

In [None]:
frequency_log = np.log(rfm_data.Frequency)
recency_log = np.log(rfm_data.Recency)

In [None]:
data = {"frequency_log": frequency_log, "recency_log": recency_log}
rf_log = pd.DataFrame.from_dict(data)

In [None]:
# Plot distribution
rfm_log_variables = rf_log.columns
fig, axes = plt.subplots(1, 2, figsize=(8,4))
for axes, var in zip(axes.ravel(), rfm_log_variables):
    axes.set_title(var)
    sns.distplot(rf_log[var], ax=axes)

plt.tight_layout()
# plt.savefig("filepath/filename.format", bbox_inches='tight')
plt.show()

### Dealing with negative values

# Centering and Scaling Variables

 - Centering variables id done by subtracting average values from each observation

In [None]:
# Aggregate data (Recent day - last transaction)
rfm_dataset = subset_df.groupby(["CustomerID"]).agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo' : 'count',
    'TotalSum' : 'sum'})

In [None]:
# Rename columns for easier interpretation
rfm_dataset.rename(columns = {'InvoiceDate' : 'Recency',
                          'InvoiceNo' : 'Frequency',
                          'TotalSum': 'Monetary Value'}, inplace=True)

In [None]:
rfm_dataset = rfm_dataset[rfm_dataset["Monetary Value"]>0]

In [None]:
rfm_dataset.head()

# Combining centering and scaling
 - use scaler from <code>scikit-learn</code>

In [None]:
scaler = StandardScaler()
scaler.fit(rfm_datav2)
rfm_normalized = scaler.transform(rfm_datav2)

In [None]:
print('mean:', rfm_normalized.mean(axis=0).round(2))
print('std:', rfm_normalized.std(axis=0).round(2))

# Sequence of structuring pre-processing steps

 - Unskew the data - log transformation
 - Standardize to the same average values
 - Scale to the same standard deviation
 - Store as a separate array to be used for clustering

In [None]:
# Unskew the data with log transformation
dataset_log = np.log(rfm_dataset)

In [None]:
dataset_log.head()

In [None]:
# Normalize the variables with <code>StandardScaler</code>
scaler = StandardScaler()
scaler.fit(dataset_log)

In [None]:
dataset_normalized = scaler.transform(dataset_log)
df_normalized = pd.DataFrame(dataset_normalized, index=rfm_dataset.index, columns=rfm_dataset.columns)

In [None]:
df_normalized.head()

In [None]:
# Plot distribution
rfm_variables = df_normalized.columns
fig, axes = plt.subplots(1, 3, figsize=(12,4))
for axes, var in zip(axes.ravel(), rfm_variables):
    axes.set_title(var)
    sns.distplot(df_normalized[var], ax=axes)

plt.tight_layout()
# plt.savefig("filepath/filename.format", bbox_inches='tight')
plt.show()

# Practical Implementation of k-means clustering

 - Data pre-processing
 - Chossing a number of clusters
 - Running k-means clustering on pre-processed data
 - Analyzing average RFM values of each cluster

# Methods to define cluster numbers

 - Visual methods: Elbow criterion
 - Mathematical methods: silhouette coefficient

In [None]:
kmeans = KMeans(n_clusters=2, random_state=1)

In [None]:
# compute k-means clustering on pre-processed data
kmeans.fit(df_normalized)

In [None]:
# Extract cluster labels using <code>labels_</code> attribute
cluster_labels = kmeans.labels_

### Analyze average RFM values of each cluster
 - Create a cluster label column in the original dataframe
 - Calculte average RFM values and size for each cluster

In [None]:
rfm_dataset_cluster = rfm_dataset.assign(Cluster = cluster_labels)

In [None]:
rfm_dataset_cluster.groupby(["Cluster"]).agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary Value': ["mean", 'count']
}).round(0)

## Elbow citerion method

 - Plot the number of clusters against within-cluster sum-of-squared-erros (SSE)- sum of squared distances from every data point to their cluster center
 - The 'Elbow' represents an 'optimal' number of clusters

In [None]:
# Create an empty dictionary
sse = {}
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=1)
    kmeans.fit(df_normalized)
    sse[k] = kmeans.inertia_

In [None]:
# Plot SSE for each k 'cluster'
plt.title('The Elbow Method')
plt.xlabel('k'); plt.ylabel('SSE')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
# plt.savefig("filepath/filename.format", bbox_inches='tight')
plt.show()

### Criteria points to 2 or 3 cluster solutions

## Experimental Approach - Analyze Segments

 - Build clustering at and around elbow solution
 - Analyze average RFM values
 - Compare against other solutions and identify which provides most insight

In [None]:
kmeans3 = KMeans(n_clusters=3, random_state=1)
# compute k-means clustering on pre-processed data
kmeans3.fit(df_normalized)
# Extract cluster labels using <code>labels_</code> attribute
cluster_labels = kmeans3.labels_
rfm_dataset_cluster3 = rfm_dataset.assign(Cluster = cluster_labels)
rfm_dataset_cluster3.groupby(["Cluster"]).agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary Value': ["mean", 'count']
}).round(0)

### Profile and Interpret Segments

Approaches to build customer personals
 - Summary statistics for each cluster
 - Snake plots
 - Calcualte relative importance of cluster attributes compared to population

As done above

### Snake plots
 - Market research technique to compare different segments
 - Visual representation of each segment's attributes
 - Need to first normalize data (center and scale)
 - Plot each clusters average normalized values of each attribute

In [None]:
df_normalized['Cluster'] = rfm_dataset_cluster3['Cluster']

In [None]:
df_melt = pd.melt(df_normalized.reset_index(),
                 id_vars=['CustomerID', 'Cluster'],
                 value_vars = ['Recency', 'Frequency', 'Monetary Value'],
                  var_name='Attribute',
                  value_name='Value'
                 )

In [None]:
plt.title('Snake plot of standardized values')
sns.lineplot(x='Attribute', y='Value', hue='Cluster', data=df_melt)
# plt.savefig("filepath/filename.format", bbox_inches='tight')
plt.show()

### Relative importance of segment attributes
 - identify relative importance of each segment's attribute
 - calculate average values of each cluster
 - calcualte average values of population

In [None]:
cluster_avg = rfm_dataset_cluster3.groupby(['Cluster']).mean()
population_avg = rfm_dataset.mean()

In [None]:
relative_imp = cluster_avg/population_avg - 1

### Analyze and plot relative importance

 - further ratio is from 0, the more important that attribute is for a segment, relative to the total population

In [None]:
plt.figure(figsize=(8,8))
plt.title('Relative importance of attributes')
chart = sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
# Makes sure the plot is visible
chart.set_ylim(len(relative_imp)-0.25, -0.25)
# plt.savefig("filepath/filename.format", bbox_inches='tight')
plt.show()