# Customer Segmentation (K-means & RFM modeling)

## Identify the most loyal customers



Customer segmentation is a method of dividing the customers with similar trends into groups. Customer segmentation allows companies to precisely target customers who have specific needs and desires. Companies can design the targeted campaigns to the right group or audience. Using machine learning techniques to create clusters, companies could also identify new market segments on which company can focus more as it might be more lucrative. Furthermore, companies could also identify groups that require extreme attention such that people in that group are on the verge of churning out. There are various benifits of using segmentation. Based on the business requirements, the right type of segmentation can be identified.

In this customer segmentation analysis, we will utilize RFM modeling to calculate the RFM scores for each customer to create segments and apply the machine learning technique, k-means clustering to create segments.

RFM ::
  
R = Recency (how recent a customer purchased an item or product, the lower the recency the better the score)
  
F = Frequency (how often a customer purchases an item, the more frequent they purchase the better the score)
  
M = Monetary (how much the customer spends, the more the amount the better the score)

RFM can be classified into many groups based on the business requirement. In general they are classified into three groups
  
High- Group who often, spends more and visited the platform recently
  
Medium- Group which spends less than high group and is not that much frequent to visit the platform
  
Low- Group which is on the verge of churning out


In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sb

In [None]:
df=pd.read_csv("Online Retail.csv")
df.head()

## Data Preparation

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [4]:
# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

In [5]:
# Check for missing values
df.isnull().sum(axis=0)

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [6]:
# Since we are identifing customers for each segment, it is important not to have any null values in customer ID.
# Removing null values from customer ID
df=df[pd.notnull(df['CustomerID'])]

In [7]:
# Next we know that Quantity and UnitPrice should never have negative values. This should be validated.
df.Quantity.min()

-80995

In [8]:
df.UnitPrice.min()

0.0

In [9]:
# Filter out negative values for Quantity
df=df[(df['Quantity']>0)]

In [10]:
# Create a new column depicting total amount for an order
df['TotalAmount']=df['Quantity']*df['UnitPrice']

## RMF Modelling

The dataset contains data for the year of 2011 up until 2011-12-09. In order to calculate the number of days from recent purchase, we set the latest date as 2011/12/10

In [11]:
LatestDate =  dt.datetime(2011,12,10)

In [12]:
# RFM score for each customer

# group by customer ID
# Recency: Latest date - x.max where x.max is the last date where a customer made a purchase
# Frequency: len(x) of invoice number (len counts the number of occurrences)
# Monetary: sum(x) of total amount 

RFM = df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (LatestDate - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})

In [13]:
RFM

Unnamed: 0_level_0,InvoiceDate,InvoiceNo,TotalAmount
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346.0,325,1,77183.60
12347.0,2,182,4310.00
12348.0,75,31,1797.24
12349.0,18,73,1757.55
12350.0,310,17,334.40
...,...,...,...
18280.0,277,10,180.60
18281.0,180,7,80.82
18282.0,7,12,178.05
18283.0,3,756,2094.88


In [14]:
# Rename column names
RFM.rename(columns= {'InvoiceDate': 'Recency',
                    'InvoiceNo':'Frequency',
                    'TotalAmount':'Monetary'},inplace=True)
RFM

Unnamed: 0_level_0,Recency,Frequency,Monetary
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346.0,325,1,77183.60
12347.0,2,182,4310.00
12348.0,75,31,1797.24
12349.0,18,73,1757.55
12350.0,310,17,334.40
...,...,...,...
18280.0,277,10,180.60
18281.0,180,7,80.82
18282.0,7,12,178.05
18283.0,3,756,2094.88


In [15]:
# Descriptive statistics
RFM.describe()

Unnamed: 0,Recency,Frequency,Monetary
count,4339.0,4339.0,4339.0
mean,92.041484,91.708689,2053.793018
std,100.007757,228.792852,8988.248381
min,0.0,1.0,0.0
25%,17.0,17.0,307.245
50%,50.0,41.0,674.45
75%,141.5,100.0,1661.64
max,373.0,7847.0,280206.02


In [20]:
# Split recency,frequency and monetary into four segements using the quantiles ( This split depends on the business requirement)
# We can create the required number of segments and set any quantile range to create the segments.
quantiles= RFM.quantile(q=[0.25,0.5,0.75])
quantiles=quantiles.to_dict()

In [21]:
# Creating a function to create the segments. We will need to create two different functions, as recency behaves the opposite of frequency and monetary
# Segement labeled as 1 to 4, with 1 being the best. 
# Higher the recency the higher the segment
# Higher the frequency the lower the segment
# Higher the monetary the lower the segment
def RScoring(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4
    
def FnMScoring(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1
    

In [22]:
# Creating the segment values for the data
RFM['RSegment'] = RFM['Recency'].apply(RScoring, args=('Recency',quantiles,))
RFM['FSegment'] = RFM['Frequency'].apply(FnMScoring, args=('Frequency',quantiles,))
RFM['MSegment'] = RFM['Monetary'].apply(FnMScoring, args=('Monetary',quantiles,))
RFM.head()

Unnamed: 0_level_0,Recency,Frequency,Monetary,RSegment,FSegment,MSegment
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
12346.0,325,1,77183.6,4,4,1
12347.0,2,182,4310.0,1,1,1
12348.0,75,31,1797.24,3,3,1
12349.0,18,73,1757.55,2,2,1
12350.0,310,17,334.4,4,4,3


In [25]:
#Create RFM groups by concatenate the segment values, we can use this to quickly identify groups (111 being the best group)
RFM['RFMGroup'] = RFM.RSegment.map(str) + RFM.FSegment.map(str) + RFM.MSegment.map(str)

#Calculate the total segment score, the low the score the higher should be the loyalty status
RFM['RFMScore'] = RFM[['RSegment', 'FSegment', 'MSegment']].sum(axis = 1)
RFM.head()

Unnamed: 0_level_0,Recency,Frequency,Monetary,RSegment,FSegment,MSegment,RFMGroup,RFMScore
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12346.0,325,1,77183.6,4,4,1,441,9
12347.0,2,182,4310.0,1,1,1,111,3
12348.0,75,31,1797.24,3,3,1,331,7
12349.0,18,73,1757.55,2,2,1,221,5
12350.0,310,17,334.4,4,4,3,443,11


In [49]:
#Assign Loyalty Level to each customer based on the RFM score
Loyalty_Level = ['Platinum', 'Gold', 'Silver', 'Bronze']
Scorecuts = pd.qcut(RFM.RFMScore, q = 4, labels = Loyalty_Level)
RFM['LoyaltyLevel'] = Scorecuts.values
RFM.reset_index().head()

Unnamed: 0,CustomerID,Recency,Frequency,Monetary,RSegment,FSegment,MSegment,RFMGroup,RFMScore,RFM_Loyalty_Level,LoyaltyLevel
0,12346.0,325,1,77183.6,4,4,1,441,9,Silver,Silver
1,12347.0,2,182,4310.0,1,1,1,111,3,Platinum,Platinum
2,12348.0,75,31,1797.24,3,3,1,331,7,Gold,Gold
3,12349.0,18,73,1757.55,2,2,1,221,5,Platinum,Platinum
4,12350.0,310,17,334.4,4,4,3,443,11,Bronze,Bronze


Based on the RFM modeling alone, companies can implement marketing strategies.
Examples:
  
Customers with RFM group of 111 are the best customers and the company can try to cross sell other products as well as encourage them to sign up for loyalty programs to enjoy some elite experiences like free shipping or priority access.
Customer with RFM group of 444 are at the risk of churning. Company can try to offer some reward or coupon to trigger the spending from these customers. Customers with Platinum loyalty level could be encouraged to stay within that loyalty level by offering rewards and better discounts.

In [50]:
#Scatter chat showing the different loyalty segments, plotly scatterplot is used as it allows us to zoom into specific areas
import plotly.express as px
df = RFM.reset_index()
fig = px.scatter(df, x="Recency", y="Frequency", color="RFM_Loyalty_Level",text='CustomerID',hover_data=['Monetary'])
fig.show()