<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-Background" data-toc-modified-id="Project-Background-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project Background</a></span></li><li><span><a href="#Data-cleaning" data-toc-modified-id="Data-cleaning-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data cleaning</a></span><ul class="toc-item"><li><span><a href="#Missing-values" data-toc-modified-id="Missing-values-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Missing values</a></span></li><li><span><a href="#Duplicated-Items" data-toc-modified-id="Duplicated-Items-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Duplicated Items</a></span></li></ul></li><li><span><a href="#Data-Understanding" data-toc-modified-id="Data-Understanding-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Understanding</a></span></li><li><span><a href="#RFM-Model" data-toc-modified-id="RFM-Model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>RFM Model</a></span><ul class="toc-item"><li><span><a href="#Recency" data-toc-modified-id="Recency-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Recency</a></span></li><li><span><a href="#Frequency" data-toc-modified-id="Frequency-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Frequency</a></span></li><li><span><a href="#Monetary" data-toc-modified-id="Monetary-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Monetary</a></span></li><li><span><a href="#RFM-Model" data-toc-modified-id="RFM-Model-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>RFM Model</a></span></li><li><span><a href="#RFM-Analysis" data-toc-modified-id="RFM-Analysis-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>RFM Analysis</a></span></li></ul></li></ul></div>

# Project Background
RFM Model is an essential model for customer segmentation. In this case, I will use the e-commerce dataset to implement this RFM model and 3D scatter plot in Python.

In [160]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import plotly.express as px
from scipy import stats


In [161]:
df = pd.read_csv("./data.csv",encoding = "ISO-8859-1")
df.head(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


# Data cleaning 

## Missing values 
We load the dataset successful. Now we gonna check if there are the missing values in the dataset and remove the missing values

In [162]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


Here are some missing values in the dataset. Now we gonna drop those missing values.

In [163]:
df=df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      406829 non-null object
StockCode      406829 non-null object
Description    406829 non-null object
Quantity       406829 non-null int64
InvoiceDate    406829 non-null object
UnitPrice      406829 non-null float64
CustomerID     406829 non-null float64
Country        406829 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 27.9+ MB


## Duplicated Items

In [164]:
print("The sum of duplicated items: ",df.duplicated().sum())

The sum of duplicated items:  5225


In [165]:
df=df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 401604 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      401604 non-null object
StockCode      401604 non-null object
Description    401604 non-null object
Quantity       401604 non-null int64
InvoiceDate    401604 non-null object
UnitPrice      401604 non-null float64
CustomerID     401604 non-null float64
Country        401604 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 27.6+ MB


# Data Understanding
InvoiceNo: the unique number represent each transaction. If the code starts with "c" it indicates a cancellation.
StockCode: the unique number of distinct product (item)
Description: the name of unique number-
Quantity: the quantities of each product per transaction
InvoiceDate: the day and time when each transaction was generated.
UnitPrice: the unit price of each product.
CustomerID: the unique number of each customer.
Country: the country where each customer resides.


In [166]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,401604.0,401604.0,401604.0
mean,12.183273,3.474064,15281.160818
std,250.283037,69.764035,1714.006089
min,-80995.0,0.0,12346.0
25%,2.0,1.25,13939.0
50%,5.0,1.95,15145.0
75%,12.0,3.75,16784.0
max,80995.0,38970.0,18287.0


In [167]:
print("Number of transactions: ", df['InvoiceNo'].nunique())
print("Number of products bought: ",df['StockCode'].nunique())
print("Number of customers:", df['CustomerID'].nunique() )

Number of transactions:  22190
Number of products bought:  3684
Number of customers: 4372


In [168]:
df.Country.value_counts().head(10)

United Kingdom    356728
Germany             9480
France              8475
EIRE                7475
Spain               2528
Netherlands         2371
Belgium             2069
Switzerland         1877
Portugal            1471
Australia           1258
Name: Country, dtype: int64

# RFM Model
The majority of transaction is generated in UK, we tend to apply RFM Model to the customer in UK.
RFM analysis is a customer segmentation mothod that uses the historical purchase behavior data of customers to divide them into difference groups, which helps the e-commerce company to identify the grand client and provide personalization services in the future.

- Recency(R): Days since last purchase
- Frequency(F): Total number of purchases
- Monetary(M): Total amount of money customer spent.

In [169]:
df_uk=df[df.Country == "United Kingdom"]
print("Number of transactions: ", df_uk['InvoiceNo'].nunique())
print("Number of products bought: ",df_uk['StockCode'].nunique())
print("Number of customers:", df_uk['CustomerID'].nunique() )


Number of transactions:  19857
Number of products bought:  3661
Number of customers: 3950


## Recency

Here we suppose the time we apply RFM model is 10 days after the latest transaction day in the dataset.

As we only use the date of the invoice while calculating the recency, we add a new column named "Date" and extract the date of transaction.

Also, we are going to calculate the last transaction time of each customer.

In [171]:
df_uk['Date']=pd.DatetimeIndex(df_uk.InvoiceDate).date
df_uk.head(5)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Date
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom,2010-12-01
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,2010-12-01
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom,2010-12-01
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,2010-12-01
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,2010-12-01


In [None]:
r_df=df_uk.groupby(by='CustomerID',as_index=False)['Date'].max()
r_df.columns = ['CustomerID','LastPurshaceDate'] 
r_df.head()


In [None]:
print("The lastest transaction date in this dataset: ",df_uk.Date.max())
now = r_df.LastPurshaceDate.max()+ datetime.timedelta(days=10)
now

In [None]:
r_df["Recency"]=r_df["LastPurshaceDate"].apply(lambda x: (now-x).days)
r_df.head(5)

 ## Frequency
 Frequency helps us to identify how many times a customer purchased from this e-commerce company, we need to calculate how many invoices(InvoiceNo) are mapped to the same customer (CustomerID)

In [None]:
temp=df_uk
temp.drop_duplicates(subset=['InvoiceNo','CustomerID'],keep='first',inplace=True)
f_df=temp.groupby(by='CustomerID',as_index=False)['InvoiceNo'].count()
f_df.columns=['CustomerID','Frequency']
f_df

## Monetary
Monetary is to measure how much did the customer spent over time, which is the sum of all the invoice the same customer has spent.

In [None]:
df_uk['Cost']=df_uk.Quantity * df_uk.UnitPrice
df_uk.head(5)

In [None]:
m_df=df_uk.groupby(by='CustomerID',as_index=False)['Cost'].sum()
m_df.columns=['CustomerID','Monetary']
m_df

## RFM Model
In this section, we are going to build a RFM to segment customer in terms of Recency, Frequency and Monetary. We assign a score from 1 to 4 to R, F and M. 4 score is the highest 

In [None]:
RFM=r_df[['CustomerID','Recency']]
RFM=RFM.merge(f_df,on="CustomerID")
RFM=RFM.merge(m_df,on="CustomerID")
RFM.CustomerID=RFM.CustomerID.astype(int)
RFM.head(5)

In [None]:
quantiles= RFM.quantile(q=[0.25,0.5,0.75])
quantiles.to_dict()

In [None]:
def Rscore(x,p,d):
    if x<= d[p][0.25]:
        return 4
    elif x<= d[p][0.5]:
        return 3
    elif x<= d[p][0.75]:
        return 2
    else:
        return 1
    
def FMscore(x,p,d):
    if x<= d[p][0.25]:
        return 1
    elif x<= d[p][0.5]:
        return 2
    elif x<= d[p][0.75]:
        return 3
    else:
        return 4

In [None]:
rfm_seg=RFM
rfm_seg['R_Score']=rfm_seg.Recency.apply(Rscore,args=('Recency',quantiles))
rfm_seg['F_Score']=rfm_seg.Frequency.apply(FMscore,args=('Frequency',quantiles))
rfm_seg['M_Score']=rfm_seg.Monetary.apply(FMscore,args=('Monetary',quantiles))
rfm_seg

In [None]:
rfm_seg['RFM_Score']=rfm_seg.R_Score.map(str)+rfm_seg.F_Score.map(str)+rfm_seg.M_Score.map(str)
rfm_seg

## RFM Analysis
In this section we are going to segment our customers according to theire RFM score, the rules are as follows:
- Grand customers: 444
- Loyal customers: x4x
- Big Spenders: xx4
- Almost lost big customers: 244 or 234 or 243 or 233
- Lost Big customers: 144 or 134 or 143 or 133
- Less important lost customers:111

We can also define as follows:
- A-class: 444
- B-class: 344,443,434
- C-class: 334,343,433
- Almost lost big customers: 244 or 234 or 243 or 233
- Lost Big customers: 144 or 134 or 143 or 133
- Less important lost customers: 111
- Regular customers: Others

In [None]:
rfm_seg[rfm_seg['RFM_Score']=="444"]

In [None]:
print("Number of grand customers: ",rfm_seg[rfm_seg['RFM_Score']=="444"].shape[0])
print("Number of loyal customers: ",rfm_seg[rfm_seg['F_Score']==4].shape[0])
print("Number of big spenders: ",rfm_seg[rfm_seg['M_Score']==4].shape[0])
print("Number of almost lost big customers: ",(rfm_seg[(rfm_seg['RFM_Score']=="244")].shape[0]+ rfm_seg[rfm_seg['RFM_Score']=="243"].shape[0]+ rfm_seg[rfm_seg['RFM_Score']=="234"].shape[0] + rfm_seg[rfm_seg['RFM_Score']=="233"].shape[0]))
print("Number of lost big customers: ",(rfm_seg[(rfm_seg['RFM_Score']=="144")].shape[0]+ rfm_seg[rfm_seg['RFM_Score']=="143"].shape[0]+ rfm_seg[rfm_seg['RFM_Score']=="134"].shape[0] + rfm_seg[rfm_seg['RFM_Score']=="133"].shape[0]))
print("Number of less important lost customers: ",rfm_seg[rfm_seg['RFM_Score']=="111"].shape[0])

In [None]:
rfm_seg['r_percentile']= rfm_seg.Recency.apply(lambda x : 100-stats.percentileofscore(a.Recency, x))
rfm_seg['f_percentile']= rfm_seg.Frequency.apply(lambda x : stats.percentileofscore(a.Frequency, x))
rfm_seg['m_percentile']= rfm_seg.Monetary.apply(lambda x : stats.percentileofscore(a.Monetary, x))

In [None]:
fig = px.scatter_3d(rfm_seg, x='r_percentile', y='f_percentile', z='m_percentile',color="RFM_Score",opacity=0.8,hover_name="CustomerID")
fig.show()