# Customer Segmentation using RFM analysis

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

We will create cutomer segments as per the Recency,Frequency and Monetary analysis by analyzing the data to know our customer base. This knowlwdge can then be used to target customers to retain customers, pitch offers etc

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from pandas.plotting import scatter_matrix

import time, warnings
import datetime as dt

import seaborn as sns


warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('C:\\Users\\DELL\\Basecamp3\\GLabs_Data_Science_Learn\\Customer_Segmentation_with_RFM analysis\\data\\commercial_data.csv')
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,545220,21955,DOORMAT UNION JACK GUNS AND ROSES,2,3/1/2011 8:30,7.95,14620.0,United Kingdom
1,545220,48194,DOORMAT HEARTS,2,3/1/2011 8:30,7.95,14620.0,United Kingdom
2,545220,22556,PLASTERS IN TIN CIRCUS PARADE,12,3/1/2011 8:30,1.65,14620.0,United Kingdom
3,545220,22139,RETROSPOT TEA SET CERAMIC 11 PC,3,3/1/2011 8:30,4.95,14620.0,United Kingdom
4,545220,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,4,3/1/2011 8:30,3.75,14620.0,United Kingdom


### Read the data

In [3]:
data.shape

(236079, 8)

In [4]:
data.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

In [5]:
'231324'.isdigit()

True

In [6]:
[val for val in data['InvoiceNo'].values if not str(val).isdigit()]

['A563185', 'A563186', 'A563187']

In [7]:
data.isnull().sum()

InvoiceNo          0
StockCode          0
Description      350
Quantity           0
InvoiceDate        0
UnitPrice          0
CustomerID     59942
Country            0
dtype: int64

In [8]:
data[data['CustomerID'].isnull()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
73,545230,20960,WATERMELON BATH SPONGE,1,3/1/2011 9:55,2.46,,United Kingdom
74,545230,21082,SET/20 FRUIT SALAD PAPER NAPKINS,1,3/1/2011 9:55,1.63,,United Kingdom
75,545230,21488,RED WHITE SCARF HOT WATER BOTTLE,1,3/1/2011 9:55,8.29,,United Kingdom
76,545230,35970,ZINC FOLKART SLEIGH BELLS,1,3/1/2011 9:55,4.13,,United Kingdom
77,545230,82583,HOT BATHS METAL SIGN,1,3/1/2011 9:55,4.13,,United Kingdom
...,...,...,...,...,...,...,...,...
236074,569202,22486,PLASMATRONIC LAMP,1,9/30/2011 17:22,8.29,,United Kingdom
236075,569202,22495,SET OF 2 ROUND TINS CAMEMBERT,1,9/30/2011 17:22,5.79,,United Kingdom
236076,569202,22539,MINI JIGSAW DOLLY GIRL,2,9/30/2011 17:22,0.83,,United Kingdom
236077,569202,22540,MINI JIGSAW CIRCUS PARADE,2,9/30/2011 17:22,0.83,,United Kingdom


In [9]:
data.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [10]:
[data.groupby('InvoiceNo')['CustomerID'].apply(set)]

[InvoiceNo
 545220     {14620.0}
 545221     {14740.0}
 545222     {13880.0}
 545223     {16462.0}
 545224     {17068.0}
              ...    
 565577     {14177.0}
 565579     {18283.0}
 A563185        {nan}
 A563186        {nan}
 A563187        {nan}
 Name: CustomerID, Length: 9974, dtype: object]

## Fill null customer id with Guest

In [11]:
data['CustomerID'].fillna('Guest', inplace=True)

In [12]:
data.isnull().sum()

InvoiceNo        0
StockCode        0
Description    350
Quantity         0
InvoiceDate      0
UnitPrice        0
CustomerID       0
Country          0
dtype: int64

In [13]:
data[data['Description'].isnull()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
2666,545551,15058A,,20,3/3/2011 15:11,0.0,Guest,United Kingdom
2667,545553,84459B,,16,3/3/2011 15:12,0.0,Guest,United Kingdom
3497,545651,21973,,140,3/4/2011 13:14,0.0,Guest,United Kingdom
3820,545666,37468,,67,3/4/2011 15:33,0.0,Guest,United Kingdom
8166,546026,90200C,,4,3/8/2011 17:43,0.0,Guest,United Kingdom
...,...,...,...,...,...,...,...,...
227681,568365,84663A,,30,9/26/2011 15:45,0.0,Guest,United Kingdom
229074,568544,84921,,5,9/27/2011 14:33,0.0,Guest,United Kingdom
229095,568548,21830,,240,9/27/2011 14:40,0.0,Guest,United Kingdom
234919,569083,23084,,3,9/30/2011 11:56,0.0,Guest,United Kingdom


In [14]:
[data.groupby('StockCode')['Description'].apply(set)]

[StockCode
 10002                          {INFLATABLE POLITICAL GLOBE , nan}
 10080                             {GROOVY CACTUS INFLATABLE, nan}
 10120                                              {DOGGY RUBBER}
 10123C                                    {HEARTS WRAPPING TAPE }
 10124A                              {SPOTS ON RED BOOKCOVER TAPE}
                                       ...                        
 gift_0001_10            {nan, Dotcomgiftshop Gift Voucher £10.00}
 gift_0001_20    {to push order througha s stock was , Dotcomgi...
 gift_0001_30            {nan, Dotcomgiftshop Gift Voucher £30.00}
 gift_0001_40                 {Dotcomgiftshop Gift Voucher £40.00}
 gift_0001_50                 {Dotcomgiftshop Gift Voucher £50.00}
 Name: Description, Length: 3542, dtype: object]

In [15]:
stock_dict = (data.groupby('StockCode')['Description'].apply(list).apply(lambda x: set([i for i in x if isinstance(i, str)]).to_dict())

SyntaxError: unexpected EOF while parsing (<ipython-input-15-bfb3cbe4462d>, line 1)

In [None]:
clean_stock_description = dict()

In [None]:
for k,v in stock_dict.items():
    if len(v)==0:
        clearn_stock_description[k]== 'No description'
    else:
        clean_stock_description[k] = list(v)[0]      

In [None]:
data['Description_filled'] = data['StockCode'].map(clean_stock_description)
data['Description'].fillna(data['Description_filled'], inplace=True)
data.drop('Description_filled', axis=1, inplace=True)

In [None]:
data.isnull().sum()

In [16]:
clean_stock_description

NameError: name 'clean_stock_description' is not defined

In [17]:
data[data['Description'] =='No description']

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


In [18]:
guest = data[data['CustomerID']=='Guest']
data = data[data['CustomerID'] !='Guest']

In [19]:
guest

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
73,545230,20960,WATERMELON BATH SPONGE,1,3/1/2011 9:55,2.46,Guest,United Kingdom
74,545230,21082,SET/20 FRUIT SALAD PAPER NAPKINS,1,3/1/2011 9:55,1.63,Guest,United Kingdom
75,545230,21488,RED WHITE SCARF HOT WATER BOTTLE,1,3/1/2011 9:55,8.29,Guest,United Kingdom
76,545230,35970,ZINC FOLKART SLEIGH BELLS,1,3/1/2011 9:55,4.13,Guest,United Kingdom
77,545230,82583,HOT BATHS METAL SIGN,1,3/1/2011 9:55,4.13,Guest,United Kingdom
...,...,...,...,...,...,...,...,...
236074,569202,22486,PLASMATRONIC LAMP,1,9/30/2011 17:22,8.29,Guest,United Kingdom
236075,569202,22495,SET OF 2 ROUND TINS CAMEMBERT,1,9/30/2011 17:22,5.79,Guest,United Kingdom
236076,569202,22539,MINI JIGSAW DOLLY GIRL,2,9/30/2011 17:22,0.83,Guest,United Kingdom
236077,569202,22540,MINI JIGSAW CIRCUS PARADE,2,9/30/2011 17:22,0.83,Guest,United Kingdom


### Remove rows where customerID are NA

## RFM Analysis
RFM (Recency, Frequency, Monetary) analysis is a customer segmentation technique that uses past purchase behavior to divide customers into groups. RFM helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions and also for future personalization services.

**RECENCY (R)**: Days since last purchase

**FREQUENCY (F):** Total number of purchases

**MONETARY VALUE (M):** Total money this customer spent.

We will create those 3 customer attributes for each customer.

## Recency
To calculate recency, we need to choose a date point from which we evaluate how many days ago was the customer's last purchase.

### Find out the latest date in the data to use it as for reference

In [20]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,545220,21955,DOORMAT UNION JACK GUNS AND ROSES,2,3/1/2011 8:30,7.95,14620,United Kingdom
1,545220,48194,DOORMAT HEARTS,2,3/1/2011 8:30,7.95,14620,United Kingdom
2,545220,22556,PLASTERS IN TIN CIRCUS PARADE,12,3/1/2011 8:30,1.65,14620,United Kingdom
3,545220,22139,RETROSPOT TEA SET CERAMIC 11 PC,3,3/1/2011 8:30,4.95,14620,United Kingdom
4,545220,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,4,3/1/2011 8:30,3.75,14620,United Kingdom


In [21]:
data['InvoiceDate'].max()

'9/9/2011 9:52'

In [22]:
now = dt.date(2011,12,9)
print(now)

2011-12-09


In [23]:
data['date'] = pd.DatetimeIndex(data['InvoiceDate']).date

In [25]:
data.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,date
0,545220,21955,DOORMAT UNION JACK GUNS AND ROSES,2,3/1/2011 8:30,7.95,14620,United Kingdom,2011-03-01
1,545220,48194,DOORMAT HEARTS,2,3/1/2011 8:30,7.95,14620,United Kingdom,2011-03-01
2,545220,22556,PLASTERS IN TIN CIRCUS PARADE,12,3/1/2011 8:30,1.65,14620,United Kingdom,2011-03-01


In [35]:
recency_df = data.groupby('CustomerID')['date'].max().reset_index()
recency_df.columns = ['CustomerID', 'LastPurcDate']
recency_df.head()

Unnamed: 0,CustomerID,LastPurcDate
0,12747.0,2011-08-22
1,12748.0,2011-09-30
2,12749.0,2011-08-01
3,12820.0,2011-09-26
4,12821.0,2011-05-09


In [40]:
recency_df.head()

Unnamed: 0,CustomerID,LastPurcDate
0,12747.0,2011-08-22
1,12748.0,2011-09-30
2,12749.0,2011-08-01
3,12820.0,2011-09-26
4,12821.0,2011-05-09


In [41]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,date
0,545220,21955,DOORMAT UNION JACK GUNS AND ROSES,2,3/1/2011 8:30,7.95,14620,United Kingdom,2011-03-01
1,545220,48194,DOORMAT HEARTS,2,3/1/2011 8:30,7.95,14620,United Kingdom,2011-03-01
2,545220,22556,PLASTERS IN TIN CIRCUS PARADE,12,3/1/2011 8:30,1.65,14620,United Kingdom,2011-03-01
3,545220,22139,RETROSPOT TEA SET CERAMIC 11 PC,3,3/1/2011 8:30,4.95,14620,United Kingdom,2011-03-01
4,545220,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,4,3/1/2011 8:30,3.75,14620,United Kingdom,2011-03-01


### Frequency of purchases

In [55]:
frequency_df = data.groupby('CustomerID')['InvoiceNo'].apply(set).apply(len).reset_index()
frequency_df.columns = ['CustomerID', 'Frquency']
frequency_df

Unnamed: 0,CustomerID,Frquency
0,12747.0,5
1,12748.0,96
2,12749.0,3
3,12820.0,1
4,12821.0,1
...,...,...
2859,18280.0,1
2860,18281.0,1
2861,18282.0,1
2862,18283.0,8


### Create a new column called date which contains the date of invoice only

### Check the last date of purchase with respect to CustomerID and calculate the RECENCY

## Frequency
Frequency helps us to know how many times a customer purchased from us. To do that we need to check how many invoices are registered by the same customer.

### Drop duplicate data from the data

### Calculate the frequency of purchases

## Monetary

**Monetary attribute answers the question: How much money did the customer spent over time?**

### To do that, first, we will create a new column total cost to have the total price per invoice.

In [56]:
data.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,date
0,545220,21955,DOORMAT UNION JACK GUNS AND ROSES,2,3/1/2011 8:30,7.95,14620,United Kingdom,2011-03-01
1,545220,48194,DOORMAT HEARTS,2,3/1/2011 8:30,7.95,14620,United Kingdom,2011-03-01
2,545220,22556,PLASTERS IN TIN CIRCUS PARADE,12,3/1/2011 8:30,1.65,14620,United Kingdom,2011-03-01


In [58]:
data['TotalCost'] = data['Quantity'] * data['UnitPrice']
data['TotalCost']

0         15.90
1         15.90
2         19.80
3         14.85
4         15.00
          ...  
235823    17.00
235824    17.00
235825    25.20
235826    25.20
235827    10.20
Name: TotalCost, Length: 176137, dtype: float64

In [68]:
monetary_df = data.groupby('CustomerID').agg({'TotalCost':'sum'}).reset_index()
monetary_df.columns = ['CustomerID', 'Monetary']
monetary_df.head()

Unnamed: 0,CustomerID,Monetary
0,12747.0,1760.09
1,12748.0,14680.85
2,12749.0,2755.23
3,12820.0,217.77
4,12821.0,92.72


In [70]:
monetary_df = data.groupby('CustomerID').agg({'TotalCost':'sum'})
monetary_df

Unnamed: 0_level_0,TotalCost
CustomerID,Unnamed: 1_level_1
12747.0,1760.09
12748.0,14680.85
12749.0,2755.23
12820.0,217.77
12821.0,92.72
...,...
18280.0,180.60
18281.0,80.82
18282.0,100.21
18283.0,802.77


In [77]:
monetary_df = data.groupby('CustomerID')['TotalCost'].sum()
monetary_df

CustomerID
12747.0     1760.09
12748.0    14680.85
12749.0     2755.23
12820.0      217.77
12821.0       92.72
             ...   
18280.0      180.60
18281.0       80.82
18282.0      100.21
18283.0      802.77
18287.0      765.28
Name: TotalCost, Length: 2864, dtype: float64

### Create RFM Table

In [78]:
temp_df = recency_df.merge(frequency_df, on='CustomerID')
temp_df.head()

Unnamed: 0,CustomerID,LastPurcDate,Frquency
0,12747.0,2011-08-22,5
1,12748.0,2011-09-30,96
2,12749.0,2011-08-01,3
3,12820.0,2011-09-26,1
4,12821.0,2011-05-09,1


## Customer segments with RFM Model

**The simplest way to create customers segments from RFM Model is to use Quartiles. We assign a score from 1 to 4 to Recency, Frequency and Monetary. Four is the best/highest value, and one is the lowest/worst value. A final RFM score is calculated simply by combining individual RFM score numbers.**

Note: Quintiles (score from 1-5) offer better granularity, in case the business needs that but it will be more challenging to create segments since we will have 555 possible combinations. So, we will use quartiles.

### Find RFM quartiles

## Creation of RFM Segments

We will create two segmentation classes since, high recency is bad, while high frequency and monetary value is good.



### Create functions as per the appropriate quaritle values and apply them to create segments

### Now that we have the score of each customer, we can represent our customer segmentation, combine the scores (R_Quartile, F_Quartile,M_Quartile) together.

### FInd out the best customers

## Learner Activity

**1. Find the following:**
1. Best Customer

2. Loyal Customer

3. Big Spenders

4. Almost lost customers

5. Lost customers

**2. Now that we know our customers segments, how will you target them?**