<a href="https://colab.research.google.com/github/Abdulmujeeb-Taiwo/Customer-Segmentation/blob/main/RFM_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
!wget "https://archive.ics.uci.edu/static/public/502/online+retail+ii.zip"

--2024-12-04 14:20:21--  https://archive.ics.uci.edu/static/public/502/online+retail+ii.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘online+retail+ii.zip’

online+retail+ii.zi     [        <=>         ]  43.51M  28.6MB/s    in 1.5s    

2024-12-04 14:20:23 (28.6 MB/s) - ‘online+retail+ii.zip’ saved [45622418]



In [8]:
!unzip "online+retail+ii.zip"

Archive:  online+retail+ii.zip
 extracting: online_retail_II.xlsx   


In [9]:
df = pd.read_excel("online_retail_II.xlsx")

# **Understanding the data**

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525461 entries, 0 to 525460
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Invoice      525461 non-null  object        
 1   StockCode    525461 non-null  object        
 2   Description  522533 non-null  object        
 3   Quantity     525461 non-null  int64         
 4   InvoiceDate  525461 non-null  datetime64[ns]
 5   Price        525461 non-null  float64       
 6   Customer ID  417534 non-null  float64       
 7   Country      525461 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 32.1+ MB


In [11]:
df.isnull().sum()

Unnamed: 0,0
Invoice,0
StockCode,0
Description,2928
Quantity,0
InvoiceDate,0
Price,0
Customer ID,107927
Country,0


In [12]:
df.nunique()

Unnamed: 0,0
Invoice,28816
StockCode,4632
Description,4681
Quantity,825
InvoiceDate,25296
Price,1606
Customer ID,4383
Country,40


In [13]:
#The invoice has some Cancellation which may stand C and Adjustment which may stand for A
#By Removiing the uncertain invoice, we can have the total sales
df_wc = df[~df['Invoice'].str.contains("C|A", na=False)]

In [14]:
df_wc.dtypes

Unnamed: 0,0
Invoice,object
StockCode,object
Description,object
Quantity,int64
InvoiceDate,datetime64[ns]
Price,float64
Customer ID,float64
Country,object


In [15]:
df_wc['Invoice'] = df_wc['Invoice'].astype(np.int64)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_wc['Invoice'] = df_wc['Invoice'].astype(np.int64)


In [16]:
#To get the total amount of sales for each products that customer purchersed we sum up the qualtity and price

df_wc["Total_amount"] = df_wc["Quantity"] + df_wc["Price"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_wc["Total_amount"] = df_wc["Quantity"] + df_wc["Price"]


In [17]:
df_wc.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
Invoice,515252.0,514496.924179,489434.0,501879.0,514826.0,527301.0,538171.0,14439.209494
Quantity,515252.0,10.956689,-9600.0,1.0,3.0,10.0,19152.0,104.354314
InvoiceDate,515252.0,2010-06-28 17:40:54.093763584,2009-12-01 07:45:00,2010-03-21 13:27:00,2010-07-06 13:13:00,2010-10-15 14:27:00,2010-12-09 20:01:00,
Price,515252.0,4.221416,0.0,1.25,2.1,4.21,25111.09,63.435424
Customer ID,407695.0,15368.504107,12346.0,13997.0,15321.0,16812.0,18287.0,1679.7957
Total_amount,515252.0,15.178105,-9600.0,4.66,8.45,13.25,25112.09,121.907233


# RFM ANALYSIS

**Recency**

In [18]:
import datetime as dt

In [19]:
print("The Starting Date of Data Collection:", df_wc.InvoiceDate.min())
print("The Ending Date of Data Collection:", df_wc.InvoiceDate.max())

The Starting Date of Data Collection: 2009-12-01 07:45:00
The Ending Date of Data Collection: 2010-12-09 20:01:00


In [20]:
day_after_end_date = dt.datetime(2010,12,11)

In [21]:
df_wc["recency"] = (day_after_end_date - df_wc.InvoiceDate).dt.days

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_wc["recency"] = (day_after_end_date - df_wc.InvoiceDate).dt.days


In [22]:
df_wc.groupby("Customer ID")["recency"].min().sort_values(ascending=False)

Unnamed: 0_level_0,recency
Customer ID,Unnamed: 1_level_1
16763.0,374
12636.0,374
17056.0,374
12362.0,374
13526.0,374
...,...
18102.0,1
15811.0,1
17198.0,1
14896.0,1


**Frequency**

In [53]:
frequency = df_wc.Invoice.value_counts().values

In [55]:
df_wc.Invoice.value_counts()

Unnamed: 0_level_0,count
Invoice,Unnamed: 1_level_1
537434,675
538071,652
537638,601
537237,597
536876,593
...,...
527135,1
527134,1
527131,1
499976,1


**Monetary**

In [62]:
Monetary = df_wc.groupby("Customer ID")["Total_amount"].sum().values

In [64]:
df_wc.groupby("Customer ID")["Total_amount"].sum()

Unnamed: 0_level_0,Total_amount
Customer ID,Unnamed: 1_level_1
12346.0,276.36
12347.0,990.95
12348.0,387.39
12349.0,1868.34
12351.0,310.46
...,...
18283.0,834.82
18284.0,585.09
18285.0,245.20
18286.0,894.30
