<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#Import-Packages-+-Data" data-toc-modified-id="Import-Packages-+-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Packages + Data</a></span></li><li><span><a href="#Explore-+-Clean-Data" data-toc-modified-id="Explore-+-Clean-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Explore + Clean Data</a></span><ul class="toc-item"><li><span><a href="#Explore-Country-Metrics" data-toc-modified-id="Explore-Country-Metrics-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Explore Country Metrics</a></span></li><li><span><a href="#Check-For-Null-Values" data-toc-modified-id="Check-For-Null-Values-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Check For Null Values</a></span></li><li><span><a href="#Check-For-Duplicates" data-toc-modified-id="Check-For-Duplicates-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Check For Duplicates</a></span></li><li><span><a href="#Cancelled-Orders" data-toc-modified-id="Cancelled-Orders-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Cancelled Orders</a></span></li><li><span><a href="#RMF-Variables" data-toc-modified-id="RMF-Variables-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>RMF Variables</a></span></li><li><span><a href="#Summary-Statistics" data-toc-modified-id="Summary-Statistics-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>Summary Statistics</a></span></li><li><span><a href="#Add-TotalPrice-Column" data-toc-modified-id="Add-TotalPrice-Column-3.7"><span class="toc-item-num">3.7&nbsp;&nbsp;</span>Add TotalPrice Column</a></span></li><li><span><a href="#Remove-Outliers" data-toc-modified-id="Remove-Outliers-3.8"><span class="toc-item-num">3.8&nbsp;&nbsp;</span>Remove Outliers</a></span></li></ul></li><li><span><a href="#Future-Work" data-toc-modified-id="Future-Work-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Future Work</a></span></li></ul></div>

# Intro

We're going to do an RFM analysis. RFM stands for Recency, Frequency and Monetary Value. We are then going to segment customers based on rankings in these three categories. We will use K-Means Clustering to segment the customers. By having these groups identified, we can target our marketing efforts with customers to increase revenue while retaining customers. (Add detail with this.)

**Recency** is how recently a customer made a purchase. 

**Frequency** is how often a customer makes a purchase. 

**Monetary** Value represents the amount of money a customer spent in a given time. 

# Import Packages + Data

In [1]:
# Import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Import data & convert to df
data = pd.read_excel('Data/Online_Retail.xlsx')
df = pd.DataFrame(data)

# Preview
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


We can see here there are multiple items and quantities purchased on each invoice. I will create another column that shows total spent on each item, so Quantity * UnitPrice. That way we can group by invoice number, customer, etc. and see the total they spent per invoice and item.

We're also going to be adding Recency, Frequency and Monetary columns so we can conduct an RMF analysis and segment customers that way as well. 

Let's take a look at some of the basics before we hop into it. 

# Explore + Clean Data

In [3]:
# Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


**InvoiceNo** is currently an object. I'm going to change that to an integer so we'll be able to group by invoice number. 

**StockCode** can stay an object, I'm guessing it's a string. 

It's great that **InvoiceDate** is already in datetime format, because we can peak at some time series in the EDA to see if we can collect any further insights. 

## Explore Country Metrics

How many countries are there?

In [4]:
len(df.Country.unique())

38

How many unique customers per country?

In [74]:
# Preview first 20 (out of 38)
df.groupby(['Country'])['CustomerID'].nunique().sort_values(ascending=False).head(20)

Country
United Kingdom     3950
Germany              95
France               87
Spain                31
Belgium              25
Switzerland          21
Portugal             19
Italy                15
Finland              12
Austria              11
Norway               10
Channel Islands       9
Netherlands           9
Denmark               9
Australia             9
Cyprus                8
Sweden                8
Japan                 8
Poland                6
Canada                4
Name: CustomerID, dtype: int64

We can see the UK has the most customers by a lot. Now I'm curious to see which countries have NaN values for CustomerID.

Another thing to note is that most of the orders come from the UK. Since this is an online retailer, and upon further exploration of the dataset, it seems perhaps a wholesaler (large volume orders, similar style products e.g. plates, napkins, lunch bags, doilies, etc. with different lines), we will explore customers from all countries. This is most likely a B2B online retailer, so its customers are more niche than general. For all of these reasons, we will keep data from all countries. We can always come back and index just the UK if we'd like to.

In [111]:
# NaN CustomerID field for a given item in an order (out of ~500K)
df[df.CustomerID.isnull() == True].groupby(['Country']).size().sort_values(ascending = False)


Country
United Kingdom    133600
EIRE                 711
Hong Kong            288
Unspecified          202
Switzerland          125
France                66
Israel                47
Portugal              39
Bahrain                2
dtype: int64

It looks like the CustomerID column with NaN values for some orders are distributed pretty evenly. This could be calculated precisely, however its effect doesn't seem to be isolated to any particular country. It's much higher for the United Kingdom, however the UK also has the most instances overall by a significant amount. 

## Check For Null Values

In [116]:
# Check for missing values

df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

It looks like we have a good amount of Null values for **CustomerID** and **Description**. Let's see how much of the total this accounts for.

In [117]:
# Description
print('Description Percent Null Values:')
print(f"{((df.Description.isnull().sum())/len(df.Description)*100).round(4)} % \n")

# CustomerID
print('CustomerID Percent Null Values:')
print(f"{((df.CustomerID.isnull().sum())/len(df.CustomerID)*100).round(4)} % \n")

print('==============================')

Description Percent Null Values:
0.2683 % 

CustomerID Percent Null Values:
24.9267 % 



The number of missing values for the **Description** column is small, however for the **CustomerID** column it is large at almost 25%. I'm curious how many customers there were. Let's take a look at the number of unique values.

In [118]:
# Unique CustomerIDs

print(f'No. of unique CustomerIDs: \n{len(df.CustomerID.value_counts())}')

No. of unique CustomerIDs: 
4372


Since we still have data from over 4,300 customers, and we don't have any way of identifying the customers with the Null **CustomerID** field, it only makes sense to remove them. 

Since this is an RMF Analysis, we don't require the product descriptions. However, for future work in identifying any trends of products and categories each group desires, we would require this information. For such an analysis we could impute or simply drop as it represents only a fraction of a percent of the dataset.

In [119]:
# Drop rows w/null fields
df = df[df['CustomerID'].notna()]

In [120]:
df.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

It looks like dropping the null CustomerID data also removed the null Descriptions data. Well that worked out perfectly. 

I'm curious if the nulls were from any particular country, as almost 25% of the data had nulls, this is a significant amount of data and does have the potential to skew the true RMF data for a customer. 

## Check For Duplicates

In [126]:
# Preview duplicates
display(df[df.duplicated()].head())
display(df[df.duplicated()].tail())

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
517,536409,21866,UNION JACK FLAG LUGGAGE TAG,1,2010-12-01 11:45:00,1.25,17908.0,United Kingdom
527,536409,22866,HAND WARMER SCOTTY DOG DESIGN,1,2010-12-01 11:45:00,2.1,17908.0,United Kingdom
537,536409,22900,SET 2 TEA TOWELS I LOVE LONDON,1,2010-12-01 11:45:00,2.95,17908.0,United Kingdom
539,536409,22111,SCOTTIE DOG HOT WATER BOTTLE,1,2010-12-01 11:45:00,4.95,17908.0,United Kingdom
555,536412,22327,ROUND SNACK BOXES SET OF 4 SKULLS,1,2010-12-01 11:49:00,2.95,17920.0,United Kingdom


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541675,581538,22068,BLACK PIRATE TREASURE CHEST,1,2011-12-09 11:34:00,0.39,14446.0,United Kingdom
541689,581538,23318,BOX OF 6 MINI VINTAGE CRACKERS,1,2011-12-09 11:34:00,2.49,14446.0,United Kingdom
541692,581538,22992,REVOLVER WOODEN RULER,1,2011-12-09 11:34:00,1.95,14446.0,United Kingdom
541699,581538,22694,WICKER STAR,1,2011-12-09 11:34:00,2.1,14446.0,United Kingdom
541701,581538,23343,JUMBO BAG VINTAGE CHRISTMAS,1,2011-12-09 11:34:00,2.08,14446.0,United Kingdom


We're essentially just looking at the head and the tail, however I don't see any duplicates. Let's take a closer look.

In [130]:
df[df.InvoiceNo == 536412].duplicated().tail(10)

612    False
613    False
614    False
615    False
616     True
617     True
618     True
619    False
620     True
621    False
dtype: bool

In [131]:
df.iloc[615:622]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
615,536412,85184C,S/4 VALENTINE DECOUPAGE HEART BOX,1,2010-12-01 11:49:00,2.95,17920.0,United Kingdom
616,536412,21708,FOLDING UMBRELLA CREAM POLKADOT,1,2010-12-01 11:49:00,4.95,17920.0,United Kingdom
617,536412,22900,SET 2 TEA TOWELS I LOVE LONDON,2,2010-12-01 11:49:00,2.95,17920.0,United Kingdom
618,536412,21706,FOLDING UMBRELLA RED/WHITE POLKADOT,1,2010-12-01 11:49:00,4.95,17920.0,United Kingdom
619,536412,22988,SOLDIERS EGG CUP,6,2010-12-01 11:49:00,1.25,17920.0,United Kingdom
620,536412,85184C,S/4 VALENTINE DECOUPAGE HEART BOX,1,2010-12-01 11:49:00,2.95,17920.0,United Kingdom
621,536412,20750,RED RETROSPOT MINI CASES,1,2010-12-01 11:49:00,7.95,17920.0,United Kingdom


None of these seem to be duplicates, so we're going to leave these here. Update: upon further research it seemed other users have had similar issues with df.duplicates( ), it seems when certain conditions are met with the dataset. 

## Cancelled Orders

In [136]:
df[df.Quantity < 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.50,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
...,...,...,...,...,...,...,...,...
540449,C581490,23144,ZINC T-LIGHT HOLDER STARS SMALL,-11,2011-12-09 09:57:00,0.83,14397.0,United Kingdom
541541,C581499,M,Manual,-1,2011-12-09 10:28:00,224.69,15498.0,United Kingdom
541715,C581568,21258,VICTORIAN SEWING BOX LARGE,-5,2011-12-09 11:57:00,10.95,15311.0,United Kingdom
541716,C581569,84978,HANGING HEART JAR T-LIGHT HOLDER,-1,2011-12-09 11:58:00,1.25,17315.0,United Kingdom


It loooks like negative quantity orders have a 'C' before the InvoiceNo. I checked the dataset description, and it turns out these are cancelled orders. So, we would ideally remove the initial orders associated with each one.

My initial thought was that I could look up the number after the C, however they don't seem to show up. They do however seem to still be associated with the CustomerID, so for now, I'm going to leave them. We want to include returns a customer makes in the total value they bring to the company, as leaving this out would create a false representation of the customer.

In [141]:
df[df.CustomerID == 14527]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.50,14527.0,United Kingdom
8963,537159,22112,CHOCOLATE HOT WATER BOTTLE,6,2010-12-05 13:17:00,4.95,14527.0,United Kingdom
8964,537159,22111,SCOTTIE DOG HOT WATER BOTTLE,1,2010-12-05 13:17:00,4.95,14527.0,United Kingdom
8965,537159,21479,WHITE SKULL HOT WATER BOTTLE,1,2010-12-05 13:17:00,3.75,14527.0,United Kingdom
8966,537159,22114,HOT WATER BOTTLE TEA AND SYMPATHY,6,2010-12-05 13:17:00,3.95,14527.0,United Kingdom
...,...,...,...,...,...,...,...,...
533807,581114,22111,SCOTTIE DOG HOT WATER BOTTLE,1,2011-12-07 12:19:00,4.95,14527.0,United Kingdom
533808,581114,22835,HOT WATER BOTTLE I AM SO POORLY,2,2011-12-07 12:19:00,4.95,14527.0,United Kingdom
533809,581114,22114,HOT WATER BOTTLE TEA AND SYMPATHY,6,2011-12-07 12:19:00,4.25,14527.0,United Kingdom
533810,581114,21479,WHITE SKULL HOT WATER BOTTLE,2,2011-12-07 12:19:00,4.25,14527.0,United Kingdom


We can see this customer has made many orders, and has a discount given. We want to include all purchases, cancelled orders and discounts so we can get a full and accurate picture of each customer. 

## RMF Variables

Now that we've cleaned our data, we're going to add in Recency, Frequency and Monetary Value.

## Summary Statistics

In [125]:
# Summary statistics

df.describe().round(2)

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,406829.0,406829.0,406829.0
mean,12.06,3.46,15287.69
std,248.69,69.32,1713.6
min,-80995.0,0.0,12346.0
25%,2.0,1.25,13953.0
50%,5.0,1.95,15152.0
75%,12.0,3.75,16791.0
max,80995.0,38970.0,18287.0


## Add TotalPrice Column

In [None]:
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

In [None]:
df.describe().round(2)

**Explore this further. Are these isolated events of extremely high Quantity ordered and then returned?**

In [None]:
df.TotalPrice.hist();

## Remove Outliers

It looks like there are some major outliers in our dataset. Let's remove them. 

In [None]:
# Percentiles for Quantity

# Define percentiles
percentiles = [0,2.5,97.5,100]

# Print them out
for i in percentiles:
    q = i/100
    print("{} percentile Quantity: {}".format(q, df.Quantity.quantile(q=q)))
    
# Percentiles for UnitPrice

# Print them out
for i in percentiles:
    q = i/100
    print("{} percentile UnitPrice: {}".format(q, df.UnitPrice.quantile(q=q)))
    
# Percentiles for TotalPrice

# Print them out
for i in percentiles:
    q = i/100
    print("{} percentile TotalPrice: {}".format(q, df.TotalPrice.quantile(q=q)))

I'm going to remove what may be returns or negative **Quantity** values as the lower 1% is -2.0 and the lower 2.5% was 1.0. We also removed all of the negative UnitPrice values when we removed the Null **CustomerID** values. 

We're also going to set the **UnitPrice** lower limit to be any value greater than 0.0 as this means it has any price. The minimum value being 0.001. 

In [None]:
# Remove extreme outliers in the lower and upper 1%

# Get original length to see percent removed
orig_tot = len(df)

# Subset to remove extreme outliers
# Quantity
df = df[(df.Quantity > 0.0) & (df.Quantity <= 120.0)] 
# UnitPrice lot
df = df[(df.UnitPrice > 0.0) & (df.UnitPrice <= 15.0)]

# Calculate percent removed
print('Percent removed:', (orig_tot -len(df))/orig_tot)

We saw how removing the rows with Null **CustomerIDs** also removed the negative **UnitPrices**, I'm wondering if it would be best to remove the rows with negative Quantity value as well. We can see here with the 1% being -2.0, the 2% being -1.0 and the 2.5% being 1.0. 

I will keep it standard for now with percentiles, however it ma

It seems returns are extremely rare, which we can see with the **Quantity** 0.01 percentile being -2.0. I'm wondering if returns should be removed alltogether since they are rare, or if there are certain segments of customers who are more prone to returns.

In [None]:
sns.boxplot(df.Quantity)
plt.show()

sns.boxplot(df.UnitPrice)
plt.show()

sns.boxplot(df.TotalPrice)
plt.show()

In [None]:
sns.distplot(df.Quantity)
plt.show()
sns.distplot(df.UnitPrice)
plt.show()

It seems that removing the Null **CustomerID** data also removed all of the negative **UnitPrice** values. 

We can see visually there are some major outliers. With this data set it's easy to visually see the outliers, so I could remove them that way, however I'm going to remove them by removing the upper and lower percentiles. 

In [None]:
df[['Quantity','UnitPrice','TotalPrice']]

In [None]:
# Pairplot
sns.pairplot(df[['Quantity','UnitPrice','TotalPrice']]);

We can see there appears to be a clear linear relationship between **TotalPrice** and **Quantity** and **TotalPrice** and **UnitPrice**, with no apparent relationship between **UnitPrice** and **Quantity**. This intuitively makes sense as **TotalPrice** is calculated as **Quantity * UnitPrice**. 

We could also take a look at a correlation matrix.

In [None]:
# Correlation matrix

df[['Quantity','UnitPrice','TotalPrice']].corr().round(2)

Interesting that there is actually a weak correlation between **UnitPrice** and **TotalPrice** and a weak yet stronger than the former correlation between **UnitPrice** and **Quantity**. 

Let's take a look at the **Country** data. 

In [None]:
# Number of unique invoices
len(df.InvoiceNo.unique())

In [None]:
# Number of unique invoices per country
df.groupby(['Country'])['InvoiceNo'].nunique().sort_values(ascending=False)

We can see most of the purchases are from the **United Kingdom**. We could model the UK exclusively, however since this is from an online retailer, I'm interested to see if there are similar groups across countries. 

We can always come back and model for the top country or countries (in terms of orders) if we find the model works better that way. 

On that note, I'm curious to see what types of items these customers are purchasing online. Let's take a look at the **Descriptions**.

In [None]:
df.Description.value_counts(ascending=False)[0:20]

In [None]:
len(df.Description.unique())

There are 3,833 unique descriptions. 

We can see there are a lot of lunch bags. I'm curious if it would be worth creating a category. Like **Description** = 'LUNCH BAG' then a different field with the specific type, or remove it alltogether. 

This wouldn't be as important for an RMF analysis with segmentation. However, for segmentation looking also at types of items purchased, this information would be valuable. 

Let's explore potential categories a bit more. 

In [None]:
df[df.Description.str.contains('LUNCH BAG') == True]

Lot's of lunch bag purchases. 

In [None]:
df[df.Description.str.contains('LUNCH BAG') == True].Description.value_counts()

It looks like 'VINTAGE DOILEY' is meant to be 'VINTAGE DOILY'. I can correct that here 

In [None]:
# Rename to match category
df.Description[df.Description == 'LUNCH BAG VINTAGE DOILEY '] = 'LUNCH BAG VINTAGE DOILY '

One customer purchased 40! I wonder what these are for. Businesses, parties, special events? 

In [None]:
df[df.Description.str.contains('LUNCH BAG') == True].Description.value_counts()

Perfect! We can see the 4 rows have been added. 

We can see there are more DOILY/DOILEY categories. I imagine these could all be combined. Let's keep looking and see what other products we have. 

In [None]:
df[df.Description.str.contains('LUNCH BAG') != True]

In [None]:
# Category of 'CHILDRENS CUTLERY'
df[df.Description.str.contains('CHILDRENS CUTLERY') == True]

In [None]:
# Line called 'CIRCUS PARADE'
df[df.Description.str.contains('CIRCUS PARADE') == True]

We can see there are categories of the descriptions and also lines it appears. For example 'SPACE BOY', 'DOLLY GIRL', 'CIRCUS PARADE', etc. across multiple categories such as 'CHILDRENS CUTLERY', 'NAPKINS', 'APRON's and more. 

In [None]:
df.Description[df.Description.str.contains('CIRCUS PARADE') == True].value_counts()

In [None]:
df.Description[df.Description.str.contains('PLASTERS IN TIN') == True].value_counts()

In [None]:
df.Description[df.Description.str.contains('LUNCH BOX') == True].value_counts()

In [None]:
df.Description[df.Description.str.contains('NAPKINS') == True].value_counts()

In [None]:
df.Description[df.Description.str.contains('FAIRY CAKES') == True].value_counts()

In [None]:
df.Description[df.Description.str.contains('NOTEBOOK') == True].value_counts()

In [None]:
df.Description[df.Description.str.contains('BABUSHKA') == True].value_counts()

In [None]:
df.Description[df.Description.str.contains('HOT WATER BOTTLE') == True].value_counts()

In [None]:
df.Description[df.Description.str.contains('CHARLIE','LOLA') == True].value_counts()

As we can see there are numerous different categories and lines of products. These could be broken down into categories and even colors for deeper insights and analyses. This would be a project in itself. For now I will continue to conduct an RMF analysis. However, for future work I believe it would be valuable to find groups of buyers based also on the types of categories they purchases. 

# Future Work

We can see there are many categories of products, such as napkins, aprons, notebooks, water bottles, lunch bags, etc. And across those categories there are many different lines of products, such as 'SPACE BOY', 'DOLLY GIRL', 'CIRCUS PARADE', 'CHARLIE + LOLA', and even 'BABUSHKA'. 

Greater insights could be attained by adding categories and lines. This could support more targeted advertising directly to its current customers, which would increase customer experience, engagement and revenue. It could also support targeted digital advertising such as Facebook and Google Ads. 

It the retailer doesn't already have it in place, they could add a recommendation system to increase price per transaction, revenue and customer lifetime values. 