## Overview

<a href="https://archive.ics.uci.edu/ml/datasets/online+retail">Online retail</a> is a transnational dataset which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## Source

UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/online+retail

## Business Goal

To segment the Customers based on RFM so that the company can target its customers efficiently.


## Methodology

1. [Reading and Understanding the Data](#1) <br>
   a. Creating a Data Dictionary
2. [Data Cleaning](#2)
3. [Data Preparation](#3) <br>
   a. Scaling Variables
4. [Model Building](#4) <br>
   a. K-means Clustering <br>
   b. Finding the Optimal K
5. [Final Analysis](#5)


<a id="1"></a> <br>

### 1 : Reading and Understanding Data


In [1]:
# import required libraries for dataframe and visualization

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import plotly as py 
import plotly.graph_objs as go

from plotly.offline import init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# import required libraries for clustering
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

In [2]:
pd.set_option('display.max_columns', 999)

In [4]:
# Reading the data on which analysis needs to be done
df = pd.read_csv('OnlineRetail.csv', encoding = 'unicode_escape')
df.head(10)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,01-12-2010 08:26,7.65,17850.0,United Kingdom
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,01-12-2010 08:26,4.25,17850.0,United Kingdom
7,536366,22633,HAND WARMER UNION JACK,6,01-12-2010 08:28,1.85,17850.0,United Kingdom
8,536366,22632,HAND WARMER RED POLKA DOT,6,01-12-2010 08:28,1.85,17850.0,United Kingdom
9,536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,01-12-2010 08:34,1.69,13047.0,United Kingdom


#### Data Dictionary

| First Header | Definition            | Description                                                                                                                        | Data Type |
| ------------ | --------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | --------- |
| InvoiceNo    | Invoice number        | A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. | Nominal   |
| StockCode    | Product (item) code   | A 5-digit integral number uniquely assigned to each distinct product.                                                              | Nominal   |
| Description  | Product (item) name   | Name of Product                                                                                                                    | Nominal   |
| Quantity     | Quantity              | The quantities of each product (item) per transaction                                                                              | Numeric   |
| InvoiceDate  | Invoice Date and time | The day and time when each transaction was generated.                                                                              | Numeric   |
| UnitPrice    | Unit price            | Product price per unit in sterling.                                                                                                | Numeric   |
| CustomerID   | Customer number       | A 5-digit integral number uniquely assigned to each customer.                                                                      | Nominal   |
| Country      | Country name          | The name of the country where each customer resides.                                                                               | Nominal   |


In [6]:
# shape of df
df.shape

(541909, 8)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [7]:
# df description
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [9]:
df.describe(include="O")

Unnamed: 0,InvoiceNo,StockCode,Description,InvoiceDate,Country
count,541909,541909,540455,541909,541909
unique,25900,4070,4223,23260,38
top,573585,85123A,WHITE HANGING HEART T-LIGHT HOLDER,31-10-2011 14:41,United Kingdom
freq,1114,2313,2369,1114,495478


Description and Customer ID columns have null values


In [10]:
df[df["Description"].isna()].sample(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
374538,569378,72799F,,-5,03-10-2011 16:31,0.0,,United Kingdom
268307,560407,35653,,9,18-07-2011 14:21,0.0,,United Kingdom
344435,567063,23135,,-27,16-09-2011 11:50,0.0,,United Kingdom
395172,571027,72816,,-36,13-10-2011 12:19,0.0,,United Kingdom
130953,547531,82605,,1,23-03-2011 15:06,0.0,,United Kingdom


In [15]:
len(df[df["Description"].isna()])

1454

In [12]:
df[df["CustomerID"].isna()].sample(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
500203,578746,22736,RIBBON REEL MAKING SNOWMEN,5,25-11-2011 11:36,4.13,,United Kingdom
15546,537638,84006,MAGIC TREE -PAPER FLOWERS,3,07-12-2010 15:28,1.66,,United Kingdom
82403,543201,48116,DOORMAT MULTICOLOUR STRIPE,1,04-02-2011 13:03,14.13,,United Kingdom
435136,574074,22167,OVAL WALL MIRROR DIAMANTE,1,02-11-2011 15:33,8.29,,United Kingdom
250825,559051,22196,SMALL HEART MEASURING SPOONS,1,05-07-2011 16:47,1.63,,United Kingdom


In [14]:
len(df[df["CustomerID"].isna()])

135080

In [21]:
len(df)

541909

In [23]:
df = df.dropna(subset="CustomerID")
len(df)

406829

In [13]:
df[(df["Description"].isna()) & (df["CustomerID"].isna())]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
622,536414,22139,,56,01-12-2010 11:52,0.0,,United Kingdom
1970,536545,21134,,1,01-12-2010 14:32,0.0,,United Kingdom
1971,536546,22145,,1,01-12-2010 14:33,0.0,,United Kingdom
1972,536547,37509,,1,01-12-2010 14:33,0.0,,United Kingdom
1987,536549,85226A,,1,01-12-2010 14:34,0.0,,United Kingdom
...,...,...,...,...,...,...,...,...
535322,581199,84581,,-2,07-12-2011 18:26,0.0,,United Kingdom
535326,581203,23406,,15,07-12-2011 18:31,0.0,,United Kingdom
535332,581209,21620,,6,07-12-2011 18:35,0.0,,United Kingdom
536981,581234,72817,,27,08-12-2011 10:33,0.0,,United Kingdom


##### Null values handling


##### Checking of invalid values


Quantity should not be lower than 0 so we will check all negative values


In [24]:
len(df[df["Quantity"] < 0])

8905

In [25]:
df[df["Quantity"] < 0].sample(5)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
440173,C574512,23085,ANTIQUE SILVER BAUBLE LAMP,-6,04-11-2011 13:28,10.4,12577.0,France
10204,C537234,22653,BUTTON BOX,-20,06-12-2010 09:40,1.95,16161.0,United Kingdom
33619,C539273,22791,T-LIGHT GLASS FLUTED ANTIQUE,-1,16-12-2010 15:30,1.25,13209.0,United Kingdom
273909,C560866,47563A,RETRO LONGBOARD IRONING BOARD COVER,-4,21-07-2011 14:04,1.25,14410.0,United Kingdom
472194,C576670,23344,JUMBO BAG 50'S CHRISTMAS,-1,16-11-2011 11:55,2.08,16393.0,United Kingdom


Same with the Quantity, Unit Price should have no null values


In [26]:
len(df[df["UnitPrice"] < 0])

0

In [27]:
df[df["UnitPrice"] < 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


#### Drop negative values of Quantity


In [28]:
df = df[df["Quantity"] >= 0]
len(df)

397924

In [29]:
len(df[df["Quantity"] < 0])

0

<a id="2"></a> <br>

### 2 : Data Cleaning


In [None]:
# Calculating the Missing Values % contribution in DF


In [None]:
# Droping rows having missing values


In [None]:
# Changing the datatype of Customer Id as per Business understanding


<a id="3"></a> <br>

### 3 : Data Preparation


#### Customers will be analyzed based on 3 factors:

- R (Recency): Number of days since last purchase
- F (Frequency): Number of tracsactions
- M (Monetary): Total amount of transactions (revenue contributed)


In [None]:
# New Attribute : Monetary


In [None]:
# New Attribute : Frequency


In [None]:
# Merging the two dfs


In [None]:
# New Attribute : Recency

# Convert to datetime to proper datatype


In [None]:
# Compute the maximum date to know the last transaction date



In [None]:
# Compute the difference between max date and transaction date


In [None]:
# Compute last transaction date to get the recency of customers


In [None]:
# Extract number of days only



In [None]:
# Merge tha dataframes to get the final RFM dataframe


#### Rescaling the Attributes

It is extremely important to rescale the variables so that they have a comparable scale.<br>
There are two common ways of rescaling:

1. Min-Max scaling
2. Standardization (mean-0, sigma-1)

Here we execute Standard Scaling.


In [None]:
# Rescaling the attributes

# Instantiate

# fit_transform


## <span style="color: red;">Execute MinMax Scaling in the next box</span>


<a id="4"></a> <br>

### 4 : Building the Model


### K-Means Clustering


K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.<br>

The algorithm works as follows:

- First we initialize k points, called means, randomly.
- We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.
- We repeat the process for a given number of iterations and at the end, we have our clusters.


In [None]:
# k-means with some arbitrary k


In [None]:
#create a K_means function here


In [None]:
#plot your clusters


## <span style="color: red;">Finding the Optimal Number of Clusters</span>


#### Elbow Curve to get the right number of Clusters

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.


## <span style="color: red;">Box Plots of Clusters created</span>


<a id="5"></a> <br>

## Step 5 : Final Analysis


## <span style="color: red;">Findings</span>


#### Student Name:
