# Clustering e-commerce's clients

## Business understanding

You have been hired by an e-commerce company that is looking to better understand its customers' behavior in order to personalize its marketing campaigns. To achieve this, the company has provided a CSV database containing data on customers, products, and store transactions carried out between 2010 and 2011.

Based on this data, you need to group customers into clusters according to their purchasing behavior. This will help identify patterns and common characteristics among customers, such as:

- Customers who buy the same products;

- Customers with the same purchase frequency;

- Customers who spend more money on their purchases.

Using these clusters, generate insights that will allow the company to better segment its customer base and personalize its marketing campaigns, directing promotions and offers to customers based on their purchasing behavior.

## Data understanding

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers (https://www.kaggle.com/datasets/carrie1/ecommerce-data).

The table below	provides a detailed description of each column.
<table style='border: 1px solid; margin-left: 0'>
    <thead>
        <tr>
            <th>Column</th>
            <th>Description</th>
            <th>Data Type</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><strong>InvoiceNo</strong></td>
            <td>Transaction ID</td>
            <td>Int</td>
        </tr>
        <tr>
            <td><strong>StockCode</strong></td>
            <td>Product stock code</td>
            <td>String</td>
        </tr>
        <tr>
            <td><strong>Description</strong></td>
            <td>Product description</td>
            <td>String</td>
        </tr>
        <tr>
            <td><strong>Quantity</strong></td>
            <td>Number of products per transaction</td>
            <td>Int</td>
        </tr>
        <tr>
            <td><strong>InvoiceDate</strong></td>
            <td>Transaction date</td>
            <td>Datetime</td>
        </tr>
        <tr>
            <td><strong>UnitPrice</strong></td>
            <td>Unit price of the product</td>
            <td>Float</td>
        </tr>
        <tr>
            <td><strong>CustomerID</strong></td>
            <td>Customer ID</td>
            <td>Int</td>
        </tr>
        <tr>
            <td><strong>Country</strong></td>
            <td>Country of transaction origin</td>
            <td>String</td>
        </tr>
    </tbody>
</table>

#### Setup

In [147]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

#### Descriptive analysis

In [148]:
# Load data
data = pd.read_csv('../data/raw/data.csv', encoding='latin-1')
data.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,12/9/2011 12:50,4.95,12680.0,France


In [149]:
# Get basic information about our data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


### Handle data types

First we need to identify and correct any wrong data type on the data being analysed

In [151]:
# Invoice Date should be handled as a datetime
# this will allow future timeseries analysis
data['InvoiceDate'] = data.InvoiceDate.astype('datetime64[ns]')

# CustomerID, despite being a number should be handled
# as a string (object) as it does not represent a real
# numeric value
data['CustomerID'] = data.CustomerID.astype('object')

# Check information
data.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID             object
Country                object
dtype: object

#### Handle missing data

Second let's check for any missing data, if necessary drop it or fill it with some value

In [170]:
# Check if there is any missing values on each column
data.isna().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [None]:
# Check if any missing description has its values on another row
# this is done by comparing StockCodes. Our assumption is that
# the same StockCode should represent the same item
missing_descriptions_stock_codes = data[data['Description'].isna()]['StockCode'].unique()

# Find StockCodes with descriptions for values in
# `missing_descriptions_stock_codes` and rename columns
# we will use it to join with our original table
missing_description_values = (
	data[
		(data['StockCode'].isin(missing_descriptions_stock_codes)) &
		(~data['Description'].isna())
	][['StockCode', 'Description']]
	.drop_duplicates(subset='StockCode')
)
missing_description_values.rename(
	{'StockCode': 'new_StockCode', 'Description': 'new_Description'},
	axis='columns', inplace=True
)

# Join Tables and replace values
merged_stock_code = data.merge(
	missing_description_values,
	left_on='StockCode',
	right_on='new_StockCode',
	how='left'
)
merged_stock_code['Description'] = merged_stock_code['Description'].fillna(merged_stock_code['new_Description'])

# Check missing values
data = merged_stock_code.copy().drop(['new_StockCode', 'new_Description'], axis='columns')
data.isna().sum()

InvoiceNo           0
StockCode           0
Description       112
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

Using this approach we were able to reduce the amount of missing value in description from 1454 to 112.

In [221]:
# Check relevant unit prices
print(f'Missing descriptions with `UnitPrice` > 0: {data[(data['Description'].isna()) & (data['UnitPrice'] > 0)].shape[0]}')

# Check if there is more items in the same Invoice
md_invoices = data[(data['Description'].isna())]['InvoiceNo'].unique()
md_invoices_count = data[(data['InvoiceNo'].isin(md_invoices))]['InvoiceNo'].nunique()
print(f'Number of invoices that contains missing descriptions: {md_invoices_count}')

# Check which customers have items with missing descriptios
print(f'Customers with missing description items: {data[(data['Description'].isna())]['CustomerID'].unique()}')


Missing descriptions with `UnitPrice` > 0: 0
Number of invoices that contains missing descriptions: 112
Customers with missing description items: [nan]


We've decided do drop all remaining items with missing descriptions becaus
- All items have unit price = 0
- Those are the only items in the invoices they belong
- All items have no informaion about the customer id
- It represents only 0,02% of the dataset

In [225]:
# Drop missing description items
data = data.dropna(subset='Description')

We still have a lot of missing values in the CustomerID column. We need to understand if this column is relevant on our analysis. It might be interesting to identify top customers and this information might be relevant when we are creating data for the RFM analysis because we need each client on its own data. If this is the case, these records might not be as helpful as we thought, there is a problem, it represents almost 25% of the role dataset. Let's analyze it and check if we can reduce this number

In [None]:
# Check if any missing customer id has its values on another row
# this is done by comparing InvoiceNo. Our assumption is that
# the same InvoiceNo should be from the same customer
missing_customer_ids = data[data['CustomerID'].isna()]['InvoiceNo'].unique()

# Find InvoiceNo with CustomerID for values in
# `missing_customer_ids` and rename columns
# we will use it to join with our original table
customer_ids_values = (
	data[
		(data['InvoiceNo'].isin(missing_customer_ids)) &
		(~data['CustomerID'].isna())
	][['InvoiceNo', 'CustomerID']]
	.drop_duplicates(subset='InvoiceNo')
)

print(f'Number of invoices with values for missing CustomerIDs: {customer_ids_values.shape[0]}')

# Check how many invoices represents these missing CustomerIDs
mc_invoices_count = data[(data['CustomerID'].isna())]['InvoiceNo'].nunique()
all_invoices_count = data['InvoiceNo'].nunique()

print(f'Missing CustomerIDs invoices: {mc_invoices_count} of {all_invoices_count} ({(mc_invoices_count/all_invoices_count)*100:.2f}%)')

Number of invoices with values for missing CustomerIDs: 0
Missing CustomerIDs invoices: 3598 of 25788 (13.95%)


This is a tough decision. These invoices with missing CustomerID seems to bring little value to the RFM analysis as we can't consider it in any client, also there some invoice in this category with more than 1k entries in the dataset, on the other hand we would be dropping 14% of all invoices and 25% of all data available. 

In [None]:
# Check missing CustomerID with high UnitPrice
data[(data['CustomerID'].isna())].sort_values(by='UnitPrice', ascending=False).head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
524602,C580605,AMAZONFEE,AMAZON FEE,-1,2011-12-05 11:36:00,17836.46,,United Kingdom
43702,C540117,AMAZONFEE,AMAZON FEE,-1,2011-01-05 09:55:00,16888.02,,United Kingdom
43703,C540118,AMAZONFEE,AMAZON FEE,-1,2011-01-05 09:57:00,16453.71,,United Kingdom
16356,C537651,AMAZONFEE,AMAZON FEE,-1,2010-12-07 15:49:00,13541.33,,United Kingdom
15016,C537630,AMAZONFEE,AMAZON FEE,-1,2010-12-07 15:04:00,13541.33,,United Kingdom
15017,537632,AMAZONFEE,AMAZON FEE,1,2010-12-07 15:08:00,13541.33,,United Kingdom
16232,C537644,AMAZONFEE,AMAZON FEE,-1,2010-12-07 15:34:00,13474.79,,United Kingdom
524601,C580604,AMAZONFEE,AMAZON FEE,-1,2011-12-05 11:35:00,11586.5,,United Kingdom
299982,A563185,B,Adjust bad debt,1,2011-08-12 14:50:00,11062.06,,United Kingdom
446533,C574902,AMAZONFEE,AMAZON FEE,-1,2011-11-07 15:21:00,8286.22,,United Kingdom


In [315]:
data[(data['CustomerID'].isna())].groupby('Description')[['Description']].count().rename({'Description': 'Count'}, axis='columns').sort_values(by='Description', ascending=False)

Unnamed: 0_level_0,Count
Description,Unnamed: 1_level_1
wrongly sold sets,1
wrongly sold as sets,1
wrongly sold (22719) barcode,1
wrongly marked. 23343 in box,1
wrongly marked carton 22804,1
...,...
NINE DRAWER OFFICE TIDY,3
I LOVE LONDON MINI BACKPACK,18
DOLLY GIRL BEAKER,41
50'S CHRISTMAS GIFT BAG LARGE,20


A lot of the descriptions on this category are related to fees, manual, postage, or damaged items, we've decided to drop all rows missing CustomerID. Also there are a lot of cancelled invoices in this category, those are the invoices starting with a 'C'

In [317]:
# Drop rows with missing CustomerID
data = data.dropna(subset=['CustomerID'])

# Check missing values
data.isna().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

In [318]:
# Check final statistics
data.describe()

Unnamed: 0,Quantity,InvoiceDate,UnitPrice
count,406829.0,406829,406829.0
mean,12.061303,2011-07-10 16:30:57.879207424,3.460471
min,-80995.0,2010-12-01 08:26:00,0.0
25%,2.0,2011-04-06 15:02:00,1.25
50%,5.0,2011-07-31 11:48:00,1.95
75%,12.0,2011-10-20 13:06:00,3.75
max,80995.0,2011-12-09 12:50:00,38970.0
std,248.69337,,69.315162


**Approach**

- Types - Handle wrong datatypes (ok)
- Missing values - Handle missing values
- Duplicates - Drop duplicats
- Errors - Handle wrong values 
- Engineering - New column with total price