#### About
Customer Segmentation using Lightgbm, XGBoost and Cat-Boost.

Dataset - https://www.kaggle.com/datasets/carrie1/ecommerce-data

#### About

* K-means clustering is a method of vector quantization that aims to partition n observation into K clusters in which each observation belongs to the cluster with nearest mean(cluster center/centroid).

>>> The primary difference between Catboost, Xgboost and Lightgbm are as follows - 

>> Boosting algorithms work on the principle of combining multiple models togethers i.e Residual of one model shall go into the other model as target.

> A. Decision Tree Structre over iterations - In catboost - Decision trees are symmetric. In LightGBM, They are growing Leafwise whereas Xgboost encounters depthwise decision trees.

> B. Handling of Categorical variables - In cat boost, Ordered Encoding is done whereas in LightGBM, The approach is Bin or Bucket based. In Extreme gradient boost, You have no predefined approach and hence one has to manually encode to one hot encoded.

> C.Selection of Samples - In Catboost, Minimimum variance and uniform variance sampling whereas for Gradient one side samplling is observed in LightGBM but in Bootstrap based sampling in Xtreme gradient boosting.

In [11]:
#neccessary imports
import seaborn as sns
import pandas as pd
import numpy as np
import datetime as dt

In [2]:
dataset = '/home/suraj/ClickUp/Jan-Feb/data/data.csv'
df = pd.read_csv(dataset)

In [3]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


#### 1. Preprocessing data


In [4]:
#converting inovice date to date time
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [5]:
#checking null values
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [6]:
#dropping null values
df.dropna(axis=0, inplace=True)
df.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

In [7]:
#let's decrease the amount of input variable columns for the target customer_id column
#1 . Let's remove Unit Price and Quantity by a single column - Total Purchase

df['TotalPurchase'] = df['UnitPrice'] * df['Quantity']
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,TotalPurchase
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


In [8]:
#dropping the two columns quantity and unit price
cols_to_drop = ['Quantity','UnitPrice']
df = df.drop(cols_to_drop,axis=1)
df.head()


Unnamed: 0,InvoiceNo,StockCode,Description,InvoiceDate,CustomerID,Country,TotalPurchase
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,2010-12-01 08:26:00,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,2010-12-01 08:26:00,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,2010-12-01 08:26:00,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,2010-12-01 08:26:00,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,2010-12-01 08:26:00,17850.0,United Kingdom,20.34


In [9]:
#checking if there are any negative total purchase
df[df.TotalPurchase<=0]

Unnamed: 0,InvoiceNo,StockCode,Description,InvoiceDate,CustomerID,Country,TotalPurchase
141,C536379,D,Discount,2010-12-01 09:41:00,14527.0,United Kingdom,-27.50
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,2010-12-01 09:49:00,15311.0,United Kingdom,-4.65
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,2010-12-01 10:24:00,17548.0,United Kingdom,-19.80
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,2010-12-01 10:24:00,17548.0,United Kingdom,-6.96
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,2010-12-01 10:24:00,17548.0,United Kingdom,-6.96
...,...,...,...,...,...,...,...
540449,C581490,23144,ZINC T-LIGHT HOLDER STARS SMALL,2011-12-09 09:57:00,14397.0,United Kingdom,-9.13
541541,C581499,M,Manual,2011-12-09 10:28:00,15498.0,United Kingdom,-224.69
541715,C581568,21258,VICTORIAN SEWING BOX LARGE,2011-12-09 11:57:00,15311.0,United Kingdom,-54.75
541716,C581569,84978,HANGING HEART JAR T-LIGHT HOLDER,2011-12-09 11:58:00,17315.0,United Kingdom,-1.25


In [13]:
#dropping all the negative total purchases and keeping all those entries that are before today
today = dt.datetime(2023,2,13)
df1 = df[(df.TotalPurchase>0) & (df.InvoiceDate<today)].copy()
df1.head()

Unnamed: 0,InvoiceNo,StockCode,Description,InvoiceDate,CustomerID,Country,TotalPurchase
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,2010-12-01 08:26:00,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,2010-12-01 08:26:00,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,2010-12-01 08:26:00,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,2010-12-01 08:26:00,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,2010-12-01 08:26:00,17850.0,United Kingdom,20.34


In [21]:
#let's create a new feature named days since purchase
df1['days_since_purchase'] = (today - df1["InvoiceDate"]).dt.days

In [37]:
#grouping all customer since days of purchase
df1_dop = df1.groupby("CustomerID").agg({"days_since_purchase":"min"})
df1_dop.reset_index(inplace=True)
df1_dop.columns = ["CustomerID","days_since_purchase"]
df1_dop.head()

#let's put this feature at hold for now !

Unnamed: 0,CustomerID,days_since_purchase
0,12346.0,4408
1,12347.0,4085
2,12348.0,4158
3,12349.0,4101
4,12350.0,4393


In [29]:
#creating a frequencey of purchase for a particular customer id grouped by Invoice number
df1_freq = df1.groupby("CustomerID").agg({"InvoiceNo":"nunique"})
df1_freq.reset_index(inplace=True)
df1_freq.columns = ["CustomerID","Frequency"]
df1_freq.head()


Unnamed: 0,CustomerID,Frequency
0,12346.0,1
1,12347.0,7
2,12348.0,4
3,12349.0,1
4,12350.0,1


In [30]:
# segregating total purchase made by a customer ID
df1_purchase = df1.groupby("CustomerID").agg({"TotalPurchase":"sum"})
df1_purchase.reset_index(inplace=True)
df1_purchase.columns = ["CustomerID","BillingAmount"]
df1_purchase.head()

Unnamed: 0,CustomerID,BillingAmount
0,12346.0,77183.6
1,12347.0,4310.0
2,12348.0,1797.24
3,12349.0,1757.55
4,12350.0,334.4


In [28]:
# merging all these 
df1_new = df1_freq.merge(df1_purchase, on="CustomerID")
df1_new

Unnamed: 0,CustomerID,Frequency,BillingAmount
0,12346.0,1,77183.60
1,12347.0,7,4310.00
2,12348.0,4,1797.24
3,12349.0,1,1757.55
4,12350.0,1,334.40
...,...,...,...
4333,18280.0,1,180.60
4334,18281.0,1,80.82
4335,18282.0,2,178.05
4336,18283.0,16,2094.88


In [33]:
df1_new.describe()

Unnamed: 0,CustomerID,Frequency,BillingAmount
count,4338.0,4338.0,4338.0
mean,15300.408022,4.272015,2054.26646
std,1721.808492,7.697998,8989.230441
min,12346.0,1.0,3.75
25%,13813.25,1.0,307.415
50%,15299.5,2.0,674.485
75%,16778.75,5.0,1661.74
max,18287.0,209.0,280206.02


In [34]:
# Total number of unique orders 
df1_new['CustomerID'].nunique()
# we have 4338 unique customer data.

4338

In [36]:
#unique product stock codes
df1_new['Frequency'].nunique()

59

In [38]:
#total number of purchases
#assuming each invoice represents a unique purchase
df1_new['BillingAmount'].nunique()

4284

In [39]:
#number of total purchase entres
len(df1_new)

4338

Now, We will filter columns needed for clustering.

1. Removing Customer ID