## UCI Online Retail II – Customer Purchase Behavior Analysis

This project analyzes customer purchasing behavior for a UK-based online retailer using data mining and machine learning techniques. The objective is to extract meaningful customer segments, identify purchasing patterns, and generate actionable marketing recommendations that support data-driven business decisions.

The analysis combines traditional clustering, deep embedding representations, and association rule mining to uncover both high-level customer groups and detailed product-level insights.

By:
**Ainedembe Denis**
- Master's student in Information Systems (2024/2026)
- LinkedIn: https://www.linkedin.com/in/ainedembe-denis-2b329615a/


## Environment Setup
This cell installs and loads all required libraries


In [1]:
# Install required libraries (run once)

%pip install pandas numpy matplotlib seaborn scikit-learn
%pip install tensorflow
%pip install mlxtend
%pip install tqdm
print(f"Successfully installed required libraries")

Successfully installed required libraries

In [1]:
# Core Libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing & Feature Engineering
from sklearn.preprocessing import StandardScaler

# Clustering Algorithms
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score

# Dimensionality Reduction (Visualization)
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Deep Learning - Autoencoder for embeddings
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam

# Association Rule Mining
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth, apriori, association_rules

# Utilities
import warnings
warnings.filterwarnings("ignore")

# Plot Styling
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (8, 6)
print(f"Successfully imported Libraries")


Successfully imported Libraries


### Part A – Data Cleaning & Clustering

#### A1.1. Load dataset

In [12]:
# Dataset path
file_path = "dataset/online_retail_II.csv"

#Load the dataset with the right encoding
df = pd.read_csv(file_path, encoding='ISO-8859-1')

# Display Shape & the first 10 rows of the DataFrame
print("Shape:", df.shape)
df.head(10)


Shape: (1067371, 8)


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom
5,489434,22064,PINK DOUGHNUT TRINKET POT,24,2009-12-01 07:45:00,1.65,13085.0,United Kingdom
6,489434,21871,SAVE THE PLANET MUG,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom
7,489434,21523,FANCY FONT HOME SWEET HOME DOORMAT,10,2009-12-01 07:45:00,5.95,13085.0,United Kingdom
8,489435,22350,CAT BOWL,12,2009-12-01 07:46:00,2.55,13085.0,United Kingdom
9,489435,22349,"DOG BOWL , CHASING BALL DESIGN",12,2009-12-01 07:46:00,3.75,13085.0,United Kingdom


#### A1.2. Dataset Information and Statistics

In [None]:
# Display basic information about the dataset
df.info()

# Generate descriptive statistics for numeric and categorical columns
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   Invoice      1067371 non-null  object 
 1   StockCode    1067371 non-null  object 
 2   Description  1062989 non-null  object 
 3   Quantity     1067371 non-null  int64  
 4   InvoiceDate  1067371 non-null  object 
 5   Price        1067371 non-null  float64
 6   Customer ID  824364 non-null   float64
 7   Country      1067371 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 65.1+ MB


Unnamed: 0,Quantity,Price,Customer ID
count,1067371.0,1067371.0,824364.0
mean,9.938898,4.649388,15324.638504
std,172.7058,123.5531,1697.46445
min,-80995.0,-53594.36,12346.0
25%,1.0,1.25,13975.0
50%,3.0,2.1,15255.0
75%,10.0,4.15,16797.0
max,80995.0,38970.0,18287.0


#### A2. Cleaning the data
Removing missing descriptions, negative quantities, and cancelled invoices (starting with "C")


In [21]:
df_clean = df.copy()

# Standardise column names if needed (optional, but cleaner)
df_clean.columns = [col.strip().replace(" ", "") for col in df_clean.columns]

# Check new column names
df_clean.columns

# Drop missing Description and Customer_ID
df_clean = df_clean.dropna(subset=["Description", "CustomerID"])

# Remove negative or zero quantities
df_clean = df_clean[df_clean["Quantity"] > 0]

# Remove cancelled invoices - Invoice codes starting with 'C')
df_clean = df_clean[~df_clean["Invoice"].astype(str).str.startswith("C")]

# Create total price column
df_clean["TotalPrice"] = df_clean["Quantity"] * df_clean["Price"]

# Display Shape & the first 5 rows of the cleaned Data
print("Original shape:", df.shape)
print("Cleaned shape:", df_clean.shape)
df_clean.head()


Original shape: (1067371, 8)
Cleaned shape: (805620, 9)


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,CustomerID,Country,TotalPrice
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom,83.4
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom,81.0
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom,81.0
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom,100.8
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom,30.0


#### A3. Creating Customer-level features
Looking at total spending, transaction count, avg basket size for clustering analysis.


In [27]:
# Group data by CustomerID & Calculate total spending, transaction count, and total quantity

customer_df = (
    df_clean
    .groupby("CustomerID")
    .agg(
        TotalSpending=("TotalPrice", "sum"),     
        TransactionCount=("Invoice", "nunique"), 
        TotalQty=("Quantity", "sum")
    )
    .reset_index()
)

# Compute Average basket size = total items / number of invoices
customer_df["AvgBasketSize"] = (
    customer_df["TotalQty"] / customer_df["TransactionCount"]
)

# Display the first 10 rows of the Customer DataFrame
customer_df.head(10)


Unnamed: 0,CustomerID,TotalSpending,TransactionCount,TotalQty,AvgBasketSize
0,12346.0,77556.46,12,74285,6190.416667
1,12347.0,5633.32,8,3286,410.75
2,12348.0,2019.4,5,2714,542.8
3,12349.0,4428.69,4,1624,406.0
4,12350.0,334.4,1,197,197.0
5,12351.0,300.93,1,261,261.0
6,12352.0,2849.84,10,724,72.4
7,12353.0,406.76,2,212,106.0
8,12354.0,1079.4,1,530,530.0
9,12355.0,947.61,2,543,271.5
