# Customer Purchase Behavior Analysis - A Real Case Study

### Probabilistic Modeling + Statistics on the Online Retail Dataset
This notebook explores customer purchase patterns using:
- Descriptive statistics: mean, median, mode, variance, std
- Distributions: Gaussian (normal) & Binomial
- Classifier: Naive Bayes to predict “high value” customers
- Visuals: histograms, KDEs, QQ plots, confusion matrix


## 1) Setup
Install all related requirments

In [2]:
%%capture
!pip install pandas ucimlrepo numpy matplotlib seaborn scikit-learn scipy openpyxl --quiet

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from pathlib import Path
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, ConfusionMatrixDisplay


sns.set(style="whitegrid") # Specifies a visual style for the plots. It sets the background to white with a gray grid
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE) # When you use the same seed, the sequence of "random" numbers generated will be identical every time


## 2) Load data

In [4]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
online_retail = fetch_ucirepo(id=352) 
  
# data (as pandas dataframes) 
X = online_retail.data.features 
y = online_retail.data.targets 

In [5]:
# metadata 
print(online_retail.metadata)

{'uci_id': 352, 'name': 'Online Retail', 'repository_url': 'https://archive.ics.uci.edu/dataset/352/online+retail', 'data_url': 'https://archive.ics.uci.edu/static/public/352/data.csv', 'abstract': 'This is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.', 'area': 'Business', 'tasks': ['Classification', 'Clustering'], 'characteristics': ['Multivariate', 'Sequential', 'Time-Series'], 'num_instances': 541909, 'num_features': 6, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': None, 'index_col': ['InvoiceNo', 'StockCode'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2015, 'last_updated': 'Mon Oct 21 2024', 'dataset_doi': '10.24432/C5BW33', 'creators': ['Daqing Chen'], 'intro_paper': {'ID': 361, 'type': 'NATIVE', 'title': 'Data mining for the online retail industry: A case study of RFM model-based customer segmenta

In [6]:
# variable information 
print(online_retail.variables) 

          name     role         type demographic  \
0    InvoiceNo       ID  Categorical        None   
1    StockCode       ID  Categorical        None   
2  Description  Feature  Categorical        None   
3     Quantity  Feature      Integer        None   
4  InvoiceDate  Feature         Date        None   
5    UnitPrice  Feature   Continuous        None   
6   CustomerID  Feature  Categorical        None   
7      Country  Feature  Categorical        None   

                                         description     units missing_values  
0  a 6-digit integral number uniquely assigned to...      None             no  
1  a 5-digit integral number uniquely assigned to...      None             no  
2                                       product name      None             no  
3  the quantities of each product (item) per tran...      None             no  
4  the day and time when each transaction was gen...      None             no  
5                             product price per uni

## 3) Basic cleaning & feature engineering
Each data needs especific cleaning and filtering so it is better to know about data and what we have in the dataset.
- Keep positive quantities & prices
- Drop rows with missing CustomerID
- Parse datetimes
- Create TotalPrice and per-invoice features


In [9]:
# To have it in better and easy to play format we make a dataframe
df = X.copy()
df["target"] = y
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [11]:
# Drop bad rows
df = df.dropna(subset=["CustomerID"])
df = df[(df["Quantity"] > 0) & (df["UnitPrice"] > 0)]


# Parse datetime & basic features
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
df["TotalPrice"] = df["Quantity"] * df["UnitPrice"]
df["InvoiceMonth"] = df["InvoiceDate"].dt.to_period("M").dt.to_timestamp()
df["Country"] = df["Country"].astype("category")


df.shape, df.isna().sum().sum()


((397884, 9), 397884)

## 4) Descriptive statistics
Compute mean, median, mode, variance, std for purchase metrics.


In [12]:
summary_stats = df[["Quantity", "UnitPrice", "TotalPrice"]].agg(
    ["mean", "median", lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan, "var", "std"]
).T
summary_stats.columns = ["mean", "median", "mode", "variance", "std"]
summary_stats

Unnamed: 0,mean,median,mode,variance,std
Quantity,12.988238,6.0,1.0,32159.885511,179.331775
UnitPrice,3.116488,1.95,1.25,488.316152,22.097877
TotalPrice,22.397,11.8,15.0,95524.908641,309.071041


### Interpretation 
1. Mean vs. Median
- Quantity: The mean (≈ 12.99) is much higher than the median (6.0), suggesting right-skewed data (a few large orders inflate the average).
- UnitPrice: Mean ≈ 3.12, median ≈ 1.95 — indicates a similar skew: most products are cheaper, but some high-priced items pull the average up.
- TotalPrice: Mean (≈ 22.4) is also larger than median (11.8), meaning large transactions occur but are less frequent.
2. Mode
- Quantity mode = 1 → most common purchase is a single unit.
- UnitPrice mode = 1.25 → the most frequent price point in the catalog.
- TotalPrice mode = 15.0 → likely corresponds to a typical low-volume order at that price.
3. Variance & Standard Deviation
- Large variances (especially for Quantity and TotalPrice) show high spread — transactions vary greatly in size and value.
- UnitPrice has smaller std (~22), but still large relative to its mean, suggesting a few very expensive products exist.
4. Business Insight
- Most orders are small quantities, low price per unit.
- The dataset contains rare but very large orders which affect averages.
- Pricing strategy may be highly segmented — some low-cost, high-frequency items and a few premium, high-price products.
