# Customer Purchase Behavior Analysis - A Real Case Study

### Probabilistic Modeling + Statistics on the Online Retail Dataset
This notebook explores customer purchase patterns using:
- Descriptive statistics: mean, median, mode, variance, std
- Distributions: Gaussian (normal) & Binomial
- Classifier: Naive Bayes to predict “high value” customers
- Visuals: histograms, KDEs, QQ plots, confusion matrix


## 1) Setup
Install all related requirments

In [1]:
%%capture
!pip install pandas ucimlrepo numpy matplotlib seaborn scikit-learn scipy openpyxl --quiet

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from pathlib import Path
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, ConfusionMatrixDisplay


sns.set(style="whitegrid") # Specifies a visual style for the plots. It sets the background to white with a gray grid
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE) # When you use the same seed, the sequence of "random" numbers generated will be identical every time


## 2) Load data

In [4]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
online_retail = fetch_ucirepo(id=352) 
  
# data (as pandas dataframes) 
X = online_retail.data.features 
y = online_retail.data.targets 

In [5]:
# metadata 
print(online_retail.metadata)

{'uci_id': 352, 'name': 'Online Retail', 'repository_url': 'https://archive.ics.uci.edu/dataset/352/online+retail', 'data_url': 'https://archive.ics.uci.edu/static/public/352/data.csv', 'abstract': 'This is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.', 'area': 'Business', 'tasks': ['Classification', 'Clustering'], 'characteristics': ['Multivariate', 'Sequential', 'Time-Series'], 'num_instances': 541909, 'num_features': 6, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': None, 'index_col': ['InvoiceNo', 'StockCode'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2015, 'last_updated': 'Mon Oct 21 2024', 'dataset_doi': '10.24432/C5BW33', 'creators': ['Daqing Chen'], 'intro_paper': {'ID': 361, 'type': 'NATIVE', 'title': 'Data mining for the online retail industry: A case study of RFM model-based customer segmenta

In [6]:
# variable information 
print(online_retail.variables) 

          name     role         type demographic  \
0    InvoiceNo       ID  Categorical        None   
1    StockCode       ID  Categorical        None   
2  Description  Feature  Categorical        None   
3     Quantity  Feature      Integer        None   
4  InvoiceDate  Feature         Date        None   
5    UnitPrice  Feature   Continuous        None   
6   CustomerID  Feature  Categorical        None   
7      Country  Feature  Categorical        None   

                                         description     units missing_values  
0  a 6-digit integral number uniquely assigned to...      None             no  
1  a 5-digit integral number uniquely assigned to...      None             no  
2                                       product name      None             no  
3  the quantities of each product (item) per tran...      None             no  
4  the day and time when each transaction was gen...      None             no  
5                             product price per uni

## 3) Basic cleaning & feature engineering
Each data needs especific cleaning and filtering so it is better to know about data and what we have in the dataset.
- Keep positive quantities & prices
- Drop rows with missing CustomerID
- Parse datetimes
- Create TotalPrice and per-invoice features
