This project curates and analyzes the Dunnhumby Complete Journey dataset to study customer purchasing patterns and market segmentation. The work follows the Digital Curation Centre (DCC) Data Lifecycle Model and includes cleaning, metadata creation, and synthetic data augmentation to improve sample diversity.
- Clean and curate the Dunnhumby dataset
- Analyze customer behavior and segmentation
- Analyse campaign effectiveness
- Document reproducible curation workflow
- Dataset: Dunnhumby – The Complete Journey
- Source: Kaggle
- License: Open data for research use
Refer to the metadata documentation for information on the dataset's structure, variables, and data types
View the workflow provenance for a detailed overview of the project's data processing steps
- Python 3.10+ (tested)
- Create virtual env:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
- Obtain raw data: Dunnhumby — The Complete Journey (Kaggle)
- Place raw CSVs in
data/raw/(filenames: transaction_data.csv, hh_demographic.csv, product.csv, coupon.csv, etc.)
Data were obtained from Kaggle (Dunnhumby — The Complete Journey). We do not redistribute raw files. Household IDs are anonymized; no PII is present. Use of the data follows Kaggle's terms of service and any downstream redistribution is restricted.
python3 -m src.customer_workflow
This performs:
- Loading and verification of raw data
- Cleaning (duplicates, invalid rows, normalization)
- Demographic mapping & heuristics
- Merge & aggregation of household-level metrics
- Figures generation saved to
data/customer_segmentation/figures/ - Writes logs to
logs/workflow.txt
- Outputs are deterministic given the same raw data and environment.
- To re-run from scratch, remove
rm -rf data/cleaned/* data/customer_segmentation/*
See /data, /src,/notebooks, /scripts, and /docs for workflow components.
Python (pandas, numpy, matplotlib)
GitHub for versioning and documentation
Dunnhumby. (n.d.). The Complete Journey. Kaggle.
https://www.kaggle.com/datasets/frtgnn/dunnhumby-the-complete-journey