CPG Private Label Opportunity Engine

A data-driven framework that identifies food categories where European retailers can launch health-positioned private label products into underserved nutritional gaps.

Problem Statement

European grocery retailers earn 25-30% margins on private label vs. ~1% on national brands. The question: which product categories have the largest gap between consumer demand for healthier options and what's currently on shelves — and where is a private label best positioned to fill that gap?

Business Impact

Analysis of 2.57M food products across 27 EU markets reveals that 73% of products score Nutri-Score C or worse (unhealthy), while private label penetration among healthy (A/B) alternatives is only ~10%. The top opportunities:

Snacks (119K products, 88% unhealthy, gap 0.88) — largest market with almost no healthy PL
Condiments & Sauces (31K products, 77% gap) — salt reduction path is technically feasible
Breakfast (8K products, 93% gap) — highest nutritional gap but requires aggressive reformulation

At published private label margin rates (25-30%), these categories represent significant annual revenue opportunity per retail chain.

Key Findings

Massive nutritional gap: 30 of 45 food categories have >70% unhealthy products with minimal healthy PL alternatives
Reformulation is feasible in many categories: Crepes & Galettes needs only 37% sugar reduction to reach Nutri-Score B; Meat Products needs 80% salt reduction
Markets are fragmented: Almost all categories have low brand concentration (HHI <0.05) — easy to enter
Cross-country variation: The same category can be 95% CDE in one country but 70% in another, enabling targeted launches
PL leaders have slightly better nutrition: Lower sugar (2.8 vs 3.6g/100g) than average products

Methodology

Analysed 2,568,269 food products from the Open Food Facts database across 27 EU countries, enriched with pricing scraped from Mercadona (Spain, 3,225 products) and Albert Heijn (Netherlands, 11,209 products)
Built multi-retailer scraping pipeline handling different API structures (REST, mobile auth), joined to nutritional database via EAN barcode and fuzzy name matching (51% and 41% match rates)
Computed Nutri-Score for 706K products missing official scores using the published 2023 algorithm, increasing coverage from 32% to 59%
Built brand concentration (HHI), nutritional gap, and reformulation feasibility metrics per product category
Ranked categories by 6-component composite opportunity score with Monte Carlo sensitivity analysis (1000 Dirichlet weight simulations)
Trained interpretable gradient boosted classifier (CV AUC 0.65) and brand classifier (CV F1 0.996) identifying attributes of successful private label products

Tech Stack

Python (pandas, numpy, scikit-learn, matplotlib, seaborn, requests, rapidfuzz, DuckDB), Open Food Facts bulk data, Mercadona/Albert Heijn scraped data

Project Structure

├── src/
│   ├── data/               # Data loading, scrapers, joining, cleaning, Nutri-Score
│   │   ├── scrapers/       # Mercadona, Albert Heijn API clients
│   │   ├── load_off.py     # Open Food Facts loader
│   │   ├── clean.py        # Brand normalisation, category mapping, PL flagging
│   │   ├── join.py         # EAN + fuzzy name matching
│   │   └── nutriscore.py   # Vectorised Nutri-Score computation
│   ├── analysis/           # Category landscape, nutritional gaps, pricing, scoring
│   └── models/             # Brand classifier, PL success predictor
├── notebooks/              # Numbered analysis notebooks (01–06 + 05b)
├── scripts/                # Pipeline scripts (build_dataset.py, scrapers)
├── data/sample/            # Small sample for reviewers
├── results/                # 18 PNG charts + analysis outputs
└── notes.md                # Detailed data findings log

Setup

# Clone and install
git clone <repo-url>
cd PrivateLabelOpportunities
pip install -e ".[dev]"

# Download data (4.4GB Open Food Facts Parquet)
python scripts/download_off.py

# Run scrapers
python scripts/scrape_mercadona.py
python scripts/scrape_albert_heijn.py

# Build dataset (cleaning + Nutri-Score + joining, ~6 min)
PYTHONPATH=. python scripts/build_dataset.py

# Run notebooks
jupyter notebook notebooks/

Interactive Dashboard

Live demo: joshuaprettyman.com/projects/pl-dashboard

Four tabs: Category Landscape (treemap by product count coloured by PL penetration, HHI bar chart), Nutritional Gaps (scatter of %CDE vs PL penetration with quadrant annotations, top-10 gap table), Opportunity Ranking (sortable composite score table, stacked component breakdown for top 10), and Data Quality (dataset overview, coverage notes, products-per-category bar chart). Loads from data/sample/ parquets — no raw data processing at runtime.

Run locally:

streamlit run src/visualisation/dashboard.py

Or via Docker:

docker build -t pl-dashboard .
docker run -p 8501:8501 pl-dashboard

Tests

pytest tests/ -v

51 tests covering Nutri-Score computation (boundary cases, sodium conversion, vectorised column), EAN/fuzzy joining, nutritional gap analysis, opportunity scoring (normalisation, weight sensitivity, Monte Carlo determinism), category landscape metrics (HHI, PL penetration, assortment depth), and brand classification. All tests use synthetic data — no large datasets required.

Notebooks

#	Notebook	Description
01	Data Loading & Cleaning	OFF data profiling, format discoveries, cleaning pipeline
02	Supermarket Scraping	Scraper results, API structures, join quality
03	Category Landscape EDA	Nutri-Score landscape, HHI, PL penetration, price gaps
04	Nutritional Gap Analysis	Top-10 deep dives, nutrient heatmap, reformulation paths
05	Opportunity Scoring	Composite scores, Monte Carlo sensitivity, weight scenarios
05b	Predictive Model	PL success predictor, brand classifier, feature importances
06	Findings & Recommendations	Executive summary, strategic recommendations

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.streamlit		.streamlit
data/sample		data/sample
notebooks		notebooks
planning		planning
results		results
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
environment.yml		environment.yml
notes.md		notes.md
pyproject.toml		pyproject.toml
requirements-dashboard.txt		requirements-dashboard.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CPG Private Label Opportunity Engine

Problem Statement

Business Impact

Key Findings

Methodology

Tech Stack

Project Structure

Setup

Interactive Dashboard

Tests

Notebooks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CPG Private Label Opportunity Engine

Problem Statement

Business Impact

Key Findings

Methodology

Tech Stack

Project Structure

Setup

Interactive Dashboard

Tests

Notebooks

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages