# 02 — Supermarket Scraping & Data Joining

**Objective:** Scrape product catalogues from European supermarkets and join to Open Food Facts.

## Scrapers Built

### Mercadona (Spain) — `src/data/scrapers/mercadona.py`
- REST API at `tienda.mercadona.es/api/`, no auth needed (just postal_code cookie)
- 3-level nesting: top category → subcategory → sub-sub-category → products
- **No EAN barcodes, no nutrition data** in API
- Brand extracted from `display_name` (heuristic — imperfect for non-PL products)
- **Result: 3,225 food products** (1,954 PL / 1,271 branded)

### Albert Heijn (Netherlands) — `src/data/scrapers/albert_heijn.py`
- Mobile API at `api.ah.nl/mobile-services/`, auth via `{"clientId": "appie"}`
- Products paginated at max 100/page across 47 food taxonomy categories
- Has nutriscore grade, brand, categories, prices — but **no EAN, no detailed nutrition**
- Products appear in multiple categories → deduplication required
- **Result: 11,209 food products** (2,937 PL / 8,272 branded)

### Carrefour (France/Spain) — stub, not yet implemented

## Join Strategy
- No EAN available from either scraper → fuzzy name+brand matching only
- Pre-filter OFF to retailer's country to reduce candidate pool
- rapidfuzz `token_sort_ratio` with threshold 75%
- **Mercadona: 51% match rate** (1,644 / 3,225), mean score 82.9
- **AH: 41% match rate** (4,556 / 11,209), mean score 83.7
- Some false positives near threshold — could tighten to 80% for stricter matching

In [None]:
import pandas as pd
from pathlib import Path

from src.data.scrapers.mercadona import MercadonaScraper
from src.data.scrapers.albert_heijn import AlbertHeijnScraper
from src.data.join import join_supermarket_to_off

In [None]:
# Load pre-scraped data (scraping takes ~5-10 min per retailer)
df_merc = pd.read_parquet("data/scraped/mercadona_products.parquet")
df_ah = pd.read_parquet("data/scraped/albert_heijn_products.parquet")

print(f"Mercadona: {len(df_merc):,} products ({df_merc['is_private_label'].sum():,} PL)")
print(f"Albert Heijn: {len(df_ah):,} products ({df_ah['is_private_label'].sum():,} PL)")

# Quick brand distribution
print(f"\nMercadona top brands:")
print(df_merc["brand"].value_counts().head(10).to_string())
print(f"\nAH top brands:")
print(df_ah["brand"].value_counts().head(10).to_string())

## Key Observations

### Mercadona
- Hacendado dominates (~60% of all products) — typical for Mercadona's PL-heavy model
- Brand extraction heuristic catches known PL brands well, but falls back to "last capitalised word" for national brands — picks up false positives like "Filetes", "Zero"
- No size/weight standardisation from API → unit price comparison needs reference_price field

### Albert Heijn
- More balanced PL/brand ratio (26%/74%) than Mercadona
- AH PL brands: AH, AH Biologisch, AH Excellent, AH Terra, AH Basic
- Already has nutriscore_grade from API — useful for cross-validation with OFF
- unit_price_description needs parsing: "prijs per liter €0.95" format

### Join Quality
- 50K OFF candidate cap per retailer may miss some matches — could increase for production
- False positive rate at threshold 75 is non-trivial (e.g., "Canela en rama" → "Helado Crema de Nata")
- Products without brands in OFF or with very generic names are hardest to match