Maldini is one of Spain's most prominent football journalists. Every week on his YouTube channel @mundomaldini he makes explicit, probabilistic predictions about upcoming matches. This project captures every prediction, scores it objectively with a Brier score, and surfaces the answer in a live dashboard.
A Brier score measures the accuracy of probabilistic predictions โ lower is better, 0 is perfect.
| Benchmark | Brier Score |
|---|---|
| Naive baseline (guess 1/3 each outcome) | 0.222 |
| Betting markets | ~0.19 |
| Superforecaster threshold | < 0.20 |
| Perfect forecaster | 0.00 |
Maldini earns the superforecaster badge only when his all-time average Brier score drops below 0.20 โ and only once he has 100+ scored predictions for statistical reliability.
The project tracks 1,500 predictions from 2022-Q4 to 2026-Q2.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ data/videos.csv โ
โ data/results_overrides.csv (*) โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ maldini.pipeline โ
โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ ingest โโโโโบโ extract โโโโโบโ results โโโโโบโ scoring โ โ
โ โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ โ
โ โผ โผ โผ โผ โ
โ YouTube Claude TheSportsDB DuckDB โ
โ transcript Haiku LLM match data Brier 2/3w โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ data/predictions.parquet โ
โ one row per prediction; single โ
โ source of truth, committed to git โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ maldini.render โ
โ DuckDB CTEs โ summary stats โ
โ Jinja2 โ dist/index.html โ
โ dist/index.en.html โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โผ
GitHub Pages (auto)
(*) Manual scoreline fixups for matches TheSportsDB can't auto-resolve.
Schedule: GitHub Actions cron, Sundays 08:00 UTC (.github/workflows/weekly.yml).
Parquet is the single source of truth. It lives in git, so every dashboard build is reproducible from a commit hash. The pipeline is idempotent โ re-running it on the same videos.csv only processes new video_ids, and pending predictions (matches not yet played) are persisted with null results so the next run picks them up.
All SQL runs in DuckDB in-process, embedded inside src/maldini/pipeline.py (scoring) and src/maldini/render.py (summary stats).
Brier score variants:
- 3-outcome (league matches):
((p_home - I_home)ยฒ + (p_draw - I_draw)ยฒ + (p_away - I_away)ยฒ) / 3 - 2-outcome (knockout, where
pred_draw_pct = 0): renormalise home + away to sum to 1, then((p_home - I_home)ยฒ + (p_away - I_away)ยฒ) / 2
Summary statistics (all-time average, accuracy, monthly trend, competition breakdown, Brier distribution) are computed by maldini.render from the parquet at render time โ a few short CTEs, no separate materialised tables.
| Layer | Technology |
|---|---|
| Pipeline | Python package (src/maldini/) |
| Transformations | DuckDB (in-process SQL) |
| Storage | Parquet file in git (data/predictions.parquet) |
| LLM | Anthropic Claude Haiku |
| External APIs | YouTube Data API v3, youtube-transcript-api, TheSportsDB |
| Dashboard | Jinja2 โ static HTML |
| Schedule | GitHub Actions (weekly cron) |
| Hosting | GitHub Pages |
For a full step-by-step guide, see docs/SETUP.md. The summary below is enough to get going.
git clone https://github.com/tomas-ravalli/maldini-stats.git
cd maldini-stats
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
cp .env.example .env # fill in YOUTUBE_API_KEY and ANTHROPIC_API_KEY# 1. Ingest, extract, fetch results, score
python -m maldini.pipeline --file data/videos.csv
# 2. Generate static HTML from the parquet
python -m maldini.render
# 3. View
open dist/index.htmlEquivalently, the package exposes maldini-pipeline and maldini-render console scripts. Run pytest for the unit tests.
To add new videos: append rows to data/videos.csv and re-run.
- Parquet lives in git โ every dashboard build is reproducible from a commit hash. If scoring logic changes, rebuild from
data/videos.csv. - DuckDB for everything SQL โ no warehouse, no credentials, no quotas; the whole pipeline runs on a laptop or a free-tier GitHub Actions runner in under a minute.
- Fuzzy team matching โ normalisation strips accents, common prefixes (
Real,Atlรฉtico), and applies SpanishโEnglish word substitutions before substring matching against TheSportsDB results. - No-date window โ predictions without a
match_dateuse a 45-day window frompublish_dateto find the matching fixture. - No-draw handling โ when
pred_draw_pct == 0, a 2-outcome Brier formula is applied automatically.
- docs/SETUP.md โ step-by-step local setup and verification
- docs/DATA_FORMAT.md โ input/output schemas for the pipeline
MIT
Built by with AI.
ยฉ trm