A modern data lake for agricultural analytics — tracking tonnage, costs, and productivity across Moroccan tomato farms.
This project ingests agricultural data from Excel spreadsheets and transforms them into analytics-ready Parquet files. It supports:
- 📊 Tonnage Tracking — Daily harvest data by farm, variety, and greenhouse
- 💰 Cost Analysis — Domain-level expense breakdown (labor, equipment, inputs)
- 📈 Productivity Metrics — Cost per ton, yield per hectare, QNZ performance
Quinzaine (QNZ) = 15-day agricultural period. There are 24 QNZ per year:
- QNZ 1: June 1-15 → QNZ 24: May 16-31
- Agricultural year runs June 1 → May 31
data-leak/
├── data/ # 📁 Raw data (Excel files)
│ ├── *QNZ*.xlsx # Tonnage files (QNZ 21, 22, ...)
│ └── 1-QUINZAINE*.xlsx # Cost data files
│
├── data/processed/ # 📁 Processed Parquet files
│ ├── tonnage_combined.parquet
│ └── costs_*.parquet
│
├── data/output/ # 📁 Analytics output
│ └── analytics.duckdb
│
├── scripts/
│ ├── ingest.py # Data ingestion pipeline
│ └── analytics.py # Analytics queries
│MIT License — See [LICENSE](LICENSE) for details.
├── frontend/ # React + Vite dashboard
│ └── src/pages/ # Dashboard, Domains, Varieties, etc.
│
└── backend/ # FastAPI API
| Tool | Version | Purpose |
|---|---|---|
| Python | 3.14+ | Runtime |
| pandas | — | Data processing |
| duckdb | — | Analytics engine |
| openpyxl | — | Excel reading |
| pyarrow | — | Parquet output |
Install dependencies:
pip install pandas duckdb openpyxl pyarrowOr use the included .venv:
source .venv/bin/activateCreate these directories manually if they don't exist:
mkdir -p data data/processed data/outputPlace your Excel files in data/:
| File Pattern | Description | Example |
|---|---|---|
*QNZ*.xlsx |
Tonnage data (one per QNZ) | 15-04-26 V2 -SUIVI JOURNALIER DES TONNAGES QNZ N° 21.xlsx |
1-QUINZAINE*.xlsx |
Cost data | 1-QUINZAINE 25-26-19.xlsx |
Excel sheet name: SUIVI JOUR, TONNAGE QNZ N° {N}
Required columns:
GROUPE— Group nameFERME— Farm nameCODE— Farm codeCLUBS— ClubVARIETE— VarietyTYPE— Crop typeSERRE N°— Greenhouse numberSUP— Surface area (ha)DATE PLT°.— Planting dateGLOBAL 1— Total tonnage- Date columns (DD/MM/YYYY) — Daily harvest tonnage
Excel with one sheet per domain (sheet name = domain ID).
Required columns:
Domaine— Domain nameSuper— Surface area (ha)Main D'oeuvrs— Labor costECHASSIER— Equipment costPoste Fixe— Fixed costsDépences externe— External expensesAutre Dépences interne— Internal expensesMontant Total— Total amount
Run the ingestion pipeline to convert Excel → Parquet:
python scripts/ingest.pyOutput:
=== Ingesting Tonnage Data ===
Processing 15-04-26 V2 -SUIVI JOURNALIER DES TONNAGES QNZ N° 21.xlsx (QNZ 21)...
Processing 30-04-26 V2 -SUIVI JOURNALIER DES TONNAGES QNZ N° 22.xlsx (QNZ 22)...
saved combined tonnage data: 15420 rows -> data/processed/tonnage_combined.parquet
=== Ingesting Cost Data ===
saved cost data: 484 rows -> data/processed/costs_qnz19_d1_22.parquet
=== Data Lake Ready ===
Tonnage records: 15420
Cost records: 484
Generate analytics with DuckDB:
python scripts/analytics.pyOutput:
============================================================
AGRICULTURAL DATA LAKE - COST/TON ANALYSIS
============================================================
--- Tonnage Summary (QNZ 22, Apr 2026) ---
ferme total_tonnage harvest_days
FERME A 12450.0 15
FERME B 9820.0 14
FERME C 7650.0 12
...
--- Cost Summary by Domain (QNZ 19, 2019) ---
domain_id Domaine total_superficie total_cost cost_per_ha
1 DOMAINE1 125.5 2450000 19521.98
2 DOMAINE2 98.2 1890000 19246.44
...
✓ Analytics complete - database saved to data/output/analytics.duckdb
Open DuckDB shell for custom queries:
duckdb data/output/analytics.duckdb-- Top 10 farms by yield
SELECT
ferme,
SUM(tonnage) / SUM(superficie) as yield_per_ha
FROM tonnage
GROUP BY ferme
ORDER BY yield_per_ha DESC
LIMIT 10;Start the FastAPI backend:
cd backend
uvicorn main:app --reloadStart the React frontend:
cd frontend
npm run devFREE TO USE
Built with 🫓 and ☕ in Morocco