Also available in French.
Context. Coursework template from iamdataeng.com/projects/ingestion.csv-loader. Fork, complete the TODO blocks, push, receive a pedagogical CI verdict. Not a maintained open-source project, an evaluated exercise.
Level: beginner · Estimated time: ~4 h · Free teaser IamDataEngineer Framework axis:
ingestion
This project is your first real contact with data engineering ingestion: a flat file dropped by a vendor, two vintages that don't agree on the schema, and a non-negotiable requirement — produce a typed, deduplicated, idempotent table that matches a written contract.
Not a tutorial. An exercise. You read, you code, you push, and CI tells you whether it passes — with a clear message on every failure pointing at what to fix.
Vélia is a bike-sharing operator. They drop a CSV of rides on an SFTP every day. Between January and February 2026, their dev did two things without telling anyone:
- Reordered the columns —
start_station_idnow comes beforestart_time. - Changed the date format — ISO-8601 in January
(
2026-01-12T08:34:11), European format in February (12/02/2026 08:34). - Bonus on the house: prices switched to French decimal notation
(
3,50) in February.
The BI analyst walked over to your desk because her GROUP BY day dashboard
has been broken since February 1st. You're going to hand her back a clean
table.
Nobody is asking you for Spark, an orchestrator, or a lakehouse. Just a
Python script that produces data/velia.duckdb with a trips table that
strictly matches the contract in contracts/trips.json.
| Deliverable | Where |
|---|---|
| The ingestion code | src/ingest.py (a typed load(path) -> DataFrame function + a main() that orchestrates) |
| The generated DuckDB file | data/velia.duckdb (never committed — it's in .gitignore) |
| A schema note | docs/schema_changes.md (template provided, fill it in) |
docs/schema_changes.md is not an afterthought. It's what you'd write in
production when a vendor changes an export without warning. CI doesn't read
it, but a recruiter who lands on your fork will.
If you're in GitHub Codespaces (one-click open from the IamDataEngineer app), fixtures are already generated and dependencies are installed. Otherwise, locally:
# 1. Install dependencies
pip install -r requirements.txt
# 2. Generate the test CSVs (deterministic — seed = 42)
python -m fixtures.generate_fixtures
# 3. Run your ingestion (fails until you implement src/ingest.py)
python -m src.ingest
# 4. Run the assessment rubric
pytest tests/ -vOnce your 4 tests pass locally, commit + push to your fork. GitHub Actions CI replays the same rubric and the IamDataEngineer app displays the verdict in your dashboard.
Defined in tests/test_evaluate.py. Each failing check prints a plain
pedagogical message.
| # | Id | What we check |
|---|---|---|
| 1 | schema_matches_contract |
The trips table exists with exactly the columns and DuckDB types from the contract. TIMESTAMP for dates, DECIMAL(10,2) for price, nullable INTEGER for end_station_id. |
| 2 | row_count_both_vintages |
The total row count in trips equals the sum of both vintages, minus duplicates on trip_id. If you loaded only one file, this check fails. |
| 3 | no_null_primary_key |
SELECT COUNT(*) WHERE trip_id IS NULL = 0. A null primary key doesn't exist. Filter before insert. |
| 4 | idempotent_rerun |
Re-running python -m src.ingest doesn't change the row count. That's the production invariant: an orchestrator retry must never duplicate. |
Seen a thousand times in junior code review:
-
Reading the CSV with pandas defaults and getting
objecteverywhere.pd.read_csvwithoutdtype=orparse_dates=hands youobjectcolumns even when the content is obviously numeric. CI checks DuckDB types on the table, so if you cast wrong at insert time, you break check 1. -
Missing that February reordered the columns. Easy fix: read by column name, not by index. pandas aligns automatically if you then do
df = df[LOGICAL_ORDER]. -
Treating
""as a valid station. The CSV writes an empty destination station for ~5% of rides (a bike abandoned off-station). It needs to becomeNULLin DuckDB, not0or an empty string. Hint:na_values=[""]on read. -
Breaking on February's
3,50price. Either passdecimal=","toread_csvfor the February file, or do a.str.replace(",", ".").astype(float)after reading. The contract wantsDECIMAL(10,2)in DuckDB — aCAST(price AS DECIMAL(10,2))at insert time is enough. -
Approximate idempotence. Concretely, two options in DuckDB:
DELETE FROM tripsbefore each insert (truncate-then-insert), ORINSERT INTO trips ... ON CONFLICT (trip_id) DO NOTHING(DuckDB ≥ 0.9 supportsON CONFLICT, provided you have aPRIMARY KEYontrip_idor aUNIQUEindex). Pick one, justify it indocs/schema_changes.md.
-
Breaking on commas inside
description. The standardcsvmodule (andpandas.read_csv, which builds on it) handles quoted fields on its own. Never split a CSV line by hand with.split(",").
No reading is required to pass this project, but if you want to put these patterns in context, the sources that shaped the rubric:
- Joe Reis & Matt Housley, Fundamentals of Data Engineering (O'Reilly, 2022) — ch. 7 "Ingestion", pp. 222-240 on batch and file ingestion patterns.
- pandas docs:
read_csv— read thedtype,parse_dates,na_values,decimal,dayfirstparameters carefully. - DuckDB docs: Data Types
and
INSERT ... ON CONFLICT. - Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017) — ch. 3 on storage formats, just to understand why CSV is not a production format.
The point is for you to struggle a bit — that's what production is. But if you've been spinning on the same check for more than an hour:
- Re-read the test error message — it almost always points at the cause.
- Eyeball a raw row of each CSV with
head -3 fixtures/velia_trips_*.csv. - Open an issue on your fork with the
help-wantedlabel — the IamDataEngineer community hangs out there.
Good luck.