From CSV to typed table — `ingestion.csv-loader`

Also available in French.

Context. Coursework template from iamdataeng.com/projects/ingestion.csv-loader. Fork, complete the TODO blocks, push, receive a pedagogical CI verdict. Not a maintained open-source project, an evaluated exercise.

From CSV to typed table — `ingestion.csv-loader`

Level: beginner · Estimated time: ~4 h · Free teaser IamDataEngineer Framework axis: ingestion

This project is your first real contact with data engineering ingestion: a flat file dropped by a vendor, two vintages that don't agree on the schema, and a non-negotiable requirement — produce a typed, deduplicated, idempotent table that matches a written contract.

Not a tutorial. An exercise. You read, you code, you push, and CI tells you whether it passes — with a clear message on every failure pointing at what to fix.

The context

Vélia is a bike-sharing operator. They drop a CSV of rides on an SFTP every day. Between January and February 2026, their dev did two things without telling anyone:

Reordered the columns — start_station_id now comes before start_time.
Changed the date format — ISO-8601 in January (2026-01-12T08:34:11), European format in February (12/02/2026 08:34).
Bonus on the house: prices switched to French decimal notation (3,50) in February.

The BI analyst walked over to your desk because her GROUP BY day dashboard has been broken since February 1st. You're going to hand her back a clean table.

Nobody is asking you for Spark, an orchestrator, or a lakehouse. Just a Python script that produces data/velia.duckdb with a trips table that strictly matches the contract in contracts/trips.json.

What you ship

Deliverable	Where
The ingestion code	`src/ingest.py` (a typed `load(path) -> DataFrame` function + a `main()` that orchestrates)
The generated DuckDB file	`data/velia.duckdb` (never committed — it's in `.gitignore`)
A schema note	`docs/schema_changes.md` (template provided, fill it in)

docs/schema_changes.md is not an afterthought. It's what you'd write in production when a vendor changes an export without warning. CI doesn't read it, but a recruiter who lands on your fork will.

Getting started

If you're in GitHub Codespaces (one-click open from the IamDataEngineer app), fixtures are already generated and dependencies are installed. Otherwise, locally:

# 1. Install dependencies
pip install -r requirements.txt

# 2. Generate the test CSVs (deterministic — seed = 42)
python -m fixtures.generate_fixtures

# 3. Run your ingestion (fails until you implement src/ingest.py)
python -m src.ingest

# 4. Run the assessment rubric
pytest tests/ -v

Once your 4 tests pass locally, commit + push to your fork. GitHub Actions CI replays the same rubric and the IamDataEngineer app displays the verdict in your dashboard.

The 4 rubric checks

Defined in tests/test_evaluate.py. Each failing check prints a plain pedagogical message.

#	Id	What we check
1	`schema_matches_contract`	The `trips` table exists with exactly the columns and DuckDB types from the contract. `TIMESTAMP` for dates, `DECIMAL(10,2)` for `price`, nullable `INTEGER` for `end_station_id`.
2	`row_count_both_vintages`	The total row count in `trips` equals the sum of both vintages, minus duplicates on `trip_id`. If you loaded only one file, this check fails.
3	`no_null_primary_key`	`SELECT COUNT(*) WHERE trip_id IS NULL = 0`. A null primary key doesn't exist. Filter before insert.
4	`idempotent_rerun`	Re-running `python -m src.ingest` doesn't change the row count. That's the production invariant: an orchestrator retry must never duplicate.

The traps beginners fall into

Seen a thousand times in junior code review:

Reading the CSV with pandas defaults and getting object everywhere. pd.read_csv without dtype= or parse_dates= hands you object columns even when the content is obviously numeric. CI checks DuckDB types on the table, so if you cast wrong at insert time, you break check 1.
Missing that February reordered the columns. Easy fix: read by column name, not by index. pandas aligns automatically if you then do df = df[LOGICAL_ORDER].
Treating "" as a valid station. The CSV writes an empty destination station for ~5% of rides (a bike abandoned off-station). It needs to become NULL in DuckDB, not 0 or an empty string. Hint: na_values=[""] on read.
Breaking on February's 3,50 price. Either pass decimal="," to read_csv for the February file, or do a .str.replace(",", ".").astype(float) after reading. The contract wants DECIMAL(10,2) in DuckDB — a CAST(price AS DECIMAL(10,2)) at insert time is enough.
Approximate idempotence. Concretely, two options in DuckDB:
- DELETE FROM trips before each insert (truncate-then-insert), OR
- INSERT INTO trips ... ON CONFLICT (trip_id) DO NOTHING (DuckDB ≥ 0.9 supports ON CONFLICT, provided you have a PRIMARY KEY on trip_id or a UNIQUE index). Pick one, justify it in docs/schema_changes.md.
Breaking on commas inside description. The standard csv module (and pandas.read_csv, which builds on it) handles quoted fields on its own. Never split a CSV line by hand with .split(",").

Going further (references)

No reading is required to pass this project, but if you want to put these patterns in context, the sources that shaped the rubric:

Joe Reis & Matt Housley, Fundamentals of Data Engineering (O'Reilly, 2022) — ch. 7 "Ingestion", pp. 222-240 on batch and file ingestion patterns.
pandas docs: read_csv — read the dtype, parse_dates, na_values, decimal, dayfirst parameters carefully.
DuckDB docs: Data Types and INSERT ... ON CONFLICT.
Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017) — ch. 3 on storage formats, just to understand why CSV is not a production format.

If you're stuck

The point is for you to struggle a bit — that's what production is. But if you've been spinning on the same check for more than an hour:

Re-read the test error message — it almost always points at the cause.
Eyeball a raw row of each CSV with head -3 fixtures/velia_trips_*.csv.
Open an issue on your fork with the help-wanted label — the IamDataEngineer community hangs out there.

Good luck.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From CSV to typed table — `ingestion.csv-loader`

The context

What you ship

Getting started

The 4 rubric checks

The traps beginners fall into

Going further (references)

If you're stuck

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
contracts		contracts
data		data
docs		docs
fixtures		fixtures
src		src
tests		tests
.gitignore		.gitignore
README.fr.md		README.fr.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

From CSV to typed table — ingestion.csv-loader

The context

What you ship

Getting started

The 4 rubric checks

The traps beginners fall into

Going further (references)

If you're stuck

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

From CSV to typed table — `ingestion.csv-loader`

Packages