Skip to content

ImDataEngineer/csv-loader

Repository files navigation

Also available in French.

Template iamdataeng.com

Context. Coursework template from iamdataeng.com/projects/ingestion.csv-loader. Fork, complete the TODO blocks, push, receive a pedagogical CI verdict. Not a maintained open-source project, an evaluated exercise.

From CSV to typed table — ingestion.csv-loader

Level: beginner · Estimated time: ~4 h · Free teaser IamDataEngineer Framework axis: ingestion

This project is your first real contact with data engineering ingestion: a flat file dropped by a vendor, two vintages that don't agree on the schema, and a non-negotiable requirement — produce a typed, deduplicated, idempotent table that matches a written contract.

Not a tutorial. An exercise. You read, you code, you push, and CI tells you whether it passes — with a clear message on every failure pointing at what to fix.


The context

Vélia is a bike-sharing operator. They drop a CSV of rides on an SFTP every day. Between January and February 2026, their dev did two things without telling anyone:

  1. Reordered the columnsstart_station_id now comes before start_time.
  2. Changed the date format — ISO-8601 in January (2026-01-12T08:34:11), European format in February (12/02/2026 08:34).
  3. Bonus on the house: prices switched to French decimal notation (3,50) in February.

The BI analyst walked over to your desk because her GROUP BY day dashboard has been broken since February 1st. You're going to hand her back a clean table.

Nobody is asking you for Spark, an orchestrator, or a lakehouse. Just a Python script that produces data/velia.duckdb with a trips table that strictly matches the contract in contracts/trips.json.


What you ship

Deliverable Where
The ingestion code src/ingest.py (a typed load(path) -> DataFrame function + a main() that orchestrates)
The generated DuckDB file data/velia.duckdb (never committed — it's in .gitignore)
A schema note docs/schema_changes.md (template provided, fill it in)

docs/schema_changes.md is not an afterthought. It's what you'd write in production when a vendor changes an export without warning. CI doesn't read it, but a recruiter who lands on your fork will.


Getting started

If you're in GitHub Codespaces (one-click open from the IamDataEngineer app), fixtures are already generated and dependencies are installed. Otherwise, locally:

# 1. Install dependencies
pip install -r requirements.txt

# 2. Generate the test CSVs (deterministic — seed = 42)
python -m fixtures.generate_fixtures

# 3. Run your ingestion (fails until you implement src/ingest.py)
python -m src.ingest

# 4. Run the assessment rubric
pytest tests/ -v

Once your 4 tests pass locally, commit + push to your fork. GitHub Actions CI replays the same rubric and the IamDataEngineer app displays the verdict in your dashboard.


The 4 rubric checks

Defined in tests/test_evaluate.py. Each failing check prints a plain pedagogical message.

# Id What we check
1 schema_matches_contract The trips table exists with exactly the columns and DuckDB types from the contract. TIMESTAMP for dates, DECIMAL(10,2) for price, nullable INTEGER for end_station_id.
2 row_count_both_vintages The total row count in trips equals the sum of both vintages, minus duplicates on trip_id. If you loaded only one file, this check fails.
3 no_null_primary_key SELECT COUNT(*) WHERE trip_id IS NULL = 0. A null primary key doesn't exist. Filter before insert.
4 idempotent_rerun Re-running python -m src.ingest doesn't change the row count. That's the production invariant: an orchestrator retry must never duplicate.

The traps beginners fall into

Seen a thousand times in junior code review:

  • Reading the CSV with pandas defaults and getting object everywhere. pd.read_csv without dtype= or parse_dates= hands you object columns even when the content is obviously numeric. CI checks DuckDB types on the table, so if you cast wrong at insert time, you break check 1.

  • Missing that February reordered the columns. Easy fix: read by column name, not by index. pandas aligns automatically if you then do df = df[LOGICAL_ORDER].

  • Treating "" as a valid station. The CSV writes an empty destination station for ~5% of rides (a bike abandoned off-station). It needs to become NULL in DuckDB, not 0 or an empty string. Hint: na_values=[""] on read.

  • Breaking on February's 3,50 price. Either pass decimal="," to read_csv for the February file, or do a .str.replace(",", ".").astype(float) after reading. The contract wants DECIMAL(10,2) in DuckDB — a CAST(price AS DECIMAL(10,2)) at insert time is enough.

  • Approximate idempotence. Concretely, two options in DuckDB:

    • DELETE FROM trips before each insert (truncate-then-insert), OR
    • INSERT INTO trips ... ON CONFLICT (trip_id) DO NOTHING (DuckDB ≥ 0.9 supports ON CONFLICT, provided you have a PRIMARY KEY on trip_id or a UNIQUE index). Pick one, justify it in docs/schema_changes.md.
  • Breaking on commas inside description. The standard csv module (and pandas.read_csv, which builds on it) handles quoted fields on its own. Never split a CSV line by hand with .split(",").


Going further (references)

No reading is required to pass this project, but if you want to put these patterns in context, the sources that shaped the rubric:

  • Joe Reis & Matt Housley, Fundamentals of Data Engineering (O'Reilly, 2022) — ch. 7 "Ingestion", pp. 222-240 on batch and file ingestion patterns.
  • pandas docs: read_csv — read the dtype, parse_dates, na_values, decimal, dayfirst parameters carefully.
  • DuckDB docs: Data Types and INSERT ... ON CONFLICT.
  • Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017) — ch. 3 on storage formats, just to understand why CSV is not a production format.

If you're stuck

The point is for you to struggle a bit — that's what production is. But if you've been spinning on the same check for more than an hour:

  1. Re-read the test error message — it almost always points at the cause.
  2. Eyeball a raw row of each CSV with head -3 fixtures/velia_trips_*.csv.
  3. Open an issue on your fork with the help-wanted label — the IamDataEngineer community hangs out there.

Good luck.

About

From flat file to typed table

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages