From CSV to Dashboard — Building a Mini Data Pipeline in Pure Python
This workshop shows how to build a complete data engineering workflow using open-source tools:
pandas • pyarrow • uv • streamlit.
You’ll ingest, clean, model, and visualize the classic Northwind dataset — all locally, with no cloud or database setup.
- A hands-on sandbox for learning how data engineers think and structure pipelines.
- A mini data warehouse built entirely with open-source tools.
- A practical example of the modern data-engineering mindset:
 Extract → Load → Transform → Build → Visualize.
- Designed for:
- Beginners exploring data pipelines and analytics engineering.
- Educators or mentors leading workshops and bootcamps.
- Teams wanting a lightweight demo of ETL + Data Quality in Python.
 
- ❌ Not a production-grade data platform or big-data tool.
- ❌ Not a replacement for Airflow, dbt, or Spark.
- ❌ Not built for parallel or distributed execution.
- ❌ Not a full data-engineering curriculum — this is the first step.
A teachable microcosm of modern data engineering — small enough for your laptop, structured enough to mirror real-world pipelines.
py-dataengineering-workshop/
├─ data/
│  ├─ 00-raw/       # raw source CSVs (auto-downloaded)
│  ├─ 01-clean/     # cleaned + validated Parquet
│  │  └─ _dq/       # data quality logs
│  ├─ 02-model/     # modeled & aggregated tables
│  └─ 03-sandbox/   # scratch area
├─ etl/
│  ├─ extract.py    # download + read CSVs
│  ├─ load.py       # clean + validate
│  ├─ transform.py  # create fact/dim tables
│  ├─ build.py      # aggregate to gold layer
│  ├─ dq.py         # data-quality rules + logs
│  ├─ paths.py      # central folder definitions
│  └─ run.py        # orchestrator (CLI)
└─ view_data.py     # Inspect parquet data files
└─ app.py           # Streamlit dashboard
git clone https://github.com/YOURNAME/py-dataengineering-workshop.git
cd py-dataengineering-workshop
uv venv && uv sync
# Run the full pipeline
uv run python -m etl.run
# Or run by stage
uv run python -m etl.run --stage extract
uv run python -m etl.run --stage load
uv run python -m etl.run --stage transform
uv run python -m etl.run --stage build
# Launch the dashboard
uv run streamlit run app.pyOn first run, extract.py downloads the Northwind CSVs into data/00-raw/.
You may also run the code inside a dev container using Docker and VS Code. This can simplify setup by providing a consistent environment. Quick steps:
- Install the "Dev Containers" extension in VS Code.
 
- Open this project folder in VS Code (File → Open Folder...).
- Open the command palette (F1 or Ctrl+Shift+P or Cmd+Shift+P) and choose: "Dev Containers: Reopen in Container".
 
- The container should start building. Wait for it to finish.
 
Notes:
- On the first run VS Code will build the container image using devcontainer.json. This may take several minutes depending on your network and machine.
- After the container starts you can run the pipeline and the Streamlit app from the container terminal (the same commands shown in the Quickstart work inside the container).
- If you don't have Docker or prefer a local virtualenv, the earlier Quickstart steps (using uv) will work without the dev container.
| Stage | Script | Purpose | 
|---|---|---|
| Extract | etl/extract.py | Download + load raw CSVs | 
| Load | etl/load.py | Clean, standardize, and log data-quality issues | 
| Transform | etl/transform.py | Create fact and dimension tables | 
| Build | etl/build.py | Aggregate gold-layer outputs (customer, country, product) | 
Data-quality results are saved under:
data/01-clean/_dq/
├─ dq_runs.parquet
└─ dq_issues.parquet
You can quickly inspect any Parquet file generated by the ETL pipeline using pandas + pyarrow.
uv run python view_data.pyThis utility prints:
- 📂 The file being read
- 👀 A preview of the first 5 rows (head())
- 🧾 Schema and data types (info())
- 📊 Summary statistics (describe())
To view another file, edit the target path inside view_data.py
(e.g., switch from sales_by_customer.parquet to fact_sales.parquet).
uv run streamlit run app.pyTabs
- 📊 Sales (Customers) – Top customers & products
- 🌍 Sales by Country – Regional aggregates
- 🧪 Data Quality – Run summaries & issue details
- Study the Northwind dataset.
- Modify extract.pyto ingest all available Northwind CSVs (Employees, Shippers, Suppliers, Categories, etc.).
- Add new data-quality checks in dq.py(e.g., missing employee names, invalid postal codes).
- Extend transform.pyto includedim_product,dim_supplier, andfact_orders.
- Enhance build.pywith new aggregates (e.g., sales by category, supplier, year).
- Visualize additional metrics in Streamlit (e.g., monthly trends, top-selling categories).
- Use another dataset (like, Chinook) and create and test it through the pipeline.
| Stage | What it does | Recommended tools & patterns | 
|---|---|---|
| Extract | Pull data from CSV/API/DB/stream | dlt (Python) → sources (CSV, REST, DB). For streams: Kafka (later). | 
| Load | Land raw/staging into warehouse | dlt → Postgres (Dockerized). For dev-only: optionally DuckDB parquet. | 
| Transform | Create clean staging + marts | dbt (SQL or Python models) over Postgres; incremental models for scale. | 
| Store | Persist analytical outputs | Postgres (marts schemas), optional Parquet in /02-modelfor ad hoc. | 
| Serve | BI / apps / ad-hoc queries | Streamlit app; Metabase (Docker) for dashboards; programmatic Ibis/DuckDB. | 
| Orchestrate | Schedule & chain runs | Prefect (flows locally or Cloud). For simple prod cron: crontab + shell. | 
| Validate / DQ | Schemas, ranges, FKs | Pandera (Python), selective Great Expectations; dbt tests ( not_null,unique). | 
| Observe | Runs, metrics, alerts | dlt run metrics + Prefect run states; ship DQ summaries to Grafana/Metabase/Slack. | 
| Version & Lineage | Repro, history, docs | git for code, optional dvc for large artifacts; dbt docs + lineage. | 
| Semantic Layer | Uniform query interface | Ibis to query Postgres/Parquet with one API; (optional) dbt metrics. | 
| ML / Features | Gold → features | Notebook lab or feature store later; start with pandas/Polars over marts. | 
Explore these tools to build a deeper appreciation for how Data Engineering scales—experiment with various data sources and implement your own end-to-end projects.
MIT © 2025 Myk Ogbinar / Data Engineering Pilipinas
- Neo4j Northwind Dataset — Sample dataset used for this project
- pandas • pyarrow — Core data processing and Parquet handling
- streamlit — Interactive data app framework
- uv — Fast Python environment and dependency manager
- dltHub • dbt • DuckDB — Modern ELT and analytical data stack components
- Ibis • ClickHouse — Unified analytical querying and columnar storage engines
- DurianPy • PyCon Davao 2025 — Community and conference initiatives supporting open-source learning
- Data Engineering Pilipinas — Open-source community promoting data literacy and collaboration in the Philippines