Skip to content

ASPactores/py-dataengineering-workshop

 
 

Repository files navigation

🧠 Py-DataEngineering Workshop

From CSV to Dashboard — Building a Mini Data Pipeline in Pure Python

This workshop shows how to build a complete data engineering workflow using open-source tools:
pandas • pyarrow • uv • streamlit.

You’ll ingest, clean, model, and visualize the classic Northwind dataset — all locally, with no cloud or database setup.


🧭 What This Repo Is — and Isn’t

✅ What It Is

  • A hands-on sandbox for learning how data engineers think and structure pipelines.
  • A mini data warehouse built entirely with open-source tools.
  • A practical example of the modern data-engineering mindset:
    Extract → Load → Transform → Build → Visualize.
  • Designed for:
    • Beginners exploring data pipelines and analytics engineering.
    • Educators or mentors leading workshops and bootcamps.
    • Teams wanting a lightweight demo of ETL + Data Quality in Python.

🚫 What It Isn’t

  • ❌ Not a production-grade data platform or big-data tool.
  • ❌ Not a replacement for Airflow, dbt, or Spark.
  • ❌ Not built for parallel or distributed execution.
  • ❌ Not a full data-engineering curriculum — this is the first step.

💡 In One Line

A teachable microcosm of modern data engineering — small enough for your laptop, structured enough to mirror real-world pipelines.


📦 Project Structure


py-dataengineering-workshop/
├─ data/
│  ├─ 00-raw/       # raw source CSVs (auto-downloaded)
│  ├─ 01-clean/     # cleaned + validated Parquet
│  │  └─ _dq/       # data quality logs
│  ├─ 02-model/     # modeled & aggregated tables
│  └─ 03-sandbox/   # scratch area
├─ etl/
│  ├─ extract.py    # download + read CSVs
│  ├─ load.py       # clean + validate
│  ├─ transform.py  # create fact/dim tables
│  ├─ build.py      # aggregate to gold layer
│  ├─ dq.py         # data-quality rules + logs
│  ├─ paths.py      # central folder definitions
│  └─ run.py        # orchestrator (CLI)
└─ view_data.py     # Inspect parquet data files
└─ app.py           # Streamlit dashboard


⚙️ Quickstart

git clone https://github.com/YOURNAME/py-dataengineering-workshop.git
cd py-dataengineering-workshop
uv venv && uv sync

# Run the full pipeline
uv run python -m etl.run

# Or run by stage
uv run python -m etl.run --stage extract
uv run python -m etl.run --stage load
uv run python -m etl.run --stage transform
uv run python -m etl.run --stage build

# Launch the dashboard
uv run streamlit run app.py

On first run, extract.py downloads the Northwind CSVs into data/00-raw/.


👍 Alternative Setup

You may also run the code inside a dev container using Docker and VS Code. This can simplify setup by providing a consistent environment. Quick steps:

  1. Install the "Dev Containers" extension in VS Code. Install Dev Containers Extension
  2. Open this project folder in VS Code (File → Open Folder...).
  3. Open the command palette (F1 or Ctrl+Shift+P or Cmd+Shift+P) and choose: "Dev Containers: Reopen in Container". Reopen in Dev Container
  4. The container should start building. Wait for it to finish. Container Building

Notes:

  • On the first run VS Code will build the container image using devcontainer.json. This may take several minutes depending on your network and machine.
  • After the container starts you can run the pipeline and the Streamlit app from the container terminal (the same commands shown in the Quickstart work inside the container).
  • If you don't have Docker or prefer a local virtualenv, the earlier Quickstart steps (using uv) will work without the dev container.

🧩 Pipeline Overview

Stage Script Purpose
Extract etl/extract.py Download + load raw CSVs
Load etl/load.py Clean, standardize, and log data-quality issues
Transform etl/transform.py Create fact and dimension tables
Build etl/build.py Aggregate gold-layer outputs (customer, country, product)

Data-quality results are saved under:

data/01-clean/_dq/
├─ dq_runs.parquet
└─ dq_issues.parquet

🧰 Quick Parquet Viewer

You can quickly inspect any Parquet file generated by the ETL pipeline using pandas + pyarrow.

uv run python view_data.py

This utility prints:

  • 📂 The file being read
  • 👀 A preview of the first 5 rows (head())
  • 🧾 Schema and data types (info())
  • 📊 Summary statistics (describe())

To view another file, edit the target path inside view_data.py (e.g., switch from sales_by_customer.parquet to fact_sales.parquet).


📊 Streamlit Dashboard

uv run streamlit run app.py

Open http://localhost:8501

Tabs

  1. 📊 Sales (Customers) – Top customers & products
  2. 🌍 Sales by Country – Regional aggregates
  3. 🧪 Data Quality – Run summaries & issue details

🌻 Challenge Yourself!

  1. Study the Northwind dataset.
  2. Modify extract.py to ingest all available Northwind CSVs (Employees, Shippers, Suppliers, Categories, etc.).
  3. Add new data-quality checks in dq.py (e.g., missing employee names, invalid postal codes).
  4. Extend transform.py to include dim_product, dim_supplier, and fact_orders.
  5. Enhance build.py with new aggregates (e.g., sales by category, supplier, year).
  6. Visualize additional metrics in Streamlit (e.g., monthly trends, top-selling categories).
  7. Use another dataset (like, Chinook) and create and test it through the pipeline.

🌱 From Demo → Real-World Data Stack

Stage What it does Recommended tools & patterns
Extract Pull data from CSV/API/DB/stream dlt (Python) → sources (CSV, REST, DB). For streams: Kafka (later).
Load Land raw/staging into warehouse dlt → Postgres (Dockerized). For dev-only: optionally DuckDB parquet.
Transform Create clean staging + marts dbt (SQL or Python models) over Postgres; incremental models for scale.
Store Persist analytical outputs Postgres (marts schemas), optional Parquet in /02-model for ad hoc.
Serve BI / apps / ad-hoc queries Streamlit app; Metabase (Docker) for dashboards; programmatic Ibis/DuckDB.
Orchestrate Schedule & chain runs Prefect (flows locally or Cloud). For simple prod cron: crontab + shell.
Validate / DQ Schemas, ranges, FKs Pandera (Python), selective Great Expectations; dbt tests (not_null, unique).
Observe Runs, metrics, alerts dlt run metrics + Prefect run states; ship DQ summaries to Grafana/Metabase/Slack.
Version & Lineage Repro, history, docs git for code, optional dvc for large artifacts; dbt docs + lineage.
Semantic Layer Uniform query interface Ibis to query Postgres/Parquet with one API; (optional) dbt metrics.
ML / Features Gold → features Notebook lab or feature store later; start with pandas/Polars over marts.

Explore these tools to build a deeper appreciation for how Data Engineering scales—experiment with various data sources and implement your own end-to-end projects.


🧾 License

MIT © 2025 Myk Ogbinar / Data Engineering Pilipinas


🙌 Acknowledgments

About

From CSV to Dashboard — Building a Mini Data Pipeline in Pure Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Dockerfile 1.6%