A data ingestion pipline that simulates real-time e-commerce data. This project demonstrates the implementation of Data Contracts, Batch Processing, and Columnar Storage.
- Data Synthesis: Generates e-commerce events using Faker.
- Data Validation: Implements strict data contracts with Pydantic to ensure zero "dirty data" in the pipeline.
- Efficient Processing: Utilizes Polars for lightning-fast micro-batch processing.
- Storage Optimization: Saves data in Apache Parquet format for superior compression and analytical performance.
- Observability: Integrated logging system for monitoring pipeline health and debugging.
- Language: Python 3.12+
- Data Validation: Pydantic
- Data Processing: Polars
- Storage: Apache Parquet
.
├── data/ # Parquet output files (untracked)
├── src/
│ ├── models.py # Pydantic schemas (The Data Contract)
│ ├── generator.py # Synthetic data logic (Faker)
│ └── main.py # App entry point (Polars logic)
└── analytics.py # Analytics script (Lazy API demo)
git clone https://github.com//e-commerce_dat-sim.git
cd e-commerce_dat-simpip install -r requirements.txtpython src/main.pypython analytics.py