Skip to content

Rifanism/e-commerce_dat-sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time E-commerce Data Pipeline Simulation

A data ingestion pipline that simulates real-time e-commerce data. This project demonstrates the implementation of Data Contracts, Batch Processing, and Columnar Storage.

Key Features

  • Data Synthesis: Generates e-commerce events using Faker.
  • Data Validation: Implements strict data contracts with Pydantic to ensure zero "dirty data" in the pipeline.
  • Efficient Processing: Utilizes Polars for lightning-fast micro-batch processing.
  • Storage Optimization: Saves data in Apache Parquet format for superior compression and analytical performance.
  • Observability: Integrated logging system for monitoring pipeline health and debugging.

Tech Stack

  • Language: Python 3.12+
  • Data Validation: Pydantic
  • Data Processing: Polars
  • Storage: Apache Parquet

Project Structure

.
├── data/               # Parquet output files (untracked)
├── src/
│   ├── models.py       # Pydantic schemas (The Data Contract)
│   ├── generator.py    # Synthetic data logic (Faker)
│   └── main.py         # App entry point (Polars logic)
└── analytics.py        # Analytics script (Lazy API demo)

How to Run

1. Clone the repository

git clone https://github.com//e-commerce_dat-sim.git
cd e-commerce_dat-sim

2. Install dependencies

pip install -r requirements.txt

3. Run the pipeline

python src/main.py

4. Run analytics

python analytics.py

Results

Analytics Report

About

Generates e-commerce dummy datas using Faker.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages