An End-to-End Data Engineering Project
This project demonstrates a real-time data engineering pipeline from scratch, covering everything from ingestion to storage using a modern, scalable tech stack. Bitcoin price updates from the CoinGecko API and stream, process, and store the data using tools like Airflow, Kafka, Spark, and Cassandra, all containerized via Docker for seamless orchestration and deployment.
Pipeline Flow:
- Airflow fetches Bitcoin data from the CoinGecko API and stores it in PostgreSQL.
- Data is streamed to Apache Kafka, coordinated by Zookeeper.
- Spark Streaming consumes and processes data in real-time.
- Transformed data is stored in a Cassandra database.
- Monitoring and schema evolution handled via Kafka Control Center and Schema Registry.
| Layer | Tool |
|---|---|
| Orchestration | Apache Airflow |
| Messaging | Apache Kafka, Zookeeper |
| Processing | Apache Spark (Structured Streaming) |
| Storage | Cassandra, PostgreSQL |
| Monitoring | Kafka Control Center, Schema Registry |
| Infrastructure | Docker, Docker Compose |
| Programming | Python |
Clone and spin up the project in just a few steps:
-
Clone the repository
git clone https://github.com/0xpradish/e2e-data-engineering.git
-
Navigate to the project directory
cd e2e-data-engineering -
Run Docker Compose to spin up the services:
docker compose up -d
