Skip to content

0xpradish/e2e-data-engineering

Repository files navigation

⚡ Bitcoin Real-Time Data Streaming Pipeline

An End-to-End Data Engineering Project

Overview

This project demonstrates a real-time data engineering pipeline from scratch, covering everything from ingestion to storage using a modern, scalable tech stack. Bitcoin price updates from the CoinGecko API and stream, process, and store the data using tools like Airflow, Kafka, Spark, and Cassandra, all containerized via Docker for seamless orchestration and deployment.


System Architecture

System Architecture

Pipeline Flow:

  1. Airflow fetches Bitcoin data from the CoinGecko API and stores it in PostgreSQL.
  2. Data is streamed to Apache Kafka, coordinated by Zookeeper.
  3. Spark Streaming consumes and processes data in real-time.
  4. Transformed data is stored in a Cassandra database.
  5. Monitoring and schema evolution handled via Kafka Control Center and Schema Registry.

Tech Stack

Layer Tool
Orchestration Apache Airflow
Messaging Apache Kafka, Zookeeper
Processing Apache Spark (Structured Streaming)
Storage Cassandra, PostgreSQL
Monitoring Kafka Control Center, Schema Registry
Infrastructure Docker, Docker Compose
Programming Python

🏁 Getting Started

Clone and spin up the project in just a few steps:

  1. Clone the repository

    git clone https://github.com/0xpradish/e2e-data-engineering.git
    
  2. Navigate to the project directory

    cd e2e-data-engineering
    
  3. Run Docker Compose to spin up the services:

    docker compose up -d

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors