Real-Time Financial Data Lakehouse
A production-grade data engineering platform that ingests real-time cryptocurrency trade data, performs streaming analytics, and stores results for both real-time queries and historical research.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Binance WS │────▶│ Kafka │────▶│ Spark Stream │
│ Trade Feed │ │ market_trades │ │ Processor │
└─────────────────┘ └─────────────────┘ └────────┬────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Delta │ │ ClickHouse│ │ Kafka │
│ Lake │ │ (OLAP) │ │ (OHLCV) │
└──────────┘ └──────────┘ └──────────┘
| Component | Description | Technology |
|---|---|---|
| Ingestor | WebSocket client connecting to Binance trade streams | Scala, Java-WebSocket |
| Kafka | Message bus for raw trade data | Apache Kafka 3.6 |
| Processor | Streaming aggregation (OHLCV) | Spark Structured Streaming 3.5 |
| Storage | Historical data lake | Delta Lake 3.0 + MinIO (S3) |
| OLAP | Real-time analytics queries | ClickHouse |
- Docker Desktop (with Docker Compose)
- Java 11-17 (recommended for Spark compatibility)
- Note: Java 25+ has compatibility issues with Spark/Hadoop. Use
JAVA_HOMEto point to Java 11-17.
- Note: Java 25+ has compatibility issues with Spark/Hadoop. Use
- SBT 1.9+
- 8GB+ RAM allocated to Docker
# Clone and navigate
cd MarketStream
# Start all services
docker-compose up -d
# Verify services are running
docker-compose psServices available:
- Kafka UI: http://localhost:8082
- Spark Master: http://localhost:8080
- MinIO Console: http://localhost:9001 (login: minioadmin/minioadmin123)
- ClickHouse: http://localhost:8123
# Compile all modules
sbt compile
# Run tests
sbt test
# Build fat JARs
sbt assembly# Start ingesting trades from Binance
sbt "ingestor/run"You should see output like:
MarketStream Trade Ingestor Starting
Symbols to ingest: btcusdt, ethusdt, solusdt
WebSocket connected to Binance
Metrics: received=1000, published=998, dropped=2
# In a new terminal
docker exec -it marketstream-kafka kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic market_trades_raw \
--from-beginning \
--max-messages 10# In a new terminal
sbt "processor/run"# Start Spark shell
spark-shell --packages io.delta:delta-spark_2.12:3.0.0
# In Spark shell
spark.read.format("delta").load("s3a://marketstream-raw/trades").show()
spark.read.format("delta").load("s3a://marketstream-aggregated/ohlcv_1m").show()Run the data quality check:
sbt "processor/runMain com.marketstream.quality.QualityCheck"Checks performed:
- Price/Volume positive values
- No null symbols
- OHLCV consistency (high >= low)
- VWAP within price range
- Data freshness
MarketStream/
├── build.sbt # Multi-module build configuration
├── docker-compose.yml # Infrastructure setup
├── common/ # Shared models and configuration
│ └── src/main/scala/
│ └── com/marketstream/common/
│ ├── Models.scala
│ └── Configuration.scala
├── ingestor/ # WebSocket -> Kafka ingestion
│ └── src/
│ ├── main/scala/
│ │ └── com/marketstream/ingestor/
│ │ ├── TradeIngestor.scala
│ │ ├── BinanceWebSocketClient.scala
│ │ └── KafkaTradeProducer.scala
│ └── test/scala/
├── processor/ # Spark streaming + Delta Lake
│ └── src/
│ ├── main/scala/
│ │ └── com/marketstream/processor/
│ │ ├── StreamProcessor.scala
│ │ └── quality/QualityCheck.scala
│ └── test/scala/
└── .github/workflows/ci.yml # CI/CD pipeline
Environment variables override defaults:
| Variable | Description | Default |
|---|---|---|
KAFKA_BOOTSTRAP_SERVERS |
Kafka brokers | localhost:9092 |
S3_ENDPOINT |
MinIO/S3 endpoint | http://localhost:9000 |
S3_ACCESS_KEY |
S3 access key | minioadmin |
S3_SECRET_KEY |
S3 secret key | minioadmin123 |
SPARK_MASTER |
Spark master URL | local[*] |
# Run all tests
sbt test
# Run specific module tests
sbt "ingestor/test"
sbt "processor/test"
# Run with coverage
sbt coverage test coverageReport- High Throughput: Handles 10,000+ trades/second
- Exactly-Once Semantics: Kafka idempotent producer + Spark checkpointing
- Late Data Handling: 2-minute watermark for out-of-order data
- Data Quality: Automated validation suite
- Scalable: Designed for horizontal scaling on Kubernetes
This project demonstrates proficiency in:
| Skill | Implementation |
|---|---|
| Scala | Functional patterns, case classes, implicits |
| Spark | Structured Streaming, watermarks, aggregations |
| Kafka | Producer API, consumer groups, exactly-once |
| Delta Lake | ACID transactions, time travel, schema evolution |
| Data Pipelines | End-to-end streaming architecture |
| CI/CD | GitHub Actions, automated testing |
| Docker/K8s | Containerized infrastructure |
MIT License - feel free to use for your own portfolio!
- Binance for public WebSocket API
- Apache Spark & Delta Lake communities
- Confluent for Kafka Docker images