Skip to content

Pbhavyashree/Data-Engineering-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Batch-Processing Data Architecture for ML Applications

Project Overview

A scalable, reliable, and maintainable batch-processing data infrastructure designed to support data-intensive machine learning applications. The system ingests massive amounts of data, processes it in batches, and prepares aggregated datasets for quarterly ML model training.

Architecture Components

Microservices

  • Data Ingestion: Apache Kafka
  • Data Storage: Hadoop HDFS + PostgreSQL
  • Data Processing: Apache Spark
  • Workflow Orchestration: Apache Airflow
  • Monitoring: Prometheus + Grafana
  • API Gateway: Flask REST API

Key Features

  • ✅ Batch processing with configurable schedules
  • ✅ Containerized microservices architecture
  • ✅ Scalable and fault-tolerant design
  • ✅ Data governance and security
  • ✅ Infrastructure as Code (Docker Compose)
  • ✅ Version-controlled codebase

Data Source

Dataset: NYC Taxi Trip Data (>1M records)

  • Source: Kaggle / NYC OpenData
  • Size: Multiple GB with timestamped records
  • Processing: Monthly ingestion, Quarterly aggregation

Project Structure

.
├── docker-compose.yml          # Container orchestration
├── infrastructure/             # Infrastructure configuration
│   ├── kafka/
│   ├── spark/
│   ├── hadoop/
│   ├── airflow/
│   └── monitoring/
├── data-ingestion/            # Kafka producers and data loaders
├── data-processing/           # Spark jobs for transformation
├── data-storage/              # Storage schemas and utilities
├── api/                       # REST API for data delivery
├── scripts/                   # Utility scripts
├── docs/                      # Architecture documentation
└── tests/                     # Integration tests

Quick Start

Prerequisites

  • Docker Desktop (20.x or later)
  • Docker Compose (v2.x or later)
  • Git
  • Minimum 16GB RAM, 50GB free disk space

Setup Instructions

  1. Clone the repository
git clone <repository-url>
cd Project1
  1. Start the infrastructure
docker-compose up -d
  1. Verify services are running
docker-compose ps
  1. Access service UIs
  1. Ingest sample data
python data-ingestion/ingest_data.py --source data/sample.csv
  1. Trigger batch processing
# Airflow DAG will automatically trigger quarterly processing
# Or manually trigger: python scripts/trigger_batch_processing.py

System Architecture

Data Flow

  1. Ingestion: Data files → Kafka → HDFS (Raw Zone)
  2. Processing: Spark reads from HDFS → Transforms → Writes to HDFS (Processed Zone)
  3. Aggregation: Spark aggregates → Writes to PostgreSQL (Analytics Zone)
  4. Delivery: REST API serves data to ML applications

Reliability Features

  • Kafka message persistence and replication
  • HDFS data replication (factor 3)
  • Spark checkpoint and recovery
  • Airflow retry mechanisms
  • Database backups

Scalability Features

  • Horizontal scaling of Spark workers
  • Kafka partitioning
  • HDFS distributed storage
  • Containerized services

Security & Governance

  • Role-based access control (RBAC)
  • Data encryption at rest
  • Audit logging
  • Data lineage tracking
  • Schema validation

Development Workflow

Adding a new data source

  1. Create Kafka producer in data-ingestion/
  2. Define schema in data-storage/schemas/
  3. Create Spark job in data-processing/
  4. Update Airflow DAG

Running tests

pytest tests/ -v

Monitoring and Maintenance

Health Checks

All services expose health endpoints monitored by Prometheus:

  • Kafka: /health
  • Spark: /api/v1/applications
  • Airflow: /health

Logs

# View logs for specific service
docker-compose logs -f <service-name>

Future Enhancements

  • Stream processing pipeline (Kafka Streams / Flink)
  • Real-time dashboard
  • ML model versioning integration
  • Cloud deployment (AWS/Azure)
  • Advanced data quality checks

Contributing

This is an academic project. For questions or suggestions, please open an issue.

License

MIT License

Author

Bhavyashree Prakash - Data Engineering Portfolio Project

References

  • Apache Kafka Documentation
  • Apache Spark Documentation
  • Apache Airflow Documentation
  • Docker Best Practices
  • Microsoft Azure Reference Architecture

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published