A scalable, reliable, and maintainable batch-processing data infrastructure designed to support data-intensive machine learning applications. The system ingests massive amounts of data, processes it in batches, and prepares aggregated datasets for quarterly ML model training.
- Data Ingestion: Apache Kafka
- Data Storage: Hadoop HDFS + PostgreSQL
- Data Processing: Apache Spark
- Workflow Orchestration: Apache Airflow
- Monitoring: Prometheus + Grafana
- API Gateway: Flask REST API
- ✅ Batch processing with configurable schedules
- ✅ Containerized microservices architecture
- ✅ Scalable and fault-tolerant design
- ✅ Data governance and security
- ✅ Infrastructure as Code (Docker Compose)
- ✅ Version-controlled codebase
Dataset: NYC Taxi Trip Data (>1M records)
- Source: Kaggle / NYC OpenData
- Size: Multiple GB with timestamped records
- Processing: Monthly ingestion, Quarterly aggregation
.
├── docker-compose.yml # Container orchestration
├── infrastructure/ # Infrastructure configuration
│ ├── kafka/
│ ├── spark/
│ ├── hadoop/
│ ├── airflow/
│ └── monitoring/
├── data-ingestion/ # Kafka producers and data loaders
├── data-processing/ # Spark jobs for transformation
├── data-storage/ # Storage schemas and utilities
├── api/ # REST API for data delivery
├── scripts/ # Utility scripts
├── docs/ # Architecture documentation
└── tests/ # Integration tests
- Docker Desktop (20.x or later)
- Docker Compose (v2.x or later)
- Git
- Minimum 16GB RAM, 50GB free disk space
- Clone the repository
git clone <repository-url>
cd Project1- Start the infrastructure
docker-compose up -d- Verify services are running
docker-compose ps- Access service UIs
- Airflow: http://localhost:8080 (admin/admin)
- Spark Master: http://localhost:8081
- Grafana: http://localhost:3000 (admin/admin)
- Kafka UI: http://localhost:9000
- Ingest sample data
python data-ingestion/ingest_data.py --source data/sample.csv- Trigger batch processing
# Airflow DAG will automatically trigger quarterly processing
# Or manually trigger: python scripts/trigger_batch_processing.py- Ingestion: Data files → Kafka → HDFS (Raw Zone)
- Processing: Spark reads from HDFS → Transforms → Writes to HDFS (Processed Zone)
- Aggregation: Spark aggregates → Writes to PostgreSQL (Analytics Zone)
- Delivery: REST API serves data to ML applications
- Kafka message persistence and replication
- HDFS data replication (factor 3)
- Spark checkpoint and recovery
- Airflow retry mechanisms
- Database backups
- Horizontal scaling of Spark workers
- Kafka partitioning
- HDFS distributed storage
- Containerized services
- Role-based access control (RBAC)
- Data encryption at rest
- Audit logging
- Data lineage tracking
- Schema validation
- Create Kafka producer in
data-ingestion/ - Define schema in
data-storage/schemas/ - Create Spark job in
data-processing/ - Update Airflow DAG
pytest tests/ -vAll services expose health endpoints monitored by Prometheus:
- Kafka:
/health - Spark:
/api/v1/applications - Airflow:
/health
# View logs for specific service
docker-compose logs -f <service-name>- Stream processing pipeline (Kafka Streams / Flink)
- Real-time dashboard
- ML model versioning integration
- Cloud deployment (AWS/Azure)
- Advanced data quality checks
This is an academic project. For questions or suggestions, please open an issue.
MIT License
Bhavyashree Prakash - Data Engineering Portfolio Project
- Apache Kafka Documentation
- Apache Spark Documentation
- Apache Airflow Documentation
- Docker Best Practices
- Microsoft Azure Reference Architecture