This repository processes large-scale aviation data from the US Bureau of Transportation Statistics. It demonstrates an end-to-end data lifecycle, from ingesting raw batch and streaming data to training and serving a predictive machine learning model for flight delays.
Dataset: Data Expo 2009: Airline on-time data (2008.csv)
- Local WSL Environment: Full Linux execution environment running natively on Windows.
- Data Engineering: PySpark for heavy-duty big data aggregations and Parquet file optimization.
- Containerization: Fully containerized Apache Kafka environments using Docker & Docker Compose.
- MLOps Integration: Model tracking, metric logging, and registry using MLflow (Local Mode).
- Business Intelligence: Automated visualization generation using Matplotlib.
The pipeline is split into three core components:
- Batch Processing Pipeline (
/src/batch):- Ingests the historical 2008 flight dataset (CSV to Parquet).
- Cleans, transforms, and aggregates data using PySpark to extract historical delay patterns.
- Generates business intelligence visualizations.
- MLOps Pipeline (
/src/mlops):- Consumes the processed batch features to train an XGBoost predictive model for flight delays.
- Tracks experiments, hyperparameters, and model artifacts locally via MLflow.
- Streaming Pipeline (
/src/streaming):- Simulates real-time flight telemetry using Apache Kafka.
- Utilizes a Producer/Consumer architecture for high-throughput live data processing.
- Environment: WSL2 (Ubuntu), Docker Desktop
- Language: Python 3.10+
- Data Processing: Pandas, PySpark, Java 11 (OpenJDK)
- Machine Learning: Scikit-Learn, XGBoost
- MLOps: MLflow
- Streaming: Apache Kafka (
kafka-python) - Visualizations: Matplotlib
│ .gitignore
│ docker-compose.yml
│ Dockerfile
│ LICENSE.md
│ README.md
│ requirements.txt
│
├───.github
│ └───workflows
├───data
│ ├───analytics
│ │ └───visuals # Auto-generated BI charts (.png)
│ ├───processed # Cleaned Parquet files
│ └───raw # Place 2008.csv here
├───infrastructure
│ serverless.yml
├───mlruns # Auto-generated MLflow tracking data
├───src
│ │ __init__.py
│ │
│ ├───batch
│ │ analytics.py # PySpark aggregations
│ │ clean_data.py # CSV to Parquet conversion
│ │ get_zipfiles.py # AWS S3 ingestion (Cloud mode)
│ │ visualize.py # Generates business dashboards
│ │ __init__.py
│ │
│ ├───mlops
│ │ train_model.py # XGBoost training & MLflow logging
│ │ __init__.py
│ │
│ └───streaming
│ consumer.py # Kafka listener
│ producer.py # Kafka data sender
│ __init__.py
│
└───tests
test_clean_data.py
__init__.py
This project is designed to run in a Linux environment using WSL (Windows Subsystem for Linux).
touch data/raw/.gitkeep data/processed/.gitkeep data/analytics/.gitkeep data/analytics/visuals/.gitkeep
wsl --install -d Ubuntu
- Open your Ubuntu WSL terminal and navigate to the project folder.
- Install the required Java engine for PySpark:
sudo apt update
sudo apt install openjdk-11-jre-headless -y
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
- Create and activate a Python Virtual Environment:
python3 -m venv venv
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
(Ensure pandas, pyspark, mlflow, xgboost, scikit-learn, matplotlib, and kafka-python are installed).
- Download the
2008.csvfile from the Kaggle link above. - Place it exactly at:
data/raw/2008.csv.
Ensure your virtual environment is active (source venv/bin/activate) before running these scripts.
- Clean the Data (CSV -> Parquet):
python src/batch/clean_data.py
- Run PySpark Analytics:
python src/batch/analytics.py
This answers the Capstone analytics questions and generates ml_features.parquet.
3. Generate Visualizations:
python src/batch/visualize.py
Check data/analytics/visuals/ for the outputted charts.
Train the XGBoost model using the features generated in Phase 3.
- Train the Model:
python src/mlops/train_model.py
- View the MLflow Dashboard:
MLflow saves all metrics and model artifacts locally to the
mlruns/folder. To view the UI, start the server:
mlflow ui --port 2000
Open your Windows web browser and navigate to: http://localhost:2000
To simulate live flight telemetry, we use Apache Kafka running in Docker.
1. Start the Kafka Infrastructure:
docker compose up -d
Wait 15-30 seconds for the Kafka broker to fully boot.
2. Start the Consumer (Terminal 1): In your current WSL terminal, activate the environment and start listening for data:
source venv/bin/activate
python src/streaming/consumer.py
3. Start the Producer (Terminal 2): Open a new WSL terminal window, activate the environment, and start streaming the CSV data into Kafka:
cd ~/Cloud-Computing-Project
source venv/bin/activate
python src/streaming/producer.py
You will immediately see the Producer sending messages in Terminal 2, and the Consumer processing them in Terminal 1.