scalable_ml_pipelines

Advanced BD ITMO University Project Work

A system that trains a machine learning model and serves incoming request making predictions for them using the trained model.

Dataset: https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset

Requirements:

Spark ML should be used to implement training ETL and model training.
Spark Streaming should be used to implement serving application.
Live Metrics should be calculated using Spark Streaming.
All services and applications should be deployed in Kubernetes.
Use kuberenetes cron to implement periodical checking for new data upload and new model events.
Cross validation has to be used for ensemble models.

It is assumed that there is a source that periodically dumps a new batch of train data to hdfs. The system should detect apperance of such a batch and trigger a training workflow. The training workflow consists of (1) ETL part that cleans data and prepares features; (2) enemble part that trains two or more ML models (let’s say gradient boosting, SVM, linear/logistic regression); (3) stacking part that trains an ML Model (linear/logistic regression) sitting on top of the ensemble and gives final predictions. Upon finishing the training and cross-validation of the model, the train set is exported to HDFS. The trained model should be registered with MLFlow Tracking service (as an artifact). The quality metrics (both for the ensemble models and the stacking model), training and validations times, models parameters (for all models) should all be logged to MLFlow Tracking service too.

There is also a model serving application that works constantly (even when the new model training is in progress) starting since training of the very first model has been completed. This application listens to incoming data from Kafka. The app filters incoming records and makes all required preparation to represent a record in a suitable form for the model, than the model is applied to each record and resulting predictions is written to Kafka. It is known, that except requests for predictions there may be incoming messages that contains eventual price set by manager for records that predictions were generated earlier. This fact makes it possible to compute live metrics for the serving application (for a time limited window).

The serving application should be redeployed to serve with a new model each time the new model registered in MLFlow Tracking service. However, there shouldn’t be downtime for the serving app even during redeployment.

Notes:

To make parallel training of ensemble models you may use either Spark parallel capabilities (FIFO scheduler, submitting parallel jobs from multiple threads, etc.) or Apache Airflow (for intermediate data use HDFS in the second case).
Take a dataset from kaggle https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset
Split the dataset on train and test parts. The test part should be used for sending to the inference application.
Simulate periodical appearance of new data by splitting the train dataset on many subparts.
Take a look on ‘kubectl rollout restart’ command to make redeployment without downtime.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
kafka		kafka
logs		logs
opt		opt
research		research
src		src
.gitignore		.gitignore
Dockerfile.txt		Dockerfile.txt
README.md		README.md
Scalable ML Pipelines - Used Cars Price Prediction.pptx		Scalable ML Pipelines - Used Cars Price Prediction.pptx
decom.sh		decom.sh
eabraham-373705.yaml		eabraham-373705.yaml
kafka-broker.yaml		kafka-broker.yaml
kafka-namespace.yaml		kafka-namespace.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scalable-svc.yaml		scalable-svc.yaml
scalables-deployment.yaml		scalables-deployment.yaml
spark-executor-configmap.yaml		spark-executor-configmap.yaml
spark-executor.yaml		spark-executor.yaml
zookeeper.yaml		zookeeper.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scalable_ml_pipelines

About

Uh oh!

Releases

Packages

Uh oh!

Languages

GitHub-User228/AdvancedBigDataProject

Folders and files

Latest commit

History

Repository files navigation

scalable_ml_pipelines

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages