This repository implements a production‑grade MLOps platform that automates the entire machine learning lifecycle:
- Feature Engineering with Feast (offline & online store)
- Experiment Tracking & Model Registry with MLflow
- Orchestration of Training Pipelines using Kubeflow Pipelines
- Advanced Drift Detection (KS‑test, JS divergence, PCA, window‑based statistics)
- Model Serving via KServe with GPU‑accelerated transformers
- Infrastructure as Code with Terraform (EKS, GPU nodes, networking)
- CI/CD for data validation, model training, deployment & rollback
The platform is cloud‑agnostic (demonstrated on AWS, but can be adapted to GCP/Azure) and designed for high scalability and cost efficiency.
graph TD
subgraph "Data Sources"
DS1[Batch Data Lake]
DS2[Streaming Events]
DS3[Feature Store Feast]
end
subgraph "Orchestration - Kubeflow Pipelines"
PIPE[Training Pipeline]
DRIFT[Drift Detection Pipeline]
end
subgraph "Experiment & Registry - MLflow"
TRACK[Experiment Tracking]
REG[Model Registry]
end
subgraph "Model Serving - KServe"
SERV[InferenceService]
TRANS[Transformer]
PRED[Predictor]
end
subgraph "Infrastructure - AWS EKS"
K8S[Kubernetes Cluster]
GPU[GPU Node Group]
CPU[CPU Node Group]
MON[Prometheus/Grafana]
end
subgraph "CI/CD - GitHub Actions"
CI[Data Validation]
CD[Deploy & Canary]
end
DS1 --> PIPE
DS2 --> DRIFT
DS3 --> PIPE
DS3 --> SERV
PIPE --> TRACK
PIPE --> REG
DRIFT --> PIPE
DRIFT --> REG
REG --> SERV
SERV --> TRANS
TRANS --> PRED
PRED --> USERS
K8S --> GPU
K8S --> CPU
MON --> K8S
CI --> PIPE
CD --> SERV
mlops-platform/
├── .github/workflows/ # CI/CD pipelines
│ ├── ci-cd.yml # Main training & deployment workflow
│ └── retrain-trigger.yml # Triggered on drift detection
├── kubernetes/ # Kubernetes manifests
│ ├── kubeflow/ # Kubeflow installation kustomize
│ ├── kserve/ # KServe operator & inference service
│ ├── mlflow/ # MLflow deployment & service
│ └── feast/ # Feast feature server
├── pipelines/ # Kubeflow pipeline definitions
│ ├── training_pipeline.py # End‑to‑end training with drift check
│ ├── drift_detection_pipeline.py# Periodic drift monitoring
│ └── components/ # Reusable KFP components
│ ├── data_validation.py
│ ├── train_model.py
│ ├── drift_detector.py
│ └── deploy_model.py
├── model/ # Model definition & config
│ ├── model.py
│ ├── train.py
│ └── config.yaml
├── serving/ # KServe transformer & config
│ ├── transformer.py
│ └── inference_service.yaml
├── terraform/ # Infrastructure as Code
│ ├── main.tf
│ ├── eks.tf
│ ├── networking.tf
│ ├── iam.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── terraform.tfvars
├── docker/ # Dockerfiles for components
│ ├── Dockerfile.transformer
│ └── Dockerfile.predictor
├── requirements.txt # Python dependencies
└── README.md # You are here- AWS account (with permissions for EKS, S3, IAM)
terraform>= 1.6kubectlconfigured for the created EKS clusterdockerfor building imagespython3.10 &pip- GitHub repository secrets configured (AWS credentials, etc.)
cd terraform
terraform init
terraform plan -out plan.out
terraform apply plan.outThis will create:
- VPC with public/private subnets
- EKS cluster with GPU and CPU node groups
- Necessary IAM roles and S3 buckets
Install the required Kubernetes operators and tools:
# NVIDIA GPU operator (if not installed by terraform)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator
# KServe
kubectl apply -f kubernetes/kserve/kserve-install.yaml
# MLflow
kubectl apply -f kubernetes/mlflow/
# Feast (offline/online store)
kubectl apply -f kubernetes/feast/
# Kubeflow Pipelines (via kustomize)
kubectl apply -k kubernetes/kubeflow/Configure the following secrets in your GitHub repository:
| Secret Name | Description |
|---|---|
AWS_ACCESS_KEY_ID |
AWS access key for S3 / ECR |
AWS_SECRET_ACCESS_KEY |
AWS secret key |
AWS_ACCOUNT_ID |
12‑digit AWS account ID |
ECR_REGISTRY |
ECR registry URL |
KUBEFLOW_ENDPOINT |
Kubeflow Pipelines API endpoint |
KUBEFLOW_USERNAME |
Kubeflow user |
KUBEFLOW_PASSWORD |
Kubeflow password |
Push to main to trigger the full CI/CD pipeline.
Once Kubeflow is running, you can submit a pipeline directly:
import kfp
client = kfp.Client(host='https://your-kubeflow-endpoint')
client.create_run_from_pipeline_func(
training_pipeline,
arguments={'train_data_path': 's3://mlops-data/train/latest.parquet'}
)The drift detection component uses multiple statistical tests to catch data drift early:
- Kolmogorov–Smirnov test (KS) – compares cumulative distributions
- Anderson‑Darling test – sensitive to distribution tails
- Earth Mover’s Distance (Wasserstein) – geometric shift
- Jensen‑Shannon Divergence – information‑theoretic measure
- Window‑based analysis – detects transient drifts
- PCA projection comparison – high‑dimensional overview
When drift exceeds the threshold (alpha = 0.05), the pipeline automatically triggers retraining and (if configured) canary deployment of the new model.
- Training uses TensorFlow MirroredStrategy across multiple GPUs.
- Inference leverages KServe’s Triton Inference Server with dynamic batching.
- NVIDIA GPU Operator ensures drivers, CUDA, and device plugin are always up‑to‑date.
- Multi‑Instance GPU (MIG) profiles available for cost‑efficient sharing.
The GitHub Actions workflow implements:
- Data validation with Great Expectations
- Automated model training on a self‑hosted GPU runner
- Model quality gates (accuracy, AUC) before promotion
- Docker image build & push to ECR
- KServe InferenceService rollout with canary traffic splitting
- Drift‑triggered retraining via repository dispatch
- Rollback command for instant revert
The platform integrates with Prometheus and Grafana for:
- Model metrics: latency (p50, p99), error rate, throughput
- Data drift scores over time
- GPU utilisation, memory per node and per model
- Autoscaling events on KServe
Pre‑built dashboards and alerting rules are located in prometheus/.
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
Your Name – @Linkedin – official.parvizi@gmail.com
Project Link: https://github.com/Awrsha/mlops-platform
