🧠 MLOps Platform

End‑to‑end Machine Learning pipelines with Kubeflow, KServe, MLflow & Feast

📖 Overview

This repository implements a production‑grade MLOps platform that automates the entire machine learning lifecycle:

Feature Engineering with Feast (offline & online store)
Experiment Tracking & Model Registry with MLflow
Orchestration of Training Pipelines using Kubeflow Pipelines
Advanced Drift Detection (KS‑test, JS divergence, PCA, window‑based statistics)
Model Serving via KServe with GPU‑accelerated transformers
Infrastructure as Code with Terraform (EKS, GPU nodes, networking)
CI/CD for data validation, model training, deployment & rollback

The platform is cloud‑agnostic (demonstrated on AWS, but can be adapted to GCP/Azure) and designed for high scalability and cost efficiency.

🏗️ Architecture

graph TD
    subgraph "Data Sources"
        DS1[Batch Data Lake]
        DS2[Streaming Events]
        DS3[Feature Store Feast]
    end

    subgraph "Orchestration - Kubeflow Pipelines"
        PIPE[Training Pipeline]
        DRIFT[Drift Detection Pipeline]
    end

    subgraph "Experiment & Registry - MLflow"
        TRACK[Experiment Tracking]
        REG[Model Registry]
    end

    subgraph "Model Serving - KServe"
        SERV[InferenceService]
        TRANS[Transformer]
        PRED[Predictor]
    end

    subgraph "Infrastructure - AWS EKS"
        K8S[Kubernetes Cluster]
        GPU[GPU Node Group]
        CPU[CPU Node Group]
        MON[Prometheus/Grafana]
    end

    subgraph "CI/CD - GitHub Actions"
        CI[Data Validation]
        CD[Deploy & Canary]
    end

    DS1 --> PIPE
    DS2 --> DRIFT
    DS3 --> PIPE
    DS3 --> SERV

    PIPE --> TRACK
    PIPE --> REG
    DRIFT --> PIPE
    DRIFT --> REG

    REG --> SERV
    SERV --> TRANS
    TRANS --> PRED
    PRED --> USERS

    K8S --> GPU
    K8S --> CPU
    MON --> K8S
    CI --> PIPE
    CD --> SERV

📁 Repository Structure

mlops-platform/
├── .github/workflows/             # CI/CD pipelines
│   ├── ci-cd.yml                  # Main training & deployment workflow
│   └── retrain-trigger.yml        # Triggered on drift detection
├── kubernetes/                    # Kubernetes manifests
│   ├── kubeflow/                  # Kubeflow installation kustomize
│   ├── kserve/                    # KServe operator & inference service
│   ├── mlflow/                    # MLflow deployment & service
│   └── feast/                     # Feast feature server
├── pipelines/                     # Kubeflow pipeline definitions
│   ├── training_pipeline.py       # End‑to‑end training with drift check
│   ├── drift_detection_pipeline.py# Periodic drift monitoring
│   └── components/                # Reusable KFP components
│       ├── data_validation.py
│       ├── train_model.py
│       ├── drift_detector.py
│       └── deploy_model.py
├── model/                         # Model definition & config
│   ├── model.py
│   ├── train.py
│   └── config.yaml
├── serving/                       # KServe transformer & config
│   ├── transformer.py
│   └── inference_service.yaml
├── terraform/                     # Infrastructure as Code
│   ├── main.tf
│   ├── eks.tf
│   ├── networking.tf
│   ├── iam.tf
│   ├── variables.tf
│   ├── outputs.tf
│   └── terraform.tfvars
├── docker/                        # Dockerfiles for components
│   ├── Dockerfile.transformer
│   └── Dockerfile.predictor
├── requirements.txt               # Python dependencies
└── README.md                      # You are here

🚀 Getting Started

Prerequisites

AWS account (with permissions for EKS, S3, IAM)
terraform >= 1.6
kubectl configured for the created EKS cluster
docker for building images
python 3.10 & pip
GitHub repository secrets configured (AWS credentials, etc.)

1. Provision Infrastructure

cd terraform
terraform init
terraform plan -out plan.out
terraform apply plan.out

This will create:

VPC with public/private subnets
EKS cluster with GPU and CPU node groups
Necessary IAM roles and S3 buckets

2. Install Cluster Services

Install the required Kubernetes operators and tools:

# NVIDIA GPU operator (if not installed by terraform)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator

# KServe
kubectl apply -f kubernetes/kserve/kserve-install.yaml

# MLflow
kubectl apply -f kubernetes/mlflow/

# Feast (offline/online store)
kubectl apply -f kubernetes/feast/

# Kubeflow Pipelines (via kustomize)
kubectl apply -k kubernetes/kubeflow/

3. Set Up CI/CD

Configure the following secrets in your GitHub repository:

Secret Name	Description
`AWS_ACCESS_KEY_ID`	AWS access key for S3 / ECR
`AWS_SECRET_ACCESS_KEY`	AWS secret key
`AWS_ACCOUNT_ID`	12‑digit AWS account ID
`ECR_REGISTRY`	ECR registry URL
`KUBEFLOW_ENDPOINT`	Kubeflow Pipelines API endpoint
`KUBEFLOW_USERNAME`	Kubeflow user
`KUBEFLOW_PASSWORD`	Kubeflow password

Push to main to trigger the full CI/CD pipeline.

4. Manual Run of a Training Pipeline

Once Kubeflow is running, you can submit a pipeline directly:

import kfp
client = kfp.Client(host='https://your-kubeflow-endpoint')
client.create_run_from_pipeline_func(
    training_pipeline,
    arguments={'train_data_path': 's3://mlops-data/train/latest.parquet'}
)

🧪 Key Features Explained

🔍 Drift Detection

The drift detection component uses multiple statistical tests to catch data drift early:

Kolmogorov–Smirnov test (KS) – compares cumulative distributions
Anderson‑Darling test – sensitive to distribution tails
Earth Mover’s Distance (Wasserstein) – geometric shift
Jensen‑Shannon Divergence – information‑theoretic measure
Window‑based analysis – detects transient drifts
PCA projection comparison – high‑dimensional overview

When drift exceeds the threshold (alpha = 0.05), the pipeline automatically triggers retraining and (if configured) canary deployment of the new model.

⚡ GPU‑Optimised Training & Inference

Training uses TensorFlow MirroredStrategy across multiple GPUs.
Inference leverages KServe’s Triton Inference Server with dynamic batching.
NVIDIA GPU Operator ensures drivers, CUDA, and device plugin are always up‑to‑date.
Multi‑Instance GPU (MIG) profiles available for cost‑efficient sharing.

🔁 CI/CD & GitOps

The GitHub Actions workflow implements:

Data validation with Great Expectations
Automated model training on a self‑hosted GPU runner
Model quality gates (accuracy, AUC) before promotion
Docker image build & push to ECR
KServe InferenceService rollout with canary traffic splitting
Drift‑triggered retraining via repository dispatch
Rollback command for instant revert

📊 Monitoring & Observability

The platform integrates with Prometheus and Grafana for:

Model metrics: latency (p50, p99), error rate, throughput
Data drift scores over time
GPU utilisation, memory per node and per model
Autoscaling events on KServe

Pre‑built dashboards and alerting rules are located in prometheus/.

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📜 License

Distributed under the MIT License. See LICENSE for more information.

📧 Contact

Your Name – @Linkedin – official.parvizi@gmail.com

Project Link: https://github.com/Awrsha/mlops-platform

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 MLOps Platform

📖 Overview

🏗️ Architecture

📁 Repository Structure

🚀 Getting Started

Prerequisites

1. Provision Infrastructure

2. Install Cluster Services

3. Set Up CI/CD

4. Manual Run of a Training Pipeline

🧪 Key Features Explained

🔍 Drift Detection

⚡ GPU‑Optimised Training & Inference

🔁 CI/CD & GitOps

📊 Monitoring & Observability

🤝 Contributing

📜 License

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
docker		docker
kubernetes		kubernetes
model		model
pipelines		pipelines
prometheus		prometheus
serving		serving
terraform		terraform
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧠 MLOps Platform

📖 Overview

🏗️ Architecture

📁 Repository Structure

🚀 Getting Started

Prerequisites

1. Provision Infrastructure

2. Install Cluster Services

3. Set Up CI/CD

4. Manual Run of a Training Pipeline

🧪 Key Features Explained

🔍 Drift Detection

⚡ GPU‑Optimised Training & Inference

🔁 CI/CD & GitOps

📊 Monitoring & Observability

🤝 Contributing

📜 License

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages