Skip to content

Awrsha/MLOps-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Platform Logo

🧠 MLOps Platform

End‑to‑end Machine Learning pipelines with Kubeflow, KServe, MLflow & Feast

CI/CD License Python 3.10 Kubernetes 1.28 Terraform


📖 Overview

This repository implements a production‑grade MLOps platform that automates the entire machine learning lifecycle:

  • Feature Engineering with Feast (offline & online store)
  • Experiment Tracking & Model Registry with MLflow
  • Orchestration of Training Pipelines using Kubeflow Pipelines
  • Advanced Drift Detection (KS‑test, JS divergence, PCA, window‑based statistics)
  • Model Serving via KServe with GPU‑accelerated transformers
  • Infrastructure as Code with Terraform (EKS, GPU nodes, networking)
  • CI/CD for data validation, model training, deployment & rollback

The platform is cloud‑agnostic (demonstrated on AWS, but can be adapted to GCP/Azure) and designed for high scalability and cost efficiency.


🏗️ Architecture

graph TD
    subgraph "Data Sources"
        DS1[Batch Data Lake]
        DS2[Streaming Events]
        DS3[Feature Store Feast]
    end

    subgraph "Orchestration - Kubeflow Pipelines"
        PIPE[Training Pipeline]
        DRIFT[Drift Detection Pipeline]
    end

    subgraph "Experiment & Registry - MLflow"
        TRACK[Experiment Tracking]
        REG[Model Registry]
    end

    subgraph "Model Serving - KServe"
        SERV[InferenceService]
        TRANS[Transformer]
        PRED[Predictor]
    end

    subgraph "Infrastructure - AWS EKS"
        K8S[Kubernetes Cluster]
        GPU[GPU Node Group]
        CPU[CPU Node Group]
        MON[Prometheus/Grafana]
    end

    subgraph "CI/CD - GitHub Actions"
        CI[Data Validation]
        CD[Deploy & Canary]
    end

    DS1 --> PIPE
    DS2 --> DRIFT
    DS3 --> PIPE
    DS3 --> SERV

    PIPE --> TRACK
    PIPE --> REG
    DRIFT --> PIPE
    DRIFT --> REG

    REG --> SERV
    SERV --> TRANS
    TRANS --> PRED
    PRED --> USERS

    K8S --> GPU
    K8S --> CPU
    MON --> K8S
    CI --> PIPE
    CD --> SERV
Loading

📁 Repository Structure

mlops-platform/
├── .github/workflows/             # CI/CD pipelines
│   ├── ci-cd.yml                  # Main training & deployment workflow
│   └── retrain-trigger.yml        # Triggered on drift detection
├── kubernetes/                    # Kubernetes manifests
│   ├── kubeflow/                  # Kubeflow installation kustomize
│   ├── kserve/                    # KServe operator & inference service
│   ├── mlflow/                    # MLflow deployment & service
│   └── feast/                     # Feast feature server
├── pipelines/                     # Kubeflow pipeline definitions
│   ├── training_pipeline.py       # End‑to‑end training with drift check
│   ├── drift_detection_pipeline.py# Periodic drift monitoring
│   └── components/                # Reusable KFP components
│       ├── data_validation.py
│       ├── train_model.py
│       ├── drift_detector.py
│       └── deploy_model.py
├── model/                         # Model definition & config
│   ├── model.py
│   ├── train.py
│   └── config.yaml
├── serving/                       # KServe transformer & config
│   ├── transformer.py
│   └── inference_service.yaml
├── terraform/                     # Infrastructure as Code
│   ├── main.tf
│   ├── eks.tf
│   ├── networking.tf
│   ├── iam.tf
│   ├── variables.tf
│   ├── outputs.tf
│   └── terraform.tfvars
├── docker/                        # Dockerfiles for components
│   ├── Dockerfile.transformer
│   └── Dockerfile.predictor
├── requirements.txt               # Python dependencies
└── README.md                      # You are here

🚀 Getting Started

Prerequisites

  • AWS account (with permissions for EKS, S3, IAM)
  • terraform >= 1.6
  • kubectl configured for the created EKS cluster
  • docker for building images
  • python 3.10 & pip
  • GitHub repository secrets configured (AWS credentials, etc.)

1. Provision Infrastructure

cd terraform
terraform init
terraform plan -out plan.out
terraform apply plan.out

This will create:

  • VPC with public/private subnets
  • EKS cluster with GPU and CPU node groups
  • Necessary IAM roles and S3 buckets

2. Install Cluster Services

Install the required Kubernetes operators and tools:

# NVIDIA GPU operator (if not installed by terraform)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator

# KServe
kubectl apply -f kubernetes/kserve/kserve-install.yaml

# MLflow
kubectl apply -f kubernetes/mlflow/

# Feast (offline/online store)
kubectl apply -f kubernetes/feast/

# Kubeflow Pipelines (via kustomize)
kubectl apply -k kubernetes/kubeflow/

3. Set Up CI/CD

Configure the following secrets in your GitHub repository:

Secret Name Description
AWS_ACCESS_KEY_ID AWS access key for S3 / ECR
AWS_SECRET_ACCESS_KEY AWS secret key
AWS_ACCOUNT_ID 12‑digit AWS account ID
ECR_REGISTRY ECR registry URL
KUBEFLOW_ENDPOINT Kubeflow Pipelines API endpoint
KUBEFLOW_USERNAME Kubeflow user
KUBEFLOW_PASSWORD Kubeflow password

Push to main to trigger the full CI/CD pipeline.

4. Manual Run of a Training Pipeline

Once Kubeflow is running, you can submit a pipeline directly:

import kfp
client = kfp.Client(host='https://your-kubeflow-endpoint')
client.create_run_from_pipeline_func(
    training_pipeline,
    arguments={'train_data_path': 's3://mlops-data/train/latest.parquet'}
)

🧪 Key Features Explained

🔍 Drift Detection

The drift detection component uses multiple statistical tests to catch data drift early:

  • Kolmogorov–Smirnov test (KS) – compares cumulative distributions
  • Anderson‑Darling test – sensitive to distribution tails
  • Earth Mover’s Distance (Wasserstein) – geometric shift
  • Jensen‑Shannon Divergence – information‑theoretic measure
  • Window‑based analysis – detects transient drifts
  • PCA projection comparison – high‑dimensional overview

When drift exceeds the threshold (alpha = 0.05), the pipeline automatically triggers retraining and (if configured) canary deployment of the new model.

⚡ GPU‑Optimised Training & Inference

  • Training uses TensorFlow MirroredStrategy across multiple GPUs.
  • Inference leverages KServe’s Triton Inference Server with dynamic batching.
  • NVIDIA GPU Operator ensures drivers, CUDA, and device plugin are always up‑to‑date.
  • Multi‑Instance GPU (MIG) profiles available for cost‑efficient sharing.

🔁 CI/CD & GitOps

The GitHub Actions workflow implements:

  • Data validation with Great Expectations
  • Automated model training on a self‑hosted GPU runner
  • Model quality gates (accuracy, AUC) before promotion
  • Docker image build & push to ECR
  • KServe InferenceService rollout with canary traffic splitting
  • Drift‑triggered retraining via repository dispatch
  • Rollback command for instant revert

📊 Monitoring & Observability

The platform integrates with Prometheus and Grafana for:

  • Model metrics: latency (p50, p99), error rate, throughput
  • Data drift scores over time
  • GPU utilisation, memory per node and per model
  • Autoscaling events on KServe

Pre‑built dashboards and alerting rules are located in prometheus/.


🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📜 License

Distributed under the MIT License. See LICENSE for more information.


📧 Contact

Your Name@Linkedinofficial.parvizi@gmail.com

Project Link: https://github.com/Awrsha/mlops-platform

Releases

No releases published

Packages

 
 
 

Contributors