# Kubernetes ML Infrastructure

This notebook covers production ML infrastructure on Kubernetes - essential for FAANG-level ML engineering.

## Topics Covered
1. **KServe** - Serverless model inference on Kubernetes
2. **Ray Serve** - Distributed model serving
3. **Kubeflow Pipelines** - ML workflow orchestration
4. **GPU Scheduling** - Resource management for ML workloads
5. **Auto-scaling** - Scaling patterns for ML services
6. **Helm Charts** - Packaging ML deployments

In [None]:
import torch
import torch.nn as nn
import numpy as np
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass, field
from enum import Enum
from abc import ABC, abstractmethod
import json
import time
import yaml
from datetime import datetime
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

## 1. KServe - Serverless Model Inference

KServe provides serverless inference on Kubernetes with auto-scaling, canary deployments, and multi-framework support.

In [None]:
@dataclass
class KServeInferenceService:
    """
    Represents a KServe InferenceService configuration.
    """
    name: str
    namespace: str
    framework: str  # pytorch, tensorflow, sklearn, xgboost, custom
    storage_uri: str
    runtime_version: str = "latest"
    
    # Resource configuration
    min_replicas: int = 1
    max_replicas: int = 10
    target_utilization: int = 70
    
    # Container resources
    cpu_request: str = "1"
    cpu_limit: str = "2"
    memory_request: str = "2Gi"
    memory_limit: str = "4Gi"
    gpu_count: int = 0
    
    # Canary configuration
    canary_traffic_percent: int = 0
    
    def to_yaml(self) -> str:
        """Generate KServe InferenceService YAML"""
        spec = {
            "apiVersion": "serving.kserve.io/v1beta1",
            "kind": "InferenceService",
            "metadata": {
                "name": self.name,
                "namespace": self.namespace,
                "annotations": {
                    "autoscaling.knative.dev/minScale": str(self.min_replicas),
                    "autoscaling.knative.dev/maxScale": str(self.max_replicas),
                    "autoscaling.knative.dev/target": str(self.target_utilization)
                }
            },
            "spec": {
                "predictor": {
                    self.framework: {
                        "storageUri": self.storage_uri,
                        "runtimeVersion": self.runtime_version,
                        "resources": {
                            "requests": {
                                "cpu": self.cpu_request,
                                "memory": self.memory_request
                            },
                            "limits": {
                                "cpu": self.cpu_limit,
                                "memory": self.memory_limit
                            }
                        }
                    }
                }
            }
        }
        
        # Add GPU if requested
        if self.gpu_count > 0:
            spec["spec"]["predictor"][self.framework]["resources"]["limits"]["nvidia.com/gpu"] = str(self.gpu_count)
        
        return yaml.dump(spec, default_flow_style=False)


@dataclass
class KServeTransformer:
    """
    Pre/post processing transformer for KServe.
    """
    name: str
    image: str
    
    def to_spec(self) -> Dict[str, Any]:
        return {
            "transformer": {
                "containers": [{
                    "name": self.name,
                    "image": self.image,
                    "resources": {
                        "requests": {"cpu": "100m", "memory": "256Mi"},
                        "limits": {"cpu": "500m", "memory": "512Mi"}
                    }
                }]
            }
        }


# Example: Create KServe InferenceService
pytorch_service = KServeInferenceService(
    name="bert-classifier",
    namespace="ml-production",
    framework="pytorch",
    storage_uri="s3://models/bert-classifier/v1",
    min_replicas=2,
    max_replicas=20,
    gpu_count=1,
    memory_request="4Gi",
    memory_limit="8Gi"
)

print("KServe InferenceService YAML:")
print(pytorch_service.to_yaml())

In [None]:
class KServeCanaryDeployment:
    """
    Manages canary deployments for KServe InferenceServices.
    """
    
    def __init__(
        self,
        service_name: str,
        namespace: str = "default"
    ):
        self.service_name = service_name
        self.namespace = namespace
        self.deployment_history: List[Dict] = []
    
    def create_canary_spec(
        self,
        default_storage_uri: str,
        canary_storage_uri: str,
        canary_traffic_percent: int = 10,
        framework: str = "pytorch"
    ) -> Dict[str, Any]:
        """Create a canary deployment specification"""
        return {
            "apiVersion": "serving.kserve.io/v1beta1",
            "kind": "InferenceService",
            "metadata": {
                "name": self.service_name,
                "namespace": self.namespace
            },
            "spec": {
                "predictor": {
                    framework: {
                        "storageUri": default_storage_uri
                    }
                },
                "canaryTrafficPercent": canary_traffic_percent,
                "canaryPredictor": {
                    framework: {
                        "storageUri": canary_storage_uri
                    }
                }
            }
        }
    
    def progressive_rollout(
        self,
        stages: List[int] = [10, 25, 50, 75, 100]
    ) -> List[Dict[str, Any]]:
        """
        Generate progressive rollout stages.
        Each stage increases canary traffic.
        """
        rollout_plan = []
        
        for percentage in stages:
            stage = {
                "traffic_percent": percentage,
                "duration_minutes": 15 if percentage < 100 else 0,
                "success_criteria": {
                    "error_rate_threshold": 0.01,
                    "latency_p99_threshold_ms": 500
                },
                "rollback_on_failure": True
            }
            rollout_plan.append(stage)
        
        return rollout_plan


# Example canary deployment
canary = KServeCanaryDeployment("bert-classifier", "ml-production")

canary_spec = canary.create_canary_spec(
    default_storage_uri="s3://models/bert-classifier/v1",
    canary_storage_uri="s3://models/bert-classifier/v2",
    canary_traffic_percent=10
)

print("Canary Deployment Spec:")
print(yaml.dump(canary_spec, default_flow_style=False))

print("\nProgressive Rollout Plan:")
for stage in canary.progressive_rollout():
    print(f"  - {stage['traffic_percent']}% traffic for {stage['duration_minutes']} minutes")

## 2. Ray Serve - Distributed Model Serving

Ray Serve enables distributed, scalable model serving with Python-native APIs.

In [None]:
class RayServeDeployment:
    """
    Simulates Ray Serve deployment configuration.
    
    In production, this would use:
    from ray import serve
    @serve.deployment
    """
    
    def __init__(
        self,
        name: str,
        num_replicas: int = 1,
        max_concurrent_queries: int = 100,
        ray_actor_options: Dict[str, Any] = None
    ):
        self.name = name
        self.num_replicas = num_replicas
        self.max_concurrent_queries = max_concurrent_queries
        self.ray_actor_options = ray_actor_options or {}
    
    def to_config(self) -> Dict[str, Any]:
        """Generate Ray Serve deployment config"""
        return {
            "name": self.name,
            "num_replicas": self.num_replicas,
            "max_concurrent_queries": self.max_concurrent_queries,
            "ray_actor_options": self.ray_actor_options,
            "autoscaling_config": {
                "min_replicas": 1,
                "max_replicas": 10,
                "target_num_ongoing_requests_per_replica": 10
            }
        }


class RayServeModelComposition:
    """
    Demonstrates Ray Serve model composition patterns.
    """
    
    def __init__(self):
        self.deployments: Dict[str, RayServeDeployment] = {}
        self.dag_edges: List[Tuple[str, str]] = []
    
    def add_deployment(self, deployment: RayServeDeployment) -> None:
        """Add a deployment to the composition"""
        self.deployments[deployment.name] = deployment
    
    def add_edge(self, from_deployment: str, to_deployment: str) -> None:
        """Add a DAG edge between deployments"""
        self.dag_edges.append((from_deployment, to_deployment))
    
    def to_serve_dag(self) -> Dict[str, Any]:
        """Generate Ray Serve DAG configuration"""
        return {
            "deployments": {
                name: dep.to_config() 
                for name, dep in self.deployments.items()
            },
            "dag": {
                "edges": self.dag_edges
            }
        }


# Example: Ray Serve ML Pipeline
composition = RayServeModelComposition()

# Add deployments
composition.add_deployment(RayServeDeployment(
    name="preprocessor",
    num_replicas=2,
    ray_actor_options={"num_cpus": 1}
))

composition.add_deployment(RayServeDeployment(
    name="encoder",
    num_replicas=4,
    ray_actor_options={"num_gpus": 1}
))

composition.add_deployment(RayServeDeployment(
    name="classifier",
    num_replicas=2,
    ray_actor_options={"num_gpus": 0.5}
))

# Define DAG
composition.add_edge("preprocessor", "encoder")
composition.add_edge("encoder", "classifier")

print("Ray Serve DAG Configuration:")
print(json.dumps(composition.to_serve_dag(), indent=2))

In [None]:
# Simulated Ray Serve deployment class (production would use @serve.deployment)
class SimulatedRayServeModel:
    """
    Simulates a Ray Serve model deployment.
    
    Production code would look like:
    
    @serve.deployment(
        num_replicas=2,
        ray_actor_options={"num_gpus": 1}
    )
    class BertClassifier:
        def __init__(self):
            self.model = load_model()
        
        async def __call__(self, request):
            return self.model(request.json())
    """
    
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.model = nn.Linear(768, 2)  # Simulated model
        self.request_count = 0
    
    def __call__(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        """Handle inference request"""
        self.request_count += 1
        
        # Simulate model inference
        batch_size = inputs.get("batch_size", 1)
        
        with torch.no_grad():
            dummy_input = torch.randn(batch_size, 768)
            output = self.model(dummy_input)
            predictions = torch.softmax(output, dim=-1)
        
        return {
            "predictions": predictions.tolist(),
            "model": self.model_name,
            "request_id": self.request_count
        }


# Test simulated deployment
model = SimulatedRayServeModel("bert-classifier")
result = model({"batch_size": 4})
print(f"Inference result: {result}")

## 3. Kubeflow Pipelines

Kubeflow Pipelines enables ML workflow orchestration on Kubernetes.

In [None]:
@dataclass
class PipelineComponent:
    """
    Represents a Kubeflow Pipeline component.
    """
    name: str
    image: str
    command: List[str]
    arguments: List[str] = field(default_factory=list)
    
    # Resource requirements
    cpu_request: str = "1"
    memory_request: str = "2Gi"
    gpu_limit: int = 0
    
    # Inputs/Outputs
    inputs: Dict[str, str] = field(default_factory=dict)
    outputs: Dict[str, str] = field(default_factory=dict)
    
    def to_component_spec(self) -> Dict[str, Any]:
        """Generate component specification"""
        spec = {
            "name": self.name,
            "implementation": {
                "container": {
                    "image": self.image,
                    "command": self.command,
                    "args": self.arguments
                }
            },
            "inputs": [
                {"name": k, "type": v} for k, v in self.inputs.items()
            ],
            "outputs": [
                {"name": k, "type": v} for k, v in self.outputs.items()
            ]
        }
        return spec


class KubeflowPipeline:
    """
    Builds Kubeflow ML pipelines.
    """
    
    def __init__(self, name: str, description: str = ""):
        self.name = name
        self.description = description
        self.components: List[PipelineComponent] = []
        self.dependencies: Dict[str, List[str]] = {}
    
    def add_component(
        self,
        component: PipelineComponent,
        depends_on: List[str] = None
    ) -> None:
        """Add component to pipeline"""
        self.components.append(component)
        self.dependencies[component.name] = depends_on or []
    
    def to_pipeline_spec(self) -> Dict[str, Any]:
        """Generate pipeline specification"""
        return {
            "apiVersion": "argoproj.io/v1alpha1",
            "kind": "Workflow",
            "metadata": {
                "name": self.name,
                "annotations": {
                    "pipelines.kubeflow.org/pipeline_name": self.name
                }
            },
            "spec": {
                "entrypoint": "main",
                "templates": [
                    self._create_dag_template(),
                    *[self._create_component_template(c) for c in self.components]
                ]
            }
        }
    
    def _create_dag_template(self) -> Dict[str, Any]:
        """Create DAG template for pipeline"""
        tasks = []
        for component in self.components:
            task = {
                "name": component.name,
                "template": component.name
            }
            if self.dependencies[component.name]:
                task["dependencies"] = self.dependencies[component.name]
            tasks.append(task)
        
        return {
            "name": "main",
            "dag": {"tasks": tasks}
        }
    
    def _create_component_template(self, component: PipelineComponent) -> Dict[str, Any]:
        """Create component template"""
        template = {
            "name": component.name,
            "container": {
                "image": component.image,
                "command": component.command,
                "args": component.arguments,
                "resources": {
                    "requests": {
                        "cpu": component.cpu_request,
                        "memory": component.memory_request
                    }
                }
            }
        }
        
        if component.gpu_limit > 0:
            template["container"]["resources"]["limits"] = {
                "nvidia.com/gpu": str(component.gpu_limit)
            }
        
        return template


# Example: ML Training Pipeline
pipeline = KubeflowPipeline(
    name="bert-training-pipeline",
    description="End-to-end BERT training pipeline"
)

# Data preprocessing
pipeline.add_component(PipelineComponent(
    name="data-preprocessing",
    image="ml-images/preprocessor:latest",
    command=["python", "preprocess.py"],
    arguments=["--input", "/data/raw", "--output", "/data/processed"],
    cpu_request="2",
    memory_request="8Gi"
))

# Model training
pipeline.add_component(
    PipelineComponent(
        name="model-training",
        image="ml-images/trainer:latest",
        command=["python", "train.py"],
        arguments=["--data", "/data/processed", "--epochs", "10"],
        cpu_request="4",
        memory_request="16Gi",
        gpu_limit=4
    ),
    depends_on=["data-preprocessing"]
)

# Model evaluation
pipeline.add_component(
    PipelineComponent(
        name="model-evaluation",
        image="ml-images/evaluator:latest",
        command=["python", "evaluate.py"],
        cpu_request="2",
        memory_request="4Gi",
        gpu_limit=1
    ),
    depends_on=["model-training"]
)

# Model deployment
pipeline.add_component(
    PipelineComponent(
        name="model-deployment",
        image="ml-images/deployer:latest",
        command=["python", "deploy.py"],
        cpu_request="1",
        memory_request="2Gi"
    ),
    depends_on=["model-evaluation"]
)

print("Kubeflow Pipeline Spec:")
print(yaml.dump(pipeline.to_pipeline_spec(), default_flow_style=False))

## 4. GPU Scheduling & Resource Management

Efficient GPU scheduling is critical for ML workloads on Kubernetes.

In [None]:
@dataclass
class GPUNode:
    """Represents a GPU node in the cluster"""
    name: str
    gpu_type: str  # nvidia-a100, nvidia-v100, nvidia-t4
    gpu_count: int
    gpu_memory_gb: int
    available_gpus: int
    labels: Dict[str, str] = field(default_factory=dict)


@dataclass
class GPURequest:
    """Represents a GPU resource request"""
    job_name: str
    gpu_count: int
    gpu_memory_gb: int
    preferred_gpu_type: str = None
    priority: int = 0  # Higher = more important


class GPUScheduler:
    """
    Simulates GPU scheduling for ML workloads.
    """
    
    def __init__(self):
        self.nodes: List[GPUNode] = []
        self.pending_requests: List[GPURequest] = []
        self.scheduled_jobs: Dict[str, str] = {}  # job -> node mapping
    
    def add_node(self, node: GPUNode) -> None:
        """Add a GPU node to the cluster"""
        self.nodes.append(node)
    
    def submit_request(self, request: GPURequest) -> None:
        """Submit a GPU request"""
        self.pending_requests.append(request)
        self.pending_requests.sort(key=lambda x: x.priority, reverse=True)
    
    def schedule(self) -> List[Tuple[str, str]]:
        """
        Schedule pending requests to available nodes.
        Returns list of (job_name, node_name) tuples.
        """
        scheduled = []
        remaining_requests = []
        
        for request in self.pending_requests:
            best_node = self._find_best_node(request)
            
            if best_node:
                # Allocate GPUs
                best_node.available_gpus -= request.gpu_count
                self.scheduled_jobs[request.job_name] = best_node.name
                scheduled.append((request.job_name, best_node.name))
            else:
                remaining_requests.append(request)
        
        self.pending_requests = remaining_requests
        return scheduled
    
    def _find_best_node(self, request: GPURequest) -> Optional[GPUNode]:
        """Find best node for request using bin-packing"""
        candidates = []
        
        for node in self.nodes:
            # Check availability
            if node.available_gpus < request.gpu_count:
                continue
            
            # Check memory requirement
            if node.gpu_memory_gb < request.gpu_memory_gb:
                continue
            
            # Calculate score (prefer nodes with matching GPU type and less fragmentation)
            score = 0
            
            if request.preferred_gpu_type and node.gpu_type == request.preferred_gpu_type:
                score += 100
            
            # Best-fit: prefer nodes with just enough resources
            score -= (node.available_gpus - request.gpu_count) * 10
            
            candidates.append((score, node))
        
        if candidates:
            candidates.sort(key=lambda x: x[0], reverse=True)
            return candidates[0][1]
        
        return None
    
    def release(self, job_name: str) -> None:
        """Release GPUs when job completes"""
        if job_name in self.scheduled_jobs:
            node_name = self.scheduled_jobs[job_name]
            for node in self.nodes:
                if node.name == node_name:
                    # Find original request to get GPU count
                    # (simplified: assume 1 GPU)
                    node.available_gpus += 1
            del self.scheduled_jobs[job_name]
    
    def get_cluster_status(self) -> Dict[str, Any]:
        """Get cluster GPU status"""
        total_gpus = sum(n.gpu_count for n in self.nodes)
        available_gpus = sum(n.available_gpus for n in self.nodes)
        
        return {
            "total_gpus": total_gpus,
            "available_gpus": available_gpus,
            "utilization": (total_gpus - available_gpus) / total_gpus if total_gpus > 0 else 0,
            "pending_jobs": len(self.pending_requests),
            "running_jobs": len(self.scheduled_jobs),
            "nodes": [
                {
                    "name": n.name,
                    "gpu_type": n.gpu_type,
                    "available": f"{n.available_gpus}/{n.gpu_count}"
                }
                for n in self.nodes
            ]
        }


# Example: GPU Scheduling
scheduler = GPUScheduler()

# Add nodes
scheduler.add_node(GPUNode("node-1", "nvidia-a100", 8, 80, 8))
scheduler.add_node(GPUNode("node-2", "nvidia-a100", 8, 80, 8))
scheduler.add_node(GPUNode("node-3", "nvidia-v100", 4, 32, 4))

# Submit requests
scheduler.submit_request(GPURequest("llm-training", 4, 80, "nvidia-a100", priority=10))
scheduler.submit_request(GPURequest("inference-1", 1, 16, priority=5))
scheduler.submit_request(GPURequest("inference-2", 1, 16, priority=5))
scheduler.submit_request(GPURequest("batch-job", 2, 32, "nvidia-v100", priority=1))

# Schedule
scheduled = scheduler.schedule()
print("Scheduled jobs:")
for job, node in scheduled:
    print(f"  {job} -> {node}")

print(f"\nCluster status: {json.dumps(scheduler.get_cluster_status(), indent=2)}")

In [None]:
class GPUFractionalScheduling:
    """
    Fractional GPU scheduling using MIG (Multi-Instance GPU) or time-slicing.
    Enables multiple workloads to share a single GPU.
    """
    
    def __init__(self):
        self.gpu_slices: Dict[str, List[Dict]] = {}  # gpu_id -> list of slices
    
    def create_mig_profiles(self, gpu_id: str, gpu_memory_gb: int = 80) -> List[Dict]:
        """
        Create MIG profiles for A100 GPU.
        A100 80GB supports various MIG configurations.
        """
        # Example A100 MIG profiles
        profiles = [
            {"profile": "1g.10gb", "memory_gb": 10, "compute_fraction": 1/7},
            {"profile": "2g.20gb", "memory_gb": 20, "compute_fraction": 2/7},
            {"profile": "3g.40gb", "memory_gb": 40, "compute_fraction": 3/7},
            {"profile": "4g.40gb", "memory_gb": 40, "compute_fraction": 4/7},
            {"profile": "7g.80gb", "memory_gb": 80, "compute_fraction": 1.0},
        ]
        
        self.gpu_slices[gpu_id] = profiles
        return profiles
    
    def allocate_slice(
        self,
        gpu_id: str,
        required_memory_gb: int
    ) -> Optional[Dict]:
        """Allocate a GPU slice for a workload"""
        if gpu_id not in self.gpu_slices:
            return None
        
        for profile in self.gpu_slices[gpu_id]:
            if profile["memory_gb"] >= required_memory_gb:
                return profile
        
        return None
    
    def generate_pod_spec_with_mig(
        self,
        name: str,
        mig_profile: str
    ) -> Dict[str, Any]:
        """Generate pod spec with MIG resource request"""
        return {
            "apiVersion": "v1",
            "kind": "Pod",
            "metadata": {"name": name},
            "spec": {
                "containers": [{
                    "name": "ml-container",
                    "resources": {
                        "limits": {
                            f"nvidia.com/mig-{mig_profile}": "1"
                        }
                    }
                }]
            }
        }


# Example: MIG scheduling
mig_scheduler = GPUFractionalScheduling()
profiles = mig_scheduler.create_mig_profiles("gpu-0")

print("Available MIG profiles:")
for p in profiles:
    print(f"  {p['profile']}: {p['memory_gb']}GB, {p['compute_fraction']:.1%} compute")

# Allocate for inference workload needing 15GB
allocation = mig_scheduler.allocate_slice("gpu-0", 15)
if allocation:
    print(f"\nAllocated profile: {allocation['profile']}")
    pod_spec = mig_scheduler.generate_pod_spec_with_mig("inference-pod", allocation['profile'])
    print(f"Pod spec: {yaml.dump(pod_spec)}")

## 5. Auto-scaling Patterns

Scaling ML services based on various metrics.

In [None]:
class HorizontalPodAutoscaler:
    """
    Simulates Kubernetes HPA for ML workloads.
    """
    
    def __init__(
        self,
        deployment_name: str,
        min_replicas: int = 1,
        max_replicas: int = 10,
        target_cpu_utilization: int = 70,
        target_memory_utilization: int = 80,
        target_requests_per_second: int = None,
        scale_up_stabilization_seconds: int = 0,
        scale_down_stabilization_seconds: int = 300
    ):
        self.deployment_name = deployment_name
        self.min_replicas = min_replicas
        self.max_replicas = max_replicas
        self.target_cpu = target_cpu_utilization
        self.target_memory = target_memory_utilization
        self.target_rps = target_requests_per_second
        self.scale_up_stabilization = scale_up_stabilization_seconds
        self.scale_down_stabilization = scale_down_stabilization_seconds
        
        self.current_replicas = min_replicas
        self.scaling_history: List[Dict] = []
    
    def calculate_desired_replicas(
        self,
        current_cpu: float,
        current_memory: float = None,
        current_rps: float = None
    ) -> int:
        """
        Calculate desired replica count based on metrics.
        Uses the formula: desired = ceil(current_replicas * (current / target))
        """
        desired_counts = []
        
        # CPU-based scaling
        cpu_desired = int(np.ceil(
            self.current_replicas * (current_cpu / self.target_cpu)
        ))
        desired_counts.append(cpu_desired)
        
        # Memory-based scaling
        if current_memory is not None:
            memory_desired = int(np.ceil(
                self.current_replicas * (current_memory / self.target_memory)
            ))
            desired_counts.append(memory_desired)
        
        # RPS-based scaling (custom metric)
        if current_rps is not None and self.target_rps:
            rps_desired = int(np.ceil(
                self.current_replicas * (current_rps / self.target_rps)
            ))
            desired_counts.append(rps_desired)
        
        # Take maximum (most aggressive scaling)
        desired = max(desired_counts)
        
        # Apply bounds
        desired = max(self.min_replicas, min(self.max_replicas, desired))
        
        return desired
    
    def scale(self, desired_replicas: int) -> Dict[str, Any]:
        """Execute scaling action"""
        previous = self.current_replicas
        self.current_replicas = desired_replicas
        
        action = {
            "timestamp": time.time(),
            "previous_replicas": previous,
            "new_replicas": desired_replicas,
            "action": "scale_up" if desired_replicas > previous else "scale_down"
        }
        
        self.scaling_history.append(action)
        return action
    
    def to_hpa_spec(self) -> Dict[str, Any]:
        """Generate HPA YAML specification"""
        spec = {
            "apiVersion": "autoscaling/v2",
            "kind": "HorizontalPodAutoscaler",
            "metadata": {"name": f"{self.deployment_name}-hpa"},
            "spec": {
                "scaleTargetRef": {
                    "apiVersion": "apps/v1",
                    "kind": "Deployment",
                    "name": self.deployment_name
                },
                "minReplicas": self.min_replicas,
                "maxReplicas": self.max_replicas,
                "metrics": [
                    {
                        "type": "Resource",
                        "resource": {
                            "name": "cpu",
                            "target": {
                                "type": "Utilization",
                                "averageUtilization": self.target_cpu
                            }
                        }
                    }
                ],
                "behavior": {
                    "scaleUp": {
                        "stabilizationWindowSeconds": self.scale_up_stabilization,
                        "policies": [{
                            "type": "Percent",
                            "value": 100,
                            "periodSeconds": 15
                        }]
                    },
                    "scaleDown": {
                        "stabilizationWindowSeconds": self.scale_down_stabilization,
                        "policies": [{
                            "type": "Percent",
                            "value": 10,
                            "periodSeconds": 60
                        }]
                    }
                }
            }
        }
        
        # Add custom RPS metric if specified
        if self.target_rps:
            spec["spec"]["metrics"].append({
                "type": "Pods",
                "pods": {
                    "metric": {"name": "requests_per_second"},
                    "target": {
                        "type": "AverageValue",
                        "averageValue": str(self.target_rps)
                    }
                }
            })
        
        return spec


# Example: HPA for ML inference service
hpa = HorizontalPodAutoscaler(
    deployment_name="bert-inference",
    min_replicas=2,
    max_replicas=20,
    target_cpu_utilization=60,
    target_requests_per_second=100
)

# Simulate scaling decisions
print("Scaling simulation:")
for cpu_util in [30, 50, 80, 120, 90, 60, 40]:
    desired = hpa.calculate_desired_replicas(current_cpu=cpu_util)
    if desired != hpa.current_replicas:
        action = hpa.scale(desired)
        print(f"  CPU {cpu_util}% -> {action['action']}: {action['previous_replicas']} -> {action['new_replicas']} replicas")
    else:
        print(f"  CPU {cpu_util}% -> no change ({hpa.current_replicas} replicas)")

print(f"\nHPA Spec:")
print(yaml.dump(hpa.to_hpa_spec(), default_flow_style=False))

In [None]:
class VerticalPodAutoscaler:
    """
    Simulates Kubernetes VPA for right-sizing ML workloads.
    """
    
    def __init__(
        self,
        deployment_name: str,
        update_mode: str = "Auto",  # Off, Initial, Recreate, Auto
        min_cpu: str = "100m",
        max_cpu: str = "8",
        min_memory: str = "256Mi",
        max_memory: str = "32Gi"
    ):
        self.deployment_name = deployment_name
        self.update_mode = update_mode
        self.min_cpu = min_cpu
        self.max_cpu = max_cpu
        self.min_memory = min_memory
        self.max_memory = max_memory
        
        self.recommendations: List[Dict] = []
    
    def analyze_usage(
        self,
        cpu_usage_samples: List[float],
        memory_usage_samples: List[float]
    ) -> Dict[str, Any]:
        """
        Analyze historical usage to generate recommendations.
        """
        cpu_array = np.array(cpu_usage_samples)
        memory_array = np.array(memory_usage_samples)
        
        recommendation = {
            "target": {
                "cpu": f"{int(np.percentile(cpu_array, 90))}m",
                "memory": f"{int(np.percentile(memory_array, 90))}Mi"
            },
            "lower_bound": {
                "cpu": f"{int(np.percentile(cpu_array, 50))}m",
                "memory": f"{int(np.percentile(memory_array, 50))}Mi"
            },
            "upper_bound": {
                "cpu": f"{int(np.percentile(cpu_array, 99))}m",
                "memory": f"{int(np.percentile(memory_array, 99))}Mi"
            }
        }
        
        self.recommendations.append(recommendation)
        return recommendation
    
    def to_vpa_spec(self) -> Dict[str, Any]:
        """Generate VPA YAML specification"""
        return {
            "apiVersion": "autoscaling.k8s.io/v1",
            "kind": "VerticalPodAutoscaler",
            "metadata": {"name": f"{self.deployment_name}-vpa"},
            "spec": {
                "targetRef": {
                    "apiVersion": "apps/v1",
                    "kind": "Deployment",
                    "name": self.deployment_name
                },
                "updatePolicy": {
                    "updateMode": self.update_mode
                },
                "resourcePolicy": {
                    "containerPolicies": [{
                        "containerName": "*",
                        "minAllowed": {
                            "cpu": self.min_cpu,
                            "memory": self.min_memory
                        },
                        "maxAllowed": {
                            "cpu": self.max_cpu,
                            "memory": self.max_memory
                        }
                    }]
                }
            }
        }


# Example: VPA for ML training job
vpa = VerticalPodAutoscaler(
    deployment_name="ml-trainer",
    update_mode="Auto",
    min_cpu="500m",
    max_cpu="16",
    min_memory="1Gi",
    max_memory="64Gi"
)

# Simulate usage analysis
cpu_samples = np.random.normal(2000, 500, 100).tolist()  # millicores
memory_samples = np.random.normal(4000, 1000, 100).tolist()  # MiB

recommendation = vpa.analyze_usage(cpu_samples, memory_samples)
print("VPA Recommendation:")
print(json.dumps(recommendation, indent=2))

print(f"\nVPA Spec:")
print(yaml.dump(vpa.to_vpa_spec(), default_flow_style=False))

## 6. Helm Charts for ML Deployments

Packaging ML deployments with Helm for reproducibility.

In [None]:
class MLHelmChart:
    """
    Generates Helm chart structure for ML deployments.
    """
    
    def __init__(
        self,
        name: str,
        version: str = "1.0.0",
        app_version: str = "1.0.0"
    ):
        self.name = name
        self.version = version
        self.app_version = app_version
    
    def generate_chart_yaml(self) -> str:
        """Generate Chart.yaml"""
        return yaml.dump({
            "apiVersion": "v2",
            "name": self.name,
            "description": f"Helm chart for {self.name} ML deployment",
            "type": "application",
            "version": self.version,
            "appVersion": self.app_version,
            "dependencies": [
                {
                    "name": "redis",
                    "version": "17.0.0",
                    "repository": "https://charts.bitnami.com/bitnami",
                    "condition": "redis.enabled"
                }
            ]
        })
    
    def generate_values_yaml(self) -> str:
        """Generate default values.yaml"""
        return yaml.dump({
            "replicaCount": 2,
            "image": {
                "repository": f"ml-images/{self.name}",
                "tag": "latest",
                "pullPolicy": "IfNotPresent"
            },
            "model": {
                "name": "bert-base",
                "version": "v1",
                "storageUri": "s3://models/bert-base/v1"
            },
            "resources": {
                "requests": {
                    "cpu": "1",
                    "memory": "4Gi"
                },
                "limits": {
                    "cpu": "4",
                    "memory": "8Gi",
                    "nvidia.com/gpu": "1"
                }
            },
            "autoscaling": {
                "enabled": True,
                "minReplicas": 2,
                "maxReplicas": 20,
                "targetCPUUtilizationPercentage": 70
            },
            "service": {
                "type": "ClusterIP",
                "port": 8080
            },
            "ingress": {
                "enabled": True,
                "className": "nginx",
                "hosts": [
                    {"host": f"{self.name}.ml.example.com", "paths": ["/"]}
                ]
            },
            "monitoring": {
                "enabled": True,
                "prometheus": {
                    "scrape": True,
                    "port": 9090
                }
            },
            "redis": {
                "enabled": True,
                "auth": {"enabled": False}
            }
        })
    
    def generate_deployment_template(self) -> str:
        """Generate deployment.yaml template"""
        template = {
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "metadata": {
                "name": "{{ include \"chart.fullname\" . }}",
                "labels": "{{- include \"chart.labels\" . | nindent 4 }}"
            },
            "spec": {
                "replicas": "{{ .Values.replicaCount }}",
                "selector": {
                    "matchLabels": "{{- include \"chart.selectorLabels\" . | nindent 6 }}"
                },
                "template": {
                    "metadata": {
                        "labels": "{{- include \"chart.selectorLabels\" . | nindent 8 }}"
                    },
                    "spec": {
                        "containers": [{
                            "name": "{{ .Chart.Name }}",
                            "image": "{{ .Values.image.repository }}:{{ .Values.image.tag }}",
                            "ports": [{"containerPort": "{{ .Values.service.port }}"}],
                            "env": [
                                {"name": "MODEL_NAME", "value": "{{ .Values.model.name }}"},
                                {"name": "MODEL_VERSION", "value": "{{ .Values.model.version }}"},
                                {"name": "MODEL_STORAGE_URI", "value": "{{ .Values.model.storageUri }}"}
                            ],
                            "resources": "{{ toYaml .Values.resources | nindent 12 }}"
                        }]
                    }
                }
            }
        }
        return yaml.dump(template, default_flow_style=False)
    
    def get_chart_structure(self) -> Dict[str, str]:
        """Get complete chart file structure"""
        return {
            "Chart.yaml": self.generate_chart_yaml(),
            "values.yaml": self.generate_values_yaml(),
            "templates/deployment.yaml": self.generate_deployment_template()
        }


# Example: Generate Helm chart
chart = MLHelmChart(
    name="bert-inference",
    version="1.2.0",
    app_version="2.0.0"
)

print("=" * 50)
print("Chart.yaml:")
print("=" * 50)
print(chart.generate_chart_yaml())

print("\n" + "=" * 50)
print("values.yaml:")
print("=" * 50)
print(chart.generate_values_yaml())

## FAANG Interview Questions

### Q1: How would you design a multi-region ML inference system on Kubernetes?

**Answer:**
I would design a federated Kubernetes architecture:

1. **Multi-Cluster Setup**: Deploy identical inference services in each region using GitOps (ArgoCD/Flux)
2. **Global Load Balancer**: Use cloud provider's global LB (GCP GLB, AWS Global Accelerator) for geo-routing
3. **Model Synchronization**: Use a model registry (MLflow) with replication to regional object stores
4. **Caching Layer**: Regional Redis/Memcached clusters for embedding/feature caching
5. **Observability**: Unified monitoring with Prometheus federation and Grafana
6. **Failover**: Automatic traffic shifting when regional health checks fail

### Q2: How do you handle GPU resource fragmentation in a shared Kubernetes cluster?

**Answer:**
Multiple strategies:

1. **MIG (Multi-Instance GPU)**: Use NVIDIA MIG on A100s to partition GPUs into smaller slices
2. **Time-Slicing**: Configure NVIDIA device plugin for time-sharing on older GPUs
3. **Bin-Packing Scheduler**: Custom scheduler that optimizes GPU utilization
4. **Priority Classes**: Use PriorityClass to preempt low-priority jobs when needed
5. **Job Queuing**: Use Kueue or Volcano for batch job scheduling with fair-share
6. **Right-Sizing**: Use VPA recommendations to match workloads to appropriate GPU types

### Q3: What's your strategy for zero-downtime ML model updates?

**Answer:**
Progressive rollout strategy:

1. **Canary Deployment**: Start with 5% traffic to new model version
2. **Automated Metrics Check**: Monitor latency, error rate, prediction distribution
3. **Shadow Mode**: Optionally run new model in shadow without serving traffic
4. **Progressive Traffic Shift**: 5% -> 25% -> 50% -> 75% -> 100% with stabilization periods
5. **Automatic Rollback**: Revert if SLO violations detected
6. **Blue-Green for Critical Models**: Maintain warm standby for instant rollback

Tools: KServe with canary, Argo Rollouts, or Istio traffic management

## Summary

This notebook covered:

1. **KServe**: Serverless model inference with auto-scaling and canary deployments
2. **Ray Serve**: Distributed model serving with DAG composition
3. **Kubeflow Pipelines**: ML workflow orchestration with component DAGs
4. **GPU Scheduling**: Resource management including MIG fractional scheduling
5. **Auto-scaling**: HPA and VPA for ML workloads with custom metrics
6. **Helm Charts**: Packaging ML deployments for reproducibility

### Key Takeaways for FAANG Interviews:
- Kubernetes is the de facto platform for production ML
- GPU scheduling requires careful bin-packing and fragmentation management
- Canary deployments are essential for safe model updates
- Auto-scaling should consider ML-specific metrics (latency, throughput)
- Infrastructure-as-Code (Helm, GitOps) enables reproducible deployments