# Lab 4.4.5: Kubernetes for ML Deployments

**Module:** 4.4 - Containerization & Cloud Deployment  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐ (Advanced)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Understand core Kubernetes concepts (Pods, Deployments, Services)
- [ ] Create deployment manifests for ML inference servers
- [ ] Configure GPU scheduling in Kubernetes
- [ ] Implement Horizontal Pod Autoscaler (HPA)
- [ ] Monitor and troubleshoot K8s deployments

---

## Prerequisites

- Docker image ready (from Lab 4.4.1)
- minikube or kind installed (or access to a K8s cluster)
- kubectl CLI installed

**Note:** This lab can be completed without a running cluster using manifest generation.

---

## Real-World Context

**58% of organizations use Kubernetes for AI workloads.**

Why Kubernetes over Docker Compose?

| Feature | Docker Compose | Kubernetes |
|---------|---------------|------------|
| Scale | Single host | Multi-node cluster |
| Failover | Manual | Automatic |
| Updates | Downtime | Rolling updates |
| Scheduling | Simple | GPU-aware |
| Production | Dev/test | Enterprise-ready |

---

## ELI5: What is Kubernetes?

> **Imagine you're running a fleet of food trucks...**
>
> Docker Compose is like having one food truck you manually park and manage.
>
> **Kubernetes is like having a dispatcher** who:
> - Sends trucks where they're needed (scheduling)
> - Calls in backup trucks when one breaks (failover)
> - Adds more trucks during lunch rush (auto-scaling)
> - Upgrades trucks one at a time without closing (rolling updates)
>
> **K8s concepts:**
> - **Pod** = Food truck
> - **Deployment** = Fleet management rules
> - **Service** = How customers find your trucks
> - **Node** = Parking lot for trucks

---

## Part 1: Kubernetes Architecture

### Core Components

```
┌─────────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                        │
├─────────────────────────────────────────────────────────────┤
│  Control Plane                                               │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │
│  │   API    │ │  etcd    │ │Scheduler │ │Controller│       │
│  │  Server  │ │          │ │          │ │ Manager  │       │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘       │
├─────────────────────────────────────────────────────────────┤
│  Worker Nodes                                                │
│  ┌─────────────────────┐  ┌─────────────────────┐          │
│  │      Node 1         │  │      Node 2         │          │
│  │  ┌─────┐ ┌─────┐   │  │  ┌─────┐ ┌─────┐   │          │
│  │  │Pod 1│ │Pod 2│   │  │  │Pod 3│ │Pod 4│   │          │
│  │  └─────┘ └─────┘   │  │  └─────┘ └─────┘   │          │
│  │  [GPU]             │  │  [GPU]             │          │
│  └─────────────────────┘  └─────────────────────┘          │
└─────────────────────────────────────────────────────────────┘
```

| Component | Description |
|-----------|-------------|
| **Pod** | Smallest deployable unit (1+ containers) |
| **Deployment** | Manages Pod replicas and updates |
| **Service** | Exposes Pods to network traffic |
| **Node** | Physical/virtual machine running Pods |

In [None]:
# Check Kubernetes environment
import subprocess
import os

print("Kubernetes Environment Check")
print("=" * 60)

# Check kubectl
result = subprocess.run(["kubectl", "version", "--client", "--short"], 
                       capture_output=True, text=True)
if result.returncode == 0:
    print(f"kubectl: {result.stdout.strip()}")
else:
    print(" kubectl not installed")
    print("   Install: https://kubernetes.io/docs/tasks/tools/")

# Check cluster connection
result = subprocess.run(["kubectl", "cluster-info"], capture_output=True, text=True)
if result.returncode == 0:
    print(f"Cluster: Connected")
    # Check for GPU nodes
    result = subprocess.run(
        ["kubectl", "get", "nodes", "-o", "jsonpath={.items[*].status.capacity.nvidia\.com/gpu}"],
        capture_output=True, text=True
    )
    if result.stdout.strip():
        print(f"GPUs detected: {result.stdout.strip()}")
else:
    print(" No cluster connected")
    print("   For local testing, install minikube or kind")

# Check minikube
result = subprocess.run(["minikube", "version"], capture_output=True, text=True)
if result.returncode == 0:
    print(f"minikube: {result.stdout.strip().split()[2]}")

print("\n" + "=" * 60)

In [None]:
# Import our K8s utilities
import sys
sys.path.insert(0, '..')

from scripts.k8s_utils import (
    K8sDeploymentManager,
    generate_deployment_manifest,
    generate_service_manifest,
    generate_hpa_manifest,
    generate_complete_ml_stack,
    save_manifests,
)

print("K8s utilities loaded!")

---

## Part 2: Creating a Deployment Manifest

### Deployment YAML Structure

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 2              # Number of Pods
  selector:
    matchLabels:
      app: my-app
  template:                # Pod template
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: my-image:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # GPU request
```

In [None]:
# Generate a deployment manifest for LLM inference
deployment = generate_deployment_manifest(
    name="llm-inference",
    image="llm-inference:latest",
    replicas=2,
    port=8000,
    gpu_count=1,
    memory_request="16Gi",
    memory_limit="32Gi",
    cpu_request="4",
    cpu_limit="8",
    env_vars={
        "MODEL_PATH": "/models/llama-8b",
        "MAX_BATCH_SIZE": "8",
        "CUDA_VISIBLE_DEVICES": "0",
    },
    health_path="/health",
)

print("Generated Deployment Manifest:")
print("=" * 60)
print(deployment.to_yaml())

### Understanding the Manifest

| Field | Purpose |
|-------|--------|
| `replicas` | How many Pod copies to run |
| `selector.matchLabels` | How Deployment finds its Pods |
| `resources.limits.nvidia.com/gpu` | Request GPU access |
| `livenessProbe` | Restart Pod if it fails |
| `readinessProbe` | Don't send traffic until ready |

---

## Part 3: Creating a Service

### Service Types

| Type | Use Case | Access |
|------|----------|--------|
| **ClusterIP** | Internal only | Within cluster |
| **NodePort** | Development | Node IP + port |
| **LoadBalancer** | Production | External IP |

In [None]:
# Generate a Service manifest
service = generate_service_manifest(
    name="llm-inference",
    port=80,
    target_port=8000,
    service_type="LoadBalancer",
)

print("Generated Service Manifest:")
print("=" * 60)
print(service.to_yaml())

---

## Part 4: Horizontal Pod Autoscaler (HPA)

### ELI5: Auto-Scaling

> **Like a restaurant adding tables...**
>
> When customers are waiting (high CPU), the manager opens more tables (adds Pods).
> When the rush ends (low CPU), tables are closed (Pods removed).
>
> **HPA watches:**
> - CPU utilization
> - Memory usage
> - Custom metrics (requests/second)

In [None]:
# Generate HPA manifest
hpa = generate_hpa_manifest(
    deployment_name="llm-inference",
    min_replicas=1,
    max_replicas=5,
    cpu_target=70,  # Scale when CPU > 70%
)

print("Generated HPA Manifest:")
print("=" * 60)
print(hpa.to_yaml())

In [None]:
# Explain HPA behavior
hpa_explanation = '''
# HPA Scaling Behavior
# ====================

Given:
  - Target CPU: 70%
  - Min replicas: 1
  - Max replicas: 5

Scaling Logic:
  
  desiredReplicas = ceil[currentReplicas × (currentMetric / targetMetric)]

Example:
  - Current: 2 replicas at 85% CPU each
  - Desired = ceil[2 × (85/70)] = ceil[2.43] = 3 replicas
  
Cooldown Periods:
  - Scale UP: 3 minutes (quick response to load)
  - Scale DOWN: 5 minutes (avoid flapping)

Best Practices for ML:
  1. Set higher min replicas (model loading is slow)
  2. Use custom metrics (requests/sec better than CPU for inference)
  3. Increase stabilization window (models take time to warm up)
'''

print(hpa_explanation)

---

## Part 5: GPU Scheduling

### NVIDIA Device Plugin

Kubernetes uses the NVIDIA device plugin to:
1. Discover GPUs on nodes
2. Advertise GPU resources
3. Allocate GPUs to Pods

In [None]:
# GPU scheduling configuration examples
gpu_scheduling = '''
# GPU Scheduling in Kubernetes
# ============================

# 1. Request specific number of GPUs
resources:
  limits:
    nvidia.com/gpu: 1  # Request 1 GPU

# 2. Request specific GPU type (with node labels)
nodeSelector:
  gpu-type: a100

# 3. Tolerate GPU node taints
tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

# 4. GPU affinity (prefer GPU nodes)
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: nvidia.com/gpu
          operator: Exists

# 5. Check GPU availability
kubectl describe nodes | grep -A5 "Allocatable:"
# Look for: nvidia.com/gpu: 1

# 6. Install NVIDIA device plugin (if not installed)
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
'''

print(gpu_scheduling)

---

## Part 6: Complete ML Stack

Let's generate a complete ML inference stack.

In [None]:
# Generate complete ML stack
resources = generate_complete_ml_stack(
    name="llm-inference",
    image="my-registry/llm-inference:v1.0",
    port=8000,
    replicas=2,
    gpu_count=1,
    enable_hpa=True,
    enable_pvc=True,
    storage_size="50Gi",
    env_vars={
        "MODEL_PATH": "/models/llama-8b",
        "MAX_BATCH_SIZE": "32",
    },
)

print(f"Generated {len(resources)} Kubernetes resources:")
for r in resources:
    print(f"  - {r.kind}: {r.name}")

In [None]:
# Save manifests to files
import os

output_dir = "../configs/k8s"
os.makedirs(output_dir, exist_ok=True)

saved_files = save_manifests(resources, output_dir, single_file=False)

print("Saved manifest files:")
for f in saved_files:
    print(f"  - {f}")

In [None]:
# Show combined manifest
print("Combined Manifests (all-in-one):")
print("=" * 60)

for resource in resources:
    print(resource.to_yaml())
    print("---")

---

## Part 7: Deploying and Managing

### Common kubectl Commands

In [None]:
# Common kubectl commands for ML deployments
kubectl_commands = '''
# Kubernetes Commands Cheatsheet for ML
# =====================================

# Deploy
kubectl apply -f deployment.yaml
kubectl apply -f ./configs/k8s/  # Apply all manifests in directory

# Check status
kubectl get pods -w                     # Watch pod status
kubectl get deployment llm-inference    # Check deployment
kubectl describe pod <pod-name>         # Detailed pod info

# View logs
kubectl logs <pod-name>                 # Current logs
kubectl logs -f <pod-name>              # Follow logs
kubectl logs <pod-name> --previous      # Previous container logs

# Debug
kubectl exec -it <pod-name> -- bash     # Shell into pod
kubectl port-forward svc/llm-inference 8000:80  # Local access

# Scale
kubectl scale deployment llm-inference --replicas=3
kubectl autoscale deployment llm-inference --min=1 --max=5 --cpu-percent=70

# Update
kubectl set image deployment/llm-inference llm-inference=my-image:v2
kubectl rollout status deployment/llm-inference  # Watch rollout
kubectl rollout undo deployment/llm-inference    # Rollback

# Delete
kubectl delete -f deployment.yaml
kubectl delete deployment llm-inference

# GPU-specific
kubectl describe nodes | grep -A5 nvidia.com/gpu  # Check GPU availability
kubectl top nodes                                 # Resource usage
'''

print(kubectl_commands)

In [None]:
# Create a deployment helper script
deploy_script = '''#!/bin/bash
# ML Deployment Script
# Usage: ./deploy.sh [apply|delete|status|logs]

NAMESPACE=${NAMESPACE:-default}
DEPLOYMENT="llm-inference"

case "$1" in
    apply)
        echo "Deploying ML stack..."
        kubectl apply -f ./k8s/ -n $NAMESPACE
        echo "\nWaiting for rollout..."
        kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE
        ;;
    delete)
        echo "Deleting ML stack..."
        kubectl delete -f ./k8s/ -n $NAMESPACE
        ;;
    status)
        echo "Deployment Status:"
        kubectl get deployment,svc,hpa,pvc -n $NAMESPACE
        echo "\nPod Status:"
        kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT
        ;;
    logs)
        POD=$(kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT -o jsonpath="{.items[0].metadata.name}")
        kubectl logs -f $POD -n $NAMESPACE
        ;;
    port-forward)
        echo "Port forwarding to localhost:8000..."
        kubectl port-forward svc/$DEPLOYMENT-service 8000:80 -n $NAMESPACE
        ;;
    *)
        echo "Usage: $0 {apply|delete|status|logs|port-forward}"
        exit 1
        ;;
esac
'''

with open("../configs/k8s/deploy.sh", "w") as f:
    f.write(deploy_script)

os.chmod("../configs/k8s/deploy.sh", 0o755)
print("Created: configs/k8s/deploy.sh")

---

## Part 8: Monitoring and Troubleshooting

In [None]:
# Troubleshooting guide
troubleshooting = '''
# Kubernetes Troubleshooting for ML
# ==================================

## Pod stuck in Pending

# Check events
kubectl describe pod <pod-name>

# Common causes:
# - No GPU nodes available
# - Insufficient memory
# - PVC not bound

## Pod in CrashLoopBackOff

# Check logs from crashed container
kubectl logs <pod-name> --previous

# Common causes:
# - Model loading failure (OOM)
# - Missing dependencies
# - Health check failing

## GPU not detected

# Verify NVIDIA device plugin
kubectl get pods -n kube-system | grep nvidia

# Check node GPU resources
kubectl describe node <node-name> | grep nvidia

# Fix: Install device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

## High Latency

# Check if HPA is scaling
kubectl get hpa

# Check pod resource usage
kubectl top pods

# Consider:
# - Increase replicas
# - Use larger instance type
# - Enable request batching

## Memory OOM

# Check pod events
kubectl describe pod <pod-name> | grep -A5 "Events:"

# Solutions:
# - Increase memory limits
# - Use model quantization
# - Enable gradient checkpointing
'''

print(troubleshooting)

---

## Common Mistakes

### Mistake 1: Not Setting Resource Limits

```yaml
# BAD - Pod can consume all node resources
containers:
- name: inference
  image: my-image

# GOOD - Resource limits prevent noisy neighbor
containers:
- name: inference
  image: my-image
  resources:
    requests:
      memory: "16Gi"
      cpu: "4"
    limits:
      memory: "32Gi"
      nvidia.com/gpu: 1
```

---

### Mistake 2: Short Liveness Probe Timeout

```yaml
# BAD - LLM models take minutes to load
livenessProbe:
  initialDelaySeconds: 10  # Too short!

# GOOD - Allow time for model loading
livenessProbe:
  initialDelaySeconds: 120  # 2 minutes
  periodSeconds: 10
  timeoutSeconds: 5
```

---

### Mistake 3: Ignoring Pod Disruption Budgets

```yaml
# Add PDB for production deployments
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llm-inference-pdb
spec:
  minAvailable: 1  # At least 1 pod always running
  selector:
    matchLabels:
      app: llm-inference
```

---

## Checkpoint

You've learned:
- Core Kubernetes concepts for ML
- Creating Deployment, Service, and HPA manifests
- GPU scheduling and configuration
- Common kubectl commands
- Troubleshooting techniques

---

## Challenge (Optional)

Create a complete production-ready K8s setup with:
1. Blue-green deployment strategy
2. Ingress with SSL termination
3. Prometheus ServiceMonitor
4. Pod Disruption Budget
5. Network Policy for security

---

## Further Reading

- [Kubernetes Documentation](https://kubernetes.io/docs/)
- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/)
- [Kube-scheduler GPU Support](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/)
- [HPA Custom Metrics](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)

---

## Cleanup

In [None]:
# Cleanup commands
print("Kubernetes Cleanup")
print("=" * 60)
print("\n# Delete all resources")
print("kubectl delete -f ./configs/k8s/")
print("\n# Or use the deploy script")
print("./configs/k8s/deploy.sh delete")
print("\n# Delete persistent volumes (if needed)")
print("kubectl delete pvc llm-inference-storage")