# Lab 4.4.5: Kubernetes Deployment - SOLUTION

**Module:** 4.4 - Containerization & Cloud Deployment  
**This is the complete solution notebook with all exercises solved.**

---

## Exercise 1 Solution: Complete ML Inference Stack

In [None]:
import sys
sys.path.insert(0, '..')

from scripts.k8s_utils import (
    generate_deployment_manifest,
    generate_service_manifest,
    generate_hpa_manifest,
    generate_configmap_manifest,
    generate_pvc_manifest,
    save_manifests,
)

# Generate all components for ML inference stack
resources = []

# 1. ConfigMap for model configuration
configmap = generate_configmap_manifest(
    name="llm-config",
    data={
        "MODEL_PATH": "/models/llama-8b",
        "MAX_BATCH_SIZE": "32",
        "MAX_INPUT_LENGTH": "2048",
        "MAX_OUTPUT_LENGTH": "512",
        "TEMPERATURE": "0.7",
        "TOP_P": "0.9",
    },
)
resources.append(configmap)

# 2. PVC for model storage
pvc = generate_pvc_manifest(
    name="model-storage",
    storage_size="100Gi",
    storage_class="fast-ssd",
)
resources.append(pvc)

# 3. Deployment with GPU
deployment = generate_deployment_manifest(
    name="llm-inference",
    image="my-registry/llm-inference:v1.0",
    replicas=2,
    port=8000,
    gpu_count=1,
    memory_request="32Gi",
    memory_limit="64Gi",
    cpu_request="8",
    cpu_limit="16",
    env_vars={
        "CUDA_VISIBLE_DEVICES": "0",
    },
    health_path="/health",
    volumes=[{
        "name": "model-storage",
        "persistentVolumeClaim": {"claimName": "model-storage"},
    }, {
        "name": "config",
        "configMap": {"name": "llm-config"},
    }],
    volume_mounts=[{
        "name": "model-storage",
        "mountPath": "/models",
    }, {
        "name": "config",
        "mountPath": "/etc/config",
    }],
)
resources.append(deployment)

# 4. Service (LoadBalancer for external access)
service = generate_service_manifest(
    name="llm-inference",
    port=80,
    target_port=8000,
    service_type="LoadBalancer",
)
resources.append(service)

# 5. HPA for auto-scaling
hpa = generate_hpa_manifest(
    deployment_name="llm-inference",
    min_replicas=1,
    max_replicas=5,
    cpu_target=70,
)
resources.append(hpa)

print(f"Generated {len(resources)} resources:")
for r in resources:
    print(f"  - {r.kind}: {r.name}")

# Print all manifests
print("\n" + "=" * 60)
print("COMPLETE K8s MANIFESTS:")
print("=" * 60)
for resource in resources:
    print(f"\n# {resource.kind}: {resource.name}")
    print(resource.to_yaml())
    print("---")

## Exercise 2 Solution: Pod Disruption Budget

In [None]:
# Pod Disruption Budget for high availability

pdb_manifest = '''
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llm-inference-pdb
spec:
  # Ensure at least 1 pod is always available during disruptions
  # (node maintenance, upgrades, etc.)
  minAvailable: 1
  
  # Alternative: maxUnavailable: 1
  # (allow only 1 pod to be down at a time)
  
  selector:
    matchLabels:
      app: llm-inference
'''

print("POD DISRUPTION BUDGET:")
print("=" * 60)
print(pdb_manifest)

print("\nWHY PDB IS IMPORTANT:")
print("  - Prevents all pods being terminated during node drain")
print("  - Ensures zero-downtime during cluster upgrades")
print("  - Protects against accidental mass deletion")
print("  - Required for production workloads")

## Exercise 3 Solution: Network Policy

In [None]:
# Network Policy for security

network_policy = '''
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: llm-inference-network-policy
spec:
  podSelector:
    matchLabels:
      app: llm-inference
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Allow traffic from API gateway
    - from:
        - podSelector:
            matchLabels:
              app: api-gateway
      ports:
        - protocol: TCP
          port: 8000
    
    # Allow traffic from monitoring
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 9090  # Prometheus metrics
  
  egress:
    # Allow DNS resolution
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
    
    # Allow access to model storage (NFS/S3)
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8  # Internal network
      ports:
        - protocol: TCP
          port: 2049  # NFS
        - protocol: TCP
          port: 443   # S3/HTTPS
'''

print("NETWORK POLICY:")
print("=" * 60)
print(network_policy)

print("\nSECURITY BENEFITS:")
print("  - Limits attack surface (only allowed traffic)")
print("  - Prevents lateral movement in cluster")
print("  - Enforces zero-trust networking")
print("  - Required for compliance (SOC2, HIPAA)")

## Exercise 4 Solution: Custom HPA with GPU Metrics

In [None]:
# HPA with custom GPU metrics

gpu_hpa_manifest = '''
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-gpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 1
  maxReplicas: 5
  
  metrics:
    # Standard CPU metric
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    
    # Custom GPU utilization metric
    # Requires DCGM exporter + Prometheus adapter
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "80"  # 80% GPU utilization
    
    # Custom GPU memory metric
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_FB_USED_PERCENT
        target:
          type: AverageValue
          averageValue: "85"  # 85% GPU memory usage
    
    # Requests per second (from Prometheus)
    - type: Object
      object:
        metric:
          name: http_requests_per_second
        describedObject:
          apiVersion: v1
          kind: Service
          name: llm-inference-service
        target:
          type: Value
          value: "100"  # Scale at 100 RPS
  
  behavior:
    scaleDown:
      # Slow scale down (models take time to load)
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
    scaleUp:
      # Fast scale up for traffic spikes
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
'''

print("GPU-AWARE HPA:")
print("=" * 60)
print(gpu_hpa_manifest)

print("\nREQUIREMENTS FOR GPU METRICS:")
print("  1. Install DCGM exporter:")
print("     kubectl apply -f https://github.com/NVIDIA/dcgm-exporter/...")
print("  2. Install Prometheus adapter:")
print("     helm install prometheus-adapter prometheus-community/prometheus-adapter")
print("  3. Configure custom metrics API")

## Challenge Solution: Blue-Green Deployment

In [None]:
# Blue-Green deployment strategy

blue_green_manifests = '''
# ==============================================
# Blue-Green Deployment for LLM Inference
# ==============================================

# Blue Deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-blue
  labels:
    app: llm-inference
    version: blue
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
      version: blue
  template:
    metadata:
      labels:
        app: llm-inference
        version: blue
    spec:
      containers:
      - name: llm-inference
        image: my-registry/llm-inference:v1.0
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
---
# Green Deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-green
  labels:
    app: llm-inference
    version: green
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
      version: green
  template:
    metadata:
      labels:
        app: llm-inference
        version: green
    spec:
      containers:
      - name: llm-inference
        image: my-registry/llm-inference:v2.0  # New version
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
---
# Service (points to blue by default)
apiVersion: v1
kind: Service
metadata:
  name: llm-inference-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8000
  selector:
    app: llm-inference
    version: blue  # Change to "green" to switch
'''

print("BLUE-GREEN DEPLOYMENT:")
print("=" * 60)
print(blue_green_manifests)

# Switching script
switch_script = '''
#!/bin/bash
# switch-traffic.sh - Switch between blue and green

TARGET=${1:-green}  # Default to green

echo "Switching traffic to: $TARGET"

# Update service selector
kubectl patch service llm-inference-service -p '{"spec":{"selector":{"version":"'$TARGET'"}}}'

# Verify
echo "Current service target:"
kubectl get service llm-inference-service -o jsonpath='{.spec.selector.version}'
echo ""
'''

print("\nTRAFFIC SWITCHING SCRIPT:")
print("=" * 60)
print(switch_script)

---

## Summary

This solution demonstrated:

1. **Complete ML Stack**
   - ConfigMap for configuration
   - PVC for model storage
   - GPU-enabled Deployment
   - LoadBalancer Service
   - HPA for auto-scaling

2. **Production Features**
   - Pod Disruption Budget for HA
   - Network Policy for security
   - GPU-aware HPA with custom metrics
   - Blue-Green deployment strategy