# Chapter 8: Production Kubernetes Deployment

## 🎯 Learning Objectives

By the end of this chapter, you will:
- **Master Kubernetes GPU resource management** for LLM workloads
- **Implement auto-scaling strategies** with custom metrics
- **Deploy production-grade monitoring** and observability
- **Build fault-tolerant LLM services** with high availability
- **Optimize costs** through intelligent resource allocation

---

## 🏗️ Why Kubernetes for LLM Production?

### **The Production Reality**

Running LLMs in production requires solving complex operational challenges:

#### **Scale Challenges**
- **Dynamic demand**: Traffic varies 10x between peak and off-peak
- **GPU scarcity**: Limited, expensive compute resources
- **Multi-model serving**: Different models for different use cases
- **Geographic distribution**: Global user base requires edge deployment

#### **Reliability Requirements**
- **99.9% uptime**: Downtime costs millions in revenue
- **Graceful degradation**: Maintain service during partial failures
- **Rolling updates**: Deploy new models without service interruption
- **Disaster recovery**: Multi-region failover capabilities

#### **Cost Optimization**
- **GPU costs**: $2-8/hour per GPU in cloud environments
- **Utilization optimization**: Keep expensive GPUs busy
- **Right-sizing**: Match resource allocation to actual needs
- **Spot instances**: 70% savings with preemption handling

### **Kubernetes Advantages**

**Kubernetes** provides enterprise-grade orchestration specifically suited for AI workloads:

#### **Native GPU Support**
- **Device plugins**: First-class GPU resource management
- **Resource quotas**: Prevent GPU resource monopolization
- **Node affinity**: Schedule workloads on appropriate hardware
- **Multi-GPU allocation**: Support for complex model architectures

#### **Auto-scaling Excellence**
- **HPA**: Horizontal scaling based on custom metrics
- **VPA**: Vertical scaling for right-sizing resources
- **Cluster autoscaler**: Dynamic node provisioning
- **Custom controllers**: Domain-specific scaling logic

#### **Production Features**
- **Service mesh**: Load balancing, circuit breaking, observability
- **Secrets management**: Secure API keys and credentials
- **Configuration management**: Environment-specific settings
- **Health checks**: Automatic failure detection and recovery

---

## 📋 Production-Ready Kubernetes Manifests

Let's build comprehensive Kubernetes manifests for LLM deployment:

In [None]:
import yaml
import json
from typing import Dict, List, Any
from dataclasses import dataclass, asdict
import os

@dataclass
class KubernetesManifestGenerator:
    """
    Production-grade Kubernetes manifest generator for LLM deployments
    
    Educational Focus:
    This class demonstrates best practices for deploying ML workloads
    in production Kubernetes environments with proper resource management,
    monitoring, and scalability configurations.
    """
    
    # Deployment configuration
    namespace: str = "llm-inference"
    app_name: str = "llm-server"
    model_name: str = "llama2-7b-chat"
    image: str = "vllm/vllm-openai:v0.2.0"
    
    # Resource configuration
    gpu_count: int = 1
    gpu_type: str = "nvidia-tesla-t4"
    cpu_request: str = "4"
    cpu_limit: str = "8"
    memory_request: str = "16Gi"
    memory_limit: str = "32Gi"
    
    # Scaling configuration
    min_replicas: int = 2
    max_replicas: int = 10
    
    def generate_namespace(self) -> Dict[str, Any]:
        """Generate namespace with resource quotas"""
        
        return {
            "apiVersion": "v1",
            "kind": "Namespace",
            "metadata": {
                "name": self.namespace,
                "labels": {
                    "purpose": "ai-inference",
                    "tier": "production",
                    "team": "ml-platform"
                }
            }
        }
    
    def generate_resource_quota(self) -> Dict[str, Any]:
        """Generate resource quota to prevent resource monopolization"""
        
        return {
            "apiVersion": "v1",
            "kind": "ResourceQuota",
            "metadata": {
                "name": f"{self.app_name}-resource-quota",
                "namespace": self.namespace
            },
            "spec": {
                "hard": {
                    "requests.nvidia.com/gpu": "20",  # Max 20 GPUs
                    "limits.nvidia.com/gpu": "20",
                    "requests.cpu": "80",  # Max 80 CPU cores
                    "requests.memory": "320Gi",  # Max 320GB RAM
                    "persistentvolumeclaims": "10",  # Max 10 PVCs
                    "pods": "50"  # Max 50 pods
                }
            }
        }
    
    def generate_configmap(self) -> Dict[str, Any]:
        """Generate ConfigMap for application configuration"""
        
        return {
            "apiVersion": "v1",
            "kind": "ConfigMap",
            "metadata": {
                "name": f"{self.app_name}-config",
                "namespace": self.namespace
            },
            "data": {
                # Model configuration
                "MODEL_NAME": self.model_name,
                "MAX_SEQUENCE_LENGTH": "4096",
                "MAX_BATCH_SIZE": "32",
                "TENSOR_PARALLEL_SIZE": "1",
                
                # vLLM configuration
                "GPU_MEMORY_UTILIZATION": "0.9",
                "MAX_NUM_BATCHED_TOKENS": "8192",
                "MAX_NUM_SEQS": "256",
                "BLOCK_SIZE": "16",
                
                # Server configuration
                "SERVER_PORT": "8000",
                "HEALTH_CHECK_PATH": "/health",
                "METRICS_PORT": "9090",
                
                # Performance tuning
                "CUDA_VISIBLE_DEVICES": "0",
                "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:128",
                "VLLM_ENGINE_ITERATION_TIMEOUT_S": "60",
                
                # Logging and monitoring
                "LOG_LEVEL": "INFO",
                "ENABLE_PROMETHEUS_METRICS": "true",
                "ENABLE_DISTRIBUTED_TRACING": "true"
            }
        }
    
    def generate_deployment(self) -> Dict[str, Any]:
        """Generate production-ready Deployment manifest"""
        
        return {
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "metadata": {
                "name": f"{self.app_name}-deployment",
                "namespace": self.namespace,
                "labels": {
                    "app": self.app_name,
                    "component": "inference",
                    "version": "v1.0",
                    "model": self.model_name
                }
            },
            "spec": {
                "replicas": self.min_replicas,
                "strategy": {
                    "type": "RollingUpdate",
                    "rollingUpdate": {
                        "maxSurge": 1,
                        "maxUnavailable": 0  # Zero downtime deployment
                    }
                },
                "selector": {
                    "matchLabels": {
                        "app": self.app_name
                    }
                },
                "template": {
                    "metadata": {
                        "labels": {
                            "app": self.app_name,
                            "component": "inference",
                            "version": "v1.0"
                        },
                        "annotations": {
                            # Prometheus scraping
                            "prometheus.io/scrape": "true",
                            "prometheus.io/port": "9090",
                            "prometheus.io/path": "/metrics",
                            
                            # Istio injection (if using service mesh)
                            "sidecar.istio.io/inject": "true",
                            
                            # Resource tracking
                            "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
                        }
                    },
                    "spec": {
                        # Node selection for GPU nodes
                        "nodeSelector": {
                            "accelerator": self.gpu_type
                        },
                        
                        # Tolerations for GPU node taints
                        "tolerations": [
                            {
                                "key": "nvidia.com/gpu",
                                "operator": "Exists",
                                "effect": "NoSchedule"
                            },
                            {
                                "key": "kubernetes.io/arch",
                                "operator": "Equal",
                                "value": "amd64",
                                "effect": "NoSchedule"
                            }
                        ],
                        
                        # Affinity rules for better scheduling
                        "affinity": {
                            "podAntiAffinity": {
                                "preferredDuringSchedulingIgnoredDuringExecution": [
                                    {
                                        "weight": 100,
                                        "podAffinityTerm": {
                                            "labelSelector": {
                                                "matchExpressions": [
                                                    {
                                                        "key": "app",
                                                        "operator": "In",
                                                        "values": [self.app_name]
                                                    }
                                                ]
                                            },
                                            "topologyKey": "kubernetes.io/hostname"
                                        }
                                    }
                                ]
                            }
                        },
                        
                        "containers": [
                            {
                                "name": "vllm-server",
                                "image": self.image,
                                "imagePullPolicy": "Always",
                                
                                # Resource requirements (CRITICAL for GPU scheduling)
                                "resources": {
                                    "requests": {
                                        "nvidia.com/gpu": str(self.gpu_count),
                                        "cpu": self.cpu_request,
                                        "memory": self.memory_request
                                    },
                                    "limits": {
                                        "nvidia.com/gpu": str(self.gpu_count),
                                        "cpu": self.cpu_limit,
                                        "memory": self.memory_limit
                                    }
                                },
                                
                                # Environment variables from ConfigMap
                                "envFrom": [
                                    {
                                        "configMapRef": {
                                            "name": f"{self.app_name}-config"
                                        }
                                    }
                                ],
                                
                                # Additional environment variables
                                "env": [
                                    {
                                        "name": "POD_NAME",
                                        "valueFrom": {
                                            "fieldRef": {
                                                "fieldPath": "metadata.name"
                                            }
                                        }
                                    },
                                    {
                                        "name": "POD_IP",
                                        "valueFrom": {
                                            "fieldRef": {
                                                "fieldPath": "status.podIP"
                                            }
                                        }
                                    },
                                    {
                                        "name": "NODE_NAME",
                                        "valueFrom": {
                                            "fieldRef": {
                                                "fieldPath": "spec.nodeName"
                                            }
                                        }
                                    }
                                ],
                                
                                # Container ports
                                "ports": [
                                    {
                                        "name": "http",
                                        "containerPort": 8000,
                                        "protocol": "TCP"
                                    },
                                    {
                                        "name": "metrics",
                                        "containerPort": 9090,
                                        "protocol": "TCP"
                                    }
                                ],
                                
                                # Health checks (ESSENTIAL for production)
                                "livenessProbe": {
                                    "httpGet": {
                                        "path": "/health",
                                        "port": 8000
                                    },
                                    "initialDelaySeconds": 120,  # Model loading time
                                    "periodSeconds": 30,
                                    "timeoutSeconds": 10,
                                    "failureThreshold": 3,
                                    "successThreshold": 1
                                },
                                
                                "readinessProbe": {
                                    "httpGet": {
                                        "path": "/health",
                                        "port": 8000
                                    },
                                    "initialDelaySeconds": 60,
                                    "periodSeconds": 10,
                                    "timeoutSeconds": 5,
                                    "failureThreshold": 3,
                                    "successThreshold": 1
                                },
                                
                                # Startup probe for slow model loading
                                "startupProbe": {
                                    "httpGet": {
                                        "path": "/health",
                                        "port": 8000
                                    },
                                    "initialDelaySeconds": 30,
                                    "periodSeconds": 15,
                                    "timeoutSeconds": 10,
                                    "failureThreshold": 20  # Up to 5 minutes
                                },
                                
                                # Volume mounts
                                "volumeMounts": [
                                    {
                                        "name": "model-cache",
                                        "mountPath": "/root/.cache/huggingface"
                                    },
                                    {
                                        "name": "tmp-dir",
                                        "mountPath": "/tmp"
                                    }
                                ],
                                
                                # Security context
                                "securityContext": {
                                    "runAsNonRoot": False,  # GPU access often requires root
                                    "allowPrivilegeEscalation": False,
                                    "readOnlyRootFilesystem": False,
                                    "capabilities": {
                                        "drop": ["ALL"]
                                    }
                                }
                            }
                        ],
                        
                        # Shared volumes
                        "volumes": [
                            {
                                "name": "model-cache",
                                "persistentVolumeClaim": {
                                    "claimName": "model-cache-pvc"
                                }
                            },
                            {
                                "name": "tmp-dir",
                                "emptyDir": {
                                    "sizeLimit": "1Gi"
                                }
                            }
                        ],
                        
                        "restartPolicy": "Always",
                        "terminationGracePeriodSeconds": 60,
                        
                        # Pod security context
                        "securityContext": {
                            "fsGroup": 1000
                        }
                    }
                }
            }
        }
    
    def generate_service(self) -> Dict[str, Any]:
        """Generate Service for load balancing"""
        
        return {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "name": f"{self.app_name}-service",
                "namespace": self.namespace,
                "labels": {
                    "app": self.app_name
                },
                "annotations": {
                    # Cloud provider annotations
                    "service.beta.kubernetes.io/aws-load-balancer-type": "nlb",
                    "service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled": "true",
                    
                    # Prometheus scraping
                    "prometheus.io/scrape": "true",
                    "prometheus.io/port": "9090",
                    "prometheus.io/path": "/metrics"
                }
            },
            "spec": {
                "type": "LoadBalancer",
                "ports": [
                    {
                        "name": "http",
                        "port": 80,
                        "targetPort": 8000,
                        "protocol": "TCP"
                    },
                    {
                        "name": "https",
                        "port": 443,
                        "targetPort": 8000,
                        "protocol": "TCP"
                    },
                    {
                        "name": "metrics",
                        "port": 9090,
                        "targetPort": 9090,
                        "protocol": "TCP"
                    }
                ],
                "selector": {
                    "app": self.app_name
                },
                # Session affinity for consistent routing
                "sessionAffinity": "ClientIP",
                "sessionAffinityConfig": {
                    "clientIP": {
                        "timeoutSeconds": 3600  # 1 hour
                    }
                }
            }
        }

# Initialize the manifest generator
manifest_generator = KubernetesManifestGenerator(
    namespace="llm-inference",
    app_name="llm-server",
    model_name="llama2-7b-chat",
    gpu_type="nvidia-tesla-t4"
)

print("✅ Kubernetes Manifest Generator Initialized!")
print(f"📦 Ready to generate production manifests for {manifest_generator.model_name}")

## ⚖️ Advanced Auto-Scaling Configuration

Let's implement sophisticated auto-scaling with custom metrics:

In [None]:
def generate_advanced_autoscaling_manifests(generator: KubernetesManifestGenerator) -> Dict[str, Dict[str, Any]]:
    """
    Generate advanced auto-scaling manifests with custom metrics
    
    Educational Focus:
    This demonstrates how to implement intelligent scaling based on
    LLM-specific metrics like queue depth, tokens/second, and GPU utilization.
    """
    
    manifests = {}
    
    # 1. Horizontal Pod Autoscaler with multiple metrics
    manifests['hpa'] = {
        "apiVersion": "autoscaling/v2",
        "kind": "HorizontalPodAutoscaler",
        "metadata": {
            "name": f"{generator.app_name}-hpa",
            "namespace": generator.namespace
        },
        "spec": {
            "scaleTargetRef": {
                "apiVersion": "apps/v1",
                "kind": "Deployment",
                "name": f"{generator.app_name}-deployment"
            },
            "minReplicas": generator.min_replicas,
            "maxReplicas": generator.max_replicas,
            
            # Multi-metric scaling strategy
            "metrics": [
                # CPU-based scaling
                {
                    "type": "Resource",
                    "resource": {
                        "name": "cpu",
                        "target": {
                            "type": "Utilization",
                            "averageUtilization": 70
                        }
                    }
                },
                # Memory-based scaling
                {
                    "type": "Resource",
                    "resource": {
                        "name": "memory",
                        "target": {
                            "type": "Utilization",
                            "averageUtilization": 80
                        }
                    }
                },
                # Custom metric: Requests per second
                {
                    "type": "Pods",
                    "pods": {
                        "metric": {
                            "name": "http_requests_per_second"
                        },
                        "target": {
                            "type": "AverageValue",
                            "averageValue": "100"  # Scale when > 100 RPS per pod
                        }
                    }
                },
                # Custom metric: Queue depth
                {
                    "type": "Object",
                    "object": {
                        "metric": {
                            "name": "llm_queue_depth"
                        },
                        "describedObject": {
                            "apiVersion": "v1",
                            "kind": "Service",
                            "name": f"{generator.app_name}-service"
                        },
                        "target": {
                            "type": "Value",
                            "value": "50"  # Scale when queue depth > 50
                        }
                    }
                },
                # Custom metric: GPU utilization
                {
                    "type": "Pods",
                    "pods": {
                        "metric": {
                            "name": "nvidia_gpu_utilization"
                        },
                        "target": {
                            "type": "AverageValue",
                            "averageValue": "85"  # Scale when GPU util > 85%
                        }
                    }
                }
            ],
            
            # Advanced scaling behavior
            "behavior": {
                "scaleUp": {
                    "stabilizationWindowSeconds": 300,  # Wait 5 min before scaling up
                    "policies": [
                        {
                            "type": "Percent",
                            "value": 50,  # Scale up by max 50%
                            "periodSeconds": 60
                        },
                        {
                            "type": "Pods",
                            "value": 2,  # Or add max 2 pods
                            "periodSeconds": 60
                        }
                    ],
                    "selectPolicy": "Min"  # Choose more conservative
                },
                "scaleDown": {
                    "stabilizationWindowSeconds": 600,  # Wait 10 min before scaling down
                    "policies": [
                        {
                            "type": "Percent",
                            "value": 25,  # Scale down by max 25%
                            "periodSeconds": 60
                        }
                    ]
                }
            }
        }
    }
    
    # 2. Vertical Pod Autoscaler for right-sizing
    manifests['vpa'] = {
        "apiVersion": "autoscaling.k8s.io/v1",
        "kind": "VerticalPodAutoscaler",
        "metadata": {
            "name": f"{generator.app_name}-vpa",
            "namespace": generator.namespace
        },
        "spec": {
            "targetRef": {
                "apiVersion": "apps/v1",
                "kind": "Deployment",
                "name": f"{generator.app_name}-deployment"
            },
            "updatePolicy": {
                "updateMode": "Off"  # Only provide recommendations
            },
            "resourcePolicy": {
                "containerPolicies": [
                    {
                        "containerName": "vllm-server",
                        "minAllowed": {
                            "cpu": "2",
                            "memory": "8Gi"
                        },
                        "maxAllowed": {
                            "cpu": "16",
                            "memory": "64Gi"
                        },
                        "controlledResources": ["cpu", "memory"]
                    }
                ]
            }
        }
    }
    
    # 3. Pod Disruption Budget for high availability
    manifests['pdb'] = {
        "apiVersion": "policy/v1",
        "kind": "PodDisruptionBudget",
        "metadata": {
            "name": f"{generator.app_name}-pdb",
            "namespace": generator.namespace
        },
        "spec": {
            "minAvailable": "50%",  # Keep at least 50% of pods running
            "selector": {
                "matchLabels": {
                    "app": generator.app_name
                }
            }
        }
    }
    
    # 4. Storage configuration
    manifests['pvc'] = {
        "apiVersion": "v1",
        "kind": "PersistentVolumeClaim",
        "metadata": {
            "name": "model-cache-pvc",
            "namespace": generator.namespace
        },
        "spec": {
            "accessModes": ["ReadWriteMany"],  # Multiple pods can access
            "resources": {
                "requests": {
                    "storage": "200Gi"  # 200GB for model cache
                }
            },
            "storageClassName": "fast-ssd"  # Use high-performance storage
        }
    }
    
    return manifests

# Generate advanced auto-scaling manifests
autoscaling_manifests = generate_advanced_autoscaling_manifests(manifest_generator)

print("✅ Advanced Auto-Scaling Manifests Generated!")
print(f"📈 Generated {len(autoscaling_manifests)} advanced configurations:")
for manifest_type in autoscaling_manifests.keys():
    print(f"   • {manifest_type.upper()}: Advanced {manifest_type} configuration")

## 📊 Production Monitoring and Observability

Let's implement comprehensive monitoring for LLM services:

In [None]:
def generate_monitoring_manifests(generator: KubernetesManifestGenerator) -> Dict[str, Dict[str, Any]]:
    """
    Generate comprehensive monitoring and observability manifests
    
    Educational Focus:
    This demonstrates production-grade monitoring setup for LLM services
    including metrics collection, alerting, and distributed tracing.
    """
    
    manifests = {}
    
    # 1. ServiceMonitor for Prometheus
    manifests['servicemonitor'] = {
        "apiVersion": "monitoring.coreos.com/v1",
        "kind": "ServiceMonitor",
        "metadata": {
            "name": f"{generator.app_name}-metrics",
            "namespace": generator.namespace,
            "labels": {
                "app": generator.app_name,
                "release": "prometheus"
            }
        },
        "spec": {
            "selector": {
                "matchLabels": {
                    "app": generator.app_name
                }
            },
            "endpoints": [
                {
                    "port": "metrics",
                    "path": "/metrics",
                    "interval": "30s",
                    "scrapeTimeout": "10s",
                    "honorLabels": True
                }
            ]
        }
    }
    
    # 2. PrometheusRule for alerting
    manifests['prometheusrule'] = {
        "apiVersion": "monitoring.coreos.com/v1",
        "kind": "PrometheusRule",
        "metadata": {
            "name": f"{generator.app_name}-alerts",
            "namespace": generator.namespace,
            "labels": {
                "app": generator.app_name,
                "release": "prometheus"
            }
        },
        "spec": {
            "groups": [
                {
                    "name": f"{generator.app_name}.rules",
                    "rules": [
                        # High latency alert
                        {
                            "alert": "LLMHighLatency",
                            "expr": 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 10',
                            "for": "5m",
                            "labels": {
                                "severity": "warning",
                                "service": generator.app_name
                            },
                            "annotations": {
                                "summary": "LLM inference latency is high",
                                "description": "95th percentile latency is {{ $value }}s for service {{ $labels.service }}"
                            }
                        },
                        # High error rate alert
                        {
                            "alert": "LLMHighErrorRate",
                            "expr": 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05',
                            "for": "5m",
                            "labels": {
                                "severity": "critical",
                                "service": generator.app_name
                            },
                            "annotations": {
                                "summary": "LLM service has high error rate",
                                "description": "Error rate is {{ $value | humanizePercentage }} for service {{ $labels.service }}"
                            }
                        },
                        # GPU utilization alert
                        {
                            "alert": "LLMGPUUtilizationLow",
                            "expr": 'nvidia_gpu_utilization < 50',
                            "for": "10m",
                            "labels": {
                                "severity": "info",
                                "service": generator.app_name
                            },
                            "annotations": {
                                "summary": "LLM GPU utilization is low",
                                "description": "GPU utilization is {{ $value }}% on {{ $labels.instance }} - consider scaling down"
                            }
                        },
                        # Memory pressure alert
                        {
                            "alert": "LLMMemoryPressure",
                            "expr": 'container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9',
                            "for": "2m",
                            "labels": {
                                "severity": "critical",
                                "service": generator.app_name
                            },
                            "annotations": {
                                "summary": "LLM container memory pressure",
                                "description": "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.pod }}"
                            }
                        },
                        # Pod crash loop alert
                        {
                            "alert": "LLMPodCrashLooping",
                            "expr": 'rate(kube_pod_container_status_restarts_total[15m]) > 0',
                            "for": "5m",
                            "labels": {
                                "severity": "critical",
                                "service": generator.app_name
                            },
                            "annotations": {
                                "summary": "LLM pod is crash looping",
                                "description": "Pod {{ $labels.pod }} is restarting frequently"
                            }
                        },
                        # Queue depth alert
                        {
                            "alert": "LLMQueueDepthHigh",
                            "expr": 'llm_queue_depth > 100',
                            "for": "2m",
                            "labels": {
                                "severity": "warning",
                                "service": generator.app_name
                            },
                            "annotations": {
                                "summary": "LLM request queue depth is high",
                                "description": "Queue depth is {{ $value }} requests - consider scaling up"
                            }
                        },
                        # Service availability alert
                        {
                            "alert": "LLMServiceDown",
                            "expr": 'up == 0',
                            "for": "1m",
                            "labels": {
                                "severity": "critical",
                                "service": generator.app_name
                            },
                            "annotations": {
                                "summary": "LLM service is down",
                                "description": "Service {{ $labels.service }} is not responding on {{ $labels.instance }}"
                            }
                        }
                    ]
                }
            ]
        }
    }
    
    # 3. Grafana Dashboard ConfigMap
    dashboard_json = {
        "dashboard": {
            "id": None,
            "title": f"LLM Inference - {generator.model_name}",
            "tags": ["llm", "inference", "ai"],
            "timezone": "UTC",
            "panels": [
                {
                    "id": 1,
                    "title": "Request Rate",
                    "type": "graph",
                    "targets": [
                        {
                            "expr": "rate(http_requests_total[5m])",
                            "legendFormat": "{{ instance }}"
                        }
                    ]
                },
                {
                    "id": 2,
                    "title": "Response Time (P95)",
                    "type": "graph",
                    "targets": [
                        {
                            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
                            "legendFormat": "P95 Latency"
                        }
                    ]
                },
                {
                    "id": 3,
                    "title": "GPU Utilization",
                    "type": "graph",
                    "targets": [
                        {
                            "expr": "nvidia_gpu_utilization",
                            "legendFormat": "GPU {{ gpu }}"
                        }
                    ]
                },
                {
                    "id": 4,
                    "title": "Memory Usage",
                    "type": "graph",
                    "targets": [
                        {
                            "expr": "container_memory_usage_bytes / container_spec_memory_limit_bytes",
                            "legendFormat": "{{ pod }}"
                        }
                    ]
                },
                {
                    "id": 5,
                    "title": "Queue Depth",
                    "type": "graph",
                    "targets": [
                        {
                            "expr": "llm_queue_depth",
                            "legendFormat": "Queue Depth"
                        }
                    ]
                },
                {
                    "id": 6,
                    "title": "Tokens per Second",
                    "type": "graph",
                    "targets": [
                        {
                            "expr": "rate(llm_tokens_generated_total[5m])",
                            "legendFormat": "Tokens/sec"
                        }
                    ]
                }
            ],
            "time": {
                "from": "now-1h",
                "to": "now"
            },
            "refresh": "30s"
        }
    }
    
    manifests['grafana_dashboard'] = {
        "apiVersion": "v1",
        "kind": "ConfigMap",
        "metadata": {
            "name": f"{generator.app_name}-dashboard",
            "namespace": generator.namespace,
            "labels": {
                "grafana_dashboard": "1"  # Auto-discovery by Grafana
            }
        },
        "data": {
            "dashboard.json": json.dumps(dashboard_json, indent=2)
        }
    }
    
    # 4. Network Policy for security
    manifests['networkpolicy'] = {
        "apiVersion": "networking.k8s.io/v1",
        "kind": "NetworkPolicy",
        "metadata": {
            "name": f"{generator.app_name}-netpol",
            "namespace": generator.namespace
        },
        "spec": {
            "podSelector": {
                "matchLabels": {
                    "app": generator.app_name
                }
            },
            "policyTypes": ["Ingress", "Egress"],
            "ingress": [
                {
                    "from": [
                        {
                            "namespaceSelector": {
                                "matchLabels": {
                                    "name": "ingress-nginx"
                                }
                            }
                        },
                        {
                            "namespaceSelector": {
                                "matchLabels": {
                                    "name": "monitoring"
                                }
                            }
                        }
                    ],
                    "ports": [
                        {
                            "protocol": "TCP",
                            "port": 8000
                        },
                        {
                            "protocol": "TCP",
                            "port": 9090
                        }
                    ]
                }
            ],
            "egress": [
                {
                    "to": [],  # Allow all egress (for model downloads, etc.)
                    "ports": [
                        {
                            "protocol": "TCP",
                            "port": 443  # HTTPS
                        },
                        {
                            "protocol": "TCP",
                            "port": 80   # HTTP
                        }
                    ]
                }
            ]
        }
    }
    
    return manifests

# Generate monitoring manifests
monitoring_manifests = generate_monitoring_manifests(manifest_generator)

print("✅ Production Monitoring Manifests Generated!")
print(f"📊 Generated {len(monitoring_manifests)} monitoring configurations:")
for manifest_type in monitoring_manifests.keys():
    print(f"   • {manifest_type.upper()}: Production-grade {manifest_type}")

print("\n🎯 Monitoring Features:")
print("   📈 Prometheus metrics collection with 30s intervals")
print("   🚨 7 comprehensive alerts for production scenarios")
print("   📊 Grafana dashboard with 6 key performance panels")
print("   🔒 Network policies for security isolation")

## 🚀 Complete Manifest Generation and Deployment Guide

Let's generate all manifests and create a deployment guide:

In [None]:
def generate_complete_deployment_package():
    """
    Generate complete deployment package with all manifests and deployment guide
    
    Educational Focus:
    This demonstrates how to organize and package a complete production
    deployment with proper documentation and deployment procedures.
    """
    
    print("🏗️ Generating Complete Production Deployment Package")
    print("=" * 60)
    
    # Generate all manifest categories
    all_manifests = {}
    
    # Core manifests
    print("📦 Generating core manifests...")
    all_manifests['namespace'] = manifest_generator.generate_namespace()
    all_manifests['resource_quota'] = manifest_generator.generate_resource_quota()
    all_manifests['configmap'] = manifest_generator.generate_configmap()
    all_manifests['deployment'] = manifest_generator.generate_deployment()
    all_manifests['service'] = manifest_generator.generate_service()
    
    # Auto-scaling manifests
    print("⚖️ Generating auto-scaling manifests...")
    autoscaling = generate_advanced_autoscaling_manifests(manifest_generator)
    all_manifests.update(autoscaling)
    
    # Monitoring manifests
    print("📊 Generating monitoring manifests...")
    monitoring = generate_monitoring_manifests(manifest_generator)
    all_manifests.update(monitoring)
    
    # Generate YAML files
    manifest_yamls = {}
    for name, manifest in all_manifests.items():
        manifest_yamls[name] = yaml.dump(manifest, default_flow_style=False, sort_keys=False)
    
    # Generate deployment script
    deployment_script = f'''#!/bin/bash
# Production LLM Deployment Script
# Model: {manifest_generator.model_name}
# Generated: $(date)

set -e  # Exit on error

echo "🚀 Starting LLM Production Deployment"
echo "Model: {manifest_generator.model_name}"
echo "Namespace: {manifest_generator.namespace}"
echo "="*50

# Check prerequisites
echo "🔍 Checking prerequisites..."
kubectl version --client=true
if ! kubectl get nodes -l accelerator={manifest_generator.gpu_type} | grep -q "Ready"; then
    echo "❌ No GPU nodes found with accelerator={manifest_generator.gpu_type}"
    echo "Please ensure GPU nodes are available and properly labeled"
    exit 1
fi

# Deploy in correct order
echo "\n📦 Phase 1: Core Infrastructure"
kubectl apply -f namespace.yaml
kubectl apply -f resource-quota.yaml
kubectl apply -f pvc.yaml
kubectl apply -f configmap.yaml

echo "\n🚀 Phase 2: Application Deployment"
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f pdb.yaml

echo "\n⚖️ Phase 3: Auto-scaling Setup"
kubectl apply -f hpa.yaml
kubectl apply -f vpa.yaml

echo "\n📊 Phase 4: Monitoring Setup"
kubectl apply -f servicemonitor.yaml
kubectl apply -f prometheusrule.yaml
kubectl apply -f grafana-dashboard.yaml

echo "\n🔒 Phase 5: Security Setup"
kubectl apply -f networkpolicy.yaml

echo "\n✅ Deployment Complete!"
echo "\n🔍 Checking deployment status..."
kubectl get pods -n {manifest_generator.namespace} -l app={manifest_generator.app_name}
kubectl get svc -n {manifest_generator.namespace}
kubectl get hpa -n {manifest_generator.namespace}

echo "\n📊 Waiting for pods to be ready..."
kubectl wait --for=condition=ready pod -l app={manifest_generator.app_name} -n {manifest_generator.namespace} --timeout=600s

echo "\n🎉 LLM Service is now running!"
echo "\n📋 Next steps:"
echo "1. Check service health: kubectl logs -n {manifest_generator.namespace} -l app={manifest_generator.app_name}"
echo "2. Monitor metrics: kubectl port-forward -n {manifest_generator.namespace} svc/{manifest_generator.app_name}-service 8080:9090"
echo "3. Test inference: curl http://localhost:8080/v1/chat/completions"
echo "4. View Grafana dashboard: Import the generated dashboard JSON"
'''
    
    # Generate cleanup script
    cleanup_script = f'''#!/bin/bash
# LLM Deployment Cleanup Script

echo "🧹 Cleaning up LLM deployment..."
echo "Namespace: {manifest_generator.namespace}"
echo "Model: {manifest_generator.model_name}"

read -p "Are you sure you want to delete the entire deployment? [y/N] " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
    echo "Deleting all resources..."
    kubectl delete namespace {manifest_generator.namespace}
    echo "✅ Cleanup complete!"
else
    echo "❌ Cleanup cancelled"
fi
'''
    
    # Generate README
    readme_content = f'''# LLM Production Deployment: {manifest_generator.model_name}

## 📋 Overview

This deployment package contains production-ready Kubernetes manifests for deploying {manifest_generator.model_name} with:

- **High Availability**: Multi-replica deployment with pod disruption budgets
- **Auto-scaling**: HPA and VPA for dynamic resource management
- **Monitoring**: Comprehensive Prometheus metrics and Grafana dashboards
- **Security**: Network policies and resource quotas
- **Performance**: GPU-optimized scheduling and resource allocation

## 🏗️ Architecture

```
Internet → Load Balancer → Service → Pods (GPU-enabled)
                ↓
         Prometheus ← Metrics
                ↓
            Grafana Dashboard
```

## 📦 Components

### Core Resources
- `namespace.yaml`: Isolated namespace with resource quotas
- `configmap.yaml`: Application configuration
- `deployment.yaml`: Main application deployment
- `service.yaml`: Load balancer service
- `pvc.yaml`: Persistent volume for model cache

### Auto-scaling
- `hpa.yaml`: Horizontal Pod Autoscaler with custom metrics
- `vpa.yaml`: Vertical Pod Autoscaler for right-sizing
- `pdb.yaml`: Pod Disruption Budget for availability

### Monitoring
- `servicemonitor.yaml`: Prometheus metrics collection
- `prometheusrule.yaml`: Alerting rules
- `grafana-dashboard.yaml`: Performance dashboard

### Security
- `networkpolicy.yaml`: Network isolation policies

## 🚀 Quick Start

### Prerequisites
- Kubernetes cluster with GPU nodes
- NVIDIA GPU Operator installed
- Prometheus Operator (for monitoring)
- At least {manifest_generator.gpu_count} × {manifest_generator.gpu_type} GPU available

### Deploy
```bash
chmod +x deploy.sh
./deploy.sh
```

### Verify
```bash
# Check pods
kubectl get pods -n {manifest_generator.namespace}

# Check service
kubectl get svc -n {manifest_generator.namespace}

# Check logs
kubectl logs -n {manifest_generator.namespace} -l app={manifest_generator.app_name} --tail=100
```

### Test
```bash
# Port forward
kubectl port-forward -n {manifest_generator.namespace} svc/{manifest_generator.app_name}-service 8080:80

# Test inference
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{{
    "model": "{manifest_generator.model_name}",
    "messages": [{{"role": "user", "content": "Hello!"}}],
    "max_tokens": 100
  }}'
```

## 📊 Monitoring

### Key Metrics
- Request rate and latency (P95, P99)
- GPU utilization and memory usage
- Queue depth and throughput
- Error rates and availability

### Alerts
- High latency (>10s P95)
- High error rate (>5%)
- Low GPU utilization (<50%)
- Memory pressure (>90%)
- Service down

## 🔧 Configuration

### Scaling
- Min replicas: {manifest_generator.min_replicas}
- Max replicas: {manifest_generator.max_replicas}
- CPU target: 70% utilization
- Memory target: 80% utilization

### Resources
- GPU: {manifest_generator.gpu_count} × {manifest_generator.gpu_type}
- CPU: {manifest_generator.cpu_request}-{manifest_generator.cpu_limit} cores
- Memory: {manifest_generator.memory_request}-{manifest_generator.memory_limit}

## 🧹 Cleanup

```bash
chmod +x cleanup.sh
./cleanup.sh
```

## 📞 Support

For issues or questions:
1. Check pod logs: `kubectl logs -n {manifest_generator.namespace} -l app={manifest_generator.app_name}`
2. Check events: `kubectl get events -n {manifest_generator.namespace}`
3. Check resource usage: `kubectl top pods -n {manifest_generator.namespace}`
'''
    
    # Create deployment package
    deployment_package = {
        'manifests': manifest_yamls,
        'scripts': {
            'deploy.sh': deployment_script,
            'cleanup.sh': cleanup_script
        },
        'documentation': {
            'README.md': readme_content
        }
    }
    
    return deployment_package

# Generate complete deployment package
print("🚀 Generating Complete Production Deployment Package...")
deployment_package = generate_complete_deployment_package()

print(f"\n✅ Production Deployment Package Generated!")
print(f"\n📦 Package Contents:")
print(f"   📄 {len(deployment_package['manifests'])} Kubernetes manifests")
print(f"   🚀 {len(deployment_package['scripts'])} deployment scripts")
print(f"   📚 {len(deployment_package['documentation'])} documentation files")

print(f"\n🔧 Generated Manifests:")
for manifest_name in deployment_package['manifests'].keys():
    print(f"   • {manifest_name}.yaml")

print(f"\n📋 Generated Scripts:")
for script_name in deployment_package['scripts'].keys():
    print(f"   • {script_name}")

print(f"\n🎯 Ready for Production Deployment!")
print(f"   Model: {manifest_generator.model_name}")
print(f"   Namespace: {manifest_generator.namespace}")
print(f"   GPU Type: {manifest_generator.gpu_type}")
print(f"   Scaling: {manifest_generator.min_replicas}-{manifest_generator.max_replicas} replicas")

## 🎯 Key Takeaways from Production Kubernetes Deployment

### **Kubernetes is Essential for LLM Production**
- **Resource management**: Native GPU support and intelligent scheduling
- **Auto-scaling**: Dynamic scaling based on custom LLM metrics
- **High availability**: Multi-replica deployments with disruption budgets
- **Operational excellence**: Health checks, monitoring, and automated recovery

### **Production-Grade Features**
- **Zero-downtime deployments**: Rolling updates with proper readiness checks
- **Comprehensive monitoring**: 7 alerts covering all critical scenarios
- **Security isolation**: Network policies and resource quotas
- **Cost optimization**: Right-sizing through VPA and intelligent scaling

### **Operational Considerations**
- **GPU scheduling**: Proper node selection and resource allocation
- **Storage management**: Persistent volumes for model caching
- **Network policies**: Security isolation without breaking functionality
- **Resource quotas**: Prevent resource monopolization and cost runaway

### **Monitoring and Alerting**
- **Custom metrics**: LLM-specific metrics like queue depth and tokens/sec
- **Multi-dimensional scaling**: CPU, memory, GPU utilization, and custom metrics
- **Proactive alerting**: Detect issues before they impact users
- **Performance tracking**: Comprehensive dashboards for operational visibility

---

## 💡 Advanced Production Patterns

### **Multi-Model Serving**
```yaml
# Deploy different model sizes for different use cases
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-small-fast
spec:
  replicas: 10  # More replicas for fast responses
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-large-accurate
spec:
  replicas: 2   # Fewer replicas for accuracy-focused requests
```

### **Canary Deployments**
```yaml
# Gradual rollout of new model versions
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10    # 10% traffic to new version
      - pause: {duration: 10m}
      - setWeight: 50    # 50% traffic
      - pause: {duration: 10m}
      - setWeight: 100   # Full rollout
```

### **Cost Optimization**
```yaml
# Use spot instances with proper handling
apiVersion: v1
kind: Node
metadata:
  labels:
    node.kubernetes.io/instance-type: spot
spec:
  taints:
  - key: spot-instance
    value: "true"
    effect: NoSchedule
```

---

## 📈 Production Checklist

### **Before Deployment**
- ✅ GPU nodes labeled and available
- ✅ NVIDIA GPU Operator installed
- ✅ Prometheus Operator configured
- ✅ Storage classes defined
- ✅ Network policies tested

### **During Deployment**
- ✅ Resource quotas applied
- ✅ Health checks responding
- ✅ Metrics being collected
- ✅ Alerts configured
- ✅ Auto-scaling working

### **After Deployment**
- ✅ Load testing completed
- ✅ Failover scenarios tested
- ✅ Monitoring dashboards validated
- ✅ Runbooks documented
- ✅ On-call procedures established

---

**Next: Chapter 9 - Cost Optimization & Operations** 💰

*In the final chapter, we'll explore advanced cost optimization techniques, FinOps practices, and SRE methodologies for running LLMs efficiently at scale.*