# üöÄ Week 13: Production Deployment

This notebook covers deploying ML models to production.

## Table of Contents
1. [Deployment Fundamentals](#1-deployment-fundamentals)
2. [Model Serving](#2-model-serving)
3. [Containerization](#3-containerization)
4. [Scaling Strategies](#4-scaling-strategies)
5. [Monitoring](#5-monitoring)
6. [Production Checklist](#6-production-checklist)

---

## 1. Deployment Fundamentals

### 1.1 Deployment Options

| Option | Pros | Cons | Use Case |
|--------|------|------|----------|
| **REST API** | Simple, standard | Latency overhead | General purpose |
| **gRPC** | Fast, typed | More complex | Internal services |
| **Serverless** | No infra, auto-scale | Cold starts | Low traffic |
| **Edge** | Low latency | Limited compute | IoT, mobile |

### 1.2 Deployment Architecture

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    LOAD BALANCER                        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                          ‚îÇ
        ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
        ‚ñº                 ‚ñº                 ‚ñº
  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
  ‚îÇ API Pod  ‚îÇ     ‚îÇ API Pod  ‚îÇ     ‚îÇ API Pod  ‚îÇ
  ‚îÇ  + Model ‚îÇ     ‚îÇ  + Model ‚îÇ     ‚îÇ  + Model ‚îÇ
  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
        ‚îÇ                 ‚îÇ                 ‚îÇ
        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                          ‚ñº
                   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                   ‚îÇ   Model      ‚îÇ
                   ‚îÇ   Registry   ‚îÇ
                   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

---

## 2. Model Serving

### 2.1 FastAPI Production Setup

In [None]:
# Production FastAPI Application
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import time
import uuid

app = FastAPI(
    title="ML Model API",
    description="Production ML model serving API",
    version="1.0.0"
)

# Request/Response Models
class PredictionRequest(BaseModel):
    text: str
    model_version: str = "latest"

class PredictionResponse(BaseModel):
    request_id: str
    prediction: str
    confidence: float
    model_version: str
    latency_ms: float

@app.get("/health")
async def health():
    """Health check endpoint for load balancer."""
    return {"status": "healthy", "timestamp": time.time()}

@app.get("/ready")
async def ready():
    """Readiness check - is the model loaded?"""
    # Check if model is loaded
    model_loaded = True  # Replace with actual check
    if not model_loaded:
        raise HTTPException(status_code=503, detail="Model not loaded")
    return {"status": "ready"}

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    """Main prediction endpoint."""
    start_time = time.time()
    
    # Generate prediction (replace with actual model)
    prediction = "positive"
    confidence = 0.95
    
    latency = (time.time() - start_time) * 1000
    
    return PredictionResponse(
        request_id=str(uuid.uuid4()),
        prediction=prediction,
        confidence=confidence,
        model_version=request.model_version,
        latency_ms=latency
    )

print("‚úÖ Production API defined!")

### 2.2 Model Loading Optimization

In [None]:
import threading

class ModelManager:
    """
    Singleton model manager with lazy loading and versioning.
    """
    _instance = None
    _lock = threading.Lock()
    
    def __new__(cls):
        if cls._instance is None:
            with cls._lock:
                if cls._instance is None:
                    cls._instance = super().__new__(cls)
                    cls._instance._initialized = False
        return cls._instance
    
    def __init__(self):
        if self._initialized:
            return
        self.models = {}
        self.current_version = None
        self._initialized = True
    
    def load_model(self, version: str, model_path: str):
        """Load a model version."""
        print(f"Loading model v{version} from {model_path}...")
        # Simulate model loading
        self.models[version] = f"model_{version}"
        self.current_version = version
        print(f"‚úÖ Model v{version} loaded")
    
    def get_model(self, version: str = "latest"):
        """Get a model by version."""
        if version == "latest":
            version = self.current_version
        return self.models.get(version)
    
    def switch_version(self, version: str):
        """Hot-swap to a different model version."""
        if version in self.models:
            self.current_version = version
            print(f"Switched to model v{version}")
        else:
            raise ValueError(f"Model v{version} not loaded")

# Example usage
manager = ModelManager()
manager.load_model("1.0", "/models/v1")
manager.load_model("2.0", "/models/v2")
manager.switch_version("2.0")

---

## 3. Containerization

### 3.1 Dockerfile for ML

In [None]:
dockerfile_content = '''
# Multi-stage build for smaller image
FROM python:3.10-slim as builder

WORKDIR /app
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Final stage
FROM python:3.10-slim

WORKDIR /app

# Copy installed packages
COPY --from=builder /usr/local/lib/python3.10/site-packages /usr/local/lib/python3.10/site-packages

# Copy application
COPY . .

# Health check
HEALTHCHECK --interval=30s --timeout=10s \\
  CMD curl -f http://localhost:8000/health || exit 1

# Non-root user
RUN useradd -m appuser && chown -R appuser /app
USER appuser

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
'''

print("Optimized Dockerfile:")
print(dockerfile_content)

### 3.2 Docker Compose for Development

In [None]:
docker_compose = '''
version: "3.8"

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - MODEL_PATH=/models/latest
      - LOG_LEVEL=info
    volumes:
      - ./models:/models:ro
    deploy:
      resources:
        limits:
          memory: 4G
        reservations:
          memory: 2G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
'''

print("Docker Compose:")
print(docker_compose)

---

## 4. Scaling Strategies

### 4.1 Horizontal Scaling

| Strategy | Description | When to Use |
|----------|-------------|-------------|
| **Replicas** | Multiple identical pods | Stateless services |
| **Load Balancing** | Distribute requests | High traffic |
| **Auto-scaling** | Scale based on metrics | Variable load |

In [None]:
# Kubernetes HPA configuration
hpa_config = '''
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: requests_per_second
        target:
          type: AverageValue
          averageValue: 100
'''

print("Kubernetes HPA:")
print(hpa_config)

### 4.2 Optimization Techniques

| Technique | Speedup | Effort | Notes |
|-----------|---------|--------|-------|
| **Batching** | 2-10x | Low | Combine requests |
| **Caching** | 10-100x | Low | Cache frequent queries |
| **Quantization** | 2-4x | Medium | Reduce precision |
| **Distillation** | 2-10x | High | Train smaller model |

In [None]:
# Simple caching with TTL
from functools import lru_cache
from datetime import datetime, timedelta
import hashlib

class PredictionCache:
    def __init__(self, ttl_seconds: int = 3600, max_size: int = 1000):
        self.cache = {}
        self.ttl = timedelta(seconds=ttl_seconds)
        self.max_size = max_size
    
    def _hash_input(self, text: str) -> str:
        return hashlib.md5(text.encode()).hexdigest()
    
    def get(self, text: str):
        key = self._hash_input(text)
        if key in self.cache:
            entry, timestamp = self.cache[key]
            if datetime.now() - timestamp < self.ttl:
                return entry
            del self.cache[key]
        return None
    
    def set(self, text: str, value):
        if len(self.cache) >= self.max_size:
            # Evict oldest
            oldest = min(self.cache, key=lambda k: self.cache[k][1])
            del self.cache[oldest]
        
        key = self._hash_input(text)
        self.cache[key] = (value, datetime.now())

cache = PredictionCache(ttl_seconds=3600)
print("‚úÖ Caching layer ready")

---

## 5. Monitoring

### 5.1 Key Metrics

In [None]:
# Prometheus metrics with custom buckets
metrics_code = '''
from prometheus_client import Counter, Histogram, Gauge

# Request metrics
REQUEST_COUNT = Counter(
    "model_requests_total",
    "Total prediction requests",
    ["model_version", "status"]
)

REQUEST_LATENCY = Histogram(
    "model_request_latency_seconds",
    "Request latency in seconds",
    ["model_version"],
    buckets=[.01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5]
)

# Model metrics
MODEL_LOAD_TIME = Gauge(
    "model_load_time_seconds",
    "Time to load model",
    ["model_version"]
)

PREDICTION_CONFIDENCE = Histogram(
    "prediction_confidence",
    "Distribution of prediction confidence scores",
    buckets=[.1, .2, .3, .4, .5, .6, .7, .8, .9, 1.0]
)
'''

print("Prometheus Metrics:")
print(metrics_code)

---

## 6. Production Checklist

### Pre-Deployment
- [ ] Model validated on production-like data
- [ ] Performance benchmarks meet SLAs
- [ ] Docker image built and tested
- [ ] Health/readiness endpoints working
- [ ] Rollback plan documented

### Deployment
- [ ] Canary deployment configured
- [ ] Auto-scaling policies set
- [ ] Load balancer configured
- [ ] SSL/TLS enabled

### Post-Deployment
- [ ] Monitoring dashboards set up
- [ ] Alerts configured
- [ ] Logging aggregation working
- [ ] Model drift detection enabled

---

## üìù Summary

### Key Takeaways

1. **Use health checks** - Load balancers need them
2. **Cache aggressively** - Huge latency improvements
3. **Monitor everything** - Can't fix what you can't see
4. **Plan for failure** - Graceful degradation

### Production Architecture

```
Users ‚Üí CDN ‚Üí Load Balancer ‚Üí API Pods ‚Üí Model
                    ‚Üì
              Cache Layer
                    ‚Üì
              Monitoring
```