
# **CHAPTER 22: MODEL DEPLOYMENT & SERVING**

*From Training Artifacts to Production APIs*

## **Chapter Overview**

A trained model provides no business value until it serves predictions. This chapter covers the engineering patterns for deploying models reliably: containerization strategies, API design, serverless vs. dedicated infrastructure, and optimization techniques for low-latency inference. You will learn to bridge the gap between data science artifacts and production software systems.

**Estimated Time:** 35-45 hours (3 weeks)  
**Prerequisites:** Chapter 19 (System Design), Chapter 21 (Training), Docker and Kubernetes basics

---

## **22.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Serialize and package models using standardized formats (ONNX, TorchScript, SavedModel)
2. Design REST/gRPC APIs for model serving with FastAPI and Triton Inference Server
3. Deploy containerized models to Kubernetes with auto-scaling and rolling updates
4. Optimize inference latency through batching, caching, and model compression
5. Implement A/B testing and canary deployments for model updates
6. Architect edge deployment strategies for IoT and mobile applications

---

## **22.1 Model Serialization & Formats**

#### **22.1.1 Framework-Native Formats**

**PyTorch TorchScript:**
```python
# tracing.py
import torch
from model import MyModel

model = MyModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()

# Method 1: Tracing (static graph, faster)
example_input = torch.rand(1, 3, 224, 224)
traced_script_module = torch.jit.trace(model, example_input)
traced_script_module.save("model_traced.pt")

# Method 2: Scripting (dynamic control flow)
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")

# Loading in production (no Python dependencies)
loaded_model = torch.jit.load("model_traced.pt")
with torch.no_grad():
    output = loaded_model(input_tensor)
```

**TensorFlow SavedModel:**
```python
# Export
tf.saved_model.save(model, "saved_model/1/")

# Load for serving
loaded = tf.saved_model.load("saved_model/1/")
infer = loaded.signatures["serving_default"]
predictions = infer(input_1=tf.constant([[1.0, 2.0]]))
```

#### **22.1.2 Framework-Agnostic: ONNX**

Open Neural Network Exchange enables interoperability between frameworks.

```python
# export_onnx.py
import torch.onnx

dummy_input = torch.randn(1, 3, 224, 224, device="cuda")
model.eval()

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=14,
    do_constant_folding=True,  # Optimize constant expressions
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={  # Variable batch sizes
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    }
)

# Optimization with ONNX Runtime
import onnxruntime as ort

session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Run inference
outputs = session.run(None, {"input": input_numpy})
```

**TensorRT Optimization (NVIDIA GPUs):**
```python
import tensorrt as trt
import pycuda.driver as cuda

# Parse ONNX to TensorRT engine (FP16/INT8)
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)

with open("model.onnx", "rb") as f:
    parser.parse(f.read())

config = builder.create_builder_config()
config.max_workspace_size = 1 << 30  # 1GB
config.set_flag(trt.BuilderFlag.FP16)  # Enable FP16

engine = builder.build_engine(network, config)
```

---

## **22.2 Serving Architectures**

#### **22.2.1 REST API with FastAPI**

Production-grade Python API for model serving.

```python
# api/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import numpy as np
import onnxruntime as ort
from prometheus_client import Counter, Histogram

app = FastAPI(title="ML Model Server")

# Metrics
prediction_counter = Counter('model_predictions_total', 'Total predictions')
latency_histogram = Histogram('model_latency_seconds', 'Inference latency')

# Load model on startup
@app.on_event("startup")
async def load_model():
    global session
    session = ort.InferenceSession("model.onnx")
    app.state.model_ready = True

class PredictionRequest(BaseModel):
    features: list[float]
    return_proba: bool = False

class PredictionResponse(BaseModel):
    prediction: int
    probability: float | None = None
    model_version: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    if not app.state.model_ready:
        raise HTTPException(status_code=503, detail="Model not loaded")
    
    with latency_histogram.time():
        input_array = np.array([request.features], dtype=np.float32)
        outputs = session.run(None, {"input": input_array})
        prediction = int(np.argmax(outputs[0]))
        probability = float(np.max(outputs[0])) if request.return_proba else None
    
    prediction_counter.inc()
    
    return PredictionResponse(
        prediction=prediction,
        probability=probability,
        model_version="1.0.0"
    )

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": app.state.model_ready}

# Dockerfile
"""
FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY model.onnx .
COPY main.py .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
"""
```

#### **22.2.2 gRPC for High Performance**

Binary protocol with lower latency than REST, ideal for internal microservices.

```protobuf
# service.proto
syntax = "proto3";

service ModelService {
  rpc Predict(PredictRequest) returns (PredictResponse);
  rpc StreamPredict(stream PredictRequest) returns (stream PredictResponse);
}

message PredictRequest {
  repeated float features = 1;
  bool return_proba = 2;
}

message PredictResponse {
  int32 prediction = 1;
  float probability = 2;
  string model_version = 3;
}
```

```python
# server.py
from concurrent import futures
import grpc
import service_pb2
import service_pb2_grpc

class ModelServicer(service_pb2_grpc.ModelServiceServicer):
    def __init__(self):
        self.session = ort.InferenceSession("model.onnx")
    
    def Predict(self, request, context):
        input_array = np.array([request.features], dtype=np.float32)
        outputs = self.session.run(None, {"input": input_array})
        
        return service_pb2.PredictResponse(
            prediction=int(np.argmax(outputs[0])),
            probability=float(np.max(outputs[0])),
            model_version="1.0.0"
        )

server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
service_pb2_grpc.add_ModelServiceServicer_to_server(ModelServicer(), server)
server.add_insecure_port("[::]:50051")
server.start()
```

#### **22.2.3 Dedicated Inference Servers**

**TensorFlow Serving:**
```bash
docker run -p 8501:8501 \
  --mount type=bind,source=/models,target=/models \
  tensorflow/serving \
  --model_name=fraud_model \
  --model_base_path=/models/fraud_model
```

**TorchServe:**
```python
# handler.py
from ts.torch_handler.base_handler import BaseHandler

class ModelHandler(BaseHandler):
    def preprocess(self, data):
        return torch.tensor(json.loads(data[0]["body"])).float()
    
    def postprocess(self, inference_output):
        return [inference_output.tolist()]

# Create MAR file
torch-model-archiver \
  --model-name resnet \
  --version 1.0 \
  --model-file model.py \
  --serialized-file model.pth \
  --handler handler.py \
  --export-path model_store/
```

**NVIDIA Triton Inference Server:**
- Supports multiple frameworks (ONNX, TensorRT, PyTorch, TF)
- Dynamic batching and model ensemble
- GPU sharing between models

---

## **22.3 Kubernetes Deployment**

#### **22.3.1 KServe (Kubernetes ML Serving)**

Standardized inference platform on Kubernetes.

```yaml
# inference_service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detection
spec:
  predictor:
    serviceAccountName: kserve-sa
    pytorch:
      storageUri: s3://models/fraud-detection
      resources:
        limits:
          memory: "4Gi"
          cpu: "2"
          nvidia.com/gpu: "1"
        requests:
          memory: "2Gi"
          cpu: "1"
      runtimeVersion: "1.13.0"
    containerConcurrency: 10  # Max concurrent requests per pod
    timeout: 60
  transformer:  # Pre/post processing
    containers:
      - image: fraud-transformer:latest
        name: transformer
```

#### **22.3.2 Auto-scaling Configuration**

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods  # Custom metric: queue length
    pods:
      metric:
        name: inference_queue_length
      target:
        type: AverageValue
        averageValue: "5"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
```

#### **22.3.3 Canary & Blue-Green Deployments**

```yaml
# canary.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: model-route
spec:
  hosts:
  - model-service
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: model-service
        subset: v2  # New version
      weight: 10
  - route:
    - destination:
        host: model-service
        subset: v1  # Current version
      weight: 90
```

---

## **22.4 Optimization Strategies**

#### **22.4.1 Dynamic Batching**

Combine individual requests to improve GPU utilization.

```python
# Pseudo-code for batching middleware
class Batcher:
    def __init__(self, model, max_batch_size=32, max_wait_ms=10):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait = max_wait_ms / 1000
        self.queue = []
    
    async def predict(self, input_data):
        future = asyncio.Future()
        self.queue.append((input_data, future))
        
        if len(self.queue) >= self.max_batch_size:
            self._process_batch()
        else:
            asyncio.create_task(self._timeout_process())
        
        return await future
    
    def _process_batch(self):
        if not self.queue:
            return
        batch_inputs = [item[0] for item in self.queue]
        results = self.model(batch_inputs)  # Batch inference
        
        for (_, future), result in zip(self.queue, results):
            future.set_result(result)
        self.queue.clear()
```

#### **22.4.2 Model Quantization for Serving**

**Post-Training Quantization (PTQ):**
```python
# PyTorch Dynamic Quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, 
    {torch.nn.Linear},  # Layers to quantize
    dtype=torch.qint8
)
# 4x size reduction, 2-3x speedup on CPU

# ONNX Runtime Quantization
from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    "model.onnx",
    "model_int8.onnx",
    weight_type=QuantType.QInt8
)
```

#### **22.4.3 Caching Strategies**

```python
import redis
import hashlib
import json

class PredictionCache:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.ttl = 300  # 5 minutes
    
    def get_key(self, features):
        # Deterministic hash of input
        feature_str = json.dumps(features, sort_keys=True)
        return f"pred:{hashlib.md5(feature_str.encode()).hexdigest()}"
    
    async def predict_with_cache(self, model, features):
        key = self.get_key(features)
        
        # Check cache
        cached = await self.redis.get(key)
        if cached:
            return json.loads(cached)
        
        # Compute and cache
        result = model.predict(features)
        await self.redis.setex(key, self.ttl, json.dumps(result))
        return result
```

---

## **22.5 Workbook Labs**

### **Lab 1: Multi-Format Model Export**
Export a trained PyTorch model to multiple formats:

1. **Native:** Save as `.pt` (state_dict and TorchScript)
2. **ONNX:** Export with dynamic axes, verify with ONNX Runtime
3. **TensorRT:** Convert ONNX to TensorRT engine (FP16)
4. **Benchmark:** Compare inference latency (PyTorch vs ONNX vs TensorRT)

**Deliverable:** Benchmark report showing throughput and latency for each format.

### **Lab 2: Production API Development**
Build a FastAPI serving application:

1. **Endpoints:** `/predict` (sync), `/predict/batch`, `/health`, `/metrics`
2. **Validation:** Pydantic models for input validation, error handling
3. **Observability:** Prometheus metrics (request count, latency histograms)
4. **Load Testing:** Test with locust/k6, identify bottlenecks (CPU vs I/O bound)

**Deliverable:** Dockerized API with docker-compose and load test results.

### **Lab 3: Kubernetes Deployment**
Deploy to K8s with advanced patterns:

1. **KServe:** Deploy InferenceService with canary splitting (10% traffic to v2)
2. **Auto-scaling:** Configure HPA based on custom metrics (GPU utilization)
3. **A/B Testing:** Route traffic based on headers (internal vs. external users)
4. **Rolling Update:** Deploy new version with zero-downtime rolling strategy

**Deliverable:** Kubernetes manifests, deployment runbook, rollback procedures.

### **Lab 4: Edge Deployment**
Optimize model for mobile/IoT:

1. **Quantization:** INT8 quantization with calibration dataset
2. **Mobile:** Convert to CoreML (iOS) or TFLite (Android)
3. **Size Optimization:** Pruning to 50% sparsity, measure accuracy loss
4. **On-Device Test:** Deploy to mobile simulator, measure inference time

**Deliverable:** Mobile-optimized model with size/latency comparison vs. cloud API.

---

## **22.6 Common Pitfalls**

1. **GIL Contention:** Python's Global Interpreter Lock limits multi-threading for CPU-bound inference. **Solution:** Use multi-processing (Uvicorn workers) or switch to ONNX Runtime/TensorRT (release GIL).

2. **Memory Leaks:** Loading new model versions without unloading old ones in long-running containers. **Solution:** Implement proper lifecycle management, use separate pods for model updates (immutable infrastructure).

3. **Cold Start Latency:** Serverless (Lambda) or scaled-to-zero K8s adds 2-10s startup. **Solution:** Keep minimum replicas warm, use provisioned concurrency, optimize container image size (distroless images).

4. **Resource Starvation:** Not setting resource limits causes noisy neighbor issues. **Solution:** Always set CPU/memory limits and requests in K8s; use guaranteed QoS class for latency-critical services.

5. **Synchronous I/O Blocking:** Fetching features from DB in request thread blocks the event loop. **Solution:** Use async database drivers (asyncpg, aioredis) or offload to thread pool (`run_in_executor`).

---

## **22.7 Interview Questions**

**Q1:** When would you choose gRPC over REST for model serving?
*A: gRPC for: (1) Internal microservices (binary protocol, schema enforcement), (2) High throughput/low latency requirements (HTTP/2 multiplexing, protobuf serialization faster than JSON), (3) Streaming (bidirectional streaming for real-time features). REST for: (1) External/public APIs (browser compatibility, easier debugging), (2) Simple request/response patterns, (3) When human-readable payloads needed. Many systems use both: gRPC internally, REST gateway externally.*

**Q2:** How do you handle model versioning and rollback in production?
*A: (1) Semantic versioning for models (v1.2.3), stored in model registry (MLflow), (2) Blue-green deployment: run old and new versions simultaneously, shift traffic gradually (canary), (3) Kubernetes: label selectors for version routing, (4) Database: store model_version with predictions for audit/debugging, (5) Rollback: revert traffic to previous stable version via Istio/flagger or K8s deployment rollback, (6) Immutable models: never overwrite deployed artifacts, always deploy new version with new ID.*

**Q3:** Explain the trade-offs between dynamic batching and single-request serving.
*A: Dynamic batching: Increases throughput (better GPU utilization) at cost of latency (waiting to form batch). Good for high-volume, latency-tolerant (100ms+). Single-request: Lower latency (immediate processing), lower throughput (GPU underutilized for small models). Good for real-time edge cases (<50ms). Hybrid: Use single-request for premium tier, batching for standard tier. Tuning parameters: max_batch_size (higher=more throughput, more latency), max_latency_ms (cap waiting time).*

**Q4:** How do you optimize a model that is CPU-bound vs. GPU-bound?
*A: CPU-bound optimizations: (1) Quantization to INT8 (vectorized CPU instructions), (2) ONNX Runtime/OpenVINO (optimized kernels), (3) Batch processing to amortize overhead, (4) Multi-processing to bypass GIL. GPU-bound optimizations: (1) TensorRT/ONNX-GPU (kernel fusion), (2) Mixed precision (FP16/TF32), (3) Dynamic batching (higher GPU utilization), (4) Model parallelism for large models, (5) Pipeline parallelism (overlap data transfer and compute). Profile with NVIDIA Nsight or PyTorch Profiler to identify bottlenecks.*

**Q5:** Design a deployment strategy for a critical fraud detection model requiring 99.99% uptime.
*A: Architecture: Multi-region active-active with global load balancer. Each region: Blue-green deployment with canary analysis (5% → 25% → 100%). Circuit breakers to fallback to rule-based system if model fails. Health checks: Deep health (actual inference test) not just shallow (process up). Database: Predictions logged asynchronously to avoid blocking. Rollback: Automated via flagger/Istio if error rate > threshold or latency p99 > SLA. Shadow mode: New versions run in parallel (log only, don't serve) for 24h before traffic shift. Data: Feature store with multi-region replication, offline mode with cached features.*

---

## **22.8 Further Reading**

**Books:**
- *Kubeflow for Machine Learning* (Holden Karau et al.) - K8s for ML workflows
- *gRPC: Up and Running* (Kasun Indrasiri) - Service mesh patterns

**Papers:**
- "Serving Machine Learning Models with Apache Flink" (Uber)
- "Clipper: A Low-Latency Online Prediction Serving System" (UC Berkeley)

**Tools:**
- **KServe:** Standardized inference on Kubernetes
- **Seldon Core:** Advanced ML deployments (A/B tests, ensembles)
- **BentoML:** Model packaging and serving framework
- **Cortex:** Serverless model serving

---

## **22.9 Checkpoint Project: Multi-Model Serving Platform**

Build a production serving platform handling 3 different model types: image classification (CNN), tabular fraud detection (XGBoost), and NLP sentiment (Transformer).

**Requirements:**

1. **Model Packaging:**
   - CNN: TensorRT optimized, batch size 8-32 dynamic
   - XGBoost: ONNX format, CPU-only (lightweight)
   - Transformer: TorchScript, GPU with FP16

2. **API Design:**
   - Unified gateway routing `/api/v1/cnn`, `/api/v1/fraud`, `/api/v1/sentiment`
   - Authentication via JWT tokens
   - Rate limiting: 100 req/min per API key

3. **Infrastructure:**
   - Kubernetes deployment with KServe
   - Separate node pools: GPU nodes for CNN/Transformer, CPU nodes for XGBoost
   - Horizontal Pod Autoscaler per model type based on queue depth

4. **Optimization:**
   - Redis cache for fraud model (high hit rate on repeated users)
   - Dynamic batching for CNN (max wait 20ms)
   - Model warm-up on startup (dummy inference to initialize GPU)

5. **Observability:**
   - Prometheus metrics: latency histograms (p50, p95, p99), throughput, GPU memory
   - Distributed tracing (Jaeger) across gateway → model → feature store
   - Alerting: P95 latency > 100ms, error rate > 0.1%

6. **Deployment Strategy:**
   - Canary releases: 5% traffic to new versions for 1 hour before full rollout
   - Automatic rollback if error rate increases
   - Circuit breaker pattern to fallback to default responses

**Deliverables:**
- `serving_platform/` with Kubernetes manifests
- Load testing results (achieve 1000 TPS aggregate across models)
- Runbook: "Adding a new model to the platform"
- Cost analysis: $/1000 predictions per model type

**Success Criteria:**
- Zero-downtime deployment demonstrated
- P99 latency < 50ms for fraud, < 200ms for transformer
- Auto-scaling tested (traffic spike from 10 → 1000 TPS)
- Failed canary automatically rolled back

---

**End of Chapter 22**

*You can now deploy models reliably at scale. Chapter 23 covers Monitoring & Maintenance—ensuring these deployed models remain accurate and healthy over time.*

---
