Here is **Chapter 19: ML System Design & Architecture** — engineering systems that scale.

---

# **CHAPTER 19: ML SYSTEM DESIGN & ARCHITECTURE**

*Designing for Scale and Reliability*

## **Chapter Overview**

Building a model is the first 10% of the journey; deploying it reliably at scale is the remaining 90%. This chapter transitions from data science to machine learning engineering, covering the architectural patterns, design trade-offs, and system components required to serve millions of predictions per second. You will learn to navigate the fundamental tensions: latency vs. throughput, cost vs. accuracy, batch vs. real-time, and consistency vs. availability.

**Estimated Time:** 40-50 hours (3 weeks)  
**Prerequisites:** Chapters 4 (Tools), 9 (Evaluation), and all modeling chapters (understanding what you're deploying)

---

## **19.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Design end-to-end ML systems with clear separation of concerns (feature platform, training, serving, monitoring)
2. Select appropriate serving patterns (batch, online, streaming) based on latency and freshness requirements
3. Architect for horizontal scalability using distributed systems principles and cloud-native technologies
4. Optimize inference latency and throughput through caching, batching, and model compression
5. Perform capacity planning and cost optimization for cloud-based ML infrastructure
6. Design resilient systems that handle cascading failures, model degradation, and data drift

---

## **19.1 ML System Components**

A production ML system is not just a model; it's a pipeline of interconnected services.

#### **19.1.1 The ML Platform Architecture**

```
┌─────────────────────────────────────────────────────────────┐
│                        Data Sources                          │
│  (Databases, Logs, Streams, External APIs)                   │
└──────────────────────┬──────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────┐
│                  Data Ingestion Layer                        │
│  (Kafka, Kinesis, Airflow, Spark Streaming)                 │
└──────────────────────┬──────────────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        │              │              │
┌───────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
│   Feature    │ │ Training │ │  Monitoring │
│    Store     │ │ Pipeline │ │   & Logging │
│ (Feast,      │ │(Kubeflow,│ │ (Evidently, │
│  Tecton)     │ │ Vertex)  │ │  WhyLabs)   │
└───────┬──────┘ └────┬─────┘ └──────┬──────┘
        │             │              │
        └─────────────┼──────────────┘
                      │
┌─────────────────────▼───────────────────────────────────────┐
│                    Model Registry                            │
│        (MLflow, Weights & Biases, Vertex AI)                │
└─────────────────────┬───────────────────────────────────────┘
                      │
       ┌──────────────┼──────────────┐
       │              │              │
┌──────▼──────┐ ┌────▼─────┐ ┌──────▼──────┐
│   Batch     │ │  Online  │ │  Streaming  │
│  Inference  │ │  Serving │ │  Inference  │
│ (Spark,     │ │(TF Serving│ │ (Flink,    │
│  Ray)       │ │ TorchServe│ │  Kafka Streams)│
└─────────────┘ └──────────┘ └─────────────┘
```

#### **19.1.2 The Feature Store**

Centralized storage for feature vectors, solving the **training-serving skew** problem.

**Components:**
- **Offline Store:** Historical data for training (Data warehouse: BigQuery, Snowflake)
- **Online Store:** Low-latency serving (Redis, DynamoDB, Cassandra)
- **Feature Registry:** Metadata, versioning, lineage tracking

**Why it matters:** If training uses `avg_price_last_7_days` computed at 2 AM, but serving computes it at request time, you have skew. Feature store ensures identical computation.

```python
# Pseudo-code for feature retrieval
from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Online retrieval (for real-time prediction)
features = store.get_online_features(
    features=[
        "user_transactions:avg_spend_30d",
        "user_transactions:transaction_count_7d"
    ],
    entity_rows=[{"user_id": "user_123"}]
).to_df()

# Offline retrieval (for training)
training_df = store.get_historical_features(
    entity_df=entity_df,  # Timestamps of when predictions were made
    features=[...]
).to_df()
```

---

## **19.2 Design Patterns**

#### **19.2.1 Batch Prediction**

Process large datasets periodically (hourly, daily).

**When to use:** 
- No strict latency requirements (minutes to hours acceptable)
- Large volume (millions of predictions)
- Features don't change between runs (e.g., overnight risk scoring)

**Architecture:**
```
S3/Data Lake → Spark Job → Model Inference → Results → Database/Notifications
```

**Pros:** Simple, scalable, cost-effective (spot instances)  
**Cons:** Stale predictions, high latency

#### **19.2.2 Real-Time (Online) Inference**

Synchronous API calls with millisecond latency requirements.

**When to use:**
- User-facing applications (search ranking, recommendations, fraud detection)
- Features available at request time
- SLA: <100ms p99 latency

**Architecture:**
```
Client → API Gateway → Load Balancer → Model Server (K8s) → Response
                    ↓
            Feature Store (Redis cache)
```

**Challenges:**
- Cold start (container startup time)
- Scaling to zero vs. keeping warm (cost vs. latency trade-off)
- Feature freshness

#### **19.2.3 Streaming Inference**

Process events as they arrive, maintaining state.

**When to use:**
- Real-time anomaly detection (network intrusion)
- Session-based recommendations (update as user clicks)
- IoT sensor monitoring

**Architecture:**
```
Kafka/Kinesis → Stream Processor (Flink/Spark Streaming) → State Store → Predictions
                    ↓
            Feature updates to online store
```

**Windowing:** 
- **Tumbling:** Fixed-size, non-overlapping (e.g., every 5 minutes)
- **Sliding:** Overlapping windows (e.g., last 5 minutes, computed every minute)
- **Session:** Dynamic based on user activity gaps

---

## **19.3 Scalability Patterns**

#### **19.3.1 Horizontal Scaling**

Add more machines rather than bigger machines.

**Stateless Serving:** Multiple model server pods behind load balancer. Any pod can handle any request.

**Sharding:** Partition data by key (user_id % num_shards) to distribute load and cache locality.

```yaml
# Kubernetes HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: custom_queue_length
      target:
        type: AverageValue
        averageValue: "10"
```

#### **19.3.2 Load Balancing Strategies**

- **Round Robin:** Even distribution (simple, ignores capacity)
- **Least Connections:** To pods with fewest active requests (good for varying latency)
- **Consistent Hashing:** Same user always hits same pod (benefits caching, but hot spots)
- **Model Sharding:** Different pods serve different model partitions (for massive models)

#### **19.3.3 Caching Strategies**

**Feature Cache:** Cache expensive feature lookups (Redis).
```
Cache Hit: 1ms
Cache Miss: 50ms (compute from raw data)
```

**Prediction Cache:** Cache model outputs for identical inputs (high hit rate for popular items).
- **TTL (Time To Live):** How long to keep cached predictions (trading freshness for speed)
- **Cache Invalidation:** When model updates, clear relevant cache entries

**CDN for Edge Inference:** Deploy models to CDN edge locations (Cloudflare Workers, Lambda@Edge) for <50ms global latency.

---

## **19.4 Latency and Throughput Optimization**

#### **19.4.1 Model Optimization**

**Quantization:** INT8 vs FP32 (2-4x speedup, slight accuracy loss)  
**Pruning:** Remove 50% of weights, compress model  
**Knowledge Distillation:** Train small "student" to mimic large "teacher"  
**ONNX/TensorRT:** Optimized inference engines (kernel fusion, memory optimization)

#### **19.4.2 Dynamic Batching**

Combine multiple individual requests into a batch for GPU efficiency.

```
Without Batching: [1], [1], [1] → 3 forward passes (GPU underutilized)
With Batching: [1,1,1] → 1 forward pass (3x throughput)
```

**Trade-off:** Waiting to form a batch adds latency (batching delay). Tune `max_batch_size` and `max_latency_ms`.

```python
# TensorFlow Serving batching parameters
max_batch_size { value: 64 }
batch_timeout_micros { value: 50000 }  # Wait max 50ms to fill batch
max_enqueued_batches { value: 100 }
```

#### **19.4.3 Async Processing**

For non-blocking operations:
1. Client sends request, gets `request_id` immediately
2. Request queued (Kafka/SQS)
3. Worker processes asynchronously
4. Client polls or receives webhook notification

**Use case:** Heavy inference (video analysis), email generation, report creation.

---

## **19.5 Cost Optimization**

#### **19.5.1 Infrastructure Cost Drivers**

1. **Compute:** GPU instances ($2-30/hour depending on type)
2. **Storage:** Model artifacts, feature logs, training data
3. **Network:** Egress fees (moving data between regions or clouds)
4. **Vendor Lock-in:** Managed ML services (SageMaker, Vertex AI) premium vs. self-managed

#### **19.5.2 Spot/Preemptible Instances**

Use spot instances for:
- Batch training (checkpoint frequently)
- Batch inference
- Development/experimentation

**Risk:** Instances can be reclaimed with 2-minute warning. Architect for fault tolerance.

#### **19.5.3 Multi-Cloud and Hybrid**

Avoid cloud lock-in:
- **Kubernetes:** Portable orchestration
- **KServe:** Standardized model serving on any cloud
- **Feature Store:** Abstraction layer over underlying storage

#### **19.5.4 Model Right-Sizing**

Don't use A100 GPUs for simple logistic regression. Match hardware to model complexity:
- **CPU:** Scikit-learn, small neural nets (<10MB), high throughput/low latency
- **GPU:** Deep learning (CNNs, Transformers), batch processing
- **TPU:** Massive matrix multiplications (training large LLMs)

---

## **19.6 Reliability and Fault Tolerance**

#### **19.6.1 Circuit Breakers**

Prevent cascade failures. If model service is down, fail fast and return default response rather than timing out.

```python
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
def get_prediction(features):
    return model.predict(features)

# After 5 failures, circuit opens, immediately raises CircuitBreakerError
# Can catch this and return fallback: return {"score": 0.5, "fallback": true}
```

#### **19.6.2 Graceful Degradation**

When system overloaded:
- **Shed load:** Return 503 with Retry-After header
- **Simplify model:** Route to lighter model (e.g., switch from GPT-4 to GPT-3.5, or from CNN to logistic regression)
- **Cache staleness:** Serve slightly stale predictions if fresh compute failing

#### **19.6.3 Shadow Mode Deployment**

Route production traffic to new model version without returning its predictions (log only). Compare new model outputs to current production model to validate before switching traffic.

---

## **19.7 Workbook Labs**

### **Lab 1: System Design Document**
Design ML system for fraud detection at payment processor (10,000 TPS, <50ms latency):

1. **Requirements:** Functional and non-functional (latency, availability, throughput)
2. **Data Flow:** From transaction event to score returned
3. **Component Diagram:** Feature store, model registry, serving infrastructure
4. **Trade-off Analysis:** Why Redis vs Cassandra? Why online vs batch features?
5. **Failure Modes:** What happens if feature store is down? Model version mismatch?

**Deliverable:** Architecture document with diagrams (draw.io or Excalidraw) and reasoning.

### **Lab 2: Load Testing**
Deploy a simple model (scikit-learn or PyTorch) to FastAPI:

1. **Baseline:** Single instance throughput (requests/sec, latency p50/p99)
2. **Bottleneck Identification:** CPU bound? I/O bound? Memory bound?
3. **Optimization:** Add Redis caching for features, implement batching
4. **Scaling:** Deploy to Kubernetes, configure HPA, test under load (k6 or Locust)

**Deliverable:** Performance report showing latency distribution before/after optimizations.

### **Lab 3: Cost Analysis**
Given cloud bill for ML training ($10k/month), optimize:

1. **Identify waste:** Idle GPUs, oversized instances, unused storage
2. **Spot instance migration:** Which workloads can tolerate interruptions?
3. **Model compression:** Quantization to reduce GPU requirements
4. **ROI Calculation:** Cost per prediction, break-even analysis

**Deliverable:** Cost reduction proposal with 30% savings target.

### **Lab 4: Disaster Recovery Plan**
Design for region failure:

1. **Multi-region deployment:** Active-active or active-passive?
2. **Data replication:** RPO (Recovery Point Objective) and RTO (Recovery Time Objective)
3. **Model artifact backup:** Versioning across regions
4. **Failover testing:** Simulate us-east-1 outage, measure failover time

**Deliverable:** Runbook for on-call engineers with step-by-step recovery procedures.

---

## **19.8 Common Pitfalls**

1. **Training-Serving Skew:** Different code paths for feature engineering in training vs. serving. **Solution:** Shared feature transformation libraries (same Docker image for both).

2. **Underestimating Cold Start:** Serverless (Lambda) or scaled-to-zero Kubernetes adds 1-10s latency on first request. **Solution:** Keep minimum replicas warm, use provisioned concurrency.

3. **Ignoring Backpressure:** When downstream is slow, queue builds up infinitely, causing OOM. **Solution:** Bounded queues with load shedding, circuit breakers.

4. **Synchronous Feature Computation:** Computing features in request path adds latency. **Solution:** Pre-materialize features, use streaming feature computation.

5. **Not Versioning Everything:** Model v2 expects different features than v1, causing crashes. **Solution:** Version APIs, features, and models together (immutable infrastructure).

---

## **19.9 Interview Questions**

**Q1:** Design a recommendation system for YouTube (1 billion users, 1 million videos). How do you handle the scale?
*A: Two-stage approach: (1) Candidate generation: Two-tower neural network retrieves ~100 videos from 1M using FAISS (approximate nearest neighbors), runs in <10ms. (2) Ranking: Heavy model (deep neural net with hundreds of features) scores the 100 candidates, runs on GPU in <50ms. Features pre-computed and cached in Redis. Model sharded by user geography for locality.*

**Q2:** When would you choose batch inference over real-time inference?
*A: Batch when: (1) No strict latency SLA (minutes/hours OK), (2) Large volume (millions of predictions), (3) Features don't change rapidly (e.g., credit risk updated nightly), (4) Cost-sensitive (can use spot instances, no need for 24/7 running servers). Real-time when: (1) User-facing latency requirements (<100ms), (2) Features depend on real-time context (current cart contents, location), (3) Immediate action required (fraud block, dynamic pricing).*

**Q3:** How do you prevent training-serving skew?
*A: (1) Shared feature engineering code (library used in both training and serving pipelines), (2) Feature store that guarantees point-in-time correctness (serving exactly the feature values that would have been available at prediction time in training), (3) Immutable data pipelines with versioning, (4) Integration tests that validate feature parity between environments.*

**Q4:** Explain the difference between horizontal and vertical scaling in ML serving.
*A: Vertical scaling: Bigger machine (more CPU/GPU/RAM) for single instance. Limited by hardware ceiling, expensive, single point of failure. Horizontal scaling: More machines behind load balancer. Better fault tolerance, theoretically unlimited scale, requires stateless design. For ML serving, horizontal is preferred with model replication; vertical used when model too large for single GPU (then use model parallelism).*

**Q5:** How do you handle model updates without downtime?
*A: Blue-green deployment: Deploy new model version alongside old (shadow mode or small traffic percentage), gradually shift traffic using canary releases (5% → 25% → 100%). If error rate spikes, automatic rollback. Kubernetes with rolling updates or traffic splitting (Istio/Linkerd). Key is maintaining backward compatibility in API contract or versioning endpoints (/v1/predict, /v2/predict).*

---

## **19.10 Further Reading**

**Books:**
- *Designing Machine Learning Systems* (Chip Huyen) - Comprehensive ML system design
- *Building Machine Learning Pipelines* (Hannes Hapke, Catherine Nelson) - TensorFlow Extended
- *Site Reliability Engineering* (Google) - For reliability concepts applied to ML

**Papers:**
- "Hidden Technical Debt in Machine Learning Systems" (Sculley et al., 2015) - Classic on MLops complexity
- "Machine Learning: The High Interest Credit Card of Technical Debt"

**Tools:**
- **Feast:** Open source feature store
- **KServe:** Kubernetes-native model serving
- **MLflow:** Model registry and experiment tracking
- **Evidently:** ML monitoring

---

## **19.11 Checkpoint Project: Production-Grade Fraud Detection System**

Build a complete fraud detection system for a fictional payment company processing 10,000 transactions/second.

**Requirements:**

1. **Latency SLA:** p99 < 50ms end-to-end (feature retrieval + inference)
2. **Throughput:** Handle 10k TPS with burst to 50k TPS (Black Friday)
3. **Features:**
   - Real-time: Velocity features (txns/minute from card), device fingerprint matching
   - Batch: Customer historical risk score (updated hourly)
   - Streaming: Aggregate merchant chargeback rate (5-minute window)

4. **Architecture:**
   - **Ingestion:** Kafka for transaction events
   - **Feature Store:** Redis (online) + S3 (offline) with Feast
   - **Model:** XGBoost (lightweight, explainable) with option to shadow-test deep model
   - **Serving:** FastAPI with asyncio, deployed on EKS with Karpenter (auto-scaling)
   - **Monitoring:** Prometheus metrics, Grafana dashboards, Evidently for data drift

5. **Reliability:**
   - Circuit breaker: If feature store down, use cached/default features and alert
   - Fallback model: Simple rules-based if ML model fails
   - Multi-AZ deployment

6. **Testing:**
   - Load test: 50k TPS sustained for 1 hour
   - Chaos engineering: Randomly kill pods, verify automatic recovery
   - A/B test: New model version gets 5% traffic, compare fraud catch rate vs false positive rate

**Deliverables:**
- `fraud_system/` with Terraform/Kubernetes manifests
- Architecture diagram with data flow
- Runbook: "Incident Response: Feature Store Outage"
- Cost estimate: Monthly AWS bill breakdown

**Success Criteria:**
- System handles load test without errors
- Latency p99 < 50ms at 10k TPS
- Zero downtime deployment of new model version
- Automatic failover demonstrated in chaos test

---

**End of Chapter 19**

*You now understand how to design ML systems for production. Chapter 20 will cover Data Engineering for ML — building the pipelines that feed these systems.*

---