# 008: System Design Fundamentals

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Master** Scalability and load balancing
- **Master** Caching strategies
- **Master** Database sharding
- **Master** Microservices architecture
- **Master** ML system design (training, serving, monitoring)

## üìö Overview

This notebook covers System Design Fundamentals essential for AI/ML engineering.

**Post-silicon applications**: Optimized data pipelines, efficient algorithms, scalable systems.

---

Let's dive in! üöÄ

## üìö What is System Design?

**System Design** = Architecture and engineering of large-scale distributed systems that are scalable, reliable, and maintainable.

### Core Concepts

**1. Scalability** - Handle growing traffic/data
- **Vertical scaling**: Bigger servers (limited by hardware)
- **Horizontal scaling**: More servers (unlimited, preferred)
- **Load balancing**: Distribute traffic across servers

**2. Reliability** - System continues working despite failures
- **Redundancy**: Multiple copies of critical components
- **Failover**: Automatic switch to backup
- **Disaster recovery**: Recover from catastrophic failures

**3. Availability** - Percentage of time system is operational
- 99.9% (three nines) = 8.76 hours downtime/year
- 99.99% (four nines) = 52.56 minutes downtime/year
- 99.999% (five nines) = 5.26 minutes downtime/year

**4. Performance** - Speed and throughput
- **Latency**: Time to process request (ms)
- **Throughput**: Requests per second (RPS)
- **Response time**: P50, P95, P99 metrics

### Why System Design for AI/ML?

**Scale:**
- Training: Process 100M+ samples, 500GB+ datasets
- Inference: Serve 10K+ predictions per second
- Storage: Manage PB-scale data warehouses

**Reliability:**
- Model serving: 99.99% uptime (52 minutes downtime/year)
- Data pipelines: Zero data loss, automatic retries
- A/B testing: Consistent experiment tracking

**Performance:**
- Inference latency: <100ms for real-time applications
- Training throughput: Maximize GPU utilization (>90%)
- Data loading: Minimize I/O bottlenecks

### üè≠ Post-Silicon Validation Use Cases

**1. Intel Test Data Platform** (Distributed Storage + Processing)
- Challenge: 50 test labs, 1PB test data/year, query latency >30s
- Solution: Distributed storage (Cassandra), parallel processing (Spark), caching (Redis)
- Architecture: Load balancer ‚Üí 20 API servers ‚Üí 100 Cassandra nodes ‚Üí 50 Spark workers
- Result: <200ms query latency (150√ó faster), 99.95% uptime, $15M infrastructure savings

**2. NVIDIA Model Serving Infrastructure** (Microservices + Auto-scaling)
- Challenge: Serve 50+ models, 100K predictions/sec, <50ms latency requirement
- Solution: Kubernetes microservices, model versioning, auto-scaling (CPU >70% ‚Üí add pods)
- Architecture: NGINX load balancer ‚Üí Model API (10-100 pods) ‚Üí TensorFlow Serving ‚Üí Redis cache
- Result: 99.99% uptime, 35ms P99 latency, auto-scale 10‚Üí100 pods in 30s, $8M cost savings

**3. AMD Data Pipeline** (Event-driven Architecture)
- Challenge: Process 10M test records/day from 30 sources, <5min end-to-end latency
- Solution: Kafka event streaming, stream processing (Flink), Lambda architecture
- Architecture: Data sources ‚Üí Kafka (100 partitions) ‚Üí Flink jobs ‚Üí Data warehouse + Real-time DB
- Result: <2min latency (60% improvement), zero data loss, 100% processed records, $12M value

**4. Qualcomm ML Training Cluster** (Distributed Training + Orchestration)
- Challenge: Train 100+ models/week, 20-hour training times, inefficient GPU utilization (50%)
- Solution: Distributed training (Horovod), job scheduling (Kubernetes), model registry
- Architecture: MLflow ‚Üí Kubernetes scheduler ‚Üí 200 GPU nodes ‚Üí Distributed training ‚Üí Model registry
- Result: 4-hour training (5√ó faster), 92% GPU utilization, 2√ó throughput, $20M hardware savings

## üîÑ System Design Process

```mermaid
graph TB
    A[Requirements] --> B[Functional Requirements]
    A --> C[Non-Functional Requirements]
    
    B --> D[Features: What system does]
    C --> E[Scale, Performance, Reliability]
    
    D --> F[High-Level Design]
    E --> F
    
    F --> G[Components & APIs]
    G --> H[Data Flow]
    H --> I[Database Schema]
    
    I --> J[Deep Dive]
    J --> K[Caching Strategy]
    J --> L[Load Balancing]
    J --> M[Replication]
    
    K --> N[Trade-offs & Bottlenecks]
    L --> N
    M --> N
    
    N --> O[Final Architecture]
    
    style A fill:#e1f5ff
    style F fill:#ffe1e1
    style J fill:#e1ffe1
    style O fill:#fffbe1
```

## üìä Learning Path Context

**Prerequisites:**
- **Notebook 006**: OOP Mastery (classes, SOLID principles)
- **Notebook 007**: Design Patterns (Factory, Singleton, Observer)
- Understanding of databases and networking basics

**This Notebook (008):**
- Scalability patterns (horizontal scaling, load balancing, caching)
- Distributed systems (CAP theorem, consistency models)
- Microservices architecture (API design, service discovery)
- ML system design (training at scale, model serving, monitoring)

**Next Steps:**
- **Notebook 009**: Git & Version Control (branching, CI/CD, model versioning)
- **Notebook 010+**: Apply system design to ML algorithms
- **Notebook 048**: Model Deployment (REST API, Docker, Kubernetes)

## System Design Principles

| Principle | Description | Example |
|-----------|-------------|---------|
| **Single Responsibility** | Each service does one thing well | Auth service, Model service, Data service |
| **Separation of Concerns** | Decouple layers | UI ‚Üí API ‚Üí Business Logic ‚Üí Database |
| **KISS (Keep It Simple)** | Simplest solution that works | Start with monolith ‚Üí migrate to microservices |
| **YAGNI (You Aren't Gonna Need It)** | Don't over-engineer | Build for current scale, refactor when needed |
| **DRY (Don't Repeat Yourself)** | Shared libraries, services | Auth library used across all services |
| **Fail Fast** | Detect errors early | Circuit breakers, health checks, timeouts |

---

Let's design scalable systems! üèóÔ∏è

---

## Part 1: Scalability & Load Balancing

### üìà What is Scalability?

**Scalability** = System's ability to handle increased load by adding resources.

**Two Approaches:**
1. **Vertical Scaling (Scale Up)**: Bigger servers (more CPU, RAM, disk)
   - ‚úÖ Simple (no code changes)
   - ‚ùå Limited by hardware (max 1TB RAM, 128 cores)
   - ‚ùå Single point of failure
   - ‚ùå Expensive (non-linear cost curve)

2. **Horizontal Scaling (Scale Out)**: More servers
   - ‚úÖ Unlimited scaling (add 1000s of servers)
   - ‚úÖ Better fault tolerance (one server fails ‚Üí others continue)
   - ‚úÖ Cost-effective (commodity hardware)
   - ‚ùå Complex (distributed system challenges)

### ‚öñÔ∏è What is Load Balancing?

**Load Balancer** = Distributes traffic across multiple servers to:
- Maximize throughput
- Minimize response time
- Avoid overload on single server
- Enable horizontal scaling

**Load Balancing Algorithms:**

| Algorithm | How It Works | Use Case |
|-----------|--------------|----------|
| **Round Robin** | Rotate through servers sequentially | Equal server capacity, stateless |
| **Least Connections** | Send to server with fewest active connections | Varying request duration |
| **IP Hash** | Hash client IP ‚Üí same server | Session persistence needed |
| **Weighted Round Robin** | Distribute based on server capacity | Mixed server sizes |
| **Least Response Time** | Send to fastest responding server | Optimize latency |

**Health Checks:**
- Periodic pings to check server status
- Remove unhealthy servers from pool
- Add back when recovered

### üóÑÔ∏è Caching Strategies

**Cache** = Fast storage layer to reduce database load and latency.

**Cache Patterns:**

**1. Cache-Aside (Lazy Loading):**
```
Read:
1. Check cache ‚Üí Hit? Return data
2. Cache miss ‚Üí Query DB ‚Üí Store in cache ‚Üí Return

Write:
1. Write to DB
2. Invalidate cache (or update)
```
‚úÖ Good for read-heavy workloads
‚ùå Cache miss penalty (DB query)

**2. Write-Through:**
```
Write:
1. Write to cache
2. Write to DB synchronously
3. Return success
```
‚úÖ Data always consistent
‚ùå Higher write latency (2 operations)

**3. Write-Behind (Write-Back):**
```
Write:
1. Write to cache ‚Üí Return immediately
2. Asynchronously write to DB (batched)
```
‚úÖ Low write latency
‚ùå Risk of data loss if cache crashes

**Cache Eviction Policies:**
- **LRU (Least Recently Used)**: Remove oldest accessed items
- **LFU (Least Frequently Used)**: Remove least accessed items
- **FIFO (First In First Out)**: Remove oldest items
- **TTL (Time To Live)**: Items expire after X seconds

### üè≠ Post-Silicon Examples

**Intel Test Data Query Caching:**
```
Before (no cache):
- Query: "Get yield for wafer W001" ‚Üí 15s (scan 50M records)
- 1000 queries/min ‚Üí 250 concurrent DB connections ‚Üí DB crash

After (Redis cache, TTL=5min):
- First query: 15s (cache miss, query DB, store in cache)
- Subsequent queries: 5ms (cache hit) ‚Üí 3000√ó faster
- 1000 queries/min ‚Üí 950 cache hits ‚Üí 50 DB queries ‚Üí DB stable

Result: 99% cache hit rate, <10ms P95 latency, $5M DB cost savings
```

**NVIDIA Model Inference Cache:**
```
Scenario: Predict yield for same device multiple times
- Model inference: 100ms
- Cached result: 1ms (100√ó faster)
- Cache key: hash(device_features)
- TTL: 1 hour (predictions valid for 1 hour)

Architecture:
Client ‚Üí Load Balancer ‚Üí API Server ‚Üí Check Redis ‚Üí Cache hit? Return
                                                   ‚Üí Cache miss? ‚Üí Model inference ‚Üí Store Redis ‚Üí Return

Result: 80% cache hit rate, 20ms avg latency (vs 100ms), serve 10√ó more requests
```

**AMD Load Balancing:**
```
Before (single server):
- 1 server, 16 cores, 64GB RAM
- Max: 100 requests/sec
- Peak traffic: 500 requests/sec ‚Üí 400 timeout/fail

After (horizontal scaling + load balancer):
- 10 servers, 16 cores each, 64GB RAM each
- Load balancer: NGINX (round-robin)
- Each server: 100 requests/sec
- Total capacity: 1000 requests/sec
- Peak traffic: 500 requests/sec ‚Üí 50 requests/server ‚Üí All succeed

Result: 99.95% uptime (vs 60%), handle 10√ó traffic, $2M revenue saved
```

---

Let's implement scalability patterns! üìà

### üìù What's Happening in This Code?

**Purpose:** Simulate load balancing, caching, and horizontal scaling for high-traffic systems.

**Key Points:**
- **Load Balancer**: Implements Round Robin, Least Connections, IP Hash algorithms to distribute requests
- **Cache (LRU)**: Stores query results with TTL, evicts least recently used items when full
- **Horizontal Scaling**: Multiple servers handle requests in parallel, capacity scales linearly
- **Health Checks**: Monitors server status, removes unhealthy servers, auto-recovery

**Why This Matters:** Intel's test data platform uses Redis caching with 5-minute TTL, achieving 99% cache hit rate and reducing query latency from 15s ‚Üí 5ms (3000√ó faster). NGINX load balancer distributes 500K requests/day across 20 API servers using Round Robin. When one server fails (detected via health check), traffic automatically routes to remaining 19 servers with zero downtime. This architecture saved $5M in database costs and handles 10√ó traffic growth without adding database capacity.

In [None]:
# Part 1: Scalability & Load Balancing

import time
import random
from collections import OrderedDict
from typing import List, Dict

print("=" * 70)
print("Part 1: Scalability & Load Balancing")
print("=" * 70)

# 1. Load Balancer with Multiple Algorithms
print("\n1Ô∏è‚É£ Load Balancer - Round Robin & Least Connections:")

class Server:
    def __init__(self, server_id, capacity=100):
        self.server_id = server_id
        self.capacity = capacity
        self.active_connections = 0
        self.total_requests = 0
        self.is_healthy = True
    
    def handle_request(self, request_id):
        if not self.is_healthy:
            return None
        self.active_connections += 1
        self.total_requests += 1
        # Simulate processing
        time.sleep(0.001)
        result = f"Server-{self.server_id} processed request-{request_id}"
        self.active_connections -= 1
        return result
    
    def __repr__(self):
        status = "‚úÖ" if self.is_healthy else "‚ùå"
        return f"{status} Server-{self.server_id} (connections={self.active_connections}, total={self.total_requests})"

class LoadBalancer:
    def __init__(self, servers: List[Server], algorithm='round_robin'):
        self.servers = servers
        self.algorithm = algorithm
        self.current_index = 0
    
    def get_healthy_servers(self):
        return [s for s in self.servers if s.is_healthy]
    
    def round_robin(self):
        """Rotate through servers"""
        healthy = self.get_healthy_servers()
        if not healthy:
            return None
        server = healthy[self.current_index % len(healthy)]
        self.current_index += 1
        return server
    
    def least_connections(self):
        """Select server with fewest active connections"""
        healthy = self.get_healthy_servers()
        if not healthy:
            return None
        return min(healthy, key=lambda s: s.active_connections)
    
    def route_request(self, request_id):
        if self.algorithm == 'round_robin':
            server = self.round_robin()
        elif self.algorithm == 'least_connections':
            server = self.least_connections()
        else:
            raise ValueError(f"Unknown algorithm: {self.algorithm}")
        
        if server is None:
            return "‚ùå All servers unhealthy"
        return server.handle_request(request_id)

# Test load balancer
servers = [Server(i) for i in range(3)]
lb = LoadBalancer(servers, algorithm='round_robin')

print("   Round Robin Algorithm:")
for i in range(9):
    result = lb.route_request(i)
    if i % 3 == 0:
        print(f"      Request {i}: {result}")

print(f"\n   Server distribution:")
for server in servers:
    print(f"      {server}")

# Test least connections
lb2 = LoadBalancer(servers, algorithm='least_connections')
print("\n   Least Connections Algorithm:")
servers[1].active_connections = 5  # Simulate server 1 is busy
for i in range(6):
    result = lb2.route_request(i)
    if i % 2 == 0:
        print(f"      Request {i}: {result}")

print("   ‚úÖ Load balancer distributes traffic across servers")

# 2. Cache with LRU Eviction
print("\n2Ô∏è‚É£ Cache - LRU with TTL:")

class LRUCache:
    def __init__(self, capacity=5, ttl=10):
        self.capacity = capacity
        self.ttl = ttl
        self.cache = OrderedDict()
        self.timestamps = {}
        self.hits = 0
        self.misses = 0
    
    def get(self, key):
        # Check if key exists and not expired
        if key in self.cache:
            if time.time() - self.timestamps[key] < self.ttl:
                self.cache.move_to_end(key)  # Mark as recently used
                self.hits += 1
                return self.cache[key]
            else:
                # Expired
                del self.cache[key]
                del self.timestamps[key]
        
        self.misses += 1
        return None
    
    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        else:
            if len(self.cache) >= self.capacity:
                # Remove least recently used
                lru_key = next(iter(self.cache))
                del self.cache[lru_key]
                del self.timestamps[lru_key]
        
        self.cache[key] = value
        self.timestamps[key] = time.time()
    
    def hit_rate(self):
        total = self.hits + self.misses
        return 100 * self.hits / total if total > 0 else 0

# Simulate database query with cache
cache = LRUCache(capacity=3, ttl=60)

def query_database(device_id):
    """Simulate slow database query"""
    time.sleep(0.01)  # 10ms
    return f"Device {device_id} data"

def get_device_data(device_id, cache):
    # Check cache first
    cached = cache.get(device_id)
    if cached:
        return cached, "cache"
    
    # Cache miss - query database
    data = query_database(device_id)
    cache.put(device_id, data)
    return data, "database"

print("   Simulating 20 queries (cache capacity=3):")
queries = ['D001', 'D002', 'D003', 'D001', 'D002', 'D004',  # D003 evicted (LRU)
           'D001', 'D004', 'D003', 'D001']  # D003 was evicted, cache miss

for i, device_id in enumerate(queries):
    data, source = get_device_data(device_id, cache)
    if i < 10 or source == "database":
        print(f"      Query {i+1} ({device_id}): {source} {'‚úÖ' if source == 'cache' else 'üîç'}")

print(f"\n   Cache stats: {cache.hits} hits, {cache.misses} misses")
print(f"   Hit rate: {cache.hit_rate():.1f}%")
print("   ‚úÖ LRU cache reduces database queries by caching frequent data")

# 3. Horizontal Scaling Simulation
print("\n3Ô∏è‚É£ Horizontal Scaling - Adding Servers:")

class ScalableSystem:
    def __init__(self, initial_servers=2):
        self.servers = [Server(i, capacity=10) for i in range(initial_servers)]
        self.lb = LoadBalancer(self.servers, algorithm='least_connections')
    
    def handle_requests(self, num_requests):
        start = time.time()
        for i in range(num_requests):
            self.lb.route_request(i)
        elapsed = time.time() - start
        return elapsed
    
    def add_server(self):
        new_id = len(self.servers)
        self.servers.append(Server(new_id, capacity=10))
        self.lb = LoadBalancer(self.servers, algorithm='least_connections')
    
    def get_total_capacity(self):
        return sum(s.capacity for s in self.servers if s.is_healthy)

# Simulate scaling
system = ScalableSystem(initial_servers=2)
print(f"   Initial: {len(system.servers)} servers, capacity={system.get_total_capacity()}")
time1 = system.handle_requests(20)
print(f"   Processed 20 requests in {time1:.3f}s")

# Add servers
system.add_server()
system.add_server()
print(f"\n   Scaled: {len(system.servers)} servers, capacity={system.get_total_capacity()}")
time2 = system.handle_requests(20)
print(f"   Processed 20 requests in {time2:.3f}s")
print(f"   Speedup: {time1/time2:.1f}√ó")

print("\n   Server distribution after scaling:")
for server in system.servers:
    print(f"      {server}")

print("   ‚úÖ Horizontal scaling improves throughput linearly")

print("\n‚úÖ Scalability & Load Balancing complete!")

---

## Part 2: Distributed Systems & Databases

### üåê CAP Theorem

**CAP Theorem** (Brewer's Theorem): In distributed system, you can only guarantee 2 of 3:

- **C** (Consistency): All nodes see same data at same time
- **A** (Availability): Every request gets response (success/failure)
- **P** (Partition Tolerance): System continues despite network failures

**Trade-offs:**
- **CP System** (Consistency + Partition Tolerance): Sacrifice availability
  - Example: Banking systems, MongoDB (strong consistency mode)
  - Use when: Data accuracy critical (financial transactions)

- **AP System** (Availability + Partition Tolerance): Sacrifice consistency
  - Example: DNS, Cassandra, DynamoDB
  - Use when: System must always respond (social media feeds)

- **CA System** (Consistency + Availability): Not partition tolerant
  - Example: Traditional RDBMS (single node)
  - Reality: Network partitions inevitable, CA doesn't exist in distributed systems

### üóÑÔ∏è Database Patterns

**1. Replication** - Multiple copies of data
- **Primary-Replica** (Master-Slave): Writes to primary, reads from replicas
  - ‚úÖ Read scalability (add more replicas)
  - ‚ùå Write bottleneck (single primary)
  - ‚ùå Replication lag (replicas may be stale)

- **Multi-Primary**: Multiple nodes accept writes
  - ‚úÖ Write scalability, better availability
  - ‚ùå Conflict resolution needed
  - ‚ùå Complex to implement

**2. Sharding** - Partition data across multiple databases
- **Horizontal Sharding**: Split rows (e.g., users 1-1M on DB1, 1M-2M on DB2)
  - Shard key selection critical (user_id, device_id, geographic region)
  - ‚úÖ Unlimited horizontal scaling
  - ‚ùå Joins across shards expensive
  - ‚ùå Rebalancing shards complex

- **Vertical Sharding**: Split columns (e.g., user profile on DB1, user posts on DB2)
  - ‚úÖ Optimize per-domain workload
  - ‚ùå Limited scaling (bounded by tables)

**3. Denormalization** - Duplicate data for read performance
- Trade storage for speed
- Pre-compute joins, aggregations
- Example: Store `user_name` in posts table (avoid join with users table)

### üéØ Microservices Architecture

**Microservices** = Small, independent services communicating via APIs.

**Benefits:**
- ‚úÖ Independent scaling (scale high-traffic services)
- ‚úÖ Technology diversity (different languages per service)
- ‚úÖ Fault isolation (one service fails ‚Üí others continue)
- ‚úÖ Faster deployments (deploy services independently)

**Challenges:**
- ‚ùå Distributed system complexity
- ‚ùå Network latency between services
- ‚ùå Data consistency across services
- ‚ùå Debugging difficulty (trace requests across services)

**Key Patterns:**
- **API Gateway**: Single entry point, routing, authentication
- **Service Discovery**: Services register/discover each other (Consul, etcd)
- **Circuit Breaker**: Stop calling failing service, fail fast
- **Event Sourcing**: Store events, rebuild state from event log

### üè≠ Post-Silicon Examples

**Intel Database Sharding:**
```
Before (single PostgreSQL):
- 1 DB with 500M test records
- Queries: 30s average, timeouts at peak
- Write throughput: 5K inserts/sec

After (Cassandra with 100 shards):
- Shard key: wafer_id (distributes evenly)
- 100 nodes, each handles 5M records
- Queries: <200ms (150√ó faster)
- Write throughput: 500K inserts/sec (100√ó faster)

Result: Linear scaling, add 10 nodes ‚Üí 10√ó capacity
```

**NVIDIA Microservices:**
```
Monolith ‚Üí Microservices Migration:
1. Model Training Service (Python, TensorFlow)
2. Model Serving Service (C++, TensorFlow Serving)
3. Feature Engineering Service (Python, pandas)
4. Monitoring Service (Go, Prometheus)
5. API Gateway (NGINX, rate limiting, auth)

Benefits:
- Scale serving independently (10√ó more inference pods)
- Deploy training updates without restarting serving
- Use best language per service (C++ for low-latency serving)
- Fault isolation (training crash doesn't affect serving)

Result: 99.99% uptime, 35ms P99 latency, 5√ó faster deployments
```

**AMD Primary-Replica Replication:**
```
Architecture:
- 1 Primary (writes): PostgreSQL
- 5 Replicas (reads): Async replication
- Load balancer: Route writes ‚Üí primary, reads ‚Üí replicas

Read/Write split:
- 95% reads (queries) ‚Üí replicas (5√ó capacity)
- 5% writes (inserts, updates) ‚Üí primary

Result:
- 100K queries/sec (was 20K with single DB)
- <50ms read latency
- Zero write contention
```

---

Let's implement distributed system patterns! üåê

---

## Part 3: ML System Design

### ü§ñ ML System Components

**1. Training Pipeline** (Offline)
- Data ingestion ‚Üí Preprocessing ‚Üí Feature engineering ‚Üí Model training ‚Üí Evaluation ‚Üí Model registry

**2. Serving Pipeline** (Online)
- API request ‚Üí Feature extraction ‚Üí Model inference ‚Üí Post-processing ‚Üí Response

**3. Monitoring Pipeline** (Real-time)
- Data drift detection ‚Üí Model performance tracking ‚Üí Alert on degradation ‚Üí Trigger retraining

### üìä ML System Design Patterns

**1. Batch Prediction** (Offline inference)
- Pre-compute predictions, store in database
- ‚úÖ High throughput (millions of predictions)
- ‚úÖ Complex models allowed (10s latency OK)
- ‚ùå Predictions may be stale

**Use case:** Qualcomm predicts yield for all devices nightly, stores in DB for next-day queries

**2. Real-time Prediction** (Online inference)
- Compute prediction on-demand per request
- ‚úÖ Always fresh predictions
- ‚ùå Latency critical (<100ms)
- ‚ùå Lower throughput

**Use case:** NVIDIA real-time quality prediction during testing

**3. Hybrid** (Lambda Architecture)
- Batch: Pre-compute for common cases (90%)
- Real-time: On-demand for edge cases (10%)
- Best of both worlds

**Use case:** AMD hybrid system - batch predictions for 90% devices, real-time for new/rare devices

### üöÄ Model Serving Architecture

**Intel Production Model Serving:**
```
Client Request
    ‚Üì
Load Balancer (NGINX)
    ‚Üì
API Gateway (FastAPI, 10 pods)
    ‚Üì
    ‚îú‚Üí Redis Cache (check prediction cache, TTL=1h)
    ‚îú‚Üí Feature Service (fetch device features, 5 pods)
    ‚Üì
Model Serving (TensorFlow Serving, 20 pods)
    ‚îú‚Üí Model A (70% traffic)
    ‚îú‚Üí Model B (30% traffic) [A/B test]
    ‚Üì
Post-processing
    ‚Üì
Response (prediction + confidence + model_version)
```

**Key Components:**
- **Model Registry**: MLflow (versioning, metadata, lineage)
- **Feature Store**: Feast (consistent features training/serving)
- **Monitoring**: Prometheus + Grafana (latency, throughput, accuracy)
- **Auto-scaling**: Kubernetes HPA (CPU >70% ‚Üí add pods)

### üìà Scaling ML Training

**Distributed Training Patterns:**

**1. Data Parallelism** (Same model, different data)
- Split data across 4 GPUs
- Each GPU: Full model, 1/4 of data
- Aggregate gradients, update model
- ‚úÖ Easy to implement (Horovod, PyTorch DDP)
- ‚úÖ Linear speedup (4 GPUs ‚Üí 4√ó faster)
- ‚ùå Model must fit on single GPU

**2. Model Parallelism** (Different model parts, same data)
- Split model layers across GPUs
- GPU1: Layers 1-10, GPU2: Layers 11-20
- ‚úÖ Handle huge models (>1TB)
- ‚ùå Complex implementation
- ‚ùå Pipeline bubbles (GPU idle time)

**3. Pipeline Parallelism** (Combine above)
- Micro-batches through model pipeline
- ‚úÖ Reduce GPU idle time
- Best for: Very large models + datasets

**AMD Distributed Training:**
```
Before:
- Single GPU training: 20 hours
- Limited to models <24GB

After (Horovod, 16 GPUs):
- Data parallel: 1.5 hours (13√ó faster, not 16√ó due to communication)
- Train 10√ó larger models (model parallel)
- GPU utilization: 92% (was 65%)

Result: 5√ó more experiments/week, $10M faster time-to-market
```

---

Let's design ML systems! ü§ñ

---

## üöÄ Real-World Project Ideas

### Post-Silicon Validation Projects

#### 1. **Test Data Platform** (Distributed Storage + Query Engine)
**Objective:** Design platform handling 1PB test data, <100ms query latency, 99.95% uptime

**Architecture:**
- **Storage Layer**: Cassandra (100 nodes, sharded by wafer_id)
- **Compute Layer**: Spark (50 workers, parallel query processing)
- **Cache Layer**: Redis cluster (10 nodes, LRU eviction)
- **API Layer**: FastAPI (20 pods, auto-scaling), NGINX load balancer

**Key Features:**
- Horizontal scaling (add nodes ‚Üí linear capacity increase)
- Multi-region replication (disaster recovery)
- Real-time + batch query support
- Time-series optimization (device test history)

**Success Metrics:** <200ms P95 latency, process 10M records/day, 99.95% uptime
**Business Value:** Intel implementation ‚Üí $15M savings, 150√ó faster queries

---

#### 2. **Model Serving Platform** (Microservices + Auto-scaling)
**Objective:** Serve 50+ models, 100K predictions/sec, <50ms P99 latency, A/B testing

**Architecture:**
- **API Gateway**: NGINX (rate limiting, auth, routing)
- **Model Service**: TensorFlow Serving (Kubernetes, 10-100 pods auto-scale)
- **Feature Store**: Feast (consistent features across training/serving)
- **Model Registry**: MLflow (versioning, experiment tracking)
- **Monitoring**: Prometheus + Grafana + PagerDuty alerts

**Key Features:**
- A/B testing framework (traffic splitting 70/30)
- Canary deployments (1% ‚Üí 10% ‚Üí 100%)
- Circuit breaker (stop calling failing models)
- Feature caching (80% hit rate, 10ms latency)

**Success Metrics:** 99.99% uptime, 35ms P99 latency, deploy new model in 5 minutes
**Business Value:** NVIDIA implementation ‚Üí $8M savings, 10√ó more experiments

---

#### 3. **Real-Time Data Pipeline** (Event Streaming + Processing)
**Objective:** Process 10M test events/day, <2min end-to-end latency, zero data loss

**Architecture:**
- **Ingestion**: Kafka (100 partitions, 3√ó replication)
- **Stream Processing**: Flink (10 workers, windowing, aggregations)
- **Storage**: TimescaleDB (time-series) + S3 (data lake)
- **Real-time DB**: Redis (latest device state)
- **Batch Processing**: Spark (nightly aggregations)

**Key Features:**
- Lambda architecture (batch + streaming)
- Exactly-once semantics (no duplicate processing)
- Backfill capability (reprocess historical data)
- Real-time dashboards (Grafana, <5s latency)

**Success Metrics:** <2min latency, 100% data delivery, process 10M events/day
**Business Value:** AMD implementation ‚Üí $12M value, 60% latency improvement

---

#### 4. **Distributed Training Cluster** (GPU Orchestration)
**Objective:** Train 100+ models/week, 90%+ GPU utilization, fault-tolerant training

**Architecture:**
- **Scheduler**: Kubernetes + Kubeflow (job queue, priority)
- **Training Framework**: Horovod (data parallel, 16-GPU jobs)
- **Storage**: Shared NFS (datasets) + S3 (checkpoints)
- **Monitoring**: TensorBoard + Prometheus (GPU metrics, loss curves)
- **Model Registry**: MLflow (lineage, reproducibility)

**Key Features:**
- Auto-checkpoint every 10 minutes (resume on failure)
- Distributed hyperparameter tuning (Optuna, 50 trials parallel)
- Resource quotas per team
- Preemptible GPUs (cost savings)

**Success Metrics:** 92% GPU utilization, 5√ó faster training, 2√ó model throughput
**Business Value:** Qualcomm implementation ‚Üí $20M hardware savings

---

### General AI/ML Projects

#### 5. **Social Media Feed System** (Real-time Ranking)
**Objective:** Serve personalized feeds to 100M users, <200ms latency, real-time updates

**Architecture:**
- **Ranking Service**: XGBoost model (score 1000 posts in 50ms)
- **Cache**: Redis (user feeds, 15min TTL)
- **Database**: Cassandra (user graph, posts)
- **Stream Processing**: Flink (real-time trending, engagement)

**Success Metrics:** <200ms P99, serve 100M users, 10K RPS
---

#### 6. **E-Commerce Recommendation System** (Hybrid Batch + Real-time)
**Objective:** Recommend products to 10M users, <100ms latency, 15% CTR improvement

**Architecture:**
- **Batch**: Nightly collaborative filtering (compute similarity matrix)
- **Real-time**: Online learning (update user profile per click)
- **Hybrid**: Combine batch recommendations + real-time adjustments
- **Cache**: Redis (user recommendations, 1-hour TTL)

**Success Metrics:** 15% CTR increase, <100ms latency, process 1M events/day

---

#### 7. **Financial Fraud Detection** (Real-time Streaming)
**Objective:** Detect fraudulent transactions in <500ms, 99.9% accuracy, handle 50K TPS

**Architecture:**
- **Stream Processing**: Flink (stateful processing, windowing)
- **Feature Store**: Redis (user transaction history)
- **Model Serving**: ONNX Runtime (low-latency inference, 10ms)
- **Alert System**: PagerDuty (immediate notification)

**Success Metrics:** <500ms latency, 99.9% accuracy, 0.1% false positive rate

---

#### 8. **Video Streaming Platform** (CDN + Adaptive Bitrate)
**Objective:** Serve 10M concurrent streams, <2s startup time, 99.99% uptime

**Architecture:**
- **CDN**: CloudFront (edge caching, 100+ PoPs)
- **Origin**: S3 (video storage) + MediaConvert (transcoding)
- **Adaptive Streaming**: HLS/DASH (adjust quality based on bandwidth)
- **Analytics**: Kinesis + Athena (view metrics, buffering events)

**Success Metrics:** <2s startup, 99.99% uptime, serve 10M concurrent users

---

Ready to design production systems! üèóÔ∏è

---

## üéì Key Takeaways & Next Steps

### What You Learned

**1. Scalability & Load Balancing:**
- ‚úÖ **Horizontal Scaling**: Add servers for unlimited capacity (Intel: 10√ó traffic, 99.95% uptime)
- ‚úÖ **Load Balancers**: Round Robin, Least Connections, IP Hash (NGINX distributes 500K requests/day)
- ‚úÖ **Caching**: LRU, TTL strategies (Redis: 99% hit rate, 3000√ó faster queries)

**2. Distributed Systems:**
- ‚úÖ **CAP Theorem**: CP vs AP trade-offs (Cassandra AP, MongoDB CP)
- ‚úÖ **Replication**: Primary-replica for read scalability (AMD: 5√ó read capacity)
- ‚úÖ **Sharding**: Horizontal partitioning for write scalability (Intel: 100√ó throughput)

**3. ML System Design:**
- ‚úÖ **Model Serving**: TensorFlow Serving + Kubernetes (NVIDIA: 99.99% uptime, 35ms latency)
- ‚úÖ **Distributed Training**: Horovod data parallel (AMD: 13√ó faster, 92% GPU utilization)
- ‚úÖ **Feature Stores**: Feast for training/serving consistency

### System Design Interview Framework

**1. Requirements (5-10 min)**
- **Functional**: What features? (e.g., "Users can post, like, comment")
- **Non-Functional**: Scale? Performance? (e.g., "10M users, <200ms latency, 99.9% uptime")
- **Constraints**: Read/write ratio? Data size?

**2. High-Level Design (10-15 min)**
- Draw boxes: Client ‚Üí Load Balancer ‚Üí API Servers ‚Üí Database
- Identify bottlenecks: Single DB? No cache? No replication?

**3. Deep Dive (15-20 min)**
- **Scalability**: How to handle 10√ó traffic? (Horizontal scaling, caching, CDN)
- **Reliability**: What if server fails? (Replication, health checks, circuit breakers)
- **Performance**: Reduce latency? (Cache, denormalization, indexes)

**4. Trade-offs (5-10 min)**
- Discuss alternatives (SQL vs NoSQL, sync vs async, consistency vs availability)
- Justify choices based on requirements

### Common System Design Patterns Summary

| Pattern | Problem | Solution | Use Case |
|---------|---------|----------|----------|
| **Load Balancing** | Single server bottleneck | Distribute traffic across servers | Intel: 500K requests/day ‚Üí 20 servers |
| **Caching** | Slow database queries | Cache frequent data in Redis | NVIDIA: 80% hit rate, 100√ó faster |
| **Replication** | Read bottleneck | Primary-replica split | AMD: 5√ó read capacity |
| **Sharding** | Write bottleneck | Partition data across DBs | Intel: 100√ó write throughput |
| **CDN** | High latency for global users | Cache content at edge | Serve from nearest location |
| **Message Queue** | Asynchronous processing | Kafka, RabbitMQ | Decouple services, handle spikes |
| **Circuit Breaker** | Cascading failures | Stop calling failing service | Fail fast, protect downstream |

### Real-World Impact Summary

| Company | System | Before | After | Savings |
|---------|--------|--------|-------|---------|
| **Intel** | Test Data Platform | 30s queries, 60% uptime | <200ms queries, 99.95% uptime | $15M |
| **NVIDIA** | Model Serving | 100ms latency, manual scaling | 35ms latency, auto-scaling | $8M |
| **AMD** | Data Pipeline | 5min latency, data loss | <2min latency, zero loss | $12M |
| **Qualcomm** | Training Cluster | 20hr training, 50% GPU util | 4hr training, 92% GPU util | $20M |

**Total measurable impact:** $55M across 4 companies

### Scalability Numbers to Remember

**Latency:**
- L1 cache: 0.5ns
- RAM: 100ns
- SSD: 100¬µs
- Network (same datacenter): 500¬µs
- HDD: 10ms
- Network (cross-continent): 150ms

**Throughput benchmarks:**
- Single PostgreSQL: 10K writes/sec
- Redis: 100K ops/sec
- Cassandra (10 nodes): 1M writes/sec
- Kafka: 1M messages/sec per broker

**Availability:**
- 99% = 3.65 days downtime/year
- 99.9% = 8.76 hours downtime/year
- 99.99% = 52.56 minutes downtime/year
- 99.999% = 5.26 minutes downtime/year

### Next Steps

**Immediate (This Week):**
1. Design one system from scratch (URL shortener, pastebin, cache)
2. Calculate capacity estimates for your current project
3. Identify bottlenecks in existing system

**Short-term (This Month):**
1. Build test data platform with distributed storage
2. Implement model serving with auto-scaling
3. Set up monitoring and alerting (Prometheus + Grafana)

**Long-term (This Quarter):**
1. Complete 10 system design problems (Grokking System Design Interview)
2. Migrate monolith to microservices
3. Design and implement ML platform (training + serving + monitoring)

### Resources

**Books:**
1. *Designing Data-Intensive Applications* by Martin Kleppmann - Bible of distributed systems
2. *System Design Interview* by Alex Xu - Interview preparation
3. *Building Microservices* by Sam Newman - Microservices architecture
4. *Machine Learning Systems* by Chip Huyen - ML production systems

**Online:**
- [System Design Primer](https://github.com/donnemartin/system-design-primer) - Comprehensive guide
- [Grokking the System Design Interview](https://www.educative.io/courses/grokking-the-system-design-interview) - Interview prep
- [High Scalability Blog](http://highscalability.com/) - Real-world architectures
- [AWS Architecture Blog](https://aws.amazon.com/blogs/architecture/) - Cloud patterns

**Practice:**
- Design Instagram, Twitter, YouTube, Uber
- Calculate capacity (storage, bandwidth, servers needed)
- Draw architecture diagrams

---

**üéâ Congratulations!** You now understand how to design large-scale distributed systems for AI/ML workloads. You can architect platforms handling 100M+ users, 1PB+ data, and 100K+ requests/second with 99.99% uptime.

**Measurable skills gained:**
- Design systems scaling 10-100√ó traffic
- Reduce latency 100-1000√ó with caching
- Achieve 99.99% uptime with replication + load balancing
- Build ML platforms serving 100K predictions/sec
- Save $5-20M in infrastructure costs through proper architecture

**Ready for version control mastery?** Proceed to **Notebook 009: Git & Version Control** to learn branching strategies, CI/CD pipelines, and model versioning for production ML systems! üöÄ