# **Chapter 17: Infrastructure & Platform Systems**

While user-facing applications deliver direct value to customers, infrastructure systems are the invisible engines that power them. These systems handle data storage, caching, rate limiting, and background processing at massive scale. They require extreme reliability—if your cache goes down, every service that depends on it fails.

This chapter covers seven foundational systems that appear in almost every modern architecture. We progress from data collection (web crawler) to storage (key-value store, cache) to traffic management (rate limiter, ID generation) to search (autocomplete) and communication (notifications).

---

## **17.1 Design a Web Crawler**

A web crawler systematically browses the internet to index web pages for search engines, archive content, or extract data. Googlebot, Bingbot, and site-specific crawlers (like those monitoring prices or news) all follow similar patterns.

### **Step 1: Scope (Requirements)**

**Functional Requirements**:
1. **Crawl the web**: Start from a seed URL, follow links, download pages
2. **Parse content**: Extract links, index text, handle different formats (HTML, PDF, images)
3. **Politeness**: Respect robots.txt and crawl rate limits (don't overwhelm websites)
4. **Deduplication**: Don't crawl the same page twice
5. **Storage**: Store crawled content for processing

**Non-Functional Requirements**:
1. **Scalability**: Crawl 1 billion pages per day
2. **Politeness**: Wait at least 1 second between requests to same domain
3. **Extensibility**: Easy to add new content types
4. **Fault tolerance**: Handle network failures, server errors gracefully

**Constraints**:
- Average page size: 500KB HTML
- Average links per page: 100
- Must respect `robots.txt` rules
- Must handle JavaScript-rendered content (modern SPAs)

### **Step 2: Sketch (Back-of-the-Envelope)**

**Traffic**:
```
1 billion pages/day = 11,574 pages/second (average)
Peak: 100,000 pages/second

Storage:
1B pages × 500KB = 500 TB/day
Yearly: 182 PB (raw, before compression)

Network:
500 TB/day = 46 Gbps sustained download
```

**URL Frontier Size**:
```
If each page has 100 links, but 70% are duplicates:
New URLs discovered per page: 30
URL queue growth: 30B URLs/day

Storage for URL queue:
30B URLs × 100 chars × 2 bytes = 6 TB of URLs to process
```

### **Step 3: Solidify (Data Model)**

**URL Frontier** (Message Queue - Kafka/RabbitMQ):
```
Topic: url_queue
Message: {
    "url": "https://example.com/page",
    "priority": 1,           // 0 = high (news sites), 5 = low (archives)
    "depth": 2,              // BFS depth from seed
    "discovered_at": "2024-01-15T10:30:00Z"
}
```

**Visited URLs** (Bloom Filter + Database):
- **Bloom Filter**: In-memory check (100GB RAM for 10B URLs, 1% false positive)
- **Database**: Cassandra for permanent storage of visited URLs with timestamp

```sql
CREATE TABLE crawled_pages (
    url_hash VARCHAR(64) PRIMARY KEY,  -- SHA-256 of URL
    url TEXT,
    crawled_at TIMESTAMP,
    content_hash VARCHAR(64),  -- To detect changes
    status_code INT,
    page_size INT,
    server_domain VARCHAR(255)
);
```

**Content Storage** (Object Storage):
```
s3://crawler-bucket/raw/{domain}/{year}/{month}/{day}/{url_hash}.html.gz
```

### **Step 4: Scale (Architecture)**

**High-Level Design**:
```
┌──────────────┐
│   Seed URLs  │ (Initial list: top 1000 websites)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│   URL        │
│   Frontier   │ (Priority Queue - Kafka)
└──────┬───────┘
       │
       ▼
┌──────────────┐      ┌──────────────┐
│   Crawler    │─────>│   Politeness │
│   Workers    │      │   Checker    │
│   (1000s)    │<─────│   (Redis)    │
└──────┬───────┘      └──────────────┘
       │
       ├─► Download page
       ├─► Store in S3
       ├─► Extract links
       └─► Add new URLs to Frontier
```

**The Politeness Constraint (Critical Design Challenge)**:

We cannot hit the same server with 1000 requests simultaneously. We must rate-limit per domain.

**Solution - Token Bucket per Domain**:
```
Redis key: "domain:example.com:tokens"
Value: Integer (remaining tokens)
Refill rate: 1 token/second (configurable per robots.txt)
Capacity: 5 tokens (burst allowance)

Before crawling example.com/page2:
    INCR domain:example.com:tokens
    If result > 0: Crawl allowed
    Else: Re-queue URL with delay
```

**Distributed Crawler Coordination**:
```
Hash ring of crawler nodes:
    Domain A → Node 1 (only Node 1 crawls example.com)
    Domain B → Node 2
    Domain C → Node 3

This ensures only one crawler hits a specific domain at a time,
naturally enforcing politeness without complex coordination.
```

**Content Processing Pipeline**:
```
Raw HTML downloaded
    │
    ▼
┌──────────────┐
│   Parser     │──┐
│   Service    │  │
└──────┬───────┘  │
       │          │
       ▼          ▼
┌──────────┐  ┌──────────┐
│ Link     │  │ Content  │
│ Extractor│  │ Indexer  │
└────┬─────┘  └────┬─────┘
     │             │
     ▼             ▼
┌──────────┐  ┌──────────┐
│ URL      │  │ Search   │
│ Frontier │  │ Index    │
│ (New     │  │ (Elastic)│
│  URLs)   │  │          │
└──────────┘  └──────────┘
```

**Handling JavaScript (Modern Web)**:
Many sites (React, Vue, Angular) require JavaScript execution to render content.

**Solution**:
- Use headless Chrome (Puppeteer/Playwright) for JavaScript-heavy sites
- Trade-off: 10x slower, 10x more CPU/memory
- Detection: Try static crawl first; if no content detected, escalate to headless

**Deduplication Strategy**:
1. **URL-level**: Bloom filter prevents crawling same URL twice
2. **Content-level**: SHA-256 hash of content; if same as previous crawl, skip re-indexing
3. **Near-duplicate**: SimHash to detect pages with minor changes (ads, timestamps)

**Fault Tolerance**:
- **Retry with backoff**: 5xx errors → retry in 1 min, 5 min, 25 min (exponential)
- **Dead Letter Queue**: After 3 failures, move to DLQ for manual inspection
- **Circuit Breaker**: If example.com is down, stop trying for 10 minutes

### **Deep Dive: Distributed politeness**

**Problem**: With 1000 crawler nodes, how do we ensure we don't exceed 1 req/sec per domain across all nodes?

**Solution 1: Domain Sharding**
```
Consistent hash of domain name → determines which crawler node handles it
example.com always goes to Node 5
Node 5 maintains local rate limiter for its assigned domains
```

**Solution 2: Centralized Redis (Simpler, but bottleneck)**
```
Each crawler asks Redis: "Can I crawl example.com now?"
Redis manages token bucket per domain
Trade-off: Network round-trip for every check, Redis hotspot
```

**Hybrid Approach**:
- Assign domains to crawler nodes (sharding)
- If Node 5 fails, redistribute to other nodes (consistent hashing)
- Each node maintains local token buckets for its domains

### **Deep Dive: Freshness vs. Coverage**

**Freshness**: How often to re-crawl?
- News sites (cnn.com): Every 5 minutes
- Blogs: Every day
- Static documentation: Every month

**Adaptive Re-crawling**:
```
If page changes frequently (detected by content hash comparison):
    Decrease interval (crawl more often)
If page rarely changes:
    Increase interval up to max (30 days)
```

---

## **17.2 Design a Distributed Key-Value Store (DynamoDB-style)**

Amazon DynamoDB revolutionized NoSQL databases by offering consistent single-digit millisecond latency at any scale. Let's design a similar system.

### **Step 1: Scope**

**Functional Requirements**:
1. **Put/Get/Delete**: Basic CRUD operations by key
2. **Range queries**: Query by key prefix (if using composite keys)
3. **TTL**: Automatic expiration of keys

**Non-Functional Requirements**:
1. **High availability**: 99.999% uptime (5 nines)
2. **Partition tolerance**: System works despite network partitions
3. **Scalability**: Unlimited storage and throughput
4. **Low latency**: P99 < 10ms for reads/writes

**CAP Trade-off**: Choose AP (Availability + Partition tolerance) with eventual consistency, tunable to strong consistency per request.

### **Step 2: Sketch**

**Scale Target**:
```
10 trillion key-value pairs
Average value size: 10KB
Total storage: 100 PB

Request rate: 100 million operations/second
Read:Write ratio: 80:20
```

**Partition Calculation**:
```
Each node handles 10,000 QPS comfortably
100M QPS / 10K = 10,000 nodes needed
```

### **Step 3: Data Model**

**Logical Model**:
```
Key: "user:123:profile"
Value: {name: "Alice", age: 30, city: "NYC"}
Metadata: Version vector, timestamp, TTL
```

**Physical Storage**:
- **MemTable**: In-memory balanced tree (skip list or red-black tree) for recent writes
- **SSTable**: Immutable sorted string tables on disk
- **Commit Log**: Append-only log for durability (WAL - Write Ahead Log)

### **Step 4: Scale (Architecture)**

**Consistent Hashing Ring**:
```
┌─────────────────────────────────────────┐
│           Consistent Hash Ring          │
│                                         │
│  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐   │
│  │NodeA│  │NodeB│  │NodeC│  │NodeD│   │
│  │ 0-25│  │25-50│  │50-75│  │75-100│  │
│  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘   │
│     └─────────┴────────┴────────┘       │
│           Replication                   │
└─────────────────────────────────────────┘

Key "user:123" hashes to 42 → stored on NodeB
Also replicated to NodeC and NodeD (next 2 nodes in ring)
```

**Write Path**:
```
Client → Load Balancer → Coordinator Node
    │
    ├─► Hash key to determine partition (Node B, C, D)
    │
    ├─► Write to Commit Log (WAL) on all 3 nodes (blocking)
    │
    ├─► Update MemTable (in-memory)
    │
    └─► Acknowledge write to client
        (Data not yet on disk, but durable in logs)

Async: MemTable flushed to SSTable when threshold reached
```

**Read Path**:
```
Client → Coordinator
    │
    ├─► Query all 3 replicas in parallel
    │
    ├─► If versions differ:
    │     - Return latest version to client
    │     - Initiate read repair (update stale replicas)
    │
    └─► Return result
```

**Storage Engine Details**:

**MemTable** (In-Memory):
```
Data structure: Concurrent skip list
Insertion: O(log n)
Lookup: O(log n)
Flush trigger: 128MB
When flushed: Written to SSTable file, MemTable cleared
```

**SSTable** (On-Disk):
```
Immutable, sorted file format:
[data block 1][data block 2]...[index block][footer]

Index block: Sparse index of keys to file offsets
Compression: Snappy or LZ4 per block
Bloom filter: To check if key might exist in this SSTable
```

**Compaction** (Background Process):
```
As writes continue, SSTables accumulate:
sstable_1: [a-c], sstable_2: [d-f], sstable_3: [a-b] (newer)

Problem: Multiple versions of same key across files
Solution: Periodically merge SSTables (like merge sort)
         Keep only latest version of each key
         Remove deleted keys (tombstones) after retention period
```

### **Deep Dive: Conflict Resolution (Vector Clocks)**

When a network partition occurs, different replicas may receive different updates.

**Scenario**:
```
Partition occurs: Node A isolated from Node B

Client 1 writes to Node A: {name: "Alice"}
Client 2 writes to Node B: {name: "Bob"}

Network heals. Node A and B reconcile.

Options:
1. Last-Write-Wins (LWW): Use timestamp. Risk: Clock skew causes data loss
2. Vector Clocks: Track which node wrote which version
   Result: Both versions kept, client must resolve (siblings)
```

**Vector Clock Example**:
```
Initial: {[A:1], value: "Alice"}

Update at A: {[A:2], value: "Alicia"}
Update at B (during partition): {[A:1, B:1], value: "Bob"}

On reconciliation:
- [A:2] and [A:1,B:1] are concurrent (neither descends from other)
- Keep both: ["Alicia", "Bob"] → return to client for resolution
```

### **Deep Dive: Gossip Protocol**

How do nodes know about each other and detect failures?

**Gossip Protocol**:
```
Every 1 second, each node picks random peer and exchanges state:
    - "I know about keys 0-1000"
    - "Node C failed 30 seconds ago"
    - "New node D joined at position 75"

Properties:
- Information spreads like epidemic (exponential)
- Eventually consistent membership
- Scales to thousands of nodes (O(log n) rounds to propagate)
```

---

## **17.3 Design a Distributed Cache (Redis-style)**

Caching reduces database load and improves latency by storing hot data in memory. A distributed cache scales this across multiple servers.

### **Step 1: Scope**

**Requirements**:
1. **Key-Value operations**: Get, Set, Delete, Expire (TTL)
2. **Data structures**: Strings, Lists, Sets, Hashes, Sorted Sets
3. **Pub/Sub**: Real-time messaging between clients
4. **Persistence**: Optional snapshotting to disk
5. **High availability**: Replication and automatic failover

### **Step 2: Sketch**

**Scale**:
```
100 million keys
Average value size: 5KB
Total data: 500 GB RAM per node
Cluster size: 100 nodes = 50 TB cache capacity

Throughput: 10 million ops/sec cluster-wide
```

### **Step 3: Architecture**

**Sharding Strategy** (Redis Cluster):
```
Hash Slot Algorithm:
- 16,384 hash slots total
- Key hashed to slot: slot = CRC16(key) % 16384
- Slots assigned to nodes: Node A (0-5500), Node B (5501-11000), etc.

Client caches slot-to-node mapping for efficiency
```

**Replication** (Master-Replica):
```
Each master has 1-2 replicas
Write to master, replicate async to replicas
If master fails: Replica promoted via consensus (Raft)
```

**Eviction Policies** (When memory full):
```
LRU (Least Recently Used): Remove least accessed keys
LFU (Least Frequently Used): Remove rarely accessed keys
TTL: Remove expired keys first
Random: Fast but suboptimal
No eviction: Return errors on write
```

### **Deep Dive: Cache Consistency**

**Cache-Aside Pattern** (Most Common):
```
Read:
    1. Check cache
    2. If miss: Read DB, write to cache, return value

Write:
    1. Update DB
    2. Delete cache (not update! Why? Race conditions)
       or Use distributed transaction ( Saga pattern )
```

**Thundering Herd Prevention**:
```
Scenario: Cache expires, 1000 clients request same key simultaneously
Result: 1000 DB queries (database dies)

Solutions:
1. Lease mechanism: Only one client allowed to regenerate cache
2. Probabilistic early expiration: Refresh at 90% of TTL
3. Lock + Retry: First client acquires lock, others wait
```

**Code Example** (Cache-Aside with Lease):
```python
def get_with_lease(key):
    value, lease_token = cache.get(key)
    
    if value is None:
        if cache.try_acquire_lease(key, ttl=10):
            # Only this process refreshes
            value = db.query(key)
            cache.set(key, value, lease_token=lease_token)
        else:
            # Wait and retry
            time.sleep(0.1)
            return get_with_lease(key)
    
    return value
```

---

## **17.4 Design a Rate Limiter**

Rate limiting prevents abuse by restricting how many requests a client can make in a time window (e.g., 100 requests/minute).

### **Step 1: Scope**

**Requirements**:
1. **Limit by**: IP address, User ID, API key, or combination
2. **Algorithms**: Token bucket, Sliding window, Fixed window
3. **Response**: HTTP 429 (Too Many Requests) when limit exceeded
4. **Headers**: Return remaining quota (X-RateLimit-Remaining)

### **Step 2: Algorithms**

**Token Bucket**:
```
Bucket capacity: 10 tokens
Refill rate: 1 token/second

Request arrives:
    If tokens > 0:
        tokens -= 1
        Allow request
    Else:
        Reject request

Pros: Allows bursts up to bucket size
Cons: Requires storing token count per client
```

**Sliding Window Log**:
```
Store timestamp of each request in sorted set
When new request arrives:
    Remove entries older than window (e.g., 1 minute ago)
    Count remaining entries
    If count < limit: Allow and add timestamp
    Else: Reject

Pros: Precise
Cons: O(log n) memory per request, high storage cost
```

**Sliding Window Counter** (Hybrid):
```
Divide time into buckets (e.g., 1 minute buckets)
Store count per bucket in Redis Hash

Current window = (Current bucket × weight) + Previous bucket × (1 - weight)
Weight = time elapsed in current bucket / bucket size

Approximate but memory efficient
```

### **Step 3: Architecture**

**Distributed Rate Limiter**:
```
API Gateway / Load Balancer
    │
    ▼
┌──────────────┐
│   Rate       │────> Redis (Centralized counter)
│   Limiter    │      Key: "ratelimit:{user_id}:{minute}"
│   Service    │      Value: Counter
└──────┬───────┘      TTL: 2 minutes
       │
       ▼
   Application
```

**Edge Rate Limiting** (CloudFlare style):
```
Rate limit at CDN edge (PoP) to block attacks before they reach origin
Synchronize counters between edge nodes periodically (eventual consistency)
```

### **Deep Dive: Burst Handling**

**Token Bucket vs. Leaky Bucket**:
```
Token Bucket (Traffic Shaping):
    - Bursts allowed up to bucket size
    - Then limited to refill rate
    
Leaky Bucket (Traffic Policing):
    - Requests enter queue at any rate
    - Processed at constant rate (leak rate)
    - Queue full → drops requests
    - Smooths traffic but adds latency
```

---

## **17.5 Design a Unique ID Generator**

Distributed systems need unique IDs without central coordination (to avoid single point of failure). Twitter's Snowflake is the industry standard.

### **Requirements**
- 64-bit integers (fits in database BIGINT)
- Roughly sortable by time (newer IDs > older IDs)
- Unique across distributed nodes without coordination
- High throughput (10,000+ IDs/second per node)

### **Snowflake Structure** (64 bits):
```
0 | 0000000000 0000000000 0000000000 0000000000 0 | 00000 | 00000 | 000000000000
^ |                    41 bits                     | 5bits | 5bits |   12 bits
| |                    (Timestamp)                 |(DC ID)|(Node) |  Sequence
Sign
```

**Breakdown**:
1. **1 bit**: Sign (always 0 for positive)
2. **41 bits**: Milliseconds since epoch (69 years)
3. **5 bits**: Data center ID (32 data centers)
4. **5 bits**: Machine ID (32 machines per DC)
5. **12 bits**: Sequence number (4096 IDs per millisecond per machine)

**Total capacity**: 32 DCs × 32 machines × 4096 IDs/ms = 4 million IDs/second per DC

**Implementation**:
```python
class Snowflake:
    def __init__(self, datacenter_id, machine_id):
        self.datacenter_id = datacenter_id
        self.machine_id = machine_id
        self.sequence = 0
        self.last_timestamp = -1
        
    def generate(self):
        timestamp = self.current_time_millis()
        
        if timestamp < self.last_timestamp:
            raise Exception("Clock moved backwards!")
            
        if timestamp == self.last_timestamp:
            self.sequence = (self.sequence + 1) & 0xFFF  # 12 bits
            if self.sequence == 0:  # Overflow, wait next millisecond
                timestamp = self.wait_next_millis()
        else:
            self.sequence = 0
            
        self.last_timestamp = timestamp
        
        # Build ID
        id = ((timestamp - EPOCH) << 22) | \
             (self.datacenter_id << 17) | \
             (self.machine_id << 12) | \
             self.sequence
             
        return id
```

**Alternative: UUID v4**
- 128 bits (too large for DB indexes)
- Not sortable (random)
- No coordination needed (random)
- Use when you don't need sortability

---

## **17.6 Design a Search Autocomplete System (Typeahead)**

Autocomplete suggests completions as users type (e.g., "sys" → "system design", "system architecture").

### **Step 1: Scope**

**Requirements**:
1. **Fast**: Suggestions in < 50ms
2. **Relevant**: Sorted by popularity/frequency
3. **Fresh**: Reflect recent trends (breaking news)
4. **Scale**: Support 10 million queries/day

### **Step 2: Data Structure - Trie (Prefix Tree)**

```
                root
                 |
                 s
                 |
                 y
                 |
           ┌─────┴─────┐
           s           m
           |           |
      ┌────┴────┐      p
      t         t      |
      |         |      t
      e         e      |
      m         m      o
                |      |
                d      m
```

Each node stores:
- Character
- Children pointers
- Top K completions (cached at node)
- Frequency count

**Optimization**: Each node stores top 5 suggestions, so traversal stops early.

### **Step 3: Architecture**

```
User types "sys"
    │
    ▼
CDN / Edge Cache (Frequently requested prefixes cached)
    │
    ▼
Autocomplete Service
    │
    ├─► Check Redis (prefix "sys" → ["system design", "system architecture"])
    │
    └─► Cache miss? Query Trie Service
            │
            ▼
        Shard by first 2 characters ("sy")
        Trie stored in memory (100GB RAM for 10M terms)
```

**Data Collection**:
```
Search Analytics (Kafka) → Aggregator (Spark/Flink) → Update Trie nightly
Recent trends (last hour): Separate hot-trie updated in real-time
```

### **Deep Dive: Personalization**

Generic: "sys" → "system design"
Personalized: "sys" → "system design book" (if user bought books before)

**Approach**:
- Maintain user-specific suffixes in separate index
- Merge generic + personalized results at serving time
- Weight: 70% global popularity, 30% personal history

---

## **17.7 Design a Notification System (Push, Email, SMS)**

A notification system delivers messages across multiple channels (iOS Push, Android Push, Email, SMS) with prioritization and batching.

### **Step 1: Scope**

**Requirements**:
1. **Multi-channel**: iOS (APNs), Android (FCM), Email (SMTP), SMS (Twilio)
2. **Prioritization**: Critical (2FA) vs. Marketing (batch at night)
3. **Rate limiting**: Don't overwhelm user (max 10 push/hour)
4. **Retry**: Exponential backoff for failures
5. **Tracking**: Delivered, Opened, Clicked metrics

### **Step 2: Architecture**

```
API Request (Send notification to user 123)
    │
    ▼
┌──────────────┐
│  Notification│──┐
│  Service     │  │
└──────┬───────┘  │
       │          │
       ▼          ▼
┌──────────┐  ┌──────────┐
│ Priority │  │ Template │
│ Queue    │  │ Engine   │
│(RabbitMQ)│  │(i18n)    │
└────┬─────┘  └────┬─────┘
     │             │
     └──────┬──────┘
            ▼
     ┌──────────────┐
     │   Router     │──┐──┬──┐
     │(Channel sel)│  │  │  │
     └──────┬───────┘  │  │  │
            │          │  │  │
            ▼          ▼  ▼  ▼
        ┌───────┐  ┌────┐┌───┐┌────┐
        │ iOS   │  │Andr││Email││SMS │
        │Push   │  │oid ││    ││    │
        └───────┘  └────┘└───┘└────┘
```

**Priority Queues**:
```
Critical: 2FA, password reset → Immediate, dedicated workers
High: Direct messages → < 1 second delay
Normal: Likes, comments → Batched (5 min window)
Low: Marketing, digests → Night time, rate limited
```

**Batching Strategy**:
```
User receives 50 likes in 5 minutes:
    Don't: Send 50 push notifications
    Do: Send "Alice and 49 others liked your post"
    
Implementation:
    Aggregate in Redis for 5 minutes
    Key: "pending:user_123"
    Value: List of notifications
    On flush: Render template with count
```

**Delivery Guarantees**:
```
At-least-once delivery:
    1. Write to database (notification status: PENDING)
    2. Push to Queue
    3. Worker processes, sends to APNs
    4. APNs acknowledges → Mark DELIVERED
    5. If no ACK in 30s → Retry (max 3 times)
    6. After 3 failures → Mark FAILED, alert admin
```

**Idempotency**:
```
Notification ID generated by client or service
If retry occurs, APNs deduplicates by ID
Prevents duplicate notifications on network timeout retry
```

### **Deep Dive: Third-Party Integration**

**APNs (Apple Push Notification service)**:
- HTTP/2 connection maintained
- JWT token authentication (expires every hour)
- Feedback service: Reports invalid tokens (app uninstalled)

**FCM (Firebase Cloud Messaging)**:
- Similar to APNs but Google's implementation
- Topic subscriptions (pub/sub model for broadcast)

**Circuit Breakers**:
```
If APNs is down:
    Circuit breaker opens after 10 failures
    Queue notifications in "delayed" queue
    Retry every 5 minutes
    When APNs recovers, drain queue
```

---

## **17.8 Chapter Summary**

Infrastructure systems share common patterns:

1. **Sharding**: Consistent hashing for data distribution (DynamoDB, Cache)
2. **Replication**: Leader-follower for durability, leaderless for availability
3. **Backpressure**: When overwhelmed, shed load (rate limiting) or buffer (queues)
4. **Idempotency**: Design for retries (notifications, ID generation)
5. **Eventual Consistency**: Accept temporary inconsistency for availability

**Key Trade-offs**:
- **Consistency vs. Availability**: Dynamo chooses availability; traditional SQL chooses consistency
- **Memory vs. Disk**: Redis chooses speed (memory); Cassandra chooses cost (disk)
- **Precision vs. Performance**: Sliding window is precise but slow; token bucket is approximate but fast

**Interview Tips for Infrastructure**:
- Always mention CAP theorem implications
- Discuss failure modes: What happens when a node dies? Network partitions?
- Calculate capacity: Memory per node, QPS per shard, replication factor
- Know the math: 2^10 ≈ 1000, 2^20 ≈ 1 million, 2^32 ≈ 4 billion

---

**Exercises**:

1. **Web Crawler**: How would you detect and avoid crawler traps (infinite URL spaces like calendars)?

2. **Key-Value Store**: Design a consistent read operation that guarantees reading the latest write (R+W > N quorum).

3. **Cache**: How would you implement a distributed cache that maintains strong consistency (all nodes see same value) versus eventual consistency?

4. **Rate Limiter**: Design a rate limiter that supports "100 requests per minute per user, but max 1000 per hour per IP" (multi-dimensional limits).

5. **ID Generator**: What happens if the Snowflake machine's clock goes backwards by 1 second? How do you handle it?

6. **Autocomplete**: How would you implement fuzzy matching (typo tolerance) in the autocomplete system?

7. **Notification System**: Design a "quiet hours" feature where users don't receive non-critical notifications from 10 PM to 8 AM, but queue them for delivery at 8 AM.

---

The next chapter will cover **Enterprise-Grade Systems**—the heavy-duty infrastructure that powers the world's largest companies, including distributed message queues, payment systems, and collaborative editors.