# **Chapter 18: Enterprise-Grade Systems**

Enterprise systems operate at massive scale while maintaining strict correctness guarantees. Unlike consumer apps where occasional inconsistency is acceptable, financial systems, inventory management, and coordination services must be bulletproof. This chapter covers the architecture behind the world's most critical infrastructure.

---

## **18.1 Design a Distributed Message Queue (Kafka-style)**

Apache Kafka revolutionized event streaming by treating messages as immutable logs rather than transient queue items. It's the backbone of real-time analytics, event sourcing, and microservice communication.

### **Step 1: Scope (Requirements)**

**Functional Requirements**:
1. **Publish/Subscribe**: Producers send messages; consumers read them
2. **Topics**: Logical channels for message categories
3. **Partitions**: Parallel processing within a topic
4. **Persistence**: Messages retained for configurable time (days/weeks)
5. **Consumer Groups**: Multiple consumers coordinate to process messages without duplication
6. **Replay**: Ability to re-read messages from beginning or specific offset

**Non-Functional Requirements**:
1. **Throughput**: 1 million messages/second per cluster
2. **Latency**: P99 < 10ms for publish, < 100ms for consume
3. **Durability**: Zero message loss once acknowledged
4. **Availability**: 99.95% uptime
5. **Scalability**: Horizontal scaling by adding brokers

**Key Challenge**: Maintaining order within partitions while allowing parallel consumption across partitions.

### **Step 2: Sketch (Back-of-the-Envelope)**

**Traffic**:
```
1M messages/sec
Average message size: 1KB
Throughput: 1 GB/sec incoming

Replication factor: 3
Network traffic: 3 GB/sec (write to 3 brokers)

Storage:
1M msg/sec × 1KB × 86400 sec/day = 86 TB/day
Retention: 7 days = 602 TB per cluster
```

**Partition Math**:
```
To consume 1M msg/sec with 10 consumers:
Need 10 partitions minimum (100K msg/sec per consumer)

For 100 consumers: 100 partitions
Each partition is a log file on disk
```

### **Step 3: Solidify (Data Model)**

**Topic Partition Log**:
```
Physical file: /data/kafka/topic-orders/partition-0/00000000000000000000.log

Format: [Offset: 8 bytes][Message Size: 4 bytes][Message Data: N bytes]

Immutable append-only log:
Offset 0: {order_id: 100, amount: 50.00}
Offset 1: {order_id: 101, amount: 75.00}
Offset 2: {order_id: 102, amount: 120.00}
...
```

**Index File** (for O(log n) lookups):
```
Sparse index: Every 4096 bytes, record offset → file position
Offset 0 → Position 0
Offset 100 → Position 4096
Offset 200 → Position 8192

To find offset 150:
    Binary search index: offset 150 is between 100 and 200
    Seek to position 4096, scan forward to offset 150
```

**Consumer Offset Storage**:
```
Topic: __consumer_offsets
Key: consumer_group + topic + partition
Value: last_committed_offset

Stored in Kafka itself (not external DB) for consistency
```

### **Step 4: Scale (Architecture)**

**Kafka Cluster Topology**:
```
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│  Producer   │      │  Producer   │      │  Producer   │
│  (App A)    │      │  (App B)    │      │  (App C)    │
└──────┬──────┘      └──────┬──────┘      └──────┬──────┘
       │                     │                     │
       └─────────────────────┼─────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │   Load Balancer │
                    │   (Discovery)   │
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
   ┌──────────┐        ┌──────────┐        ┌──────────┐
   │ Broker 1 │◄──────►│ Broker 2 │◄──────►│ Broker 3 │
   │ (Leader  │        │ (Leader  │        │ (Leader  │
   │  P0,P3)  │        │  P1,P4)  │        │  P2,P5)  │
   └────┬─────┘        └────┬─────┘        └────┬─────┘
        │                   │                   │
        └───────────────────┼───────────────────┘
                            │
                            ▼
                    ┌─────────────────┐
                    │   ZooKeeper /   │
                    │   Raft Cluster  │
                    │   (Coordination)│
                    └─────────────────┘
                            │
         ┌──────────────────┼──────────────────┐
         │                  │                  │
         ▼                  ▼                  ▼
   ┌──────────┐       ┌──────────┐       ┌──────────┐
   │Consumer 1│       │Consumer 2│       │Consumer 3│
   │(Group A) │       │(Group A) │       │(Group A) │
   └──────────┘       └──────────┘       └──────────┘
   (Each processes different partitions)
```

**Replication Flow**:
```
Producer writes to Topic "orders", Partition 0
    │
    ▼
Broker 1 (Leader for P0) receives message
    │
    ├─► Append to local log
    │
    ├─► Send to Broker 2 (Follower)
    │      Broker 2 appends to its log, sends ACK
    │
    ├─► Send to Broker 3 (Follower)
    │      Broker 3 appends, sends ACK
    │
    └─► When min.insync.replicas (2) ACK received,
        send ACK to Producer
```

**Consumer Group Coordination**:
```
Topic "orders" has 6 partitions (P0-P5)
Consumer Group "payment-processors" has 3 consumers

Kafka assigns:
    Consumer 1: P0, P1
    Consumer 2: P2, P3
    Consumer 3: P4, P5

If Consumer 2 dies:
    Rebalance: Consumer 1 gets P0,P1,P2; Consumer 3 gets P3,P4,P5

Guarantee: Each partition consumed by exactly one consumer in group
```

### **Deep Dive: Exactly-Once Semantics**

**Problem**: Consumer processes message, crashes before committing offset. On restart, message re-processed. Duplicate charge!

**Solution 1: Idempotent Consumers**
```
Database schema includes message_id:
    INSERT INTO payments (order_id, amount, kafka_msg_id)
    VALUES (123, 50.00, 'msg_abc_123')
    ON CONFLICT (kafka_msg_id) DO NOTHING;
```

**Solution 2: Transactions (Kafka 0.11+)**:
```
Producer.initTransactions()
Producer.beginTransaction()
Producer.send(record1)
Producer.send(record2)
Producer.sendOffsetsToTransaction(consumer.position(), consumer.groupMetadata())
Producer.commitTransaction()
```

Atomic: Either all messages sent and offset committed, or none.

### **Deep Dive: Partition Strategy**

**Choosing partition key**:
```
Bad: Random partitioning
    - Even load
    - But: No ordering guarantee for related messages
    
Good: User ID as partition key
    - All messages for user 123 go to partition 5
    - Ordering preserved per user
    - But: Hot partition if one user is super active
    
Better: Composite key (user_id % num_partitions)
    - Distribute super-user across partitions
    - Lose strict ordering, gain distribution
```

### **System Characteristics**

| Feature | Implementation |
|---------|---------------|
| Ordering | Per-partition only |
| Retention | Time-based (7 days) or size-based |
| Replication | Leader-follower, configurable |
| Consumer offset | Stored in Kafka (__consumer_offsets) |
| Compression | LZ4, Snappy, GZIP per topic |

---

## **17.2 Design an E-commerce Platform (Amazon)**

E-commerce combines catalog management, inventory tracking, shopping carts, and payment processing into a complex distributed transaction system.

### **Step 1: Scope**

**Functional Requirements**:
1. **Product catalog**: Browse/search millions of products
2. **Shopping cart**: Add/remove items, persist across sessions
3. **Inventory**: Track stock levels, prevent overselling
4. **Checkout**: Payment, shipping calculation, order confirmation
5. **Order management**: Track status, returns, refunds

**Non-Functional Requirements**:
1. **Consistency**: Inventory and payments must be accurate (no overselling)
2. **Availability**: 99.99% uptime (downtime = lost revenue)
3. **Scale**: 100M products, 1M concurrent users, 10K orders/minute peak

**Key Challenge**: The "Overselling Problem"—ensuring two users don't buy the last item simultaneously.

### **Step 2: Sketch**

**Traffic**:
```
Catalog views: 10M/day = 115/sec
Cart operations: 1M/day = 11/sec
Checkout: 100K/day = 1.1/sec (peak: 100/sec)
```

**Storage**:
```
Products: 100M × 5KB = 500 GB
Orders: 100K/day × 1KB × 5 years = 182 GB
Inventory updates: High write volume
```

### **Step 3: Data Model**

**Product Catalog** (Elasticsearch for search + PostgreSQL for source):
```sql
CREATE TABLE products (
    product_id BIGINT PRIMARY KEY,
    name VARCHAR(255),
    description TEXT,
    price DECIMAL(10,2),
    category_id INT,
    seller_id INT,
    attributes JSONB,  -- Flexible schema: {color: "red", size: "XL"}
    created_at TIMESTAMP
);

-- Search index (Elasticsearch)
{
  "product_id": 123,
  "name": "Wireless Headphones",
  "description": "Noise cancelling...",
  "price": 299.99,
  "category": "Electronics",
  "attributes": {
    "brand": "Sony",
    "color": "black"
  }
}
```

**Inventory** (Redis for speed + PostgreSQL for audit):
```
Key: inventory:product_123
Value: 42 (current stock)

Key: inventory:product_123:reserved
Value: 5 (in carts but not purchased)

Available = 42 - 5 = 37
```

**Shopping Cart** (Redis - session based):
```
Key: cart:user_456
Value: Hash {
    "product_123": {qty: 2, added_at: "2024-01-15T10:30:00Z"},
    "product_789": {qty: 1, added_at: "2024-01-15T10:35:00Z"}
}
TTL: 30 days
```

**Orders** (PostgreSQL with ACID):
```sql
CREATE TABLE orders (
    order_id UUID PRIMARY KEY,
    user_id UUID,
    status VARCHAR(20),  -- pending, paid, shipped, delivered
    total_amount DECIMAL(10,2),
    tax_amount DECIMAL(10,2),
    shipping_amount DECIMAL(10,2),
    created_at TIMESTAMP,
    paid_at TIMESTAMP,
    
    -- Consistency check
    CONSTRAINT valid_amount CHECK (total_amount >= 0)
);

CREATE TABLE order_items (
    order_item_id UUID PRIMARY KEY,
    order_id UUID REFERENCES orders(order_id),
    product_id BIGINT,
    quantity INT,
    unit_price DECIMAL(10,2),
    
    CONSTRAINT valid_qty CHECK (quantity > 0)
);
```

### **Step 4: Scale (Architecture)**

**Checkout Flow (The Critical Path)**:
```
1. Cart Validation
   - Verify items still in stock (Redis WATCH/MULTI/EXEC)
   - Calculate totals, tax, shipping
   
2. Payment Authorization
   - Call Stripe/PayPal (external API)
   - If fails: Release inventory, abort
   
3. Order Creation (ACID Transaction)
   BEGIN TRANSACTION
     INSERT INTO orders
     INSERT INTO order_items
     UPDATE inventory (decrement)
   COMMIT
   
4. Async Processing
   - Send confirmation email
   - Notify warehouse
   - Update search index
   - Analytics
```

**Inventory Reservation Pattern** (Preventing Overselling):
```
When item added to cart:
    1. WATCH inventory:product_123
    2. GET current_stock
    3. If stock > 0:
         MULTI
         DECR inventory:product_123
         INCR reserved:product_123
         EXEC
    4. If EXEC fails (WATCH triggered), retry

When checkout completes:
    DECR reserved:product_123 (permanent decrement)

When cart expires (30 min):
    DECR reserved:product_123
    INCR inventory:product_123 (release back)
```

**Search Architecture**:
```
Product updates → Kafka → Elasticsearch indexing
Query: Elasticsearch → Product IDs → Cache/DB lookup for details

Faceted search:
    Category: Electronics
    Brand: [Sony, Samsung]
    Price: $100-$500
    Rating: 4+ stars
```

**Recommendation Engine**:
```
Batch processing (nightly):
    Spark job on historical data
    Matrix factorization for collaborative filtering
    Store recommendations in Redis

Real-time:
    "Customers who bought X also bought Y" (item-item similarity)
    Recently viewed items
```

### **Deep Dive: Distributed Transactions**

**The Saga Pattern** (for long-running transactions):
```
Order Creation Saga:
Step 1: Create Order (local DB) ─┐
                                  ├─► If any step fails, run compensating
Step 2: Reserve Payment           │   transactions (undo previous steps)
                                  │
Step 3: Reserve Inventory         │
                                  │
Step 4: Confirm Shipment          │
                                  │
Step 5: Complete Order ◄──────────┘

Compensating transactions:
    If Step 3 fails → Release payment (refund)
    If Step 4 fails → Release inventory
```

**Two-Phase Commit (2PC)** (for ACID across services):
```
Coordinator                    Participants
    │                              │
    ├─► Phase 1: PREPARE ────────►│
    │                              │
    │◄─────────── YES/NO ◄────────┤
    │                              │
    ├─► Phase 2: COMMIT/ABORT ───►│ (if all YES)
    │                              │
    │◄────────── ACK ◄─────────────┤
```

**Trade-offs**:
- 2PC: Blocking, coordinator is SPOF, slow (2 round trips)
- Saga: Eventually consistent, requires compensation logic, complex debugging

### **Deep Dive: Inventory Consistency**

**Eventual Consistency with Conflict Resolution**:
```
Two warehouses sell last item simultaneously:

Warehouse A: Sells item, decrements inventory 1 → 0
Warehouse B: Sells item, decrements inventory 1 → 0

Conflict detected during sync (both claim last item):
    Resolution: Check timestamps, cancel later order
    Or: Accept both, mark as backorder
    
Compensation: Email customer B, offer discount
```

**Optimistic Locking**:
```
SELECT inventory, version FROM products WHERE id=123;
-- inventory=5, version=10

UPDATE products 
SET inventory=4, version=11 
WHERE id=123 AND version=10;

If rows affected = 0:
    Someone else updated it (version is 11 now)
    Retry with fresh read
```

### **System Characteristics**

| Component | Pattern |
|-----------|---------|
| Product catalog | CQRS (Command: PostgreSQL, Query: Elasticsearch) |
| Inventory | Redis (speed) + PostgreSQL (audit) |
| Cart | Redis (session) |
| Orders | PostgreSQL (ACID) + Event sourcing |
| Payments | External PCI-compliant gateway |
| Recommendations | Spark batch + Redis cache |

---

## **18.3 Design a Payment Processing System (Stripe-style)**

Payment systems handle the most sensitive data with the strictest consistency requirements. A lost message is acceptable; a lost dollar is not.

### **Step 1: Scope**

**Functional Requirements**:
1. **Payment Methods**: Store credit cards securely (tokenization)
2. **Charges**: Process payments, handle currencies
3. **Refunds**: Reverse charges partially or fully
4. **Disputes**: Handle chargebacks
5. **Payouts**: Transfer funds to merchant accounts
6. **Webhooks**: Notify merchants of events

**Non-Functional Requirements**:
1. **Security**: PCI DSS Level 1 compliance (no raw card data in application)
2. **Idempotency**: Same charge request twice → one charge only
3. **Consistency**: ACID transactions. No partial payments.
4. **Auditability**: Immutable log of every financial event
5. **Availability**: 99.99% uptime (money never sleeps)

**Key Challenge**: Double-spending prevention and idempotency across distributed systems.

### **Step 2: Sketch**

**Scale**:
```
10 million transactions/day = 115/sec average, 1000/sec peak
Average transaction size: $50
Daily volume: $500M

Storage:
Transaction records: 10M × 500 bytes = 5 GB/day
Audit logs: Immutable, compressed = 1 GB/day
Retention: 7 years (regulatory)
```

### **Step 3: Data Model**

**Idempotency Keys** (Critical for exactly-once):
```sql
CREATE TABLE idempotency_keys (
    key VARCHAR(255) PRIMARY KEY,
    request_hash VARCHAR(64),  -- Hash of request body
    response_body JSONB,       -- Cached response
    created_at TIMESTAMP,
    expires_at TIMESTAMP       -- TTL 24 hours
);

-- Unique constraint prevents double processing
CREATE UNIQUE INDEX idx_idempotency ON charges(idempotency_key);
```

**Ledger Pattern** (Double-entry bookkeeping):
```sql
-- Every transaction creates two ledger entries (debit/credit)
CREATE TABLE ledger_entries (
    entry_id UUID PRIMARY KEY,
    transaction_id UUID,
    account_id UUID,           -- Customer, Merchant, Platform
    entry_type VARCHAR(10),    -- DEBIT or CREDIT
    amount DECIMAL(19,4),
    currency VARCHAR(3),
    status VARCHAR(20),        -- PENDING, POSTED, FAILED
    created_at TIMESTAMP,
    
    CONSTRAINT valid_amount CHECK (amount >= 0)
);

-- Accounts table tracks balances (materialized view of ledger)
CREATE TABLE accounts (
    account_id UUID PRIMARY KEY,
    balance DECIMAL(19,4),
    currency VARCHAR(3),
    version INT,  -- Optimistic locking
    last_updated TIMESTAMP
);
```

**State Machine for Payments**:
```
CREATED → PENDING → PROCESSING → CAPTURED → SETTLED
   │          │           │           │
   └──────────┴───────────┴───────────┘
              ↓
           FAILED (terminal)
              ↓
         REFUNDED (if captured)
```

### **Step 4: Scale (Architecture)**

**PCI Compliance Architecture** (Security First):
```
┌─────────────┐
│   Client    │ (Mobile/Web - Non-PCI)
└──────┬──────┘
       │
       ▼
┌──────────────┐
│   API        │ (Your servers - Non-PCI)
│   Gateway    │ (Never see raw card data)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│   Token      │ (PCI Level 1 Service)
│   Service    │ (Stripe, Braintree, Adyen)
│   (iFrame/   │
│   SDK)       │
└──────────────┘
       │
       ▼
┌──────────────┐
│   Card       │ (Bank Networks - Visa/MC)
│   Networks   │
└──────────────┘
```

**Payment Processing Flow**:
```
1. Client requests payment intent from API
2. API creates pending transaction in DB (status: PENDING)
3. Client tokenizes card with Stripe (PCI handled by Stripe)
4. Client sends token + amount to API
5. API calls Stripe API with idempotency key
6. Stripe processes with bank, returns success/failure
7. API updates DB (status: CAPTURED or FAILED)
8. API publishes event to Kafka (order.completed)
9. Webhook service notifies merchant
```

**Idempotency Implementation**:
```
Client generates UUID: "pay_req_abc_123"
Sends in header: Idempotency-Key: pay_req_abc_123

Server:
1. Check Redis: "idempotency:pay_req_abc_123"
   - If exists: Return cached response
   - If not: Process payment
   
2. After processing:
   SET "idempotency:pay_req_abc_123" response_json EX 86400
   
3. Database unique constraint as safety net
```

**Saga Pattern for Distributed Payments**:
```
Order Creation Saga:
Step 1: Reserve Inventory
    ├─► Success: Continue
    └─► Failure: Abort (Compensate: nothing)
    
Step 2: Process Payment
    ├─► Success: Continue
    └─► Failure: Compensate (Release inventory)
    
Step 3: Create Shipment
    ├─► Success: Complete
    └─► Failure: Compensate (Refund payment, release inventory)
```

### **Deep Dive: Fraud Detection**

**Real-time Scoring**:
```
Transaction features:
    - Amount ($5000 is suspicious if average is $50)
    - Velocity (5 transactions in 1 minute)
    - Location (IP geolocation vs. shipping address)
    - Device fingerprint (new device?)
    - Merchant category (high-risk categories)
    
ML Model (Random Forest or Gradient Boosting):
    Input: Feature vector
    Output: Risk score 0-100
    
Rules:
    Score > 80: Block immediately
    Score > 50: 3D Secure challenge (additional auth)
    Score < 50: Approve
```

### **System Characteristics**

| Component | Technology |
|-----------|-----------|
| Payment Gateway | Stripe/Adyen (PCI compliance) |
| Transaction DB | PostgreSQL (ACID) |
| Idempotency cache | Redis |
| Event streaming | Kafka |
| Fraud detection | Spark Streaming + ML |
| Ledger | Immutable log (Cassandra) |

---

## **18.4 Design a Multiplayer Game Backend**

Real-time multiplayer games require stateful servers, low-latency networking, and authoritative game logic to prevent cheating.

### **Step 1: Scope**

**Functional Requirements**:
1. **Real-time gameplay**: < 50ms latency for player actions
2. **Matchmaking**: Pair players of similar skill level
3. **Game state**: Authoritative server prevents client cheating
4. **Persistence**: Save progress, achievements, leaderboards
5. **Social**: Friends lists, guilds, chat

**Non-Functional Requirements**:
1. **Latency**: < 50ms for competitive games (FPS, MOBA)
2. **Throughput**: 100,000 concurrent matches
3. **Availability**: 99.9% (scheduled maintenance acceptable)
4. **Consistency**: Authoritative server is source of truth

### **Step 2: Architecture**

**Game Server Architecture**:
```
┌─────────────┐
│   Client    │ (Unity, Unreal, Mobile)
│   (Visual   │
│   Only)     │
└──────┬──────┘
       │ UDP (game state) + TCP (reliable events)
       ▼
┌──────────────┐
│   Game       │ (Authoritative)
│   Server     │ - Validates all actions
│   (Stateful) │ - Runs physics simulation
└──────┬───────┘       - Broadcasts state to clients
       │
       ▼
┌──────────────┐
│   Game State │
│   Database   │ (Redis for active, PostgreSQL for persistence)
└──────────────┘
```

**Matchmaking Service**:
```
Players enter queue with:
    - Skill rating (ELO/Glicko)
    - Latency preference (region)
    - Game mode preference
    
Algorithm:
    1. Sort by wait time (longest waiting first)
    2. For each player, find others within:
        - Skill range: ±200 rating (expands over time)
        - Latency: < 100ms to same game server
    3. Form match when 10 players found (5v5)
    4. Spin up game server (or use pre-warmed pool)
    5. Notify clients of server IP/port
```

**State Synchronization**:
```
Authoritative Server (30Hz tick rate):
    Tick 0: Player A at (10, 0, 20), Player B at (15, 0, 25)
    Tick 1: Process inputs, update physics
    Tick 2: Broadcast state delta to all clients
    
Client-side prediction:
    - Client predicts movement locally (instant response)
    - Server corrects if wrong (rubber-banding)
    - Interpolate between server states for smoothness
```

### **Deep Dive: Anti-Cheat**

**Client-Side (Easily bypassed)**:
- Memory scanning for known cheat signatures
- Integrity checks of game files

**Server-Side (Authoritative)**:
- Validate all movement speeds (if player moves too fast = teleport hack)
- Verify line-of-sight (if player shoots through wall = wallhack)
- Statistical analysis (impossible reaction times = aimbot)

**Replay System**:
- Record all inputs server-side
- Replay to verify physics consistency
- Ban wave based on detected patterns

### **Deep Dive: Regional Distribution**

**Latency Optimization**:
```
Edge POPs (CloudFlare/AWS Global Accelerator)
    │
    ▼
Regional Game Server Clusters (us-east, eu-west, ap-south)
    │
    ├─► Matchmaking within region (< 50ms)
    │
    └─► Cross-region only if low population (expand search)
```

**State Transfer**:
If player moves from US to Europe:
- Account data replicated globally (eventually consistent)
- Game progress in central database
- Real-time state (current match) ends, new match in EU region

---

## **18.5 Design a Collaborative Document Editor (Google Docs)**

Real-time collaborative editing requires Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs) to merge concurrent edits.

### **Step 1: Scope**

**Functional Requirements**:
1. **Real-time editing**: Multiple users edit simultaneously
2. **Cursor presence**: See others' cursors and selections
3. **Version history**: Track changes, revert to previous versions
4. **Offline editing**: Edit without connection, sync later
5. **Formatting**: Rich text, images, comments

**Non-Functional Requirements**:
1. **Latency**: < 100ms for local edits, < 500ms for remote edits
2. **Consistency**: All users see same document eventually
3. **Availability**: 99.9% uptime
4. **Scale**: 1 billion documents, 100M daily active users

### **Step 2: Architecture**

**Operational Transformation (OT)**:
```
Client A inserts "H" at position 0: "H"
Client B inserts "i" at position 0: "i"

Server receives both:
    Transform operations against each other
    If A then B: "H" at 0, then "i" at 0 → "iH" (wrong)
    If B then A: "i" at 0, then "H" at 0 → "Hi" (correct)
    
Server picks order (by timestamp or client ID), transforms, broadcasts
```

**CRDTs (Alternative)**:
```
Each character has unique ID (position is relative)
Insertions don't conflict because each char has unique ID
Merging is commutative and associative
No central server needed for transformation

Example:
    Client A: Insert "H" with ID 1
    Client B: Insert "i" with ID 2
    Merge: Both exist, order by ID or position vector
```

**System Architecture**:
```
Client (Browser)
    │
    ├─► WebSocket (real-time collaboration)
    │
    └─► HTTP API (save, load, history)
    
WebSocket Server (Stateful)
    │
    ├─► Document Service (OT/CRDT logic)
    │
    ├─► Presence Service (who's editing)
    │
    └─► Persistence Layer
    
Document Storage:
    - Real-time: Redis (current document state)
    - History: S3 (version snapshots every 30 seconds)
    - Metadata: PostgreSQL (permissions, sharing)
```

### **Step 3: Deep Dives**

**Conflict Resolution**:
```
Two users edit same word simultaneously:
    User A: "color" → "colour" (British spelling)
    User B: "color" → "shade" (different meaning)
    
Server receives both ops:
    If using OT: Transform based on operation type
        If both replace same range: Last-write-wins or branch
    
    Better: Granular locking
        Lock at paragraph level, not document level
        User A editing paragraph 1, User B editing paragraph 2 → no conflict
```

**Offline Support**:
```
Client goes offline:
    - Queue local changes in IndexedDB
    - Continue editing (optimistic UI)
    
Client comes online:
    - Sync server state
    - Replay local changes
    - Resolve conflicts using CRDT merge
    - If unresolvable: Show conflict UI to user
```

**Presence and Cursors**:
```
WebSocket broadcast:
    {
        type: "cursor_move",
        user_id: "alice",
        document_id: "doc_123",
        position: {paragraph: 5, offset: 42},
        selection: {start: 42, end: 50},
        color: "#FF5733"
    }
    
Throttling: Send cursor updates max 10/sec (interpolate on client)
```

---

## **18.6 Design a Distributed Lock Service (ZooKeeper/etcd-style)**

Distributed systems need coordination: leader election, configuration management, and distributed locking. ZooKeeper and etcd provide these primitives.

### **Step 1: Scope**

**Requirements**:
1. **Distributed Lock**: Exclusive access to resource across processes
2. **Leader Election**: One master among many candidates
3. **Configuration**: Dynamic config shared across cluster
4. **Service Discovery**: Register/find service endpoints
5. **Barriers**: Coordinate multiple nodes (wait for N to arrive)

**Non-Functional**:
1. **Consistency**: Linearizable reads/writes (strong consistency)
2. **Availability**: Survive minority of node failures (majority quorum)
3. **Latency**: < 10ms for coordination primitives

### **Step 2: Consensus Algorithm (Raft)**

**Raft Basics**:
```
Leader Election:
    - Nodes start as Followers
    - If no heartbeat from leader, become Candidate
    - Request votes from all nodes
    - If majority votes received: Become Leader
    - If split vote: Randomized timeout, retry

Log Replication:
    - All writes go through Leader
    - Leader appends to log, replicates to Followers
    - When majority acknowledges: Commit entry
    - Leader tells Followers to commit
```

**ZooKeeper Architecture**:
```
Ensemble (cluster of 3, 5, or 7 nodes)
    │
    ├─► Leader (handles writes)
    │
    └─► Followers (handle reads, replicate writes)

Client connects to any node
    - Read: Served locally (fast, possibly stale)
    - Write: Forwarded to Leader, waits for consensus
```

### **Step 3: Primitives Implementation**

**Distributed Lock**:
```
Algorithm (ZooKeeper ephemeral sequential nodes):
    1. Create ephemeral sequential node: /locks/resource/lock-
    2. List all children of /locks/resource/
    3. If my node has lowest sequence number:
           - Acquire lock
       Else:
           - Watch the node just before mine
           - Wait for it to be deleted
    4. On unlock: Delete my node (ephemeral = auto-delete on disconnect)
```

**Leader Election**:
```
All candidates create ephemeral sequential nodes: /election/candidate-
Node with lowest sequence number becomes leader
Others watch the leader node
If leader dies (node deleted), next lowest becomes leader
```

**Configuration Management**:
```
Store config in znode: /config/database_url
All clients watch this node
When config changes, all clients notified immediately
No polling needed
```

### **Step 4: Scale Considerations**

**Read Scaling**:
- Followers handle read traffic
- Linearizable read: Forward to leader (slow but consistent)
- Sequential read: Read from follower (fast, may be slightly stale)

**Write Scaling**:
- All writes through leader (bottleneck)
- Partition data across multiple ensembles
- Sharding: User A-M on cluster 1, N-Z on cluster 2

**etcd vs ZooKeeper**:
- **ZooKeeper**: Java, mature, complex, strong ordering guarantees
- **etcd**: Go, simpler, HTTP/gRPC API, Kubernetes uses it
- **Consul**: Service discovery focus, built-in health checking

### **Deep Dive: The FLP Impossibility**

**Fischer, Lynch, Paterson (1985)**:
In an asynchronous network with even one faulty process, no deterministic consensus algorithm can guarantee termination.

**Implications**:
- We must use timeouts (making it partially synchronous)
- Randomization helps (Raft uses randomized election timeouts)
- 100% consensus is theoretically impossible in practice, but we get close enough

### **System Characteristics**

| Feature | Implementation |
|---------|---------------|
| Consensus | Raft (etcd) or ZAB (ZooKeeper) |
| Storage | B-tree or MVCC (etcd uses BoltDB) |
| Watch | Long-polling or gRPC streaming |
| Sessions | Ephemeral nodes + heartbeats |

---

## **18.4 Design a Multiplayer Game Backend**

(Note: This was partially covered in 17.4, but let's expand for enterprise scale - MMOs, Battle Royales)

**MMO Specifics (Massively Multiplayer Online)**:
- **Spatial Partitioning**: World divided into zones/shards
- **Interest Management**: Only send updates for entities "near" player
- **Entity Component System**: Efficient processing of thousands of NPCs

**Battle Royale (100 players)**:
- **Deterministic Lockstep**: All clients simulate, server validates
- **State Compression**: Delta compression between frames
- **Relay Server**: P2P with NAT traversal fallback

---

## **18.5 Design a Collaborative Document Editor (Google Docs)**

(Expanded from 17.5)

**Operational Transformation Details**:
```
Operation types:
    - Insert(text, position)
    - Delete(length, position)
    - Retain(count)  // Skip count characters
    
Transformation function T:
    Given op A and op B concurrent:
    A' = T(A, B)  // A transformed against B
    B' = T(B, A)  // B transformed against A
    
    Apply A' then B' = Apply B' then A' (convergence)
```

**CRDTs (Conflict-free Replicated Data Types)**:
```
Sequence CRDT (RGA - Replicated Growable Array):
    Each character has unique ID (author + sequence number)
    Insertions reference predecessor ID
    Tombstones for deletions (never truly delete, mark as removed)
    
Merge: Union of all characters, sorted by ID, remove tombstones
```

---

## **18.6 Design a Distributed Lock Service (ZooKeeper/etcd-style)**

(Already covered in detail in 18.6 above)

---

## **18.7 Chapter Summary**

Enterprise systems differ from consumer apps in three key ways:

1. **Correctness over Speed**: Payment systems sacrifice latency for ACID guarantees. A slow correct answer is better than a fast wrong one.

2. **Compliance as Architecture**: PCI DSS, GDPR, and SOX aren't checkboxes—they dictate data flow, encryption, and audit trails.

3. **Failure is Normal**: Distributed systems components fail constantly. Design for graceful degradation (circuit breakers, bulkheads, retries).

**The Saga Pattern** emerged as the solution for distributed transactions, trading atomic isolation for availability and partition tolerance.

**Consensus algorithms** (Raft, Paxos) provide the foundation for coordination, but at the cost of write throughput.

**Event sourcing** (storing events rather than state) enables audit trails, temporal queries, and system evolution—essential for financial and compliance-heavy domains.

---

**Exercises**:

1. **Payment System**: How would you handle currency conversion fluctuations between authorization (hold) and capture (charge)?

2. **Message Queue**: Design a "dead letter queue" strategy for messages that fail processing after 3 retries.

3. **Distributed Lock**: Implement a "read-write lock" using ZooKeeper primitives (multiple readers, exclusive writer).

4. **E-commerce**: Design a "flash sale" system where 100,000 users try to buy 1,000 items simultaneously without overselling.

5. **Game Backend**: How would you prevent "lag switching" (players intentionally delaying packets to gain advantage)?

6. **Document Editor**: Implement the "undo" feature in an OT-based system where other users may have edited the same text.

---

The next chapter will cover **Production Readiness**—observability, monitoring, performance optimization, and deployment strategies that separate prototype systems from production-grade infrastructure.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='17. infrastructure_and_platform_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../7. Production_readiness.ipynb/19. observability_and_monitoring.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
