# **Chapter 16: User-Facing Applications**

Now that we've covered the theoretical foundations and building blocks of distributed systems, it's time to apply this knowledge to real-world problems. This chapter walks through the design of six widely-used systems, progressing from simple to complex.

Each case study follows the **4S Framework** introduced in Chapter 10:
- **S**cope: Requirements and constraints
- **S**ketch: Back-of-the-envelope calculations
- **S**olidify: Data models and APIs
- **S**cale: Deep dives into bottlenecks and trade-offs

By the end of this chapter, you'll understand how to approach any system design interview or real-world architecture challenge.

---

## **16.1 Design a URL Shortener (TinyURL)**

Let's start with the classic URL shortener—a system that takes a long URL like `https://example.com/articles/how-to-design-systems` and converts it to `https://tinyurl.com/x7y9z2`.

### **Step 1: Scope (Requirements)**

**Functional Requirements** (Must-haves):
1. **Shorten URL**: Given a long URL, generate a unique short alias
2. **Redirect**: Given a short URL, redirect to the original long URL
3. **Custom aliases** (optional): Users can specify custom short codes
4. **Expiration** (optional): URLs expire after a set time
5. **Analytics** (optional): Track click counts

**Non-Functional Requirements**:
1. **High availability**: 99.9% uptime (short links must work)
2. **Low latency**: Redirect should happen in < 100ms
3. **Scalability**: Handle 100 million new URLs per month, 10 billion redirects per month
4. **Uniqueness**: No two long URLs should map to the same short URL (unless specified)

**Out of Scope** (for this design):
- User authentication/management
- URL editing after creation
- Malware detection in URLs

### **Step 2: Sketch (Back-of-the-Envelope)**

**Traffic Estimates**:
```
New URL creations:
- 100 million per month
- 100M / 30 days / 86,400 seconds ≈ 40 URLs/second (average)
- Peak traffic: 10x average = 400 URLs/second

URL redirects:
- 10 billion per month
- 10B / 30 / 86,400 ≈ 3,800 redirects/second (average)
- Peak: 38,000 redirects/second

Read-to-write ratio: 100:1 (very read-heavy)
```

**Storage Estimates**:
```
Per URL record:
- short_code: 6 bytes (e.g., "x7y9z2")
- long_url: 500 bytes average
- created_at: 8 bytes (timestamp)
- expiration: 8 bytes
- user_id: 16 bytes (UUID)

Total per record: ~550 bytes

5-year storage:
- 100M URLs/month × 12 months × 5 years = 6 billion URLs
- 6B × 550 bytes = 3.3 TB

With replicas (3x) and overhead: ~10 TB
```

**Bandwidth Estimates**:
```
Incoming (writes): 400 URLs/sec × 550 bytes = 220 KB/sec
Outgoing (reads): 38,000 redirects/sec × 550 bytes = 21 MB/sec
```

### **Step 3: Solidify (API and Data Model)**

**API Design**:

```
POST /api/v1/shorten
Request:
{
  "long_url": "https://example.com/articles/how-to-design-systems",
  "custom_alias": "system-design",  // optional
  "expiration_days": 30             // optional, default 365
}

Response:
{
  "short_url": "https://tinyurl.com/x7y9z2",
  "created_at": "2024-01-15T10:30:00Z",
  "expires_at": "2024-02-14T10:30:00Z"
}

GET /{short_code}
Response: HTTP 302 Redirect to original URL
         (or 404 if not found/expired)

GET /api/v1/stats/{short_code}
Response:
{
  "short_code": "x7y9z2",
  "long_url": "https://...",
  "click_count": 15000,
  "created_at": "..."
}
```

**Database Schema**:

**URL Table** (Primary storage):
```sql
CREATE TABLE urls (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    short_code VARCHAR(10) UNIQUE NOT NULL,
    long_url VARCHAR(2048) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP,
    click_count BIGINT DEFAULT 0,
    user_id VARCHAR(36),
    
    INDEX idx_short_code (short_code),
    INDEX idx_expires_at (expires_at)
);
```

**Why separate `id` and `short_code`?**
- `id` is for internal database operations (B-tree indexing efficiency)
- `short_code` is what users see (base62 encoded)

### **Step 4: Scale (High-Level Design)**

**Architecture Overview**:
```
┌─────────────┐      ┌──────────────┐      ┌─────────────┐
│   Client    │──────>│ Load Balancer│──────>│   Web       │
│             │      │ (Round Robin)│      │   Servers   │
└─────────────┘      └──────────────┘      └──────┬──────┘
                                                  │
                    ┌─────────────────────────────┼─────────────────────────────┐
                    │                             │                             │
                    ▼                             ▼                             ▼
            ┌──────────────┐             ┌──────────────┐              ┌──────────────┐
            │   Cache      │             │   Database   │              │   Analytics  │
            │   (Redis)    │<───────────>│   (MySQL/    │              │   (Kafka)    │
            │              │   Cache miss│   Postgres)  │              │              │
            └──────────────┘             └──────────────┘              └──────────────┘
```

**The Encoding Strategy (Critical Design Decision)**

How do we generate unique short codes?

**Option 1: Hash-based (MD5/SHA256)**
- Hash the long URL, take first 6 characters
- Problem: Collisions! Two different URLs might have same first 6 chars
- Solution: If collision, add counter or rehash
- Problem: Can't support custom aliases easily

**Option 2: Base62 Encoding of Auto-increment ID (Recommended)**
- Database generates auto-increment ID (1, 2, 3...)
- Convert ID to base62 (a-z, A-Z, 0-9)
- Examples:
  - ID 1 → "1"
  - ID 100 → "1C"
  - ID 1,000,000 → "4c92"

**Base62 Encoding Logic**:
```python
BASE62 = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

def encode_base62(num):
    if num == 0:
        return "0"
    
    result = []
    while num > 0:
        num, remainder = divmod(num, 62)
        result.append(BASE62[remainder])
    
    return ''.join(reversed(result))

def decode_base62(code):
    result = 0
    for char in code:
        result = result * 62 + BASE62.index(char)
    return result

# Examples
print(encode_base62(1000000))  # "4c92"
print(decode_base62("4c92"))   # 1000000
```

**Capacity of 6-character codes**:
```
62^6 = 56.8 billion unique URLs
62^7 = 3.5 trillion unique URLs

With 100M URLs/month, 6 chars lasts 568 months (~47 years)
```

**Write Path (Creating Short URL)**:
```
1. Client sends long_url to POST /shorten
2. Application checks if URL already exists (optional deduplication)
3. Database inserts new record, gets auto-increment ID
4. Application converts ID to base62 short_code
5. Update database row with short_code
6. Return short_url to client
```

**Read Path (Redirect)**:
```
1. Client requests /x7y9z2
2. Application checks Redis cache:
   - Hit: Return long_url immediately
   - Miss: Query database, store in cache, return long_url
3. Issue 302 Redirect to long_url
4. Async: Increment click count (write to Kafka, batch update DB)
```

**Why 302 Redirect and not 301?**
- **301 (Permanent)**: Browser caches forever. If you change the destination, browsers won't check again.
- **302 (Temporary)**: Browser checks every time. Allows updating the long URL later, and allows analytics tracking.

### **Deep Dive: Handling the Read-Heavy Traffic**

With a 100:1 read-to-write ratio, reads dominate. Here's how to handle 38,000 redirects/second:

**Caching Strategy**:
- **Cache layer**: Redis cluster with 99% hit rate
- **TTL**: 24 hours (or until expiration)
- **Cache warming**: Pre-populate cache for popular URLs
- **Cache eviction**: LRU (Least Recently Used)

**Database Optimization**:
- **Read replicas**: 3-5 replicas to distribute read load
- **Sharding**: By short_code (consistent hashing) if single DB can't handle load
- **Covering index**: Index on (short_code, long_url) to avoid table lookups

**CDN for Static Assets**:
If serving analytics dashboards or documentation, use CloudFlare/AWS CloudFront.

### **Deep Dive: Custom Aliases**

If user wants `tinyurl.com/system-design` instead of random code:
```
1. Check if "system-design" already exists
2. If not, insert with custom short_code instead of generated one
3. Reserve specific keywords in application layer (don't allow "api", "admin", etc.)
```

**Collision Handling**:
```python
def create_short_url(long_url, custom_alias=None):
    if custom_alias:
        if db.exists(short_code=custom_alias):
            raise Error("Alias already taken")
        short_code = custom_alias
    else:
        # Get next ID and encode
        id = db.insert(long_url=long_url)
        short_code = encode_base62(id)
    
    return short_code
```

### **Deep Dive: Analytics at Scale**

Tracking 10 billion clicks/month:
- Don't write to database synchronously (would kill DB)
- Use message queue (Kafka/RabbitMQ)
- Consumer batch writes to analytics database (ClickHouse/Druid for OLAP)
- Store: timestamp, short_code, user_agent, referrer, geo_location

**Data retention**: Raw data 30 days, aggregated data 5 years.

### **System Characteristics Summary**

| Aspect | Approach |
|--------|----------|
| Database | Relational (MySQL/Postgres) - strong consistency needed |
| Caching | Redis for hot URLs |
| Scaling | Horizontal scaling of app servers, read replicas for DB |
| Encoding | Base62 of auto-increment ID |
| Analytics | Async via message queue |

---

## **16.2 Design Twitter/X News Feed (Fan-out Problem)**

Now we move to a significantly more complex problem: designing a social media feed system where users see posts from people they follow, in real-time.

### **Step 1: Scope (Requirements)**

**Functional Requirements**:
1. **Create tweet**: Users can post tweets (text, images, videos)
2. **News feed**: Users see tweets from people they follow, sorted by time (reverse chronological)
3. **Follow/unfollow**: Users can follow other users
4. **Timeline**: View any user's profile and their tweets

**Non-Functional Requirements**:
1. **Latency**: News feed should load in < 200ms
2. **Availability**: Eventually consistent is acceptable (post may take seconds to appear)
3. **Scale**: 
   - 500 million daily active users (DAU)
   - 100 million new tweets per day
   - Average user follows 500 people, has 1000 followers
   - Some users have 50 million followers (celebrities)

**Key Challenge**: The "Fan-out Problem"—when a celebrity posts, we must deliver to millions of followers instantly.

### **Step 2: Sketch (Back-of-the-Envelope)**

**Traffic**:
```
Tweets: 100M/day = 1,200 tweets/second (average), 12,000/sec (peak)
Timeline reads: 500M DAU × 10 feeds/day = 5B reads/day = 58,000/sec (peak 580,000/sec)
```

**Storage**:
```
Tweet metadata: 100 bytes
Media: Average 200KB per tweet (5% of tweets have media)
Daily storage: 
  - Metadata: 100M × 100 bytes = 10 GB
  - Media: 5M × 200KB = 1 TB
  
5-year storage: 18 PB (media) + 18 TB (metadata)
```

**Fan-out Analysis**:
```
Average user: 1,000 followers
- Posting tweet: Write to 1,000 timelines = 1,000 writes

Celebrity user: 50 million followers
- Posting tweet: Write to 50M timelines = 50 million writes!
- If 100 celebrities post simultaneously: 5 billion writes
```

### **Step 3: Solidify (Data Model)**

**Tweet Table**:
```sql
CREATE TABLE tweets (
    tweet_id BIGINT PRIMARY KEY,
    user_id BIGINT NOT NULL,
    content VARCHAR(280),
    media_urls JSON,
    created_at TIMESTAMP,
    likes_count INT DEFAULT 0,
    retweets_count INT DEFAULT 0
);
```

**User Table**:
```sql
CREATE TABLE users (
    user_id BIGINT PRIMARY KEY,
    username VARCHAR(50) UNIQUE,
    email VARCHAR(255),
    follower_count BIGINT DEFAULT 0,
    following_count BIGINT DEFAULT 0
);
```

**Follow Relationship** (Many-to-Many):
```sql
CREATE TABLE follows (
    follower_id BIGINT,  -- who is following
    following_id BIGINT, -- who is being followed
    created_at TIMESTAMP,
    PRIMARY KEY (follower_id, following_id)
);
```

**News Feed Cache** (Redis):
```
Key: feed:user_id:{user_id}
Value: Sorted Set (zset) of tweet_ids, scored by timestamp
TTL: 7 days (older feeds fetched from database)
```

### **Step 4: Scale (Architecture)**

**Two Approaches to News Feed Generation**:

#### **Approach 1: Fan-out on Write (Push Model)**

When user posts a tweet, immediately push to all followers' feeds.

```
User posts tweet
    │
    ▼
┌──────────────┐
│   Tweet      │
│   Service    │
└──────┬───────┘
       │
       ├─► Save tweet to DB
       │
       ├─► Get follower list from User Service
       │         (User A has 1,000 followers)
       │
       └─► Write tweet_id to 1,000 Redis feeds
             (fan-out)
```

**Pros**:
- Read is O(1): Just fetch pre-computed feed from Redis
- Low latency for timeline reads

**Cons**:
- Celebrity problem: Writing to 50M followers takes minutes/hours
- Waste of resources: Inactive users still get feeds updated

#### **Approach 2: Fan-out on Read (Pull Model)**

Don't pre-compute feeds. When user opens app, fetch tweets from everyone they follow and merge them.

```
User requests timeline
    │
    ▼
┌──────────────┐
│   Timeline   │
│   Service    │
└──────┬───────┘
       │
       ├─► Get list of 500 people user follows
       │
       ├─► Fetch recent tweets from each (parallel)
       │     - Query 500 database shards
       │     - Or query 500 cache keys
       │
       └─► Merge and sort by time
            Return top 100 tweets
```

**Pros**:
- No celebrity problem
- No wasted computation for inactive users

**Cons**:
- High latency: Querying 500 sources takes time (300-500ms)
- Database overload: 500 queries per timeline request

#### **Hybrid Approach (The Real Solution)**

Twitter uses a hybrid: **Fan-out on Write for normal users, Fan-out on Read for celebrities**.

```
Define threshold: If user has > 1 million followers, they are "celebrity"

For Normal User (1,000 followers):
    - Push to all followers' Redis feeds immediately
    - Read timeline: O(1) fetch from Redis

For Celebrity (50M followers):
    - Don't push to followers
    - When follower requests timeline:
        1. Fetch their normal feed from Redis (people they follow)
        2. Fetch celebrity tweets separately (recent 100 tweets from each celebrity)
        3. Merge results
```

**Architecture Diagram**:
```
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Client    │─────│ Load Balancer│─────│   API       │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                │
                    ┌─────────────────────────────┼─────────────────────────────┐
                    │                             │                             │
                    ▼                             ▼                             ▼
            ┌──────────────┐            ┌──────────────┐              ┌──────────────┐
            │ Tweet Service│            │ Timeline     │              │ User Service │
            │ - Post tweet │            │ Service      │              │ - Followers  │
            │ - Store media│            │ - Build feed │              │ - Profiles   │
            └──────┬───────┘            └──────┬───────┘              └──────┬───────┘
                   │                           │                             │
                   ▼                           ▼                             ▼
            ┌──────────────┐            ┌──────────────┐              ┌──────────────┐
            │   Database   │            │   Redis      │              │   Database   │
            │   (Tweets)   │            │   Cluster    │              │   (Users)    │
            └──────────────┘            └──────────────┘              └──────────────┘
                   │                           ▲
                   │                           │
                   └───────────────────────────┘
                         Fan-out on Write
```

### **Deep Dive: The Timeline Generation Algorithm**

**For Normal Users**:
```python
def get_timeline(user_id):
    # O(1) operation from Redis
    tweet_ids = redis.zrevrange(f"feed:user_id:{user_id}", 0, 100)
    tweets = fetch_tweet_details(tweet_ids)  # Batch fetch from DB
    return tweets
```

**For Celebrity Content**:
```python
def get_timeline_hybrid(user_id):
    # 1. Get pre-computed feed from Redis (normal users)
    normal_tweets = redis.zrevrange(f"feed:user_id:{user_id}", 0, 100)
    
    # 2. Get list of celebrities this user follows
    celebrities = get_followed_celebrities(user_id)  # Cached
    
    # 3. Fetch recent tweets from celebrities (parallel)
    celeb_tweets = []
    for celeb in celebrities:
        tweets = redis.lrange(f"tweets:user:{celeb}", 0, 10)
        celeb_tweets.extend(tweets)
    
    # 4. Merge and sort
    all_tweets = merge_and_sort(normal_tweets + celeb_tweets)
    return all_tweets[:100]
```

### **Deep Dive: Media Storage**

Tweets with images/videos:
- **Object Storage**: AWS S3, Google Cloud Storage
- **CDN**: CloudFront, CloudFlare for serving images
- **Processing Pipeline**: 
  - Upload → S3 → Lambda triggers → Resize to thumbnails → Store variants
  - Video transcoding: Multiple resolutions (480p, 720p, 1080p)

**URL Structure**:
```
https://media.twitter.com/{user_id}/{tweet_id}/image_1024x768.jpg
```

### **Deep Dive: Caching Strategy**

**Redis Data Structures**:
1. **User Feed**: Sorted Set (`zset`) with tweet_id as member, timestamp as score
   ```
   Key: feed:12345
   Value: [(1699123456, tweet_987), (1699123400, tweet_986), ...]
   ```

2. **Tweet Content**: Hash with tweet details
   ```
   Key: tweet:987
   Value: {content: "Hello", user_id: 123, likes: 50, ...}
   ```

3. **User Profile**: Hash with user info
   ```
   Key: user:123
   Value: {name: "Alice", followers: 1000, ...}
   ```

**Cache Warming**: Pre-populate feeds for active users during low-traffic hours.

### **System Characteristics**

| Feature | Implementation |
|---------|---------------|
| Normal tweets | Fan-out on write (push) |
| Celebrity tweets | Fan-out on read (pull) |
| Feed storage | Redis Sorted Sets |
| Media | Object storage (S3) + CDN |
| Timeline latency | < 100ms for normal users, < 200ms with celebrities |

---

## **16.3 Design a Chat Application (WhatsApp/Slack)**

Chat systems combine real-time messaging with persistent storage, presence detection, and group management.

### **Step 1: Scope**

**Functional Requirements**:
1. **One-on-one messaging**: Direct messages between users
2. **Group messaging**: Chat rooms with multiple participants
3. **Online presence**: See who's online/offline
4. **Message history**: Access to previous messages
5. **Read receipts**: Double checkmarks (delivered/read)
6. **Media sharing**: Images, files, voice messages
7. **Typing indicators**: "Alice is typing..."

**Non-Functional Requirements**:
1. **Real-time**: Messages delivered in < 500ms
2. **Ordered**: Messages appear in correct chronological order
3. **Exactly-once delivery**: No duplicate messages
4. **Persistence**: Messages stored indefinitely (or user-defined retention)
5. **Scale**: 1 billion daily active users, 100 billion messages/day

### **Step 2: Sketch**

**Traffic**:
```
100B messages/day = 1.2M messages/second (average)
Peak: 10M messages/second

Media: 20% of messages include media (20B/day)
```

**Storage**:
```
Text message: 200 bytes
Media: 1 MB average
Daily: (80B × 200B) + (20B × 1MB) = 20 TB text + 20 PB media
Yearly: 7.3 PB text + 7.3 EB media (exabytes!)
```

**Connections**:
```
1B active users, 20% online at once = 200M concurrent connections
Each connection maintains persistent WebSocket
```

### **Step 3: Data Model**

**Messages Table** (Cassandra/ScyllaDB - wide column store):
```sql
CREATE TABLE messages (
    chat_id TEXT,           -- user1_user2 for 1:1, group_id for groups
    message_id TIMEUUID,    -- contains timestamp, unique
    sender_id TEXT,
    content TEXT,
    media_url TEXT,
    created_at TIMESTAMP,
    status TEXT,            -- sent, delivered, read
    
    PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
```

**Why Cassandra?**
- Writes are faster than reads (chat is write-heavy)
- Linear scalability
- Time-series data fits wide-column model well
- Tunable consistency (can trade consistency for availability)

**User Sessions** (Redis):
```
Key: session:{user_id}
Value: {server_id: "ws-server-42", status: "online", last_seen: 1699123456}
TTL: 5 minutes (refreshed on activity)
```

**Recent Chats** (Redis):
```
Key: recent:{user_id}
Value: Sorted Set of chat_ids with last_message_timestamp
```

### **Step 4: Scale (Architecture)**

**High-Level Design**:
```
┌─────────────┐         ┌──────────────┐         ┌─────────────┐
│   Mobile    │◄───────►│  Load Balancer│◄───────►│   WebSocket │
│   Apps      │         │  (Layer 4)    │         │   Servers   │
└─────────────┘         └──────────────┘         └──────┬──────┘
                                                        │
                    ┌───────────────────────────────────┼───────────────────────────┐
                    │                                   │                           │
                    ▼                                   ▼                           ▼
            ┌──────────────┐                  ┌──────────────┐              ┌──────────────┐
            │   Presence   │                  │   Chat       │              │   Message    │
            │   Service    │                  │   Service    │              │   History    │
            │   (Redis)    │                  │   (API)      │              │   Service    │
            └──────────────┘                  └──────┬───────┘              └──────┬───────┘
                                                     │                           │
                                                     ▼                           ▼
                                             ┌──────────────┐            ┌──────────────┐
                                             │   Kafka      │            │   Cassandra  │
                                             │   (Queue)    │            │   Cluster    │
                                             └──────────────┘            └──────────────┘
```

**WebSocket Connection Flow**:
```
1. Client authenticates via HTTP API, gets JWT token
2. Client connects to WebSocket server with token
3. WebSocket server validates token, stores mapping: user_id → connection
4. Heartbeat every 30 seconds to keep connection alive
5. On disconnect, mark user as "offline" after timeout
```

**Message Flow (User A sends to User B)**:
```
User A's Phone
     │
     ▼
WebSocket Server (A)
     │
     ├─► Save message to Cassandra (async)
     │
     ├─► Check if User B is online (Redis lookup)
     │     ├─► Online: Route to WebSocket Server (B) → Push to User B
     │     └─► Offline: Store for later delivery (Push Notification)
     │
     └─► Update recent chats for both users (Redis)
```

**Handling Message Order**:
Problem: Network latency varies. Message 2 might arrive before Message 1.

Solutions:
1. **Server-side timestamps**: Server assigns timestamp, not client
2. **Sequence numbers**: Per-chat incrementing counter
3. **Vector clocks**: For distributed systems (complex, usually overkill for chat)

**Cassandra handles this naturally**: `message_id` is a TimeUUID (timestamp + random), clustering key sorts by time.

### **Deep Dive: Group Chats**

**Fan-out in Groups**:
- Group has 1,000 members
- 1 message sent = 1,000 deliveries needed

**Optimization**:
- Write once to `messages` table (chat_id = group_id)
- Write to `unread` table for each member (lightweight pointer)
- Don't fan-out to WebSockets immediately for large groups
- Instead, maintain "last read message_id" per user

**Group Message Table**:
```sql
CREATE TABLE group_members (
    group_id TEXT,
    user_id TEXT,
    joined_at TIMESTAMP,
    role TEXT,  -- admin, member
    
    PRIMARY KEY (group_id, user_id)
);

CREATE TABLE unread_messages (
    user_id TEXT,
    chat_id TEXT,
    count INT,
    last_message_id TIMEUUID,
    
    PRIMARY KEY (user_id, chat_id)
);
```

### **Deep Dive: Presence and Typing Indicators**

**Presence (Online/Offline)**:
- **Heartbeat approach**: Client pings every 30 seconds
- **Server updates Redis**: `SET user:123:status online EX 60`
- **Broadcast to friends**: When status changes, notify all friends via WebSocket

**Optimization**: Don't broadcast to all 1,000 friends immediately. Batch updates or only broadcast to friends currently looking at contact list.

**Typing Indicators**:
- Client sends "typing_start" event
- Server broadcasts to chat participants
- Debounce: Only send "typing" every 3 seconds to prevent spam
- Auto-expire: If no "typing_stop" received in 10 seconds, clear indicator

### **Deep Dive: Media Messages**

**Upload Flow**:
```
1. Client requests upload URL from Media Service
2. Media Service generates pre-signed S3 URL (valid 15 minutes)
3. Client uploads directly to S3 (bypassing our servers)
4. S3 triggers Lambda to generate thumbnail
5. Client sends message with S3 URL
6. Receivers download directly from S3/CloudFront
```

**Benefits**:
- Our servers don't handle large file uploads (bandwidth savings)
- Scales infinitely with S3
- Resumable uploads possible

### **Deep Dive: Message Delivery Guarantees**

**Exactly-Once Delivery**:
1. Client generates UUID for message before sending
2. Server stores message with UUID as idempotency key
3. If client retries (network timeout), server recognizes duplicate UUID
4. Acknowledgment: Server sends ACK with server-assigned ID

**Offline Delivery**:
- Messages stored in Cassandra with TTL (e.g., 30 days)
- When user comes online, fetch all messages where `message_id > last_seen_message_id`
- Push Notification via Firebase/APNs for critical messages

**Read Receipts**:
- Client sends "read" event with message_id
- Server updates `read_at` timestamp
- Broadcasts to sender: "Your message was read"

### **System Characteristics**

| Feature | Technology |
|---------|-----------|
| Real-time transport | WebSockets |
| Message storage | Cassandra (write-heavy, time-series) |
| Recent chats/Session | Redis |
| Media storage | S3 + CloudFront |
| Presence | Redis with pub/sub |
| Delivery guarantee | Idempotency keys + at-least-once delivery |

---

## **16.4 Design a Video Streaming Service (YouTube/Netflix)**

Video streaming combines massive storage requirements with complex encoding pipelines and adaptive bitrate streaming.

### **Step 1: Scope**

**Functional Requirements**:
1. **Video upload**: Users upload videos in various formats
2. **Streaming**: Watch videos with adaptive quality (auto-adjust based on bandwidth)
3. **Search**: Find videos by title, description, tags
4. **Recommendations**: Suggested videos based on viewing history
5. **Subtitles/Captions**: Multi-language support

**Non-Functional Requirements**:
1. **Latency**: Start playback in < 2 seconds, no buffering
2. **Availability**: 99.99% uptime (people binge-watch at 2 AM)
3. **Scale**: 
   - 2 billion users
   - 500 hours of video uploaded per minute
   - 1 billion hours watched per day

**Key Challenge**: Video files are huge. A 10-minute 1080p video is ~500MB. Storage and transmission at scale is the core problem.

### **Step 2: Sketch**

**Upload Traffic**:
```
500 hours/minute = 30,000 minutes/hour of video
30,000 min × 60 sec × 5 Mbps (average bitrate) = 9 TB/hour of raw video
```

**Streaming Traffic**:
```
1B hours/day watched
Assume average bitrate 2 Mbps (adaptive, varies)
1B hours × 3600 sec × 2 Mbps = 900 PB/day of outbound traffic
Peak traffic: 100 Tbps during evening hours
```

**Storage**:
```
Raw uploads: 9 TB/hour × 24 = 216 TB/day
Encoded variants: Each video encoded in 5 qualities (144p to 4K) = 3x storage
Yearly raw: 79 PB
Yearly encoded: 237 PB
```

### **Step 3: Data Model**

**Video Metadata** (SQL - requires ACID):
```sql
CREATE TABLE videos (
    video_id UUID PRIMARY KEY,
    title VARCHAR(255),
    description TEXT,
    user_id UUID,
    status VARCHAR(20),  -- processing, active, deleted
    duration INT,        -- seconds
    thumbnail_url VARCHAR(255),
    upload_date TIMESTAMP,
    view_count BIGINT DEFAULT 0,
    
    INDEX idx_upload_date (upload_date),
    INDEX idx_user (user_id)
);
```

**Video Files** (Object Storage):
```
s3://video-bucket/raw/{video_id}/original.mp4
s3://video-bucket/processed/{video_id}/144p.mp4
s3://video-bucket/processed/{video_id}/360p.mp4
s3://video-bucket/processed/{video_id}/720p.mp4
s3://video-bucket/processed/{video_id}/1080p.mp4
s3://video-bucket/processed/{video_id}/4k.mp4
```

**CDN Cache** (Edge locations):
```
edge-cdn.net/videos/{video_id}/720p/segment_001.ts
```

### **Step 4: Scale (Architecture)**

**Upload Pipeline**:
```
User uploads video
    │
    ▼
┌──────────────┐
│   Upload     │
│   Service    │──┐
└──────────────┘  │
                  ▼
          ┌──────────────┐
          │   Raw        │
          │   Storage    │
          │   (S3)       │
          └──────┬───────┘
                 │
                 ▼
          ┌──────────────┐
          │   Encoding   │────┐
          │   Queue      │    │
          │   (Kafka)    │    │
          └──────────────┘    │
                              ▼
                       ┌──────────────┐
                       │   Encoding   │
                       │   Workers    │
                       │   (FFmpeg)   │
                       └──────┬───────┘
                              │
                              ▼
                       ┌──────────────┐
                       │   Processed  │
                       │   Storage    │
                       │   (S3)       │
                       └──────────────┘
```

**Streaming Architecture**:
```
User requests video
    │
    ▼
CDN Edge Location (closest to user)
    │
    ├─► Cache Hit? Serve immediately
    │
    └─► Cache Miss? Fetch from Origin
            │
            ▼
      Origin Storage (S3)
            │
            ▼
      Transcoding on the fly (if needed)
```

**Adaptive Bitrate Streaming (ABR)**:

Instead of one video file, we break it into small chunks (2-10 seconds) at multiple bitrates.

```
Manifest file (playlist.m3u8):
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=1280000,RESOLUTION=720x480
480p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2560000,RESOLUTION=1280x720
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=7680000,RESOLUTION=1920x1080
1080p/playlist.m3u8

480p/playlist.m3u8 contains:
segment_001.ts (2 seconds)
segment_002.ts
segment_003.ts
...
```

**How ABR Works**:
```
Player starts with lowest quality (fast start)
    │
    ▼
Measures download speed:
  - If speed > current bitrate × 1.5 → switch up
  - If speed < current bitrate × 0.8 → switch down
    │
    ▼
Seamlessly switches quality between segments
User sees: "Auto (720p)" in settings
```

**Protocols**:
- **HLS (HTTP Live Streaming)**: Apple's protocol, .m3u8 manifests, .ts segments
- **DASH (Dynamic Adaptive Streaming)**: MPEG standard, .mpd manifests
- **WebRTC**: For live streaming (lower latency)

### **Deep Dive: Video Encoding**

**Why Encode?**
- Raw video from camera: 1 Gbps bitrate (unstreamable)
- Encoded 1080p: 5 Mbps (manageable)
- Compression ratio: 200:1 using H.264/H.265/AV1

**Encoding Ladder**:
```
Resolution  Bitrate    Use case
360p        400 Kbps   Mobile, slow 3G
480p        1 Mbps     Mobile, 4G
720p        2.5 Mbps   Desktop, good WiFi
1080p       5 Mbps     Desktop, fast WiFi
4K          20 Mbps    Premium users, fiber
```

**Encoding Process**:
```
1. Download raw video from S3
2. Extract audio stream
3. Create multiple video streams:
   - Scale to each resolution
   - Compress using H.264 (or VP9, AV1 for better compression but slower)
4. Package into HLS/DASH format
5. Upload to S3
6. Invalidate CDN cache
```

**Distributed Encoding**:
One video split into segments, encoded in parallel by hundreds of servers, then stitched back together.

### **Deep Dive: CDN Strategy**

**Multi-tier CDN**:
```
Tier 1 (Edge): 10,000+ locations worldwide
  - Cache popular videos (top 1%)
  - Serve 95% of traffic

Tier 2 (Regional): 100 locations
  - Cache long-tail videos
  - Fill from Origin when needed

Origin (Central): 3-5 data centers
  - All videos stored here
  - Source of truth
```

**Cache Eviction**:
- **LRU (Least Recently Used)**: Remove videos not watched recently
- **LFU (Least Frequently Used)**: Remove videos with few views
- **Popular videos**: Pre-positioned on all edge servers (TikTok videos, viral YouTube content)

### **Deep Dive: Handling Seek**

When user drags progress bar to middle of video:
```
Player requests manifest for that timestamp
    │
    ▼
Manifest contains segment numbers for each timestamp
    │
    ▼
Request specific segment (e.g., segment_450.ts)
    │
    ▼
If CDN has it → serve
If not → fetch from origin → serve → cache for next user
```

**Keyframes**: Videos have keyframes every 2-10 seconds. Player can only seek to keyframes. If user seeks to non-keyframe, player seeks to previous keyframe and decodes forward.

### **System Characteristics**

| Component | Technology |
|-----------|-----------|
| Raw storage | S3 / GCS |
| Encoding | FFmpeg on EC2/Kubernetes |
| Streaming | HLS/DASH via CDN |
| Metadata | PostgreSQL / Cassandra |
| Search | Elasticsearch |
| Recommendations | ML Pipeline (TensorFlow) |

---

## **16.5 Design a Ride-Sharing Service (Uber/Lyft)**

Ride-sharing is a real-time, location-heavy system matching drivers with riders using geospatial indexing.

### **Step 1: Scope**

**Functional Requirements**:
1. **Ride request**: Passenger requests ride, sees ETA and price
2. **Driver matching**: System finds nearest available driver
3. **Real-time tracking**: See driver approaching on map
4. **Payment**: Automatic payment after ride
5. **Rating**: Rate driver and passenger

**Non-Functional Requirements**:
1. **Real-time**: Driver location updates every 5 seconds, matching in < 5 seconds
2. **Reliability**: Must work during high demand (New Year's Eve, concerts)
3. **Scale**: 100 million monthly active users, 20 million trips/day

**Key Challenge**: Geospatial queries—finding drivers within 5 miles of a location, fast.

### **Step 2: Sketch**

**Traffic**:
```
20M trips/day = 230 trips/second (average), 2,000/sec (peak)

Location updates:
- 5M drivers online
- Update every 5 seconds = 1M updates/second
- Each update: driver_id, lat, long, timestamp (50 bytes)
- 50 MB/sec of location data
```

**Storage**:
```
Trips: 20M/day × 1KB metadata = 20 GB/day
Locations: 1M updates/sec × 50 bytes × 1 day retention = 4.3 TB/day (then archived)
```

### **Step 3: Data Model**

**Active Drivers** (Redis with Geospatial indexes):
```
Redis Geo Commands:
GEOADD drivers -122.4194 37.7749 "driver_123"  # Add driver at location
GEORADIUS drivers -122.4194 37.7749 5 mi       # Find drivers within 5 miles
```

**Why Redis Geo?**
- Built-in geospatial indexing using sorted sets
- O(log n) for adding points
- O(log n + m) for radius queries (m = number of results)
- In-memory = extremely fast (< 10ms)

**Trips Table** (PostgreSQL):
```sql
CREATE TABLE trips (
    trip_id UUID PRIMARY KEY,
    rider_id UUID NOT NULL,
    driver_id UUID,
    status VARCHAR(20),  -- requested, accepted, ongoing, completed
    pickup_lat DECIMAL(10,8),
    pickup_long DECIMAL(11,8),
    dropoff_lat DECIMAL(10,8),
    dropoff_long DECIMAL(11,8),
    requested_at TIMESTAMP,
    accepted_at TIMESTAMP,
    completed_at TIMESTAMP,
    fare DECIMAL(10,2)
);
```

**Driver Locations** (Time-series database, e.g., InfluxDB):
```
Measurement: driver_location
Tags: driver_id
Fields: lat, long, speed
Timestamp: 2024-01-15T10:30:00Z
Retention: 7 days (then aggregated or deleted)
```

### **Step 4: Scale (Architecture)**

**System Components**:
```
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│   Rider App  │      │  Driver App  │      │   Dispatch   │
│              │      │              │      │   Service    │
└──────┬───────┘      └──────┬───────┘      └──────┬───────┘
       │                     │                     │
       └─────────────────────┼─────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  WebSocket      │
                    │  Gateway        │
                    │  (Location      │
                    │   Streaming)    │
                    └────────┬────────┘
                             │
            ┌────────────────┼────────────────┐
            │                │                │
            ▼                ▼                ▼
      ┌──────────┐    ┌──────────┐    ┌──────────┐
      │  Redis   │    │  Kafka   │    │ PostgreSQL│
      │  (Geo    │    │  (Events)│    │ (Trips)   │
      │  Index)  │    │          │    │           │
      └──────────┘    └──────────┘    └──────────┘
```

**The Matching Algorithm**:
```
Rider requests trip at (lat, long)
    │
    ▼
Dispatch Service queries Redis:
    GEORADIUS drivers lat long 5 mi WITHDIST
    
    Returns: [
        ("driver_123", 0.5 mi),
        ("driver_456", 1.2 mi),
        ("driver_789", 2.1 mi)
    ]
    │
    ▼
Filter out drivers:
    - Currently on trip
    - Rating below threshold
    - Ignoring this area
    
Select top 3 nearest drivers
    │
    ▼
Send push notification to Driver 1 (timeout: 15 seconds)
If declined/no response → Driver 2 → Driver 3
    │
    ▼
When accepted:
    - Update trip status in DB
    - Notify rider via WebSocket
    - Start location streaming
```

**Surge Pricing** (Dynamic Pricing):
```
If demand > supply in area:
    multiplier = 1.5x, 2x, etc.
    
Calculation:
    requests_per_minute / available_drivers > threshold
```

**Location Streaming**:
```
Driver app sends location every 5 seconds:
    POST /location
    {driver_id, lat, long, timestamp}
    
Flow:
    Driver App → Load Balancer → WebSocket Gateway → Kafka → Consumers
    
Consumers:
    1. Update Redis Geo index (for matching)
    2. Update time-series DB (for analytics)
    3. Broadcast to Rider (if on trip): "Driver is at lat, long"
```

### **Deep Dive: Geospatial Indexing**

**Why not just SQL?**
```sql
SELECT * FROM drivers 
WHERE SQRT(POW(lat - 37.7749, 2) + POW(long - (-122.4194), 2)) < 5;
```
- Requires full table scan (O(n))
- Slow with millions of drivers

**Redis GeoHash**:
- Earth divided into grid cells
- Each cell encoded as hash string
- Nearby locations share same hash prefix
- Stored in sorted set by hash value
- Radius query becomes range query on sorted set

**Alternative: Google S2 Geometry**:
- Sphere divided into hierarchical cells
- Each cell has 64-bit ID
- Nearby cells have similar IDs
- Can use Bigtable/DynamoDB with cell ID as key

### **Deep Dive: Handling Peak Demand**

**Problem**: Concert ends, 10,000 people request rides simultaneously in 1 square mile.

**Solutions**:
1. **Queueing**: Riders enter virtual queue, matched FIFO
2. **Radius expansion**: If no drivers in 1 mile, expand to 2 miles, then 5 miles
3. **Batch matching**: Collect requests for 10 seconds, optimize matching algorithmically (minimize total distance)
4. **Surge pricing**: Reduce demand by increasing price

### **System Characteristics**

| Component | Technology |
|-----------|-----------|
| Real-time location | WebSockets |
| Geospatial index | Redis Geo |
| Matching logic | State machine in application layer |
| Event streaming | Kafka |
| Trip storage | PostgreSQL |
| ETA calculation | ML model + historical traffic data |

---

## **16.6 Design a Food Delivery App (DoorDash/UberEats)**

Food delivery combines ride-sharing's real-time location with inventory management (restaurant menus) and three-sided marketplace (customers, restaurants, drivers).

### **Step 1: Scope**

**Functional Requirements**:
1. **Restaurant browsing**: Search/filter by cuisine, location, rating
2. **Menu management**: Restaurants update availability
3. **Order placement**: Cart, payment, special instructions
4. **Driver assignment**: Match drivers to pick up food
5. **Real-time tracking**: Track order status (preparing, picked up, en route)
6. **ETA calculation**: When will food arrive?

**Non-Functional Requirements**:
1. **Consistency**: Orders must not be lost (financial transactions)
2. **Real-time**: Status updates within seconds
3. **Scale**: 50M orders/month, 500K restaurants, 5M drivers

### **Step 2: Sketch**

**Traffic**:
```
50M orders/month = 1.7M orders/day = 20 orders/sec (average), 200/sec (peak)

Menu views: 10x orders = 200/sec
Location updates: 5M drivers × every 10 sec = 500K/sec
```

**Storage**:
```
Orders: 50M/month × 2KB = 100 GB/month
Restaurant data: 500K restaurants × 10MB (menus, images) = 5 TB
```

### **Step 3: Data Model**

**Order State Machine**:
```
CREATED → PAID → CONFIRMED → PREPARING → READY_FOR_PICKUP → 
PICKED_UP → EN_ROUTE → DELIVERED → COMPLETED
```

**Orders Table** (ACID required - PostgreSQL):
```sql
CREATE TABLE orders (
    order_id UUID PRIMARY KEY,
    customer_id UUID NOT NULL,
    restaurant_id UUID NOT NULL,
    driver_id UUID,
    status VARCHAR(30),
    items JSONB,  -- [{item_id, quantity, price, modifications}]
    total_amount DECIMAL(10,2),
    delivery_address TEXT,
    lat DECIMAL(10,8),
    long DECIMAL(11,8),
    placed_at TIMESTAMP,
    estimated_delivery TIMESTAMP,
    
    INDEX idx_customer (customer_id, placed_at),
    INDEX idx_restaurant (restaurant_id, status)
);
```

**Inventory Management** (Redis):
```
Key: restaurant:{id}:inventory
Value: Hash { "item_123": 5, "item_456": 0 }  -- quantity available

Key: restaurant:{id}:menu_version
Value: Integer (incremented when menu changes)
```

### **Step 4: Scale (Architecture)**

**Order Flow**:
```
Customer places order
    │
    ▼
Order Service validates:
    - Restaurant open?
    - Items available? (check Redis inventory)
    - Address within delivery range?
    │
    ▼
Payment Service processes payment
    │
    ▼
Restaurant notified (tablet app/SMS)
    │
    ▼
Kitchen Display System shows order
    │
    ▼
When food ready (~15-30 min later):
    ├─► Find nearest driver (same as Uber)
    └─► Driver picks up and delivers
```

**ETA Calculation Service**:
```
Inputs:
    - Food prep time (ML model based on restaurant, time of day, current load)
    - Driver distance to restaurant (real-time GPS)
    - Traffic conditions (Google Maps API or internal data)
    - Distance restaurant to customer
    
Output: "Your order will arrive in 32-38 minutes"
```

**Inventory Management**:
```
Challenge: Item sells out while customer is browsing

Solution:
1. Customer opens app → Cache menu in Redis (TTL: 5 minutes)
2. Customer adds to cart → Reserve inventory for 10 minutes (decrement Redis)
3. If not checked out in 10 min → Release inventory (increment Redis)
4. If checked out → Permanent decrement
```

**Concurrency Control**:
```
Two customers try to order last pizza simultaneously:

Customer A: Read inventory = 1
Customer B: Read inventory = 1 (same time)

Both try to decrement:
    Optimistic locking: Version number check
    OR
    Redis INCR/DECR (atomic operations)
```

**Driver Matching Nuances**:
Unlike ride-sharing:
- Food must be ready before driver arrives (don't want cold food)
- Stack orders: Driver can pick up 2-3 orders from same restaurant
- Batch optimization: Route driver to pick up Order A, then Order B, deliver A, then B (traveling salesman problem)

### **Deep Dive: The "Prepare Time" Problem**

**Problem**: If driver arrives too early, food isn't ready (wasted time). If too late, food gets cold.

**Solution**:
```
When order placed:
    estimated_prep_time = ML_model(restaurant, items, current_load)
    
At estimated_prep_time - 5 minutes:
    Trigger driver search
    
Algorithm:
    driver_travel_time_to_restaurant ≈ food_prep_time_remaining
```

**ML Model Features**:
- Restaurant historical prep times
- Current number of active orders at restaurant
- Day of week/time of day
- Item complexity (salad vs. well-done steak)

### **Deep Dive: Search and Discovery**

**Restaurant Search** (Elasticsearch):
```
Query: "Italian food under $30, 4+ stars, within 5 miles"
    │
    ▼
Elasticsearch index:
    - Geo-point for location
    - Cuisine tags (array)
    - Price range (integer 1-4)
    - Average rating (float)
    - Currently open (boolean)
    - Delivery time estimate (integer minutes)
```

**Ranking Algorithm**:
```
Score = (distance_weight × distance) + 
        (rating_weight × rating) + 
        (popularity_weight × order_count) +
        (promoted_weight × paid_promotion)
```

### **Deep Dive: Handling Failures**

**Restaurant cancels order** (out of ingredients):
1. Refund customer automatically
2. Send push notification with apology + coupon
3. Suggest similar restaurants

**Driver cancels after pickup**:
1. Emergency re-assignment to nearest driver
2. If food already left restaurant, new driver intercepts en route
3. Customer notified of delay

**Payment fails**:
1. Order held for 5 minutes while customer fixes payment
2. If not fixed, release inventory and cancel

### **System Characteristics**

| Component | Technology |
|-----------|-----------|
| Order management | PostgreSQL (ACID for payments) |
| Real-time location | Redis Geo + WebSockets |
| Search | Elasticsearch |
| Inventory | Redis (atomic operations) |
| ETA/ML | Python microservices |
| Notifications | Firebase/APNs + SMS gateway |

---

## **16.7 Chapter Summary**

In this chapter, we designed six production-scale systems, each with unique challenges:

1. **URL Shortener**: Taught us about encoding strategies, read-heavy architectures, and the importance of caching. The key decision was base62 encoding of auto-increment IDs.

2. **Twitter News Feed**: Introduced the fan-out problem and the hybrid push/pull model. We learned that one size doesn't fit all—treat celebrities differently from normal users.

3. **Chat Application**: Covered real-time communication via WebSockets, message ordering guarantees, and handling the "n-squared" problem of group chats.

4. **Video Streaming**: Demonstrated the complexity of media processing pipelines, adaptive bitrate streaming, and multi-tier CDN strategies for massive files.

5. **Ride-Sharing**: Focused on geospatial indexing with Redis Geo, real-time location streaming, and the dispatch matching algorithm.

6. **Food Delivery**: Combined elements of ride-sharing with inventory management and three-way marketplace coordination, emphasizing ETA prediction and state machines.

**Common Patterns Across All Systems**:
- **Caching is crucial**: Redis appears in every design for different use cases
- **Asynchronous processing**: Kafka/RabbitMQ for decoupling heavy operations
- **Database per use case**: SQL for transactions, NoSQL for scalability, Time-series for metrics
- **CDN for static assets**: Offload bandwidth and reduce latency
- **Graceful degradation**: When components fail, the system continues operating with reduced functionality

**Key Takeaway**: System design is about trade-offs. A URL shortener prioritizes read speed over write speed. Twitter prioritizes read speed for normal users but write speed for celebrities. Chat prioritizes availability over temporary consistency. There are no perfect solutions, only solutions optimized for specific constraints.

---

**Exercises**:

1. **URL Shortener**: How would you modify the design to support URL expiration with automatic cleanup of expired entries?

2. **Twitter**: Calculate the storage requirements if we kept the last 1000 tweets for each of 500 million users in Redis. Is this feasible?

3. **Chat**: Design a "message recall" feature that allows users to delete messages within 5 minutes of sending. What are the consistency challenges?

4. **Video Streaming**: How would you design a "live streaming" feature (like Twitch) differently from video-on-demand (like YouTube)?

5. **Ride-Sharing**: Design an algorithm to detect fraudulent drivers (GPS spoofing, fake trips).

6. **Food Delivery**: How would you handle a flash sale where a popular restaurant offers 50% off for one hour, causing a 100x traffic spike?

---

**Interview Tips**:
- Always clarify functional vs. non-functional requirements first
- Do the math early (QPS, storage, bandwidth) to guide your design
- Identify the "hardest problem" in the system (fan-out, geospatial, etc.) and focus your deep dive there
- Discuss trade-offs explicitly: "If we choose X, we get benefits A and B, but drawbacks C and D"
- Never propose a design you can't scale—if you suggest a single database, be ready to explain how to shard it when it fills up

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../5. Architectural_patterns/15. data_intensive_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='17. infrastructure_and_platform_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
