# Chapter 10: The 4S Framework for System Design Interviews

---

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Structure any system design interview conversation using the 4S Framework
- Elicit and document functional and non-functional requirements systematically
- Perform back-of-the-envelope estimations with confidence
- Design data models and APIs that scale
- Identify bottlenecks and articulate trade-offs effectively
- Draw clear, professional architecture diagrams
- Navigate a 45-60 minute system design interview from start to finish

---

## **Introduction: Why a Framework Matters**

Imagine you're an architect. Someone asks you to design a house. Do you start by drawing the bathroom tiles? Of course not. You start by understanding the family's needs, estimate the budget, draw the floor plan, and then dive into the details.

System design is the same. Without a framework, you might:
- Forget critical requirements
- Spend too much time on irrelevant details
- Miss the interviewer's hidden expectations
- Run out of time before reaching the solution

**The 4S Framework** gives you a roadmap. It's used by engineers at Google, Amazon, Meta, Netflix, and many other top companies to structure system design discussions.

---

## **The 4S Framework Overview**

```
┌─────────────────────────────────────────────────────────────────────┐
│                        THE 4S FRAMEWORK                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   1. SCOPE    →  What are we building? (Requirements)              │
│                                                                     │
│   2. SKETCH   →  How big is this? (Estimations)                    │
│                                                                     │
│   3. SOLIDIFY →  How does data flow? (Data Model + API)            │
│                                                                     │
│   4. SCALE    →  How do we make it handle millions? (Design)       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

Each 'S' builds on the previous one. You don't jump to the SCALE phase until you've understood the SCOPE, estimated the SKETCH, and SOLIDIFIED the data model.

---

## **S1: SCOPE — Understanding Requirements**

The SCOPE phase is the foundation. Without clear requirements, you'll design the wrong system.

### **What is Scope?**

Scope defines:
- What you're building
- What you're NOT building
- How well it needs to perform
- What constraints you're working under

### **The Requirement Collection Process**

Start by asking questions. A good system design interview is a conversation, not a monologue.

#### **Step 1: Clarify the Problem**

Never assume you understand the problem. Ask clarifying questions:

**Example: Design a URL Shortener**

```
❌ BAD assumption: "Okay, I'll design a URL shortener like bit.ly"

✅ GOOD approach: 
"What kind of URLs are we shortening?"
"Does the short URL expire?"
"Do we need custom short URLs?"
"Is this for public use or internal company use?"
```

#### **Step 2: Document Functional Requirements**

Functional requirements describe **what the system does**. These are features and behaviors.

**How to Document:**

1. **List Use Cases**: Who are the users and what do they do?
2. **Define Features**: What capabilities does the system have?
3. **Identify Input/Output**: What goes in and what comes out?

**Example URL Shortener — Functional Requirements:**

| User Action | Description |
|-------------|-------------|
| Create short URL | User submits a long URL, receives a short URL |
| Redirect | User visits short URL, gets redirected to original |
| Delete URL | Owner can delete their short URL |
| Custom alias | User can choose custom short URL (if available) |
| Analytics | Track number of clicks per short URL |

**Functional Requirements Summary (Write this down):**
```
Functional Requirements:
- Create and return a unique short URL for any given long URL
- When a user accesses a short URL, redirect them to the original long URL
- Short URLs should be at most 7 characters long (e.g., abc1234)
- Users should be able to optionally specify a custom alias
- Track click counts for each short URL
```

#### **Step 3: Document Non-Functional Requirements**

Non-functional requirements describe **how well** the system performs. These are quality attributes.

**The Big 5 Non-Functional Requirements:**

| Category | What It Means | Example Questions |
|----------|---------------|-------------------|
| **Scalability** | How many users can it handle? | "10K concurrent users?" "10M URLs total?" |
| **Availability** | How often must it be up? | "99.9% uptime?" "Is downtime acceptable?" |
| **Latency** | How fast must it respond? | "Redirect in < 200ms?" "Create URL in < 100ms?" |
| **Consistency** | How up-to-date must data be? | "Is eventual consistency okay?" "Must reads be latest?" |
| **Durability** | Can we lose data? | "Can we afford to lose 0.1% of URLs?" |

**Example URL Shortener — Non-Functional Requirements:**
```
Non-Functional Requirements:
Scalability:
  - Support 10 million URL creations per day
  - Support 100 million redirects per day
  - Store up to 1 billion URLs total

Availability:
  - 99.9% uptime (redirect service must be highly available)

Latency:
  - Create URL: < 100ms (p95)
  - Redirect: < 50ms (p95)
  - Read latency: < 200ms

Consistency:
  - Strong consistency for URL creation
  - Eventual consistency for analytics (click counts)

Durability:
  - Zero data loss for stored URLs
  - Analytics can tolerate slight data loss
```

#### **Step 4: Define Out of Scope**

Explicitly state what you're NOT building. This shows you're managing scope wisely.

**Example URL Shortener — Out of Scope:**
```
Out of Scope:
- User authentication (assume anonymous users)
- URL preview (no thumbnail generation)
- Social sharing features
- QR code generation
- Domain-specific shortening
- URL expiration (short URLs don't expire)
- API rate limiting (for this phase)
```

### **Pro Tip: The "Do You Want Me to Consider" Checklist**

After documenting requirements, ask:
- "Do you want me to consider authentication?"
- "Do you want me to consider caching?"
- "Do you want me to consider privacy/anonymization?"

This shows you're thinking ahead but deferring to the interviewer's priorities.

---

## **S2: SKETCH — Back-of-the-Envelope Estimation**

Now that you know WHAT you're building, estimate HOW BIG it needs to be. This is the "SKETCH" phase.

### **Why Estimation Matters**

Estimations help you:
- Choose the right technologies
- Identify early bottlenecks
- Design appropriate data structures
- Calculate resource needs (storage, bandwidth, servers)

### **What to Estimate**

You typically estimate three things:
1. **QPS (Queries Per Second)** — Traffic volume
2. **Storage** — Data size
3. **Bandwidth** — Data transfer rate

### **The Estimation Process**

#### **Step 1: Start with Daily Volume**

Begin with what's given or make reasonable assumptions.

**Example URL Shortener:**
```
Given: 10 million URL creations per day
Given: 100 million redirects per day
```

#### **Step 2: Convert to Peak Per Second**

Daily numbers don't help you size for peak. Convert to per-second and add a peak multiplier.

**The Formula:**
```
Peak QPS = (Daily Volume / 86400 seconds) × Peak Multiplier

Where:
- 86400 = 24 hours × 60 minutes × 60 seconds
- Peak Multiplier = 2 to 10 (depends on traffic pattern)
```

**Example Calculation:**

```
URL Creations:
  Daily: 10 million
  Average QPS = 10,000,000 / 86,400 = 115 QPS
  Peak QPS = 115 × 5 = 575 QPS (peak multiplier of 5)

Redirects:
  Daily: 100 million
  Average QPS = 100,000,000 / 86,400 = 1,157 QPS
  Peak QPS = 1,157 × 10 = 11,570 QPS (peak multiplier of 10)

Total Peak QPS = 575 (create) + 11,570 (redirect) = 12,145 QPS
```

**Pro Tip:** Peak multipliers:
- Social media/news: 5-10x (viral spikes)
- Business apps: 2-3x (work hours)
- E-commerce: 3-5x (events, sales)

#### **Step 3: Estimate Storage**

Calculate how much data you'll store over the system's lifetime.

**The Formula:**
```
Storage = (Data per item × Items stored) + (Growth margin)

For multi-year storage:
Total Storage = (Data per item × Items per year × Years)
```

**Example URL Shortener — Storage Estimation:**

**What do we store per URL?**
```
Entry: {
  short_url: 7 bytes (e.g., "abc1234")
  long_url: 2000 bytes (average URL length)
  creation_time: 8 bytes (Unix timestamp)
  click_count: 4 bytes (integer)
  user_id: 8 bytes (optional, for future features)
}

Total per URL ≈ 2,027 bytes ≈ 2 KB per URL
```

**Storage Calculation:**
```
Given: 1 billion URLs total

Primary Storage:
  = 1 billion URLs × 2 KB
  = 2,000,000,000 KB
  = 2,000,000 MB
  = 2,000 GB
  = 2 TB

Add 50% for overhead (indexes, metadata):
  = 2 TB × 1.5
  = 3 TB total storage
```

**Analytics Storage (if tracking detailed clicks):**
```
If we store every click:
  Click entry = {
    url_id: 8 bytes
    timestamp: 8 bytes
    ip_address: 4 bytes
    user_agent: 50 bytes
  } = 70 bytes per click

100 million clicks/day × 70 bytes = 7 GB/day
7 GB/day × 365 days = 2.5 TB/year
For 5 years: = 12.5 TB
```

#### **Step 4: Estimate Bandwidth**

Bandwidth is the amount of data flowing in and out of your system per second.

**The Formula:**
```
Bandwidth (GB/s) = (Request size + Response size) × Peak QPS / 1,000,000,000
```

**Example URL Shortener — Bandwidth Estimation:**

**Create URL:**
```
Request: POST long URL (2000 bytes) + optional alias (7 bytes)
       ≈ 2 KB

Response: Short URL (7 bytes) + metadata
        ≈ 100 bytes

Bandwidth (create) = (2 KB + 100 bytes) × 575 QPS
                   = 2,100 bytes × 575
                   = 1,207,500 bytes/second
                   ≈ 1.2 MB/s
```

**Redirect:**
```
Request: GET short URL (7 bytes)
       ≈ 10 bytes with headers

Response: 301 Redirect to long URL (2000 bytes)
        ≈ 2 KB

Bandwidth (redirect) = (10 bytes + 2000 bytes) × 11,570 QPS
                     = 2,010 bytes × 11,570
                     = 23,255,700 bytes/second
                     ≈ 23 MB/s
```

**Total Bandwidth:**
```
= 1.2 MB/s (create) + 23 MB/s (redirect)
= 24.2 MB/s
= 0.024 GB/s

Per day:
  = 0.024 GB/s × 86,400 s
  ≈ 2,074 GB/day
  ≈ 2 TB/day bandwidth
```

### **The Estimation Cheat Sheet**

```
┌──────────────────────────────────────────────────────────────────┐
│                    QUICK ESTIMATION RULES                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1 day     = 86,400 seconds                                      │
│  1 KB      = 1,024 bytes                                         │
│  1 MB      = 1,024 KB                                            │
│  1 GB      = 1,024 MB                                            │
│  1 TB      = 1,024 GB                                            │
│                                                                  │
│  Peak multiplier: 2-10x (use 5x as default)                      │
│                                                                  │
│  Average URL length: 200-500 bytes                               │
│  Short URL length: 6-8 characters                                │
│                                                                  │
│  Typical user record: 1-5 KB                                     │
│  Typical transaction: 500 bytes - 2 KB                           │
│                                                                  │
│  1 year of daily data = 365 × daily amount                       │
│  5 years of daily data ≈ 2,000 × daily amount                    │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘
```

### **Summary Document — Sketch Phase**

After the SKETCH phase, write down your estimations:

```
Sketch — Estimations:

Traffic (QPS):
  - URL creations: 575 QPS (peak)
  - Redirects: 11,570 QPS (peak)
  - Total: ~12,145 QPS

Storage:
  - URL storage: 3 TB (for 1 billion URLs)
  - Analytics storage: 12.5 TB (5 years of click data)
  - Total: ~15.5 TB

Bandwidth:
  - Inbound: ~1.2 MB/s
  - Outbound: ~23 MB/s
  - Total: ~24 MB/s (2 TB/day)
```

---

## **S3: SOLIDIFY — Data Model and API Design**

Now that you know the scope and scale, design how data flows through the system.

### **Data Model Design**

The data model defines how you'll store and structure your data.

#### **Step 1: Identify Entities**

What "things" exist in your system?

**Example URL Shortener — Entities:**
```
Entities:
- URL
- User (optional, for future features)
- Click (for analytics)
```

#### **Step 2: Define Relationships**

How do entities relate to each other?

```
User  ────< creates >────  URL
                            │
                            │< has many >
                            ↓
                          Click
```

#### **Step 3: Define Attributes per Entity**

What properties does each entity have?

**URL Entity:**
| Attribute | Type | Description |
|-----------|------|-------------|
| id | VARCHAR(7) | The short URL code (unique, indexed) |
| long_url | VARCHAR(2048) | The original URL |
| created_at | TIMESTAMP | When it was created |
| expires_at | TIMESTAMP | Optional expiration (NULL = never) |
| user_id | BIGINT | Optional owner ID |
| click_count | INT | Total number of redirects |

**Click Entity (for analytics):**
| Attribute | Type | Description |
|-----------|------|-------------|
| id | BIGINT | Auto-increment unique ID |
| url_id | VARCHAR(7) | Foreign key to URL |
| timestamp | TIMESTAMP | When the click occurred |
| ip_address | VARCHAR(45) | IP address of visitor |
| user_agent | VARCHAR(255) | Browser/device info |
| country | VARCHAR(2) | 2-letter country code |

#### **Step 4: Choose Database Type**

Based on your data model and requirements:

| Factor | Consideration | Decision |
|--------|---------------|----------|
| Data structure | Simple key-value lookups | Key-Value Store |
| Query patterns | Lookup by short URL only | Key-Value Store |
| Consistency | Strong consistency required | DynamoDB, Redis, or MySQL with primary key |
| Scale | 1 billion URLs | Distributed NoSQL (DynamoDB, Cassandra) |
| Analytics | Heavy write load | Separate time-series database |

**Recommendation for URL Shortener:**
```
Primary storage: DynamoDB (AWS) or Redis Cluster
  - Key: short_url
  - Value: long_url + metadata
  - Strong consistency for reads
  - Horizontal scaling

Analytics storage: Time-series database (TimescaleDB, InfluxDB)
  - Optimized for time-series queries
  - High write throughput
```

#### **Step 5: Design Schema (SQL Example)**

If using a relational database:

```sql
CREATE TABLE urls (
    id VARCHAR(7) PRIMARY KEY,           -- The short URL code
    long_url VARCHAR(2048) NOT NULL,     -- Original URL
    created_at TIMESTAMP DEFAULT NOW(),   -- Creation timestamp
    user_id BIGINT,                      -- Optional owner
    click_count INT DEFAULT 0,           -- Redirect count
    INDEX idx_user_id (user_id)
);

CREATE TABLE clicks (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    url_id VARCHAR(7) NOT NULL,
    timestamp TIMESTAMP DEFAULT NOW(),
    ip_address VARCHAR(45),
    user_agent VARCHAR(255),
    country VARCHAR(2),
    INDEX idx_url_id (url_id),
    INDEX idx_timestamp (timestamp)
);

-- Foreign key relationship
ALTER TABLE clicks
ADD FOREIGN KEY (url_id) REFERENCES urls(id);
```

#### **Step 6: Design Schema (NoSQL Example)**

For DynamoDB (document/key-value style):

```json
// URL Table (Key: short_url)
{
  "short_url": "abc1234",
  "long_url": "https://example.com/very/long/path",
  "created_at": 1709251200,
  "user_id": null,
  "click_count": 42,
  "metadata": {
    "user_agent": "Mozilla/5.0...",
    "ip": "192.168.1.1"
  }
}

// Clicks Table (Partition Key: url_id, Sort Key: timestamp)
{
  "url_id": "abc1234",
  "timestamp": 1709254800,
  "ip_address": "192.168.1.2",
  "user_agent": "Mozilla/5.0...",
  "country": "US"
}
```

### **API Design**

The API defines how external systems interact with your system.

#### **API Design Principles**

1. **Use HTTP Methods Correctly**
   - `GET` — Read data
   - `POST` — Create data
   - `PUT` — Update/replace data
   - `DELETE` — Remove data

2. **Use Resource-Based URLs**
   - Noun-based, not verb-based
   - RESTful conventions

3. **Use Appropriate Status Codes**
   - `200` — Success
   - `201` — Created
   - `400` — Bad Request
   - `404` — Not Found
   - `500` — Server Error

#### **API Endpoints — URL Shortener**

| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/v1/shorten` | Create a new short URL |
| GET | `/api/v1/urls/{short_url}` | Get URL details |
| GET | `/` | Redirect to original URL |
| DELETE | `/api/v1/urls/{short_url}` | Delete a URL |
| GET | `/api/v1/stats/{short_url}` | Get analytics for a URL |

#### **API Details with Examples**

**1. Create Short URL**

```http
POST /api/v1/shorten HTTP/1.1
Host: shortener.example.com
Content-Type: application/json

{
  "long_url": "https://example.com/very/long/path/to/resource",
  "custom_alias": null  // Optional: "my-custom-link"
}
```

**Success Response:**
```http
HTTP/1.1 201 Created
Content-Type: application/json

{
  "short_url": "https://short.example.com/abc1234",
  "long_url": "https://example.com/very/long/path/to/resource",
  "created_at": "2024-02-01T00:00:00Z",
  "expires_at": null
}
```

**Error Response:**
```http
HTTP/1.1 400 Bad Request
Content-Type: application/json

{
  "error": "Invalid URL format",
  "code": "INVALID_URL"
}
```

**2. Redirect to Original URL**

```http
GET /abc1234 HTTP/1.1
Host: shortener.example.com
```

**Response:**
```http
HTTP/1.1 301 Moved Permanently
Location: https://example.com/very/long/path/to/resource
```

**3. Get URL Details**

```http
GET /api/v1/urls/abc1234 HTTP/1.1
Host: shortener.example.com
```

**Response:**
```http
HTTP/1.1 200 OK
Content-Type: application/json

{
  "short_url": "abc1234",
  "long_url": "https://example.com/very/long/path/to/resource",
  "created_at": "2024-02-01T00:00:00Z",
  "click_count": 142,
  "last_accessed": "2024-02-05T15:30:00Z"
}
```

**4. Get Analytics**

```http
GET /api/v1/stats/abc1234?from=2024-02-01&to=2024-02-05 HTTP/1.1
Host: shortener.example.com
```

**Response:**
```http
HTTP/1.1 200 OK
Content-Type: application/json

{
  "short_url": "abc1234",
  "total_clicks": 142,
  "period": {
    "from": "2024-02-01",
    "to": "2024-02-05"
  },
  "clicks_by_country": {
    "US": 85,
    "UK": 32,
    "DE": 15,
    "JP": 10
  },
  "clicks_by_day": [
    {"date": "2024-02-01", "count": 30},
    {"date": "2024-02-02", "count": 45},
    {"date": "2024-02-03", "count": 25},
    {"date": "2024-02-04", "count": 20},
    {"date": "2024-02-05", "count": 22}
  ]
}
```

#### **Code Example — API Implementation (Python/Flask)**

```python
from flask import Flask, request, jsonify, redirect
import datetime
import random
import string

app = Flask(__name__)

# Simulated database (in production, use real DB)
url_database = {}

def generate_short_url(length=7):
    """Generate a random short URL code."""
    characters = string.ascii_letters + string.digits
    return ''.join(random.choice(characters) for _ in range(length))

@app.route('/api/v1/shorten', methods=['POST'])
def shorten_url():
    """Create a new short URL."""
    data = request.get_json()
    
    if not data or 'long_url' not in data:
        return jsonify({
            "error": "long_url is required",
            "code": "MISSING_URL"
        }), 400
    
    long_url = data['long_url']
    custom_alias = data.get('custom_alias')
    
    # Validate URL (simplified)
    if not long_url.startswith(('http://', 'https://')):
        return jsonify({
            "error": "Invalid URL format",
            "code": "INVALID_URL"
        }), 400
    
    # Generate or use custom short URL
    if custom_alias:
        if custom_alias in url_database:
            return jsonify({
                "error": "Custom alias already taken",
                "code": "ALIAS_TAKEN"
            }), 409
        short_url = custom_alias
    else:
        # Generate unique short URL
        while True:
            short_url = generate_short_url()
            if short_url not in url_database:
                break
    
    # Store in database
    url_database[short_url] = {
        "long_url": long_url,
        "created_at": datetime.datetime.now().isoformat(),
        "click_count": 0,
        "last_accessed": None
    }
    
    return jsonify({
        "short_url": f"https://short.example.com/{short_url}",
        "long_url": long_url,
        "created_at": url_database[short_url]["created_at"],
        "expires_at": None
    }), 201

@app.route('/<short_url>')
def redirect_to_original(short_url):
    """Redirect to the original URL."""
    if short_url not in url_database:
        return jsonify({
            "error": "Short URL not found",
            "code": "NOT_FOUND"
        }), 404
    
    url_data = url_database[short_url]
    
    # Update click count
    url_data["click_count"] += 1
    url_data["last_accessed"] = datetime.datetime.now().isoformat()
    
    # Redirect
    return redirect(url_data["long_url"], code=301)

@app.route('/api/v1/urls/<short_url>', methods=['GET'])
def get_url_details(short_url):
    """Get details about a short URL."""
    if short_url not in url_database:
        return jsonify({
            "error": "Short URL not found",
            "code": "NOT_FOUND"
        }), 404
    
    url_data = url_database[short_url]
    
    return jsonify({
        "short_url": short_url,
        "long_url": url_data["long_url"],
        "created_at": url_data["created_at"],
        "click_count": url_data["click_count"],
        "last_accessed": url_data["last_accessed"]
    }), 200

if __name__ == '__main__':
    app.run(debug=True)
```

### **Summary Document — Solidify Phase**

After the SOLIDIFY phase, you should have:

```
Solidify — Data Model:

Entities:
  - URL: short_url (PK), long_url, created_at, click_count
  - Click: id, url_id (FK), timestamp, ip_address, user_agent

Database Choice:
  - Primary: DynamoDB (strong consistency)
  - Analytics: Time-series DB

Schema:
  [Include your schema diagram or SQL here]

Solidify — API:

Endpoints:
  POST   /api/v1/shorten          - Create short URL
  GET    /{short_url}             - Redirect to original
  GET    /api/v1/urls/{short_url} - Get URL details
  DELETE /api/v1/urls/{short_url} - Delete URL
  GET    /api/v1/stats/{short_url} - Get analytics
```

---

## **S4: SCALE — High-Level Design and Deep Dives**

The final phase is where you design the system architecture. This is where most interviewers spend the majority of time.

### **The Two-Level Design Approach**

System design has two levels:

```
┌─────────────────────────────────────────────────────────────────┐
│                    LEVEL 1: HIGH-LEVEL DESIGN                    │
│  - Big picture architecture                                      │
│  - Component diagram                                             │
│  - Data flow                                                     │
│  - Key technologies                                              │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                    LEVEL 2: DEEP DIVE                           │
│  - Detailed design of specific components                        │
│  - Bottleneck analysis                                          │
│  - Trade-off discussion                                          │
│  - Alternative approaches                                        │
└─────────────────────────────────────────────────────────────────┘
```

### **Level 1: High-Level Design**

#### **Step 1: Identify Key Components**

Based on your requirements and data model, what building blocks do you need?

**Example URL Shortener — Key Components:**

| Component | Purpose |
|-----------|---------|
| Load Balancer | Distribute traffic across servers |
| API Server | Handle HTTP requests |
| URL Database | Store URL mappings |
| Cache | Speed up redirects |
| ID Generator | Create unique short URLs |
| Analytics Service | Track clicks |

#### **Step 2: Draw the High-Level Architecture**

```
                    ┌─────────────────────────┐
                    │        Client           │
                    │  (Browser, Mobile App)  │
                    └────────────┬────────────┘
                                 │ HTTPS
                                 ↓
                    ┌─────────────────────────┐
                    │      Load Balancer      │
                    │    (AWS ELB / Nginx)    │
                    └────────────┬────────────┘
                                 │
                ┌────────────────┼────────────────┐
                ↓                ↓                ↓
    ┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
    │   API Server 1   │ │ API Server 2 │ │   API Server N   │
    │ (Stateless)      │ │ (Stateless)  │ │    (Stateless)    │
    └────────┬─────────┘ └──────┬───────┘ └────────┬─────────┘
             │                    │                    │
             └────────────────────┼────────────────────┘
                                  ↓
                    ┌─────────────────────────┐
                    │       Cache Layer       │
                    │    (Redis Cluster)      │
                    │  - Short URL lookups    │
                    └────────────┬────────────┘
                                  │ Cache miss
                                  ↓
                    ┌─────────────────────────┐
                    │    URL Database         │
                    │  (DynamoDB / Redis)     │
                    │  - Short URL mappings   │
                    │  - Strong consistency   │
                    └────────────┬────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    │                           │
                    ↓                           ↓
         ┌──────────────────┐        ┌──────────────────┐
         │ ID Generator     │        │ Analytics DB     │
         │ (Snowflake ID)   │        │ (Time-series)    │
         └──────────────────┘        └──────────────────┘
```

#### **Step 3: Explain Data Flow**

Walk through the flow for each major operation:

**Flow 1: Create Short URL**

```
1. Client POSTs /api/v1/shorten with long_url
2. Load balancer routes to API Server
3. API Server calls ID Generator for unique short_url
4. API Server writes to URL Database
5. API Server writes to Cache (optional)
6. API Server returns short_url to client
```

**Flow 2: Redirect**

```
1. Client GETs /abc1234
2. Load balancer routes to API Server
3. API Server checks Cache for short_url
4. Cache HIT → Return long_url immediately
5. Cache MISS → Query URL Database
6. URL Database returns long_url
7. API Server stores in Cache for next time
8. API Server returns 301 redirect to long_url
9. API Server triggers analytics update (async)
```

#### **Step 4: Choose Technologies**

Based on your requirements:

| Decision | Factor | Choice |
|----------|--------|--------|
| Load Balancer | HTTP termination, health checks | AWS ALB / Nginx |
| API Server | Stateless, auto-scalable | Docker + Kubernetes |
| Database | Key-value, high throughput, strong consistency | DynamoDB |
| Cache | Fast lookups, distributed | Redis Cluster |
| ID Generator | Unique, sortable, collision-free | Snowflake (Twitter's ID generator) |
| Analytics | Time-series, high write volume | TimescaleDB / ClickHouse |

### **Level 2: Deep Dive**

Now, dive deep into specific components and decisions.

#### **Deep Dive 1: Unique Short URL Generation**

**Problem:** How do we generate unique short URLs that won't collide?

**Solution Options:**

| Option | Pros | Cons | Verdict |
|--------|------|------|---------|
| Random strings | Simple to implement | Collision risk, not sortable | ❌ Not ideal |
| UUID v4 | Guaranteed unique | Too long (36 chars), not sortable | ❌ Too long |
| Auto-increment DB ID | Guaranteed unique | Predictable, doesn't scale well | ❌ Reveals growth |
| Snowflake ID | Unique, sortable, distributed | Slight complexity | ✅ Best choice |

**Snowflake ID Explained:**

Snowflake is Twitter's distributed ID generator. It creates 64-bit unique IDs that are:
- Globally unique across multiple servers
- Time-ordered (sort by creation time)
- Short enough for URLs (can be encoded)

**Snowflake ID Structure:**

```
┌─────────────┬──────────────┬──────────────┬──────────────┐
│  41 bits    │   10 bits    │    12 bits   │              │
│  Timestamp  │  Machine ID  │   Sequence   │   Total: 64   │
└─────────────┴──────────────┴──────────────┴──────────────┘
   69 years      1024 nodes       4096 IDs/sec
     from 1970
```

**How it works:**
- **Timestamp (41 bits)**: Milliseconds since epoch. Gives ~69 years of IDs
- **Machine ID (10 bits)**: Identifies which server generated the ID. Up to 1024 servers
- **Sequence (12 bits)**: Counter for IDs generated in the same millisecond. 4096 IDs per ms per server

**Code Example — Snowflake ID Generator (Python):**

```python
import time
import threading

class SnowflakeIDGenerator:
    """
    Snowflake ID Generator
    
    Generates unique 64-bit IDs that are:
    - Time-ordered (newer IDs have larger values)
    - Globally unique across multiple machines
    - Distributed and scalable
    """
    
    # Epoch: January 1, 2024 (custom epoch)
    EPOCH = 1704067200000  # Timestamp in milliseconds
    
    # Bit allocation
    TIMESTAMP_BITS = 41
    MACHINE_ID_BITS = 10
    SEQUENCE_BITS = 12
    
    # Maximum values
    MAX_SEQUENCE = (1 << SEQUENCE_BITS) - 1  # 4095
    MAX_MACHINE_ID = (1 << MACHINE_ID_BITS) - 1  # 1023
    
    def __init__(self, machine_id):
        """
        Initialize the generator.
        
        Args:
            machine_id: Unique ID for this machine (0-1023)
        """
        if machine_id < 0 or machine_id > self.MAX_MACHINE_ID:
            raise ValueError(f"Machine ID must be 0-{self.MAX_MACHINE_ID}")
        
        self.machine_id = machine_id
        self.sequence = 0
        self.last_timestamp = -1
        self.lock = threading.Lock()
    
    def _current_timestamp(self):
        """Get current timestamp in milliseconds."""
        return int(time.time() * 1000)
    
    def _wait_next_millis(self, last_timestamp):
        """Wait until the next millisecond."""
        timestamp = self._current_timestamp()
        while timestamp <= last_timestamp:
            timestamp = self._current_timestamp()
        return timestamp
    
    def generate_id(self):
        """
        Generate a new unique ID.
        
        Returns:
            int: 64-bit unique ID
        """
        with self.lock:
            timestamp = self._current_timestamp()
            
            # Clock moved backwards, throw exception
            if timestamp < self.last_timestamp:
                raise Exception("Clock moved backwards")
            
            # Same millisecond, increment sequence
            if timestamp == self.last_timestamp:
                self.sequence = (self.sequence + 1) & self.MAX_SEQUENCE
                
                # Overflow, wait for next millisecond
                if self.sequence == 0:
                    timestamp = self._wait_next_millis(self.last_timestamp)
            else:
                # New millisecond, reset sequence
                self.sequence = 0
            
            self.last_timestamp = timestamp
            
            # Build the ID
            # Shift each component to its position and combine
            timestamp_part = (timestamp - self.EPOCH) << (self.MACHINE_ID_BITS + self.SEQUENCE_BITS)
            machine_part = self.machine_id << self.SEQUENCE_BITS
            sequence_part = self.sequence
            
            snowflake_id = timestamp_part | machine_part | sequence_part
            
            return snowflake_id
    
    def id_to_short_url(self, snowflake_id):
        """
        Convert Snowflake ID to short URL string.
        
        Since Snowflake IDs are large numbers, we encode them 
        in base62 to get short strings.
        """
        alphabet = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
        
        if snowflake_id == 0:
            return alphabet[0]
        
        base = len(alphabet)
        chars = []
        
        while snowflake_id > 0:
            snowflake_id, remainder = divmod(snowflake_id, base)
            chars.append(alphabet[remainder])
        
        return ''.join(reversed(chars))

# Example usage
generator = SnowflakeIDGenerator(machine_id=1)

# Generate 5 short URLs
for i in range(5):
    snowflake_id = generator.generate_id()
    short_url = generator.id_to_short_url(snowflake_id)
    print(f"ID: {snowflake_id}, Short URL: {short_url}")
```

**Output:**
```
ID: 171936000000001, Short URL: 1a
ID: 171936000000002, Short URL: 1b
ID: 171936000000003, Short URL: 1c
ID: 171936000000004, Short URL: 1d
ID: 171936000000005, Short URL: 1e
```

**Key Insight:** Snowflake IDs are unique across multiple machines because each machine has a unique `machine_id`. Combined with the timestamp and sequence, collisions are impossible.

#### **Deep Dive 2: Caching Strategy**

**Problem:** How do we cache URL lookups efficiently?

**Cache Architecture:**

```
┌─────────────┐
│  API Server │
└──────┬──────┘
       │
       ├───────┐
       ↓       ↓
  ┌────────┐ ┌────────┐
  │ Redis  │ │ Redis  │  (Redis Cluster with sharding)
  │ Shard 1│ │ Shard 2│
  └────────┘ └────────┘
       │           │
       └─────┬─────┘
             ↓
      ┌─────────────┐
      │   URL DB    │
      │ (DynamoDB)  │
      └─────────────┘
```

**Cache Pattern: Cache-Aside**

This is the most common pattern for read-heavy workloads.

```
Cache-Aside Read Flow:
┌────────────────────────────────────────────────────────────┐
│                                                             │
│  1. Application needs data                                  │
│         ↓                                                    │
│  2. Check cache                                             │
│         ↓                                                    │
│     ┌─────┴─────┐                                           │
│     ↓           ↓                                           │
│   HIT         MISS                                           │
│     │           │                                           │
│     ↓           ↓                                           │
│  3. Return   4. Query database                             │
│     data          │                                           │
│                    ↓                                         │
│                 5. Update cache                             │
│                    │                                         │
│                    ↓                                         │
│                 6. Return data                              │
│                                                             │
└────────────────────────────────────────────────────────────┘
```

**Code Example — Cache-Aside Pattern:**

```python
import redis
import json

class URLCache:
    """Cache layer for URL lookups using Cache-Aside pattern."""
    
    def __init__(self, redis_host='localhost', redis_port=6379):
        """Initialize Redis connection."""
        self.redis_client = redis.Redis(
            host=redis_host,
            port=redis_port,
            decode_responses=True
        )
        self.cache_ttl = 3600  # 1 hour
    
    def get_long_url(self, short_url):
        """
        Get long URL from cache or database.
        
        Implements Cache-Aside pattern.
        """
        # 1. Try to get from cache
        cache_key = f"url:{short_url}"
        cached_data = self.redis_client.get(cache_key)
        
        if cached_data:
            print(f"✓ Cache HIT for {short_url}")
            return json.loads(cached_data)
        
        # 2. Cache miss - get from database
        print(f"✗ Cache MISS for {short_url}")
        long_url = self._get_from_database(short_url)
        
        if long_url:
            # 3. Update cache
            self.redis_client.setex(
                cache_key,
                self.cache_ttl,
                json.dumps(long_url)
            )
        
        return long_url
    
    def _get_from_database(self, short_url):
        """Simulate database lookup."""
        # In production, this would query DynamoDB or MySQL
        # This is a mock implementation
        mock_database = {
            "abc1234": {
                "short_url": "abc1234",
                "long_url": "https://example.com/path"
            },
            "xyz5678": {
                "short_url": "xyz5678",
                "long_url": "https://google.com"
            }
        }
        return mock_database.get(short_url)
    
    def invalidate(self, short_url):
        """Remove URL from cache (after update/delete)."""
        cache_key = f"url:{short_url}"
        self.redis_client.delete(cache_key)
        print(f"✓ Cache invalidated for {short_url}")

# Example usage
cache = URLCache()

# First call - cache miss
print("\nFirst lookup:")
result1 = cache.get_long_url("abc1234")
print(f"Result: {result1}")

# Second call - cache hit
print("\nSecond lookup:")
result2 = cache.get_long_url("abc1234")
print(f"Result: {result2}")

# Invalidate cache
print("\nInvalidating cache:")
cache.invalidate("abc1234")

# Third call - cache miss again
print("\nThird lookup:")
result3 = cache.get_long_url("abc1234")
print(f"Result: {result3}")
```

**Output:**
```
First lookup:
✗ Cache MISS for abc1234
Result: {'short_url': 'abc1234', 'long_url': 'https://example.com/path'}

Second lookup:
✓ Cache HIT for abc1234
Result: {'short_url': 'abc1234', 'long_url': 'https://example.com/path'}

Invalidating cache:
✓ Cache invalidated for abc1234

Third lookup:
✗ Cache MISS for abc1234
Result: {'short_url': 'abc1234', 'long_url': 'https://example.com/path'}
```

**Cache Eviction Policy:**

Since URLs are read-heavy (many redirects, few creations), use **LRU (Least Recently Used)** with TTL.

| Setting | Value | Reason |
|---------|-------|--------|
| Eviction Policy | allkeys-lru | URLs not accessed recently are less valuable |
| TTL | 1 hour to 1 day | Balances freshness with cache hit rate |
| Max Memory | Depends on available RAM | Leave headroom for other data |

#### **Deep Dive 3: Database Sharding**

**Problem:** How do we store 1 billion URLs in a distributed database?

**Sharding Strategy: Hash-Based Sharding**

```
┌─────────────────────────────────────────────────────────────────┐
│                    HASH-BASED SHARDING                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  short_url "abc1234"                                            │
│         ↓                                                       │
│  hash("abc1234") = 1234567890                                   │
│         ↓                                                       │
│  1234567890 % 4 = 2                                            │
│         ↓                                                       │
│  Shard 2                                                        │
│         ↓                                                       │
│  Stored on Shard 2                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Consistent Hashing (Better for dynamic scaling):**

```
┌─────────────────────────────────────────────────────────────────┐
│                   CONSISTENT HASHING                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Hash Ring:                                                    │
│                                                                 │
│        0                                                        │
│      /   \                                                      │
│     /     \                                                     │
│    ↓       ↓                                                    │
│   240     60                                                    │
│    │       │                                                    │
│    │       ├─── Shard 1                                        │
│    │       │                                                    │
│   120      │                                                    │
│    │       │                                                    │
│    │       ├─── Shard 2                                        │
│    │       │                                                    │
│    └───────┘                                                    │
│      180                                                        │
│         │                                                        │
│         └─── Shard 3                                            │
│                                                                 │
│   Adding a shard only affects 1/N of the keys                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Code Example — Consistent Hashing:**

```python
import hashlib
import bisect

class ConsistentHash:
    """
    Implements consistent hashing for distributed systems.
    
    Benefits over simple mod-based hashing:
    - Adding/removing nodes affects only 1/N of keys
    - Keys are evenly distributed across nodes
    - Supports weighted nodes (different capacities)
    """
    
    def __init__(self, replicas=100):
        """
        Initialize the hash ring.
        
        Args:
            replicas: Number of virtual nodes per physical node
                      Higher = better distribution but more memory
        """
        self.replicas = replicas
        self.ring = []  # Sorted list of hash positions
        self.nodes = {}  # Hash position -> Node ID
    
    def _hash(self, key):
        """Compute hash of a key."""
        md5 = hashlib.md5()
        md5.update(key.encode('utf-8'))
        return int(md5.hexdigest(), 16)
    
    def add_node(self, node):
        """
        Add a node to the hash ring.
        
        Creates multiple virtual nodes for better distribution.
        """
        for i in range(self.replicas):
            virtual_node_key = f"{node}:{i}"
            hash_value = self._hash(virtual_node_key)
            
            # Insert at correct position to keep ring sorted
            pos = bisect.bisect_left(self.ring, hash_value)
            self.ring.insert(pos, hash_value)
            self.nodes[hash_value] = node
        
        print(f"✓ Added node: {node} (with {self.replicas} virtual nodes)")
    
    def remove_node(self, node):
        """Remove a node from the hash ring."""
        removed = 0
        
        for i in range(self.replicas):
            virtual_node_key = f"{node}:{i}"
            hash_value = self._hash(virtual_node_key)
            
            # Find and remove the virtual node
            pos = bisect.bisect_left(self.ring, hash_value)
            
            if pos < len(self.ring) and self.ring[pos] == hash_value:
                self.ring.pop(pos)
                del self.nodes[hash_value]
                removed += 1
        
        print(f"✓ Removed node: {node} (removed {removed} virtual nodes)")
    
    def get_node(self, key):
        """
        Get the node responsible for a given key.
        
        Uses binary search for O(log N) lookup.
        """
        if not self.ring:
            raise Exception("No nodes in hash ring")
        
        hash_value = self._hash(key)
        
        # Find first node with hash >= key hash
        pos = bisect.bisect_left(self.ring, hash_value)
        
        # If past end, wrap around to first node
        if pos == len(self.ring):
            pos = 0
        
        node_hash = self.ring[pos]
        return self.nodes[node_hash]
    
    def get_distribution(self, test_keys):
        """Test distribution of keys across nodes."""
        distribution = {}
        
        for key in test_keys:
            node = self.get_node(key)
            distribution[node] = distribution.get(node, 0) + 1
        
        return distribution

# Example usage
print("=== Consistent Hashing Demo ===\n")

# Create consistent hash ring
ch = ConsistentHash(replicas=3)

# Add nodes
ch.add_node("shard1")
ch.add_node("shard2")
ch.add_node("shard3")

# Distribute some URLs
test_urls = [
    "abc1234", "xyz5678", "def9999", "ghi0000",
    "jkl1111", "mno2222", "pqr3333", "stu4444",
    "vwx5555", "yza6666"
]

print("\nDistributing URLs across shards:")
for url in test_urls:
    shard = ch.get_node(url)
    print(f"  {url} → {shard}")

# Show distribution
distribution = ch.get_distribution(test_urls)
print("\nDistribution:")
for node, count in sorted(distribution.items()):
    print(f"  {node}: {count} URLs")

# Add a new shard
print("\nAdding a new shard...")
ch.add_node("shard4")

# Show how distribution changed
new_distribution = ch.get_distribution(test_urls)
print("\nNew distribution:")
for node, count in sorted(new_distribution.items()):
    print(f"  {node}: {count} URLs")
```

**Output:**
```
=== Consistent Hashing Demo ===

✓ Added node: shard1 (with 3 virtual nodes)
✓ Added node: shard2 (with 3 virtual nodes)
✓ Added node: shard3 (with 3 virtual nodes)

Distributing URLs across shards:
  abc1234 → shard3
  xyz5678 → shard2
  def9999 → shard1
  ghi0000 → shard3
  jkl1111 → shard1
  mno2222 → shard2
  pqr3333 → shard3
  stu4444 → shard2
  vwx5555 → shard1
  yza6666 → shard3

Distribution:
  shard1: 3 URLs
  shard2: 3 URLs
  shard3: 4 URLs

Adding a new shard...
✓ Added node: shard4 (with 3 virtual nodes)

New distribution:
  shard1: 2 URLs
  shard2: 2 URLs
  shard3: 3 URLs
  shard4: 3 URLs
```

**Key Insight:** Adding `shard4` only redistributed a subset of URLs. With simple mod-based hashing (`hash % N`), adding a shard would require moving ALL keys to new positions. Consistent hashing is much more efficient for scaling.

#### **Deep Dive 4: Analytics Storage**

**Problem:** Storing billions of click events efficiently for analytics.

**Challenges:**
- High write volume (100 million clicks/day)
- Time-series queries (clicks per day, trends)
- Large data size over time

**Solution: Time-Series Database**

**TimescaleDB Schema Example:**

```sql
-- Create a hypertable (TimescaleDB's time-series table)
CREATE TABLE clicks (
    time TIMESTAMP NOT NULL,
    url_id VARCHAR(7) NOT NULL,
    ip_address VARCHAR(45),
    user_agent VARCHAR(255),
    country VARCHAR(2)
);

-- Convert to hypertable for automatic time-based partitioning
SELECT create_hypertable('clicks', 'time', 
    chunk_time_interval => INTERVAL '1 day');

-- Create indexes for common queries
CREATE INDEX idx_clicks_url_id ON clicks (url_id, time DESC);
CREATE INDEX idx_clicks_country ON clicks (country, time DESC);

-- Example query: Clicks per URL in the last 7 days
SELECT 
    url_id,
    COUNT(*) as click_count,
    date_trunc('day', time) as day
FROM clicks
WHERE time >= NOW() - INTERVAL '7 days'
GROUP BY url_id, date_trunc('day', time)
ORDER BY url_id, day DESC;

-- Example query: Top countries for a specific URL
SELECT 
    country,
    COUNT(*) as click_count
FROM clicks
WHERE url_id = 'abc1234'
    AND time >= NOW() - INTERVAL '30 days'
GROUP BY country
ORDER BY click_count DESC
LIMIT 10;
```

---

## **Trade-off Analysis**

Throughout your design, constantly identify and discuss trade-offs. This shows you understand that every decision has pros and cons.

### **The Trade-off Matrix**

For every major decision, use this framework:

```
┌─────────────────────────────────────────────────────────────────┐
│                    TRADE-OFF ANALYSIS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Decision: [What you chose]                                     │
│                                                                 │
│  Alternative 1: [Option A]                                      │
│    ✓ Pros:                                                      │
│    ✗ Cons:                                                      │
│                                                                 │
│  Alternative 2: [Option B]                                      │
│    ✓ Pros:                                                      │
│    ✗ Cons:                                                      │
│                                                                 │
│  Chosen Alternative: [Why you picked it]                        │
│    Reason:                                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### **Example Trade-offs for URL Shortener**

#### **Trade-off 1: SQL vs. NoSQL Database**

```
Decision: DynamoDB (NoSQL)

Alternative 1: PostgreSQL (SQL)
  ✓ Pros:
    - Strong ACID guarantees
    - Complex queries supported
    - Mature tooling and ecosystem
  ✗ Cons:
    - Harder to horizontally scale
    - Single region by default
    - Write performance limited by single master

Alternative 2: DynamoDB (NoSQL)
  ✓ Pros:
    - Horizontal scaling built-in
    - Multi-region active-active
    - Managed service (no operational overhead)
    - Consistent single-digit millisecond reads
  ✗ Cons:
    - Limited query capabilities
    - Higher cost at low scale
    - Vendor lock-in (AWS-specific)

Chosen Alternative: DynamoDB
  Reason:
    - Our read-heavy workload (100M redirects/day) needs horizontal scaling
    - Strong consistency option available for URL lookups
    - Multi-region deployment needed for high availability (99.9% SLA)
    - Managed service reduces operational complexity
```

#### **Trade-off 2: Cache Location**

```
Decision: Distributed Redis Cluster

Alternative 1: In-memory cache per API server
  ✓ Pros:
    - Zero network latency
    - Simple architecture
  ✗ Cons:
    - Data inconsistency across servers
    - Memory wasted with duplicate data
    - Not scalable (each server has limited memory)

Alternative 2: Distributed Redis Cluster
  ✓ Pros:
    - Single source of truth
    - Horizontal scaling possible
    - Data shared across all servers
    - Eviction policies (LRU, LFU) built-in
  ✗ Cons:
    - Network latency (1-5ms per call)
    - Added complexity (cluster management)
    - Need to handle cache failures

Chosen Alternative: Distributed Redis Cluster
  Reason:
    - Consistency is critical for URL redirects
    - 12,145 QPS exceeds single-node capacity
    - Cache hit rate is more important than eliminating 1-2ms latency
    - Redis cluster supports auto-failover
```

#### **Trade-off 3: Synchronous vs. Asynchronous Analytics**

```
Decision: Asynchronous (message queue)

Alternative 1: Synchronous (update on every redirect)
  ✓ Pros:
    - Real-time analytics
    - Simpler architecture
  ✗ Cons:
    - Redirect latency increases (writes to DB)
    - Database becomes bottleneck
    - Analytics can take down the main service

Alternative 2: Asynchronous (message queue + batch writes)
  ✓ Pros:
    - Redirect latency unchanged (non-blocking)
    - Main service resilient to analytics issues
    - Can batch writes for efficiency
    - Can use separate, optimized storage
  ✗ Cons:
    - Analytics not real-time (delay of seconds to minutes)
    - Added complexity (Kafka, consumer)
    - Need to handle data loss scenarios

Chosen Alternative: Asynchronous
  Reason:
    - 50ms redirect SLA can't be compromised
    - Analytics can tolerate slight delay (not user-facing)
    - Separation of concerns improves reliability
    - Can scale analytics independently from redirect service
```

---

## **Bottleneck Identification**

A good designer proactively identifies where the system might fail.

### **The Bottleneck Checklist**

Go through each component and ask: "What could fail here?"

```
┌─────────────────────────────────────────────────────────────────┐
│                   BOTTLENECK CHECKLIST                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Component        Potential Bottleneck      Mitigation          │
│  ─────────────────────────────────────────────────────────────  │
│  Load Balancer    Single point of failure   Multiple LBs        │
│                   SSL termination overhead   Use L4 LB          │
│                                                                 │
│  API Server       CPU-bound (ID generation)  Horizontal scaling │
│                   Memory-bound (cache)      Stateless design    │
│                   Connection limits         Keep-alive          │
│                                                                 │
│  Cache            Memory limits             Shard cache         │
│                   Network latency           Local cache fallback│
│                   Eviction storms           Proactive refresh   │
│                                                                 │
│  Database         Write throughput          Sharding            │
│                   Hot partition             Salting             │
│                   Read latency              Read replicas       │
│                                                                 │
│  Message Queue    Producer backlog          Multiple partitions │
│                   Consumer lag              Scale consumers     │
│                   Message ordering          Per-key partitioning │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### **Example Bottleneck Analysis**

#### **Bottleneck 1: Hot Partition in DynamoDB**

**Problem:** Popular URLs get accessed much more often, creating a hot partition.

```
Scenario:
- "abc1234" is a viral URL
- Gets 1 million hits/day
- Stored in single DynamoDB partition
- Partition becomes hotspot
- Reads/writes throttled
```

**Mitigation:**

1. **Add a random suffix:**
   ```
   Original: abc1234
   With suffix: abc1234#1, abc1234#2, ..., abc1234#N
   
   Reads:
     - Choose random suffix on redirect
     - Distribute load across N partitions
   ```
   
2. **Use DAX (DynamoDB Accelerator):**
   ```
   Cache reads before hitting DynamoDB
   - 10x read performance improvement
   - Reduces load on hot partitions
   ```

3. **Fan-out writes:**
   ```
   On URL creation, write to multiple partitions:
     - abc1234 stored in partition 1, 2, and 3
   On redirect:
     - Read from any one of the three partitions
   ```

**Code Example — Hot Partition Mitigation:**

```python
import random
import redis

class URLServiceWithPartitioning:
    """
    URL service that handles hot partitions by adding
    random suffixes to popular URLs.
    """
    
    def __init__(self, redis_client):
        self.redis = redis_client
        self.partition_count = 10  # Number of partitions per hot URL
    
    def store_url(self, short_url, long_url):
        """
        Store URL with multiple partitions for hot URLs.
        """
        # Check if URL is "hot" (already has multiple reads)
        read_count = int(self.redis.get(f"reads:{short_url}") or 0)
        
        if read_count > 1000:
            # This is a hot URL - store in multiple partitions
            for i in range(self.partition_count):
                partitioned_url = f"{short_url}#{i}"
                self.redis.set(partitioned_url, long_url)
            
            # Mark as hot
            self.redis.set(f"hot:{short_url}", "true")
            print(f"✓ Stored hot URL {short_url} in {self.partition_count} partitions")
        else:
            # Regular URL - single partition
            self.redis.set(short_url, long_url)
            print(f"✓ Stored URL {short_url}")
    
    def get_url(self, short_url):
        """
        Get URL, using random partition if hot.
        """
        # Increment read counter
        self.redis.incr(f"reads:{short_url}")
        
        # Check if hot
        is_hot = self.redis.get(f"hot:{short_url}")
        
        if is_hot:
            # Choose random partition
            partition = random.randint(0, self.partition_count - 1)
            partitioned_url = f"{short_url}#{partition}"
            long_url = self.redis.get(partitioned_url)
            print(f"✓ Got hot URL {short_url} from partition {partition}")
        else:
            long_url = self.redis.get(short_url)
            print(f"✓ Got URL {short_url}")
        
        return long_url

# Example usage
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)
service = URLServiceWithPartitioning(redis_client)

# Simulate a URL becoming hot
print("=== Simulating Hot URL ===\n")

# Store URL (not hot yet)
service.store_url("viral1234", "https://example.com/viral")

# Read a few times
print("\nFirst 50 reads:")
for i in range(50):
    service.get_url("viral1234")

# Store URL again (should detect as hot)
print("\nRe-storing URL:")
service.store_url("viral1234", "https://example.com/viral")

# Read more times
print("\nNext 50 reads (should use partitions):")
for i in range(50):
    service.get_url("viral1234")
```

**Output:**
```
=== Simulating Hot URL ===

✓ Stored URL viral1234

First 50 reads:
✓ Got URL viral1234
✓ Got URL viral1234
... (48 more times)

Re-storing URL:
✓ Stored hot URL viral1234 in 10 partitions

Next 50 reads (should use partitions):
✓ Got hot URL viral1234 from partition 3
✓ Got hot URL viral1234 from partition 7
✓ Got hot URL viral1234 from partition 1
... (47 more times)
```

#### **Bottleneck 2: Message Queue Consumer Lag**

**Problem:** Redirects happen faster than analytics can be processed.

```
Scenario:
- 100 million redirects/day = 1,157 QPS average
- 11,570 QPS peak (10x multiplier)
- Analytics consumer can only process 500 QPS
- Consumer falls behind → messages pile up
```

**Mitigation:**

1. **Scale consumers horizontally:**
   ```
   Required consumers = 11,570 / 500 = 24 consumers
   
   Add 24 consumer instances to handle peak load
   ```

2. **Use partitioning:**
   ```
   Partition by url_id:
     - Same URL always goes to same partition
     - Enables per-URL aggregation
     - Parallel processing across partitions
   ```

3. **Batch writes:**
   ```
   Instead of writing each click:
     - Buffer 1000 clicks in memory
     - Write as single batch to database
     - Reduces DB roundtrips by 1000x
   ```

**Code Example — Batching Consumer:**

```python
import time
from collections import defaultdict

class AnalyticsConsumer:
    """
    Analytics consumer that batches writes for efficiency.
    """
    
    def __init__(self, batch_size=1000, max_wait_time=5):
        """
        Initialize consumer.
        
        Args:
            batch_size: Number of events to buffer before writing
            max_wait_time: Maximum seconds to wait before flushing
        """
        self.batch_size = batch_size
        self.max_wait_time = max_wait_time
        self.buffer = []
        self.url_stats = defaultdict(int)
        self.last_flush = time.time()
    
    def process_event(self, event):
        """
        Process a single click event.
        
        Adds to buffer and flushes if necessary.
        """
        self.buffer.append(event)
        self.url_stats[event['url_id']] += 1
        
        # Check if we should flush
        should_flush = (
            len(self.buffer) >= self.batch_size or
            time.time() - self.last_flush >= self.max_wait_time
        )
        
        if should_flush:
            self.flush()
    
    def flush(self):
        """
        Write buffered events to database.
        """
        if not self.buffer:
            return
        
        start_time = time.time()
        
        # In production, this would be a bulk insert to DB
        print(f"Flushing {len(self.buffer)} events...")
        
        # Simulate batch write (aggregated by URL)
        for url_id, count in self.url_stats.items():
            # Update analytics for this URL
            print(f"  {url_id}: +{count} clicks")
        
        # Clear buffer
        self.buffer = []
        self.url_stats = defaultdict(int)
        self.last_flush = time.time()
        
        elapsed = time.time() - start_time
        print(f"Flush completed in {elapsed:.2f}s\n")
    
    def close(self):
        """Flush any remaining events before shutdown."""
        print("Closing consumer...")
        self.flush()

# Example usage
print("=== Batching Consumer Demo ===\n")

consumer = AnalyticsConsumer(batch_size=5, max_wait_time=10)

# Simulate stream of events
events = [
    {"url_id": "abc1234", "timestamp": 1709251200},
    {"url_id": "abc1234", "timestamp": 1709251201},
    {"url_id": "xyz5678", "timestamp": 1709251202},
    {"url_id": "abc1234", "timestamp": 1709251203},
    {"url_id": "def9999", "timestamp": 1709251204},  # Triggers flush (5 events)
    {"url_id": "xyz5678", "timestamp": 1709251205},
    {"url_id": "abc1234", "timestamp": 1709251206},
]

for event in events:
    print(f"Processing: {event['url_id']}")
    consumer.process_event(event)
    time.sleep(0.1)  # Simulate processing delay

# Force final flush
consumer.close()
```

**Output:**
```
=== Batching Consumer Demo ===

Processing: abc1234
Processing: abc1234
Processing: xyz5678
Processing: abc1234
Processing: def9999
Flushing 5 events...
  abc1234: +3 clicks
  xyz5678: +1 clicks
  def9999: +1 clicks
Flush completed in 0.00s

Processing: xyz5678
Processing: abc1234
Closing consumer...
Flushing 2 events...
  xyz5678: +1 clicks
  abc1234: +1 clicks
Flush completed in 0.00s
```

---

## **The Art of Drawing Architecture Diagrams**

A picture is worth a thousand words. Good diagrams communicate complex systems clearly.

### **Diagramming Best Practices**

#### **Rule 1: Start Simple, Add Detail**

Don't draw everything at once. Build up your diagram progressively.

```
Step 1: Client → Server (basic flow)
Step 2: Add Load Balancer
Step 3: Add Cache Layer
Step 4: Add Database
Step 5: Add Message Queue
Step 6: Add Analytics
```

#### **Rule 2: Use Standard Symbols**

```
┌─────────────────────────────────────────────────────────────────┐
│                    ARCHITECTURE SYMBOLS                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Client:          ┌─────────────────┐                           │
│                   │      📱          │                           │
│                   │    Browser/App   │                           │
│                   └────────┬─────────┘                           │
│                                                                 │
│  Load Balancer:   ┌─────────────────┐                           │
│                   │  ⚖️  Load Bal.   │                           │
│                   └────────┬─────────┘                           │
│                                                                 │
│  Server:          ┌─────────────────┐                           │
│                   │  🖥️  API Server  │                           │
│                   │    (Stateless)   │                           │
│                   └────────┬─────────┘                           │
│                                                                 │
│  Cache:           ┌─────────────────┐                           │
│                   │   💾  Cache      │                           │
│                   │    (Redis)       │                           │
│                   └────────┬─────────┘                           │
│                                                                 │
│  Database:        ┌─────────────────┐                           │
│                   │   🗄️  Database   │                           │
│                   │   (DynamoDB)     │                           │
│                   └────────┬─────────┘                           │
│                                                                 │
│  Message Queue:   ┌─────────────────┐                           │
│                   │   📨  MQ         │                           │
│                   │   (Kafka)        │                           │
│                   └────────┬─────────┘                           │
│                                                                 │
│  External:        ☁️  External Service                         │
│                                                                 │
│  Data Flow:       ───────────→ (Unidirectional)                │
│  Bidirectional:   ←──────────→ (Both ways)                      │
│  Async:           ──────────→> (Async)                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

#### **Rule 3: Label Everything**

Every box, arrow, and connection should have a label.

```
✗ BAD (no labels):
┌────┐   →   ┌────┐
│ LB │       │ API │
└────┘       └────┘

✓ GOOD (labeled):
┌─────────────┐         HTTPS (443)          ┌─────────────────┐
│   Load      │────────────────────────────→│      API        │
│  Balancer   │←─────────────────────────────|    Server       │
│  (ALB)      │      Response (200 OK)       │   (K8s Pod)     │
└─────────────┘                             └────────┬────────┘
```

#### **Rule 4: Show Data Flow**

Don't just draw boxes. Show how data moves through the system.

```
┌─────────────────────────────────────────────────────────────────┐
│                     REQUEST FLOW DIAGRAM                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Client                                                         │
│    │                                                            │
│    │  1. POST /api/v1/shorten                                   │
│    │     { "long_url": "https://..." }                          │
│    ↓                                                            │
│  Load Balancer                                                  │
│    │                                                            │
│    │  2. Route to healthy server                                │
│    ↓                                                            │
│  API Server                                                     │
│    │                                                            │
│    │  3. Generate short URL (Snowflake ID)                      │
│    │                                                            │
│    │  4. Write to DynamoDB                                      │
│    ↓                                                            │
│  DynamoDB                                                       │
│    │                                                            │
│    │  5. Store: { "abc1234": "https://..." }                    │
│    │                                                            │
│    │  6. ACK success                                            │
│    ↑                                                            │
│  API Server                                                     │
│    │                                                            │
│    │  7. Cache in Redis (optional)                              │
│    │                                                            │
│    │  8. Return: { "short_url": "https://short.url/abc1234" }   │
│    ↑                                                            │
│  Load Balancer                                                  │
│    │                                                            │
│    │  9. Response (201 Created)                                 │
│    ↑                                                            │
│  Client                                                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

#### **Rule 5: Indicate Scale**

Show when components have multiple instances.

```
┌─────────────────────────────────────────────────────────────────┐
│                   MULTI-INSTANCE DIAGRAM                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                     Load Balancer                                │
│                           │                                      │
│     ┌─────────────────────┼─────────────────────┐               │
│     ↓                     ↓                     ↓               │
│  ┌────────┐           ┌────────┐           ┌────────┐           │
│  │ API-1  │           │ API-2  │           │ API-N  │           │
│  │ (K8s)  │           │ (K8s)  │           │ (K8s)  │           │
│  └────────┘           └────────┘           └────────┘           │
│     │                     │                     │               │
│     └─────────────────────┼─────────────────────┘               │
│                           │                                      │
│                     Redis Cluster                                │
│                   (Master + 3 Slaves)                            │
│                           │                                      │
│                     ┌─────┴─────┐                                │
│                     ↓           ↓                                │
│                  Shard-1     Shard-2                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### **Drawing Order for Interviews**

When drawing during an interview, follow this sequence:

```
1. Draw the client and first server (5 seconds)
   ┌──────┐    ┌──────────┐
   │User  │───→│  Server  │
   └──────┘    └──────────┘

2. Add load balancer (10 seconds)
   ┌──────┐    ┌──────────┐    ┌──────────┐
   │User  │───→│    LB    │───→│  Server  │
   └──────┘    └──────────┘    └──────────┘

3. Add database (15 seconds)
   ┌──────┐    ┌──────────┐    ┌──────────┐
   │User  │───→│    LB    │───→│  Server  │───→┌─────────┐
   └──────┘    └──────────┘    └──────────┘    │  DB     │
                                            └─────────┘

4. Add cache (20 seconds)
   ┌──────┐    ┌──────────┐    ┌──────────┐
   │User  │───→│    LB    │───→│  Server  │───→┌─────────┐
   └──────┘    └──────────┘    └──────────┘    │  DB     │
                                           ↑  └─────────┘
                                           │
                                      ┌─────────┐
                                      │  Cache  │
                                      └─────────┘

5. Add message queue (30 seconds)
   ┌──────┐    ┌──────────┐    ┌──────────┐
   │User  │───→│    LB    │───→│  Server  │───→┌─────────┐
   └──────┘    └──────────┘    └──────────┘    │  DB     │
                              │                └─────────┘
                              ↓
                         ┌─────────┐
                         │   MQ    │
                         └────┬────┘
                              ↓
                         ┌─────────┐
                         │Consumer │
                         └─────────┘

6. Add multiple instances (final diagram)
   (See complete architecture diagram above)
```

### **Diagramming Tools**

**For Interviews:**
- **Whiteboard** — Most common in on-site interviews
- **Excalidraw** — Free online tool (hand-drawn style)
- **Draw.io** — Free, feature-rich
- **Lucidchart** — Paid, very popular

**For Documentation:**
- **Mermaid** — Code-based diagrams (great for Git)
- **PlantUML** — Code-based UML diagrams
- **Cloudcraft** — AWS-specific diagrams

**Example Mermaid Code:**

```mermaid
graph TD
    User[User] -->|HTTPS| LB[Load Balancer]
    LB -->|Route| API1[API Server 1]
    LB -->|Route| API2[API Server 2]
    LB -->|Route| APIN[API Server N]
    
    API1 -->|Read/Write| Cache[Redis Cluster]
    API2 -->|Read/Write| Cache
    APIN -->|Read/Write| Cache
    
    API1 -->|Write| DB[DynamoDB]
    API2 -->|Write| DB
    APIN -->|Write| DB
    
    API1 -->|Publish| MQ[Kafka]
    API2 -->|Publish| MQ
    APIN -->|Publish| MQ
    
    MQ -->|Consume| Consumer[Analytics Consumer]
    Consumer -->|Write| Analytics[TimescaleDB]
    
    style LB fill:#f9f,stroke:#333,stroke-width:2px
    style Cache fill:#bbf,stroke:#333,stroke-width:2px
    style DB fill:#bfb,stroke:#333,stroke-width:2px
```

---

## **Putting It All Together: The 4S Interview Flow**

Here's how you would use the 4S framework in a 45-minute interview:

```
┌─────────────────────────────────────────────────────────────────┐
│                   45-MINUTE INTERVIEW TIMELINE                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  0-5 min:     SCOPE                                             │
│                 • Clarify the problem                           │
│                 • Ask questions                                  │
│                 • Define functional requirements                │
│                 • Define non-functional requirements             │
│                 • Define out-of-scope                           │
│                                                                 │
│  5-10 min:    SKETCH                                            │
│                 • Calculate QPS (peak and average)              │
│                 • Estimate storage                              │
│                 • Estimate bandwidth                            │
│                 • Identify scale constraints                    │
│                                                                 │
│  10-15 min:   SOLIDIFY                                          │
│                 • Identify entities                             │
│                 • Design data model                              │
│                 • Choose database type                          │
│                 • Design API endpoints                          │
│                 • Define request/response formats               │
│                                                                 │
│  15-35 min:   SCALE — High-Level Design                        │
│                 • Identify key components                       │
│                 • Draw high-level architecture                  │
│                 • Explain data flow                             │
│                 • Choose technologies                           │
│                                                                 │
│  35-45 min:   SCALE — Deep Dive                                │
│                 • Dive into 2-3 critical components             │
│                 • Analyze bottlenecks                           │
│                 • Discuss trade-offs                            │
│                 • Propose alternatives                          │
│                 • Handle follow-up questions                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### **Sample Interview Script**

**Interviewer:** "Design a URL shortener."

**You (SCOPE - 0-5 min):**

```
You: "Great! Let me clarify a few things first:

1. Is this for public use or internal company use?
   Int: Public use.

2. Should short URLs expire, or do they last forever?
   Int: They don't expire.

3. Can users specify custom aliases?
   Int: Yes, if available.

4. Do we need analytics (click tracking, geo location)?
   Int: Yes, basic click counts.

5. What's the expected scale?
   Int: 10 million URL creations per day, 100 million redirects per day.

Based on that, here are my requirements:

Functional Requirements:
- Create and return unique short URLs (≤7 characters)
- Redirect short URLs to original long URLs
- Optional custom aliases
- Track click counts per URL

Non-Functional Requirements:
- Support 10M URL creations/day, 100M redirects/day
- 99.9% availability
- Create URL: <100ms, Redirect: <50ms
- Strong consistency for URL lookups
- Zero data loss for URLs

Out of Scope:
- User authentication (assume anonymous)
- URL preview thumbnails
- Social sharing features
- QR code generation

Does this sound right to you?"
```

**You (SKETCH - 5-10 min):**

```
You: "Now let me estimate the scale:

Traffic (QPS):
- URL creations: 10M/day = 115 QPS avg, 575 QPS peak (5x multiplier)
- Redirects: 100M/day = 1,157 QPS avg, 11,570 QPS peak (10x multiplier)
- Total: ~12,145 QPS peak

Storage:
- Per URL: short_url (7B) + long_url (2KB) + metadata = ~2KB
- 1 billion URLs × 2KB = 2TB
- With 50% overhead: ~3TB

Bandwidth:
- Create: 2KB req + 100B res × 575 QPS = 1.2 MB/s
- Redirect: 10B req + 2KB res × 11,570 QPS = 23 MB/s
- Total: ~24 MB/s (2TB/day)

This tells me we need a horizontally scalable database and 
distributed caching to handle the 11K+ QPS for redirects."
```

**You (SOLIDIFY - 10-15 min):**

```
You: "Let me design the data model:

Entities:
- URL: short_url (PK), long_url, created_at, click_count
- Click: id, url_id (FK), timestamp, ip_address

I'll use DynamoDB for the main URL storage:
- Key: short_url (7 chars)
- Strong consistency for redirects
- Horizontal scaling for 1B+ URLs

For analytics, I'll use a time-series database:
- TimescaleDB for click events
- Optimized for time-series queries

API Design:
- POST /api/v1/shorten - Create short URL
- GET /{short_url} - Redirect (301)
- GET /api/v1/urls/{short_url} - Get details
- GET /api/v1/stats/{short_url} - Get analytics

The redirect endpoint needs to be ultra-fast, so we'll 
cache heavily at that layer."
```

**You (SCALE - High-Level - 15-25 min):**

```
You: [Draws diagram]

"Here's the high-level architecture:

[Draws and explains each component]

Key components:
1. Load Balancer - Distributes traffic
2. API Servers - Stateless, horizontally scalable
3. Redis Cluster - Cache for fast redirects
4. DynamoDB - Primary storage for URLs
5. Snowflake ID Generator - Unique, sortable IDs
6. Kafka - Async analytics pipeline
7. TimescaleDB - Analytics storage

Data flow for redirect:
1. Client GETs /abc1234
2. Load balancer routes to API server
3. API server checks Redis cache
4. Cache HIT → Return redirect immediately
5. Cache MISS → Query DynamoDB
6. Update cache for next time
7. Async publish click event to Kafka

For URL creation:
1. Client POSTs to /api/v1/shorten
2. API server generates Snowflake ID
3. Encodes to short URL
4. Writes to DynamoDB
5. Returns short URL to client"
```

**You (SCALE - Deep Dive - 25-40 min):**

```
You: "Let me dive deeper into a few critical areas:

1. Short URL Generation:
   I'm using Snowflake IDs because:
   - Globally unique across servers
   - Time-ordered (sortable)
   - No collisions
   - Base62 encoding keeps URLs short
   
   [Explains Snowflake structure and shows code]

2. Caching Strategy:
   Using Cache-Aside pattern with Redis Cluster:
   - Cache hit reduces latency from ~50ms to ~5ms
   - LRU eviction with 1-hour TTL
   - Consistent hashing for distribution
   
   [Shows cache-aside code]

3. Hot Partition Mitigation:
   Viral URLs could create hot partitions in DynamoDB.
   Solution:
   - Add random suffix to popular URLs
   - Store in multiple partitions
   - Use DAX for caching
   
   [Shows mitigation code]

Trade-offs I considered:

DynamoDB vs PostgreSQL:
   Chose DynamoDB for horizontal scaling and multi-region
   PostgreSQL would be simpler but doesn't scale as well

Synchronous vs Asynchronous Analytics:
   Chose async (Kafka) to not impact redirect latency
   Synchronous would be simpler but violates 50ms SLA"
```

**Interviewer:** "What if the cache fails?"

**You (40-45 min):**

```
You: "Good question. If Redis fails:

1. Immediate fallback:
   - API server detects cache failure
   - Directly queries DynamoDB
   - Latency increases to ~50ms (still within SLA)

2. Failover:
   - Redis Cluster has automatic failover
   - One replica becomes master
   - Service restored in seconds

3. Local cache:
   - Each API server has local LRU cache
   - Temporary buffer until Redis recovers
   - Reduces load on DynamoDB during outage

4. Degradation mode:
   - If both caches fail and DB is overloaded
   - Serve 503 with Retry-After header
   - Better than serving stale/wrong data

This defense in depth ensures 99.9% availability even 
with cache failures."
```

---

## **Chapter Summary**

### **Key Takeaways**

| Concept | Summary |
|---------|---------|
| **4S Framework** | Structured approach: Scope → Sketch → Solidify → Scale |
| **Scope** | Define functional/non-functional requirements and out-of-scope |
| **Sketch** | Estimate QPS, storage, and bandwidth using simple math |
| **Solidify** | Design data model, choose database, define APIs |
| **Scale** | Draw architecture, explain data flow, deep-dive into components |
| **Trade-offs** | Every decision has pros and cons—discuss them explicitly |
| **Bottlenecks** | Proactively identify potential failure points and mitigations |
| **Diagrams** | Draw progressively, label everything, show data flow |

### **Common Mistakes to Avoid**

| Mistake | Why It's Bad | How to Fix It |
|---------|--------------|---------------|
| Jumping straight to design | You might solve the wrong problem | Always start with SCOPE and clarifying questions |
| Ignoring estimations | You'll under- or over-engineer | Always do back-of-the-envelope calculations |
| Over-complicating early | Wastes time and confuses interviewer | Start simple, add complexity only if needed |
| No trade-off discussion | Shows you don't understand alternatives | Always discuss 2-3 options and your reasoning |
| Forgetting bottlenecks | System might fail at scale | Proactively identify and mitigate bottlenecks |
| Poor diagramming | Harder for interviewer to follow | Use clear labels, standard symbols, show data flow |

### **4S Framework Checklist**

Use this during your interview:

```
□ SCOPE:
  □ Clarified the problem
  □ Asked relevant questions
  □ Documented functional requirements
  □ Documented non-functional requirements
  □ Defined out-of-scope

□ SKETCH:
  □ Calculated QPS (peak and average)
  □ Estimated storage
  □ Estimated bandwidth
  □ Identified scale constraints

□ SOLIDIFY:
  □ Identified entities
  □ Designed data model
  □ Chose database type
  □ Defined API endpoints
  □ Specified request/response formats

□ SCALE:
  □ Identified key components
  □ Drew high-level architecture
  □ Explained data flow
  □ Chose technologies with reasoning
  □ Dived into 2-3 critical components
  □ Identified bottlenecks
  □ Discussed trade-offs
  □ Proposed alternatives
```

---

## **Exercises**

### **Exercise 1: Apply 4S to a New Problem**

Apply the 4S framework to "Design a chat application."

**Hint:** Consider:
- Real-time messaging (WebSockets)
- Message ordering
- Offline message delivery
- Read receipts
- Group chats vs. direct messages

### **Exercise 2: Estimation Practice**

Calculate QPS, storage, and bandwidth for:

- A photo-sharing app with 1M users, each uploading 5 photos/day
- Average photo size: 2MB
- Each photo viewed 10 times on average
- Assume 5x peak multiplier

**Solution:**
```
Daily uploads: 1M users × 5 photos = 5M photos/day
Upload QPS: 5M / 86,400 = 58 QPS avg, 290 QPS peak
Daily views: 5M × 10 = 50M views/day
View QPS: 50M / 86,400 = 578 QPS avg, 2,890 QPS peak

Storage per day: 5M photos × 2MB = 10GB/day
Per year: 10GB × 365 = 3.65TB/year
5 years: ~18TB

Upload bandwidth: 2MB × 290 QPS = 580 MB/s
View bandwidth: 2MB × 2,890 QPS = 5.78 GB/s
Total: ~6.3 GB/s
```

### **Exercise 3: Trade-off Analysis**

For each decision, list 2 alternatives and explain your choice:

1. SQL vs. NoSQL for photo metadata
2. Blob storage (S3) vs. database for photo files
3. CDN vs. direct storage for serving photos

---

## **Further Reading**

| Resource | Description |
|----------|-------------|
| [System Design Primer](https://github.com/donnemartin/system-design-primer) | Comprehensive GitHub repo with diagrams |
| [Designing Data-Intensive Applications](https://dataintensive.net/) | Martin Kleppmann's book on distributed systems |
| [DDIA - Chapter 5](https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903119/) | Replication, partitioning, transactions |
| [Google File System Paper](https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf) | Classic distributed file system paper |
| [Dynamo Paper](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) | Amazon's key-value store design |

---

**Next:** In Chapter 11, we'll dive deep into **Reliability & Fault Tolerance**, exploring how to design systems that survive failures, disasters, and unexpected events. We'll cover redundancy patterns, retry strategies, chaos engineering, and disaster recovery planning.


<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../3. Distributes_systems_fundamentals/9. scalability_patterns.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='11. reliability_and_fault_tolerance.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
