# **Chapter 26: Mock Interviews & Solutions**

This chapter applies the 4S framework to the most common system design interview questions. Each walkthrough demonstrates exactly how to structure your response, what details to include, and how to navigate trade-offs. Study these patterns, but remember: the goal isn't memorization—it's understanding how to reason through ambiguity.

---

## **Problem 1: Design a URL Shortener (TinyURL)**

**Difficulty**: Easy to Medium  
**Key Concepts**: Hashing, Database Sharding, Caching, Analytics  
**Estimated Time**: 45 minutes

### **SCOPE: Requirements Gathering**

> **Candidate**: "Before I begin, let me clarify the requirements. For functional requirements, I'm assuming:
> 1. Users submit a long URL and receive a short alias (e.g., `short.io/abc123`)
> 2. Visiting the short URL redirects to the original (301 permanent redirect)
> 3. Optional custom aliases (`short.io/my-brand`)
> 4. Links expire after a configurable time (default 1 year)
> 5. Basic analytics: click count, referrer, geographic distribution
>
> For non-functional requirements, what scale should I target?"
>
> **Interviewer**: "Design for 100 million new URLs per month, 10 billion redirects per month. Read-heavy."
>
> **Candidate**: "Got it. And for latency—I'm assuming sub-100ms for redirects is acceptable? Should we support global distribution or start single-region?"
>
> **Interviewer**: "Global distribution for latency, 99.9% availability."

### **SKETCH: Back-of-the-Envelope Estimation**

> "Let me calculate the scale:
>
> **Write throughput**: 100M URLs/month ÷ 2.6M seconds ≈ **40 URLs/second**. Peak 5× = **200/s**.
>
> **Read throughput**: 10B redirects/month ÷ 2.6M seconds ≈ **4,000/s**. Peak 10× = **40,000/s**.
>
> **Storage calculations**:
> - Average URL: 500 bytes (long) + 50 bytes (short) + 100 bytes (metadata) = **650 bytes/URL**
> - 100M/month × 12 months × 5 years = 6B URLs
> - 6B × 650 bytes = **3.9 TB** (plus 30% overhead for indexes = **~5 TB**)
>
> **Bandwidth**: 40,000 redirects/s × 500 bytes = **20 MB/s** (160 Mbps)—well within 1 Gbps.
>
> **Analytics storage**: If we log every click with IP, timestamp, user-agent (~200 bytes):
> - 10B clicks/month × 200 bytes = **2 TB/month** of clickstream data.
>
> These numbers tell me: single database can handle writes, but we need caching for reads and separate analytics pipeline."

### **SOLIDIFY: API and Data Model**

**API Design**:

```http
POST /api/v1/shorten
Content-Type: application/json

{
  "long_url": "https://example.com/very/long/path?query=params",
  "custom_alias": "mylink",        // optional
  "expires_in_days": 30            // optional, default 365
}

Response 201 Created:
{
  "short_code": "mylink",          // or auto-generated "a7x9k2"
  "short_url": "https://short.io/mylink",
  "long_url": "https://example.com/...",
  "created_at": "2024-01-15T10:30:00Z",
  "expires_at": "2024-02-14T10:30:00Z"
}

Response 409 Conflict:
{
  "error": "Custom alias already taken",
  "suggested": ["mylink123", "mylink2024"]
}
```

```http
GET /{short_code}
Response 302 Found:
Location: https://example.com/very/long/path
[Sets tracking cookie for analytics]

Response 410 Gone:
{
  "error": "Link expired or deleted"
}
```

**Database Schema**:

```sql
-- Main URL table (sharded by short_code hash)
CREATE TABLE url_mappings (
    short_code VARCHAR(10) PRIMARY KEY,
    long_url TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP,
    user_id UUID,
    click_count BIGINT DEFAULT 0,
    is_custom BOOLEAN DEFAULT FALSE
);

-- Indexes
CREATE INDEX idx_expires_at ON url_mappings(expires_at) 
WHERE expires_at IS NOT NULL;

-- Analytics table (time-series, separate database)
CREATE TABLE click_events (
    event_id BIGSERIAL,
    short_code VARCHAR(10),
    clicked_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    ip_address INET,
    country_code CHAR(2),
    referrer TEXT,
    user_agent TEXT
) PARTITION BY RANGE (clicked_at);
```

**URL Generation Strategy**:

> "For generating short codes, I have two options:
>
> **Option A: Hash-based (MD5/SHA)**
> - Hash the long URL, take first 7 characters
> - Check for collision, if exists append counter
> - Pros: Deterministic (same URL → same short code)
> - Cons: Collisions increase with scale, complex collision handling
>
> **Option B: Base62 Counter**
> - Use auto-increment ID (1, 2, 3...) → convert to Base62 (a-z, A-Z, 0-9)
> - 7 characters = 62^7 ≈ 3.5 trillion unique URLs
> - Pros: No collisions, sequential (good for B-tree inserts), simple
> - Cons: Predictable (sequential IDs expose creation rate), need distributed counter
>
> **Decision**: I'll use Base62 with a distributed ID generator (Snowflake-style) to avoid single-point-of-failure in the counter. For custom aliases, we reserve those in a separate namespace."

### **SCALE: Architecture and Deep Dives**

**High-Level Architecture**:

```
[User] → [DNS/GeoDNS] → [CDN] → [Load Balancer]
                              ↓
                   [API Servers] (Stateless, Auto-scaling)
                    /          \
   [Write Path]  /              \  [Read Path]
                ↓                ↓
        [ID Generator]      [Redis Cluster]
        (Snowflake)         (URL Cache)
                ↓                ↓
        [Database] ←──────────────┘
     (PostgreSQL/MySQL)   (Cache miss fetch)
                ↓
        [Kafka] → [ClickHouse] (Analytics)
```

**Deep Dive 1: Database Sharding**

> "With 5TB over 5 years, we might need to shard. I'll use **hash-based sharding** on `short_code` because:
> 1. **Uniform distribution**: Random short codes spread evenly
> 2. **Direct lookup**: No routing table needed—`shard = hash(code) % N`
> 3. **No hot spots**: Unlike time-based sharding where recent data is hot
>
> **Rebalancing strategy**: Use consistent hashing (virtual nodes) so adding a shard only moves 1/N of data."

**Deep Dive 2: Caching Strategy**

> "For 40,000 reads/second with 100:1 read ratio:
> - **Cache tier**: Redis Cluster with 100GB RAM (holds hot 20% of URLs = 1.2B URLs, likely serving 95% of traffic)
> - **TTL strategy**: 24 hours for popular URLs, 1 hour for others
> - **Cache warming**: On miss, populate cache asynchronously so next request hits cache
> - **Thundering herd**: Use lease tokens—only one thread fetches from DB, others wait or serve stale data temporarily"

**Deep Dive 3: Analytics Pipeline**

> "10B clicks/month = 3,800 clicks/second average, 38,000 peak.
>
> **Problem**: Writing every click to SQL database would kill performance.
>
> **Solution**: 
> 1. **Click logs**: Append-only Kafka topic (high throughput, retention 7 days)
> 2. **Stream processing**: Flink jobs aggregate real-time metrics (clicks per URL per hour)
> 3. **Data warehouse**: Copy to ClickHouse for ad-hoc analytics (fast columnar queries)
> 4. **Pre-aggregation**: Update counter cache in Redis for 'click_count' displayed on dashboard"

**Deep Dive 4: Global Distribution**

> "For global low latency:
> 1. **Multi-region deployment**: US-East, US-West, EU, Asia-Pacific
> 2. **Database**: Single primary for writes (US-East), read replicas in other regions
> 3. **Cache**: Redis in each region, evicted on write via invalidation messages
> 4. **CDN**: CloudFront/Cloudflare caches 301 redirects (cacheable indefinitely unless expired)
>
> **Write latency**: Acceptable—writes are rare (200/s) and don't need to be instant.
> **Read latency**: <50ms via CDN + regional cache."

**Trade-offs Discussed**:

> "I chose 301 (permanent) over 302 (temporary) redirects because:
> - **Pros**: Browsers cache 301s forever (faster subsequent visits, less server load)
> - **Cons**: Cannot change destination once issued; if we delete a URL, cached 301s still work in browsers
>
> I chose eventual consistency for analytics (real-time not required) over strong consistency to achieve higher write throughput."

---

## **Problem 2: Design Twitter News Feed**

**Difficulty**: Hard  
**Key Concepts**: Fan-out Problem, Timeline Generation, Push vs. Pull  
**Estimated Time**: 45-50 minutes

### **SCOPE: Requirements Gathering**

> "For functional requirements:
> - Users post tweets (text, images, video)
> - Users follow other users
> - Home timeline shows tweets from followed users (reverse chronological)
> - User timeline shows user's own tweets
> - Like, retweet, reply functionality
>
> Scale targets?"
>
> **Interviewer**: "10 million DAU, average 200 followers per user, 100 million tweets per day."
>
> "Non-functional: Timeline generation <200ms, post tweet <500ms, available 99.9%."

### **SKETCH: Estimations**

> **Tweet volume**: 100M/day ÷ 86,400s = **1,150 tweets/second** (peak 5× = 5,750/s)
>
> **Timeline reads**: 10M DAU × 10 timeline views/day = 100M views/day = **1,150 reads/second** (peak 10× = 11,500/s)
>
> **Storage**:
> - Tweet: 140 chars = 280 bytes + metadata = 500 bytes
> - Media: Average 200KB per tweet (20% have media)
> - 100M tweets × (500 bytes + 0.2×200KB) = **4 TB/day** (1.4 PB/year)
>
> **Fan-out**: 10M users × 200 followers = 2 billion potential fan-out edges."

### **SOLIDIFY: Data Model**

**Relational Schema**:

```sql
-- Users table
CREATE TABLE users (
    user_id BIGINT PRIMARY KEY,
    username VARCHAR(15) UNIQUE NOT NULL,
    follower_count INT DEFAULT 0,
    following_count INT DEFAULT 0,
    created_at TIMESTAMP
);

-- Follows relationship (graph edge)
CREATE TABLE follows (
    follower_id BIGINT REFERENCES users(user_id),
    following_id BIGINT REFERENCES users(user_id),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (follower_id, following_id)
);

-- Tweets
CREATE TABLE tweets (
    tweet_id BIGINT PRIMARY KEY,
    user_id BIGINT REFERENCES users(user_id),
    content TEXT,
    media_urls TEXT[], -- Array of URLs
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    like_count INT DEFAULT 0,
    retweet_count INT DEFAULT 0
);

-- Indexes for timeline queries
CREATE INDEX idx_tweets_user_time ON tweets(user_id, created_at DESC);
CREATE INDEX idx_follows_follower ON follows(follower_id);
```

**NoSQL Alternative (Cassandra)**:

> "For the timeline itself, I'll use Cassandra with wide rows:
> ```sql
> CREATE TABLE user_timeline (
>     user_id BIGINT,
>     tweet_id BIGINT,
>     author_id BIGINT,
>     content TEXT,
>     created_at TIMESTAMP,
>     PRIMARY KEY (user_id, created_at, tweet_id)
> ) WITH CLUSTERING ORDER BY (created_at DESC);
> ```
> This gives us O(1) lookup for a user's timeline by time range."

### **SCALE: The Fan-Out Problem**

**The Core Challenge**:

> "When Kim Kardashian (100M followers) tweets, do we:
> 1. **Push model**: Write the tweet to all 100M followers' timelines immediately (write-heavy)
> 2. **Pull model**: Fetch tweets from all followed users when timeline is requested (read-heavy)
> 3. **Hybrid**: Push to normal users, pull for celebrities?"

**Hybrid Approach (The Solution)**:

```
Celebrity Threshold: 1 million followers

Normal User (500 followers) tweets:
  → Push to 500 timelines immediately (fast, cheap)
  → Read timeline: O(1) fetch from Redis/cache

Celebrity (Kim K) tweets:
  → Don't push to anyone
  → When user requests timeline:
     1. Fetch tweets from normal follows (from cache)
     2. Fetch recent celebrity tweets separately (merge on read)
     3. Merge and sort by time
```

**Architecture Diagram**:

```
[Post Tweet] → [Load Balancer] → [Tweet Service]
                     ↓
              [Fan-out Service]
               /            \
   [Normal User]              [Celebrity]
         ↓                         ↓
   [Write to Redis]          [Skip push]
   (User timelines)           
         ↓
   [Async to DB]
   
[Read Timeline] → [Timeline Service]
       ↓
   [Redis: User timeline]
       ↓
   [Merge with celeb tweets]
       ↓
   [Return sorted]
```

**Deep Dive 1: Timeline Generation Algorithm**

> **For normal users (Push model)**:
> ```python
> def post_tweet(user_id, content):
>     # 1. Store tweet
>     tweet_id = save_to_db(user_id, content)
>     
>     # 2. Get followers (cached)
>     followers = get_followers(user_id)  # 500 users
>     
>     # 3. Fan out to their timelines (Redis LPUSH)
>     for follower in followers:
>         redis.lpush(f"timeline:{follower}", tweet_id)
>         redis.ltrim(f"timeline:{follower}", 0, 1000)  # Keep last 1000
>     
>     return tweet_id
> ```
>
> **For celebrities**:
> ```python
> def get_timeline(user_id, cursor=None):
>     # 1. Get normal follows timeline from Redis (pre-computed)
>     normal_tweets = redis.lrange(f"timeline:{user_id}", 0, 100)
>     
>     # 2. Get followed celebrities
>     celebs = get_followed_celebrities(user_id)
>     
>     # 3. Fetch their recent tweets (last 24h) from DB
>     celeb_tweets = []
>     for celeb in celebs:
>         celeb_tweets.extend(
>             db.query("SELECT * FROM tweets WHERE user_id = ? AND created_at > ? LIMIT 10",
>                     celeb, now() - 24h)
>         )
>     
>     # 4. Merge and sort (heap merge for efficiency)
>     all_tweets = merge_sorted([normal_tweets, celeb_tweets], key='created_at')
>     
>     return all_tweets[:100]  # Return top 100
> ```

**Deep Dive 2: Media Storage**

> "4TB/day of media requires object storage:
> - **Storage**: S3 with lifecycle policies (move to Glacier after 1 year)
> - **CDN**: CloudFront for global distribution
> - **Processing**: Lambda@Edge to resize images (thumbnail vs full)
> - **URL structure**: `https://media.twitter.com/{user_id}/{tweet_id}/{size}.jpg`
>   - Sizes: thumb (150×150), medium (600×400), large (1200×800)"

**Deep Dive 3: Counter Consistency**

> "Like counts and retweet counts need to be accurate but don't need real-time precision.
>
> **Strategy**:
> 1. **Write path**: Increment counter in Redis (fast)
> 2. **Sync**: Every 10 seconds, flush Redis counters to persistent DB
> 3. **Read path**: Read from Redis (approximate) or DB (exact) based on use case
>
> **Conflict resolution**: If Redis restarts and loses count, recalculate from event log (Kafka) or accept minor inaccuracy temporarily."

---

## **Problem 3: Design a Chat System (WhatsApp/Slack)**

**Difficulty**: Hard  
**Key Concepts**: WebSockets, Message Ordering, Presence, Read Receipts  
**Estimated Time**: 50 minutes

### **SCOPE**

> "Requirements clarification:
> - 1-on-1 messaging and group chats (up to 500 people)
> - Online status (last seen)
> - Read receipts (double checkmarks)
> - Message history (searchable)
> - Media sharing (images, files)
> - End-to-end encryption (optional but mentioned)
>
> Scale: 1 billion daily active users, 100 billion messages/day."

### **SKETCH**

> **Messages**: 100B/day ÷ 86,400 = **1.16 million messages/second** (peak 5M/s)
>
> **Storage**: Average message 100 bytes + metadata = 200 bytes
> - 100B × 200 bytes = **20 TB/day** (7.3 PB/year)
>
> **Connections**: 1B users, 20% online simultaneously = **200 million concurrent connections**"

### **SOLIDIFY: Data Model**

**Message Table (Cassandra)**:

```sql
CREATE TABLE messages (
    chat_id TEXT,           -- user1_user2 (sorted) or group_id
    message_id BIGINT,      -- Snowflake ID (time-based)
    sender_id BIGINT,
    content TEXT,
    media_url TEXT,
    created_at TIMESTAMP,
    status TINYINT,         -- 0:sent, 1:delivered, 2:read
    PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
```

**User Sessions (Redis)**:

```python
# Online status
SET user:{user_id}:status "online" EX 60  # Expires in 60s unless refreshed

# Active connections (for routing)
SADD user:{user_id}:connections "ws_server_42" "ws_server_15"

# Last seen
SET user:{user_id}:last_seen "2024-01-15T10:30:00Z"
```

### **SCALE: Real-Time Architecture**

**Connection Handling**:

> "200M concurrent WebSocket connections cannot fit on one server.
>
> **Architecture**:
> - **Gateway Layer**: HAProxy with sticky sessions (IP hash) routes to WebSocket servers
> - **WebSocket Servers**: 10,000 servers × 20,000 connections each = 200M capacity
> - **Stateless**: WebSocket servers store only in-memory state (connection mapping). If server crashes, clients reconnect to different server and re-fetch missed messages.
>
> **Connection Management**:
> ```python
> class ChatServer:
>     def __init__(self):
>         self.connections = {}  # user_id -> WebSocket connection
>     
>     async def handle_connection(self, ws, user_id):
>         # Register connection
>         self.connections[user_id] = ws
>         await redis.sadd(f"user:{user_id}:connections", self.server_id)
>         
>         # Send undelivered messages (offline messages)
>         pending = await db.get_pending_messages(user_id)
>         for msg in pending:
>             await ws.send(msg)
>         
>         try:
>             while True:
>                 msg = await ws.recv()
>                 await handle_message(user_id, msg)
>         except ConnectionClosed:
>             await redis.srem(f"user:{user_id}:connections", self.server_id)
>             del self.connections[user_id]
> ```

**Message Flow**:

```
User A (Server 1) → "Hello" 
    ↓
Message Service validates, stores in DB (Cassandra)
    ↓
Pub/Sub (Redis/Kafka): "new_message:{chat_id}"
    ↓
User B Online? 
    ├─ Yes (Server 5): Push via WebSocket immediately
    └─ No: Store in "pending:{user_id}" queue, send push notification
```

**Deep Dive 1: Message Ordering**

> "Problem: Network latency causes messages to arrive out of order.
>
> **Solution**:
> 1. **Server-assigned sequence numbers**: Not client timestamp (clocks drift)
> 2. **Snowflake IDs**: Embed timestamp + sequence for ordering
>    - 41 bits: timestamp (ms since epoch)
>    - 10 bits: machine ID
>    - 12 bits: sequence number (4096 messages/ms per machine)
> 3. **Causality tracking**: Vector clocks for 'User is typing...' indicators (causal consistency)"

**Deep Dive 2: Group Chats**

> "500 people in a group sending 1 msg/sec = 500 messages/sec to distribute.
>
> **Optimization**:
> - **Fan-out on read vs write**: For small groups (<100), fan-out on write (push to all). For large groups, fan-out on read (pull when user opens app).
> - **Presence optimization**: Only send 'typing' indicators to users who have app open (check Redis presence set)
> - **Message synchronization**: Use 'sync tokens'—client sends last received message ID, server sends only newer messages (efficient for reconnection)"

**Deep Dive 3: Read Receipts**

> "Double checkmark logic:
> 1. **Sent**: Message stored in sender's outbox
> 2. **Delivered**: Recipient's server ACK received (store delivery receipt in DB)
> 3. **Read**: Recipient opened chat and fetched messages (update status)
>
> **Scaling read receipts**:
> - Don't update DB for every read (write amplification)
> - Batch updates: Update 'read' status in Redis, flush to DB every 5 seconds or when user leaves chat
> - For group chats: Store 'read_by' as bitmap or separate table (message_id, user_id, read_at)"

---

## **Problem 4: Design Uber (Ride Sharing)**

**Difficulty**: Hard  
**Key Concepts**: Geo-Spatial Indexing, Matching Algorithm, Supply-Demand  
**Estimated Time**: 50 minutes

### **SCOPE**

> "Functional:
> - Riders request rides (real-time tracking)
> - Drivers accept/decline requests
> - ETA calculation and route optimization
> - Surge pricing
> - Payment processing (out of scope, just mention)
>
> Scale: 100 million rides per month, 5 million drivers."

### **SKETCH**

> **Rides**: 100M/month ÷ 2.6M seconds = **38 rides/second** (peak 200/s)
>
> **Location updates**: 5M drivers × every 4 seconds = **1.25M updates/second**
>
> **Storage**: GPS points (lat, long, timestamp) = 24 bytes
> - 1.25M × 24 bytes × 86400 seconds = **2.5 TB/day of location data**"

### **SCALE: Geo-Spatial Architecture**

**The Dispatcher Service**:

```
[Rider Request] → [Load Balancer] → [API Gateway]
                                          ↓
                                    [Dispatch Service]
                                          ↓
                    ┌─────────────────────┼─────────────────────┐
                    ↓                     ↓                     ↓
            [Geospatial DB]      [Matching Engine]      [Notification]
            (Driver locations)   (Find nearest driver)  (APNS/FCM)
```

**Geospatial Indexing**:

> "Problem: Find nearest 10 drivers to rider (lat, lng) efficiently.
>
> **Solution 1: Geohash** (simpler)
> - Encode lat/lng to 8-character string (precision ~20 meters)
> - Use as Redis key: `drivers:geohash:{hash}`
> - Query adjacent 9 cells (center + 8 neighbors) to ensure coverage
>
> **Solution 2: S2 Geometry** (Google's library, more accurate)
> - Hilbert curve space-filling curve
> - Cells at multiple levels (level 10 = ~10km, level 16 = ~150m)
> - Range queries on cell IDs
>
> **Implementation**:
> ```python
> def find_nearest_drivers(rider_lat, rider_lng, radius_m=5000):
>     # Get geohash precision based on radius
>     precision = geohash_precision(radius_m)  # e.g., 6 chars
>     center_hash = geohash.encode(rider_lat, rider_lng, precision)
>     
>     # Get 8 neighboring cells
>     neighbors = geohash.neighbors(center_hash)
>     all_hashes = [center_hash] + neighbors
>     
>     drivers = []
>     for hash in all_hashes:
>         # Fetch from Redis (drivers in this cell)
>         drivers.extend(redis.georadius(f"cell:{hash}", rider_lat, rider_lng, radius_m))
>     
>     # Sort by actual distance, return top 10
>     return sorted(drivers, key=lambda d: distance(rider_lat, rider_lng, d.lat, d.lng))[:10]
> ```

**Matching Algorithm**:

> "Simple: Nearest driver. But reality is complex:
> - Driver rating
> - Driver preferences (UberX vs Pool)
> - Direction (don't match driver going opposite way)
> - ETA vs distance (highway vs city streets)
>
> **Scoring function**:
> ```python
> def score_driver(driver, rider):
>     eta = calculate_eta(driver.location, rider.location)
>     distance = haversine(driver.location, rider.location)
>     rating_bonus = (driver.rating - 4.5) * 100  # seconds reduction
>     
>     # Weighted score (lower is better)
>     return (eta * 0.6) + (distance * 0.2) - rating_bonus
> ```

**Location Update Optimization**:

> "1.25M updates/second is massive. Optimizations:
>
> 1. **Batching**: Drivers send location every 4 seconds, but we batch 10 updates before writing to database (reduce DB writes 10x)
> 2. **In-memory storage**: Keep recent locations in Redis (TTL 1 minute), archive to Cassandra for history
> 3. **Delta updates**: Only update if moved >50 meters (reduce noise)
> 4. **Regional sharding**: US-West drivers don't need to update servers in Asia"

**Surge Pricing**:

> "When demand > supply in a geo-fence (hexagonal grid):
> - Calculate supply/demand ratio per cell
> - If ratio < 0.5 (2 riders per driver), apply multiplier (1.2x, 1.5x, 2.0x)
> - Update every 2 minutes
> - Use Kafka to stream price updates to clients"

---

## **The Interviewer's Perspective**

**What Interviewers Actually Look For**:

1. **Structured Thinking**: Did you follow a framework or jump randomly? Candidates who say "First, let me understand the requirements" score higher than those who immediately draw boxes.

2. **Trade-off Awareness**: The best candidates say "I could do X or Y. X gives us consistency but higher latency, Y gives speed but potential staleness. Given our requirements for financial transactions, I choose X."

3. **Practicality**: Junior candidates design for 1 billion users when asked for a startup MVP. Senior candidates say "Start with PostgreSQL on a single server, shard when we exceed 10k QPS."

4. **Deep Knowledge**: When you mention Kafka, be ready to explain partition replication, ISR (In-Sync Replicas), and exactly-once semantics. Surface-level buzzwords hurt more than help.

5. **Communication**: Do you check if the interviewer is following? Do you adapt when they hint ("Would that handle the write amplification?")?

**Red Flags**:
- Ignoring scale constraints ("The database will handle it")
- No discussion of failure modes
- Inability to calculate basic throughput/storage
- Over-complicating simple problems (microservices for a URL shortener)

**Green Flags**:
- Asking clarifying questions before designing
- Back-of-the-envelope math within 2x of correct answer
- Discussing monitoring and observability
- Admitting uncertainty but reasoning from first principles

---

## **System Design Checklist in Practice**

Apply this checklist to any problem:

**Before Drawing**:
- [ ] Clarified functional requirements (core features)
- [ ] Clarified non-functional (scale, latency, availability)
- [ ] Estimated QPS, storage, bandwidth
- [ ] Identified read-heavy vs write-heavy

**During Design**:
- [ ] Defined API contracts (REST/gRPC)
- [ ] Designed data schema (tables, NoSQL structures)
- [ ] Explained sharding strategy (if needed)
- [ ] Justified technology choices (Redis vs Memcached, SQL vs NoSQL)
- [ ] Addressed single points of failure
- [ ] Discussed caching strategy (what, where, eviction)

**Deep Dives**:
- [ ] Database: Indexing, replication, partitioning
- [ ] Caching: Consistency, thundering herd, cache warming
- [ ] Scalability: Horizontal scaling, load balancing, auto-scaling triggers
- [ ] Reliability: Circuit breakers, retries, dead letter queues, graceful degradation

**Production Readiness**:
- [ ] Monitoring: Metrics (latency, errors, saturation), alerting thresholds
- [ ] Security: Authentication, authorization, data encryption
- [ ] Deployment: Rolling updates, feature flags, rollback strategy

---

## **Final Advice**

**The Day Before**:
- Review latency numbers (L1 cache, SSD, network)
- Practice one estimation problem (Twitter, Uber, or YouTube)
- Sleep >8 hours (cognitive function drops 30% with poor sleep)

**During the Interview**:
- Bring a notebook (write down requirements so you don't forget)
- Speak slowly (nervousness makes people rush)
- Validate with interviewer ("Does this approach make sense for the scale we discussed?")

**Remember**: You're not building the perfect system. You're demonstrating that you can think systematically about complex problems, make justified trade-offs, and communicate technical concepts clearly.

---

## **Chapter Summary**

We walked through four distinct problems—URL shortener (hashing/caching), Twitter (fan-out), WhatsApp (real-time messaging), and Uber (geo-spatial)—demonstrating how to apply the 4S framework to diverse domains. Each solution emphasized different architectural patterns while maintaining the same structured approach.

The system design interview is a conversation about constraints, not a test of memorization. Master the fundamentals, practice the framework, and approach each problem with curiosity rather than anxiety.

**Congratulations**: You've completed the System Design Handbook. You now possess the architectural knowledge, communication frameworks, and analytical tools to design systems that power the modern internet.

---

**Final Exercise**: Pick one problem from this chapter. Record yourself explaining it in 45 minutes. Watch the recording and ask: "Would I hire this person?" Iterate until the answer is yes.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='25. interview_strategy_and_communication.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <span style='color:gray; font-size:1.05em;'>Next</span>
</div>
