# **Chapter 25: Interview Strategy & Communication**

The system design interview is fundamentally different from coding interviews. You're not proving you can write a sorting algorithm; you're demonstrating that you can architect complex systems, make trade-offs, and communicate technical decisions effectively. This chapter provides the framework, techniques, and communication strategies to excel in these high-stakes conversations.

---

## **25.1 The 4S Framework for System Design Interviews**

Most candidates fail not because they lack technical knowledge, but because they approach the interview haphazardly. The **4S Framework** provides a structured narrative that guides the interviewer through your thinking process.

### **The Framework Overview**

```
┌─────────────────────────────────────────────────────────────┐
│  SCOPE  →  SKETCH  →  SOLIDIFY  →  SCALE                     │
│  (5 min)   (10 min)    (15 min)     (20 min)                 │
│                                                              │
│  What to    Rough math    Data model    Deep dives,          │
│  build?     & capacity    & API design  trade-offs,          │
│             estimates                   bottlenecks          │
└─────────────────────────────────────────────────────────────┘
```

### **Phase 1: SCOPE (Requirements Gathering)**

**The Goal**: Before designing anything, understand what you're building. A beautiful design for the wrong requirements is worthless.

**Functional Requirements (What the system does)**:
- **Explicit**: Stated directly in the question
- **Implicit**: Assumed but not stated (user authentication, error handling)
- **Discovery**: Ask clarifying questions

**Non-Functional Requirements (How the system performs)**:
- **Scale**: Users, requests per second, data volume
- **Performance**: Latency (p50, p99), throughput
- **Availability**: Uptime SLA (99.9%, 99.99%)
- **Durability**: Data loss tolerance
- **Security**: Authentication, encryption, compliance

**Example Interview Dialogue**:

> **Interviewer**: "Design a URL shortener like TinyURL."
>
> **Candidate**: "Before I start designing, let me clarify the requirements. For functional requirements, I'm assuming:
> - Users can submit a long URL and get a short alias
> - Visiting the short URL redirects to the original
> - Users can optionally specify a custom alias
> - We should track click analytics
>
> For non-functional requirements, what scale should I design for? Are we talking hundreds or millions of URLs per day? And what's the expected read-to-write ratio?"
>
> **Interviewer**: "Let's design for 100 million new URLs per month, 10 billion redirects per month. Read-heavy, 100:1 ratio."
>
> **Candidate**: "Got it. And for latency, I'm assuming sub-100ms for redirects is acceptable? For availability, standard 99.9%?"

**Why This Works**: The candidate demonstrates structured thinking, doesn't make assumptions, and establishes concrete constraints before designing.

### **Phase 2: SKETCH (Back-of-the-Envelope Estimation)**

**The Goal**: Prove the system is feasible and guide architectural decisions with rough calculations.

**Key Numbers to Memorize**:
```
Latency:
- L1 cache reference: 0.5 ns
- Main memory reference: 100 ns
- SSD read: 100 µs
- Network (same datacenter): 0.5 ms
- Network (cross-country): 50 ms
- Network (intercontinental): 150 ms

Throughput:
- Sequential disk read: 200 MB/s
- SSD read: 500 MB/s
- 1 Gbps network: 125 MB/s
- 10 Gbps network: 1.25 GB/s

Scale:
- 1 million seconds ≈ 11.5 days
- 1 billion seconds ≈ 31.7 years
```

**Estimation Example: URL Shortener**

> **Candidate**: "Let me estimate the scale. We said 100 million new URLs per month, 10 billion redirects.
>
> **Write throughput**: 100M/month ÷ 30 days ÷ 86400 seconds ≈ 40 URLs/second. Peak might be 5x, so 200 writes/second.
>
> **Read throughput**: 10B/month ÷ 2.6M seconds ≈ 4000 reads/second. Peak 20,000/second.
>
> **Storage**: Assuming average URL is 500 bytes (original) + 50 bytes (short URL) + 100 bytes (metadata) = 650 bytes per URL.
> 100M URLs/month × 650 bytes × 12 months × 5 years retention = ~4 TB of URL mappings.
>
> **Bandwidth**: 20,000 redirects/second × 500 bytes average = 10 MB/s = 80 Mbps. Well within 1 Gbps.
>
> So we need a system handling 200 writes/second, 20,000 reads/second, storing 4 TB. This is well within single-server capabilities, but for high availability, I'll design for distributed deployment."

**Why This Matters**: These numbers tell us:
- We need caching (20,000 reads/second)
- Writes are low enough for any database
- Storage is manageable (4 TB fits on one SSD)
- We need multiple regions for global latency

### **Phase 3: SOLIDIFY (Data Model and API Design)**

**The Goal**: Define the contracts—how data is stored and how clients interact with the system.

**Database Schema Design**:

For the URL shortener:
```sql
-- URL mappings table
CREATE TABLE url_mappings (
    short_code VARCHAR(10) PRIMARY KEY,  -- 'abc123'
    long_url TEXT NOT NULL,             -- 'https://example.com/very/long/url'
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,                -- NULL = never expires
    user_id UUID,                        -- Who created it
    click_count BIGINT DEFAULT 0         -- Cached for fast lookup
);

-- Analytics table (time-series)
CREATE TABLE click_analytics (
    id BIGSERIAL PRIMARY KEY,
    short_code VARCHAR(10) REFERENCES url_mappings(short_code),
    clicked_at TIMESTAMP DEFAULT NOW(),
    ip_address INET,
    user_agent TEXT,
    referrer TEXT,
    country_code CHAR(2)
);

-- Indexes for common queries
CREATE INDEX idx_analytics_shortcode_time ON click_analytics(short_code, clicked_at);
CREATE INDEX idx_analytics_country ON click_analytics(country_code) WHERE clicked_at > NOW() - INTERVAL '7 days';
```

**API Design (RESTful)**:

```yaml
# URL Shortener API

POST /api/v1/urls
Request:
  {
    "long_url": "https://example.com/very/long/path?query=param",
    "custom_alias": "mylink",  # Optional
    "expires_in_days": 30      # Optional, default 365
  }

Response 201:
  {
    "short_code": "mylink",
    "short_url": "https://short.io/mylink",
    "long_url": "https://example.com/very/long/path?query=param",
    "created_at": "2024-01-15T10:30:00Z",
    "expires_at": "2024-02-14T10:30:00Z"
  }

Response 409:
  {
    "error": "Custom alias already exists",
    "suggested_aliases": ["mylink123", "mylink2024"]
  }

---

GET /{short_code}
Response 302:
  Location: https://example.com/very/long/path?query=param
  
  (Also logs analytics asynchronously)

Response 404:
  {
    "error": "Short URL not found or expired"
  }

---

GET /api/v1/urls/{short_code}/analytics
Query params:
  - start_date: ISO 8601
  - end_date: ISO 8601
  - granularity: hour | day | month

Response:
  {
    "short_code": "mylink",
    "total_clicks": 15420,
    "unique_visitors": 12300,
    "clicks_by_country": {
      "US": 8000,
      "UK": 3000,
      "DE": 2000
    },
    "clicks_over_time": [
      {"timestamp": "2024-01-15T10:00:00Z", "clicks": 150},
      {"timestamp": "2024-01-15T11:00:00Z", "clicks": 230}
    ],
    "top_referrers": [
      {"url": "https://twitter.com", "clicks": 5000},
      {"url": "https://facebook.com", "clicks": 3000}
    ]
  }
```

### **Phase 4: SCALE (High-Level Design and Deep Dives)**

**The Goal**: Draw the architecture, identify bottlenecks, and demonstrate how to scale.

**High-Level Architecture Diagram**:

```
┌─────────────────────────────────────────────────────────────┐
│                         DNS                                  │
│              (GeoDNS: Route to nearest region)              │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                      CDN (CloudFront)                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │ Static Assets│  │ Cached       │  │ Edge         │       │
│  │ (JS, CSS)    │  │ Redirects    │  │ Logic        │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                   Load Balancer (ALB/NGINX)                  │
│              Health checks, SSL termination                  │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                 Application Servers (Auto-scaling)           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐     │
│  │ API      │  │ URL      │  │ Analytics│  │ Health   │     │
│  │ Handler  │  │ Generator│  │ Worker   │  │ Check    │     │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘     │
│                                                              │
│  Stateless, horizontal scaling (10-1000 instances)          │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    Caching Layer (Redis Cluster)               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │ URL Mappings │  │ Rate Limit   │  │ Session      │       │
│  │ (short→long) │  │ Counters   │  │ Cache        │       │
│  │ TTL: 24h     │  │ TTL: 1h    │  │ TTL: 1h      │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
│                                                              │
│  Cache hit: <1ms, Cache miss: Query database                 │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                 Primary Database (PostgreSQL/MySQL)          │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  URL Mappings Table (sharded by short_code hash)    │    │
│  │  - short_code (PK)                                   │    │
│  │  - long_url (indexed)                                │    │
│  │  - created_at, expires_at                           │    │
│  │  - user_id, click_count                               │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
│  Read replicas: 3 (for read scaling)                        │
│  Write capacity: 2000 TPS                                   │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│              Analytics Pipeline (Kafka → ClickHouse)         │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│  │ Kafka    │ →  │ Flink    │ →  │ ClickHouse│             │
│  │ (Raw     │    │ (Enrich, │    │ (Analytics│             │
│  │  clicks) │    │  Window) │    │  DB)      │             │
│  └──────────┘    └──────────┘    └──────────┘              │
│                                                              │
│  Real-time dashboards, hourly reports, billing             │
└─────────────────────────────────────────────────────────────┘
```

**Deep Dive: Database Sharding Strategy**

> **Interviewer**: "How would you shard the database if you exceed single-node capacity?"

> **Candidate**: "Currently we're at 4TB with 5 years of data. If we grow 10x, we'd need to shard. I'd use hash-based sharding on `short_code` because:
> 
> 1. **Even distribution**: Short codes are random, so hash mod N gives uniform distribution
> 2. **Direct lookup**: Given a short code, we know exactly which shard to query
> 3. **No hot spots**: Unlike range sharding on time (recent data is hot), hash spreads load
> 
> Implementation: `shard_id = hash(short_code) % num_shards`
> 
> For cross-shard operations (analytics), we'd aggregate results from all shards or use a separate OLAP database (ClickHouse) fed by CDC (Change Data Capture)."

**Deep Dive: Cache Eviction Strategy**

> **Interviewer**: "What happens when the cache is full and a new popular URL emerges?"

> **Candidate**: "We'd use an LRU (Least Recently Used) eviction policy with some modifications:
>
> 1. **Tiered caching**: Hot URLs (top 1%) in L1 (in-memory), warm in L2 (Redis), cold in DB
> 2. **TTL variation**: Popular URLs get longer TTL (24h), unpopular shorter (1h)
> 3. **Pre-warming**: When cache miss occurs, we don't just cache that one URL; we cache related URLs (same user, same domain) anticipating they'll be accessed
> 4. **Write-through**: Database write and cache update happen atomically to prevent inconsistency"

---

## **25.2 The Art of Drawing Architecture Diagrams**

Your diagrams are your primary communication tool. A good diagram conveys the architecture instantly; a bad one confuses the interviewer.

### **Diagramming Principles**

**1. Layered Layout**
```
Top:    Clients (Web, Mobile, Third-party)
        ↓
Middle: API Gateway, Load Balancers, Application Logic
        ↓
Bottom: Data Layer (Caches, Databases, Storage)
```

**2. Direction of Flow**
- Left-to-right for request flow
- Top-to-bottom for dependency hierarchy
- Arrows indicate data direction

**3. Grouping**
- Use boxes to group related components (e.g., "Availability Zone A")
- Color coding (if digital) or line styles (if whiteboard) for different service types

### **Whiteboard Technique**

Since most interviews use whiteboards (physical or virtual):

**Step 1: Start with the Client**
```
[User/Browser] → ?
```

**Step 2: Add the Entry Point**
```
[User] → [DNS] → [CDN] → [Load Balancer] → ?
```

**Step 3: Build the Core**
```
... → [LB] → [App Servers] → [Cache] → [Database]
                ↓
           [Message Queue] → [Workers]
```

**Step 4: Add Data Flows**
- Solid lines: Synchronous requests
- Dashed lines: Asynchronous/background
- Double arrows: Two-way communication

**Step 5: Annotate**
Write key metrics next to components:
```
[Redis Cluster]
- 99th percentile: 1ms
- Hit rate: 95%
- Size: 100GB
```

### **Common Diagramming Mistakes**

1. **The "Spaghetti"**: Crossing lines everywhere. Use layers to organize.
2. **The "Monolith"**: One box labeled "The System." Break it down.
3. **Missing Data Stores**: Showing logic but forgetting where data lives.
4. **Scale Ambiguity**: Not indicating if components are single instances or clusters.
5. **Over-engineering**: Drawing Kubernetes clusters when a single server suffices.

---

## **25.3 Communication Strategies**

### **Thinking Out Loud**

Your internal monologue should be external. The interviewer can't read your mind.

**Bad**: *Silence for 2 minutes while drawing, then "So we use a hash ring."*

**Good**: "I'm considering how to distribute data across servers. Consistent hashing would work well here because it minimizes data movement when we add nodes. Let me draw that..."

### **Handling Uncertainty**

When you don't know something:

**Don't**: Guess confidently or pretend you know.
**Do**: Acknowledge, reason from first principles, and offer to research.

> "I'm not intimately familiar with Cassandra's compaction strategy, but I know LSM trees generally merge sorted files in the background. For this design, I'd need to verify if Cassandra's write amplification fits our SSD endurance requirements, or if we should consider RocksDB instead. For now, I'll assume we can achieve our write throughput targets."

### **Managing Time**

45-60 minutes goes fast. Check your watch or ask about time.

**Time Checkpoints**:
- 0-5 min: Requirements complete
- 5-15 min: Estimation and high-level design
- 15-35 min: Deep dives (database sharding, caching strategy)
- 35-50 min: Trade-offs, failure scenarios, monitoring
- 50-60 min: Questions for interviewer

If running behind: "I see we have 10 minutes left. Rather than diving into the replication protocol, let me summarize the architecture and discuss how we'd handle a datacenter failure. Does that work?"

---

## **25.4 The System Design Checklist**

Before saying you're done, verify you've covered:

### **Functional Requirements**
- [ ] Core features implemented
- [ ] API endpoints defined with request/response formats
- [ ] Edge cases handled (duplicate requests, invalid inputs)

### **Non-Functional Requirements**
- [ ] Latency targets met (p50, p99 specified)
- [ ] Throughput calculated (QPS, bandwidth)
- [ ] Availability SLA (99.9%, 99.99%)
- [ ] Durability guarantees (data loss tolerance)

### **Data Layer**
- [ ] Schema designed (SQL/NoSQL choice justified)
- [ ] Sharding strategy (if needed)
- [ ] Replication factor and consistency level
- [ ] Caching strategy (what to cache, eviction policy)

### **Scalability**
- [ ] Horizontal scaling path (stateless services)
- [ ] Load balancing strategy
- [ ] Auto-scaling triggers
- [ ] Database scaling (read replicas, sharding)

### **Reliability**
- [ ] Single points of failure eliminated
- [ ] Circuit breakers for external dependencies
- [ ] Retry strategies with exponential backoff
- [ ] Dead letter queues for failed operations

### **Monitoring**
- [ ] Metrics to track (QPS, latency, errors)
- [ ] Alerting thresholds
- [ ] Logging strategy (what to log, retention)
- [ ] Distributed tracing for request flow

### **Security**
- [ ] Authentication (who can access)
- [ ] Authorization (what they can do)
- [ ] Data encryption (at rest and in transit)
- [ ] Input validation and sanitization

---

## **25.5 Common Pitfalls and How to Avoid Them**

### **Pitfall 1: Jumping to Solutions**
**Mistake**: Starting with "We'll use Kubernetes and microservices..."
**Fix**: Always start with requirements. The best architecture depends on constraints.

### **Pitfall 2: Over-engineering**
**Mistake**: Designing for 1 billion users when the requirement is 1 million.
**Fix**: Design for 10x current scale, but mention how to evolve. "For phase 1, a single PostgreSQL instance handles 10,000 QPS. When we exceed that, we'll shard by user_id..."

### **Pitfall 3: Ignoring Failure Modes**
**Mistake**: "The database never goes down."
**Fix**: Explicitly discuss failure scenarios. "If the primary database fails, the replica promotes automatically within 30 seconds. During that window, we serve stale cache data..."

### **Pitfall 4: Neglecting the Data Model**
**Mistake**: Detailed discussion of load balancers but vague hand-waving about "the database."
**Fix**: Spend time on schema design. Draw the tables, explain primary keys, discuss indexes.

### **Pitfall 5: Silent Assumptions**
**Mistake**: Assuming the interviewer knows why you chose a technology.
**Fix**: Explain trade-offs. "I'm choosing Cassandra over PostgreSQL because we need write-heavy workload with tunable consistency, though this sacrifices complex query capabilities..."

---

## **25.6 Sample Interview Walkthrough**

**Question**: "Design a rate limiter for an API."

### **SCOPE (5 minutes)**

> "Let me clarify the requirements. For functional requirements, I'm assuming:
> - Limit requests per user per time window (e.g., 100 requests/minute)
> - Return 429 status when limit exceeded
> - Support different limits for different API tiers (free vs. paid)
> - Distributed across multiple API servers
>
> For non-functional:
> - Latency: Check should add <1ms overhead
> - Accuracy: Can tolerate slight over-limit (sliding window approximation)
> - Availability: Rate limiter shouldn't block API if it fails (fail open)
>
> Does that cover the main requirements, or are there specific algorithms you'd like me to focus on?"

### **SKETCH (10 minutes)**

> "Let me estimate the scale. Assuming:
> - 10,000 API servers
> - 1 million active users
> - Average 10 requests/user/minute = 10 million requests/minute ≈ 170,000 QPS
>
> Storage: We need to track counters per user per window.
> - User ID: 16 bytes
> - Window timestamp: 8 bytes
> - Counter: 4 bytes
> - 1M users × 28 bytes ≈ 28 MB per time window
> - With 1-minute windows and 1-hour retention: 28 MB × 60 = 1.68 GB
>
> This fits easily in memory. I'll use Redis for fast counter increments with TTL."

### **SOLIDIFY (15 minutes)**

> "For the algorithm, I'll use the **Sliding Window Counter** approach for better accuracy than fixed windows, without the memory cost of true sliding windows.
>
> **Data Model**:
> - Key: `rate_limit:{user_id}:{api_tier}:{minute_bucket}`
> - Value: counter (integer)
> - TTL: 2 minutes (covers current and previous window)
>
> **API Design**:
> ```python
> def check_rate_limit(user_id, tier='free'):
>     current_minute = time.time() // 60
>     current_key = f"rate_limit:{user_id}:{tier}:{current_minute}"
>     prev_key = f"rate_limit:{user_id}:{tier}:{current_minute-1}"
>     
>     limits = {'free': 100, 'pro': 1000, 'enterprise': 10000}
>     limit = limits[tier]
>     
>     # Get current and previous window counts
>     current_count = redis.get(current_key) or 0
>     prev_count = redis.get(prev_key) or 0
>     
>     # Weighted sum: 60% current window + 40% previous
>     # This approximates sliding window without storing every request
>     estimated = 0.6 * current_count + 0.4 * prev_count
>     
>     if estimated >= limit:
>         return False, 429  # Too many requests
>     
>     # Increment current window
>     pipe = redis.pipeline()
>     pipe.incr(current_key)
>     pipe.expire(current_key, 120)  # 2 minute TTL
>     pipe.execute()
>     
>     return True, 200
> ```
>
> **Why this works**:
> - Redis INCR is atomic (no race conditions)
> - Pipeline reduces round trips to 1
> - Weighted window prevents burst attacks at window boundaries
> - TTL auto-cleans old data"

### **SCALE (20 minutes)**

> "For high availability, I'll deploy Redis in cluster mode with 3 master nodes and 3 replicas. If a master fails, replica promotes automatically.
>
> **Bottleneck Analysis**:
> 1. **Redis CPU**: 170k QPS might saturate single Redis node. I'll shard by user_id hash across 10 Redis clusters.
> 2. **Network**: 170k QPS × 100 bytes = 17 MB/s, well within 1 Gbps.
> 3. **Memory**: 1.68 GB per cluster × 10 = 16.8 GB total, reasonable.
>
> **Failure Scenarios**:
> - **Redis down**: Fail open (allow request) rather than block API. Log for later analysis.
> - **Network partition**: Device continues with local cache of rate limits, syncs when reconnected.
> - **Hot key**: One user gets huge traffic. Use local in-app caching for rate limit checks to reduce Redis load.
>
> **Monitoring**:
> - Metrics: Rate limit hits/misses, Redis latency, 429 error rate
> - Alerts: Redis memory >80%, rate check latency >5ms, sudden spike in 429s (possible attack)
>
> **Trade-offs**:
> - I chose sliding window counter over token bucket for better user experience (no burst allowance at window start), at cost of slightly more memory.
> - I chose Redis over in-memory for horizontal scaling, accepting network latency penalty.
> - I chose weighted window over true sliding window for O(1) memory vs O(window_size) memory."

---

## **25.7 Key Takeaways**

1. **Structure beats improvisation**: The 4S framework (Scope, Sketch, Solidify, Scale) ensures you cover all dimensions systematically.

2. **Requirements first, solutions second**: Never start with "We'll use Kubernetes." Start with "We need to handle 10k QPS with 100ms latency."

3. **Show your work**: Explain why you chose Redis over Memcached, or sharding over replication. The reasoning matters more than the choice.

4. **Estimate confidently**: Use round numbers, state assumptions clearly, and sanity-check results ("4TB for 5 years of URLs sounds reasonable").

5. **Address the constraints**: Every design has trade-offs. Explicitly discuss consistency vs. availability, latency vs. cost, complexity vs. maintainability.

6. **Plan for failure**: The best designs include circuit breakers, fallbacks, and graceful degradation. Never assume 100% uptime.

---

## **Chapter Summary**

This chapter transformed technical knowledge into interview performance. We established the 4S framework as a narrative structure for any design question. We practiced requirements gathering, back-of-the-envelope calculations, and API design. We walked through a complete rate limiter example, demonstrating how to discuss trade-offs, failure modes, and monitoring.

The key insight: The interviewer is evaluating your thought process, not your memory. A candidate who explains why they chose eventual consistency with clear reasoning scores higher than one who recites CAP theorem without context.

**Coming up next**: In Chapter 26, we'll put it all together with detailed walkthroughs of 10 classic system design problems, showing exactly how to apply the 4S framework to the most common interview questions.

---

## **Exercises**

1. **Mock Interview Practice**: Record yourself designing a chat application for 15 minutes using the 4S framework. Watch the recording and check:
   - Did you ask clarifying questions before designing?
   - Did you estimate scale before choosing technologies?
   - Did you explain trade-offs when selecting databases?
   - Did you discuss failure scenarios?

2. **Estimation Drills**: Practice these back-of-the-envelope calculations:
   - Design Twitter: 500M daily active users, 500M tweets/day. Calculate storage for 5 years, QPS for home timeline, and cache size.
   - Design Uber: 100M monthly users, 10M trips/day. Calculate GPS update frequency, storage for location history, and matching algorithm QPS.

3. **Trade-off Analysis**: For each pair, argue for each side:
   - SQL vs. NoSQL for a social media feed
   - Strong consistency vs. eventual consistency for shopping cart
   - Microservices vs. monolith for a startup's MVP
   - Kafka vs. RabbitMQ for event streaming

4. **Failure Scenario Planning**: Pick any system you've designed. List:
   - 3 single points of failure and how to eliminate them
   - What happens when the primary database crashes
   - How to handle a 10x traffic spike in 5 minutes
   - Recovery procedure for a corrupted cache

5. **API Design**: Design the API for a distributed rate limiter that works across multiple data centers with shared state. Include endpoints for checking limits, querying current quota, and admin override. Specify request/response formats and error codes.

---


<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../8. Advanced_topics_and_emerging_patterns/24. edge_computing_and_iot.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='26. mock_interviews_and_solutions.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
