# **Chapter 1: Introduction to System Design**

## **1.1 What is System Design?**

Imagine you're asked to build a bridge. You wouldn't just start pouring concrete, right? You'd first ask: *How many cars will cross it daily?* *What weather must it withstand?* *Should it have pedestrian lanes?* **System design is the software engineering equivalent of that planning process.**

System design is the art of translating ambiguous requirements into concrete technical specifications. It's the bridge between "We need an app that lets people share photos" and the specific architecture involving databases, servers, load balancers, and caching layers that makes Instagram possible.

### **Architecture vs. Design: The Big Picture vs. The Blueprint**

These terms are often used interchangeably, but there's a subtle distinction:

- **Software Architecture** is the high-level *structure*—the load-bearing walls of your application. It answers: *What are the major components? How do they communicate? What patterns govern their interaction?*
  
- **System Design** is the detailed *planning*—the electrical wiring, plumbing, and material specifications. It answers: *Which database should we use? How many servers do we need? What's our backup strategy?*

Think of architecture as designing the city's road layout, while system design is figuring out the traffic light timing and lane markings.

**Example:** Building a video streaming service like Netflix:
- **Architecture decision**: "We'll use a microservices pattern with separate services for recommendations, user profiles, and video delivery."
- **Design decision**: "The recommendation service will use Redis for caching with a 24-hour TTL, running on 3 AWS EC2 instances behind an Application Load Balancer."

---

## **1.2 The System Design Interview (SDI) Framework**

In technical interviews at companies like Google, Meta, Amazon, and Netflix, system design questions aren't just tests of knowledge—they're simulations of how you'd collaborate on building real products.

### **Why Companies Care**

When you design a system in an interview, interviewers are evaluating:

1. **Structured Thinking**: Can you break down an ambiguous problem logically?
2. **Breadth of Knowledge**: Do you know the tools available (databases, caches, queues)?
3. **Trade-off Analysis**: Can you explain *why* you chose PostgreSQL over MongoDB for this specific use case?
4. **Operational Awareness**: Are you thinking about failure modes, monitoring, and maintenance?

### **The Interview Structure**

A typical 45-60 minute system design interview follows this flow:

**Phase 1: Requirements Gathering (5-10 minutes)**
You: "How many daily active users are we expecting? Is this a read-heavy or write-heavy application? Do we need real-time updates?"

**Phase 2: Back-of-the-Envelope Estimation (5 minutes)**
You: "If we have 10 million daily users posting 5 tweets each, that's 50 million writes per day... roughly 580 writes per second."

**Phase 3: High-Level Design (10-15 minutes)**
Draw the boxes and arrows: Client → Load Balancer → Application Servers → Database → Cache.

**Phase 4: Deep Dive (15-20 minutes)**
Pick one component: "Let's talk about how we'll scale the database..."

**Phase 5: Trade-offs and Bottlenecks (5-10 minutes)**
"What happens if the cache goes down? How do we handle a celebrity with 50 million followers posting?"

---

## **1.3 The Evolution Mental Model: From 0 to 10 Million Users**

One of the most powerful mental models in system design is understanding how architectures evolve. You don't build for 10 million users on day one—that's wasteful and slows you down. Instead, systems evolve through distinct stages.

### **Stage 1: The Single Server (1-1,000 users)**
*The "Just Make It Work" Phase*

```
┌─────────────┐
│   Client    │
│  (Browser)  │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Server     │
│ (App + DB)  │
└─────────────┘
```

**Setup**: One machine running your application and database (maybe a simple Python Flask app with SQLite).
**Focus**: Product-market fit, iterating on features.
**Reality**: This handles about 10-100 concurrent users. Perfect for MVPs (Minimum Viable Products).

**Code Example**: A simple monolithic setup
```python
# app.py - Everything in one place
from flask import Flask
import sqlite3

app = Flask(__name__)

@app.route('/user/<id>')
def get_user(id):
    # Direct database connection on same machine
    conn = sqlite3.connect('local.db')
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM users WHERE id = ?", (id,))
    return cursor.fetchone()
```

### **Stage 2: Separation of Concerns (1,000-10,000 users)**
*The "Database is Slowing Down" Phase*

You notice the database queries are making your app sluggish. Time to separate:

```
┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Server    │
│  (App Only) │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Database   │
│   Server    │
└─────────────┘
```

**Key Change**: Application server and database server are now separate machines.
**Why**: Databases need different hardware optimization (fast SSDs, lots of RAM) than application servers (CPU for business logic).
**New Concern**: Network latency between app and DB (was instant before, now takes 1-5ms).

### **Stage 3: Load Balancing & Horizontal Scaling (10,000-100,000 users)**
*The "One Server Can't Handle the Traffic" Phase*

```
                    ┌─────────────┐
                    │   Client    │
                    └──────┬──────┘
                           │
                           ▼
                    ┌─────────────┐
                    │Load Balancer│
                    │   (Nginx)   │
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           ▼               ▼               ▼
      ┌─────────┐     ┌─────────┐     ┌─────────┐
      │ Server 1│     │ Server 2│     │ Server 3│
      └────┬────┘     └────┬────┘     └────┬────┘
           │               │               │
           └───────────────┼───────────────┘
                           ▼
                    ┌─────────────┐
                    │  Database   │
                    └─────────────┘
```

**Key Change**: Multiple application servers behind a load balancer.
**Concept**: Horizontal scaling (adding more machines) vs. Vertical scaling (buying a bigger machine).
**Challenge**: Session management—if User A logs in on Server 1, Server 2 doesn't know about it. Solution: External session store (Redis) or sticky sessions.

### **Stage 4: Caching & Content Delivery (100,000-1M users)**
*The "Database is Dying" Phase*

Your database is getting hammered with the same queries repeatedly. "Who is User 123?" shouldn't require a disk lookup every time.

```
Client
  │
  ▼
CDN (Static assets: images, JS, CSS)
  │
  ▼
Load Balancer
  │
  ├─► Cache Hit? Return immediately (Redis/Memcached)
  │
  ▼
App Servers
  │
  ▼
Database (Primary)
  │
  ▼
Read Replicas (for scaling reads)
```

**New Components**:
- **CDN (Content Delivery Network)**: Stores images/videos close to users (CloudFlare, AWS CloudFront)
- **Cache Layer**: Stores hot data in memory (Redis, Memcached)
- **Read Replicas**: Database copies for read queries, reducing load on the primary database

### **Stage 5: Microservices & Data Partitioning (1M-10M users)**
*The "One Database Can't Store Everything" Phase*

```
                    ┌─────────────┐
                    │     CDN     │
                    └──────┬──────┘
                           ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│   API      │ │  User    │ │  Feed    │ │ Search   │
│  Gateway   │ │ Service  │ │ Service  │ │ Service  │
└─────┬──────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
      │             │            │            │
      └─────────────┴────────────┴────────────┘
                    │
      ┌─────────────┼─────────────┐
      ▼             ▼             ▼
┌─────────┐  ┌──────────┐  ┌──────────┐
│User DB  │  │Feed DB   │  │Search    │
│(Postgre)│  │(Cassandra)│ │(Elastic) │
└─────────┘  └──────────┘  └──────────┘
```

**Key Changes**:
- **Service Separation**: User service, Feed service, Search service (different teams own different services)
- **Database per Service**: User data in PostgreSQL, Feed in Cassandra (better for write-heavy timelines), Search in Elasticsearch
- **Message Queues**: Services communicate asynchronously via Kafka/RabbitMQ instead of direct calls

### **Stage 6: Global Scale (10M+ users)**
*The "Worldwide Distribution" Phase*

- **Multi-region deployment**: Data centers in US, Europe, Asia
- **Geo-routing**: Users hit the closest data center
- **Data sovereignty**: EU data stays in EU (GDPR compliance)
- **Event sourcing**: Complete audit trail of all changes

**The Golden Rule**: *Design for the next order of magnitude, not the next decade.* Build for 100k users when you have 10k, not 10 million.

---

## **1.4 Trade-offs in Engineering: The CAP Theorem Preview**

In distributed systems, you can't have everything. The CAP theorem (formulated by Eric Brewer) states that in the event of a network partition (a network break between nodes), you must choose between **Consistency** and **Availability**. You can't have both.

### **The Three Properties**

**C - Consistency**: Every read receives the most recent write or an error. Like a bank account—your balance must be accurate.

**A - Availability**: Every request receives a response, without guarantee it contains the most recent write. Like Twitter—seeing a slightly old tweet is better than seeing an error.

**P - Partition Tolerance**: The system continues to operate despite network failures. This is mandatory in distributed systems—networks always fail eventually.

### **The Trade-off in Practice**

**Scenario**: You have a database in New York and a replica in London. The network cable between them is cut (Partition).

- **Option A (Consistency)**: The London database refuses to answer queries until it syncs with New York. Users in London see errors, but data is never stale.
  
- **Option B (Availability)**: The London database answers queries using its last known data. Users see old data, but the site works.

**Real-World Examples**:
- **CP Systems**: Bank transactions (HBase, MongoDB with specific settings, PostgreSQL)
- **AP Systems**: Social media feeds, shopping carts (DynamoDB, Cassandra, Riak)

**PACELC Theorem Extension**: Even without a partition, you must choose between Latency and Consistency.

---

## **1.5 Measuring System Success: SLIs, SLOs, and SLAs**

How do you know if your system is "good"? We use quantitative targets.

### **SLI (Service Level Indicator)**
*What we measure*

A quantitative measure of service quality. Examples:
- **Latency**: How long does a request take? (95th percentile < 200ms)
- **Throughput**: How many requests per second? (10,000 RPS)
- **Error Rate**: What percentage fail? (< 0.1% HTTP 500 errors)
- **Availability**: Uptime percentage (99.9% or "three nines")

### **SLO (Service Level Objective)**
*What we aim for*

A target value for an SLI that indicates acceptable service. This is an internal goal.
- "Our API will respond to 99% of requests in under 100ms over a 30-day window."
- "Database availability will be 99.95%."

**The 99th Percentile Trap**: Don't just measure averages. If your average latency is 50ms but 1% of users wait 10 seconds, those users are leaving. Always measure percentiles (p50, p95, p99).

### **SLA (Service Level Agreement)**
*What we promise*

The business contract with consequences. If you miss it, you pay.
- "If availability drops below 99.9%, customers get a 10% service credit."
- Used by cloud providers (AWS, GCP) and enterprise SaaS companies.

### **Error Budgets: The SRE Approach**

Google's Site Reliability Engineering (SRE) teams use "error budgets." If your SLO is 99.9% availability, you have a 0.1% "error budget" (about 43 minutes of downtime per month).

**Philosophy**: If you haven't spent your error budget, you can take risks—deploy new features, experiment. If you're close to exceeding it, freeze deployments and focus on reliability.

---

## **1.6 Key Takeaways**

1. **System design is iterative**: Start simple, evolve based on actual bottlenecks, not hypothetical ones.

2. **Every decision is a trade-off**: Speed vs. Cost, Consistency vs. Availability, Complexity vs. Maintainability. There are no perfect solutions, only appropriate ones.

3. **Scale changes everything**: The architecture that serves 1,000 users is often unsuitable for 1 million. Be prepared to refactor.

4. **Measure what matters**: Define your SLIs and SLOs early. "Fast" and "reliable" aren't measurable—"p95 latency under 100ms" is.

5. **Design for failure**: Networks partition, servers crash, databases corrupt. The question isn't *if* it will fail, but *how* your system responds when it does.

---

## **Chapter Summary**

In this chapter, we established that system design is the process of making intentional decisions about how software components interact to meet business requirements. We explored the evolution from single-server setups to globally distributed architectures, understanding that each stage introduces new complexity to solve specific bottlenecks.

We learned that distributed systems require accepting trade-offs, particularly around the CAP theorem, and that success must be measured through concrete SLIs, SLOs, and SLAs rather than vague feelings of "fast" or "slow."

**Coming up next**: In Chapter 2, we'll build the foundation with networking fundamentals, latency numbers every programmer should know, and the data structures specifically relevant to distributed systems.

---

**Exercises**:

1. **Evolution Planning**: Take a simple blog application. Describe what changes you'd make at each stage (1k, 10k, 100k, 1M users). What breaks first at each level?

2. **CAP Analysis**: For each scenario below, would you choose CP or AP? Why?
   - An online banking transaction system
   - A Netflix movie recommendation feed
   - A real-time multiplayer game leaderboard
   - An e-commerce shopping cart

3. **SLO Calculation**: If your service promises 99.99% availability (four nines), how many minutes of downtime are allowed per year? Calculate the same for 99.9% and 99%.

---

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <span style='color:gray; font-size:1.05em;'>Previous</span>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='2. prerequisites_and_core_concepts.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
