# **Chapter 2: Prerequisites & Core Concepts**

Before we can design systems that scale to millions of users, we need to understand the fundamental building blocks that make distributed systems possible. This chapter covers the essential concepts that form the bedrock of system design knowledge.

---

## **2.1 Networking Fundamentals**

Understanding how computers communicate over networks is crucial because, in distributed systems, almost every operation involves network communication. Your application is rarely faster than the network allows it to be.

### **The OSI Model Simplified**

The Open Systems Interconnection (OSI) model conceptualizes how network communication works. While you won't memorize all 7 layers, understanding which layer your protocols operate on is essential for debugging performance issues.

```
┌─────────────────────────────────────────┐
│ Layer 7: Application                    │ ← HTTP, HTTPS, SMTP, DNS, SSH
├─────────────────────────────────────────┤
│ Layer 6: Presentation                    │ ← TLS/SSL, JSON/XML serialization
├─────────────────────────────────────────┤
│ Layer 5: Session                         │ ← TCP/UDP ports
├─────────────────────────────────────────┤
│ Layer 4: Transport                       │ ← TCP (reliable), UDP (fast)
├─────────────────────────────────────────┤
│ Layer 3: Network                         │ ← IP addressing, routing
├─────────────────────────────────────────┤
│ Layer 2: Data Link                       │ ← MAC addresses, Ethernet
├─────────────────────────────────────────┤
│ Layer 1: Physical                        │ ← Cables, radio waves
└─────────────────────────────────────────┘
```

**Key Insight**: In system design, we mostly care about Layers 3-7. You don't need to understand physical cables to design distributed systems, but you do need to understand how TCP/IP works.

### **TCP vs. UDP: The Transport Layer Trade-off**

**TCP (Transmission Control Protocol)**: The reliable option. Like certified mail—you get delivery confirmation.

**Characteristics**:
- **Connection-oriented**: Three-way handshake before data transfer
- **Reliable delivery**: Retransmits lost packets, acknowledges successful delivery
- **Ordered delivery**: Guarantees packets arrive in sequence
- **Flow control**: Prevents overwhelming the receiver

**Three-way handshake visualization**:
```
Client                              Server
  │                                   │
  │── SYN (Synchronize) ─────────────>│ "Can I connect?"
  │                                   │
  │<── SYN-ACK (Synchronize-Ack) ─────┤ "Yes, can I confirm?"
  │                                   │
  │── ACK (Acknowledgment) ──────────>│ "Confirmed! Let's talk."
  │                                   │
  ▼                                   ▼
Connection Established
```

**Use cases**: Web browsing (HTTP/HTTPS), email (SMTP), file transfers (SFTP), databases (MySQL, PostgreSQL connections).

**UDP (User Datagram Protocol)**: The fast option. Like shouting in a crowded room—you speak, but there's no guarantee anyone heard you.

**Characteristics**:
- **Connectionless**: No setup, no teardown
- **Unreliable**: No guarantees—packets may be lost, duplicated, or arrive out of order
- **Fast**: No overhead from acknowledgments or retransmissions
- **Minimal headers**: Smaller packet size

**Use cases**: Video streaming, online gaming, DNS queries, VoIP calls.

**System Design Decision**: Use TCP for anything where data integrity matters more than speed (transactions, user data). Use UDP where speed matters more than occasional data loss (live video, real-time gaming where a dropped frame is acceptable).

### **HTTP Versions: Evolution of Web Communication**

**HTTP/1.1 (1997)**: The foundation of the modern web.

**How it works**: Text-based protocol. Each request opens a new TCP connection, sends the request, and closes the connection. Multiple requests can be pipelined, but responses arrive in order.

**Request Example**:
```http
GET /api/users/123 HTTP/1.1
Host: api.example.com
User-Agent: Mozilla/5.0
Accept: application/json

Response:
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 142

{"id": 123, "name": "Alice", "email": "alice@example.com"}
```

**Problems**: 
- **Head-of-line blocking**: The response for the first request must complete before the second request starts (even if the second could be served faster)
- **Inefficient headers**: Repeated headers with every request
- **Connection overhead**: TCP handshake for each request (despite Keep-Alive, still has limitations)

**HTTP/2 (2015)**: Binary protocol for efficiency.

**Key Features**:
- **Multiplexing**: Multiple streams over a single TCP connection (no head-of-line blocking)
- **Binary framing**: More efficient parsing (not text-based like HTTP/1.1)
- **Header compression**: HPACK compression reduces header size by up to 90%
- **Server push**: Server can proactively send resources (e.g., CSS file along with HTML)

**Visualization**:
```
HTTP/1.1:                            HTTP/2:
Request 1 ─┐                        Stream 1 ──┐
Response 1─┤                        Stream 2 ──┤ → All on 1 connection
Request 2 ─┤ (Serial)               Stream 3 ──┤ (Parallel)
Response 2─┤                        Stream 4 ──┘
Request 3 ─┘
Response 3─┘
```

**HTTP/3 (2022)**: Replaces TCP with QUIC (UDP-based).

**Key Innovation**: HTTP/3 runs on QUIC, which is built on UDP (not TCP). This eliminates TCP-level head-of-line blocking—if one packet is lost, only that stream is affected, not all streams.

**When to use each**:
- **HTTP/1.1**: Still widely supported, simpler, adequate for most use cases
- **HTTP/2**: Better for browsers loading multiple resources from a single domain
- **HTTP/3**: Best for high-latency networks (mobile, satellite), unreliable connections

### **gRPC: RPC for the Modern Era**

**RPC (Remote Procedure Call)**: Code that calls a function on another computer as if it were local. You write `user = getUser(123)` and the call happens over the network.

**gRPC (Google Remote Procedure Call)**: High-performance RPC framework using Protocol Buffers (protobuf) for serialization and HTTP/2 for transport.

**Why it's superior to REST**:
1. **Smaller payloads**: Protobuf binary is 3-10x smaller than JSON
2. **Faster serialization**: Protobuf encoding/decoding is significantly faster
3. **Strongly typed**: Schema-first approach catches errors at compile time
4. **Bidirectional streaming**: Both client and server can send messages continuously

**Comparison Example**:

**REST + JSON**:
```json
// Request
GET /api/users/123

// Response (450 bytes)
{
  "id": 123,
  "name": "Alice Johnson",
  "email": "alice@example.com",
  "created_at": "2024-01-15T08:30:00Z"
}
```

**gRPC + Protobuf**:
```protobuf
// Schema definition (.proto file)
syntax = "proto3";

message User {
  int32 id = 1;
  string name = 2;
  string email = 3;
  int64 created_at = 4;
}

service UserService {
  rpc GetUser(GetUserRequest) returns (User);
}
```
```go
// Binary request/response (~80 bytes, 5x smaller)
// And 10x faster to serialize/deserialize
```

**Use cases**: Microservices communication (Google uses it extensively), mobile-to-backend communication, browser-to-backend communication (via gRPC-Web).

### **WebSockets: Real-Time Bidirectional Communication**

HTTP was designed for request-response. What if you need the server to send messages to the client without being asked?

**WebSocket**: A persistent, bidirectional communication channel over a single TCP connection.

**Handshake Process**:
```
Client                                          Server
  │                                               │
  │── HTTP GET (Upgrade: websocket) ────────────>│
  │  "I want a WebSocket connection"             │
  │                                               │
  │<── HTTP 101 (Switching Protocols) ───────────┤
  │  "Switching to WebSocket protocol"           │
  │                                               │
  ▼                                               ▼
           Persistent bidirectional connection
              (No new HTTP requests needed)
```

**Key Features**:
- **Full-duplex**: Both client and server can send messages anytime
- **Low overhead**: Only 2-14 bytes per message (compared to HTTP headers)
- **Persistent**: Connection stays open until either party closes it
- **Efficient for real-time**: Perfect for chat, notifications, live updates

**Code Example** (JavaScript Client):
```javascript
// Establish WebSocket connection
const socket = new WebSocket('wss://api.example.com/notifications');

// Listen for messages from server
socket.onmessage = (event) => {
  const notification = JSON.parse(event.data);
  showNotification(notification);
};

// Send message to server
socket.send(JSON.stringify({
  type: 'subscribe',
  channel: 'user-updates'
}));

// Handle connection close
socket.onclose = () => {
  console.log('WebSocket closed. Reconnecting...');
  setTimeout(reconnect, 1000);
};
```

**Use cases**: Chat applications (Slack, WhatsApp), live collaboration (Google Docs), real-time notifications (GitHub, Jira), live sports scores, multiplayer games.

---

## **2.2 Latency Numbers Every Programmer Should Know**

In system design, having an intuitive sense of how long different operations take helps you make better decisions. These numbers, popularized by Jeff Dean (Google engineer), should be memorized.

### **The Hierarchy of Speed**

```
Operation                          Time (approx)  Human Scale
──────────────────────────────────────────────────────────────
CPU L1 cache access                1 ns            Light travels 30 cm
CPU L2 cache access                4 ns            Light travels 1.2 m
CPU L3 cache access                10 ns           Light travels 3 m
Main memory RAM access             100 ns          Light travels 30 m
Context switch (OS)                10,000 ns       100x slower than RAM
SSD random read                    100,000 ns      1,000x slower than RAM
Read from network (same DC)        250,000 ns      2,500x slower than RAM
Read from network (global)         150,000,000 ns  1.5 million x slower than RAM
SSD sequential read                1,000,000 ns    10,000x slower than RAM
Disk (HDD) random read             10,000,000 ns   100,000x slower than RAM
Disk (HDD) sequential read         50,000,000 ns   500,000x slower than RAM
```

**Key Insight**: RAM is about **10-100x faster than SSD**, and **SSD is about 10-100x faster than HDD**. Network calls are **1,000-1,000,000x slower** than RAM operations.

### **Practical Implications**

**1. The Caching Imperative**

If a database query takes 100ms, but reading from RAM takes 0.1ms, caching can make your application 1,000x faster. This is why distributed systems heavily invest in caching layers.

**2. The Network Is the Enemy**

Every network call is an eternity in computer time. A single network round-trip to a database in another data center takes 150ms—enough time to:
- Read from L1 cache 150 million times
- Read from RAM 1.5 million times
- Switch CPU contexts 15,000 times

**Design consequence**: Minimize network calls. Batch operations, use in-memory data when possible, and cache aggressively.

**3. Memory vs. Disk Trade-offs**

If you can fit your working set in RAM, do it. Even an expensive database server with 1TB of RAM is cheaper than the engineering cost of optimizing disk access.

### **Scaling These Numbers Over Time**

**Network Latency**:
- Same data center: 0.1-0.5 ms
- Same city: 5-10 ms
- East Coast US to West Coast US: 40-50 ms
- US to Europe: 80-100 ms
- US to Asia: 150-200 ms
- US to South America: 100-120 ms

**Why this matters for global systems**:
- A user in Tokyo waiting for data from US servers experiences 200ms latency
- If you do 10 database calls per request, that's 2 seconds just in network latency
- This is why companies deploy data centers globally

**Optimization Techniques**:
- **CDN (Content Delivery Network)**: Serve static assets from edge locations (~10ms globally)
- **Read replicas**: Database replicas in each region (local reads)
- **Edge computing**: Run application logic close to users (AWS CloudFront Functions, Cloudflare Workers)

---

## **2.3 Data Structures for System Design**

General-purpose data structures you learned in computer science (arrays, hash maps, trees) have specialized variants optimized for distributed systems. These are the building blocks of scalable systems.

### **Hash Functions & Consistent Hashing**

**Hash Function**: A mathematical function that maps data of arbitrary size to fixed-size values.

```
hash("hello") → 8c7d9b...
hash("hello") → 8c7d9b... (always the same result)
hash("world") → 2d4f6a... (different result)
```

**Properties of Good Hash Functions**:
- **Deterministic**: Same input always produces same output
- **Uniform distribution**: Inputs are spread evenly across output space
- **Avalanche effect**: Small input changes produce completely different outputs
- **Fast**: Computationally inexpensive to calculate

**Simple Hash Example**:
```python
def simple_hash(key, num_servers):
    return sum(ord(c) for c in key) % num_servers

# This determines which server stores the data
server_id = simple_hash("user_123", 10)  # Returns 0-9
```

**Problem with Simple Hashing**: What happens when we add or remove servers?

```python
# Initially 10 servers
server_id = simple_hash("user_123", 10)  # Returns 7 (data on server 7)

# We add a new server (now 11 servers)
server_id = simple_hash("user_123", 11)  # Returns 3 (data moved to server 3)
```

**The Rehashing Problem**: Adding one server causes all keys to be reassigned to different servers. In a system with 1 billion keys, this means moving billions of records—a catastrophic operation.

**Consistent Hashing**: Solves this by hashing both servers and keys onto the same ring.

```
           Key: "user_123"
                  ↓
    ┌────────────────────────────────┐
    │     ◄─────── Ring ───────►     │
    │                                │
    │  Key: "user_456"    Server C  │
    │       ↓                ↑       │
    │  Server A ───► Server B ───►  │
    │      ↑                   ↓     │
    │  Key: "user_789"   Key: "user_000"
    │                                │
    └────────────────────────────────┘

Key assignment rule: Each key is assigned to the next server clockwise.
```

**Key Advantages**:
1. **Minimal disruption**: Adding a server only moves ~1/N of keys (N = number of servers)
2. **Automatic scaling**: Can add/remove servers without massive data movement
3. **Load balancing**: Even distribution of keys across servers

**Virtual Nodes**: In practice, each physical server is represented by multiple virtual nodes on the ring (e.g., 100-1000) to ensure even distribution when the number of servers is small.

**Code Example** (Conceptual):
```python
class ConsistentHash:
    def __init__(self, num_virtual_nodes=100):
        self.ring = []
        self.num_virtual_nodes = num_virtual_nodes
        
    def add_server(self, server):
        # Add virtual nodes for this server
        for i in range(self.num_virtual_nodes):
            hash_value = hash(f"{server}_{i}")
            self.ring.append((hash_value, server))
        self.ring.sort()  # Sort by hash value
        
    def get_server(self, key):
        hash_value = hash(key)
        # Find first server clockwise from this hash
        for node_hash, server in self.ring:
            if node_hash >= hash_value:
                return server
        # Wrap around to first server
        return self.ring[0][1]
```

**Use cases**: 
- **Distributed caches**: Memcached clusters
- **Distributed databases**: Cassandra, Riak
- **Load balancing**: Deterministic routing of requests

---

### **Bloom Filters: Space-Efficient Probabilistic Data Structures**

**Question**: How can you efficiently check if an element is in a very large set without storing the entire set?

**Answer**: Bloom filters—space-efficient probabilistic data structures that test whether an element is a member of a set.

**Properties**:
- **Space-efficient**: 1-2 bytes per element (vs. 20+ bytes for hash maps)
- **False positives possible**: Might say "yes" when answer is "no"
- **False negatives impossible**: If it says "no", element is definitely not in set
- **No deletion**: Standard bloom filters don't support deletion (though variants do)

**How It Works**:
```
1. Choose k hash functions and a bit array of m bits
2. To add an element, compute k hash values and set those bits to 1
3. To check membership, compute k hash values; if all bits are 1, element might be in set
```

**Visualization**:
```
Initial state (m=10 bits):
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Add "alice":
hash1("alice") = 2 → set bit[2] = 1
hash2("alice") = 5 → set bit[5] = 1
hash3("alice") = 9 → set bit[9] = 1

Result:
[0, 0, 1, 0, 0, 1, 0, 0, 0, 1]

Check "bob":
hash1("bob") = 2 → bit[2] = 1 ✓
hash2("bob") = 3 → bit[3] = 0 ✗ → Definitely not in set

Check "charlie":
hash1("charlie") = 2 → bit[2] = 1 ✓
hash2("charlie") = 5 → bit[5] = 1 ✓
hash3("charlie") = 9 → bit[9] = 1 ✓
→ Possibly in set (might be false positive)
```

**False Positive Rate**:
```
P(false positive) ≈ (1 - e^(-kn/m))^k

Where:
- n = number of elements in set
- m = size of bit array
- k = number of hash functions
```

**Optimal Configuration**:
- For given n and m, optimal k ≈ (m/n) * ln(2)
- For 1% false positive rate, need ~10 bits per element
- For 0.1% false positive rate, need ~15 bits per element

**Code Example**:
```python
import mmh3  # MurmurHash3, a fast non-cryptographic hash
from bitarray import bitarray

class BloomFilter:
    def __init__(self, size, hash_count):
        self.size = size
        self.hash_count = hash_count
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)
        
    def add(self, item):
        for i in range(self.hash_count):
            # Different seed for each hash function
            index = mmh3.hash(str(item), i) % self.size
            self.bit_array[index] = 1
            
    def might_contain(self, item):
        for i in range(self.hash_count):
            index = mmh3.hash(str(item), i) % self.size
            if not self.bit_array[index]:
                return False  # Definitely not in set
        return True  # Possibly in set
```

**Real-World Use Cases**:
- **Google Chrome**: "Do not show this website's notifications again" (10MB filter for billions of domains)
- **Databases**: Query optimization (bloom filter before reading from disk)
- **Bitcoin**: Lightweight clients can check if a transaction exists without storing entire blockchain
- **Web crawlers**: Avoid re-crawling URLs
- **Spell checkers**: Quickly determine if a word might be misspelled

**System Design Pattern**:
```
Request → Bloom Filter Check
    ├─→ Negative (definitely not in DB) → Return 404
    └─→ Positive (might be in DB) → Query Database → Return result
```

This prevents 99% of unnecessary database queries for non-existent keys.

---

### **Skip Lists: Probabilistic Search Structures**

**Problem**: Sorted arrays support O(1) access by index and O(log n) search but O(n) insertion/deletion. Balanced trees support O(log n) operations but are complex to implement and maintain. What if we want something simpler?

**Skip Lists**: Probabilistic data structures that provide O(log n) search, insertion, and deletion with simpler implementation than balanced trees.

**How It Works**: A multi-level linked list where higher levels skip many elements, similar to how you might look at the chapter titles before reading paragraphs.

**Visualization**:
```
Level 2:      1 ────────────→ 10 ────────────→ 30
Level 1:      1 ────→ 5 ────→ 10 ────→ 20 ──→ 30
Level 0: 0 ──→ 1 ──→ 5 ──→ 8 ──→ 10 ──→ 20 ──→ 30 ──→ ∞
```

**Search for 20**:
1. Start at Level 2: 1 → 10 → 30 (20 is between 10 and 30)
2. Drop to Level 1: 10 → 20 (found!)
3. Only 4 node visits instead of 6 in Level 0

**Insertion Process**:
```
1. Insert node into Level 0 (like normal linked list)
2. Randomly decide whether to promote to Level 1 (usually 50% chance)
3. If promoted to Level 1, randomly decide whether to promote to Level 2 (25% chance)
4. Continue until node isn't promoted or reaches max level
```

**Why This Works**: The probability distribution ensures that higher levels have exponentially fewer nodes, mimicking the structure of balanced trees.

**Code Example** (Conceptual):
```python
import random

class SkipListNode:
    def __init__(self, value, levels):
        self.value = value
        self.next = [None] * levels

class SkipList:
    def __init__(self, max_levels=16, probability=0.5):
        self.max_levels = max_levels
        self.probability = probability
        self.head = SkipListNode(-float('inf'), max_levels)
        
    def random_level(self):
        level = 1
        while random.random() < self.probability and level < self.max_levels:
            level += 1
        return level
        
    def search(self, value):
        current = self.head
        for level in reversed(range(self.max_levels)):
            while current.next[level] and current.next[level].value < value:
                current = current.next[level]
        current = current.next[0]
        return current and current.value == value
        
    def insert(self, value):
        update = [None] * self.max_levels
        current = self.head
        
        for level in reversed(range(self.max_levels)):
            while current.next[level] and current.next[level].value < value:
                current = current.next[level]
            update[level] = current
            
        new_level = self.random_level()
        new_node = SkipListNode(value, new_level)
        
        for level in range(new_level):
            new_node.next[level] = update[level].next[level]
            update[level].next[level] = new_node
```

**Real-World Use Cases**:
- **Redis**: Implementing sorted sets (ZSET) data structure
- **Apache Cassandra**: SSTable indexing (memtable-sstable)
- **LevelDB/RocksDB**: Memtable structure

**Advantages over Balanced Trees**:
- Simpler to implement correctly
- Better cache locality (linked lists vs. tree nodes scattered in memory)
- No need for rebalancing rotations
- Efficient concurrent access (lock levels individually)

---

### **Bitmasks: Efficient Flag Storage**

**Problem**: How do you efficiently store and check multiple boolean flags for millions of items?

**Answer**: Bitmasks—using individual bits within integers to represent boolean values.

**Basic Concept**: A 64-bit integer can store 64 boolean flags.
```
Integer:  0000 0001 0100 1000 (binary)
Meaning:  └┬─┘ └┬─┘ └┬─┘ └┬─┘
         Flag  Flag  Flag  Flag
           0     3     6     9
         
Bit 0 = 1 → Flag 0 is True
Bit 3 = 1 → Flag 3 is True
Bit 6 = 1 → Flag 6 is True
Bit 9 = 1 → Flag 9 is True
All other bits = 0 → Other flags are False
```

**Bitwise Operations**:

**Set a bit** (turn flag on):
```python
flags = 0b00001010  # Initial flags
flags |= 0b00000100  # Set bit 2 (| is OR)
# Result: 0b00001110
```

**Clear a bit** (turn flag off):
```python
flags = 0b00001110
flags &= ~0b00000100  # Clear bit 2 (& is AND, ~ is NOT)
# Result: 0b00001010
```

**Check a bit** (is flag set?):
```python
flags = 0b00001110
is_set = bool(flags & 0b00000100)  # Check bit 2
# Result: True
```

**Toggle a bit** (flip flag):
```python
flags = 0b00001110
flags ^= 0b00000100  # Toggle bit 2 (^ is XOR)
# Result: 0b00001010
```

**Real-World Example: File Permissions in Unix/Linux**:
```python
# Unix file permissions (rwxrwxrwx)
# Read = 4 (100), Write = 2 (010), Execute = 1 (001)

# Owner permissions: read + write = 4 + 2 = 6 (110)
# Group permissions: read only = 4 (100)
# Others: no permissions = 0 (000)

permissions = 0o640  # Octal notation

# Check if owner can write
owner_can_write = bool((permissions >> 6) & 0o2)  # True

# Check if group can execute
group_can_execute = bool((permissions >> 3) & 0o1)  # False

# Grant execute permission to others
permissions |= 0o1  # Now 0o641
```

**System Design Use Cases**:
- **User permissions**: Store 64+ permission flags in single integer per user
- **Feature flags**: Efficient A/B testing and gradual rollout
- **Caching**: Dirty bits indicating which cache entries need to be written to disk
- **Compression**: Bitmap indices for databases

**Performance Advantage**:
- 64 flags stored in 8 bytes
- All operations are CPU-native (single instruction)
- Can test all flags simultaneously with bitmask comparisons

---

## **2.4 Concurrency Basics**

In distributed systems, multiple operations often happen simultaneously. Understanding concurrency is essential for designing correct and performant systems.

### **Processes vs. Threads: Understanding the Hierarchy**

**Process**: An instance of a running program. Each process has its own memory space, file descriptors, and resources.

**Characteristics**:
- **Isolated memory**: One process cannot directly access another's memory
- **Heavyweight**: Creating a process copies the entire memory space
- **Communication**: Requires IPC (Inter-Process Communication) mechanisms
- **Safety**: Process crashes don't affect other processes (usually)

**Thread**: The smallest unit of execution within a process. Threads share the same memory space and resources.

**Characteristics**:
- **Shared memory**: All threads in a process share the same memory
- **Lightweight**: Creating a thread is much faster than creating a process
- **Direct communication**: Can access shared variables directly
- **Vulnerable**: One thread can corrupt memory for all threads

**Visualization**:
```
Process A                          Process B
┌─────────────────────────┐       ┌─────────────────────────┐
│                         │       │                         │
│  Memory Space           │       │  Memory Space           │
│  ┌─────────────────────┐│       │  ┌─────────────────────┐│
│  │ Thread 1            ││       │  │ Thread 1            ││
│  │ (Shared memory)     ││       │  │ (Shared memory)     ││
│  │                     ││       │  │                     ││
│  │ Thread 2            ││       │  │ Thread 2            ││
│  └─────────────────────┘│       │  └─────────────────────┘│
│                         │       │                         │
│  File Descriptors       │       │  File Descriptors       │
│  Network Sockets        │       │  Network Sockets        │
└─────────────────────────┘       └─────────────────────────┘

No direct access between processes
```

**Performance Comparison**:
```
Operation                    Time (approx)
──────────────────────────────────────────────
Process creation            10-50 ms
Thread creation             0.1-1 ms
Context switch (same proc)  1-5 µs
Context switch (diff proc)  10-100 µs
```

**Use Cases**:
- **Multi-process**: Separate services (web server, database, cache), isolation (browser tabs), CPU-bound parallelism
- **Multi-thread**: Shared data structures, responsive UI (background tasks), I/O-bound work (handling multiple connections)

---

### **Race Conditions and Data Races**

**Race Condition**: The behavior of software depends on the relative timing of events. Results are non-deterministic.

**Example**: The "Bank Account Problem"
```python
# Shared balance variable
balance = 1000

# Thread 1: Deposit $100
def deposit():
    global balance
    temp = balance      # Read: 1000
    time.sleep(0.001)   # Context switch happens here!
    balance = temp + 100

# Thread 2: Deposit $200
def deposit_large():
    global balance
    temp = balance      # Read: 1000 (stale value!)
    balance = temp + 200

# Expected: balance = 1300
# Actual: balance = 1200 (Thread 2 overwrites Thread 1's update)
```

**Timeline of the Race**:
```
Time  Thread 1          Thread 2          Balance
─────────────────────────────────────────────────────
0     Read 1000                           1000
1                       Read 1000          1000
2     Write 1100                          1100
3                       Write 1200         1200  ← Wrong!
```

**Types of Race Conditions**:
1. **Check-then-act**: Check a condition, then act (but condition changed)
2. **Read-modify-write**: Read value, modify, write (race between read and write)
3. **Lost update**: Multiple threads update same value based on stale reads

---

### **Synchronization Mechanisms: Preventing Race Conditions**

**Mutex (Mutual Exclusion)**: Ensures only one thread can access shared resource at a time.

**Concept**:
```
Room (shared resource) with one key (mutex)

┌─────────────────┐
│                 │
│   Thread 1      │ ← Has key, enters room
│   🔑 Key        │   (Other threads wait)
│                 │
└─────────────────┘

When Thread 1 leaves, it returns the key.
Next waiting thread gets the key and enters.
```

**Code Example**:
```python
import threading

balance = 1000
balance_lock = threading.Lock()  # Mutex

def deposit(amount):
    global balance
    with balance_lock:  # Acquire lock
        temp = balance
        time.sleep(0.001)  # Even with delay, no race condition
        balance = temp + amount
    # Lock automatically released when exiting 'with' block

# Now threads execute sequentially, not concurrently
```

**Performance Impact**:
- Locks introduce overhead (tens to hundreds of nanoseconds)
- Contention (threads waiting for locks) can become bottleneck
- Deadlock risk if locks are held incorrectly

**Best Practices**:
- Hold locks for minimal time
- Avoid holding locks while calling unknown functions
- Use lock ordering to prevent deadlocks
- Consider lock-free data structures for high-contention scenarios

---

### **Semaphores: Counting Resources**

**Semaphore**: A more general synchronization primitive that allows multiple threads to access a resource up to a limit.

**Binary Semaphore**: Acts like a mutex (limit = 1).

**Counting Semaphore**: Allows up to N threads to access resource simultaneously.

**Example**: Database Connection Pool
```python
import threading

class ConnectionPool:
    def __init__(self, max_connections):
        self.max_connections = max_connections
        self.semaphore = threading.Semaphore(max_connections)
        
    def get_connection(self):
        self.semaphore.acquire()  # Decrements semaphore
        # If semaphore = 0, blocks until a connection is released
        return create_database_connection()
        
    def return_connection(self, connection):
        connection.close()
        self.semaphore.release()  # Increments semaphore

# Usage
pool = ConnectionPool(max_connections=10)

# Up to 10 threads can have connections simultaneously
# 11th thread blocks until a connection is returned
```

**Use Cases**:
- **Resource pools**: Database connections, thread pools
- **Rate limiting**: Limit concurrent API calls
- **Producer-consumer**: Bounded buffer synchronization

---

### **Deadlocks: When Threads Wait Forever**

**Deadlock**: A situation where multiple threads are blocked waiting for each other, resulting in no progress.

**Example**: The "Dining Philosophers Problem"
```
Philosopher 1: Fork A → Fork B  (waiting for Fork B)
Philosopher 2: Fork B → Fork A  (waiting for Fork A)

Both are waiting for the other to release their fork.
Neither can proceed. Deadlock!
```

**Four Necessary Conditions for Deadlock** (Coffman conditions):
1. **Mutual exclusion**: Resources cannot be shared
2. **Hold and wait**: Thread holds a resource while waiting for another
3. **No preemption**: Resources cannot be forcibly taken
4. **Circular wait**: Thread A waits for Thread B, which waits for Thread A...

**Prevention Strategies**:

**1. Lock Ordering**: Always acquire locks in a consistent order.
```python
def transfer(account1, account2, amount):
    # Always lock the account with the smaller ID first
    if account1.id < account2.id:
        account1.lock.acquire()
        account2.lock.acquire()
    else:
        account2.lock.acquire()
        account1.lock.acquire()
    
    # Perform transfer
    account2.lock.release()
    account1.lock.release()
```

**2. Timeouts**: Give up if lock acquisition takes too long.
```python
def transfer_with_timeout(account1, account2, amount):
    if account1.lock.acquire(timeout=1.0):
        try:
            if account2.lock.acquire(timeout=1.0):
                try:
                    # Perform transfer
                    pass
                finally:
                    account2.lock.release()
        finally:
            account1.lock.release()
    else:
        raise TimeoutError("Could not acquire lock")
```

**3. Try-Lock**: Attempt to acquire all locks; if any fails, release all and retry.
```python
def transfer_try_lock(account1, account2, amount):
    acquired = False
    try:
        acquired = account1.lock.acquire(blocking=False) and \
                   account2.lock.acquire(blocking=False)
        if acquired:
            # Perform transfer
            pass
    finally:
        if acquired:
            account2.lock.release()
            account1.lock.release()
```

---

### **The Event Loop: Asynchronous I/O Without Threads**

**Problem**: Threads have overhead (stack memory, context switching). What if we want to handle thousands of concurrent connections with minimal overhead?

**Solution**: Event loop + asynchronous I/O—single thread handles all I/O operations.

**How It Works**:
```
1. Event loop maintains a queue of tasks
2. When task performs I/O (network read, disk write), it yields control
3. Event loop schedules next task from queue
4. When I/O completes, original task is resumed
```

**Comparison**:
```
Threaded Model:              Event Loop Model:
Thread 1:                    Task 1:  I/O wait → yield
  Handle Request 1                  Task 2:  Handle Request 2
  Read from DB                      Task 3:  Handle Request 3
  Write response                    ...
Thread 2:                           I/O completes → Task 1 resumed
  Handle Request 2                  Task 1:  Write response
  Read from DB
  Write response
... (1000 threads)
```

**Memory Usage**:
```
Threaded (1000 threads, 2MB stack per thread): ~2GB memory
Event Loop (1 thread, task objects): ~50MB memory
```

**Code Example (Node.js/JavaScript)**:
```javascript
// Node.js uses an event loop by default
const fs = require('fs');

// Non-blocking file read
fs.readFile('large_file.txt', (err, data) => {
  // This callback runs when file is read
  console.log('File read complete!');
});

console.log('This prints BEFORE file read completes');

// Output:
// "This prints BEFORE file read completes"
// "File read complete!"
```

**Code Example (Python asyncio)**:
```python
import asyncio

async def fetch_data(url):
    print(f"Fetching {url}...")
    await asyncio.sleep(2)  # Simulate network call
    print(f"Data from {url} received!")
    return f"Data from {url}"

async def main():
    # Execute three fetches concurrently
    tasks = [
        fetch_data('http://api1.com'),
        fetch_data('http://api2.com'),
        fetch_data('http://api3.com')
    ]
    results = await asyncio.gather(*tasks)
    return results

# Total time: 2 seconds (not 6 seconds!)
asyncio.run(main())
```

**Use Cases**:
- **High-concurrency web servers**: Node.js, Python asyncio
- **Real-time applications**: Chat servers, multiplayer games
- **Microservices**: API gateways, proxy servers

**Trade-offs**:
- **Pros**: Low memory usage, no lock contention, efficient I/O handling
- **Cons**: CPU-bound tasks block entire event loop, requires async-compatible libraries, steeper learning curve

---

## **2.5 Basic Math for Capacity Planning**

System design often requires back-of-the-envelope calculations to make quick decisions. You don't need complex math, but you do need to understand the basics of estimating capacity.

### **Throughput and QPS (Queries Per Second)**

**QPS**: Number of queries/requests processed per second.

**Basic Calculation**:
```
QPS = Total Users × Requests per User / Time Window

Example:
- 1 million daily active users (DAU)
- Each user makes 10 requests per day
- Total daily requests = 10 million
- Peak hour has 20% of traffic = 2 million requests
- Peak QPS = 2,000,000 / 3600 = 555 requests/second
```

**Real-World QPS Estimates**:
```
Service              Peak QPS
─────────────────────────────────────
Twitter              30,000+ (tweets per second)
Google Search        70,000+
YouTube              100,000+ (video starts per second)
AWS                  1,000,000+ (API calls)
Small SaaS app       10-100
Medium app           1,000-10,000
Large app            10,000-100,000
```

**Server Capacity Planning**:
```
If each server can handle 1,000 QPS:
- Required for 5,000 QPS: 5 servers
- Required for 50,000 QPS: 50 servers
- Required for 500,000 QPS: 500 servers

Add buffer for:
- Peak traffic (2-5x average)
- Server failures (1-2 extra servers)
- Maintenance capacity (another 10-20%)
```

---

### **Storage Estimation**

**Formula**: Storage = Records × Size per Record

**Example: Social Media Posts**
```
Assumptions:
- Post ID: 8 bytes (64-bit integer)
- User ID: 8 bytes
- Timestamp: 8 bytes
- Content: 1,000 bytes (500 characters, 2 bytes each)
- Metadata: 100 bytes

Total per post: ~1,124 bytes

Daily posts:
- 1 million users
- 2 posts per user per day
- 2 million posts per day

Daily storage: 2,000,000 × 1,124 bytes ≈ 2.2 GB
Yearly storage: 2.2 GB × 365 ≈ 800 GB

5-year storage: 4 TB (plus ~20% overhead = 5 TB)
```

**Example: Chat Messages**
```
Assumptions:
- Message ID: 16 bytes (UUID)
- Sender ID: 8 bytes
- Recipient ID: 8 bytes
- Timestamp: 8 bytes
- Message text: 200 bytes
- Metadata: 50 bytes

Total per message: ~290 bytes

Daily messages:
- 500,000 users
- 50 messages per user per day
- 25 million messages per day

Daily storage: 25,000,000 × 290 bytes ≈ 7.25 GB
Yearly storage: 7.25 GB × 365 ≈ 2.6 TB
```

**Database Storage Overhead**:
- Add 20-30% for database overhead (indexes, page headers)
- Add 50-100% for replicas (3 replicas = 3x storage)
- Add 50% for backups (incremental backups reduce this)

**Compression Savings**:
- Text: 50-70% reduction (gzip, LZ4)
- Images: 50-90% reduction (WebP, AVIF)
- Video: 90-99% reduction (H.264, VP9, AV1)

---

### **Network Bandwidth**

**Formula**: Bandwidth = Throughput × Average Response Size

**Example: Image Serving**
```
Assumptions:
- 10,000 image requests per second
- Average image size: 500 KB

Bandwidth: 10,000 × 500 KB = 5 GB/s = 40 Gbps

Network requirements:
- 40 Gbps of bandwidth
- With redundancy: 2 × 40 Gbps = 80 Gbps total
- For 3 availability zones: 3 × 80 Gbps = 240 Gbps
```

**Example: API Response Bandwidth**
```
Assumptions:
- 5,000 API requests per second
- Average response size: 10 KB

Bandwidth: 5,000 × 10 KB = 50 MB/s = 400 Mbps

With CDN:
- 80% of bandwidth served from CDN (cached responses)
- 20% served from origin
- Origin bandwidth: 400 Mbps × 0.2 = 80 Mbps
```

**Cost Estimation** (AWS us-east-1, approximate):
```
Data Transfer Out:
- First 10 TB/month: Free
- Next 40 TB/month: $0.09/GB
- Next 100 TB/month: $0.085/GB
- Beyond: Tiered pricing

Example: 1 TB/month data transfer
= 10 TB free + 40 TB × $0.09 + 950 TB × $0.085
≈ $81,550/month (high traffic!)
```

---

### **Latency Budget**

**Latency Budget**: Maximum acceptable time for each component.

**Example: 200ms Total Budget**
```
Component                Budget     Actual    Status
─────────────────────────────────────────────────────
Client processing       10 ms      8 ms      ✓
Network (CDN edge)      30 ms      25 ms     ✓
CDN cache lookup        20 ms      15 ms     ✓
Application processing  50 ms      45 ms     ✓
Database query          40 ms      60 ms     ✗ (over budget)
Cache fallback          10 ms      5 ms      ✓
Response serialization   10 ms      8 ms      ✓
Network return          30 ms      25 ms     ✓
─────────────────────────────────────────────────────
Total                   200 ms     191 ms    ✓

If database exceeds budget, options:
1. Add more read replicas
2. Optimize queries and add indexes
3. Increase cache hit rate
4. Denormalize data
```

**P99 Latency Calculation**:
```
If 99% of requests are under 200ms, but 1% take 5 seconds:
- Average latency: ~200ms (misleading!)
- P50 latency: ~150ms
- P95 latency: ~250ms
- P99 latency: ~5,000ms (users notice!)
- P99.9 latency: ~10,000ms (users give up!)

Always track P99, not just averages.
```

---

## **2.6 Key Takeaways**

1. **Networking is the slowest component**: Network calls are 1,000-1,000,000x slower than memory operations. Minimize them.

2. **Protocol selection matters**: Use TCP for reliability, UDP for speed. Use HTTP/2 for efficiency, HTTP/3 for reliability over unreliable networks.

3. **Data structure selection is critical**: Bloom filters for membership testing, consistent hashing for distribution, skip lists for efficient indexing.

4. **Concurrency is complex**: Understand race conditions, deadlocks, and synchronization. Choose threads vs. event loops based on your workload.

5. **Back-of-the-envelope calculations**: Quick math (QPS, storage, bandwidth) guides architectural decisions and prevents costly mistakes.

6. **Measure first, optimize second**: Before optimizing, know your baseline. 80% of optimization efforts are wasted on non-bottlenecks.

---

## **Chapter Summary**

In this chapter, we built the foundation for system design by understanding networking fundamentals, the hierarchy of speed, specialized data structures, concurrency concepts, and capacity planning math.

We learned that network latency is the enemy of performance, that different protocols serve different needs, and that specialized data structures like Bloom filters and consistent hashing are essential for distributed systems.

We explored concurrency from processes to event loops, understanding how to prevent race conditions and deadlocks. Finally, we learned to calculate requirements quickly using basic math—skills that will guide our architectural decisions in every subsequent chapter.

**Coming up next**: In Chapter 3, we'll dive deep into databases—the heart of most systems. We'll explore relational and NoSQL databases, indexing strategies, sharding, and the CAP theorem in detail.

---

**Exercises**:

1. **Network Latency**: Calculate the total latency for a request that:
   - Travels from user in London to server in New York (80ms round-trip)
   - Makes 3 database queries (each 10ms)
   - Processes 50ms of business logic
   - What's the total? Where would you optimize?

2. **Bloom Filter Design**: You need a bloom filter for 1 billion URLs with 1% false positive rate. How many bits do you need? How many hash functions?

3. **Capacity Planning**: You're building a video streaming service with 1 million users, each watching 2 hours of video per day (average 3 Mbps). Calculate:
   - Daily storage requirements
   - Peak bandwidth requirements (assuming 20% of users watch during peak hour)
   - How many 10 Gbps network connections do you need?

4. **Concurrency**: Identify the race condition in this code and fix it:
```python
counter = 0

def increment():
    global counter
    temp = counter
    time.sleep(0.001)
    counter = temp + 1

# What happens if 10 threads call increment() simultaneously?
```

---

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='1. introduction_to_system_design.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../2. The_building_blocks/3. databases.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
