# **Chapter 3: Databases – The Persistence Layer**

Databases are the heart of most systems. They store and retrieve data reliably, ensuring data survives crashes, power failures, and disasters. Understanding database internals, capabilities, and trade-offs is essential for every system designer.

---

## **3.1 The Role of Databases in System Design**

**Definition**: A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS).

**Why Databases Matter**:
- **Data persistence**: Survive application restarts and system crashes
- **Data integrity**: ACID guarantees ensure consistent state
- **Data querying**: Efficiently retrieve and filter large datasets
- **Data relationships**: Enforce relationships between related entities
- **Data security**: Control access and permissions

**Database Evolution Timeline**:
```
1960s: Hierarchical and Network Models (IMS, IDMS)
1970s: Relational Model (System R, Ingres)
1980s: Commercial SQL Databases (Oracle, DB2)
1990s: Open Source SQL (PostgreSQL, MySQL)
2000s: NoSQL Revolution (MongoDB, Cassandra, DynamoDB)
2010s: NewSQL and Cloud Databases (Spanner, Aurora)
2020s: Multi-model and Serverless Databases (CockroachDB, FaunaDB)
```

---

## **3.2 Relational Databases (RDBMS)**

Relational databases organize data into tables with rows and columns, enforcing relationships through foreign keys. They use SQL (Structured Query Language) for data manipulation.

### **ACID Properties: The Gold Standard of Data Integrity**

**ACID** stands for Atomicity, Consistency, Isolation, Durability—guarantees that database transactions are reliable.

**Atomicity**: A transaction is all-or-nothing. Either all operations complete, or none do.

**Example**:
```sql
-- Bank transfer: Transfer $100 from Account A to Account B
BEGIN TRANSACTION;

UPDATE accounts SET balance = balance - 100 WHERE id = 'A';
-- What if this succeeds but the next update fails?
UPDATE accounts SET balance = balance + 100 WHERE id = 'B';

COMMIT;  -- Either both updates happen, or neither happens

-- If second update fails, entire transaction is rolled back:
ROLLBACK;
```

**Atomicity ensures**: The money never disappears—it's either transferred or stays where it is.

**Consistency**: Database transitions from one valid state to another valid state, respecting all constraints.

**Example**:
```sql
CREATE TABLE accounts (
    id VARCHAR(10) PRIMARY KEY,
    balance DECIMAL(10, 2) CHECK (balance >= 0)  -- Constraint: no negative balances
);

-- This transaction will FAIL because it violates the consistency constraint:
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE id = 'A';  -- Balance becomes -50
-- Database detects constraint violation and rolls back entire transaction
ROLLBACK;
```

**Consistency ensures**: Business rules are always enforced, even in failure scenarios.

**Isolation**: Concurrent transactions don't interfere with each other. Each transaction sees a consistent snapshot of the database.

**Isolation Levels** (PostgreSQL example):
```sql
-- Read Uncommitted: Can see uncommitted changes from other transactions (dangerous!)
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

-- Read Committed: Can only see committed changes (default in PostgreSQL, SQL Server)
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

-- Repeatable Read: If you read a row twice during a transaction, it's the same
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

-- Serializable: Transactions are truly isolated; behaves like single-threaded (safest, slowest)
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
```

**Example: Lost Update Problem** (without proper isolation):
```
Time  Transaction 1                          Transaction 2          Balance
─────────────────────────────────────────────────────────────────────────────
0     Read balance: 100                                             100
1                                            Read balance: 100      100
2     Add 50 → Write 150                                            150
3                                            Add 20 → Write 120     120
4                                            Commit                  120
5     Commit                                                           120

Expected: 170 (100 + 50 + 20)
Actual: 120 (Transaction 2 overwrote Transaction 1)
```

**With Isolation**: Transaction 2 would be blocked until Transaction 1 commits, or would detect conflict and retry.

**Durability**: Once a transaction commits, changes are permanent—even if power fails immediately after.

**How databases achieve durability**:
- **Write-ahead logging (WAL)**: Changes are written to a log file before being written to data files
- **Checkpointing**: Periodically writing consistent snapshots to disk
- **Replication**: Copying data to multiple machines

**Durability sequence**:
```
1. Client sends transaction
2. Database writes transaction to WAL log (synchronously, confirmed by disk)
3. Database tells client "transaction committed"
4. Later, database lazily updates actual data files from WAL

If power fails after step 2 but before step 4:
- On restart, database replays WAL log
- All committed transactions are recovered
- Client was told success, so they expect success
```

---

### **Schema Design: Normalization and Denormalization**

**Normalization**: Organizing data to minimize redundancy and dependency.

**Normalization Levels**:
- **1NF**: Each cell contains a single value (no multi-valued attributes)
- **2NF**: 1NF + no partial dependencies (no non-key attribute depends on part of a primary key)
- **3NF**: 2NF + no transitive dependencies (no non-key attribute depends on another non-key attribute)

**Example**: Denormalized → Normalized

**Denormalized (violates 3NF)**:
```sql
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_name VARCHAR(100),
    customer_address VARCHAR(200),
    customer_email VARCHAR(100),
    order_date DATE,
    total_amount DECIMAL(10, 2)
);

-- Problems:
-- Customer info repeated for every order (redundancy)
-- Update customer address → must update all their orders
```

**Normalized (3NF)**:
```sql
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(100),
    address VARCHAR(200),
    email VARCHAR(100)
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    total_amount DECIMAL(10, 2),
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

-- Benefits:
-- Customer info stored once
-- Update address → single row update
-- Add customer → no need to create order first
```

**Trade-offs**:
- **Normalized**: Less redundancy, better data integrity, more joins required (slower reads)
- **Denormalized**: Faster reads, more redundancy, risk of data inconsistencies

**When to use which**:
- **Normalized**: Transactional systems (banking, CRM), data integrity critical
- **Denormalized**: Analytical systems (data warehouses), read-heavy workloads

---

### **Indexing: The Speed Dial for Databases**

**Index**: Data structure that allows efficient lookups without scanning entire tables.

**Problem**: Without an index, finding a row requires scanning the entire table.
```sql
-- Table with 1 million rows
SELECT * FROM users WHERE email = 'alice@example.com';

-- Without index: Scan all 1 million rows → ~100ms (average 500,000 rows checked)
-- With index on email: Jump directly to row → ~0.1ms (binary search)
```

**B-Tree Index**: The most common index type.

**B-Tree Structure**:
```
                    [Root Node]
                    /      \
                [100]     [300]
               /    \     /    \
            [50]  [75] [200] [400]
           /  \   /  \  /  \   /  \
         ...  ... ... ... ... ... ...

Search for 200:
1. Root: 200 > 100 → go right
2. Level 1: 200 < 300 → go left
3. Level 2: Found node containing 200
4. Return data (O(log n) lookup)
```

**Properties of B-Trees**:
- **Balanced**: All leaves are at the same level (guaranteed O(log n) operations)
- **Multi-way**: Each node has multiple keys and children (not just 2 like binary trees)
- **Disk-friendly**: Optimized for disk access patterns (nodes match disk block size)

**Creating Indexes**:
```sql
-- Single-column index
CREATE INDEX idx_users_email ON users(email);

-- Composite index (order matters!)
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);

-- Unique index (also enforces uniqueness constraint)
CREATE UNIQUE INDEX idx_users_username ON users(username);

-- Partial index (index only subset of rows)
CREATE INDEX idx_active_users ON users(last_login) WHERE active = true;
```

**Index Selection** (PostgreSQL example):
```sql
-- Query
SELECT * FROM orders 
WHERE customer_id = 123 AND order_date >= '2024-01-01';

-- Which index is used?
-- If only idx_customer_id exists: Use it, filter by date manually
-- If only idx_date exists: Use it, filter by customer_id manually
-- If idx_customer_date exists: Use it, both conditions satisfied (best!)
-- If idx_date_customer exists: Use it, but less efficient (wrong order)
```

**When Indexes Are Used vs. Not Used**:
```sql
-- Index used: WHERE clause on indexed column
SELECT * FROM users WHERE email = 'alice@example.com';  -- Uses email index

-- Index used: ORDER BY on indexed column
SELECT * FROM users ORDER BY email;  -- Uses email index

-- Index NOT used: Function on indexed column
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';  -- Can't use index!
-- Solution: Create functional index or store lowercase version

-- Index NOT used: OR condition (sometimes)
SELECT * FROM users WHERE email = 'alice@example.com' OR id = 123;
-- If both indexed, might use union of indexes (index-only scan)

-- Index NOT used: Leading wildcard in LIKE
SELECT * FROM users WHERE email LIKE '%@example.com';  -- Can't use index
-- This IS used:
SELECT * FROM users WHERE email LIKE 'alice@%';
```

**Trade-offs**:
- **Pros**: 100-1,000x faster lookups
- **Cons**: Slower writes (each INSERT/UPDATE/DELETE must update index), more storage (20-100% of table size)

**Rule of thumb**: Index columns frequently used in WHERE clauses and JOINs, but don't over-index.

---

### **Foreign Keys and Relationships**

**Foreign Key**: A field in one table that uniquely identifies a row in another table, establishing relationships.

**Types of Relationships**:

**One-to-One**: Each row in Table A relates to at most one row in Table B.
```sql
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(50),
    email VARCHAR(100)
);

CREATE TABLE user_profiles (
    profile_id INT PRIMARY KEY,
    user_id INT UNIQUE,  -- UNIQUE enforces one-to-one
    bio TEXT,
    avatar_url VARCHAR(200),
    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

-- Each user has at most one profile
-- Each profile belongs to exactly one user
```

**One-to-Many**: Each row in Table A relates to many rows in Table B.
```sql
CREATE TABLE departments (
    dept_id INT PRIMARY KEY,
    dept_name VARCHAR(100)
);

CREATE TABLE employees (
    emp_id INT PRIMARY KEY,
    emp_name VARCHAR(100),
    dept_id INT,
    FOREIGN KEY (dept_id) REFERENCES departments(dept_id)
);

-- Each department has many employees
-- Each employee belongs to exactly one department
```

**Many-to-Many**: Each row in Table A relates to many rows in Table B, and vice versa. Requires junction table.
```sql
CREATE TABLE students (
    student_id INT PRIMARY KEY,
    student_name VARCHAR(100)
);

CREATE TABLE courses (
    course_id INT PRIMARY KEY,
    course_name VARCHAR(100)
);

-- Junction table
CREATE TABLE enrollments (
    enrollment_id INT PRIMARY KEY,
    student_id INT,
    course_id INT,
    enrollment_date DATE,
    FOREIGN KEY (student_id) REFERENCES students(student_id),
    FOREIGN KEY (course_id) REFERENCES courses(course_id),
    UNIQUE (student_id, course_id)  -- Prevent duplicate enrollments
);

-- Each student can enroll in many courses
-- Each course can have many students
```

**Referential Integrity**: Foreign keys enforce that relationships remain consistent.
```sql
-- Cannot delete a department that still has employees
DELETE FROM departments WHERE dept_id = 10;
-- ERROR: update or delete on table "departments" violates foreign key constraint
-- Detail: Key (dept_id)=(10) is still referenced from table "employees".

-- Solutions:
-- 1. Delete employees first (CASCADE)
DELETE FROM employees WHERE dept_id = 10;
DELETE FROM departments WHERE dept_id = 10;

-- 2. Define ON DELETE CASCADE (automatically deletes related rows)
ALTER TABLE employees 
DROP CONSTRAINT employees_dept_id_fkey,
ADD CONSTRAINT employees_dept_id_fkey 
FOREIGN KEY (dept_id) REFERENCES departments(dept_id) 
ON DELETE CASCADE;

-- Now deleting department also deletes its employees
DELETE FROM departments WHERE dept_id = 10;  -- Also deletes all employees in dept 10
```

---

### **Transactions: Multi-Operation Atomicity**

**Transaction**: A sequence of operations performed as a single logical unit of work.

**Transaction Lifecycle**:
```
1. BEGIN TRANSACTION
   - Start transaction
   - Acquire locks on affected rows
   - Create savepoint (for rollback)

2. Execute operations (SELECT, INSERT, UPDATE, DELETE)
   - Changes are visible only within this transaction
   - Other transactions see data as it was before this transaction started

3. COMMIT or ROLLBACK
   - COMMIT: Make changes permanent and visible to other transactions
   - ROLLBACK: Revert all changes, as if this transaction never happened
```

**Example**: Transfer money between accounts
```sql
-- Transaction ensures atomicity
BEGIN TRANSACTION;

-- Check sufficient funds (for consistency)
SELECT balance FROM accounts WHERE id = 'A' FOR UPDATE;
-- FOR UPDATE locks the row, preventing other transactions from modifying it

-- If balance >= 100, proceed with transfer
UPDATE accounts SET balance = balance - 100 WHERE id = 'A';
UPDATE accounts SET balance = balance + 100 WHERE id = 'B';

-- Record the transfer
INSERT INTO transfers (from_account, to_account, amount, timestamp)
VALUES ('A', 'B', 100.00, NOW());

-- All operations succeeded: make permanent
COMMIT;

-- If any operation failed (e.g., insufficient funds):
-- ROLLBACK;  -- Revert all changes
```

**Transaction Isolation Issues**:

**Dirty Read**: Reading uncommitted changes from another transaction.
```
Time  Transaction A                      Transaction B
──────────────────────────────────────────────────────────────
0     BEGIN;
1     UPDATE accounts SET balance = 50 WHERE id = 'A';
2                                         SELECT balance FROM accounts WHERE id = 'A';
                                          → Returns 50 (uncommitted!)
3     ROLLBACK;  -- Revert back to 100
4                                         -- Transaction B saw 50, but actual is 100
                                          → Dirty read!
```

**Prevention**: Use READ COMMITTED or higher isolation level.

**Non-Repeatable Read**: Reading same row twice within a transaction, getting different results.
```
Time  Transaction A                      Transaction B
──────────────────────────────────────────────────────────────
0     BEGIN;
1     SELECT balance FROM accounts WHERE id = 'A';
      → Returns 100
2                                         BEGIN;
3                                         UPDATE accounts SET balance = 150 WHERE id = 'A';
4                                         COMMIT;
5     SELECT balance FROM accounts WHERE id = 'A';
      → Returns 150 (different from first read!)
      → Non-repeatable read!
```

**Prevention**: Use REPEATABLE READ or SERIALIZABLE isolation level.

**Phantom Read**: Same query returns different sets of rows within a transaction.
```
Time  Transaction A                      Transaction B
──────────────────────────────────────────────────────────────
0     BEGIN;
1     SELECT * FROM accounts WHERE balance > 1000;
      → Returns 5 rows
2                                         BEGIN;
3                                         INSERT INTO accounts VALUES ('E', 2000);
4                                         COMMIT;
5     SELECT * FROM accounts WHERE balance > 1000;
      → Returns 6 rows (phantom row appeared!)
      → Phantom read!
```

**Prevention**: Use SERIALIZABLE isolation level (prevents phantom reads).

---

### **Popular RDBMS Systems**

**PostgreSQL**: Open-source, feature-rich, extensible.
- **Strengths**: Advanced data types (JSON, UUID, arrays), extensive indexing, excellent query optimizer
- **Best for**: Complex queries, geospatial data, JSON workloads
- **Companies**: Apple, Spotify, Reddit, Instagram

**MySQL**: Open-source, widely used, good for web applications.
- **Strengths**: Easy to set up, strong community support, excellent for read-heavy workloads
- **Best for**: Web applications, e-commerce, CMS
- **Companies**: Facebook, YouTube, Uber (originally)

**SQL Server**: Microsoft's enterprise database.
- **Strengths**: Tight integration with Microsoft ecosystem, excellent tooling
- **Best for**: Windows environments, enterprise applications
- **Companies**: Various enterprises, Microsoft products

**Oracle**: Commercial, enterprise-grade, high performance.
- **Strengths**: Extreme scalability, advanced features, excellent for very large datasets
- **Best for**: Very large enterprises, mission-critical systems
- **Companies**: Banks, airlines, large corporations

---

## **3.3 NoSQL Databases: The Flexible Alternative**

NoSQL databases emerged to address limitations of relational databases: scalability, flexibility, and specific data models. "NoSQL" originally meant "Not Only SQL" — it's about using the right tool for the job.

### **The NoSQL Landscape**

NoSQL databases are categorized by data model:
```
NoSQL Databases
    │
    ├─→ Key-Value Stores (Redis, DynamoDB, Memcached)
    ├─→ Document Stores (MongoDB, Couchbase, RavenDB)
    ├─→ Wide-Column Stores (Cassandra, HBase, ScyllaDB)
    └─→ Graph Databases (Neo4j, ArangoDB, Amazon Neptune)
```

---

### **Key-Value Stores**

**Concept**: Store data as key-value pairs, like a hash map distributed across multiple servers.

**Characteristics**:
- **Simple API**: GET key, PUT key, DELETE key
- **Extreme performance**: O(1) access time (constant time regardless of data size)
- **Flexible values**: Can store any data (strings, numbers, JSON, binary)
- **Schemaless**: No predefined structure for values

**Example Use Cases**:
```python
# Redis example (Python)
import redis

# Connect to Redis
r = redis.Redis(host='localhost', port=6379, db=0)

# Store simple value
r.set('user:123:name', 'Alice')
name = r.get('user:123:name')  # Returns b'Alice'

# Store JSON value
import json
user_data = {'id': 123, 'name': 'Alice', 'email': 'alice@example.com'}
r.set('user:123', json.dumps(user_data))
user = json.loads(r.get('user:123'))

# Set expiration (key automatically deleted after 1 hour)
r.setex('session:abc123', 3600, 'user:123')

# Atomic increment (perfect for counters)
r.incr('page_views:home')  # Increment by 1
r.incrby('page_views:home', 10)  # Increment by 10

# Lists (perfect for queues)
r.lpush('user:123:notifications', 'Welcome to our service!')
r.lpush('user:123:notifications', 'You have a new follower')
notifications = r.lrange('user:123:notifications', 0, -1)  # Get all
```

**Key-Value Use Cases**:
- **Session storage**: Web application sessions (fast lookups, automatic expiration)
- **Caching**: Store expensive query results (reduce database load)
- **Rate limiting**: Track request counts per user (atomic operations)
- **Leaderboards**: High-performance counters and rankings
- **Message queues**: Lightweight pub/sub and list-based queues

**Popular Key-Value Stores**:

**Redis**: In-memory key-value store with persistence.
- **Strengths**: Extremely fast, rich data structures, atomic operations
- **Weaknesses**: Memory-limited (must fit in RAM), single-threaded processing
- **Best for**: Caching, session storage, real-time analytics

**Amazon DynamoDB**: Fully managed, serverless key-value and document database.
- **Strengths**: Serverless (auto-scaling), single-digit millisecond latency, global tables
- **Weaknesses**: Expensive for high-throughput write workloads, limited query capabilities
- **Best for**: Serverless applications, mobile backends, gaming leaderboards

**Memcached**: Simple, high-performance distributed memory object caching system.
- **Strengths**: Simplicity, extremely fast for caching, multi-threaded
- **Weaknesses**: Limited data structures, no persistence, no built-in replication
- **Best for**: Simple caching layer

---

### **Document Stores**

**Concept**: Store, retrieve, and manage document-oriented information. Documents are typically JSON (or BSON in MongoDB) format.

**Characteristics**:
- **Flexible schema**: Each document can have different structure
- **Nested documents**: Store related data together (no joins needed)
- **Rich querying**: Query on any field, including nested fields
- **Schema validation**: Optional validation rules for document structure

**Example Document (MongoDB)**:
```javascript
{
  "_id": ObjectId("64c8e1234567890abcdef123"),
  "firstName": "Alice",
  "lastName": "Johnson",
  "age": 28,
  "email": "alice@example.com",
  "address": {
    "street": "123 Main St",
    "city": "San Francisco",
    "state": "CA",
    "zipCode": "94102"
  },
  "orders": [
    {
      "orderId": "ORD001",
      "date": "2024-01-15T08:30:00Z",
      "total": 125.50,
      "items": [
        {"productId": "P123", "name": "Widget", "quantity": 2, "price": 25.00},
        {"productId": "P456", "name": "Gadget", "quantity": 1, "price": 75.50}
      ]
    },
    {
      "orderId": "ORD002",
      "date": "2024-02-20T14:15:00Z",
      "total": 89.99,
      "items": [
        {"productId": "P789", "name": "Thingamajig", "quantity": 3, "price": 29.99}
      ]
    }
  ],
  "createdAt": "2023-08-01T10:00:00Z",
  "updatedAt": "2024-02-20T14:15:00Z"
}
```

**MongoDB Example**:
```javascript
// Insert document
db.users.insertOne({
  firstName: "Alice",
  lastName: "Johnson",
  age: 28,
  email: "alice@example.com"
});

// Query documents
db.users.find({ age: { $gte: 25 } })  // Find users 25 or older
db.users.find({ 
  firstName: "Alice",
  "address.city": "San Francisco"  // Query nested field
})

// Update document (upsert = insert if not exists)
db.users.updateOne(
  { email: "alice@example.com" },
  { $set: { lastLogin: new Date() } },
  { upsert: true }
)

// Array operations
db.users.updateOne(
  { email: "alice@example.com" },
  { $push: { orders: { orderId: "ORD003", total: 199.99 } } }
)

// Aggregation pipeline (like SQL GROUP BY)
db.users.aggregate([
  { $match: { age: { $gte: 18 } } },  // Filter
  { $group: { 
    _id: "$address.state",  // Group by state
    count: { $sum: 1 },     // Count users per state
    avgAge: { $avg: "$age" }  // Average age per state
  }},
  { $sort: { count: -1 } }  // Sort by count descending
])
```

**Document Store Use Cases**:
- **Content management**: Blog posts, product catalogs (flexible schema)
- **User profiles**: Social media profiles (nested data, varying fields)
- **Mobile applications**: Offline-first apps (JSON sync)
- **Product catalogs**: E-commerce (nested variants, flexible attributes)
- **Event logging**: Application logs, audit trails (structured JSON)

**Popular Document Stores**:

**MongoDB**: Most popular document database.
- **Strengths**: Rich query language, flexible schema, strong community
- **Weaknesses**: Transactions historically weaker than RDBMS, joins are expensive
- **Best for**: Content management, product catalogs, mobile backends

**Couchbase**: Enterprise-grade document database with SQL-like query language.
- **Strengths**: N1QL (SQL for JSON), excellent performance, built-in caching
- **Weaknesses**: Steeper learning curve, smaller community
- **Best for**: Enterprise applications, real-time analytics

**RavenDB**: ACID-compliant document database with built-in indexing.
- **Strengths**: Strong consistency, automatic indexing, easy setup
- **Weaknesses**: Smaller ecosystem, less battle-tested than MongoDB
- **Best for**: Applications requiring strong consistency

---

### **Wide-Column Stores**

**Concept**: Store data in columns rather than rows, optimized for wide tables (many columns) and petabyte-scale datasets.

**Characteristics**:
- **Column-family organization**: Data is stored by column families, not rows
- **Scalable writes**: Optimized for write-heavy workloads
- **Flexible schema**: Rows can have different columns
- **Distributed**: Designed for massive horizontal scaling

**Data Model**:
```
Table: users
─────────────────────────────────────────────────────────────
Row Key       Column Family                Column Family
─────────────────────────────────────────────────────────────
user123       profile                      activity
              ───────────                  ──────────────
              name: "Alice"                login_ts: 1704067200
              age: 28                      last_action: "view_profile"
              email: "alice@..."           page_views: 1500
              ───────────
              preferences
              ───────────
              theme: "dark"
              notifications: "email"

user456       profile                      activity
              ───────────                  ──────────────
              name: "Bob"                  login_ts: 1704070800
              age: 35                      last_action: "purchase"
              email: "bob@..."             total_spent: 1250.50
              ───────────
              preferences
              ───────────
              theme: "light"
              notifications: "sms"
```

**Key insight**: Related columns are stored together (column families). Reading only name and age is fast—you only read the profile column family, not the entire row.

**Cassandra Example**:
```sql
-- Create keyspace (like a database)
CREATE KEYSPACE myapp 
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

-- Use keyspace
USE myapp;

-- Create table (column family)
CREATE TABLE users (
  user_id UUID PRIMARY KEY,
  name TEXT,
  email TEXT,
  age INT,
  created_at TIMESTAMP
);

-- Insert data
INSERT INTO users (user_id, name, email, age, created_at)
VALUES (uuid(), 'Alice', 'alice@example.com', 28, toTimestamp(now()));

-- Query data (WHERE must include partition key)
SELECT * FROM users WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- Create table with clustering key (sort order)
CREATE TABLE messages (
  user_id UUID,
  message_id TIMEUUID,
  content TEXT,
  created_at TIMESTAMP,
  PRIMARY KEY (user_id, message_id)  -- Composite primary key
) WITH CLUSTERING ORDER BY (message_id DESC);  -- Sort messages by time, descending

-- Insert messages
INSERT INTO messages (user_id, message_id, content, created_at)
VALUES (123e4567-e89b-12d3-a456-426614174000, now(), 'Hello!', toTimestamp(now()));

-- Query messages (efficiently retrieves in sorted order)
SELECT * FROM messages WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;
-- Returns messages sorted by message_id (which is time-based) in descending order
```

**Cassandra Data Model Concepts**:

**Partition Key**: Determines which node stores the data (like hashing in distributed hash tables).
```sql
-- Single partition key
CREATE TABLE users (
  user_id UUID PRIMARY KEY  -- Partition key
);

-- Composite partition key (multiple columns)
CREATE TABLE user_events (
  user_id UUID,
  event_type TEXT,
  event_id TIMEUUID,
  event_data TEXT,
  PRIMARY KEY ((user_id, event_type), event_id)  -- (user_id, event_type) is partition key
);
```

**Clustering Key**: Determines how data is sorted within a partition (like ORDER BY in SQL).
```sql
-- Event timeline example
CREATE TABLE events (
  user_id UUID,
  timestamp TIMESTAMP,
  event_type TEXT,
  event_data TEXT,
  PRIMARY KEY (user_id, timestamp)  -- user_id is partition key, timestamp is clustering key
) WITH CLUSTERING ORDER BY (timestamp DESC);  -- Most recent events first
```

**Query Patterns**: Design your schema based on query patterns (data modeling is query-driven).
```sql
-- Good: Efficient query (includes partition key)
SELECT * FROM events WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- Bad: Inefficient query (doesn't include partition key)
-- Requires full cluster scan (very slow!)
SELECT * FROM events WHERE timestamp > '2024-01-01';

-- Solution: Denormalize into separate table for this query
CREATE TABLE events_by_date (
  event_date DATE,
  timestamp TIMESTAMP,
  user_id UUID,
  event_type TEXT,
  event_data TEXT,
  PRIMARY KEY (event_date, timestamp)  -- Partition by date
);

-- Now efficient:
SELECT * FROM events_by_date WHERE event_date = '2024-01-01';
```

**Wide-Column Use Cases**:
- **Time series data**: IoT sensor readings, application metrics
- **Log data**: Application logs, audit trails (append-only)
- **Messaging systems**: Chat messages, notifications (time-ordered)
- **Product catalogs**: E-commerce with many attributes (flexible schema)
- **User activity tracking**: Social media feeds, analytics events

**Popular Wide-Column Stores**:

**Apache Cassandra**: Most popular open-source wide-column store.
- **Strengths**: Linear scalability, multi-datacenter replication, tunable consistency
- **Weaknesses**: Complex data modeling (query-driven), limited query capabilities
- **Best for**: Time series data, write-heavy workloads, global applications

**HBase**: Hadoop-based wide-column store.
- **Strengths**: Integrates with Hadoop ecosystem, strong consistency, real-time random access
- **Weaknesses**: Requires Zookeeper, complex setup, Java-only client
- **Best for**: Big data analytics, real-time processing on Hadoop

**ScyllaDB**: C++ rewrite of Cassandra for extreme performance.
- **Strengths**: 10x faster than Cassandra, shared-nothing architecture, low latency
- **Weaknesses**: Smaller community, fewer integrations
- **Best for**: High-throughput, low-latency workloads

---

### **Graph Databases**

**Concept**: Store data as nodes (entities), edges (relationships), and properties. Optimized for querying complex relationships.

**Characteristics**:
- **Nodes**: Entities (users, products, locations)
- **Edges**: Relationships (follows, purchased, located_in)
- **Properties**: Key-value pairs on nodes and edges
- **Graph traversals**: Efficiently navigate relationships

**Data Model**:
```
┌─────────────┐                      ┌─────────────┐
│   Person    │                      │   Person    │
│   ────────  │                      │   ────────  │
│ name: Alice │                      │ name: Bob   │
│ age: 28     │                      │ age: 32     │
└──────┬──────┘                      └──────┬──────┘
       │                                    │
       │ knows                              │ knows
       │ (since: 2019)                     │ (since: 2020)
       │                                    │
┌──────▼───────────────────────────────────▼──────┐
│              Person                             │
│              ────────                           │
│              name: Carol                         │
│              age: 35                            │
└─────────────────────────────────────────────────┘
```

**Neo4j Example**:
```cypher
-- Create nodes and relationships
CREATE (alice:Person {name: 'Alice', age: 28})
CREATE (bob:Person {name: 'Bob', age: 32})
CREATE (carol:Person {name: 'Carol', age: 35})
CREATE (alice)-[:KNOWS {since: 2019}]->(bob)
CREATE (bob)-[:KNOWS {since: 2020}]->(carol)
CREATE (alice)-[:KNOWS {since: 2021}]->(carol);

-- Find friends of friends (2 hops away)
MATCH (a:Person {name: 'Alice'})-[:KNOWS]->(friend:Person)-[:KNOWS]->(fof:Person)
WHERE NOT (a)-[:KNOWS]->(fof)  -- Not already friends
RETURN fof.name, friend.name;

-- Find the shortest path between two people
MATCH p = shortestPath(
  (alice:Person {name: 'Alice'})-[*]-(bob:Person {name: 'Bob'})
)
RETURN p;

-- Find common friends
MATCH (alice:Person {name: 'Alice'})-[:KNOWS]->(friend:Person)<-[:KNOWS]-(bob:Person {name: 'Bob'})
RETURN friend.name;

-- Recommend products based on purchases (recommendation engine)
MATCH (user:Person {name: 'Alice'})-[:PURCHASED]->(product:Product)<-[:PURCHASED]-(other:Person)-[:PURCHASED]->(recommendation:Product)
WHERE NOT (user)-[:PURCHASED]->(recommendation)  -- Alice hasn't purchased
RETURN recommendation.name, count(*) AS frequency
ORDER BY frequency DESC
LIMIT 10;
```

**Graph Database Use Cases**:
- **Social networks**: Finding friends of friends, influence analysis
- **Recommendation engines**: Product recommendations, content suggestions
- **Fraud detection**: Finding suspicious patterns, money laundering networks
- **Identity management**: Access control, permission hierarchies
- **Network topology**: IT infrastructure, dependency graphs

**Popular Graph Databases**:

**Neo4j**: Most popular graph database.
- **Strengths**: Cypher query language (graph-focused), excellent performance for graph traversals, strong ecosystem
- **Weaknesses**: Scaling requires Enterprise edition, horizontal scaling is challenging
- **Best for**: Social networks, recommendation engines, fraud detection

**Amazon Neptune**: Fully managed graph database service.
- **Strengths**: Serverless, supports both property graphs and RDF graphs, integrated with AWS
- **Weaknesses**: Proprietary, limited to AWS, higher cost for high throughput
- **Best for**: AWS-based applications requiring graph capabilities

**ArangoDB**: Multi-model database (supports graphs, documents, key-value).
- **Strengths**: Multi-model (one database for multiple use cases), flexible, ACID transactions
- **Weaknesses**: Jack of all trades (not specialized), smaller community
- **Best for**: Applications needing multiple data models

---

## **3.4 Database Scaling: From Single Server to Global Distribution**

### **Vertical Scaling vs. Horizontal Scaling**

**Vertical Scaling (Scale Up)**: Adding more resources to a single server.

**Example**:
```
Before: 1 server with 16 cores, 64GB RAM, 1TB SSD
After:  1 server with 64 cores, 256GB RAM, 4TB SSD

Pros:
- Simple: No application changes required
- Consistent performance: No data synchronization issues
- Easier operations: Single point of maintenance

Cons:
- Expensive: High-end hardware costs more per unit of performance
- Limited: Eventually hit hardware limits
- Single point of failure: Entire system goes down if server fails
- Downtime: Requires reboot for upgrades
```

**Horizontal Scaling (Scale Out)**: Adding more servers.

**Example**:
```
Before: 1 server with 16 cores, 64GB RAM, 1TB SSD
After:  4 servers with 16 cores, 64GB RAM, 1TB SSD each

Pros:
- Cost-effective: Commodity hardware is cheaper
- Unlimited: Theoretically infinite scaling
- Resilient: System continues if one server fails
- No downtime: Add/remove servers without affecting application

Cons:
- Complex: Requires application changes (data partitioning, sharding)
- Consistency challenges: Distributed systems complexity
- Increased operations: More servers to manage
- Distributed transactions: Harder to maintain ACID properties
```

**When to Use Which**:
```
Use Vertical Scaling When:
- Data fits in memory (< 100GB)
- Traffic is manageable (< 10,000 QPS)
- Team is small (limited operations capacity)
- Consistency is critical (banking, finance)

Use Horizontal Scaling When:
- Data exceeds single machine capacity (> 10TB)
- Traffic is high (> 10,000 QPS)
- Team is experienced (can manage complexity)
- Availability is critical (can tolerate eventual consistency)
```

---

### **Read Replicas: Scaling Reads**

**Problem**: Application is read-heavy (90% reads, 10% writes), but the single database can't handle the load.

**Solution**: Create read replicas—copies of the primary database that handle read queries.

**Architecture**:
```
Application Servers
    │
    ├─→ [Primary Database] ←── Writes go here
    │      (Master)         │
    │      │                │
    │      ├─→ Replica 1    │
    │      ├─→ Replica 2    │
    │      └─→ Replica 3    │
    │                       │
    └─→ [Load Balancer] ←── Reads go here
          (Routes to any replica)
```

**How It Works**:
1. **Write operation**: Application writes to primary database
2. **Replication**: Primary asynchronously propagates changes to replicas
3. **Read operation**: Application reads from any replica (load balanced)

**Replication Lag**: The time between a write being applied to the primary and being visible on replicas.

**Example**:
```sql
-- Primary database (write)
INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com');
-- Transaction committed at 10:00:00.000

-- Read replica at 10:00:00.100
SELECT * FROM users WHERE email = 'alice@example.com';
-- Returns 0 rows (not yet replicated)

-- Read replica at 10:00:00.500
SELECT * FROM users WHERE email = 'alice@example.com';
-- Returns 1 row (now replicated)

-- Replication lag: 500ms
```

**Handling Replication Lag**:

**Strategy 1: Accept eventual consistency**
```python
# After writing, read from primary for a short time
def create_user(user_data):
    # Write to primary
    db_primary.insert(user_data)
    
    # For the next 2 seconds, read from primary
    redis.setex(f"read_primary:{user_id}", 2, "true")

def get_user(user_id):
    if redis.exists(f"read_primary:{user_id}"):
        return db_primary.get(user_id)
    else:
        return db_replica.get(user_id)
```

**Strategy 2: Always read from primary for critical data**
```python
# Banking transactions always read from primary
def get_account_balance(account_id):
    # Financial data must be consistent
    return db_primary.query("SELECT balance FROM accounts WHERE id = ?", account_id)

def get_profile(user_id):
    # Profile data can be eventually consistent
    return db_replica.query("SELECT * FROM profiles WHERE id = ?", user_id)
```

**Setting Up Read Replicas** (PostgreSQL example):
```sql
-- On primary server:
-- 1. Create replication user
CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'password';

-- 2. Configure postgresql.conf:
wal_level = replica
max_wal_senders = 5
max_replication_slots = 5

-- 3. Reload configuration
SELECT pg_reload_conf();

-- 4. Take base backup for replicas
pg_basebackup -h primary_host -D /var/lib/postgresql/data -U replicator -P -v -R

-- On replica server:
-- 1. Configure postgresql.conf:
hot_standby = on
max_standby_streaming_delay = 30s

-- 2. Start replica
# PostgreSQL automatically starts in replication mode
```

**Scaling Calculations**:
```
Scenario: 10,000 QPS with 90% reads (9,000 read QPS, 1,000 write QPS)

Single database: Can handle 2,000 QPS total
-- Insufficient! Need scaling.

Option 1: Read Replicas
- Primary: Handles 1,000 write QPS + some reads
- Replicas needed for 9,000 read QPS: 9,000 / 2,000 = 4.5 → 5 replicas
- Total servers: 1 primary + 5 replicas = 6 servers
- Write capacity: Still 1,000 QPS (single primary)
- Read capacity: 10,000 QPS (5 replicas × 2,000 QPS each)

Option 2: Horizontal Partitioning (Sharding)
- Shard by user_id (10 shards)
- Each shard handles ~10% of traffic
- Write capacity: 10,000 QPS (10 shards × 1,000 QPS each)
- Read capacity: 20,000 QPS (10 shards × 2,000 QPS each)
- Total servers: 10 shards (each primary + replicas)

Decision: Read replicas are simpler for read-heavy workloads.
```

---

### **Connection Pooling: Managing Database Connections**

**Problem**: Each HTTP request opens a new database connection → thousands of connections → database overwhelmed.

**Solution**: Connection pool—a cache of database connections that can be reused.

**Without Connection Pooling**:
```
Request 1: Open connection → Query → Close connection (50ms overhead)
Request 2: Open connection → Query → Close connection (50ms overhead)
Request 3: Open connection → Query → Close connection (50ms overhead)
...
Request 1000: Open connection → Query → Close connection (50ms overhead)

Total overhead: 1000 × 50ms = 50 seconds just opening/closing connections!
```

**With Connection Pooling**:
```
Application Startup:
- Create pool of 20 connections
- All connections established and authenticated

Request 1:    Get connection from pool → Query → Return to pool (0ms overhead)
Request 2:    Get connection from pool → Query → Return to pool (0ms overhead)
Request 3:    Wait for available connection → Query → Return to pool
Request 4:    Get connection from pool → Query → Return to pool
...
Request 1000: Wait for available connection → Query → Return to pool

Total overhead: 0ms (connections reused!)
```

**Connection Pool Configuration**:
```python
# PostgreSQL connection pool (Python using psycopg2)
from psycopg2 import pool

# Create connection pool
connection_pool = pool.SimpleConnectionPool(
    minconn=5,      # Minimum connections to keep open
    maxconn=20,     # Maximum connections allowed
    host='localhost',
    database='mydb',
    user='user',
    password='password'
)

# Get connection from pool
def get_user(user_id):
    connection = connection_pool.getconn()
    try:
        cursor = connection.cursor()
        cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
        return cursor.fetchone()
    finally:
        # Always return connection to pool
        connection_pool.putconn(connection)

# When shutting down application
connection_pool.closeall()
```

**Pool Sizing**:
```
Rule of thumb: pool_size = (core_count × 2) + effective_spindle_count

Example: 8-core server with 1 database (1 spindle)
- pool_size = (8 × 2) + 1 = 17 connections

But consider:
- More connections = More concurrency, but context switching overhead
- Too few connections = Requests wait for available connections
- Connection wait time = (request_time × (requests_per_second × pool_size))

If request_time = 100ms, requests_per_second = 1000, pool_size = 20:
- Total connections needed = 1000 × 0.1 = 100
- But pool only has 20 → 80 requests wait

Solution: Scale horizontally (add more application servers with their own pools)
```

**Connection Pool Libraries**:
- **PostgreSQL**: PgBouncer (external connection pooler), connection pool built into drivers
- **MySQL**: MySQL Proxy, connection pool built into drivers
- **MongoDB**: Built-in connection pooling in driver
- **Redis**: Built-in connection pooling in client

---

## **3.5 Indexing Strategies: Making Queries Fast**

We've covered basic indexes earlier. Now, let's dive deeper into advanced indexing strategies.

### **Composite Indexes: Multiple Columns**

**Composite Index**: Index on multiple columns, order matters.

**Example**:
```sql
-- Table: orders (customer_id, order_date, amount, status)
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);

-- This query uses the index efficiently:
SELECT * FROM orders 
WHERE customer_id = 123 AND order_date >= '2024-01-01';
-- Uses index: Both columns match index order

-- This query uses index partially:
SELECT * FROM orders 
WHERE customer_id = 123;
-- Uses index: First column matches (can use index, but less efficient)

-- This query DOES NOT use index:
SELECT * FROM orders 
WHERE order_date >= '2024-01-01';
-- Cannot use index: Doesn't include first column (customer_id)

-- This query DOES NOT use index:
SELECT * FROM orders 
WHERE order_date >= '2024-01-01' AND customer_id = 123;
-- Cannot use index: Columns not in index order
```

**Rule of Thumb**: For composite index (A, B, C), queries can use index for:
- WHERE A = value ✓
- WHERE A = value AND B = value ✓
- WHERE A = value AND B = value AND C = value ✓
- WHERE B = value ✗ (unless A is also specified)
- WHERE C = value ✗ (unless A and B are also specified)

**Index Order Strategy**: Put most selective column first.
```sql
-- Scenario 1: customer_id is highly selective (many distinct values)
-- Query: WHERE customer_id = 123 AND status = 'pending'
-- Better: INDEX(customer_id, status)
-- Reason: customer_id narrows down to few rows, then status filters further

-- Scenario 2: status is not selective (only 3 values: pending, shipped, delivered)
-- Query: WHERE status = 'pending' AND order_date >= '2024-01-01'
-- Better: INDEX(status, order_date)
-- Reason: status narrows to 1/3 of data, then date filters further
-- Alternative: Separate index on order_date if queries often search by date only
```

---

### **Covering Indexes: Index-Only Scans**

**Covering Index**: Index that contains all columns needed by query, eliminating table access.

**Example**:
```sql
-- Table: users (id, name, email, age, city)
CREATE INDEX idx_users_name_email ON users(name, email);

-- Query:
SELECT name, email FROM users WHERE name LIKE 'Ali%';

-- Without covering index:
1. Use index to find rows matching name LIKE 'Ali%'
2. For each matching row, access table to get email
3. Return results

-- With covering index (name and email in index):
1. Use index to find rows matching name LIKE 'Ali%'
2. Email is already in index (no table access needed!)
3. Return results

-- Performance improvement: 2-10x faster (no table I/O)
```

**When to Use**:
- Frequently executed queries
- Queries that only select a few columns
- Columns often used in WHERE clauses

---

### **Partial Indexes: Indexing Subset of Data**

**Partial Index**: Index that only includes rows matching a condition.

**Example**:
```sql
-- Table: orders (customer_id, order_date, amount, status)
-- Query: SELECT * FROM orders WHERE status = 'pending';

-- Regular index:
CREATE INDEX idx_orders_status ON orders(status);
-- Indexes ALL rows (including shipped, delivered, cancelled)

-- Partial index:
CREATE INDEX idx_orders_pending ON orders(status) WHERE status = 'pending';
-- Only indexes pending orders

-- Benefits:
1. Smaller index (only pending orders, maybe 10% of data)
2. Faster index maintenance (only updates pending rows)
3. Faster queries (index is smaller, fits in memory)
4. Less storage (index is 90% smaller)

-- Query uses partial index:
SELECT * FROM orders WHERE status = 'pending';  -- Uses partial index

-- Query does NOT use partial index (condition doesn't match):
SELECT * FROM orders WHERE status = 'shipped';  -- Cannot use partial index
```

**Use Cases**:
- Filtering on status (pending, active, deleted)
- Time-based queries (recent data only)
- User-specific data (only active users)
- Hot/cold data separation

---

### **Functional Indexes: Indexing Computed Values**

**Functional Index**: Index on expression or function result.

**Example**:
```sql
-- Table: users (id, name, email)
-- Query: SELECT * FROM users WHERE LOWER(email) = 'ALICE@EXAMPLE.COM';

-- Without functional index:
-- Cannot use regular index (email indexed, but LOWER(email) not indexed)
-- Full table scan required

-- With functional index:
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

-- Now query uses index:
SELECT * FROM users WHERE LOWER(email) = 'ALICE@EXAMPLE.COM';
-- Index is used (matches functional index)

-- Another example: Full-text search
-- Query: SELECT * FROM articles WHERE title LIKE '%database%';
-- Slow: Leading wildcard prevents index usage

-- With functional index (using PostgreSQL's full-text search):
CREATE INDEX idx_articles_search ON articles USING gin(to_tsvector('english', title || ' ' || content));

-- Query uses index:
SELECT * FROM articles 
WHERE to_tsvector('english', title || ' ' || content) @@ to_tsquery('english', 'database');
-- Fast: Full-text search using index
```

**Use Cases**:
- Case-insensitive searches (LOWER, UPPER)
- Date calculations (DATE_TRUNC, EXTRACT)
- Full-text search (to_tsvector, to_tsquery)
- Mathematical operations (ABS, ROUND)

---

### **B-Tree vs. Hash Indexes**

**B-Tree Index**: Default index type in most databases.
- **Supports**: Range queries, ORDER BY, pattern matching (LIKE 'abc%')
- **Structure**: Balanced tree (log n operations)
- **Use cases**: Most queries, range searches, sorting

**Hash Index**: Hash-based index.
- **Supports**: Equality queries only (=)
- **Structure**: Hash table (O(1) average operations)
- **Use cases**: Exact lookups, key-value operations

**Comparison**:
```sql
-- Table: users (id, name, email, username)

-- B-Tree index:
CREATE INDEX idx_users_email ON users(email);

-- Hash index:
CREATE INDEX idx_users_username_hash ON users USING HASH(username);

-- Query 1: Equality
SELECT * FROM users WHERE username = 'alice';
-- Both indexes work (hash might be slightly faster)

-- Query 2: Range query
SELECT * FROM users WHERE email LIKE 'alice%';
-- B-Tree index works, hash index doesn't (full table scan)

-- Query 3: ORDER BY
SELECT * FROM users WHERE email = 'alice@example.com';
-- B-Tree index works (already sorted), hash index doesn't
```

**When to Use Hash Indexes**:
- Only equality queries needed
- Very high selectivity (many distinct values)
- In-memory or small datasets

---

## **3.6 Sharding: Distributing Data Across Multiple Databases**

**Sharding**: Horizontal partitioning—splitting data across multiple database instances.

### **Why Shard?**

**Problem**: Single database can't handle the load.
```
Single Database:
- Data: 10TB (exceeds single machine storage)
- QPS: 50,000 (exceeds single machine capacity)
- Connections: 10,000 (exceeds single machine limits)

Solution: Shard across multiple databases
```

**Sharding Benefits**:
- **Scalability**: Each shard handles subset of data (linear scalability)
- **Performance**: Smaller datasets per shard (faster queries)
- **Availability**: One shard failure doesn't affect others

**Sharding Challenges**:
- **Complexity**: Application must know which shard to query
- **Joins**: Cross-shard joins are expensive or impossible
- **Transactions**: Distributed transactions are complex
- **Rebalancing**: Adding/removing shards is expensive (data movement)

---

### **Sharding Strategies**

**Strategy 1: Hash-based Sharding**

**Concept**: Hash a key (e.g., user_id) to determine shard.

**Example**:
```python
def get_shard(user_id, num_shards):
    # Hash user_id and modulo number of shards
    hash_value = hash(user_id)
    shard_id = hash_value % num_shards
    return shard_id

# Example with 4 shards
user_ids = [123, 456, 789, 101112, 131415]
for user_id in user_ids:
    shard = get_shard(user_id, 4)
    print(f"User {user_id} → Shard {shard}")

# Output:
# User 123 → Shard 3
# User 456 → Shard 0
# User 789 → Shard 1
# User 101112 → Shard 0
# User 131415 → Shard 3
```

**Pros**:
- **Even distribution**: Data evenly spread across shards (if hash function is good)
- **Simple**: Easy to implement
- **Predictable**: Same key always maps to same shard

**Cons**:
- **No range queries**: Cannot efficiently query ranges (e.g., users 1000-2000)
- **Rebalancing**: Adding shards requires remapping all keys (expensive)
- **Hot spots**: If one key has much more data, that shard is overloaded

**Use cases**: User data, key-value lookups, even distribution required

---

**Strategy 2: Range-based Sharding**

**Concept**: Shard by value ranges (e.g., user_id 0-999 on shard 0, 1000-1999 on shard 1).

**Example**:
```python
# Shard configuration
shard_ranges = [
    (0, 9999),        # Shard 0: user_id 0-9999
    (10000, 19999),   # Shard 1: user_id 10000-19999
    (20000, 29999),   # Shard 2: user_id 20000-29999
    (30000, 39999),   # Shard 3: user_id 30000-39999
]

def get_shard(user_id):
    for shard_id, (min_id, max_id) in enumerate(shard_ranges):
        if min_id <= user_id <= max_id:
            return shard_id
    return None  # User_id outside range

# Examples
print(f"User 5000 → Shard {get_shard(5000)}")   # Shard 0
print(f"User 15000 → Shard {get_shard(15000)}") # Shard 1
print(f"User 25000 → Shard {get_shard(25000)}") # Shard 2
```

**Pros**:
- **Range queries**: Efficient range queries within shard
- **Ordered data**: Data naturally ordered (useful for time-series)
- **Predictable**: Easy to understand distribution

**Cons**:
- **Uneven distribution**: Some ranges may have more data than others
- **Hot spots**: Popular ranges overload shards
- **Rebalancing**: Changing ranges requires moving data

**Use cases**: Time-series data, geographic data, range queries required

---

**Strategy 3: Directory-based Sharding**

**Concept**: Maintain a directory (lookup table) mapping keys to shards.

**Example**:
```python
# Directory (could be in Redis, database, or service)
shard_directory = {
    123: 0,      # User 123 is on shard 0
    456: 1,      # User 456 is on shard 1
    789: 2,      # User 789 is on shard 2
    101112: 0,   # User 101112 is also on shard 0
    131415: 3,   # User 131415 is on shard 3
}

def get_shard(user_id):
    return shard_directory.get(user_id)

# When creating new user:
def create_user(user_id, user_data):
    # Choose shard based on current load
    shard_id = choose_least_loaded_shard()
    
    # Store on shard
    shard_databases[shard_id].insert(user_id, user_data)
    
    # Update directory
    shard_directory[user_id] = shard_id

# When querying user:
def get_user(user_id):
    shard_id = get_shard(user_id)
    return shard_databases[shard_id].get(user_id)
```

**Pros**:
- **Flexible**: Can manually control placement
- **Rebalancing**: Easy to move users between shards (just update directory)
- **Load-aware**: Can place users based on current load

**Cons**:
- **Directory overhead**: Need to maintain and query directory
- **Single point of failure**: Directory failure breaks entire system
- **Complex**: Additional infrastructure to manage

**Use cases**: When flexibility and manual control over placement are needed

---

### **Sharding Key Selection**

**Sharding Key**: Column used to determine shard placement (also called partition key).

**Criteria for Good Sharding Key**:
1. **High cardinality**: Many distinct values (even distribution)
2. **Access pattern**: Most queries include sharding key (avoid cross-shard queries)
3. **No hotspots**: No single key has disproportionate data
4. **Stable**: Key doesn't change (no need to move data)

**Examples**:

**Good Sharding Keys**:
```sql
-- User ID: High cardinality, stable, most queries include user_id
Shard key: user_id

-- Order ID: High cardinality, stable
Shard key: order_id

-- Customer ID + Date: Composite sharding key for time-series
Shard key: (customer_id, order_date)
```

**Bad Sharding Keys**:
```sql
-- Status: Low cardinality (only 3 values: active, inactive, deleted)
-- Uneven distribution: 90% of users might be active
Shard key: status  → BAD

-- Timestamp: Range-based sharding causes hot spots (recent data popular)
Shard key: created_at  → BAD (unless range-based sharding is desired)

-- Country: Uneven distribution (some countries have many more users)
Shard key: country  → BAD
```

---

### **Cross-Shard Queries: The Join Problem**

**Problem**: Querying data across multiple shards requires joining data from multiple databases.

**Example**:
```sql
-- Sharded by user_id (4 shards)
-- Shard 0: users 0-9999
-- Shard 1: users 10000-19999
-- Shard 2: users 20000-29999
-- Shard 3: users 30000-39999

-- Table: orders (order_id, user_id, product_id, amount, order_date)
-- Table: products (product_id, name, price, category)

-- Query: Get all orders for user 12345 (on shard 1) with product names
SELECT o.order_id, o.amount, p.name, p.price
FROM orders o
JOIN products p ON o.product_id = p.product_id
WHERE o.user_id = 12345;

-- Problem: Products table is on all shards (or separate shard)
-- Need to query shard 1 for orders, then query all shards for products
```

**Solutions**:

**Solution 1: Denormalization (Duplicate Data)**
```sql
-- Duplicate product name in orders table
CREATE TABLE orders (
    order_id INT,
    user_id INT,      -- Shard key
    product_id INT,
    product_name VARCHAR(100),  -- Denormalized!
    product_price DECIMAL(10, 2),  -- Denormalized!
    amount DECIMAL(10, 2),
    order_date DATE
);

-- Query: No join needed!
SELECT order_id, amount, product_name, product_price
FROM orders
WHERE user_id = 12345;

-- Trade-offs:
-- + No cross-shard joins (fast queries)
# - Data redundancy (product name stored multiple times)
# - Update anomalies (must update product name in all orders)
```

**Solution 2: Application-side joins**
```python
def get_orders_with_products(user_id):
    # Step 1: Get orders from correct shard
    shard_id = get_shard(user_id)
    orders = shard_databases[shard_id].query(
        "SELECT * FROM orders WHERE user_id = ?", user_id
    )
    
    # Step 2: Get product_ids from orders
    product_ids = [order['product_id'] for order in orders]
    
    # Step 3: Get products from product shard (or all shards)
    # Assume products are sharded by product_id
    products = {}
    for product_id in product_ids:
        product_shard = get_shard(product_id)
        products[product_id] = shard_databases[product_shard].get_product(product_id)
    
    # Step 4: Join in application
    result = []
    for order in orders:
        product = products[order['product_id']]
        result.append({
            'order_id': order['order_id'],
            'amount': order['amount'],
            'product_name': product['name'],
            'product_price': product['price']
        })
    
    return result
```

**Solution 3: Two-phase commit (distributed transaction)**
```python
# Expensive and complex, but ensures consistency
def create_order(user_id, product_data):
    # Phase 1: Prepare
    order_shard = get_shard(user_id)
    product_shard = get_shard(product_data['product_id'])
    
    # Reserve on both shards
    order_id = order_shard.prepare_order(user_id, product_data)
    product_shard.reserve_product(product_data['product_id'], order_id)
    
    # Phase 2: Commit
    order_shard.commit_order(order_id)
    product_shard.commit_reservation(product_data['product_id'])
    
    # If any step fails, rollback both
```

---

## **3.7 The CAP Theorem Deep Dive**

We introduced CAP briefly in Chapter 1. Now, let's explore it in detail.

### **CAP Theorem: Consistency, Availability, Partition Tolerance**

**CAP Theorem**: In a distributed system, during a network partition, you must choose between Consistency (C) and Availability (A). Partition Tolerance (P) is mandatory in distributed systems.

**Three Properties**:

**C - Consistency**: Every read receives the most recent write or an error.
- **Strong consistency**: All nodes see same data at same time
- **Example**: Bank account balance—always accurate

**A - Availability**: Every request receives a response, without guarantee it contains the most recent write.
- **High availability**: System always responds (even with stale data)
- **Example**: Twitter feed—better to see old tweets than error

**P - Partition Tolerance**: System continues to operate despite network failures.
- **Network partition**: Communication break between nodes
- **Mandatory**: Networks always fail eventually

**The Trade-off**: When P occurs (network partition), must choose C or A.

```
                    Network Partition
                         │
                         ▼
         ┌───────────────────────────┐
         │                           │
    Choose C                   Choose A
         │                           │
         ▼                           ▼
    Reject writes              Accept writes
    (return error)           (might be inconsistent)
         │                           │
         ▼                           ▼
   Data always               System always
   consistent               available
   (but unavailable)        (but possibly stale)
```

---

### **CP Systems: Choosing Consistency**

**CP Systems**: Prioritize consistency over availability. During partition, system rejects writes or reads until partition is resolved.

**Examples**:
- **Banking systems**: Data accuracy is critical
- **Databases**: HBase, MongoDB (with specific settings), PostgreSQL
- **Blockchain**: Immutable ledger requires consistency

**Example Scenario**:
```
Primary Database: New York (active)
Replica Database: London (sync replica)

Network partition: New York ↔ London communication breaks

Client request: Write transaction to London replica

CP System Response:
- London replica: Cannot accept write (not sure if it conflicts with New York)
- Return error: "Service unavailable, please try later"
- Data remains consistent (no conflicting writes)
- But system is unavailable (London users cannot write)

When partition heals:
- Sync changes from New York to London
- London becomes available again
```

**PostgreSQL (CP Configuration)**:
```sql
-- Set synchronous commit for strong consistency
SET synchronous_commit = on;

-- Set transaction isolation level for serializable isolation
SET default_transaction_isolation = 'serializable';

-- Result: Strong consistency, but lower availability during partitions
```

---

### **AP Systems: Choosing Availability**

**AP Systems**: Prioritize availability over consistency. During partition, system accepts reads and writes, serving potentially stale data.

**Examples**:
- **Social media**: Better to see old feed than error
- **E-commerce shopping cart**: Better to add item than lose sale
- **Databases**: DynamoDB, Cassandra, CouchDB

**Example Scenario**:
```
Database: DynamoDB (multi-region, 3 replicas)
Region: US-East-1 (primary), US-West-2 (replica), EU-West-1 (replica)

Network partition: US-East-1 ↔ US-West-2 communication breaks

Client request: Write user profile (update email) to US-West-2

AP System Response:
- US-West-2: Accept write (stores locally)
- Return success: "Profile updated"
- System remains available
- But data is inconsistent (US-East-1 doesn't have new email)

When partition heals:
- US-East-1 syncs with US-West-2
- Conflict resolution: Last-write-wins (based on timestamp)
- Data becomes consistent again
```

**DynamoDB (AP Configuration)**:
```javascript
// Write with eventual consistency (AP)
const params = {
  TableName: 'Users',
  Item: {
    userId: '123',
    name: 'Alice',
    email: 'alice@example.com'
  },
  // No consistency specified → eventual consistency (AP)
};

// Query with eventual consistency (AP)
const params = {
  TableName: 'Users',
  Key: { userId: '123' },
  ConsistentRead: false  // Eventual consistency (AP)
};
```

---

### **CA Systems: The Impossibility**

**CA Systems**: Theoretically possible, but only in non-distributed systems (single machine). In distributed systems, P is mandatory, so CA is impossible.

**Example**: Single database instance on one machine (no network partitions).
- **Consistent**: All reads see latest writes
- **Available**: Always responds (unless machine crashes)
- **No partition tolerance**: Not distributed (single point of failure)

**Reality**: Any distributed system must be either CP or AP. CA is only possible in non-distributed systems.

---

### **PACELC Theorem: Extension of CAP**

**PACELC**: An extension of CAP. In case of **P**artition (P), trade off between **A**vailability (A) and **C**onsistency (C); else (E), when there is no partition, trade off between **L**atency (L) and **C**onsistency (C).

**Visualization**:
```
                    Network Partition?
                    │
              ┌─────┴─────┐
              │ No        │ Yes
              ▼           ▼
         Trade off    Trade off
         Latency      Availability
         vs           vs
         Consistency  Consistency
```

**Examples**:
```
Scenario 1: No partition (normal operation)
- Choose: Latency vs. Consistency
- Option 1: Strong consistency (slower due to synchronous replication)
- Option 2: Weak consistency (faster due to asynchronous replication)

Scenario 2: Partition occurs
- Choose: Availability vs. Consistency
- Option 1: CP (reject writes to maintain consistency)
- Option 2: AP (accept writes, eventual consistency)
```

**Real-World Systems**:

**PostgreSQL (CP/EC)**:
```sql
-- Synchronous commit (strong consistency, higher latency)
SET synchronous_commit = on;
-- Result: Strong consistency, but higher latency (synchronous writes)

-- Asynchronous commit (weak consistency, lower latency)
SET synchronous_commit = off;
-- Result: Lower latency, but risk of data loss if crash
```

**DynamoDB (AP/EL)**:
```javascript
// Strongly consistent read (higher latency, but latest data)
const params = {
  Key: { userId: '123' },
  ConsistentRead: true  // Synchronous read (higher latency)
};

// Eventually consistent read (lower latency, but possibly stale data)
const params = {
  Key: { userId: '123' },
  ConsistentRead: false  // Asynchronous read (lower latency)
};
```

---

### **Choosing CP or AP: Decision Framework**

**Decision Framework**:
```
Question 1: Can your application tolerate inconsistent data?
- No → CP (banking, financial transactions, inventory)
- Yes → Question 2

Question 2: How critical is availability?
- Extremely critical → AP (social media, shopping carts, user profiles)
- Moderately critical → Question 3

Question 3: What's your primary concern?
- Data accuracy → CP (analytics, reporting, inventory)
- User experience → AP (news feeds, notifications, search)
```

**Real-World Examples**:

**CP Use Cases**:
- **Banking**: Account balances must be accurate
- **Inventory**: Product counts must be accurate (no overselling)
- **Payments**: Transaction records must be consistent
- **Auctions**: Bids must be processed in order
- **Authentication**: Login state must be consistent

**AP Use Cases**:
- **Social media feeds**: Better to show old posts than error
- **Shopping carts**: Better to save cart than lose sale
- **User profiles**: Better to show old profile than error
- **Notifications**: Better to show old notifications than miss new ones
- **Search results**: Better to show old results than error

**Hybrid Approaches**:
```python
# Some data CP, some data AP (same application)
# Banking app example:

class BankService:
    # Account balances: CP (consistency critical)
    def get_balance(self, account_id):
        # Strongly consistent read
        return db_primary.query("SELECT balance FROM accounts WHERE id = ?", account_id)
    
    def transfer(self, from_id, to_id, amount):
        # Strongly consistent transaction
        db_primary.begin_transaction()
        db_primary.update("UPDATE accounts SET balance = balance - ? WHERE id = ?", amount, from_id)
        db_primary.update("UPDATE accounts SET balance = balance + ? WHERE id = ?", amount, to_id)
        db_primary.commit()
    
    # Transaction history: AP (availability critical)
    def get_transaction_history(self, account_id):
        # Eventually consistent read (from replica)
        return db_replica.query("SELECT * FROM transactions WHERE account_id = ?", account_id)
    
    # Notifications: AP (availability critical)
    def send_notification(self, user_id, message):
        # Write to queue (eventually consistent)
        queue.publish({'user_id': user_id, 'message': message})
```

---

## **3.8 Key Takeaways**

1. **Database choice is critical**: RDBMS for consistency and complex queries, NoSQL for scale and flexibility. Choose based on your requirements.

2. **ACID guarantees come with trade-offs**: Strong consistency reduces availability and performance. Not all systems need ACID.

3. **Indexing is performance**: Well-designed indexes can improve query performance 100-1,000x. But indexes slow writes and consume storage.

4. **Scaling requires architectural changes**: Vertical scaling has limits; horizontal scaling (read replicas, sharding) requires application changes.

5. **Sharding is complex**: Simple to understand, hard to implement correctly. Consider database services that handle sharding for you (DynamoDB, CockroachDB, Azure Cosmos DB).

6. **CAP theorem is a decision framework**: Choose CP for data accuracy, AP for availability. Understand your requirements before choosing a database.

7. **Hybrid approaches are common**: Many systems use CP for critical data and AP for non-critical data. Don't feel compelled to choose one for everything.

---

## **Chapter Summary**

In this chapter, we explored databases—the foundation of data persistence in system design. We started with relational databases, understanding ACID properties, schema design, indexing, and transactions.

We then explored the NoSQL landscape, examining key-value stores, document databases, wide-column stores, and graph databases. Each has strengths for specific use cases.

We covered database scaling strategies, from vertical scaling to horizontal scaling, read replicas, connection pooling, and sharding. We understood that scaling introduces complexity but is necessary for high-traffic systems.

Finally, we deep-dived into the CAP theorem, understanding the trade-offs between consistency and availability in distributed systems, and how to choose between CP and AP based on application requirements.

**Coming up next**: In Chapter 4, we'll explore Caching—Speed at Scale. We'll cover caching patterns, eviction policies, distributed caching, CDNs, and cache consistency strategies.

---

**Exercises**:

1. **Database Selection**: For each scenario, recommend a database (RDBMS or NoSQL) and explain why:
   - A banking application processing financial transactions
   - A social media platform storing user posts and comments
   - An IoT platform collecting sensor readings from millions of devices
   - A fraud detection system analyzing transaction patterns

2. **Index Design**: For this schema, design indexes for these queries:
```sql
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    product_id INT,
    order_date DATE,
    amount DECIMAL(10, 2),
    status VARCHAR(20)
);

-- Queries:
-- a. Get all orders for a customer
-- b. Get all orders in a date range
-- c. Get all pending orders for a customer
-- d. Get total amount spent by a customer
```

3. **Sharding Strategy**: You're building a messaging app with 100 million users. Each user has 10,000 messages on average. Design a sharding strategy:
   - What sharding key would you use? Why?
   - How many shards would you start with?
   - How would you handle cross-shard queries (e.g., search all messages containing a keyword)?

4. **CAP Analysis**: For each system, would you choose CP or AP? Why?
   - A stock trading platform
   - A social media news feed
   - A collaborative document editor (Google Docs-style)
   - A multiplayer game leaderboard

5. **Read Replica Strategy**: You have an application with 20,000 QPS (90% reads). Each database server can handle 5,000 QPS. How many read replicas do you need? How would you handle replication lag?

---


<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../1. Foundations_and_prerequisites/2. prerequisites_and_core_concepts.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='4. caching.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
