# Chapter 36: Scaling Strategies (What Actually Works)

Scaling PostgreSQL is fundamentally about choosing the right tradeoffs between consistency, availability, and operational complexity. This chapter provides a decision framework for scaling decisions—emphasizing that vertical scaling and query optimization handle 80% of use cases, while horizontal partitioning (sharding) should be the option of last resort. We cover practical patterns for read scaling, sharding when unavoidable, and caching strategies that protect the database without compromising consistency.

---

## 36.1 Vertical Scaling (The First 80%)

Before considering complex distributed architectures, exhaust vertical scaling. Modern hardware can support PostgreSQL handling tens of thousands of transactions per second and terabytes of data.

### 36.1.1 Hardware Sizing Priorities

**The Hierarchy of Needs**:

1. **Storage IOPS** (Most Critical)
   - Random read/write performance determines transaction throughput
   - NVMe SSDs: 100K+ IOPS per device
   - Cloud: Provisioned IOPS (AWS io2, Azure Premium SSD v2)
   - RAID 10 for critical workloads (striping + mirroring)

2. **Memory** (Second Critical)
   - Rule: `shared_buffers` + OS cache = Working set size
   - For OLTP: 64-256 GB RAM typical
   - For analytics: 512 GB+ to cache hot partitions

3. **CPU** (Third)
   - PostgreSQL scales well to 32-64 cores
   - Beyond 64 cores: Diminishing returns due to lock contention
   - Prefer faster cores over more cores for OLTP

4. **Network** (Often Overlooked)
   - 10 Gbps minimum for replicas
   - 25-100 Gbps for high-throughput OLAP

```sql
-- Determine if you need more memory (cache hit ratio)
SELECT 
    sum(heap_blks_read) as disk_read,
    sum(heap_blks_hit) as cache_hit,
    sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read))::float as ratio
FROM pg_statio_user_tables;

-- Target: > 99% cache hit ratio for OLTP
-- If < 95%, you need more RAM or query optimization
```

### 36.1.2 Configuration for Vertical Scale

```ini
# postgresql.conf for high-end hardware (64+ cores, 256GB+ RAM)

# Memory
shared_buffers = 64GB                  # 25% of RAM (max ~32GB effective due to double buffering)
effective_cache_size = 192GB           # 75% of RAM (planner hint)
work_mem = 256MB                       # Per-operation (sorts, hashes)
                                       # 256MB × max_connections = danger zone
                                       # Better: 32MB with moderate connections + pooling
maintenance_work_mem = 2GB             # VACUUM, CREATE INDEX operations

# Parallelism
max_parallel_workers_per_gather = 8    # Per query parallel workers
max_parallel_workers = 32              # Total parallel workers cluster-wide
max_parallel_maintenance_workers = 8   # CREATE INDEX, VACUUM parallel workers

# WAL and Checkpointing (high write throughput)
max_wal_size = 16GB                    # Allow more WAL before forced checkpoint
checkpoint_completion_target = 0.9     # Spread writes over 90% of checkpoint interval
wal_buffers = 16MB                     # Increase with high write volume

# Connection limits (with pooling!)
max_connections = 500                  # Hard limit
superuser_reserved_connections = 5     # Emergency access
```

### 36.1.3 When to Stop Vertical Scaling

**Hard Limits**:
- Single-server PostgreSQL practical limit: ~100K TPS, 10-50 TB data
- Network bandwidth saturation on replicas
- Cost-efficiency: Cloud instances beyond 64 vCPUs / 512 GB RAM have diminishing price/performance

**Decision Point**:
If CPU > 80% sustained AND queries are already optimized AND you cannot partition by tenant/date, consider horizontal scaling.

---

## 36.2 Read Scaling with Replicas

Read replicas are the industry standard for scaling read-heavy workloads (typical web apps have 80% reads, 20% writes).

### 36.2.1 Replication Topologies

**Single Primary, Multiple Standbys**:
```text
Primary (Write)
    ├─ Hot Standby 1 (Read)
    ├─ Hot Standby 2 (Read)
    └─ Hot Standby 3 (Read + Reporting)
```

**Cascading Replication** (for many replicas):
```text
Primary
    ├─ Standby 1
    │     ├─ Standby 1a
    │     └─ Standby 1b
    └─ Standby 2
```
- Reduces load on primary (only replicates to 2 standbys, they replicate to others)
- Increases lag for cascading nodes

### 36.2.2 Load Balancing Strategies

**Application-Level Routing** (Recommended for precision):

```python
# Python with SQLAlchemy
from sqlalchemy import create_engine

# Write engine (primary)
write_engine = create_engine(
    "postgresql://user:pass@primary.internal:5432/production",
    pool_size=20,
    max_overflow=10
)

# Read engines (standbys with health checks)
read_engines = [
    create_engine("postgresql://user:pass@replica1.internal:5432/production", pool_size=10),
    create_engine("postgresql://user:pass@replica2.internal:5432/production", pool_size=10)
]

def get_session(read_only=False):
    if read_only:
        # Round-robin or least-connections selection
        return random.choice(read_engines)
    return write_engine

# Usage
with get_session(read_only=True).connect() as conn:
    result = conn.execute("SELECT * FROM products WHERE category = %s", (category,))
```

**PgPool-II or HAProxy for Transparent Routing**:

```ini
# HAProxy configuration for read scaling
listen postgres_read
    bind *:5433
    mode tcp
    option tcp-check
    tcp-check expect string 5
    
    # Health check: connect and expect PostgreSQL protocol version 5
    server replica1 10.0.2.10:5432 check inter 2s rise 2 fall 3
    server replica2 10.0.2.11:5432 check inter 2s rise 2 fall 3
    server replica3 10.0.2.12:5432 check inter 2s rise 2 fall 3 backup  # Backup only
```

### 36.2.3 Handling Replication Lag

**The Stale Read Problem**:
User writes comment → Primary accepts → Page refresh hits replica → Comment missing (lag = 200ms)

**Solutions**:

1. **Session Stickiness** (Post-Write Redirect):
```python
# After write, force subsequent reads to primary for a few seconds
def create_comment(user_id, content):
    with write_engine.connect() as conn:
        comment_id = conn.execute(
            "INSERT INTO comments (user_id, content) VALUES (%s, %s) RETURNING id",
            (user_id, content)
        ).scalar()
        
        # Flag session: next 5 seconds read from primary
        session['read_from_primary_until'] = time.time() + 5
        return comment_id

def get_comments(user_id):
    if session.get('read_from_primary_until', 0) > time.time():
        engine = write_engine  # Primary
    else:
        engine = random.choice(read_engines)  # Replicas
```

2. **Logical Replication Slot Lag Monitoring**:
```sql
-- Don't route to replicas with > 5 seconds lag
SELECT 
    client_addr,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) as lag_bytes,
    extract(epoch from (now() - backend_start)) as connected_seconds
FROM pg_stat_replication
WHERE application_name = 'replica_1';

# Application logic: If lag_bytes > 10MB, remove from rotation
```

3. **Synchronous Commit for Critical Reads**:
```sql
-- Force write to wait for replica sync (only for critical operations)
SET synchronous_commit = remote_apply;
INSERT INTO payments ...;
-- Now safe to read from replica
```

### 36.2.4 Read Replica Anti-Patterns

**Anti-Pattern**: Treating replicas as identical to primary for all queries.
- Long-running analytics queries on replicas can cause replication lag
- Vacuum conflicts on hot standbys

**Solution**: Dedicated reporting replica with `hot_standby_feedback = on` and `max_standby_streaming_delay` tuned, or use logical replication to data warehouse.

---

## 36.3 Sharding (The Nuclear Option)

Sharding splits data across multiple PostgreSQL instances (nodes). It solves write scalability limits but introduces massive operational complexity.

### 36.3.1 When to Shard

**Valid Reasons**:
- Single-node write throughput exceeded (~50K-100K writes/sec)
- Data size exceeds single-node storage (10+ TB with growth)
- Geographic latency requirements (EU data in EU, US data in US)
- Compliance (data residency requirements)

**Invalid Reasons** (Don't Shard Yet):
- "We might need it later" (Premature optimization)
- CPU at 20% (Optimize queries first)
- Just learned about microservices (Don't use DB sharding for service boundaries)

### 36.3.2 Sharding Strategies

**1. Hash Sharding (Distributed evenly)**

```sql
-- Shard key: user_id % 4
-- Shard 0: user_id % 4 = 0
-- Shard 1: user_id % 4 = 1
-- etc.

-- Application logic
def get_shard(user_id):
    shard_id = user_id % 4
    return shards[shard_id]  # Array of 4 connection pools

def get_user(user_id):
    conn = get_shard(user_id).getconn()
    return conn.execute("SELECT * FROM users WHERE user_id = %s", (user_id,))
```

**Pros**: Even distribution
**Cons**: Range queries scan all shards; resharding required when adding nodes

**2. Range Sharding (Time or ID ranges)**

```sql
-- Shard 0: user_id 1 - 1,000,000
-- Shard 1: user_id 1,000,001 - 2,000,000
-- Or time-based:
-- Shard 0: orders_2023
-- Shard 1: orders_2024

-- Application router
def get_shard(user_id):
    if user_id <= 1000000:
        return shard_0
    elif user_id <= 2000000:
        return shard_1
    # etc.
```

**Pros**: Efficient range queries; easy to archive old shards
**Cons**: Hot spots (newest shard gets all writes); uneven growth

**3. Directory-Based Sharding (Lookup table)**

```sql
-- Central lookup table (small, cached)
CREATE TABLE shard_directory (
    tenant_id BIGINT PRIMARY KEY,
    shard_node VARCHAR(50),  -- 'shard_001', 'shard_002'
    created_at TIMESTAMPTZ
);

-- Application checks directory before each query (cached in Redis/memcached)
def get_shard(tenant_id):
    shard_node = redis.get(f"shard:{tenant_id}")
    if not shard_node:
        shard_node = query_directory_table(tenant_id)
        redis.setex(f"shard:{tenant_id}", 3600, shard_node)
    return connections[shard_node]
```

**Pros**: Flexible; can move tenants between shards
**Cons**: Directory is single point of failure; extra lookup latency

### 36.3.3 Cross-Shard Operations (The Hard Part)

**JOINs across shards don't work**.

```sql
-- If users in shard_1 and orders in shard_2, this fails:
SELECT u.*, o.* 
FROM users u 
JOIN orders o ON u.user_id = o.user_id;
-- Must fetch users from shard_1, orders from shard_2, join in application
```

**Application-Level Join**:
```python
def get_user_with_orders(user_id):
    # Get user from shard
    user = shard.execute("SELECT * FROM users WHERE user_id = %s", (user_id,))
    
    # Get orders (hopefully same shard if co-located)
    orders = shard.execute("SELECT * FROM orders WHERE user_id = %s", (user_id,))
    
    # Manual join
    return {**user, 'orders': orders}
```

**Global Tables (Replicated to all shards)**:
Small lookup tables (countries, currencies) replicated to every shard to avoid cross-shard joins.

**Aggregation across shards**:
Map-reduce pattern required.
```python
# Count total users across all shards
total = 0
for shard in shards:
    count = shard.execute("SELECT COUNT(*) FROM users").scalar()
    total += count
```

### 36.3.4 Sharding Tools (PostgreSQL-Specific)

**1. Citus (Extension)**
- Transforms PostgreSQL into a distributed database
- Coordinator node routes queries to worker shards
- Supports distributed joins, aggregates

```sql
-- Citus setup
CREATE EXTENSION citus;

-- Add worker nodes
SELECT * FROM citus_add_node('worker-1', 5432);
SELECT * FROM citus_add_node('worker-2', 5432);

-- Distribute table
SELECT create_distributed_table('users', 'user_id');

-- Queries automatically routed; aggregates handled by coordinator
```

**2. PostgreSQL FDW (Foreign Data Wrappers)**
- Manual sharding with postgres_fdw
- Joins possible but slow (pulls all data to coordinator)

```sql
-- On coordinator
CREATE SERVER shard_1 FOREIGN DATA WRAPPER postgres_fdw 
OPTIONS (host 'shard1.internal', dbname 'production');

CREATE USER MAPPING FOR coordinator SERVER shard_1 
OPTIONS (user 'shard_user', password 'secret');

CREATE FOREIGN TABLE users_shard_1 (...) SERVER shard_1 OPTIONS (table_name 'users');

-- Create view unioning all shards
CREATE VIEW all_users AS 
SELECT * FROM users_shard_1
UNION ALL
SELECT * FROM users_shard_2;
```

**3. Application Sharding (DIY)**
- Most common for SaaS (tenant ID-based)
- No magic, just routing logic in app

---

## 36.4 Caching Strategies

Caching reduces database load but introduces consistency challenges. Use caching for read-heavy, low-mutation data.

### 36.4.1 Cache Layers

**L1: Application Memory** (Fastest, smallest)
```python
# LRU cache in Python
from functools import lru_cache

@lru_cache(maxsize=10000)
def get_user_by_id(user_id):
    return db.execute("SELECT * FROM users WHERE id = %s", (user_id,))
```

**L2: Redis/Memcached** (Fast, network round-trip)
```python
def get_user(user_id):
    # Check cache
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Cache miss: query DB
    user = db.execute("SELECT * FROM users WHERE id = %s", (user_id,)).fetchone()
    
    # Write to cache (TTL = 5 minutes)
    redis.setex(f"user:{user_id}", 300, json.dumps(user))
    return user
```

**L3: CDN** (Static assets, rarely changing API responses)

### 36.4.2 Cache Invalidation Strategies

**Cache-Aside (Lazy Loading)**:
- Read: Check cache → miss → read DB → write cache
- Write: Write DB → delete cache (not update, to avoid race conditions)

```python
def update_user(user_id, new_data):
    # Update database first (source of truth)
    db.execute("UPDATE users SET ... WHERE id = %s", (user_id, ...))
    
    # Invalidate cache (delete, don't update)
    redis.delete(f"user:{user_id}")
    # Next read will refresh from DB
```

**Why Delete Instead of Update?**
```
Thread A: Read user (gets v1)
Thread B: Update user to v2, updates cache to v2
Thread A: Writes stale v1 back to cache (race condition)
```

**Write-Through**:
- Write: Update DB and cache synchronously
- Slower writes, consistent reads
- Risk: Cache write fails → inconsistency

**Write-Behind (Async)**:
- Write: Update cache immediately, queue DB write
- Fastest but risk of data loss on crash

### 36.4.3 PostgreSQL LISTEN/NOTIFY for Cache Invalidation

Use database triggers to notify cache layer of changes.

```sql
-- Function to broadcast changes
CREATE OR REPLACE FUNCTION cache_invalidation() RETURNS TRIGGER AS $$
BEGIN
    PERFORM pg_notify('cache_invalidation', 
        json_build_object(
            'table', TG_TABLE_NAME,
            'id', COALESCE(NEW.id, OLD.id),
            'operation', TG_OP
        )::text
    );
    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

-- Trigger on users table
CREATE TRIGGER user_cache_invalidation
AFTER INSERT OR UPDATE OR DELETE ON users
FOR EACH ROW EXECUTE FUNCTION cache_invalidation();
```

**Application Listener**:
```python
# Python async listener
import asyncio
import asyncpg

async def cache_invalidation_listener():
    conn = await asyncpg.connect(database='production')
    await conn.add_listener('cache_invalidation', handle_notification)
    
    while True:
        await asyncio.sleep(3600)

def handle_notification(connection, pid, channel, payload):
    data = json.loads(payload)
    cache_key = f"{data['table']}:{data['id']}"
    redis.delete(cache_key)
    print(f"Invalidated {cache_key}")
```

### 36.4.4 Materialized Views as Caches

For complex aggregations that are expensive to compute:

```sql
-- Materialized view (snapshot)
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT 
    date_trunc('day', created_at) as sale_date,
    region,
    sum(amount) as total_sales,
    count(*) as transaction_count
FROM orders
GROUP BY 1, 2;

-- Index for fast refresh
CREATE INDEX idx_daily_sales_date ON daily_sales_summary(sale_date);

-- Refresh strategy (concurrent = no lock)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales_summary;

-- Automation: Run every 5 minutes via pg_cron or external scheduler
```

**Tradeoff**: Stale data (max 5 minutes old), but O(1) query time vs O(n) table scan.

---

## 36.5 The Scaling Decision Tree

```
Is CPU/Memory saturated?
├── No → Optimize queries (indexes, EXPLAIN ANALYZE)
└── Yes → Is it read-heavy (>80% reads)?
    ├── Yes → Add read replicas (Chapter 33)
    └── No → Is it write-heavy?
        ├── Yes → Can you partition by time/tenant?
        │   ├── Yes → Partitioning (Declarative, Chapter 10)
        │   └── No → Consider sharding (Citus or DIY)
        └── No → Vertical scaling (bigger instance)
            └── Maxed out vertical?
                └── Sharding (last resort)
```

**Golden Rule**: Sharding is a business decision (data residency, compliance) as much as a technical one. If you shard for performance alone, you will likely regret the operational burden.

---

## Chapter Summary

In this chapter, you learned:

1. **Vertical Scaling**: Maximize single-node performance before distributing. Prioritize NVMe storage IOPS, then RAM (for cache hit ratios >99%), then CPU cores. Modern cloud instances can handle 100K+ TPS before requiring horizontal scaling.

2. **Read Scaling**: Use streaming replicas for read-heavy workloads (80/20 read/write split). Implement application-level routing to direct queries to replicas, with lag monitoring to prevent stale reads. Use session stickiness or synchronous commit for critical read-after-write consistency.

3. **Sharding**: Only shard when single-node write throughput (<100K TPS) or storage (>10TB) is exhausted, or for data residency compliance. Use hash sharding for even distribution, range sharding for time-series archival, or directory-based routing for tenant isolation. Avoid cross-shard joins—they require application-level coordination.

4. **Caching**: Implement multi-layer caching (application LRU, Redis) using cache-aside pattern (invalidate on write, lazy load on read). Use PostgreSQL `LISTEN/NOTIFY` for cache invalidation triggers. Materialized views provide database-native caching for expensive aggregations with controlled staleness.

5. **Operational Reality**: Sharding multiplies operational complexity (backups, monitoring, schema changes) by the number of shards. Exhaust query optimization (indexes, partitioning), connection pooling, and read replicas before sharding. When sharding is unavoidable, use extensions like Citus or managed services (Google Spanner, CockroachDB, Yugabyte) rather than DIY unless absolutely necessary.

**Next**: In Chapter 37, we will explore Local Development Workflows—covering reproducible development environments with Docker Compose, database seeding strategies, managing multiple PostgreSQL versions locally, and testing patterns that ensure production parity.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='35. connection_management_and_pooling.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../10. operations_and_observability_dev_sre_friendly/37. configuration_basics_practical_not_mystical.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
