# Chapter 31: Schema Migrations in Real Teams

Database schema migrations are the most dangerous routine operation in application development. A failed migration can cause extended downtime, data corruption, or cascading failures across services. This chapter establishes industry-standard patterns for executing migrations safely in production environments, emphasizing zero-downtime techniques, backward compatibility, and team workflows that prevent the 3 AM "migration stuck" page.

---

## 31.1 Migration Philosophy and Constraints

### 31.1.1 The Forward-Only Mandate

**Industry Standard**: Production migrations must be **forward-only** and **idempotent**. "Down" or "rollback" migration scripts are prohibited in automated pipelines.

**Rationale**:
- Data loss is asymmetric: rolling back a column drop destroys data permanently
- Application code may have already written new-format data that down migrations cannot interpret
- Down migrations are rarely tested in production-like environments
- Recovery should use database backups (PITR) or compensating transactions, not schema reversal

```sql
-- ANTI-PATTERN: Down migration (DO NOT USE)
-- 0002_add_user_status.down.sql
ALTER TABLE users DROP COLUMN status;

-- PROBLEM: If any user was marked 'inactive' in production, 
-- that information is permanently lost when this runs

-- INDUSTRY STANDARD: Forward-only with compensation
-- 0003_remove_user_status.sql
-- Step 1: Deprecate (keep column, stop using in app)
-- Step 2: Later migration drops after 30-day grace period
ALTER TABLE users ALTER COLUMN status DROP DEFAULT;
-- Application ignores column, but data preserved if rollback needed
```

### 31.1.2 Idempotency and Safety Checks

Every migration must be safe to run multiple times without error.

```sql
-- Safe pattern: Check before creating
DO $$
BEGIN
    IF NOT EXISTS (
        SELECT 1 FROM information_schema.columns 
        WHERE table_name = 'users' 
        AND column_name = 'status'
    ) THEN
        ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';
    END IF;
END $$;

-- Alternative: PostgreSQL-native IF NOT EXISTS (PostgreSQL 9.6+)
CREATE INDEX IF NOT EXISTS idx_users_email ON users(email);
CREATE TABLE IF NOT EXISTS audit_log (...);
```

**Migration Metadata Table** (Framework-agnostic):

```sql
-- Track applied migrations
CREATE TABLE schema_migrations (
    version VARCHAR(255) PRIMARY KEY,
    applied_at TIMESTAMPTZ DEFAULT NOW(),
    checksum VARCHAR(64),        -- SHA-256 of file contents
    execution_time_ms INTEGER,
    applied_by VARCHAR(100)      -- Who/what ran it
);

-- Check before execution
SELECT 1 FROM schema_migrations WHERE version = '20241002143000_add_user_status';
-- If row exists, skip (already applied)
```

---

## 31.2 Migration Categories and Risk Profiles

### 31.2.1 Metadata-Only Changes (Low Risk)

Operations that rewrite only catalog tables, not heap data.

```sql
-- SAFE: No table rewrite, instantaneous
ALTER TABLE users RENAME COLUMN email TO email_address;
ALTER TABLE users ALTER COLUMN status SET DEFAULT 'pending';
ALTER TYPE user_status ADD VALUE 'suspended';  -- Enum extension (end only)

-- Creating indexes CONCURRENTLY (see below)
CREATE INDEX CONCURRENTLY idx_orders_created_at ON orders(created_at);
```

### 31.2.2 Locking Changes (High Risk)

Operations that acquire `ACCESS EXCLUSIVE` locks and rewrite tables.

```sql
-- DANGEROUS: Locks table for duration of rewrite
ALTER TABLE users ADD COLUMN bio TEXT;                    -- Rewrites table (PG < 11)
ALTER TABLE users ALTER COLUMN email TYPE VARCHAR(300);   -- Rewrites table
ALTER TABLE users DROP COLUMN phone;                      -- Rewrites table (PG < 11)

-- PostgreSQL 11+ optimizations:
-- ADD COLUMN with default is metadata-only (fast)
ALTER TABLE users ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT NOW();

-- But DROP COLUMN still rewrites (removes column physically)
```

**Mitigation Strategy**: Use `pg_repack` or `ALTER ... ADD COLUMN` followed by background population instead of blocking `ALTER`.

### 31.2.3 Constraint Changes (Blocking Risk)

```sql
-- BLOCKING: Validates all rows immediately, holds lock
ALTER TABLE orders ADD CONSTRAINT fk_orders_user 
    FOREIGN KEY (user_id) REFERENCES users(user_id);

-- NON-BLOCKING: Add constraint as NOT VALID, then validate separately
-- Step 1: Create constraint without validation (fast, minimal locking)
ALTER TABLE orders ADD CONSTRAINT fk_orders_user 
    FOREIGN KEY (user_id) REFERENCES users(user_id) 
    NOT VALID;

-- Step 2: Validate in background (acquires SHARE UPDATE EXCLUSIVE, not ACCESS EXCLUSIVE)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_user;
-- This can take hours on large tables but doesn't block reads/writes
```

---

## 31.3 Zero-Downtime Migration Patterns

### 31.3.1 The Expand-Contract Pattern

The gold standard for structural changes without downtime or backward incompatibility.

**Scenario**: Rename column `email` to `email_address`.

```sql
-- Phase 1: EXPAND (Deploy 1)
-- Add new column, keep old column
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);

-- Backfill in batches (see section 31.4)
UPDATE users SET email_address = email WHERE email_address IS NULL;

-- Application writes to BOTH columns (dual write)
-- Application reads from OLD column (backward compatible)
```

```python
# Application code (Deploy 1)
class User:
    def save(self):
        # Dual write
        db.execute("""
            UPDATE users 
            SET email = %s, email_address = %s 
            WHERE id = %s
        """, (self.email, self.email, self.id))
    
    @property
    def get_email(self):
        return self.email  # Read from old column
```

```sql
-- Phase 2: MIGRATE (Deploy 2)
-- Backfill complete, switch reads to new column
-- Add trigger to keep old column synced (for rollback safety)
CREATE OR REPLACE FUNCTION sync_email() RETURNS TRIGGER AS $$
BEGIN
    NEW.email = NEW.email_address;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_sync_email 
    BEFORE UPDATE ON users 
    FOR EACH ROW 
    WHEN (NEW.email_address IS DISTINCT FROM OLD.email_address)
    EXECUTE FUNCTION sync_email();
```

```python
# Application code (Deploy 2)
class User:
    def save(self):
        db.execute("""
            UPDATE users 
            SET email_address = %s 
            WHERE id = %s
        """, (self.email, self.id))
        # Trigger keeps old column updated
    
    @property
    def get_email(self):
        return self.email_address  # Read from new column
```

```sql
-- Phase 3: CONTRACT (Deploy 3, days later)
-- Drop old column after grace period
DROP TRIGGER trg_sync_email ON users;
DROP FUNCTION sync_email();
ALTER TABLE users DROP COLUMN email;
```

### 31.3.2 Adding Constraints Without Downtime

**Adding a NOT NULL Constraint**:

```sql
-- ANTI-PATTERN: Blocks table while checking millions of rows
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

-- SAFE PATTERN:
-- Step 1: Add CHECK constraint as NOT VALID (no scan)
ALTER TABLE users ADD CONSTRAINT chk_email_not_null 
    CHECK (email IS NOT NULL) NOT VALID;

-- Step 2: Backfill any NULLs in batches (if migration missed any)
UPDATE users SET email = 'legacy@example.com' WHERE email IS NULL;

-- Step 3: Validate constraint (scan table, but doesn't block writes)
ALTER TABLE users VALIDATE CONSTRAINT chk_email_not_null;

-- Step 4: Make it a proper NOT NULL (metadata only, PostgreSQL 12+)
-- Or keep as CHECK constraint (functionally equivalent)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT chk_email_not_null;  -- Optional cleanup
```

### 31.3.3 Index Creation Strategies

**Concurrent Index Creation** (Essential for production):

```sql
-- BLOCKING: Locks table for entire index build
CREATE INDEX idx_orders_total ON orders(total);

-- NON-BLOCKING: Allows reads/writes during build
CREATE INDEX CONCURRENTLY idx_orders_total ON orders(total);

-- Caveats:
-- 1. Takes longer (2-3x) due to multiple table scans
-- 2. Cannot run inside transaction block
-- 3. If build fails (e.g., unique constraint violation), leaves "invalid" index
-- 4. Must be run outside standard transactional migration frameworks
```

**Handling Invalid Indexes**:

```sql
-- Check for invalid indexes (failed concurrent builds)
SELECT indexname, indexdef 
FROM pg_indexes 
WHERE NOT indisvalid 
FROM pg_index JOIN pg_class ON pg_index.indexrelid = pg_class.oid
WHERE NOT pg_index.indisvalid;

-- Drop and retry
DROP INDEX CONCURRENTLY idx_orders_total;
CREATE INDEX CONCURRENTLY idx_orders_total ON orders(total);
```

---

## 31.4 Long-Running Migrations (Batch Processing)

### 31.4.1 The Problem with Single-Transaction Updates

```sql
-- ANTI-PATTERN: Locks rows for hours, fills WAL, blocks vacuum
UPDATE users SET status = 'migrated' WHERE legacy_flag = true;
-- Locks all matching rows until commit
-- Generates massive WAL traffic (replication lag)
-- Autovacuum can't clean dead tuples while transaction runs
```

### 31.4.2 Batch Update Pattern

```python
# Python implementation of batched migration
import psycopg2
import time

def migrate_in_batches(batch_size=1000, sleep_interval=0.1):
    conn = psycopg2.connect(DATABASE_URL)
    conn.autocommit = False
    
    while True:
        with conn.cursor() as cur:
            # Update small batch
            cur.execute("""
                UPDATE users 
                SET status = 'migrated'
                WHERE user_id IN (
                    SELECT user_id 
                    FROM users 
                    WHERE legacy_flag = true 
                    AND status != 'migrated'
                    LIMIT %s
                    FOR UPDATE SKIP LOCKED  -- Critical: skip locked rows
                )
                RETURNING user_id;
            """, (batch_size,))
            
            updated = cur.rowcount
            conn.commit()
            
            if updated == 0:
                break
                
            print(f"Migrated {updated} rows")
            time.sleep(sleep_interval)  -- Allow other queries between batches
    
    conn.close()
```

**Key Features**:
- `LIMIT` controls batch size
- `FOR UPDATE SKIP LOCKED` prevents blocking concurrent writes
- Frequent `COMMIT` releases locks and allows vacuum to proceed
- Sleep interval prevents CPU/IO saturation

### 31.4.3 Background Migration Jobs

For very large tables (billions of rows), use background job processors instead of DDL migrations.

```sql
-- Migration table tracks progress
CREATE TABLE migration_jobs (
    job_id SERIAL PRIMARY KEY,
    table_name VARCHAR(100),
    status VARCHAR(20),  -- pending, running, completed, failed
    last_processed_id BIGINT DEFAULT 0,
    batch_size INTEGER DEFAULT 1000,
    started_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ
);

-- Application worker processes chunks
WITH batch AS (
    SELECT user_id 
    FROM users 
    WHERE user_id > (SELECT last_processed_id FROM migration_jobs WHERE job_id = 1)
    ORDER BY user_id
    LIMIT 1000
),
updated AS (
    UPDATE users 
    SET new_column = calculate_value(old_column)
    WHERE user_id IN (SELECT user_id FROM batch)
    RETURNING user_id
)
UPDATE migration_jobs 
SET last_processed_id = (SELECT MAX(user_id) FROM updated)
WHERE job_id = 1;
```

---

## 31.5 Version Compatibility and Deployment Coordination

### 31.5.1 N-1 Compatibility Rule

**Industry Standard**: Database schema must be compatible with both the current application version (N) and the previous version (N-1) during deployment.

**Deployment Sequence**:
1. Deploy migration (schema N+1 compatible with app N)
2. Deploy application code (N+1)
3. Verify health
4. Deploy migration (schema N+2, cleanup of N compatibility)

```sql
-- Example: Removing a column safely
-- App N uses column 'legacy_data'
-- App N+1 ignores column 'legacy_data'

-- Deploy 1: App N is running
-- Schema must have legacy_data (App N needs it)

-- Deploy 2: Migration (App N still running)
-- Stop writing to legacy_data (App N+1 ready)
-- But don't drop column yet (App N might read it)
ALTER TABLE events ALTER COLUMN legacy_data DROP DEFAULT;

-- Deploy 3: App N+1 deployed
-- App no longer uses legacy_data

-- Deploy 4: (One day later) Migration to drop column
-- Now safe to drop because App N is gone
ALTER TABLE events DROP COLUMN legacy_data;
```

### 31.5.2 Feature Flags for Schema Changes

Coordinate schema changes with application behavior using feature flags.

```sql
-- Add column but don't enforce constraints yet
ALTER TABLE orders ADD COLUMN priority_score INTEGER;

-- Application checks feature flag before using new column
if feature_flags.get('use_priority_scoring'):
    query = "SELECT * FROM orders ORDER BY priority_score DESC"
else:
    query = "SELECT * FROM ORDER BY created_at DESC"
```

---

## 31.6 Migration Tooling Patterns

### 31.6.1 Migration File Naming Conventions

```text
migrations/
├── 20241002143000_create_users_table.sql
├── 20241002144500_add_user_indexes.sql
└── 20241002150000_populate_user_defaults.sql
```

**Naming**: `YYYYMMDDhhmmss_descriptive_name.sql`
- Timestamp ensures ordering
- Description aids debugging
- No version numbers (conflict-prone in teams)

### 31.6.2 Transaction Control

```sql
-- migrations/20241002143000_add_constraint.sql

-- Wrap related changes in transaction
BEGIN;

-- Step 1: Add column
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT false;

-- Step 2: Create index concurrently (CANNOT be in transaction)
-- This must be separate migration file run outside transaction

COMMIT;

-- migrations/20241002143500_add_index_concurrently.sql
-- NO BEGIN/COMMIT - run as single statement
CREATE INDEX CONCURRENTLY idx_users_email_verified ON users(email_verified);
```

### 31.6.3 Pre-Deployment Validation

```sql
-- Add to migration preamble
DO $$
DECLARE
    row_count BIGINT;
BEGIN
    -- Safety check: Ensure table size is expected
    SELECT COUNT(*) INTO row_count FROM users;
    
    IF row_count > 10000000 THEN
        RAISE EXCEPTION 'Table too large for this migration. Use background job instead.';
    END IF;
    
    -- Check for conflicting data before adding constraint
    IF EXISTS (SELECT 1 FROM users WHERE email IS NULL) THEN
        RAISE EXCEPTION 'NULL emails exist. Clean data before adding NOT NULL constraint.';
    END IF;
END $$;
```

---

## 31.7 Safety Mechanisms and Checklists

### 31.7.1 The Migration Safety Checklist

Before running any migration in production:

- [ ] **Tested** on production-sized dataset in staging
- [ ] **Timing**: Run during lowest traffic window (monitor with `pg_stat_activity`)
- [ ] **Locks**: Verify no long-running transactions (`SELECT * FROM pg_stat_activity WHERE state = 'active' AND xact_start < NOW() - INTERVAL '5 minutes'`)
- [ ] **Disk Space**: Ensure 50% free space for rewrites (`SELECT pg_size_pretty(pg_database_size(current_database()))`)
- [ ] **Replication**: Check replication lag (`SELECT * FROM pg_stat_replication`) < 5 minutes
- [ ] **Backup**: Fresh base backup completed within last 24 hours
- [ ] **Rollback Plan**: PITR target time documented, not "down migration"
- [ ] **Monitoring**: Alerting configured for lock waits > 30 seconds

### 31.7.2 Lock Timeout and Statement Timeout

Prevent runaway migrations from hanging indefinitely:

```sql
-- At start of migration session
SET lock_timeout = '5s';        -- Fail fast if can't acquire lock quickly
SET statement_timeout = '1h';   -- Kill migration if it runs too long
SET idle_in_transaction_session_timeout = '10min';
```

### 31.7.3 Emergency Cancellation

If a migration is blocking production:

```bash
# Find the migration process
psql -c "SELECT pid, query, state, now() - query_start as duration 
         FROM pg_stat_activity 
         WHERE query LIKE '%ALTER TABLE%' 
         AND state = 'active';"

# Cancel gracefully (SIGINT)
SELECT pg_cancel_backend(pid);

# If unresponsive, terminate (SIGTERM)
SELECT pg_terminate_backend(pid);
```

---

## 31.8 Handling Migration Failures

### 31.8.1 Failed Partial Migrations

When a migration fails midway (network blip, constraint violation):

```sql
-- Check what actually got applied
SELECT * FROM schema_migrations WHERE version = '20241002143000';
SELECT column_name FROM information_schema.columns WHERE table_name = 'users';

-- If partially applied, determine if safe to rerun (idempotent) or needs manual cleanup
-- Option 1: Rerun (if idempotent)
-- Option 2: Manual cleanup + mark as applied (if cleanup is safer than rerunning)
INSERT INTO schema_migrations (version, applied_at) 
VALUES ('20241002143000', NOW());
-- Then apply corrected migration as new version
```

### 31.8.2 Hotfix Migrations (Out-of-Band)

Emergency fixes that skip the queue:

```sql
-- Naming: prefix with timestamp + 'hotfix'
-- 20241002150000_hotfix_add_missing_index.sql

-- Must be backward compatible with pending migrations
-- If other migrations added columns this uses, ensure they run first or check existence
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_emergency_lookup 
ON users(status, created_at) 
WHERE deleted_at IS NULL;
```

---

## Chapter Summary

In this chapter, you learned:

1. **Forward-Only Philosophy**: Never use "down" migrations in production. Recover from bad migrations using PITR or compensating transactions. Schema changes must be additive and backward compatible during the transition window.

2. **Risk Classification**: Distinguish between metadata-only changes (fast, safe), locking changes (require `CONCURRENTLY` or `NOT VALID` patterns), and table rewrites (require `pg_repack` or background jobs). Always prefer `VALIDATE CONSTRAINT` over direct constraint addition.

3. **Expand-Contract Pattern**: For structural changes (column renames, type changes), use three-phase deployment: (1) Add new structure and dual-write, (2) Migrate reads to new structure, (3) Remove old structure after grace period. This maintains N-1 compatibility.

4. **Batch Processing**: Never update millions of rows in a single transaction. Use batched updates with `LIMIT` and `FOR UPDATE SKIP LOCKED`, committing frequently to release locks and allow vacuum to proceed. For billion-row tables, use background job tables instead of DDL migrations.

5. **Concurrent Indexing**: Always use `CREATE INDEX CONCURRENTLY` in production. Never run it inside a transaction block. Monitor for invalid indexes and rebuild them. Expect 2-3x longer build times in exchange for zero locking.

6. **Safety Mechanisms**: Use `lock_timeout` to prevent indefinite waiting, validate preconditions in `DO $$` blocks, check table sizes before rewrites, and maintain N-1 version compatibility during deployment windows. Run migrations during low-traffic periods with replication lag monitoring.

**Next**: In Chapter 32, we will explore Upgrades and Major Version Changes—covering planning strategies for PostgreSQL major version upgrades, the `pg_upgrade` utility with its copy vs. link modes, logical replication migration strategies, and extension compatibility management.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='30. point_in_time_recovery_pitr_and_wal.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='32. upgrades_and_major_version_changes.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
