# Chapter 30: Point-in-Time Recovery (PITR) and WAL

Point-in-Time Recovery (PITR) transforms PostgreSQL from a simple backup-and-restore system into a continuous data protection platform. By combining physical base backups with a complete archive of Write-Ahead Log (WAL) files, you can recover a database to any microsecond between the backup time and the present—or even to a specific transaction. This chapter covers the architectural foundations of WAL, the mechanics of continuous archiving, and the operational procedures for achieving granular recovery objectives.

---

## 30.1 WAL Architecture and MVCC

### 30.1.1 The Write-Ahead Log Foundation

PostgreSQL's durability guarantee rests on the WAL (Write-Ahead Log), a sequential record of all changes made to the database's data files. Before any data page is written to disk, the corresponding WAL record describing that change must be durably stored.

**Core Principles**:
1. **Sequential Write Performance**: WAL is append-only, avoiding random I/O penalties of data file updates
2. **Crash Recovery**: On startup, PostgreSQL replays WAL from the last checkpoint to bring data files to a consistent state
3. **Replication**: Streaming replication transmits WAL records to standby servers
4. **PITR**: Archived WAL files extend recovery beyond the last checkpoint to any historical point

```sql
-- Check WAL configuration and current position
SELECT 
    current_setting('wal_level') as wal_level,
    current_setting('archive_mode') as archive_mode,
    pg_current_wal_lsn() as current_lsn,
    pg_walfile_name(pg_current_wal_lsn()) as current_wal_file,
    pg_walfile_name_offset(pg_current_wal_lsn()) as file_offset;
```

**WAL File Format**:
- Default size: 16MB (configurable via `initdb --wal-segsize`, or `wal_segment_size` in PG15+)
- Naming convention: `TTTTTTTTXXXXXXXXYYYYYYYY` (timeline, logical ID, segment)
- Location: `pg_wal/` (formerly `pg_xlog/` in PG 9.6 and earlier)

### 30.1.2 WAL Generation and Checkpointing

WAL files are generated continuously and recycled in a ring buffer. Checkpoints truncate the recovery horizon.

```sql
-- Monitor WAL generation rate
SELECT 
    pg_size_pretty(count(*) * 16 * 1024 * 1024) as total_wal_size,
    count(*) as num_files
FROM pg_ls_waldir();

-- Checkpoint statistics
SELECT 
    checkpoints_timed,
    checkpoints_req,
    checkpoint_write_time,
    checkpoint_sync_time
FROM pg_stat_bgwriter;

-- Force a checkpoint (maintenance only, not for normal operations)
CHECKPOINT;
```

**Key Parameters** (postgresql.conf):

```ini
# WAL retention for replication (minimum)
wal_keep_size = 1GB          # Retain at least 1GB of WAL files (PG13+)
                             # Previously wal_keep_segments in PG12

# Checkpoint frequency (balance between recovery time and I/O)
checkpoint_timeout = 5min    # Maximum time between checkpoints
max_wal_size = 4GB           # Target WAL size before forced checkpoint
min_wal_size = 1GB           # Minimum WAL to retain for recycling
checkpoint_completion_target = 0.9  # Spread writes over 90% of interval
```

### 30.1.3 WAL Inspection (Debugging)

Use `pg_waldump` to inspect WAL contents for forensic analysis or debugging replication issues.

```bash
# Read a specific WAL file
pg_waldump /var/lib/postgresql/data/pg_wal/0000000100000001000000AB

# Filter by specific resource manager (e.g., heap operations)
pg_waldump --rmgr=Heap /var/lib/postgresql/data/pg_wal/0000000100000001000000AB

# Statistics only
pg_waldump --stats /var/lib/postgresql/data/pg_wal/0000000100000001000000AB
```

---

## 30.2 WAL Archiving Configuration

### 30.2.1 Archive Mode Fundamentals

To enable PITR, PostgreSQL must copy completed WAL segments to archival storage before recycling them. This is controlled by `archive_mode` and `archive_command`.

**Configuration** (postgresql.conf):

```ini
wal_level = replica          # Minimum for archiving (logical if using logical replication)
archive_mode = on            # Enable archiving (always|off|on)
                             # 'always' for standby servers to archive their own WAL

# Archive command (must return 0 on success, non-zero to retry)
# Simple local copy (testing only):
archive_command = 'test ! -f /backups/wal/%f && cp %p /backups/wal/%f'

# Production-ready with error handling and compression:
archive_command = 'test ! -f /backups/wal/%f && gzip -c %p > /backups/wal/%f.gz && chmod 600 /backups/wal/%f.gz'

# Cloud storage (using wal-g - industry standard):
archive_command = 'wal-g wal-push %p'

# AWS S3 using AWS CLI (ensure instance profile or env vars configured):
archive_command = 'aws s3 cp %p s3://company-postgres-wal/production/%f --storage-class STANDARD_IA'

# Azure Blob:
archive_command = 'azcopy copy %p https://companybackup.blob.core.windows.net/wal/%f'

# With retry logic and logging:
archive_command = 'bash -c ''test ! -f /backups/wal/%f && (cp %p /backups/wal/%f && logger -t pg_archive "Archived %f") || (logger -t pg_archive "Failed to archive %f" && exit 1)'''
```

**Critical Requirements**:
1. **Zero Exit Code**: The command must return 0 only if the file is durably stored (fsynced to disk or confirmed in object storage)
2. **Idempotency**: Running the command twice must not corrupt data (hence `test ! -f`)
3. **Permissions**: The `postgres` OS user must have write access to the archive destination
4. **Atomicity**: Use temporary files and atomic moves to prevent partial files: `cp %p /backups/wal/%f.tmp && mv /backups/wal/%f.tmp /backups/wal/%f`

### 30.2.2 Archive Monitoring

A lagging archive command will cause `pg_wal` directory bloat and eventual disk space exhaustion.

```sql
-- Check for archiving lag
SELECT 
    archived_count,
    last_archived_wal,
    last_archived_time,
    failed_count,
    last_failed_wal,
    last_failed_time,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), last_archived_wal)) as lag_bytes
FROM pg_stat_archiver;

-- If archiving is failing, WAL files accumulate in pg_wal
-- Check oldest unarchived WAL:
SELECT 
    name, 
    modification,
    pg_size_pretty(size) as size
FROM pg_ls_waldir()
WHERE name > pg_walfile_name((SELECT last_archived_wal FROM pg_stat_archiver))
ORDER BY modification
LIMIT 5;
```

**Alerting Thresholds**:
- Archive lag > 15 minutes: Warning
- Archive lag > 1 hour: Critical
- Failed archives > 3 consecutive: Critical

### 30.2.3 Archive Storage Architecture

**Immutable Storage (Ransomware Protection)**:

```bash
# AWS S3 with Object Lock (WORM - Write Once Read Many)
aws s3api put-object-lock-configuration \
    --bucket company-postgres-wal \
    --object-lock-configuration='ObjectLockEnabled=Enabled,Rule={DefaultRetention={Mode=COMPLIANCE,Days=30}}'

# Upload with retention
aws s3 cp 0000000100000001000000AB s3://company-postgres-wal/ \
    --object-lock-mode COMPLIANCE \
    --object-lock-retention-days 30
```

**Tiered Storage Strategy**:
- **Hot** (0-7 days): Local SSD/NFS for fast recovery
- **Warm** (7-30 days): S3 Standard or Azure Hot tier
- **Cold** (30-365 days): S3 Glacier or Azure Archive (retrieval 1-12 hours acceptable)

---

## 30.3 Point-in-Time Recovery Workflow

PITR restores a physical base backup and replays WAL up to a specific target time, transaction ID, or LSN.

### 30.3.1 Preparing for Recovery

**Step 1: Base Backup Restoration**:

```bash
# Restore the base backup to new data directory
mkdir -p /var/lib/postgresql/data.new
tar -xzf /backups/base/base_20241002.tar.gz -C /var/lib/postgresql/data.new

# If using pg_basebackup plain format:
# cp -r /backups/base/20241002 /var/lib/postgresql/data.new
```

**Step 2: Configure Recovery Parameters** (PostgreSQL 12+):

In PostgreSQL 12 and later, recovery is configured via `postgresql.conf` (or `postgresql.auto.conf`) and signal files, not `recovery.conf`.

```ini
# postgresql.conf in the new data directory

# Restore command (fetch WAL from archive)
restore_command = 'cp /backups/wal/%f %p'
# Or for compressed: 'gunzip -c /backups/wal/%f.gz > %p'
# Or for S3: 'wal-g wal-fetch %f %p'

# Recovery target (choose ONE)
recovery_target_time = '2024-10-02 14:30:00+00'    # Timestamp with timezone
recovery_target_xid = '1234567'                     # Transaction ID
recovery_target_lsn = '1/AB123456'                  # Log Sequence Number
recovery_target_name = 'before_accident'            # Named restore point

# What to do when target reached:
recovery_target_action = 'pause'      # pause|promote|shutdown
                                      # 'pause' allows inspection before opening
                                      # 'promote' ends recovery and accepts connections
                                      # 'shutdown' stops server (for frozen snapshots)

# If target time is ambiguous (multiple transactions at same microsecond):
recovery_target_inclusive = on        # on=include transaction at target, off=exclude
```

**Step 3: Create Recovery Signal File**:

```bash
# For PITR (PostgreSQL 12+):
touch /var/lib/postgresql/data.new/recovery.signal

# For standby/streaming replica (different mode):
# touch /var/lib/postgresql/data.new/standby.signal

# Ensure proper ownership
chown -R postgres:postgres /var/lib/postgresql/data.new
chmod 700 /var/lib/postgresql/data.new
```

### 30.3.2 Recovery Execution

```bash
# Start PostgreSQL in recovery mode
pg_ctl -D /var/lib/postgresql/data.new start -l logfile

# Monitor recovery progress
tail -f /var/lib/postgresql/data.new/log/postgresql-*.log

# Log output will show:
# "starting point-in-time recovery to 2024-10-02 14:30:00+00"
# "restoring log file "00000001000000010000008A" from archive"
# ...
# "recovery stopping before commit of transaction 1234567, time 2024-10-02 14:30:05.123456+00"
# "paused at 2024-10-02 14:30:00.000000+00"
# "consistency reached"
```

**If recovery_target_action = 'pause'**, the server starts but does not accept connections until promoted:

```sql
-- Connect and inspect (superuser only, single user mode or via local socket)
psql -h /var/run/postgresql -U postgres postgres

-- Check recovery status
SELECT 
    pg_is_in_recovery(),           -- true while recovering
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn(),
    pg_last_xact_replay_timestamp();

-- If satisfied, promote to production (irreversible):
SELECT pg_promote();
-- Or from shell: pg_ctl promote -D /var/lib/postgresql/data.new
```

### 30.3.3 Recovery Conflicts

If the recovery target is in the middle of a transaction, PostgreSQL must decide whether to include it:

```ini
# postgresql.conf
recovery_target_inclusive = on   # Include the target transaction (commit it)
# vs
recovery_target_inclusive = off  # Exclude the target transaction (rollback)
```

**Best Practice**: Use `pause` action, inspect the database state, then decide to promote or adjust target and restart recovery (if you went too far).

---

## 30.4 Timeline Management

### 30.4.1 Understanding Timelines

Timelines prevent split-brain scenarios when recovering from backups. Each time a new recovery is performed and the server is promoted, a new timeline is created.

- **Timeline 1**: Original production history
- **Timeline 2**: First recovery (diverged at recovery point)
- **Timeline 3**: Second recovery (diverged at different point), etc.

**Visual Representation**:
```
Timeline 1: ----A----B----C----D----E (production)
                 \
Timeline 2:       \----C'----D'----E' (recovery to point C)
                       \
Timeline 3:             \----D''----E'' (second recovery attempt)
```

### 30.4.2 Timeline History Files

When you create a new timeline, PostgreSQL generates a `.history` file describing the divergence.

```bash
# List timeline history
ls -la /backups/wal/*.history

# Example: 00000002.history
# 1	0/150000A0	no recovery target specified
# Explanation: Timeline 2 forked from Timeline 1 at LSN 0/150000A0
```

**Switching Timelines**:
If recovering to a specific timeline (e.g., you accidentally recovered too far on timeline 2 and want to try timeline 1 again):

```ini
# postgresql.conf
recovery_target_timeline = '1'        # Specific timeline
# or
recovery_target_timeline = 'latest'   # Default: follow latest timeline in archive
```

### 30.4.3 Named Restore Points

Create named markers in WAL during normal operations for easier recovery targeting:

```sql
-- During normal operations (requires superuser)
SELECT pg_create_restore_point('before_schema_migration');

-- View restore points (only recent ones in memory, query pg_waldump for history)
SELECT pg_current_wal_lsn() as lsn, 'before_schema_migration' as name;

-- Later, recover to this exact point:
recovery_target_name = 'before_schema_migration'
recovery_target_action = 'pause'
```

---

## 30.5 Recovery Targets in Detail

### 30.5.1 Recovery Target Options

| Target Type | Use Case | Precision | Syntax Example |
|-------------|----------|-----------|----------------|
| **Time** | "Oops, I dropped the table at 2:30 PM" | Microsecond | `2024-10-02 14:30:00.000000+00` |
| **XID** | "Revert transaction 1234567" | Transaction boundary | `1234567` |
| **LSN** | "Stop at exact log position" | Byte offset | `1/AB123456` |
| **Name** | "Before the deployment" | Restore point | `before_deploy_v2` |

**Target Time Recovery** (Most Common):

```ini
recovery_target_time = '2024-10-02 14:30:00+05:30'  # ISO 8601 with timezone
recovery_target_action = 'pause'
```

**Important**: The timezone must match your session timezone or use explicit offset. PostgreSQL converts to UTC internally.

### 30.5.2 Recovery Target Pitfalls

**The "Recovery Target Not Found" Error**:

If you specify a target time after the end of available WAL:
```log
FATAL:  recovery ended before configured recovery target was reached
```

**Solutions**:
1. Check if archive_command was working during the target time
2. Verify `restore_command` can reach the WAL files (permissions, paths)
3. Use `recovery_target_action = 'shutdown'` and check how far it got, then adjust target

**Ambiguous Timestamps**:
Multiple transactions may share the same microsecond timestamp. Use `recovery_target_inclusive` to control boundary behavior, or switch to XID targeting for precision.

---

## 30.6 RPO/RTO Considerations

### 30.6.1 Defining Recovery Objectives

- **RPO (Recovery Point Objective)**: Maximum acceptable data loss (e.g., "We can lose 15 minutes of data")
- **RTO (Recovery Time Objective)**: Maximum acceptable downtime (e.g., "We must be online within 1 hour")

**Achieving RPO = 0 (Zero Data Loss)**:
Requires synchronous replication to a standby with `synchronous_commit = remote_apply`, or synchronous archiving (rare due to performance impact).

**Achieving RPO < 5 Minutes**:
- Continuous archiving with `archive_timeout = 5min` (force archive switch every 5 minutes)
- Streaming replication to standby (WAL applied continuously)
- `pg_receivewal` running in parallel to stream WAL to secondary storage

### 30.6.2 Calculating Recovery Time

Recovery duration consists of:
1. **Base restore**: Copy data files (network/disk bandwidth limited)
2. **WAL replay**: Proportional to time since backup and write volume
3. **Warmup**: Cache warming, index validation (if applicable)

```bash
# Estimate WAL replay speed (typically 100MB/s to 1GB/s depending on hardware)
pg_waldump --stats /backups/wal/0000000100000001* | tail -20

# Check how much WAL needs replay:
ls -la /backups/wal/*.gz | wc -l  # Number of files since backup
# Multiply by 16MB (uncompressed) for total bytes
```

**Optimization**:
- Take base backups frequently (daily) to reduce WAL replay window
- Use `pg_basebackup` with `--wal-method=fetch` to include necessary WAL in backup
- For critical systems, maintain a warm standby (streaming replica) for immediate failover (RTO ≈ 0)

### 30.6.3 Archive Retention Policy

Calculate required WAL retention based on backup frequency and RPO:

```bash
# If daily base backups and 7-day RPO requirement:
# Keep: 7 days of base backups + 7 days of WAL + safety margin
# With 16MB WAL segments and 100GB daily WAL generation:
# 7 days * (100GB/16MB) = ~7,000 files = 112GB storage

# Automated cleanup (run after successful base backup):
find /backups/wal -name "00000001*" -mtime +7 -delete
# Or use wal-g/wal-e retention policies
```

**Critical Warning**: Never delete WAL files that might be needed for existing base backups. The `backup_label` file in a base backup references the START WAL location required for recovery.

---

## Chapter Summary

In this chapter, you learned:

1. **WAL Architecture**: The Write-Ahead Log provides durability, replication, and PITR capabilities. WAL files (16MB default) are generated continuously and must be archived before recycling to enable historical recovery. Use `pg_stat_archiver` to monitor archiving lag.

2. **Archive Configuration**: Set `wal_level = replica`, `archive_mode = on`, and configure `archive_command` to copy WAL to durable storage (S3, NFS, Azure Blob). Commands must be idempotent, return exit code 0 only on success, and use atomic file operations. Use tools like `wal-g` or `pgBackRest` for cloud-native archiving.

3. **PITR Workflow**: Restore a base backup, create `recovery.signal`, configure `restore_command` to fetch WAL, and set `recovery_target_time` (or XID/LSN/Name) in `postgresql.conf`. Start PostgreSQL; it replays WAL until the target, then pauses, promotes, or shuts down based on `recovery_target_action`.

4. **Timeline Management**: Each recovery creates a new timeline (numbered history files) to prevent divergence conflicts. Use `recovery_target_timeline` to specify which branch to follow. Create named restore points with `pg_create_restore_point()` for precise recovery markers during maintenance windows.

5. **Recovery Targets**: Specify exact recovery points via timestamp (microsecond precision), transaction ID, LSN, or named restore point. Use `recovery_target_inclusive` to control boundary transaction inclusion. Always use `pause` action initially to verify state before `pg_promote()`.

6. **RPO/RTO Strategy**: Achieve aggressive RPO (minutes) via frequent base backups and continuous archiving. For RTO ≈ 0, maintain streaming replicas rather than relying on PITR. Calculate WAL retention as `(backup_interval + rpo_safety_margin) * daily_wal_volume`. Store archives in immutable storage (S3 Object Lock) to prevent ransomware corruption of backups.

**Next**: In Chapter 31, we will cover Schema Migrations in Real Teams—exploring migration frameworks, zero-downtime deployment patterns, handling long-running migrations safely, and maintaining backward compatibility across application versions.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='29. backup_and_restore_fundamentals.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='31. schema_migrations_in_real_teams.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
