# Chapter 38: Vacuum, Analyze, and Bloat

PostgreSQL's Multi-Version Concurrency Control (MVCC) architecture provides excellent concurrency semantics without read locks, but it creates a maintenance burden: dead tuple accumulation. Without aggressive maintenance, tables and indexes bloat indefinitely, performance degrades, and the database risks catastrophic transaction ID wraparound failure. This chapter provides the operational knowledge to manage PostgreSQL's vacuum processes effectively, detect bloat before it becomes critical, and remediate existing bloat safely.

## 38.1 MVCC Fundamentals and The Vacuum Imperative

Understanding why vacuum exists requires understanding PostgreSQL's concurrency model. Unlike databases that use undo logs or read locks, PostgreSQL keeps multiple versions of rows in the table itself.

### 38.1.1 Tuple Visibility and Row Versions

When a row is updated or deleted, PostgreSQL does not immediately remove the old data. Instead, it marks the old row version as "dead" and creates a new version with new transaction IDs.

```sql
-- Transaction ID visibility demonstration
BEGIN;
SELECT txid_current();  -- Returns current transaction ID, e.g., 1500000

UPDATE accounts SET balance = balance - 100 WHERE account_id = 1;
-- This creates:
-- 1. A new row version with xmin = 1500000 (creating transaction)
-- 2. The old row marked with xmax = 1500000 (deleting transaction)

COMMIT;
-- The old row is now "dead" - invisible to new transactions but physically remains
```

**Tuple Header Fields (Simplified):**
- `xmin`: Transaction ID that inserted the row
- `xmax`: Transaction ID that deleted/updated the row (0 = live)
- `cmin/cmax`: Command ID within transaction
- `ctid`: Physical location (block, offset)

**The Visibility Rule:**
A row version is visible to a transaction if:
1. `xmin` is committed and less than the querying transaction's ID
2. `xmax` is either 0 (not deleted) or a transaction newer than the querying transaction

### 38.1.2 The Dead Tuple Problem

Without vacuum, dead tuples accumulate indefinitely:

```sql
-- Simulate bloat generation
CREATE TABLE bloat_demo (
    id serial PRIMARY KEY,
    data text
);

INSERT INTO bloat_demo (data) SELECT 'data' || gs FROM generate_series(1, 100000) gs;

-- Check live vs dead tuples
SELECT 
    n_live_tup,
    n_dead_tup,
    pg_size_pretty(pg_total_relation_size('bloat_demo')) as total_size
FROM pg_stat_user_tables 
WHERE relname = 'bloat_demo';
-- Result: 100000 live, 0 dead, ~4MB

-- Generate dead tuples without vacuum
UPDATE bloat_demo SET data = data || '_updated';
-- Now we have 100000 live + 100000 dead tuples = ~8MB

UPDATE bloat_demo SET data = data || '_again';
-- 100000 live + 200000 dead = ~12MB

-- This growth continues infinitely without vacuum intervention
```

**Consequences of Unvacuumed Tables:**
1. **Storage bloat**: Tables 5x-50x larger than necessary
2. **Index bloat**: Indexes point to dead tuples; each update creates new index entries
3. **Query slowdown**: Sequential scans read dead tuples; index scans follow pointers to dead rows
4. **Transaction ID wraparound**: XID is 32-bit; without freezing, the database stops accepting writes at 2^31 transactions

### 38.1.3 Vacuum Operations Overview

PostgreSQL provides three vacuum variants:

1. **Standard VACUUM**: Removes dead tuples, updates statistics, marks space for reuse. Does not shrink table files (non-blocking).
2. **VACUUM FULL**: Reclaims storage by rewriting the entire table (exclusive locks, requires 2x disk space).
3. **ANALYZE**: Updates planner statistics without removing dead tuples.

```sql
-- Standard vacuum (maintenance command, doesn't lock table for reads/writes)
VACUUM bloat_demo;

-- Verbose output (shows details)
VACUUM (VERBOSE, ANALYZE) bloat_demo;

-- Full vacuum (blocking, reclaims disk space)
VACUUM FULL bloat_demo;

-- Analyze only (updates statistics for query planning)
ANALYZE bloat_demo;
```

## 38.2 Standard Vacuum Mechanics

Standard vacuum is the workhorse operation. Understanding its phases helps diagnose performance issues.

### 38.2.1 Vacuum Phases

```sql
-- Monitoring vacuum progress (PostgreSQL 12+)
SELECT 
    pid,
    phase,
    heap_blks_total,
    heap_blks_scanned,
    heap_blks_vacuumed,
    index_vacuum_count,
    dead_tuples,
    dead_tuple_bytes,
    pg_size_pretty(dead_tuple_bytes) as dead_size
FROM pg_stat_progress_vacuum;
```

**Phase Descriptions:**

1. **initializing**: Preparing to scan
2. **scanning heap**: Reading table blocks sequentially, identifying dead tuples
3. **vacuuming indexes**: For each dead tuple found, remove index entries pointing to it (expensive phase)
4. **vacuuming heap**: Marking dead tuple space as reusable in the visibility map
5. **cleaning up indexes**: Final index cleanup
6. **truncating heap**: Attempting to return empty pages at end of file to OS (requires brief lock)

**The Visibility Map (VM):**
PostgreSQL maintains a visibility map (one bit per heap page) indicating whether all tuples on a page are visible to all transactions. Vacuum sets these bits; queries check them to determine if they must check individual tuple visibility.

```sql
-- Visibility map reduces vacuum workload
-- Pages marked "all-visible" are skipped in future vacuum scans
-- Pages marked "all-frozen" are skipped in anti-wraparound vacuums
```

### 38.2.2 Freezing and Transaction ID Wraparound

Transaction IDs are 32-bit integers (approx 4 billion). PostgreSQL uses modulo arithmetic, but once XID reaches 2 billion, it risks overlapping with old transactions. Vacuum "freezes" tuples by setting special flags, allowing XID reuse.

```sql
-- Check database age (distance to wraparound)
SELECT 
    datname,
    age(datfrozenxid) as xid_age,
    2000000000 - age(datfrozenxid) as transactions_until_emergency,
    pg_database.datfrozenxid::text
FROM pg_database
WHERE datallowconn
ORDER BY age(datfrozenxid) DESC;

-- Critical thresholds:
-- < 500M: Healthy
-- > 1B:  Warning (vacuum working hard)
-- > 1.5B: Critical (forced vacuum, performance impact)
-- > 2B:  Shutdown (database stops accepting writes)
```

**Freeze Strategies:**

1. **Eager Freezing** (default aggressive vacuum):
   - Tuples frozen when older than `vacuum_freeze_min_age` (50M transactions by default)
   
2. **Anti-Wraparound Vacuum**:
   - Triggered when table age approaches `autovacuum_freeze_max_age` (200M default)
   - Scans entire table regardless of dead tuple count
   - Non-blocking but I/O intensive

```sql
-- Check table-specific freeze ages
SELECT 
    c.relname,
    age(c.relfrozenxid) as table_age,
    pg_size_pretty(pg_table_size(c.oid)) as table_size,
    s.last_vacuum,
    s.last_autovacuum
FROM pg_class c
JOIN pg_stat_user_tables s ON s.relid = c.oid
WHERE c.relkind = 'r'
  AND age(c.relfrozenxid) > 100000000  -- Tables older than 100M
ORDER BY age(c.relfrozenxid) DESC;
```

### 38.2.3 Vacuum Cost-Based Delay

To prevent vacuum from monopolizing I/O, PostgreSQL implements cost-based throttling:

```conf
vacuum_cost_page_hit = 1        # Cost when page found in buffer cache
vacuum_cost_page_miss = 10      # Cost when page read from disk
vacuum_cost_page_dirty = 20     # Cost when modified page written back
vacuum_cost_limit = 10000       # Cost points before vacuum sleeps
vacuum_cost_delay = 0           # Sleep time (0 = no throttling in modern SSDs)
```

**Calculation Example:**
If `vacuum_cost_limit = 10000` and `vacuum_cost_delay = 2ms`:
- Vacuum accumulates cost points during operation
- When 10,000 points reached, sleep 2ms
- Maximum I/O rate: 10,000 / (2+processing_time) ≈ 5,000 pages/second with 2ms delay

**Modern SSD Recommendation:**
```conf
# For SSD storage, reduce or eliminate delay
vacuum_cost_delay = 0           # Allow vacuum to run at full speed
# OR
vacuum_cost_delay = 1ms         # Minimal throttling
vacuum_cost_limit = 10000       # Or increase limit to 20000
```

## 38.3 Autovacuum Deep Dive

Autovacuum is a background worker system that automates vacuum operations. Proper tuning prevents manual intervention.

### 38.3.1 Autovacuum Triggering

Autovacuum starts when:
```sql
-- For vacuum (dead tuple cleanup):
dead_tuples > autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor × table_rows)

-- For analyze (statistics update):
modified_tuples > autovacuum_analyze_threshold + (autovacuum_analyze_scale_factor × table_rows)
```

**Default Problem:**
On a 100GB table with 100M rows and default `scale_factor = 0.2`:
- Vacuum triggers at: 50 + (0.2 × 100,000,000) = 20,000,050 dead tuples
- That's 20GB of bloat before vacuum starts!

**Industry Solution - Per-Table Settings:**

```sql
-- Large tables: Use absolute thresholds instead of scale factors
ALTER TABLE large_table SET (
    autovacuum_vacuum_scale_factor = 0,
    autovacuum_vacuum_threshold = 10000,    -- Vacuum at 10k dead tuples
    autovacuum_analyze_scale_factor = 0,
    autovacuum_analyze_threshold = 5000
);

-- High-churn medium tables
ALTER TABLE order_items SET (
    autovacuum_vacuum_scale_factor = 0.05,  -- 5% instead of 20%
    autovacuum_vacuum_cost_limit = 1000,    -- Allow faster vacuuming
    autovacuum_vacuum_cost_delay = 2ms,
    fillfactor = 85                         -- Leave room for HOT updates
);
```

### 38.3.2 Autovacuum Worker Configuration

```conf
# postgresql.conf
autovacuum_max_workers = 3          # Parallel vacuum processes (increase to 6 on large systems)
autovacuum_naptime = 1min           # How often to check for tables needing vacuum

# Per-worker memory
autovacuum_work_mem = -1            # Use maintenance_work_mem (usually 1GB)
# Or set explicitly:
autovacuum_work_mem = 512MB
```

**Worker Allocation Strategy:**
- One worker per database (rotates through databases)
- Multiple workers can vacuum different tables in same database concurrently
- Long-running vacuum of one table doesn't block others

**Monitoring Worker Activity:**

```sql
-- Current autovacuum activity
SELECT 
    pid,
    now() - pg_stat_activity.query_start as duration,
    query,
    wait_event_type,
    wait_event
FROM pg_stat_activity
WHERE query LIKE 'autovacuum:%';

-- Tables waiting for vacuum (backlog)
SELECT 
    schemaname,
    relname,
    n_dead_tup,
    n_live_tup,
    round(n_dead_tup::numeric/nullif(n_live_tup,0)*100, 2) as dead_pct,
    last_vacuum,
    last_autovacuum,
    last_analyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
  AND (last_autovacuum IS NULL OR last_autovacuum < now() - interval '1 hour')
ORDER BY n_dead_tup DESC;
```

### 38.3.3 Autovacuum Logging and Monitoring

Enable detailed logging:

```conf
log_autovacuum_min_duration = 0     # Log all autovacuum actions (0 = all, -1 = disable, 1000 = >1s)
```

**Log Interpretation:**
```
LOG:  automatic vacuum of table "mydb.public.large_table": index scans: 1
    pages: 0 removed, 123450 remain, 0 skipped due to pins, 0 skipped frozen
    tuples: 50000 removed, 1000000 remain, 0 are dead but not yet removable
    buffer usage: 4567 hits, 1234 misses, 789 dirtied
    avg read rate: 25.123 MB/s, avg write rate: 8.456 MB/s
    system usage: CPU 2.34s/5.67u sec elapsed 45.89 sec
```

**Key Metrics to Watch:**
- `index scans`: Should ideally be 0 or 1. Many scans indicate massive dead tuple count or aggressive `autovacuum_vacuum_cost_limit` causing premature vacuum exit.
- `tuples removed`: If consistently lower than `tuples remain`, vacuum isn't keeping up.
- `dead but not yet removable`: Tuples visible to long-running transactions; indicates vacuum blockage.

## 38.4 Table and Index Bloat Detection

Bloat is the difference between actual table size and "minimum" size (what VACUUM FULL would produce).

### 38.4.1 Bloat Estimation Queries

**Standard Approach (pgstattuple extension):**

```sql
-- Install extension (superuser)
CREATE EXTENSION IF NOT EXISTS pgstattuple;

-- Check specific table
SELECT 
    table_len,
    tuple_count,
    tuple_len,
    dead_tuple_count,
    dead_tuple_len,
    free_space,
    free_percent,
    pg_size_pretty(table_len) as total_size,
    pg_size_pretty(tuple_len) as live_data,
    pg_size_pretty(dead_tuple_len) as dead_data
FROM pgstattuple('large_table');

-- Estimate bloat percentage
-- > 20% bloat warrants investigation
-- > 50% bloat requires remediation
```

**Non-Extension Estimation (using statistics):**

```sql
-- Estimated bloat query (works without pgstattuple)
WITH constants AS (
    SELECT current_setting('block_size')::numeric AS bs, 8192
    23 AS page_header_size,
    8 AS page_tuple_id_size,
    4 AS item_id_size
),
bloat_info AS (
    SELECT
        schemaname, tablename, 
        (datawidth+(block_size-page_tuple_id_size-page_header_size)/(datawidth+item_id_size)
            *item_id_size)*num_rows AS expected_pages,
        reltuples,
        relpages
    FROM (
        SELECT
            schemaname, tablename,
            pg_relation_size(schemaname||'.'||tablename) / current_setting('block_size')::int AS relpages,
            reltuples,
            (current_setting('block_size')::int - page_header_size) / 
            nullif((SELECT avg(attlen) FROM pg_attribute WHERE attrelid = c.oid AND attnum > 0), 0) AS datawidth,
            block_size, page_tuple_id_size, page_header_size, item_id_size
        FROM pg_class c
        JOIN pg_stat_user_tables s ON s.relid = c.oid
        CROSS JOIN constants
        WHERE c.relkind = 'r'
    ) sub
)
SELECT
    tablename,
    round(reltuples) as rows,
    relpages as actual_pages,
    round(expected_pages) as expected_pages,
    round((relpages - expected_pages) / relpages * 100) as bloat_pct,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size,
    pg_size_pretty(round((relpages - expected_pages) * 8192)::bigint) as wasted_space
FROM bloat_info
WHERE relpages > expected_pages * 1.2  -- 20% bloat threshold
ORDER BY (relpages - expected_pages) DESC;
```

### 38.4.2 Index Bloat Detection

Indexes bloat more aggressively than tables because each UPDATE creates a new index entry, and old entries aren't reclaimed until vacuum runs and index cleanup completes.

```sql
-- Using pgstatindex extension
CREATE EXTENSION IF NOT EXISTS pgstatindex;

SELECT 
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    idx_scan as index_scans,
    idx_tup_read,
    idx_tup_fetch,
    tree_level,
    pg_size_pretty((root_block_no + 1) * 8192) as root_size,
    internal_pages,
    leaf_pages,
    empty_pages,
    deleted_pages,
    avg_leaf_density,
    leaf_fragmentation
FROM pgstatindex('large_table_pkey')
WHERE avg_leaf_density < 50;  -- Less than 50% density indicates bloat
```

**Index Bloat Remediation:**
```sql
-- Non-blocking index rebuild (PostgreSQL 12+)
REINDEX INDEX CONCURRENTLY idx_large_table_column;

-- For older versions or unique indexes:
CREATE INDEX CONCURRENTLY idx_new ON large_table(column);
DROP INDEX CONCURRENTLY idx_old;
ALTER INDEX idx_new RENAME TO idx_old;
```

## 38.5 Bloat Remediation Strategies

When autovacuum falls behind or massive bloat accumulates, manual intervention is required.

### 38.5.1 Aggressive Manual Vacuum

```sql
-- Step 1: Vacuum with freeze and verbose
VACUUM (VERBOSE, ANALYZE, FREEZE) bloated_table;

-- Step 2: If still bloated, vacuum with disabled cost limits (full speed)
SET vacuum_cost_delay = 0;
SET vacuum_cost_limit = 10000;
VACUUM (VERBOSE) bloated_table;
RESET vacuum_cost_delay;
RESET vacuum_cost_limit;
```

### 38.5.2 VACUUM FULL and Alternatives

**VACUUM FULL** is the nuclear option—it rewrites the table completely, reclaiming all bloat, but requires exclusive locks.

```sql
-- Standard VACUUM FULL (blocking)
VACUUM FULL bloated_table;
-- Lock duration: proportional to table size
-- Disk space: Requires 2x table size during operation
```

**pg_repack (Industry Standard Alternative):**
`pg_repack` is an extension that performs online table repacking without long exclusive locks.

```bash
# Installation
pgxn install pg_repack
# Or package manager: apt install postgresql-16-repack

# Usage (creates shadow table, copies data, swaps)
pg_repack -d mydb -t bloated_table --no-order

# With specific ordering (like CLUSTER but online)
pg_repack -d mydb -t large_table -o "created_at DESC"
```

**How pg_repack works:**
1. Creates shadow table with same structure
2. Adds trigger to original table to capture changes during copy
3. Copies data in batches (non-blocking)
4. Applies captured changes
5. Swaps tables using exclusive lock (brief, usually milliseconds)

### 38.5.3 Partitioning as Bloat Prevention

For tables with predictable data lifecycles (time-series), partitioning prevents bloat accumulation:

```sql
-- Instead of deleting old rows (creates dead tuples):
DELETE FROM logs WHERE created_at < NOW() - INTERVAL '1 year';

-- Use partition dropping:
DROP TABLE logs_2023_q1;  -- Instant, no dead tuples, no vacuum needed
```

## 38.6 ANALYZE and Statistics Management

While vacuum cleans dead tuples, ANALYZE updates the statistics used by the query planner. Stale statistics cause poor execution plans.

### 38.6.1 How Statistics Work

```sql
-- View table statistics
SELECT 
    attname,
    n_distinct,
    most_common_vals::text,
    most_common_freqs,
    histogram_bounds::text
FROM pg_stats
WHERE tablename = 'users' 
  AND schemaname = 'public';

-- n_distinct: 
--   >0 = exact count of distinct values
--   <0 = fraction of total rows (e.g., -0.5 means 50% distinct)
-- most_common_vals: List of most frequent values
-- histogram_bounds: Value distribution for range queries
```

### 38.6.2 When to Analyze

1. **After bulk loads**: COPY or INSERT of >10% of table data
2. **After index creation**: Planner needs correlation stats
3. **After significant DML**: Large UPDATE/DELETE operations
4. **When plans degrade**: Sudden plan changes often indicate stale stats

```sql
-- Analyze specific columns (faster than full table)
ANALYZE users (email, status);

-- Analyze with sampling (large tables)
ANALYZE large_table WITH (SYSTEM, SAMPLING SYSTEM, PERCENT 10);
```

### 38.6.3 Statistics Targets

Increase statistics granularity for columns used in complex WHERE clauses:

```sql
-- Default is 100 (number of most_common_vals and histogram buckets)
ALTER TABLE users ALTER COLUMN email SET STATISTICS 1000;
ANALYZE users;

-- This increases accuracy of selectivity estimates for:
-- WHERE email = 'specific@value'  (most_common_vals)
-- WHERE email BETWEEN 'a' AND 'b'  (histogram_bounds)
```

## 38.7 Routine Maintenance Checklist

### 38.7.1 Daily Monitoring

```sql
-- 1. Check for wraparound risk
SELECT datname, age(datfrozenxid) FROM pg_database 
WHERE age(datfrozenxid) > 500000000;

-- 2. Identify tables needing vacuum attention
SELECT schemaname, relname, n_dead_tup, n_live_tup,
       round(n_dead_tup*100/nullif(n_live_tup,0), 2) as pct_dead
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 10;

-- 3. Check for long-running transactions blocking vacuum
SELECT pid, usename, state, xact_start, now() - xact_start as duration, left(query, 50)
FROM pg_stat_activity
WHERE xact_start < now() - interval '1 hour'
  AND state != 'idle'
ORDER BY xact_start;
```

### 38.7.2 Weekly Tasks

```sql
-- Check bloat on largest tables
-- (Use pgstattuple or estimation query from section 38.4)

-- Review autovacuum effectiveness
SELECT 
    relname,
    vacuum_count + autovacuum_count as total_vacuums,
    analyze_count + autoanalyze_count as total_analyzes,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    n_tup_ins + n_tup_upd + n_tup_del as total_changes
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;
```

### 38.7.3 Monthly/Quarterly

1. **Index maintenance**: REINDEX CONCURRENTLY on heavily updated indexes
2. **Statistics refresh**: Increase statistics targets on problem columns
3. **Partition management**: Drop old partitions, create new ones
4. **Vacuum Full assessment**: For non-partitioned tables with >50% bloat that can't be repacked

## Chapter Summary

In this chapter, you learned:

1. **MVCC Fundamentals**: PostgreSQL keeps old row versions in the table until vacuum removes them. Transaction IDs determine visibility, and the 32-bit XID space requires periodic "freezing" of old tuples to prevent wraparound shutdown.

2. **Vacuum Mechanics**: Standard vacuum scans tables to identify dead tuples, removes index entries pointing to them, marks space as reusable, and updates the visibility map. It runs non-blocking but requires cleanup of indexes (the expensive phase).

3. **Autovacuum Tuning**: Default `scale_factor` settings (20%) are inappropriate for large tables. Configure per-table thresholds (`scale_factor = 0`, absolute thresholds) for tables >10GB. Increase `autovacuum_max_workers` to 4-6 on busy systems and reduce `vacuum_cost_delay` for SSD storage.

4. **Bloat Detection**: Use `pgstattuple` for accurate measurements or statistical estimation queries for monitoring. Watch for dead tuple ratios >20% or tables approaching transaction ID age limits (1.5B transactions).

5. **Remediation**: Use manual `VACUUM (FREEZE)` for urgent cleanup, `pg_repack` for online bloat removal without downtime, and `VACUUM FULL` only as last resort (requires exclusive lock). For time-series data, use partitioning and DROP instead of DELETE to avoid bloat entirely.

6. **ANALYZE Importance**: Statistics determine query plans. Analyze after bulk loads and set higher statistics targets (ALTER TABLE ... SET STATISTICS) for columns with skewed distributions or complex filtering.

7. **Monitoring**: Track `pg_stat_user_tables` for dead/live tuple ratios, `pg_stat_progress_vacuum` for active vacuum phases, and `age(datfrozenxid)` for wraparound prevention. Terminate long-running transactions that block vacuum progress.

---

**Next:** In Chapter 39, we will explore Monitoring and Diagnostics—covering the essential metrics to watch (latency, locks, WAL, vacuum), slow query logging and sampling approaches, useful system views (`pg_stat_*`), and investigation workflows for production incidents.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='37. configuration_basics_practical_not_mystical.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='39. monitoring_and_diagnostics.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
