# Chapter 14: How PostgreSQL Executes Queries

Understanding PostgreSQL's query execution architecture is essential for performance tuning. This chapter demystifies the planner and executor, explains how PostgreSQL chooses between scan types and join strategies, and establishes the foundation for interpreting execution plans. Mastering these internals allows you to predict query behavior, diagnose performance issues, and write SQL that aligns with the optimizer's strengths.

## 14.1 The Query Processing Pipeline

Before PostgreSQL executes your SQL, it passes through several transformation stages. Understanding this pipeline helps you distinguish between syntax errors, planning failures, and execution bottlenecks.

### 14.1.1 The Five Stages of Query Execution

```sql
-- Example query that will traverse all stages:
SELECT u.user_id, u.email, COUNT(o.order_id) as order_count
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE u.created_at > '2024-01-01'
  AND u.status = 'active'
GROUP BY u.user_id, u.email
HAVING COUNT(o.order_id) > 5
ORDER BY order_count DESC
LIMIT 10;
```

**Stage 1: Parser**
- Checks SQL syntax against grammar rules
- Validates table/column names exist in system catalogs
- Produces a **query tree** (internal representation)
- Errors here: `syntax error at or near "FROM"`, `column "emial" does not exist`

**Stage 2: Rewriter**
- Handles view expansion (replaces view names with underlying query)
- Applies **rule system** (rarely used in modern apps)
- Transforms `UPDATE`/`DELETE` with views into proper plans
- Handles row-level security (RLS) predicate injection

**Stage 3: Planner/Optimizer**
- The most complex stage: generates execution plans and selects the cheapest
- Considers scan methods, join orders, join algorithms
- Estimates costs using statistics (covered in 14.3)
- Output: **Plan Tree** (nodes like Seq Scan, Index Scan, Hash Join)

**Stage 4: Executor**
- Walks the plan tree and executes nodes
- Calls storage manager to fetch pages from disk or buffer cache
- Applies predicates, performs joins, sorts results
- Returns rows to client

### 14.1.2 Understanding Plan Nodes

Every execution plan consists of **nodes** arranged in a tree. Data flows from leaf nodes (scans) up through intermediate nodes (joins, sorts) to the root.

```sql
-- Simple plan structure
EXPLAIN (FORMAT TEXT, ANALYZE) 
SELECT * FROM users WHERE user_id = 123;

-- Typical output:
-- Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=72) (actual time=0.012..0.013 rows=1 loops=1)
--   Index Cond: (user_id = 123)
-- Planning Time: 0.123 ms
-- Execution Time: 0.025 ms

-- Node breakdown:
-- "Index Scan" = Node type (operation)
-- "using users_pkey" = Specific index used
-- "on users" = Target table
-- (cost=0.29..8.30...) = Planner's cost estimate (startup..total)
-- (actual time=0.012..0.013...) = Actual execution time (first row..all rows)
-- rows=1 = Estimated (planned) vs actual row count
-- width=72 = Estimated bytes per row
-- loops=1 = How many times this node executed (important for nested loops)
```

## 14.2 The Planner: How Decisions Are Made

The planner's job is to find the fastest execution path. It uses a **cost-based optimizer** that assigns arbitrary cost units to operations and selects the plan with the lowest total cost.

### 14.2.1 Cost Model Fundamentals

PostgreSQL's cost model uses abstract units, not milliseconds. The planner estimates disk I/O and CPU operations.

```sql
-- View current cost settings (arbitrary units relative to seq_page_cost)
SHOW seq_page_cost;        -- Default: 1.0 (cost of sequential page fetch)
SHOW random_page_cost;     -- Default: 4.0 (cost of random page fetch, higher due to seek)
SHOW cpu_tuple_cost;       -- Default: 0.01 (processing one row)
SHOW cpu_index_tuple_cost; -- Default: 0.005 (processing index entry)
SHOW cpu_operator_cost;    -- Default: 0.0025 (processing operator/function)

-- Cost calculation example for sequential scan:
-- Table: 10,000 rows, 100 pages (100 rows/page)
-- Seq Scan cost = (pages * seq_page_cost) + (rows * cpu_tuple_cost)
--               = (100 * 1.0) + (10000 * 0.01)
--               = 100 + 100 = 200

-- Index Scan cost calculation (simplified):
-- Index pages to traverse + random page fetches + tuple processing
-- Higher random_page_cost makes index scans more expensive unless very selective

-- When to adjust random_page_cost:
-- SSD storage: Lower to 1.1 or 1.2 (random reads almost as fast as sequential)
-- RAID arrays with large cache: 2.0-2.5
-- Spinning disks with heavy load: Keep at 4.0 or higher
```

### 14.2.2 Selectivity and Cardinality Estimation

The planner estimates how many rows match a condition (**selectivity**) to choose between scan types.

```sql
-- Statistics overview
SELECT 
    schemaname, tablename, attname as column,
    n_distinct, 
    null_frac,
    correlation,
    most_common_vals::text::text[] as common_values,
    most_common_freqs as frequencies
FROM pg_stats 
WHERE tablename = 'users' 
  AND attname = 'status';

-- n_distinct: 
--   > 0: Estimated number of distinct values
--   < 0: Fraction of total rows (e.g., -0.5 means 50% of rows are distinct)
-- null_frac: Fraction of rows that are NULL
-- correlation: How well-ordered column is relative to physical storage (1.0 = perfectly sorted)
-- most_common_vals: Array of most frequent values (histogram for others)

-- Selectivity calculation example:
-- Table has 1,000,000 rows
-- status column: n_distinct = 5 (active, pending, deleted, suspended, archived)
-- Query: WHERE status = 'active'
-- If 'active' is 60% of rows (from most_common_freqs), selectivity = 0.6
-- Estimated rows = 1,000,000 * 0.6 = 600,000 rows

-- Planner decisions based on selectivity:
-- Low selectivity (few rows match): Use Index Scan
-- High selectivity (many rows match): Use Sequential Scan (faster than random I/O)

-- When statistics are wrong:
ANALYZE users;  -- Update statistics immediately
-- Or for specific columns:
ANALYZE users (status, created_at);

-- Check actual vs estimated rows:
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM users WHERE status = 'active';
-- Look for "rows=1000 (actual=50000)" - big discrepancy means bad stats
```

## 14.3 Scan Types: How PostgreSQL Reads Data

The **scan** is the leaf node of every execution plan. PostgreSQL chooses between sequential scans, index scans, and bitmap scans based on cost estimates.

### 14.3.1 Sequential Scan (Seq Scan)

Reads every row in the table sequentially. Often faster than index scans for large portions of the table due to sequential I/O efficiency.

```sql
-- Force a sequential scan (for testing)
SET enable_indexscan = off;
SET enable_bitmapscan = off;

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM users WHERE status = 'active';

-- Typical output:
-- Seq Scan on users  (cost=0.00..15406.00 rows=50000 width=72) 
--                    (actual time=0.015..125.432 rows=50000 loops=1)
--   Filter: (status = 'active')
--   Rows Removed by Filter: 50000
--   Buffers: shared read=10834

-- Analysis:
-- "Seq Scan on users": Reading table sequentially
-- "Filter": Applied after reading row (not index lookup)
-- "Rows Removed by Filter": 50k rows read, 50k discarded (50% selectivity)
-- "Buffers: shared read=10834": Read 10834 pages from buffer cache/disk

-- When Seq Scan is optimal:
-- 1. Table is small (< few thousand rows)
-- 2. Query matches large percentage of rows (>5-10%)
-- 3. No suitable index exists
-- 4. Index would require random I/O for many rows (slower than sequential read)

-- Parallel sequential scans (PostgreSQL 9.6+)
EXPLAIN (ANALYZE) SELECT * FROM large_table WHERE x > 100;
-- Workers Planned: 2
-- Workers Launched: 2
-- Multiple processes scan different portions of the table simultaneously
```

### 14.3.2 Index Scan

Uses an index to find rows, then fetches the actual table data (heap). Efficient for retrieving small numbers of rows.

```sql
-- Enable index scans
SET enable_indexscan = on;

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM users WHERE user_id = 123;

-- Typical output:
-- Index Scan using users_pkey on users  
--   (cost=0.29..8.30 rows=1 width=72) 
--   (actual time=0.008..0.009 rows=1 loops=1)
--   Index Cond: (user_id = 123)
--   Buffers: shared hit=3

-- Analysis:
-- "Index Scan": Uses index to find row location, then fetches from heap
-- "using users_pkey": B-tree index on user_id
-- "Index Cond": Predicate applied to index (efficient)
-- "Buffers: shared hit=3": 3 buffer cache hits (index root, leaf, heap page)

-- Index Scan with multiple conditions
EXPLAIN (ANALYZE)
SELECT * FROM users 
WHERE status = 'active' 
  AND created_at > '2024-01-01';

-- If index exists on (status) or (created_at), planner chooses based on selectivity
-- May see: Index Scan using idx_status, then Filter: (created_at > '2024-01-01')
-- Or: Bitmap Index Scan combining both indexes (see next section)

-- Backward index scan (for DESC order)
EXPLAIN (ANALYZE)
SELECT * FROM users 
WHERE user_id < 1000 
ORDER BY user_id DESC 
LIMIT 10;
-- Index Scan Backward using users_pkey
-- B-trees are bidirectional; no sort needed for ORDER BY matching index direction
```

### 14.3.3 Bitmap Index Scan

Combines multiple indexes or handles moderate selectivity efficiently by avoiding random I/O for many rows.

```sql
-- Bitmap scan example (moderate selectivity, multiple conditions)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM users 
WHERE status = 'active' 
  AND email_verified = true
  AND created_at > '2023-01-01';

-- Typical output:
-- Bitmap Heap Scan on users  
--   (cost=24.50..1500.80 rows=500 width=72)
--   (actual time=0.523..5.234 rows=450 loops=1)
--   Recheck Cond: ((status = 'active') AND (email_verified = true))
--   Filter: (created_at > '2023-01-01')
--   Heap Blocks: exact=200
--   ->  BitmapAnd  
--       ->  Bitmap Index Scan on idx_status  
--             Index Cond: (status = 'active')
--       ->  Bitmap Index Scan on idx_email_verified  
--             Index Cond: (email_verified = true)

-- How Bitmap Scans work:
-- 1. Scan each index to create a bitmap of matching row locations
-- 2. Combine bitmaps (AND/OR) in memory
-- 3. Scan heap sequentially, checking bitmap before fetching each page
-- 4. "Recheck Cond": Verify row matches conditions (bitmap may have false positives)

-- Advantages over Index Scan:
-- 1. Avoids random I/O when fetching many rows (reads heap pages sequentially)
-- 2. Can combine multiple indexes efficiently (BitmapAnd/BitmapOr)
-- 3. Better cache utilization for moderate selectivity (5-20% of table)

-- When Bitmap Scan is chosen:
-- 1. Moderate selectivity (too many rows for Index Scan random I/O, too few for Seq Scan)
-- 2. Multiple index conditions combined with AND
-- 3. OR conditions across different indexes

-- Bitmap Index Scan vs Bitmap Heap Scan:
-- Bitmap Index Scan: Builds the bitmap from index pages
-- Bitmap Heap Scan: Uses the bitmap to fetch table pages sequentially
```

### 14.3.4 Index-Only Scans (Covering Indexes)

When all requested columns exist in the index, PostgreSQL can avoid accessing the heap entirely.

```sql
-- Index-only scan example
-- Table: users(user_id PK, email, status)
-- Index: CREATE INDEX idx_email_status ON users(email, status);

EXPLAIN (ANALYZE, BUFFERS)
SELECT email, status FROM users 
WHERE email = 'alice@example.com';

-- Output:
-- Index Only Scan using idx_email_status on users  
--   (cost=0.29..4.30 rows=1 width=36)
--   (actual time=0.008..0.009 rows=1 loops=1)
--   Index Cond: (email = 'alice@example.com')
--   Heap Fetches: 0  <-- Key indicator: no heap access needed
--   Buffers: shared hit=2

-- Visibility Map Check:
-- PostgreSQL must verify row visibility without accessing heap
-- Uses Visibility Map (VM) to check if page is all-visible (no dead tuples)
-- If VM says "all visible", heap fetch avoided
-- If VM is stale (recent updates), may need heap fetches

-- Check visibility map status:
SELECT 
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables 
WHERE relname = 'users';
-- High n_dead_tup means visibility map is dirty, index-only scans degrade

-- Covering index with INCLUDE (PostgreSQL 11+)
-- Include non-key columns to enable index-only scans for more queries
CREATE INDEX idx_users_email_covering ON users(email) 
INCLUDE (full_name, created_at, status);
-- email is index key (tree structure)
-- full_name, created_at, status are payload (stored in leaf pages only)
-- Query can be index-only: SELECT full_name, created_at FROM users WHERE email = 'x'

-- Index-only scan limitations:
-- 1. Requires all columns in index or INCLUDE
-- 2. Visibility map must be reasonably current (recent VACUUM)
-- 3. Cannot work if query references system columns (ctid, xmin, etc.)
-- 4. Expressions in SELECT must match index expressions exactly
```

## 14.4 Join Strategies and Algorithms

PostgreSQL implements three join algorithms: Nested Loop, Hash Join, and Merge Join. The planner chooses based on table sizes, indexes, and selectivity.

### 14.4.1 Nested Loop Join

Best for small outer tables with indexed inner tables. Conceptually: for each outer row, scan inner table.

```sql
-- Nested Loop example: Small table driving large indexed table
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.email, o.order_id, o.total
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE u.status = 'active'
  AND u.created_at > NOW() - INTERVAL '7 days';
-- Assumption: users table has 100 recent active users
-- orders table has 1,000,000 rows with index on user_id

-- Typical plan:
-- Nested Loop  (cost=0.29..1234.56 rows=500 width=50)
--   ->  Index Scan using idx_users_created_at on users
--         Index Cond: ((status = 'active') AND (created_at > ...))
--         Rows: 100
--   ->  Index Scan using idx_orders_user_id on orders
--         Index Cond: (user_id = users.user_id)
--         Rows: 5 per loop

-- How Nested Loop works:
-- 1. Scan outer table (users) to get 100 rows
-- 2. For each of the 100 rows:
--    a. Take the user_id
--    b. Probe inner table (orders) using index on user_id
--    c. Return matching rows
-- 3. Total inner probes: 100 index scans

-- Cost calculation:
-- Outer scan cost: ~10 (index scan on users)
-- Inner probe cost: ~4 per iteration (index probe on orders)
-- Total: 10 + (100 * 4) = 410 (plus tuple processing)

-- When Nested Loop is chosen:
-- 1. Outer table is small (low number of rows)
-- 2. Inner table has index on join column
-- 3. Join condition is selective (few matches per outer row)
-- 4. No better alternative (hash/merge not viable)

-- Nested Loop with Materialize (caching inner results)
-- If inner side is expensive and outer is small, PostgreSQL may materialize (cache) inner results
-- ->  Materialize  (cost=0.00..234.00 rows=1000 width=20)
--       ->  Seq Scan on small_lookup_table
-- Then nested loop probes the materialized result instead of rescanning
```

### 14.4.2 Hash Join

Best when joining large tables without suitable indexes. Builds a hash table from the smaller relation, then probes with the larger.

```sql
-- Hash Join example: Two large tables, no suitable index
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.user_id, u.email, o.order_id
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE u.created_at BETWEEN '2023-01-01' AND '2023-12-31'
  AND o.total > 1000;
-- Assumption: 500,000 users in date range, 200,000 large orders
-- No index on o.total (or planner decides it's not selective enough)

-- Typical plan:
-- Hash Join  (cost=12345.67..67890.12 rows=50000 width=40)
--   Hash Cond: (o.user_id = u.user_id)
--   ->  Seq Scan on orders o
--         Filter: (total > 1000)
--         Rows Removed by Filter: 800000
--   ->  Hash
--         ->  Index Scan using idx_users_created_at on users u
--               Index Cond: ((created_at >= '2023-01-01') AND (created_at <= '2023-12-31'))

-- How Hash Join works:
-- 1. Build Phase: Scan smaller relation (users) and build hash table in memory
--    - Hash key: user_id
--    - Value: entire row (or needed columns)
-- 2. Probe Phase: Scan larger relation (orders)
--    - For each order row, hash the user_id
--    - Look up in hash table
--    - If match found, emit joined row

-- Memory considerations:
-- work_mem (default 4MB) limits hash table size
-- If hash table exceeds work_mem, it spills to disk (slower)
-- EXPLAIN will show: "Buckets: 131072 Batches: 2 Memory Usage: 3072kB"
-- If Batches > 1, disk I/O occurred (performance concern)

-- When Hash Join is chosen:
-- 1. Joining large tables (no small outer table for nested loop)
-- 2. No suitable index on inner table (or index not selective enough)
-- 3. Equality join condition (hash requires exact match)
-- 4. Can fit in work_mem (or acceptable spill to disk)

-- Hash Join vs Merge Join:
-- Hash Join: Good when one table is much smaller, no index requirement
-- Merge Join: Good when both tables are sorted or sortable efficiently
```

### 14.4.3 Merge Join (Sort-Merge Join)

Best when joining large, sorted datasets or when input is already ordered. Requires sortable data types and equality conditions.

```sql
-- Merge Join example: Pre-sorted data or efficient sorting
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.user_id, u.email, o.order_id, o.total
FROM users u
JOIN orders o ON u.user_id = o.user_id
ORDER BY u.user_id;
-- Assumption: Both tables large, joining on sorted key (user_id)
-- Query asks for output ordered by user_id (same as join key)

-- Typical plan:
-- Merge Join  (cost=0.57..34567.89 rows=1000000 width=60)
--   Merge Cond: (u.user_id = o.user_id)
--   ->  Index Scan using users_pkey on users u
--   ->  Index Scan using idx_orders_user_id on orders o

-- How Merge Join works:
-- 1. Sort Phase: Get both inputs sorted on join key (user_id)
--    - If inputs come from index scans, already sorted (cheapest)
--    - Otherwise, sort nodes added (expensive for large data)
-- 2. Merge Phase: 
--    - Start with first row from both inputs
--    - If match, emit joined row, advance inner pointer
--    - If outer < inner, advance outer
--    - If inner < outer, advance inner
--    - Continue until one input exhausted

-- Visual:
-- Users (sorted): [1, 2, 3, 5, 7, 9]
-- Orders (sorted): [1, 1, 2, 4, 5, 5, 5, 8]
-- Matches: (1,1), (1,1), (2,2), (5,5), (5,5), (5,5)

-- When Merge Join is chosen:
-- 1. Joining large tables on sorted columns
-- 2. Inputs naturally ordered (index scans) or sortable within work_mem
-- 3. Equality join conditions
-- 4. Query requires output sorted by join key (merge preserves order, avoids final sort)

-- Sort node costs:
-- If inputs not pre-sorted:
-- Sort  (cost=10000.00..10500.00 rows=200000 width=40)
--   Sort Key: u.user_id
--   Sort Method: quicksort  Memory: 25000kB
--   ->  Seq Scan on users u
-- "Memory" indicates in-memory sort (fast)
-- If "External Merge" shown, spilled to disk (slow, needs work_mem increase)
```

### 14.4.4 Join Order and Join Tree Shapes

For multi-table joins, the planner chooses the order of joining and the shape of the join tree (left-deep, bushy, right-deep).

```sql
-- Three-way join example
EXPLAIN (ANALYZE)
SELECT u.email, o.order_id, p.amount
FROM users u
JOIN orders o ON u.user_id = o.user_id
JOIN payments p ON o.order_id = p.order_id
WHERE u.status = 'active';

-- Possible plans:
-- 1. ((users JOIN orders) JOIN payments) - Left-deep tree
--    - Join users->orders (small filter on users first)
--    - Join result->payments
--    - Usually preferred (intermediate results materialized efficiently)

-- 2. (users JOIN (orders JOIN payments)) - Bushy tree
--    - Join orders->payments first (might be large)
--    - Then join with users
--    - Rarely optimal unless orders->payments is highly selective

-- Planner uses dynamic programming (DP) for join ordering:
-- - For < 12 tables: Exhaustive search (all possible orders)
-- - For >= 12 tables: Genetic algorithm (GEQO) to avoid combinatorial explosion
-- Controlled by: geqo_threshold (default 12), geqo_effort

-- Forcing join order (rarely needed, for testing):
SET join_collapse_limit = 1;  -- Prevents reordering of explicit JOINs
-- Or use CTEs (materialize intermediate results)
WITH active_users AS MATERIALIZED (
    SELECT * FROM users WHERE status = 'active'
)
SELECT * FROM active_users u
JOIN orders o ON u.user_id = o.user_id;
-- MATERIALIZED forces CTE to be computed first (like a temp table)
-- Without MATERIALIZED, CTEs are inlined and optimized with main query (PostgreSQL 12+)
```

## 14.5 Statistics: The Foundation of Good Plans

The planner is only as good as its statistics. Understanding how PostgreSQL collects and uses statistics is crucial for diagnosing plan quality.

### 14.5.1 How Statistics Are Collected

```sql
-- Manual statistics collection
ANALYZE users;  -- Update stats for table
ANALYZE users (status, created_at);  -- Specific columns only

-- Automatic statistics collection
-- Autovacuum daemon runs ANALYZE automatically:
-- - When 10% of table changes (default: autovacuum_analyze_threshold = 50, 
--   autovacuum_analyze_scale_factor = 0.1)
-- - For large tables: 10% of 100M rows = 10M changes before analyze!
--   Consider lowering scale_factor for large tables:
ALTER TABLE big_table SET (autovacuum_analyze_scale_factor = 0.01);  -- 1% for big tables

-- Viewing statistics
SELECT * FROM pg_stats WHERE tablename = 'users';
SELECT * FROM pg_statistic WHERE starelid = 'users'::regclass;  -- Raw binary stats

-- Extended statistics (PostgreSQL 10+)
-- For correlations between columns
CREATE STATISTICS stats_users_status_date ON status, created_at FROM users;
ANALYZE users;
-- Helps planner estimate combined selectivity of status='active' AND created_at > X
-- Without this, planner assumes independence (multiplies selectivities)
```

### 14.5.2 Histograms and Selectivity

```sql
-- Most Common Values (MCV) list
-- For columns with few distinct values (e.g., status, boolean flags)
-- Stores top N values and their frequencies (default N=100)

-- Histogram bounds
-- For high-cardinality columns (timestamps, IDs, text)
-- Divides range into buckets (default 100 buckets)
SELECT histogram_bounds::text::text[] 
FROM pg_stats 
WHERE tablename = 'users' AND attname = 'created_at';
-- Returns array of timestamp boundaries

-- Selectivity estimation example:
-- Query: WHERE created_at > '2024-06-01'
-- Histogram shows bucket boundaries: ['2024-01-01', '2024-02-01', ..., '2024-12-01']
-- '2024-06-01' falls in bucket 6 of 100
-- Selectivity = (100 - 6) / 100 = 0.94 (94% of rows)
-- Planner will likely choose Seq Scan (high selectivity)

-- Query: WHERE created_at > '2024-11-01'
-- Falls in bucket 11
-- Selectivity = (100 - 11) / 100 = 0.09 (9% of rows)
-- Planner may choose Index Scan (moderate selectivity)
```

## 14.6 Specialized Scan Types

### 14.6.1 Tid Scan (Tuple ID Scan)

Directly accesses rows by their physical location (ctid). Used when PostgreSQL knows exactly which rows to fetch.

```sql
-- Tid Scan occurs with:
-- 1. ctid = constant (rare in apps)
-- 2. Current-of cursor operations
-- 3. Some subquery optimizations

EXPLAIN SELECT * FROM users WHERE ctid = '(0,1)';
-- Tid Scan on users
--   Tid Cond: (ctid = '(0,1)')

-- Practical use: Efficient updates of specific rows found via subquery
UPDATE users SET status = 'archived'
WHERE ctid IN (
    SELECT ctid FROM users 
    WHERE last_login < '2023-01-01' 
    LIMIT 1000
);
-- Avoids repeated index scans during update
```

### 14.6.2 Subquery Scans and Materialization

```sql
-- Subquery scan (uncorrelated subquery)
EXPLAIN SELECT * FROM users 
WHERE user_id IN (SELECT user_id FROM premium_users);

-- Plan:
-- Seq Scan on users
--   Filter: (SubPlan 1)
--   SubPlan 1
--     ->  Materialize  <-- Caches subquery result
--           ->  Seq Scan on premium_users

-- Materialize node:
-- - Executes subquery once, stores in memory
-- - Subsequent scans read from memory (fast)
-- - If result too large, spills to disk (slow)

-- InitPlan (scalar subquery, executes once)
EXPLAIN SELECT * FROM users 
WHERE created_at > (SELECT MAX(created_at) FROM deleted_users);
-- InitPlan 1 (returns 1 row)
--   ->  Aggregate
--         ->  Seq Scan on deleted_users
-- Seq Scan on users
--   Filter: (created_at > $1)  -- $1 is result of InitPlan

-- Correlated subquery (executes once per outer row - expensive)
EXPLAIN SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.user_id = u.user_id  -- Correlated: references outer query
      AND o.total > 1000
);
-- Nested Loop Semi Join  <-- Semi Join = EXISTS optimization
--   ->  Seq Scan on users u
--   ->  Index Scan using idx_orders_user_id on orders o
--         Index Cond: (user_id = u.user_id)
--         Filter: (total > 1000)
```

## 14.7 Understanding Cost Estimates

Cost units are arbitrary but based on sequential page reads. Understanding cost components helps you identify why the planner made specific choices.

### 14.7.1 Cost Components Explained

```sql
-- Cost format: startup_cost..total_cost
-- startup_cost: Cost to produce first row (e.g., sorting must complete first)
-- total_cost: Cost to produce all rows

EXPLAIN (FORMAT JSON)
SELECT * FROM users ORDER BY email;

-- Sort node:
-- "Startup Cost": 15406.00 (must sort all rows before returning first)
-- "Total Cost": 17906.00 (sorting cost + sequential scan cost)
-- Sort Cost formula: (N * log2(N)) * cpu_operator_cost + comparison costs

-- Join cost estimation:
-- Nested Loop: outer_startup + (outer_cardinality * inner_cost_per_row)
-- Hash Join: inner_hash_build_cost + (outer_cardinality * probe_cost)
-- Merge Join: sort_outer_cost + sort_inner_cost + merge_cost

-- Real example with calculations:
EXPLAIN SELECT * FROM users u JOIN orders o ON u.user_id = o.user_id;

-- If users has 10,000 rows, orders has 1,000,000 rows:
-- Nested Loop cost: 
--   Scan users: 100 (seq scan)
--   Index probe orders: 4 per row * 10,000 = 40,000
--   Total: ~40,100

-- Hash Join cost:
--   Build hash on users (smaller): 100 + (10000 * 0.1) = 1,100
--   Scan orders (probe): 10,000 (seq scan) + (1000000 * 0.01) = 20,000
--   Total: ~21,100 (wins over nested loop)

-- Merge Join cost:
--   Sort users: 100 + (10000 * log2(10000) * 0.0025) ≈ 400
--   Sort orders: 10000 + (1000000 * log2(1000000) * 0.0025) ≈ 60,000
--   Merge: 1000000 * 0.01 = 10,000
--   Total: ~70,500 (loses due to sort cost on large table)
```

### 14.7.2 When Plans Go Wrong

```sql
-- Statistics mismatch example
-- Table has 1M rows, but stats are stale (table grew to 10M)
EXPLAIN SELECT * FROM users WHERE user_id = 9999999;
-- Index Scan (cost=0.29..8.30 rows=1 width=72)
-- Actual: Seq Scan, because user_id doesn't exist (planner thought it might)

-- Fix:
ANALYZE users;

-- Correlation and clustering
-- If table is physically ordered by user_id (high correlation), 
-- index scans are faster (sequential prefetch)
-- If table is random (low correlation), index scans cause random I/O

SELECT attname, correlation 
FROM pg_stats 
WHERE tablename = 'users' AND attname = 'user_id';
-- correlation=1.0: Perfectly sorted (index scan very fast)
-- correlation=0.0: Random order (index scan slower, random I/O)

-- Constraint exclusion (partitioning)
-- If table is partitioned by date, planner excludes irrelevant partitions
EXPLAIN SELECT * FROM events WHERE event_date = '2024-01-01';
-- Partition pruning: Only scans events_2024_01 partition
-- Shows: "Append" node with only relevant child scans
```

## 14.8 Practical Plan Reading

### 14.8.1 Identifying Bottlenecks

```sql
-- High buffer reads (disk I/O)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM large_table WHERE unindexed_column = 'value';
-- Buffers: shared read=100000  <-- High physical reads
-- Solution: Add index or improve query

-- High rows removed by filter (bad index selectivity)
EXPLAIN (ANALYZE)
SELECT * FROM users WHERE status = 'active' AND age > 25;
-- Index Scan using idx_status
--   Index Cond: (status = 'active')
--   Filter: (age > 25)
--   Rows Removed by Filter: 95000  <-- Index returned 100k rows, filter kept 5k
-- Solution: Composite index on (status, age) or (age, status) depending on selectivity

-- High execution time vs planning time
EXPLAIN (ANALYZE)
SELECT * FROM complex_view WHERE id = 1;
-- Planning Time: 45.000 ms  <-- Too long (complex view, many tables)
-- Execution Time: 0.500 ms
-- Solution: Materialized view, or simplify view, or use plan caching (prepared statements)

-- Memory usage (hash operations)
EXPLAIN (ANALYZE)
SELECT * FROM large_table GROUP BY many_columns;
-- HashAggregate
--   Peak Memory Usage: 1048576kB  <-- Hit work_mem limit
--   Disk Usage: 500000kB  <-- Spilled to disk (slow)
-- Solution: Increase work_mem, or better indexing, or reduce GROUP BY complexity
```

### 14.8.2 Plan Node Reference

```sql
-- Common node types you'll see:

-- 1. Scan Nodes (leaf nodes)
-- Seq Scan: Sequential table read
-- Index Scan: Index lookup + heap fetch
-- Index Only Scan: Index lookup only (no heap)
-- Bitmap Index Scan + Bitmap Heap Scan: Bitmap combination
-- Tid Scan: Direct ctid access
-- Subquery Scan: Wrapper for subquery results

-- 2. Join Nodes
-- Nested Loop: Iterate outer, probe inner
-- Hash Join: Build hash on inner, probe with outer
-- Merge Join: Sort both, merge
-- Nested Loop Semi Join: EXISTS optimization
-- Nested Loop Anti Join: NOT EXISTS optimization

-- 3. Aggregation Nodes
-- Aggregate: General grouping (sort or hash based)
-- GroupAggregate: Sorted input grouping
-- HashAggregate: Hash table grouping (usually faster)
-- MixedAggregate: For GROUPING SETS

-- 4. Sorting and Limiting
-- Sort: QuickSort or external merge sort
-- Limit: Stop after N rows
-- Limit with Ties: Include ties when using WITH TIES
-- Unique: DISTINCT operation (often via HashAggregate or Unique node)

-- 5. Set Operations
-- Append: UNION ALL, partitioned table scans
-- Merge Append: UNION (sorted), partitioned ordered scans
-- SetOp: INTERSECT/EXCEPT (via hashing or sorting)
-- HashSetOp: Hash-based INTERSECT/EXCEPT

-- 6. Modification Nodes
-- Insert, Update, Delete: Top-level DML nodes
-- ModifyTable: Wrapper for INSERT/UPDATE/DELETE
-- LockRows: FOR UPDATE/SHARE locking
-- Result: Constant projection or one-time evaluation
```

## 14.9 Configuration Parameters Affecting Planning

```sql
-- Enable/disable specific plan types (for testing or hints)
SET enable_seqscan = off;        -- Force index usage (testing only)
SET enable_indexscan = on;       -- Allow index scans
SET enable_bitmapscan = on;      -- Allow bitmap scans
SET enable_tidscan = on;         -- Allow tid scans

SET enable_nestloop = off;       -- Avoid nested loops (often for star schemas)
SET enable_hashjoin = on;        -- Allow hash joins
SET enable_mergejoin = on;       -- Allow merge joins

-- Cost constants (calibration)
SET seq_page_cost = 1.0;         -- Cost of sequential page read
SET random_page_cost = 4.0;      -- Cost of random page read (lower for SSDs: 1.1)
SET cpu_tuple_cost = 0.01;       -- Processing cost per row
SET cpu_index_tuple_cost = 0.005; -- Index entry processing cost
SET cpu_operator_cost = 0.0025;  -- Operator evaluation cost

-- Memory constraints
SET work_mem = '4MB';            -- Per-operation memory (sorts, hashes, bitmaps)
-- Hash joins: Build table must fit in work_mem or spill to disk
-- Sorts: External merge sort uses work_mem per sort operation
-- Bitmap scans: Bitmap size limited by work_mem (or uses "lossy" bitmaps)

-- Parallelism (PostgreSQL 9.6+)
SET max_parallel_workers_per_gather = 2;  -- Parallel workers for scans/joins
SET parallel_tuple_cost = 0.1;              -- Cost of passing tuples between processes
SET parallel_setup_cost = 1000;             -- Cost of starting worker processes

-- Genetic Query Optimizer (GEQO)
SET geqo = on;                    -- Enable for complex joins (>12 tables default)
SET geqo_threshold = 12;          -- Switch to genetic algorithm above this
SET geqo_effort = 5;              -- Search effort (1-10, higher = better plans, slower planning)
```

## 14.10 Industry Best Practices and Anti-Patterns

### 14.10.1 When to Trust the Planner

```sql
-- Good: Sequential scan on small table
EXPLAIN SELECT * FROM countries WHERE continent = 'Europe';
-- Seq Scan is correct for 200 rows, even with index available
-- Random I/O of index scan would be slower than reading 2 pages sequentially

-- Good: Sequential scan on large unselective query
EXPLAIN SELECT * FROM logs WHERE level IN ('INFO', 'DEBUG');
-- If 90% of logs are INFO/DEBUG, reading 90% of table via index is wasteful
-- Seq Scan reads sequentially, better cache utilization

-- Bad: Sequential scan due to stale statistics
EXPLAIN SELECT * FROM users WHERE status = 'admin';
-- Seq Scan on 1M rows, but only 10 admins exist!
-- Fix: ANALYZE users;
-- Then: Index Scan using idx_status

-- Bad: Sequential scan due to function on column
EXPLAIN SELECT * FROM users WHERE EXTRACT(YEAR FROM created_at) = 2024;
-- Seq Scan with Filter: (EXTRACT(year FROM created_at) = 2024)
-- Cannot use index on created_at
-- Fix: Use range query (SARGable)
EXPLAIN SELECT * FROM users 
WHERE created_at >= '2024-01-01' 
  AND created_at < '2025-01-01';
-- Index Scan using idx_created_at
```

### 14.10.2 Join Strategy Selection Guidelines

```sql
-- Nested Loop: Small outer, indexed inner
-- Best for: OLTP lookups, recent data filters
SELECT * FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE u.user_id = 123;  -- Single user (1 row outer)
-- Plan: Nested Loop with Index Scan on orders

-- Hash Join: Large tables, no suitable index, equality join
-- Best for: Analytics, reporting, star schema joins
SELECT * FROM fact_sales f
JOIN dim_product p ON f.product_id = p.product_id
WHERE f.sale_date BETWEEN '2024-01-01' AND '2024-12-31';
-- Plan: Seq Scan on fact_sales (large), Hash on dim_product (small), Hash Join

-- Merge Join: Large tables, sorted inputs, range queries
-- Best for: Range joins, inequality conditions, pre-sorted data
SELECT * FROM events e1
JOIN events e2 ON e1.start_time <= e2.end_time 
               AND e1.end_time >= e2.start_time;
-- Temporal overlap join (inequality)
-- Plan: Sort both inputs, Merge Join

-- Anti-Join patterns (NOT EXISTS)
SELECT * FROM users u
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.user_id
);
-- Plan: Nested Loop Anti Join
-- Stops at first match (efficient for "not exists")
-- Never returns rows from right side (anti-join semantics)
```

### 14.10.3 Plan Instability and Management

```sql
-- Plan caching and parameter sniffing (PostgreSQL uses generic plans after 5 executions)
PREPARE get_user_orders(BIGINT) AS
SELECT * FROM orders WHERE user_id = $1;

-- First 5 executions: Custom plan generated for specific parameter values
-- 6th+ execution: Generic plan (unless custom plan is much cheaper)
-- Check with:
EXPLAIN EXECUTE get_user_orders(123);

-- Forcing custom plans (if generic plan is bad for specific values):
SET plan_cache_mode = 'force_custom_plan';

-- Plan hints (extension required: pg_hint_plan)
-- PostgreSQL doesn't support hints natively (by design), but extension available:
-- /*+ SeqScan(users) IndexScan(orders idx_orders_user_id) NestLoop(users orders) */

-- Better approach: Use query structure to guide planner
-- Force index usage by making condition index-friendly
-- Force hash join by ensuring large inputs
-- Force nested loop by limiting outer query with LIMIT

-- Join order hints via CTEs (PostgreSQL 12+):
WITH active_users AS MATERIALIZED (
    SELECT * FROM users WHERE status = 'active' LIMIT 100
)
SELECT * FROM active_users u
JOIN orders o ON u.user_id = o.user_id;
-- MATERIALIZED forces evaluation of CTE first (limits join order flexibility)
-- Use sparingly; prevents optimizer from pushing predicates into CTE
```

---

## Chapter Summary

In this chapter, you learned:

1. **Query Processing Pipeline**: SQL passes through Parser (syntax), Rewriter (views/rules), Planner (optimization), and Executor (runtime). The Planner generates the critical execution plan using cost estimates.

2. **Cost Model**: PostgreSQL uses abstract cost units based on `seq_page_cost` (1.0) and `random_page_cost` (4.0). Costs include I/O (pages) and CPU (tuples, operators). Costs are relative, not milliseconds.

3. **Statistics**: The planner relies on `pg_stats` (populated by `ANALYZE`) containing most common values, histograms, and correlation. Stale statistics cause bad plans (Seq Scans on selective queries). High-churn tables need adjusted `autovacuum_analyze_scale_factor`.

4. **Scan Types**:
   - **Seq Scan**: Sequential read of all pages. Best for small tables or high selectivity (>10% of rows).
   - **Index Scan**: Index lookup followed by heap fetch. Best for low selectivity (<5% of rows) or when order matches index.
   - **Bitmap Index Scan + Bitmap Heap Scan**: Builds bitmap of matching rows from index, then fetches heap pages in order. Best for moderate selectivity (5-20%) or combining multiple indexes.
   - **Index-Only Scan**: Reads only index, no heap access. Requires covering index and clean visibility map.

5. **Join Algorithms**:
   - **Nested Loop**: Iterate outer, probe inner. Best for small outer with indexed inner (OLTP).
   - **Hash Join**: Build hash table on smaller input, probe with larger. Best for large tables without indexes (Analytics).
   - **Merge Join**: Sort both inputs, merge. Best for pre-sorted data or range joins (inequalities).

6. **Plan Reading**: Look for actual vs estimated row discrepancies (indicates bad statistics), buffer counts (I/O intensity), and node timing (where time is spent). High "Rows Removed by Filter" suggests index selectivity issues.

7. **Configuration**: `work_mem` affects hash and sort operations (spill to disk if exceeded). `random_page_cost` should be lowered for SSDs (1.1 vs default 4.0). `geqo_threshold` switches to genetic optimizer for many-table joins.

**Next:** In Chapter 15, we will explore Index Fundamentals—covering B-tree structure, index selection criteria, covering indexes, and the critical distinction between index scans and index-only scans in production environments.