# Chapter 21: Sequences and ID Generation

Reliable, high-performance ID generation is foundational to database applications. This chapter examines PostgreSQL's sequence infrastructure, contrasts legacy and modern approaches to surrogate key generation, and addresses distributed system challenges including the trade-offs between sequential and random identifiers.

## 21.1 Sequence Fundamentals

Sequences are database objects that generate unique integer values according to defined rules, providing the mechanism behind auto-incrementing columns.

### 21.1.1 Sequence Architecture

```sql
-- Creating a standalone sequence (uncommon but illustrative)
CREATE SEQUENCE order_id_seq
    AS BIGINT  -- Default is BIGINT since PostgreSQL 10
    START WITH 1
    INCREMENT BY 1
    NO MINVALUE
    NO MAXVALUE
    CACHE 1;  -- Default cache size

-- Sequence functions:
SELECT nextval('order_id_seq');  -- Returns next value (1, then 2, then 3...)
SELECT currval('order_id_seq');  -- Returns current session's last value
SELECT lastval();                -- Returns last value from any sequence in session
SELECT setval('order_id_seq', 1000);  -- Set current value (dangerous, use carefully)

-- How sequences work internally:
-- 1. Sequences are stored in a single row in pg_sequence catalog
-- 2. Next value is calculated: current_value + increment
-- 3. Changes are WAL-logged (durable) but not transactional (rolled back values are lost)
-- 4. Current value is cached in backend memory for performance

-- Demonstrating non-transactional behavior:
BEGIN;
SELECT nextval('order_id_seq');  -- Returns 1
SELECT nextval('order_id_seq');  -- Returns 2
ROLLBACK;

SELECT nextval('order_id_seq');  -- Returns 3! (1 and 2 are lost forever)
-- This is by design: sequences guarantee uniqueness, not gap-free allocation
```

### 21.1.2 Sequence Durability and Caching

```sql
-- Default CACHE 1 means every nextval() hits shared memory/catalog
-- For high-throughput (thousands of inserts/sec), increase cache:
CREATE SEQUENCE high_throughput_seq
    AS BIGINT
    CACHE 1000;  -- Pre-allocate 1000 values in memory

-- Performance impact:
-- CACHE 1: ~5,000 IDs/sec (bottlenecked by catalog updates)
-- CACHE 100: ~500,000 IDs/sec (minimal catalog contention)
-- CACHE 1000: ~2,000,000 IDs/sec (scales with connection pooling)

-- Important caveats with CACHE > 1:
-- 1. Sequences are not strictly monotonic across sessions
--    Session A gets 1-1000, Session B gets 1001-2000
--    If A crashes, 1-1000 might never be used (gaps)
-- 2. On server crash, cached values are lost (more gaps)
-- 3. Still guarantees uniqueness, but gaps are larger

-- Checking sequence settings and usage:
SELECT
    sequencename,
    data_type,
    start_value,
    increment_by,
    min_value,
    max_value,
    cache_size,
    cycle,
    last_value
FROM pg_sequences
WHERE schemaname = 'public';

-- Monitor sequence exhaustion (for INT sequences):
SELECT
    sequencename,
    data_type,
    last_value,
    CASE data_type
        WHEN 'integer' THEN 2147483647
        WHEN 'bigint' THEN 9223372036854775807
    END as max_value,
    CASE data_type
        WHEN 'integer' THEN (2147483647 - last_value)::float / 2147483647 * 100
        WHEN 'bigint' THEN (9223372036854775807 - last_value)::float / 9223372036854775807 * 100
    END as remaining_pct
FROM pg_sequences
WHERE sequencename = 'order_id_seq';
-- For BIGINT, exhaustion is theoretically impossible (would take centuries at max throughput)
-- For INTEGER, monitor if approaching 2 billion limit
```

## 21.2 SERIAL vs IDENTITY: The Modern Standard

PostgreSQL offers two auto-increment mechanisms. IDENTITY (SQL standard) is the modern preference over the legacy SERIAL type.

### 21.2.1 The Legacy SERIAL Approach

```sql
-- Old approach (still widely used, but not recommended):
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,  -- Creates sequence + default + NOT NULL
    total DECIMAL(10,2)
);

-- What SERIAL actually does:
-- 1. CREATE SEQUENCE orders_order_id_seq
-- 2. ALTER TABLE orders ALTER order_id SET DEFAULT nextval('orders_order_id_seq')
-- 3. ALTER SEQUENCE orders_order_id_seq OWNED BY orders.order_id

-- Problems with SERIAL:
-- 1. Implicit sequence creation (magic behavior, hard to introspect)
-- 2. Sequence name derived from table/column names (brittle with renames)
-- 3. Privilege inheritance issues (sequence permissions separate from table)
-- 4. No standard SQL compliance
-- 5. Dropping column doesn't drop sequence automatically (orphaned sequences)

-- Example of SERIAL fragility:
ALTER TABLE orders RENAME COLUMN order_id TO id;
-- Sequence is still named orders_order_id_seq (now mismatched)
-- Applications referencing sequence by name break
```

### 21.2.2 The Modern IDENTITY Approach

```sql
-- SQL standard compliant (PostgreSQL 10+):
CREATE TABLE orders (
    order_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    -- or: GENERATED BY DEFAULT AS IDENTITY (allows manual override)
    total DECIMAL(10,2)
);

-- What IDENTITY does:
-- 1. Creates sequence internally
-- 2. Links sequence to column with metadata (pg_attribute.attidentity)
-- 3. Handles privileges automatically (table owner controls generation)
-- 4. Drops sequence automatically when column dropped
-- 5. Survives column/table renames intact

-- OVERRIDING clause for IDENTITY:
-- GENERATED ALWAYS: Prevents manual insertion (error if tried)
INSERT INTO orders (order_id, total) VALUES (999, 100.00);
-- ERROR: cannot insert into column "order_id"
-- DETAIL: Column "order_id" is an identity column defined as GENERATED ALWAYS.
-- HINT: Use OVERRIDING SYSTEM VALUE to override.

-- To force specific ID (e.g., data migration):
INSERT INTO orders (order_id, total) 
OVERRIDING SYSTEM VALUE 
VALUES (999, 100.00);

-- GENERATED BY DEFAULT: Allows manual insertion (like old SERIAL)
CREATE TABLE orders_legacy (
    order_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    total DECIMAL(10,2)
);

-- Can insert manually without error (but risks duplicates if not careful)
INSERT INTO orders_legacy (order_id, total) VALUES (1, 100.00);
INSERT INTO orders_legacy (total) VALUES (200.00);  -- Gets next sequence value
-- Risk: If sequence behind manual values, duplicate key errors occur

-- Managing IDENTITY sequences:
-- Alter sequence behavior:
ALTER TABLE orders ALTER COLUMN order_id 
SET GENERATED BY DEFAULT 
SET INCREMENT BY 1 
SET CACHE 1000;

-- Reset sequence (after truncate or data load):
ALTER TABLE orders ALTER COLUMN order_id RESTART WITH 1000;
```

### 21.2.3 Migration from SERIAL to IDENTITY

```sql
-- Convert existing SERIAL to IDENTITY (PostgreSQL 10+):
-- Step 1: Add IDENTITY property
ALTER TABLE orders 
ALTER COLUMN order_id 
ADD GENERATED BY DEFAULT AS IDENTITY;

-- Step 2: Drop the old default (from SERIAL)
ALTER TABLE orders 
ALTER COLUMN order_id 
DROP DEFAULT;

-- Step 3: (Optional) Drop the old sequence if not automatically removed
-- Check: SELECT * FROM pg_sequences WHERE sequencename LIKE 'orders%';
-- DROP SEQUENCE IF EXISTS orders_order_id_seq;

-- Verification:
SELECT 
    column_name,
    is_identity,
    identity_generation,
    identity_start,
    identity_increment
FROM information_schema.columns
WHERE table_name = 'orders' AND column_name = 'order_id';
```

## 21.3 High-Throughput Sequence Strategies

In high-concurrency environments (thousands of connections), sequence generation becomes a bottleneck without proper caching.

### 21.3.1 Sequence Caching for Connection Pooling

```sql
-- Problem: 1000 application servers, each with connection pool of 10
-- = 10,000 concurrent connections requesting IDs
-- With CACHE 1: Each nextval() updates shared catalog row = contention

-- Solution: Aggressive caching for write-heavy tables
CREATE TABLE events (
    event_id BIGINT GENERATED ALWAYS AS IDENTITY (CACHE 10000) PRIMARY KEY,
    event_type TEXT,
    payload JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Trade-offs:
-- CACHE 10000: Each connection pre-allocates 10,000 IDs
-- - Pros: Minimal catalog contention, 1M+ IDs/sec possible
-- - Cons: Gaps of up to 10,000 if connection terminates early
-- - Cons: IDs not strictly chronological across sessions

-- For monotonic IDs (time-ordered appearance):
-- Use CACHE 1 (slower) or application-layer ID generation (Snowflake, etc.)

-- Monitoring sequence contention:
-- Check pg_stat_activity for waits on "relation" of sequence
-- Or check pg_stat_database for high catalog row update rates
```

### 21.3.2 Handling Gaps

```sql
-- Gaps are inevitable and acceptable in production:
-- Causes: Rollbacks, cache pre-allocation, crashes, failed inserts

-- Never try to "fill gaps" (anti-pattern):
-- Bad: SELECT MIN(id)+1 FROM table WHERE id+1 NOT IN (SELECT id FROM table)
-- This is race-condition prone and slow

-- If gap-free required (rarely necessary):
-- Solution: MAX(id)+1 in serializable transaction (extremely slow, high contention)
-- Or use gapless sequence (counter table with pessimistic locking - bottleneck)

-- Better approach: Accept gaps, use separate "display sequence" if needed
-- Database ID = surrogate (can have gaps)
-- Display Number = separate column populated by application logic (business key)
CREATE TABLE invoices (
    invoice_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,  -- Surrogate (gaps OK)
    invoice_number TEXT UNIQUE,  -- Business key (gap-free, generated carefully)
    amount DECIMAL
);

-- Generate invoice_number in application with retry logic for gaps:
-- invoice_number = 'INV-' || year || '-' || sequential_counter(year)
-- Store counter in separate table per year with row-level locking (acceptable bottleneck for invoices)
```

## 21.4 UUID Strategies and Performance

UUIDs provide globally unique identifiers without central coordination, essential for distributed systems. However, they have significant performance implications for B-tree indexes.

### 21.4.1 UUID Types and Generation

```sql
-- PostgreSQL native UUID type (128-bit, stored as 16 bytes):
CREATE TABLE events_distributed (
    event_id UUID PRIMARY KEY,  -- Not generated automatically
    data JSONB
);

-- Generation methods:
-- 1. UUID v4 (random) - Most common, but worst for PostgreSQL performance
INSERT INTO events_distributed VALUES (gen_random_uuid(), '{"key": "value"}');
-- Pros: No coordination needed, impossible to guess
-- Cons: Completely random = maximum B-tree page splitting, fragmentation

-- 2. UUID v1 (timestamp + MAC address):
-- Requires uuid-ossp extension (less secure, contains MAC address)
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
INSERT INTO events_distributed VALUES (uuid_generate_v1(), '{}');
-- Pros: Time-ordered (better locality than v4)
-- Cons: Reveals hardware info, clock sequence issues

-- 3. UUID v7 (time-ordered, modern standard):
-- Not built-in to PostgreSQL yet (as of 16), requires pgcrypto or app generation
-- Structure: 48-bit timestamp (millis) + 74 random bits
-- Pros: Time-ordered like v1, no MAC address, high entropy
-- Cons: Slightly more predictable than v4 (timestamp exposed)
```

### 21.4.2 UUID Index Performance Implications

```sql
-- B-tree indexes perform best with monotonically increasing inserts
-- UUID v4 inserts go to random pages = high page splits, disk I/O

-- Demonstration of write amplification:
-- Sequential BIGINT: Insert 1, 2, 3... all go to rightmost leaf page (minimal splits)
-- Random UUID: Each insert likely goes to different page (read, split, write)

-- Solutions:

-- 1. Use BIGINT for single-database systems (best performance)
-- Only use UUID if distributed/multi-master required

-- 2. Sequential UUIDs (v7 or custom "COMB"):
-- If using UUID, ensure time-ordered prefix
-- Example: UUID with first 6 bytes as timestamp
-- Index maintains high fillfactor, sequential disk writes

-- 3. Separate surrogate key from business UUID:
CREATE TABLE distributed_events (
    id BIGSERIAL PRIMARY KEY,  -- Local sequential surrogate (clustering key)
    event_uuid UUID UNIQUE NOT NULL,  -- Global business identifier
    payload JSONB
) PARTITION BY RANGE (id);
-- Primary key (BIGINT) determines physical order
-- UUID has unique index but not clustering
-- Queries by UUID use index lookup then heap fetch by bigint PK (fast)
-- Range queries by time use id range (if id is time-ordered) or separate timestamp index
```

### 21.4.3 UUID Storage Optimization

```sql
-- UUID as 16 bytes vs TEXT (36 bytes with dashes):
-- Always use UUID type, not TEXT/VARCHAR

-- Extracting timestamp from UUID v7 (if stored):
-- PostgreSQL 16+ may have built-in functions, currently manual:
CREATE OR REPLACE FUNCTION uuid_v7_to_timestamp(uuid_val UUID)
RETURNS TIMESTAMPTZ AS $$
DECLARE
    hex_str TEXT;
    epoch_millis BIGINT;
BEGIN
    -- Extract first 48 bits (12 hex chars) from UUID
    hex_str = REPLACE(uuid_val::TEXT, '-', '');
    epoch_millis = ('x' || SUBSTRING(hex_str, 1, 12))::BIT(48)::BIGINT;
    RETURN TO_TIMESTAMP(epoch_millis / 1000.0);
END;
$$ LANGUAGE plpgsql IMMUTABLE;
```

## 21.5 Distributed ID Generation

When multiple databases or services must generate IDs without coordination, sequences are insufficient. Industry patterns provide ordered, unique IDs without central bottlenecks.

### 21.5.1 Snowflake IDs (Twitter Algorithm)

```sql
-- 64-bit integer composed of:
-- 41 bits: timestamp (milliseconds since epoch, ~69 years)
-- 10 bits: machine ID (1024 machines)
-- 12 bits: sequence number (4096 IDs per millisecond per machine)

-- Structure: 0|timestamp|machine|sequence
-- Example: 1387263847263847263

-- PostgreSQL implementation (application usually generates, but can store):
CREATE TABLE snowflake_ids (
    id BIGINT PRIMARY KEY,  -- Snowflake ID generated by app
    data TEXT
);

-- Benefits:
-- 1. Roughly time-ordered (good for B-tree indexes)
-- 2. No coordination needed between machines
-- 3. 64-bit (fits in BIGINT, half the size of UUID)
-- 4. No gaps (unlike sequences with cache)

-- Trade-offs:
-- 1. Requires machine ID assignment (infrastructure complexity)
-- 2. Clock sync critical (NTP required)
-- 3. Only 4096 IDs/ms per machine (rarely limiting)

-- Extracting timestamp from Snowflake ID (for debugging):
-- timestamp_ms = (id >> 22) + custom_epoch
```

### 21.5.2 Hi/Lo Algorithm

```sql
-- Application-side batch allocation to reduce DB round-trips:
-- 1. Database sequence with increment_by = 100 (hi)
-- 2. Application allocates lo = 0-99 locally
-- 3. Combined ID = (hi * 100) + lo

-- PostgreSQL setup:
CREATE SEQUENCE global_id_seq INCREMENT BY 100 CACHE 1;

-- Application logic:
-- Request nextval from DB: gets 1000 (this reserves 1000-1099)
-- App uses 1000, 1001, 1002... 1099 without hitting DB
-- When exhausted, requests nextval: gets 1100 (reserves 1100-1199)

-- Benefits:
-- 1. Only 1 DB call per 100 IDs (vs 100 calls)
-- 2. Sequential (good for B-tree)
-- 3. Works across app restarts (sequence persists)

-- PostgreSQL-specific optimization:
-- Use hi/lo implicitly with CACHE, but app generates hi/lo for distributed systems
-- where multiple DBs exist (sharding)
```

### 21.5.3 Comb GUIDs (Continuously Ordered)

```sql
-- COMB (Combined GUID/GUID with sequence):
-- Replace random bits at end of UUID v4 with timestamp or counter

-- Example generation (pseudo-code, usually app-side):
-- guid = uuidv4()  -- random
-- timestamp_bytes = current_timestamp_millis()  -- 6 bytes
-- comb = timestamp_bytes || guid[6:]  -- Prefix timestamp, keep random suffix

-- PostgreSQL storage:
CREATE TABLE comb_events (
    event_id UUID PRIMARY KEY,  -- Comb GUID (time-ordered prefix)
    data JSONB
);

-- Index behavior: Similar to BIGINT (sequential inserts), avoids UUID v4 fragmentation
-- Generation complexity in application, but database sees ordered inserts
```

## 21.6 ID Generation Anti-Patterns

Avoid these common mistakes that cause performance degradation or correctness issues.

### 21.6.1 The MAX(id)+1 Anti-Pattern

```sql
-- Never do this (race conditions and serialization bottlenecks):
INSERT INTO orders (order_id, total)
SELECT COALESCE(MAX(order_id), 0) + 1, 100.00 FROM orders;

-- Problems:
-- 1. Race condition: Two transactions read MAX(id)=100 simultaneously
--    Both insert 101 -> duplicate key or lost update
-- 2. Serializable anomalies at high concurrency
-- 3. Table scan to find MAX (slow on large tables)

-- Even with locking (still bad):
BEGIN;
SELECT COALESCE(MAX(order_id), 0) + 1 INTO new_id FROM orders FOR UPDATE;
-- Locks entire table (or large range)
INSERT INTO orders VALUES (new_id, ...);
COMMIT;
-- Throughput: ~10 inserts/sec (serializes all writes)

-- Always use sequences or UUIDs instead
```

### 21.6.2 Timestamp-Only IDs

```sql
-- Using timestamp as primary key (collisions guaranteed):
CREATE TABLE events_bad (
    event_id TIMESTAMPTZ PRIMARY KEY,  -- Millisecond precision
    data TEXT
);

-- Problem:
-- INSERT 1: '2024-01-01 10:00:00.123'
-- INSERT 2: '2024-01-01 10:00:00.123'  -- Same millisecond = DUPLICATE KEY!

-- Even with microsecond precision, collisions occur at high throughput
-- Also, clock adjustments (NTP) can cause time to go backwards = duplicate errors

-- Solution:
CREATE TABLE events_good (
    event_id TIMESTAMPTZ NOT NULL,
    sequence INT NOT NULL,
    data TEXT,
    PRIMARY KEY (event_id, sequence)
);
-- Or just use BIGINT/BIGSERIAL
```

### 21.6.3 Integer Overflow Risks

```sql
-- Using INTEGER (32-bit) for high-volume tables:
CREATE TABLE logs (
    log_id SERIAL PRIMARY KEY,  -- INTEGER, max 2.1 billion
    message TEXT
);

-- 2.1 billion seems large, but:
-- 10,000 inserts/sec = 864M/day = overflow in 2.4 days!

-- Always use BIGINT for high-throughput or long-lived tables:
CREATE TABLE logs (
    log_id BIGSERIAL PRIMARY KEY,  -- 9 quintillion max
    message TEXT
);

-- Monitoring for overflow:
SELECT 
    schemaname,
    sequencename,
    last_value,
    CASE 
        WHEN last_value > 2000000000 THEN 'WARNING: Approaching INTEGER max'
        ELSE 'OK'
    END as status
FROM pg_sequences
WHERE data_type = 'integer';
```

## 21.7 Best Practices and Operational Considerations

### 21.7.1 Choosing ID Types

```sql
-- Decision matrix:

-- Single database, no replication:
-- -> BIGINT GENERATED ALWAYS AS IDENTITY (best performance, smallest, fastest indexes)

-- Multi-master replication, distributed systems:
-- -> UUID v7 (time-ordered) or Snowflake ID (64-bit)
-- -> Avoid UUID v4 (random) due to index fragmentation

-- Public-facing IDs (security through obscurity):
-- -> UUID v4 (unguessable) or hash of sequential ID
-- -> Never expose sequential IDs in URLs (allows enumeration attacks)

-- Time-series data (partitioning):
-- -> BIGINT with time component (Snowflake) or UUID v7
-- -> Enables time-range partitioning without separate timestamp column

-- High-throughput (100k+ inserts/sec):
-- -> CACHE 10000+ on sequences
-- -> Or application-generated Snowflake IDs (no DB coordination)
```

### 21.7.2 Sequence Maintenance

```sql
-- Reserving ID ranges for data migration:
-- Set sequence ahead of imported data to avoid collisions
SELECT setval('orders_order_id_seq', (SELECT MAX(order_id) FROM imported_orders) + 1000);

-- Checking for sequence exhaustion (rare with BIGINT, but possible):
SELECT 
    sequencename,
    last_value,
    max_value,
    max_value - last_value as remaining
FROM pg_sequences
WHERE last_value > max_value * 0.8;  -- Alert at 80% usage

-- Restarting sequence (only for new/empty tables or specific business logic):
ALTER SEQUENCE order_id_seq RESTART WITH 1;
-- DANGER: Only if table truncated or no risk of duplicate keys
```

### 21.7.3 ID Generation in Application Layers

```sql
-- When to generate IDs in application vs database:

-- Database generation (preferred):
-- Pros: Guaranteed uniqueness, simple, no coordination needed
-- Cons: Round-trip required to get ID before use (unless RETURNING)

-- Application generation (UUID, Snowflake):
-- Pros: No DB round-trip for ID, batch inserts without ID fetch
-- Cons: Risk of collision if algorithm flawed, clock sync issues

-- Hybrid approach:
-- Use database sequences for most tables (simplicity)
-- Use application Snowflake for write-heavy sharded tables (performance)
-- Use UUID for external identifiers (security)

-- Always use RETURNING to fetch generated IDs:
INSERT INTO orders (total) VALUES (100.00) RETURNING order_id;
-- Single round-trip, atomic, no race conditions
```

---

## Chapter Summary

In this chapter, you learned:

1. **Sequence Architecture**: Sequences provide unique integer values via `nextval()`, cached in memory for performance. They are non-transactional (rolled back values are lost), ensuring uniqueness at the cost of gaps. The `CACHE` parameter is critical for high-throughput systems—values of 1000+ reduce catalog contention to millions of IDs per second.

2. **IDENTITY vs SERIAL**: `GENERATED ALWAYS AS IDENTITY` (PostgreSQL 10+) is the SQL-standard replacement for SERIAL. It provides better ownership semantics (dropping columns drops sequences), survives renames, handles privileges correctly, and supports `OVERRIDING SYSTEM VALUE` for data migration. Never use SERIAL for new designs.

3. **UUID Performance**: UUID v4 (random) causes severe B-tree fragmentation due to random insertion order. Use UUID v7 (time-ordered) or 64-bit alternatives (Snowflake) for distributed systems. If UUID is required for business reasons but performance matters, use a BIGINT surrogate primary key for clustering while maintaining a separate UUID unique index.

4. **Distributed ID Generation**: Snowflake IDs (64-bit time-ordered) provide 4096 IDs/millisecond per machine without database coordination. The Hi/Lo algorithm batches sequence allocation in application memory to reduce round-trips. COMB GUIDs modify UUIDs to have time-ordered prefixes for better index locality.

5. **Anti-Patterns**: Never use `MAX(id)+1` (race conditions and serialization bottlenecks). Never use timestamps alone as primary keys (clock collisions). Monitor INTEGER sequences for overflow (use BIGINT for high-volume tables). Accept that gaps are inevitable and do not attempt "gap-free" sequences (extreme performance penalty).

**Next:** In Chapter 22, we will explore SQL Functions and Procedures—covering creation of immutable, stable, and volatile functions, security definer patterns, and when to use SQL vs PL/pgSQL for database logic.