# Chapter 7: CRUD Operations and Filtering

Mastering Create, Read, Update, and Delete (CRUD) operations is fundamental to PostgreSQL proficiency. This chapter covers industry-standard patterns for data manipulation, emphasizing safety, performance, and correctness at scale.

## 7.1 INSERT Operations: Beyond Basic Insertion

### 7.1.1 Single Row Insertion with RETURNING

The `RETURNING` clause retrieves values generated during insertion without requiring a second query. This is essential for IDs, timestamps, and computed columns.

```sql
-- Basic insert (anti-pattern: requires second query to get ID)
INSERT INTO users (email, full_name) 
VALUES ('alice@example.com', 'Alice Johnson');
-- Now you have to query: SELECT user_id FROM users WHERE email = 'alice@example.com';
-- Race condition risk: another insert might happen between statements

-- Industry standard: RETURNING clause
INSERT INTO users (email, full_name) 
VALUES ('alice@example.com', 'Alice Johnson')
RETURNING user_id, created_at, email;
-- Returns the generated ID and defaults in one atomic operation

-- Detailed explanation:
-- RETURNING works with:
-- 1. Generated columns (SERIAL, IDENTITY, UUID)
-- 2. DEFAULT values (NOW(), triggers)
-- 3. Computed columns (generated columns)
-- 4. Any inserted column value

-- Application usage (pseudo-code):
-- user = db.query("INSERT INTO users ... RETURNING user_id, created_at")
-- application.user_id = user.user_id  -- Immediately available
```

### 7.1.2 Bulk Insertion

Inserting multiple rows in a single statement reduces network round trips and transaction overhead.

```sql
-- Single statement bulk insert
INSERT INTO users (email, full_name, status) 
VALUES 
    ('user1@example.com', 'User One', 'active'),
    ('user2@example.com', 'User Two', 'active'),
    ('user3@example.com', 'User Three', 'pending'),
    ('user4@example.com', 'User Four', 'active')
RETURNING user_id, email;

-- Performance benefits:
-- 1. One network round trip (vs 4 separate inserts)
-- 2. Single transaction (atomic)
-- 3. Optimized WAL (Write-Ahead Log) usage
-- 4. Single parse/plan cycle

-- Bulk insert with SELECT (insert from another table)
INSERT INTO user_archive (user_id, email, deleted_at)
SELECT user_id, email, NOW()
FROM users
WHERE status = 'deleted' 
  AND deleted_at < NOW() - INTERVAL '1 year'
RETURNING user_id;

-- Insert with DEFAULTS for omitted columns
INSERT INTO products (sku, name) 
VALUES ('SKU-001', 'Widget');
-- price_cents uses DEFAULT, created_at uses DEFAULT NOW()

-- Insert specific columns (omitting others)
INSERT INTO users (email) 
VALUES ('minimal@example.com');
-- full_name is NULL (if nullable), created_at is DEFAULT
```

### 7.1.3 Upsert: ON CONFLICT (Merge Operations)

`ON CONFLICT` handles duplicate key violations gracefully, enabling "insert or update" semantics atomically.

```sql
-- Basic upsert: Insert or update existing
INSERT INTO users (email, full_name, updated_at) 
VALUES ('existing@example.com', 'New Name', NOW())
ON CONFLICT (email) 
DO UPDATE SET 
    full_name = EXCLUDED.full_name,
    updated_at = EXCLUDED.updated_at
RETURNING user_id, email, xmax;

-- Detailed explanation:
-- ON CONFLICT (email): Specifies the unique constraint to check
-- DO UPDATE: Action when conflict detected
-- EXCLUDED: Refers to the row that failed insertion (the new values)
-- xmax: System column indicating if row was inserted (0) or updated (>0)

-- Conflict on constraint name (alternative syntax)
INSERT INTO users (email, full_name) 
VALUES ('test@example.com', 'Test User')
ON CONFLICT ON CONSTRAINT users_email_unique
DO UPDATE SET full_name = EXCLUDED.full_name;

-- Conditional upsert (update only if different)
INSERT INTO inventory (product_id, quantity, last_updated)
VALUES (101, 50, NOW())
ON CONFLICT (product_id)
DO UPDATE SET 
    quantity = EXCLUDED.quantity,
    last_updated = EXCLUDED.last_updated
WHERE inventory.quantity != EXCLUDED.quantity;
-- Only updates if quantity actually changed (prevents unnecessary writes)

-- Upsert with partial unique indexes
INSERT INTO user_sessions (user_id, session_token, expires_at)
VALUES (1, 'abc123', NOW() + INTERVAL '1 day')
ON CONFLICT (user_id) 
WHERE expires_at > NOW()  -- Partial unique index condition
DO UPDATE SET 
    session_token = EXCLUDED.session_token,
    expires_at = EXCLUDED.expires_at;

-- ON CONFLICT DO NOTHING (ignore duplicates)
INSERT INTO event_log (event_type, payload)
VALUES ('page_view', '{"url": "/home"}')
ON CONFLICT DO NOTHING;
-- Silently skips insert if any unique constraint violated
-- No RETURNING clause allowed with DO NOTHING (no row returned on conflict)
```

### 7.1.4 Insert Performance Optimization

```sql
-- 1. Disable triggers temporarily (bulk load scenario)
ALTER TABLE users DISABLE TRIGGER ALL;
-- Insert millions of rows...
ALTER TABLE users ENABLE TRIGGER ALL;
-- Caution: Disables ALL triggers including foreign key constraints
-- Only use for initial data loads, not production operations

-- 2. Use COPY instead of INSERT for bulk loads
COPY users (email, full_name) FROM '/path/to/users.csv' WITH (FORMAT CSV, HEADER);
-- 10-20x faster than individual INSERTs
-- Bypasses SQL layer, writes directly to table

-- 3. Batch inserts in transactions
BEGIN;
INSERT INTO logs ...;
INSERT INTO logs ...;
-- ... 1000 rows ...
COMMIT;
-- One commit per batch vs one per row = massive speedup

-- 4. Unlogged tables for temporary data (PostgreSQL 9.1+)
CREATE UNLOGGED TABLE temp_import (
    data TEXT
);
-- No WAL overhead, but data lost on crash
-- Use for ETL staging tables

-- 5. Table locking for bulk inserts (prevent vacuum interference)
LOCK TABLE users IN SHARE MODE;
-- Prevents concurrent updates while bulk inserting
-- Reduces lock contention with autovacuum
```

## 7.2 UPDATE Strategies: Safe and Efficient Modifications

### 7.2.1 Basic Updates with RETURNING

```sql
-- Update single row with confirmation
UPDATE users 
SET full_name = 'Alice Smith', updated_at = NOW()
WHERE user_id = 123
RETURNING user_id, full_name, updated_at, 
          (xmax = 0) as was_inserted;  -- False for updates

-- Update with subquery
UPDATE products 
SET price_cents = (
    SELECT MAX(price_cents) * 0.9 
    FROM products 
    WHERE category_id = products.category_id
)
WHERE product_id = 456;

-- Update from another table (JOIN update)
UPDATE users u
SET status = 'inactive'
FROM user_sessions s
WHERE u.user_id = s.user_id
  AND s.last_activity < NOW() - INTERVAL '1 year'
RETURNING u.user_id, u.email;

-- Detailed JOIN update explanation:
-- UPDATE target_table
-- SET ...
-- FROM source_table
-- WHERE join_condition
-- Only updates rows in target_table where join_condition matches
```

### 7.2.2 Batch Updates (Avoiding Long Transactions)

Large updates should be batched to prevent long-running transactions and table bloat.

```sql
-- Anti-pattern: Updating millions of rows at once
UPDATE large_table SET status = 'archived' WHERE created_at < '2023-01-01';
-- Problems:
-- 1. Locks table for extended period
-- 2. Generates massive WAL (replication lag)
-- 3. Long rollback if cancelled
-- 4. Bloats table (dead tuples accumulate until commit)

-- Industry standard: Batch updates with LIMIT
DO $$
DECLARE
    rows_updated INTEGER;
BEGIN
    LOOP
        UPDATE large_table 
        SET status = 'archived'
        WHERE ctid IN (
            SELECT ctid 
            FROM large_table 
            WHERE status != 'archived' 
              AND created_at < '2023-01-01'
            LIMIT 1000
        );
        
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
        EXIT WHEN rows_updated = 0;
        
        COMMIT;  -- Commit each batch
        PERFORM pg_sleep(0.1);  -- Brief pause to let other queries in
    END LOOP;
END $$;

-- Alternative: Cursor-based batching
DECLARE
    cur CURSOR FOR 
        SELECT ctid FROM large_table 
        WHERE status != 'archived' 
        AND created_at < '2023-01-01';
BEGIN
    FOR i IN 1..1000 LOOP
        FETCH cur INTO rec;
        EXIT WHEN NOT FOUND;
        
        UPDATE large_table 
        SET status = 'archived' 
        WHERE ctid = rec.ctid;
    END LOOP;
    CLOSE cur;
END;

-- ctid explanation:
-- ctid is the physical row location (block_number, offset)
-- Fastest way to identify a row for update
-- Changes after VACUUM FULL or pg_repack, so don't store long-term
```

### 7.2.3 Update Pitfalls and Safety

```sql
-- Safety check: Verify row count before update
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE account_id = 1;
-- Check: SELECT count(*) should be 1
-- If 0: Wrong account_id
-- If >1: Missing primary key in WHERE clause (dangerous!)
ROLLBACK;  -- or COMMIT if correct

-- Common mistake: Missing WHERE clause
UPDATE users SET status = 'deleted';
-- Updates EVERY row in table!
-- Recovery: Restore from backup or point-in-time recovery

-- Safe update pattern: Use primary key always
UPDATE users SET status = 'deleted' WHERE user_id = 123;

-- Optimistic locking (prevent lost updates)
UPDATE users 
SET full_name = 'New Name', version = version + 1
WHERE user_id = 123 AND version = 5;
-- If another transaction updated first, version is now 6, update affects 0 rows
-- Application checks: If rows_affected = 0, throw concurrency error

-- Returning old values for audit
UPDATE users 
SET email = 'new@example.com'
WHERE user_id = 123
RETURNING user_id, 
          OLD.email as old_email, 
          email as new_email,
          NOW() as changed_at;
```

## 7.3 DELETE Operations: Safe Removal Patterns

### 7.3.1 Soft Deletes (Logical Deletion)

Hard deletes are permanent. Industry standard is soft deletes (marking as deleted) for data recovery and audit trails.

```sql
-- Add soft delete columns to table
ALTER TABLE users ADD COLUMN deleted_at TIMESTAMPTZ;
ALTER TABLE users ADD COLUMN deleted_by BIGINT REFERENCES users(user_id);
ALTER TABLE users ADD COLUMN is_deleted BOOLEAN DEFAULT FALSE;

-- Create view for active records (hides deleted)
CREATE VIEW active_users AS
SELECT * FROM users 
WHERE deleted_at IS NULL;

-- Soft delete operation
UPDATE users 
SET deleted_at = NOW(), 
    is_deleted = TRUE,
    deleted_by = current_user_id  -- From application context
WHERE user_id = 123
RETURNING user_id, deleted_at;

-- Hard delete (only after soft delete and confirmation period)
DELETE FROM users 
WHERE user_id = 123 
  AND deleted_at < NOW() - INTERVAL '30 days';

-- Unique constraints with soft deletes (allow re-use of email after delete)
-- Partial unique index excludes deleted records
CREATE UNIQUE INDEX idx_users_email_active 
ON users(email) 
WHERE deleted_at IS NULL;

-- This allows:
-- INSERT email='a@b.com' (succeeds)
-- Soft delete that user
-- INSERT email='a@b.com' (succeeds again, because deleted record excluded from index)
```

### 7.3.2 Cascading Deletes

```sql
-- Manual cascade (controlled deletion)
BEGIN;
-- 1. Delete child records first (or archive them)
INSERT INTO deleted_order_items_archive 
SELECT * FROM order_items WHERE order_id = 123;

DELETE FROM order_items WHERE order_id = 123;

-- 2. Delete parent
DELETE FROM orders WHERE order_id = 123;
COMMIT;

-- Using foreign key ON DELETE CASCADE (automatic)
-- Defined in schema:
-- CONSTRAINT fk_order_items_order 
--     FOREIGN KEY (order_id) REFERENCES orders(order_id) 
--     ON DELETE CASCADE

-- Check what will be deleted (dry run)
SELECT 
    'orders' as table_name, 
    count(*) as row_count 
FROM orders 
WHERE order_id = 123
UNION ALL
SELECT 
    'order_items', 
    count(*) 
FROM order_items 
WHERE order_id = 123
UNION ALL
SELECT 
    'order_payments', 
    count(*) 
FROM order_payments 
WHERE order_id = 123;

-- TRUNCATE for complete table wipe (fast, irreversible)
TRUNCATE TABLE temp_logs;
-- Faster than DELETE for entire table (no per-row triggers, minimal WAL)
-- Cannot be rolled back in some contexts (DDL transaction)
-- Requires privileges
```

### 7.3.3 Delete Performance

```sql
-- Mass delete optimization
-- Anti-pattern:
DELETE FROM huge_table WHERE created_at < '2022-01-01';
-- Locks table, generates tons of WAL, bloats table

-- Better: Batch deletes
DELETE FROM huge_table 
WHERE ctid IN (
    SELECT ctid 
    FROM huge_table 
    WHERE created_at < '2022-01-01' 
    LIMIT 1000
);
-- Repeat until no rows left

-- Even better: Partition pruning
-- If table partitioned by date, drop old partitions:
DROP TABLE huge_table_2021_q4;
-- Instant, no bloat, minimal WAL
```

## 7.4 WHERE Clause Optimization and Filtering

### 7.4.1 Index-Friendly WHERE Clauses

```sql
-- SARGable predicates (Search ARGument ABLE)
-- Good: Can use index
SELECT * FROM users WHERE user_id = 123;
SELECT * FROM users WHERE email = 'test@example.com';
SELECT * FROM users WHERE created_at > '2024-01-01';
SELECT * FROM users WHERE status IN ('active', 'pending');

-- Bad: Cannot use index (functions on columns)
SELECT * FROM users WHERE LOWER(email) = 'test@example.com';
-- Fix: Create functional index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

-- Bad: Leading wildcard
SELECT * FROM users WHERE email LIKE '%@example.com';
-- Fix: Use trigram index (pg_trgm extension)
CREATE INDEX idx_users_email_trgm ON users USING GIN(email gin_trgm_ops);

-- Bad: Implicit type conversion (prevents index usage)
SELECT * FROM users WHERE user_id = '123';
-- user_id is BIGINT, '123' is TEXT
-- PostgreSQL converts column to text: to_char(user_id) = '123'
-- Fix: Use correct type
SELECT * FROM users WHERE user_id = 123::BIGINT;

-- Bad: OR conditions (often prevent index usage)
SELECT * FROM users WHERE email = 'a@b.com' OR phone = '555-1234';
-- Fix: Use UNION
SELECT * FROM users WHERE email = 'a@b.com'
UNION
SELECT * FROM users WHERE phone = '555-1234';

-- Or create separate indexes and let bitmap scan handle it
```

### 7.4.2 NULL Handling (Three-Valued Logic)

SQL uses three-valued logic: TRUE, FALSE, and NULL. NULL comparisons require special operators.

```sql
-- NULL comparison (correct way)
SELECT * FROM users WHERE deleted_at IS NULL;
SELECT * FROM users WHERE deleted_at IS NOT NULL;

-- Wrong: Never use = NULL or != NULL
SELECT * FROM users WHERE deleted_at = NULL;  -- Returns nothing (always UNKNOWN)
SELECT * FROM users WHERE deleted_at != NULL; -- Returns nothing (always UNKNOWN)

-- NULL in expressions
SELECT NULL = NULL;    -- NULL (unknown if two unknowns are equal)
SELECT NULL IS NULL;   -- TRUE
SELECT TRUE AND NULL;  -- NULL (TRUE if other is TRUE, else NULL)
SELECT FALSE AND NULL; -- FALSE (FALSE AND anything is FALSE)
SELECT TRUE OR NULL;   -- TRUE (TRUE OR anything is TRUE)
SELECT FALSE OR NULL;  -- NULL (depends on other value)

-- COALESCE: Replace NULL with default
SELECT COALESCE(phone, 'No phone') FROM users;
SELECT COALESCE(deleted_at, created_at) as relevant_date FROM users;

-- NULLIF: Return NULL if equal
SELECT NULLIF(status, 'pending') as non_pending_status FROM orders;
-- Returns NULL if status is 'pending', else returns status

-- Handling NULL in aggregates
SELECT 
    COUNT(*) as total_rows,           -- Counts all rows
    COUNT(email) as non_null_emails,  -- Counts non-NULL emails only
    AVG(age) as average_age           -- Ignores NULL ages
FROM users;

-- Filter out NULLs explicitly when needed
SELECT * FROM users 
WHERE email IS NOT NULL 
  AND phone IS NOT NULL;
```

### 7.4.3 Advanced Filtering

```sql
-- Range conditions (BETWEEN is inclusive)
SELECT * FROM orders 
WHERE created_at BETWEEN '2024-01-01' AND '2024-01-31';
-- Includes 2024-01-01 00:00:00 through 2024-01-31 00:00:00
-- Better for timestamps:
WHERE created_at >= '2024-01-01' 
  AND created_at < '2024-02-01';

-- IN clause (optimize large lists with JOIN)
SELECT * FROM users WHERE user_id IN (1, 2, 3, 4, 5);

-- For very large lists (thousands), use temporary table:
CREATE TEMP TABLE tmp_user_ids (user_id BIGINT PRIMARY KEY);
INSERT INTO tmp_user_ids VALUES (1), (2), (3), ...;
SELECT u.* FROM users u
JOIN tmp_user_ids t ON u.user_id = t.user_id;

-- EXISTS vs IN (EXISTS is usually faster with subqueries)
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.user_id = u.user_id 
      AND o.total > 1000
);
-- vs
SELECT * FROM users 
WHERE user_id IN (
    SELECT user_id FROM orders 
    WHERE total > 1000
);
-- EXISTS stops at first match, IN builds result set first

-- ALL/ANY comparisons
SELECT * FROM products 
WHERE price > ALL (SELECT price FROM competitor_products);
-- Price higher than ALL competitor prices

-- Row comparisons
SELECT * FROM users 
WHERE (last_name, first_name) > ('Smith', 'John');
-- Composite comparison (useful for keyset pagination)
```

## 7.5 Pagination: Handling Large Result Sets

### 7.5.1 OFFSET/LIMIT (Simple but Slow)

```sql
-- Basic pagination (anti-pattern for large offsets)
SELECT * FROM users 
ORDER BY user_id 
LIMIT 10 OFFSET 1000;
-- Problem: PostgreSQL must scan and discard 1000 rows to return 10
-- Time increases linearly with offset

-- Better: Cursor-based pagination (Keyset Pagination)
SELECT * FROM users 
WHERE user_id > 1000  -- Last seen ID from previous page
ORDER BY user_id 
LIMIT 10;
-- Constant time regardless of page number
-- Requires ordered unique column (or composite)

-- Implementation with tie-breaker
SELECT * FROM users 
WHERE (created_at, user_id) > ('2024-01-15 10:00:00', 1000)
ORDER BY created_at, user_id 
LIMIT 10;
-- Handles duplicate created_at values
```

### 7.5.2 Seek Method (Keyset Pagination)

```sql
-- Page 1
SELECT * FROM users 
ORDER BY created_at DESC, user_id DESC 
LIMIT 10;
-- Returns rows with created_at: [Jan 15, Jan 14, Jan 14, Jan 13...]
-- Last row: created_at='2024-01-13 10:00:00', user_id=500

-- Page 2 (use last row values)
SELECT * FROM users 
WHERE (created_at, user_id) < ('2024-01-13 10:00:00', 500)
ORDER BY created_at DESC, user_id DESC 
LIMIT 10;

-- Advantages:
-- 1. O(1) performance (index seek, not scan)
-- 2. Stable results (new inserts don't shift page boundaries)
-- 3. Works with live data

-- Disadvantages:
-- 1. Cannot jump to arbitrary page number (must navigate sequentially)
-- 2. Requires unique sort key (or composite)
-- 3. Complex implementation for multi-column sorting

-- Total count for UI (expensive, consider estimates)
SELECT COUNT(*) FROM users WHERE status = 'active';
-- For large tables, use approximation:
SELECT reltuples::BIGINT as estimated_count 
FROM pg_class 
WHERE relname = 'users';
```

### 7.5.3 Cursors (For Large Processing)

```sql
-- Server-side cursor for processing millions of rows
BEGIN;
DECLARE user_cursor CURSOR FOR 
    SELECT user_id, email FROM users WHERE status = 'active';

-- Fetch batches
FETCH 1000 FROM user_cursor;
-- Process 1000 rows...
FETCH 1000 FROM user_cursor;
-- Process next 1000...

CLOSE user_cursor;
COMMIT;

-- Cursor considerations:
-- 1. Holds transaction open (locks resources)
-- 2. Cannot be used in autocommit mode (psql \set AUTOCOMMIT off)
-- 3. WITH HOLD cursors survive transaction (but use resources)
```

---

## Chapter Summary

In this chapter, you learned:

1. **INSERT Operations**: Use `RETURNING` to retrieve generated values atomically; bulk insert with multi-row VALUES or `COPY` for performance; leverage `ON CONFLICT` for upsert patterns (DO UPDATE for merges, DO NOTHING for idempotency); disable triggers temporarily only for initial data loads.

2. **UPDATE Strategies**: Always use `RETURNING` to verify changes; batch large updates using `ctid` with `LIMIT` to prevent long transactions and table bloat; use optimistic locking (`version` column) to prevent lost updates; never update without a `WHERE` clause (use transactions to verify row counts first).

3. **DELETE Safety**: Implement soft deletes (`deleted_at` timestamp) for data recovery; use partial unique indexes (`WHERE deleted_at IS NULL`) to allow reuse of unique values after soft delete; batch hard deletes to prevent locking; use `TRUNCATE` for complete table wipes (fast but irreversible).

4. **WHERE Clause Optimization**: Write SARGable predicates (avoid functions on indexed columns, implicit type conversions, leading wildcards); understand three-valued logic (`IS NULL`, not `= NULL`); use `EXISTS` instead of `IN` for subqueries; use `UNION` instead of `OR` for disparate conditions.

5. **NULL Handling**: SQL uses TRUE/FALSE/UNKNOWN logic; `NULL = NULL` is UNKNOWN (not TRUE); use `IS NULL`/`IS NOT NULL` for comparisons; use `COALESCE` to provide defaults; use `NULLIF` to convert values to NULL; remember aggregates ignore NULLs.

6. **Pagination**: Avoid `OFFSET` for large result sets (O(n) performance); use keyset pagination (seek method) with `WHERE (col1, col2) > (last_val1, last_val2)` for O(1) performance; use server-side cursors for processing large datasets in batches; consider estimated counts instead of exact `COUNT(*)` for large tables.

---

**Next:** In Chapter 8, we will explore joins, aggregation, and set logicâ€”covering inner and outer join mechanics, aggregation with GROUP BY and HAVING, window functions for analytical queries, and set operations (UNION, INTERSECT, EXCEPT) that combine result sets.