# Chapter 8: Joins, Aggregation, and Set Logic

Mastering data combination and summarization is essential for complex analytical queries and application logic. This chapter covers SQL's most powerful operations for combining tables, summarizing data, and performing analytical calculations across row sets.

## 8.1 Joins: Combining Data from Multiple Tables

Joins are fundamental to relational databases, allowing you to query across normalized tables. Understanding join mechanics, algorithms, and pitfalls is critical for both correctness and performance.

### 8.1.1 Join Fundamentals and Syntax

PostgreSQL supports several join syntaxes. The ANSI SQL-92 syntax (using `JOIN` keywords) is the industry standard and should be used exclusively.

```sql
-- ANSI SQL-92 Syntax (Industry Standard)
SELECT u.user_id, u.email, o.order_id, o.total
FROM users u
INNER JOIN orders o ON u.user_id = o.user_id
WHERE u.status = 'active';

-- Anti-pattern: Old Oracle-style joins (comma-separated with WHERE clause)
SELECT u.user_id, u.email, o.order_id, o.total
FROM users u, orders o
WHERE u.user_id = o.user_id  -- This is a join condition
  AND u.status = 'active';

-- Why ANSI-92 is superior:
-- 1. Join logic is separated from filter logic (WHERE clause)
-- 2. Easier to read complex multi-table joins
-- 3. Outer join syntax is standardized (old syntax was database-specific)
-- 4. Less prone to accidental Cartesian products (missing join condition)
```

### 8.1.2 INNER JOIN: The Default Join

`INNER JOIN` returns only rows where the join condition matches in both tables. This is the most common join type and the default when `INNER` is omitted.

```sql
-- Basic INNER JOIN
SELECT 
    u.user_id,
    u.email,
    o.order_id,
    o.created_at,
    o.total_cents
FROM users u
INNER JOIN orders o ON u.user_id = o.user_id
WHERE o.created_at > '2024-01-01';

-- Multiple INNER JOINs
SELECT 
    u.user_id,
    u.email,
    o.order_id,
    oi.product_id,
    oi.quantity,
    p.name as product_name
FROM users u
INNER JOIN orders o ON u.user_id = o.user_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
WHERE u.status = 'active';

-- Joining on multiple columns (composite keys)
SELECT *
 FROM table_a a
INNER JOIN table_b b 
    ON a.org_id = b.org_id 
   AND a.department_id = b.department_id;

-- Non-equi joins (join on range conditions)
SELECT 
    e.employee_name,
    s.salary_grade
FROM employees e
INNER JOIN salary_brackets s 
    ON e.salary >= s.min_salary 
   AND e.salary <= s.max_salary;

-- Self-join (joining a table to itself)
SELECT 
    e.name as employee,
    m.name as manager
FROM employees e
INNER JOIN employees m ON e.manager_id = m.employee_id;

-- Detailed explanation:
-- INNER JOIN filters out rows where the join condition is not met
-- If a user has no orders, they won't appear in the result
-- Order of tables doesn't matter for INNER JOIN (optimizer chooses)
-- But order matters for readability (driving table first convention)
```

### 8.1.3 OUTER JOINs: Preserving Unmatched Rows

Outer joins return matching rows plus unmatched rows from one or both tables, filling with NULLs where data is missing.

```sql
-- LEFT OUTER JOIN (or just LEFT JOIN)
-- Returns all rows from left table (users), matched rows from right (orders)
-- Users with no orders will have NULL for order columns
SELECT 
    u.user_id,
    u.email,
    o.order_id,
    o.total_cents
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE u.created_at > '2024-01-01';

-- Common pattern: Find users with NO orders
SELECT 
    u.user_id,
    u.email
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.order_id IS NULL;  -- Filters to unmatched rows only

-- RIGHT OUTER JOIN (rarely used, usually rewritten as LEFT JOIN)
-- Returns all rows from right table, matched rows from left
SELECT 
    u.user_id,
    u.email,
    o.order_id
FROM users u
RIGHT JOIN orders o ON u.user_id = o.user_id;
-- Better rewritten as:
-- FROM orders o LEFT JOIN users u ON o.user_id = u.user_id

-- FULL OUTER JOIN: Returns all rows from both tables
-- Matches where possible, NULLs where no match
SELECT 
    COALESCE(u.user_id, o.user_id) as user_id,
    u.email,
    o.order_id,
    o.total_cents,
    CASE 
        WHEN u.user_id IS NULL THEN 'Order without user'
        WHEN o.order_id IS NULL THEN 'User without orders'
        ELSE 'Matched'
    END as match_status
FROM users u
FULL OUTER JOIN orders o ON u.user_id = o.user_id;

-- Practical use case: Data reconciliation between systems
SELECT 
    a.id as system_a_id,
    b.id as system_b_id,
    a.value as value_a,
    b.value as value_b
FROM system_a_table a
FULL OUTER JOIN system_b_table b ON a.id = b.id
WHERE a.value IS DISTINCT FROM b.value;
-- IS DISTINCT FROM handles NULL comparisons properly (NULL != NULL is NULL, not TRUE)
```

### 8.1.4 Join Pitfalls and Common Mistakes

```sql
-- Pitfall 1: Accidental Cartesian Product (Cross Join)
-- Missing join condition
SELECT * FROM users u JOIN orders o;  -- No ON clause!
-- Returns users.count * orders.count rows (disaster for large tables)

-- Pitfall 2: Joining on the wrong column
SELECT * FROM users u
JOIN orders o ON u.email = o.customer_email;  -- Might work but fragile
-- Better: Join on foreign keys (user_id) which are indexed and immutable

-- Pitfall 3: Filtering in WHERE vs ON clause (LEFT JOIN)
-- Wrong: Filters out users with no orders (effectively becomes INNER JOIN)
SELECT u.user_id, o.order_id
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.status = 'completed';  -- Eliminates NULL rows from LEFT JOIN!

-- Correct: Move filter to ON clause to preserve unmatched rows
SELECT u.user_id, o.order_id
FROM users u
LEFT JOIN orders o 
    ON u.user_id = o.user_id 
   AND o.status = 'completed';
-- Now users with no completed orders still appear (with NULL order_id)

-- Pitfall 4: Duplicate rows from one-to-many relationships
-- If a user has 3 orders, they appear 3 times in the result
SELECT u.user_id, u.email, o.order_id
FROM users u
JOIN orders o ON u.user_id = o.user_id;
-- User appears once per order

-- Solution if you only want user data:
SELECT u.user_id, u.email
FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.user_id
);
-- Or use DISTINCT if you need order aggregates

-- Pitfall 5: NULL handling in joins
-- NULL never equals NULL, so these rows won't join
SELECT * FROM table_a a
JOIN table_b b ON a.optional_id = b.optional_id;
-- Rows where a.optional_id IS NULL won't match even if b.optional_id IS NULL

-- Solution for matching NULLs (if business logic requires):
SELECT * FROM table_a a
JOIN table_b b ON (
    a.optional_id = b.optional_id OR 
    (a.optional_id IS NULL AND b.optional_id IS NULL)
);
-- But better to avoid nullable join columns or use COALESCE with sentinel values
```

### 8.1.5 CROSS JOIN and NATURAL JOIN

```sql
-- CROSS JOIN: Cartesian product of both tables
-- Every row from A paired with every row from B
SELECT 
    c.color_name,
    s.size_name
FROM colors c
CROSS JOIN sizes s;
-- Useful for generating combinations (e.g., all color/size variants)

-- Implicit CROSS JOIN (anti-pattern, avoid)
SELECT * FROM colors, sizes;  -- Same as CROSS JOIN but less clear

-- Practical use: Generating date series for reporting
WITH date_range AS (
    SELECT generate_series('2024-01-01'::date, '2024-01-31'::date, '1 day'::interval) as date
),
categories AS (
    SELECT category_id FROM product_categories
)
SELECT 
    d.date,
    c.category_id,
    COALESCE(s.sales_count, 0) as sales_count
FROM date_range d
CROSS JOIN categories c
LEFT JOIN sales_summary s ON s.date = d.date AND s.category_id = c.category_id;

-- NATURAL JOIN (AVOID in production)
-- Automatically joins on columns with same names in both tables
SELECT * FROM users NATURAL JOIN orders;
-- Joins on user_id if both tables have it, but also on any other matching column names
-- Dangerous: Schema changes (adding matching column names) silently changes query logic
-- Explicit join conditions are always preferred
```

### 8.1.6 LATERAL Joins: Row-by-Row Subqueries

`LATERAL` allows subqueries in the FROM clause to reference columns from preceding tables, effectively creating correlated subqueries that can return multiple rows.

```sql
-- Without LATERAL: Get top 3 orders per user (inefficient subquery)
SELECT u.user_id, u.email, o.order_id, o.total
FROM users u
JOIN orders o ON o.user_id = u.user_id
WHERE o.order_id IN (
    SELECT order_id 
    FROM orders o2 
    WHERE o2.user_id = u.user_id  -- Correlated subquery
    ORDER BY total DESC 
    LIMIT 3
);

-- With LATERAL: More efficient, cleaner
SELECT u.user_id, u.email, o.order_id, o.total
FROM users u
LEFT JOIN LATERAL (
    SELECT order_id, total
    FROM orders o
    WHERE o.user_id = u.user_id
    ORDER BY total DESC
    LIMIT 3
) o ON true;  -- ON true required for LATERAL
-- Returns up to 3 rows per user

-- LATERAL with functions
SELECT 
    u.user_id,
    u.email,
    recent_orders.*
FROM users u
LEFT JOIN LATERAL get_recent_orders(u.user_id, 5) recent_orders ON true;
-- Where get_recent_orders is a set-returning function

-- Practical pattern: Calculate running totals per category
SELECT 
    p.category_id,
    p.product_id,
    p.price,
    running.total_price
FROM products p
LEFT JOIN LATERAL (
    SELECT SUM(price) as total_price
    FROM products p2
    WHERE p2.category_id = p.category_id
      AND p2.product_id <= p.product_id
) running ON true;

-- LATERAL vs JOIN: Performance considerations
-- LATERAL executes the subquery for each row of the left table
-- Good when the subquery is selective (uses index on user_id)
-- Bad if the left table is large and subquery is expensive
-- Always verify with EXPLAIN ANALYZE
```

### 8.1.7 Join Algorithms and Performance

Understanding how PostgreSQL executes joins helps you optimize queries and choose proper indexes.

```sql
-- Nested Loop Join
-- For each row in outer table, scan inner table for matches
-- Fast when outer table is small and inner table has index on join key
-- Example: User has 10 orders, orders table has index on user_id
SET enable_hashjoin = off; SET enable_mergejoin = off;
EXPLAIN SELECT * FROM users u JOIN orders o ON u.user_id = o.user_id;
-- -> Nested Loop
--    -> Seq Scan on users
--    -> Index Scan using idx_orders_user_id on orders

-- Hash Join
-- Build hash table from smaller table, probe with larger table
-- Good for large tables without suitable indexes
-- Requires memory (work_mem) to build hash table
SET enable_hashjoin = on; SET enable_mergejoin = off;
EXPLAIN SELECT * FROM large_table_a a JOIN large_table_b b ON a.fk = b.id;
-- -> Hash Join
--    -> Hash Cond: (a.fk = b.id)
--    -> Seq Scan on large_table_a
--    -> Hash
--       -> Seq Scan on large_table_b

-- Merge Join
-- Both tables sorted on join key, scanned in parallel
-- Efficient for large pre-sorted datasets (e.g., primary key joins)
-- Requires indexes or sort operations
SET enable_mergejoin = on;
EXPLAIN SELECT * FROM orders o JOIN order_items oi ON o.order_id = oi.order_id;
-- -> Merge Join
--    -> Index Scan using orders_pkey on orders
--    -> Index Scan using idx_order_items_order_id on order_items

-- Optimization guidelines:
-- 1. Ensure foreign keys have indexes (for Nested Loop efficiency)
-- 2. For large table joins, ensure work_mem is adequate for hash joins
-- 3. Avoid joining on calculated expressions (pre-compute or index)
-- 4. Consider denormalization if joins are too expensive for read-heavy workloads
```

## 8.2 Aggregation and Grouping

Aggregation summarizes data across multiple rows. PostgreSQL provides powerful grouping capabilities from simple `GROUP BY` to advanced `GROUPING SETS`.

### 8.2.1 Basic Aggregation with GROUP BY

```sql
-- Simple aggregation: Count orders per user
SELECT 
    user_id,
    COUNT(*) as order_count,
    SUM(total_cents) as total_spent,
    AVG(total_cents) as avg_order_value,
    MIN(created_at) as first_order,
    MAX(created_at) as last_order
FROM orders
WHERE status = 'completed'
GROUP BY user_id;

-- Multiple column grouping
SELECT 
    DATE_TRUNC('month', created_at) as month,
    status,
    COUNT(*) as order_count,
    SUM(total_cents) as revenue
FROM orders
WHERE created_at > '2024-01-01'
GROUP BY 
    DATE_TRUNC('month', created_at),
    status
ORDER BY month, status;

-- GROUP BY with expressions
SELECT 
    CASE 
        WHEN total_cents < 10000 THEN 'small'
        WHEN total_cents < 50000 THEN 'medium'
        ELSE 'large'
    END as order_size,
    COUNT(*) as count,
    AVG(total_cents) as avg_amount
FROM orders
GROUP BY 1;  -- Group by the first SELECT expression (positional reference)
-- Better to use named columns for clarity:
GROUP BY 
    CASE 
        WHEN total_cents < 10000 THEN 'small'
        WHEN total_cents < 50000 THEN 'medium'
        ELSE 'large'
    END;

-- Important rule: Every column in SELECT must be either:
-- 1. An aggregate function (COUNT, SUM, etc.), OR
-- 2. Listed in the GROUP BY clause
-- This query would fail:
SELECT user_id, email, COUNT(*) FROM users GROUP BY user_id;
-- ERROR: column "email" must appear in GROUP BY clause or be used in aggregate
-- Fix: SELECT user_id, MAX(email), COUNT(*) ... or add email to GROUP BY
```

### 8.2.2 The HAVING Clause: Filtering Aggregates

`WHERE` filters rows before aggregation; `HAVING` filters groups after aggregation.

```sql
-- Find users with more than 5 orders
SELECT 
    user_id,
    COUNT(*) as order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5;
-- HAVING filters the grouped results

-- Combining WHERE and HAVING
SELECT 
    user_id,
    COUNT(*) as order_count,
    SUM(total_cents) as total_spent
FROM orders
WHERE status = 'completed'  -- Filter rows first (eliminates pending/cancelled)
GROUP BY user_id
HAVING COUNT(*) > 5          -- Then filter groups
   AND SUM(total_cents) > 100000;  -- Can use aggregates in HAVING

-- Common mistake: Using WHERE for aggregate conditions
SELECT user_id, COUNT(*) 
FROM orders
WHERE COUNT(*) > 5  -- ERROR: aggregate functions not allowed in WHERE
GROUP BY user_id;

-- Using window functions in HAVING (PostgreSQL advanced feature)
SELECT 
    department_id,
    employee_id,
    salary
FROM employees
GROUP BY department_id, employee_id, salary
HAVING salary > AVG(salary);  -- Compare to department average
-- Actually better done with window functions (see section 8.4)
```

### 8.2.3 Advanced Grouping: GROUPING SETS, ROLLUP, CUBE

These features generate multiple grouping levels in a single query, useful for reporting.

```sql
-- GROUPING SETS: Specify exact grouping combinations
SELECT 
    COALESCE(TO_CHAR(created_at, 'YYYY-MM'), 'ALL MONTHS') as month,
    COALESCE(status::text, 'ALL STATUSES') as status,
    COUNT(*) as order_count,
    SUM(total_cents) as revenue
FROM orders
GROUP BY GROUPING SETS (
    (DATE_TRUNC('month', created_at), status),  -- By month and status
    (DATE_TRUNC('month', created_at)),           -- By month only
    (status),                                    -- By status only
    ()                                           -- Grand total
)
ORDER BY month NULLS LAST, status NULLS LAST;

-- ROLLUP: Hierarchical aggregation (right-to-left)
-- Generates: (month, status), (month), ()
SELECT 
    DATE_TRUNC('month', created_at) as month,
    status,
    COUNT(*) as order_count,
    SUM(total_cents) as revenue
FROM orders
GROUP BY ROLLUP (DATE_TRUNC('month', created_at), status);

-- CUBE: All combinations of grouping columns
-- Generates: (month, status), (month), (status), ()
SELECT 
    DATE_TRUNC('month', created_at) as month,
    status,
    COUNT(*) as order_count
FROM orders
GROUP BY CUBE (DATE_TRUNC('month', created_at), status);

-- GROUPING function: Identify which level of aggregation
SELECT 
    DATE_TRUNC('month', created_at) as month,
    status,
    COUNT(*) as order_count,
    GROUPING(DATE_TRUNC('month', created_at)) as is_month_total,
    GROUPING(status) as is_status_total
FROM orders
GROUP BY ROLLUP (DATE_TRUNC('month', created_at), status);
-- Returns 1 if the column is being aggregated (NULL in that dimension), 0 otherwise
```

### 8.2.4 Aggregate Functions Reference

```sql
-- Standard aggregates
SELECT
    COUNT(*) as row_count,                    -- Count all rows
    COUNT(email) as non_null_emails,          -- Count non-NULL values
    COUNT(DISTINCT status) as unique_statuses, -- Count distinct values
    SUM(amount) as total,
    AVG(amount) as average,
    MIN(created_at) as earliest,
    MAX(created_at) as latest,
    STRING_AGG(name, ', ' ORDER BY name) as names_list  -- Concatenate with separator
FROM table_name;

-- Statistical aggregates
SELECT
    STDDEV(amount) as standard_deviation,
    VARIANCE(amount) as variance,
    CORRELATION(amount, discount) as correlation,
    REGR_R2(amount, discount) as r_squared  -- Coefficient of determination
FROM sales;

-- Ordered-set aggregates (percentiles)
SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) as median,
    PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY amount) as p90,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY amount) as p99,
    MODE() WITHIN GROUP (ORDER BY status) as most_common_status
FROM orders;

-- Array aggregates (useful for application consumption)
SELECT 
    user_id,
    ARRAY_AGG(order_id ORDER BY created_at) as order_ids,
    JSON_AGG(json_build_object('id', order_id, 'total', total) ORDER BY created_at) as orders_json
FROM orders
GROUP BY user_id;

-- FILTER clause (conditional aggregation)
SELECT
    COUNT(*) as total_orders,
    COUNT(*) FILTER (WHERE status = 'completed') as completed_orders,
    COUNT(*) FILTER (WHERE status = 'pending') as pending_orders,
    SUM(total) FILTER (WHERE status = 'completed') as completed_revenue,
    AVG(total) FILTER (WHERE status = 'completed') as avg_completed_order
FROM orders;
-- More efficient than CASE WHEN in aggregates, cleaner syntax
```

## 8.3 Set Operations: Combining Query Results

Set operations combine the results of two or more queries into a single result set.

### 8.3.1 UNION and UNION ALL

```sql
-- UNION: Combines results, removes duplicates (expensive sort operation)
SELECT user_id, email FROM active_users
UNION
SELECT user_id, email FROM pending_users;
-- Returns unique rows from both tables

-- UNION ALL: Combines results, keeps duplicates (faster, preferred if possible)
SELECT user_id, email FROM active_users
UNION ALL
SELECT user_id, email FROM pending_users;
-- Returns all rows from both, including duplicates if present in both tables

-- Column count and types must match
SELECT user_id, email FROM users
UNION
SELECT customer_id, email_address FROM customers;  -- OK if types compatible

-- Ordering applies to final result
SELECT user_id, created_at FROM orders_2023
UNION ALL
SELECT user_id, created_at FROM orders_2024
ORDER BY created_at DESC;  -- Sorts combined result

-- Practical use: Partitioned table query simulation
SELECT * FROM orders_2024_q1
UNION ALL
SELECT * FROM orders_2024_q2
UNION ALL
SELECT * FROM orders_2024_q3
UNION ALL
SELECT * FROM orders_2024_q4;
-- Note: PostgreSQL 10+ has native partitioning, but UNION ALL still useful for cross-database queries
```

### 8.3.2 INTERSECT and EXCEPT

```sql
-- INTERSECT: Rows present in both queries
SELECT email FROM newsletter_subscribers
INTERSECT
SELECT email FROM customers;
-- Emails of people who are both subscribers AND customers

-- INTERSECT ALL: Keeps duplicates based on minimum occurrence
SELECT email FROM subscribers  -- 'a@b.com' appears twice
INTERSECT ALL
SELECT email FROM customers;    -- 'a@b.com' appears three times
-- Result: 'a@b.com' appears twice (minimum)

-- EXCEPT: Rows in first query not in second (set difference)
SELECT email FROM all_users
EXCEPT
SELECT email FROM unsubscribed_users;
-- Active email addresses only

-- EXCEPT ALL: Accounts for duplicates
SELECT email FROM orders  -- 'a@b.com' ordered 5 times
EXCEPT ALL
SELECT email FROM refunds -- 'a@b.com' refunded 2 times
-- Result: 'a@b.com' appears 3 times (5 - 2)

-- Finding data discrepancies between systems
SELECT id, hash FROM system_a_records
EXCEPT
SELECT id, hash FROM system_b_records;
-- Shows records that differ between systems
```

### 8.3.3 Set Operation Rules and Best Practices

```sql
-- Rules:
-- 1. Column count must match in all queries
-- 2. Corresponding columns must have compatible types
-- 3. Column names come from first query
-- 4. ORDER BY applies to final result only (use column numbers or names from first query)

-- Best practice: Always use UNION ALL unless you specifically need deduplication
-- UNION (without ALL) requires sorting to find duplicates (O(n log n))
-- UNION ALL is O(n) and streams results

-- Complex set operations with parentheses
(SELECT * FROM urgent_orders)
UNION
(SELECT * FROM high_value_orders
 INTERSECT
 SELECT * FROM recent_orders);
-- Parentheses control evaluation order (though UNION/INTERSECT/EXCEPT have precedence rules)

-- With CTEs for clarity
WITH premium_customers AS (
    SELECT user_id FROM users WHERE tier = 'premium'
),
recent_buyers AS (
    SELECT DISTINCT user_id FROM orders WHERE created_at > NOW() - INTERVAL '30 days'
)
SELECT * FROM premium_customers
INTERSECT
SELECT * FROM recent_buyers;
```

## 8.4 Window Functions: Analytical Power

Window functions perform calculations across sets of rows related to the current row without collapsing groups like `GROUP BY` does.

### 8.4.1 Window Function Syntax and OVER Clause

```sql
-- Basic syntax: function OVER (window_definition)
SELECT
    user_id,
    order_id,
    total_cents,
    RANK() OVER (ORDER BY total_cents DESC) as revenue_rank,
    SUM(total_cents) OVER () as grand_total,  -- Empty OVER = all rows
    SUM(total_cents) OVER (PARTITION BY user_id) as user_total
FROM orders;

-- Window definition components:
-- PARTITION BY: Divides rows into groups (like GROUP BY but keeps rows separate)
-- ORDER BY: Defines order for functions that care about sequence (rank, lead/lag)
-- Frame specification: Which rows within partition to include

-- Detailed breakdown
SELECT
    department_id,
    employee_id,
    salary,
    
    -- Aggregate as window function (keeps individual rows)
    AVG(salary) OVER (PARTITION BY department_id) as dept_avg,
    salary - AVG(salary) OVER (PARTITION BY department_id) as diff_from_avg,
    
    -- Ranking functions
    RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as dept_rank,
    ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as row_num,
    DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as dense_rank,
    
    -- Navigation functions
    LAG(salary, 1) OVER (PARTITION BY department_id ORDER BY hire_date) as prev_salary,
    LEAD(salary, 1) OVER (PARTITION BY department_id ORDER BY hire_date) as next_salary,
    FIRST_VALUE(salary) OVER (PARTITION BY department_id ORDER BY hire_date) as first_hire_salary,
    LAST_VALUE(salary) OVER (
        PARTITION BY department_id 
        ORDER BY hire_date 
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as last_hire_salary
    
FROM employees
ORDER BY department_id, salary DESC;
```

### 8.4.2 Ranking Functions

```sql
-- RANK(): Ties get same rank, next rank skips numbers (1, 2, 2, 4)
-- DENSE_RANK(): Ties get same rank, next rank is sequential (1, 2, 2, 3)
-- ROW_NUMBER(): Unique sequential number even with ties (1, 2, 3, 4)
-- PERCENT_RANK(): Relative rank (0.0 to 1.0)
-- CUME_DIST(): Cumulative distribution (0.0 to 1.0)
-- NTILE(n): Divides rows into n buckets

SELECT
    product_id,
    sales_amount,
    RANK() OVER (ORDER BY sales_amount DESC) as rank,
    DENSE_RANK() OVER (ORDER BY sales_amount DESC) as dense_rank,
    ROW_NUMBER() OVER (ORDER BY sales_amount DESC) as row_num,
    PERCENT_RANK() OVER (ORDER BY sales_amount DESC) as percentile,
    NTILE(4) OVER (ORDER BY sales_amount DESC) as quartile
FROM product_sales;

-- Top-N per category pattern
WITH ranked AS (
    SELECT
        category_id,
        product_id,
        sales,
        RANK() OVER (PARTITION BY category_id ORDER BY sales DESC) as rank
    FROM sales_by_product
)
SELECT * FROM ranked
WHERE rank <= 3;  -- Top 3 products per category

-- Handling ties: If you need exactly N rows per group with tie-breaking
WITH ranked AS (
    SELECT
        category_id,
        product_id,
        sales,
        ROW_NUMBER() OVER (
            PARTITION BY category_id 
            ORDER BY sales DESC, product_id ASC  -- Tie-breaker
        ) as row_num
    FROM sales_by_product
)
SELECT * FROM ranked WHERE row_num <= 3;
```

### 8.4.3 Navigation Functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE

```sql
-- LAG/LEAD: Access previous/next rows
SELECT
    date,
    daily_revenue,
    LAG(daily_revenue, 1) OVER (ORDER BY date) as prev_day,
    daily_revenue - LAG(daily_revenue, 1) OVER (ORDER BY date) as day_over_day_change,
    
    LEAD(daily_revenue, 7) OVER (ORDER BY date) as revenue_in_7_days,  -- Look ahead
    daily_revenue - LAG(daily_revenue, 7) OVER (ORDER BY date) as week_over_week_change
FROM daily_metrics
ORDER BY date;

-- LAG/LEAD with default values
SELECT
    order_id,
    created_at,
    LAG(created_at, 1, created_at) OVER (ORDER BY created_at) as prev_time,
    -- If no previous row, use current row's value (default is NULL)
    EXTRACT(EPOCH FROM (created_at - LAG(created_at) OVER (ORDER BY created_at))) / 60 
        as minutes_since_last_order
FROM orders;

-- FIRST_VALUE/LAST_VALUE/NTH_VALUE
SELECT
    user_id,
    order_id,
    total_cents,
    FIRST_VALUE(total_cents) OVER (
        PARTITION BY user_id 
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as first_order_amount,
    NTH_VALUE(total_cents, 2) OVER (
        PARTITION BY user_id 
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as second_order_amount
FROM orders;
```

### 8.4.4 Aggregate Window Functions and Frame Specifications

```sql
-- Running totals and moving averages
SELECT
    date,
    daily_sales,
    SUM(daily_sales) OVER (ORDER BY date) as running_total,  -- Default: UNBOUNDED PRECEDING TO CURRENT ROW
    AVG(daily_sales) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as 7_day_moving_avg,
    AVG(daily_sales) OVER (
        ORDER BY date 
        ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
    ) as centered_7_day_avg
FROM daily_sales
ORDER BY date;

-- Frame specification options:
-- ROWS: Physical row offset (1 row before, 2 rows after)
-- RANGE: Logical value offset (all rows with same ORDER BY value)
-- UNBOUNDED PRECEDING: All rows from start of partition
-- UNBOUNDED FOLLOWING: All rows to end of partition
-- CURRENT ROW: Current row
-- n PRECEDING/FOLLOWING: n rows before/after

-- Practical: Year-to-date calculation
SELECT
    date,
    amount,
    SUM(amount) OVER (
        PARTITION BY DATE_TRUNC('year', date) 
        ORDER BY date 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as ytd_amount
FROM transactions;

-- Window functions with FILTER
SELECT
    department_id,
    employee_id,
    salary,
    AVG(salary) FILTER (WHERE tenure_years > 2) OVER (PARTITION BY department_id) as avg_salary_senior
FROM employees;
```

### 8.4.5 Named Window Definitions

```sql
-- Define window once, reuse in multiple functions
SELECT
    department_id,
    employee_id,
    salary,
    RANK() OVER w_dept_salary as rank_in_dept,
    salary - AVG(salary) OVER w_dept as diff_from_avg,
    SUM(salary) OVER w_dept as dept_payroll
FROM employees
WINDOW 
    w_dept AS (PARTITION BY department_id),
    w_dept_salary AS (w_dept ORDER BY salary DESC);
-- w_dept_salary inherits PARTITION BY from w_dept and adds ORDER BY

-- Complex example with multiple windows
SELECT
    region,
    store_id,
    sales,
    
    -- Regional metrics
    RANK() OVER (PARTITION BY region ORDER BY sales DESC) as regional_rank,
    SUM(sales) OVER (PARTITION BY region) as regional_total,
    
    -- Global metrics
    RANK() OVER (ORDER BY sales DESC) as global_rank,
    sales / SUM(sales) OVER () as pct_of_company
    
FROM store_sales;
```

### 8.4.6 Practical Window Function Patterns

```sql
-- Pattern 1: Gaps and Islands (finding consecutive sequences)
WITH ordered AS (
    SELECT 
        id,
        value,
        id - ROW_NUMBER() OVER (ORDER BY id) as grp  -- Same grp = consecutive
    FROM sequences
    WHERE value IS NOT NULL
)
SELECT 
    MIN(id) as start_id,
    MAX(id) as end_id,
    COUNT(*) as length
FROM ordered
GROUP BY grp;

-- Pattern 2: Sessionization (grouping events into sessions)
WITH sessions AS (
    SELECT 
        user_id,
        event_time,
        event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as time_diff,
        CASE 
            WHEN event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) > INTERVAL '30 minutes'
            THEN 1 
            ELSE 0 
        END as is_new_session
    FROM user_events
),
session_ids AS (
    SELECT 
        user_id,
        event_time,
        SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) as session_id
    FROM sessions
)
SELECT * FROM session_ids;

-- Pattern 3: Percentage of total within group
SELECT
    category,
    product,
    sales,
    sales::numeric / SUM(sales) OVER (PARTITION BY category) as pct_of_category,
    sales::numeric / SUM(sales) OVER () as pct_of_total
FROM product_sales;

-- Pattern 4: Moving difference (delta detection)
SELECT
    sensor_id,
    reading_time,
    temperature,
    temperature - LAG(temperature) OVER (PARTITION BY sensor_id ORDER BY reading_time) as temp_change,
    CASE 
        WHEN ABS(temperature - LAG(temperature) OVER (PARTITION BY sensor_id ORDER BY reading_time)) > 5 
        THEN 'ALERT' 
        ELSE 'OK' 
    END as status
FROM sensor_readings;
```

## 8.5 Advanced Query Patterns

### 8.5.1 Pivoting and Unpivoting

```sql
-- Pivot: Transform rows to columns (using CASE aggregation)
SELECT
    date,
    SUM(CASE WHEN product = 'Widget' THEN amount ELSE 0 END) as widget_sales,
    SUM(CASE WHEN product = 'Gadget' THEN amount ELSE 0 END) as gadget_sales,
    SUM(CASE WHEN product = 'Tool' THEN amount ELSE 0 END) as tool_sales,
    SUM(amount) as total_sales
FROM sales
GROUP BY date
ORDER BY date;

-- Dynamic pivot (requires dynamic SQL in application or plpgsql)
-- PostgreSQL has crosstab function in tablefunc extension for complex pivots

-- Unpivot: Transform columns to rows (using UNION ALL)
SELECT date, 'Widget' as product, widget_sales as amount FROM pivoted_sales WHERE widget_sales > 0
UNION ALL
SELECT date, 'Gadget' as product, gadget_sales as amount FROM pivoted_sales WHERE gadget_sales > 0
UNION ALL
SELECT date, 'Tool' as product, tool_sales as amount FROM pivoted_sales WHERE tool_sales > 0;

-- Better unpivot using LATERAL (PostgreSQL 9.3+)
SELECT s.date, u.product, u.amount
FROM pivoted_sales s
LATERAL (VALUES 
    ('Widget', s.widget_sales),
    ('Gadget', s.gadget_sales),
    ('Tool', s.tool_sales)
) AS u(product, amount)
WHERE u.amount > 0;
```

### 8.5.2 Recursive CTEs for Hierarchical Data

```sql
-- Organizational chart traversal
WITH RECURSIVE org_tree AS (
    -- Anchor: Top-level employees (no manager)
    SELECT 
        employee_id,
        name,
        manager_id,
        1 as level,
        CAST(name AS TEXT) as path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive: Employees who report to previous level
    SELECT 
        e.employee_id,
        e.name,
        e.manager_id,
        ot.level + 1,
        ot.path || ' > ' || e.name
    FROM employees e
    INNER JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT 
    REPEAT('  ', level - 1) || name as indented_name,
    level,
    path
FROM org_tree
ORDER BY path;

-- Bill of materials (exploding components)
WITH RECURSIVE components AS (
    SELECT 
        parent_part_id,
        child_part_id,
        quantity,
        quantity as total_quantity,
        1 as depth
    FROM bill_of_materials
    WHERE parent_part_id = 'PRODUCT-123'
    
    UNION ALL
    
    SELECT 
        bom.parent_part_id,
        bom.child_part_id,
        bom.quantity,
        c.total_quantity * bom.quantity,
        c.depth + 1
    FROM bill_of_materials bom
    INNER JOIN components c ON bom.parent_part_id = c.child_part_id
)
SELECT * FROM components;
```

---

## Chapter Summary

In this chapter, you mastered:

1. **Joins**: Use ANSI-92 `JOIN` syntax exclusively. `INNER JOIN` returns matches only; `LEFT JOIN` preserves left table rows; `FULL JOIN` preserves both. Avoid Cartesian products by ensuring all tables are joined. Use `LATERAL` for row-by-row subqueries that need to reference outer tables.

2. **Aggregation**: `GROUP BY` collapses rows into summaries. Every selected column must be aggregated or grouped. `HAVING` filters groups after aggregation (unlike `WHERE` which filters before). Use `GROUPING SETS`, `ROLLUP`, and `CUBE` for multi-dimensional reporting.

3. **Set Operations**: `UNION ALL` combines results efficiently (keeps duplicates); `UNION` removes duplicates (expensive). `INTERSECT` finds common rows; `EXCEPT` finds differences. Column counts and types must match across queries.

4. **Window Functions**: Perform calculations across related rows without collapsing groups. `PARTITION BY` divides data; `ORDER BY` sequences it; frame specifications (`ROWS/RANGE`) define which rows to include. Ranking functions (`RANK`, `ROW_NUMBER`) assign positions; navigation functions (`LAG`, `LEAD`) access other rows; aggregates can run as window functions for running totals.

5. **Performance**: Ensure join columns are indexed for Nested Loop efficiency. Prefer `EXISTS` over `IN` for subqueries. Use `UNION ALL` instead of `UNION` when possible. Window functions require sorting; ensure adequate `work_mem` for large partitions.

---

**Next:** In Chapter 9, we will explore functions, expressions, and common patternsâ€”including SQL functions, CASE expressions, CTEs for query organization, and advanced pattern matching with regular expressions.