# **Chapter 30: SQL for Testers**

---

## **30.1 Introduction to Advanced SQL for Testing**

While basic SQL (SELECT, INSERT, UPDATE, DELETE) allows testers to verify data existence, advanced SQL techniques enable sophisticated validation of data relationships, anomaly detection, and complex business rule verification. Mastering these techniques transforms database testing from simple record-counting into comprehensive data quality assurance.

**Why Advanced SQL Matters for Testers:**

1. **Complex Relationship Validation:** Multi-table joins and subqueries verify intricate business relationships that simple queries cannot capture
2. **Anomaly Detection:** Window functions identify outliers, duplicates, and sequence breaks that indicate bugs
3. **Regression Prevention:** Data comparison queries detect unintended changes between releases
4. **Performance Validation:** Query execution analysis ensures database changes don't degrade performance
5. **ETL Verification:** Complex transformations require sophisticated SQL to validate source-to-target mappings

---

## **30.2 Complex Join Operations**

Joins combine data from multiple tables. Understanding join types ensures testers retrieve complete datasets for validation.

### **30.2.1 Join Types Deep Dive**

**Visual Join Representation:**
```
Table A                    Table B
┌─────┬──────┐          ┌─────┬──────┐
│ ID  │ Name │          │ ID  │ City │
├─────┼──────┤          ├─────┼──────┤
│ 1   │ John │          │ 1   │ NYC  │
│ 2   │ Jane │          │ 3   │ LA   │
│ 4   │ Bob  │          └─────┴──────┘
└─────┴──────┘

INNER JOIN (A ∩ B):       LEFT JOIN (A + matching B):
┌─────┬──────┬──────┐    ┌─────┬──────┬──────┐
│ ID  │ Name │ City │    │ ID  │ Name │ City │
├─────┼──────┼──────┤    ├─────┼──────┼──────┤
│ 1   │ John │ NYC  │    │ 1   │ John │ NYC  │
└─────┴──────┴──────┘    │ 2   │ Jane │ NULL │
                         │ 4   │ Bob  │ NULL │
                         └─────┴──────┴──────┘

RIGHT JOIN:               FULL OUTER JOIN:
┌─────┬──────┬──────┐    ┌─────┬──────┬──────┐
│ ID  │ Name │ City │    │ ID  │ Name │ City │
├─────┼──────┼──────┤    ├─────┼──────┼──────┤
│ 1   │ John │ NYC  │    │ 1   │ John │ NYC  │
│ 3   │ NULL │ LA   │    │ 2   │ Jane │ NULL │
└─────┴──────┴──────┘    │ 3   │ NULL │ LA   │
                         │ 4   │ Bob  │ NULL │
                         └─────┴──────┴──────┘
```

**Testing Scenarios for Each Join Type:**

```sql
-- INNER JOIN: Verify only matching records exist
-- Test: Ensure all orders have valid users (no orphaned orders)
SELECT o.order_id, o.total, u.username
FROM orders o
INNER JOIN users u ON o.user_id = u.user_id;
-- Expected: All rows should have non-NULL username

-- LEFT JOIN: Find missing relationships
-- Test: Identify users who never placed orders (onboarding issue?)
SELECT u.user_id, u.username, o.order_id, o.total
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.order_id IS NULL;
-- Expected: List of users with no order history

-- RIGHT JOIN: Verify referential integrity from child perspective
-- Test: Ensure all order items reference valid products
SELECT oi.item_id, oi.product_id, p.product_name
FROM order_items oi
RIGHT JOIN products p ON oi.product_id = p.product_id
WHERE oi.item_id IS NULL;
-- Expected: Products that have never been ordered (may be valid)

-- FULL OUTER JOIN: Complete data comparison (PostgreSQL/MySQL 8.0+)
-- Test: Sync verification between source and target systems
SELECT 
    s.id as source_id, 
    t.id as target_id,
    s.checksum as source_checksum,
    t.checksum as target_checksum
FROM source_table s
FULL OUTER JOIN target_table t ON s.id = t.id
WHERE s.checksum != t.checksum OR s.id IS NULL OR t.id IS NULL;

-- CROSS JOIN: Generate test combinations (Cartesian product)
-- Test: Validate all permission combinations work
SELECT r.role_name, p.permission_name
FROM roles r
CROSS JOIN permissions p;
-- Use with caution: 10 roles × 20 permissions = 200 rows
```

### **30.2.2 Multi-Table Join Chains**

Real-world testing often requires traversing multiple relationships:

```sql
-- Complex e-commerce validation: Order → User → Shipping → Product Category
SELECT 
    o.order_id,
    o.order_date,
    u.username,
    u.email,
    s.tracking_number,
    s.status as shipping_status,
    oi.product_name,
    oi.quantity,
    oi.unit_price,
    c.category_name,
    (oi.quantity * oi.unit_price) as line_total,
    pay.payment_method,
    pay.status as payment_status
FROM orders o
INNER JOIN users u ON o.user_id = u.user_id
LEFT JOIN shipments s ON o.order_id = s.order_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
INNER JOIN categories c ON p.category_id = c.category_id
LEFT JOIN payments pay ON o.order_id = pay.order_id
WHERE o.order_date >= DATE_SUB(NOW(), INTERVAL 30 DAY)
ORDER BY o.order_date DESC, oi.product_name;

-- Testing validation points:
-- 1. Every order has a user (INNER JOIN enforces this)
-- 2. Orders may not have shipments yet (LEFT JOIN allows NULL)
-- 3. Every order item must have valid product and category
-- 4. Payment might be pending (LEFT JOIN)
```

### **30.2.3 Self-Joins for Hierarchical Data**

Testing organizational structures or category trees:

```sql
-- Test: Verify employee-manager relationships and hierarchy depth
SELECT 
    e.employee_id,
    e.name as employee_name,
    e.title,
    m.name as manager_name,
    m.title as manager_title,
    CASE 
        WHEN m.manager_id IS NULL THEN 'C-Level'
        WHEN mm.manager_id IS NULL THEN 'VP-Level'
        ELSE 'Individual Contributor'
    END as hierarchy_level
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id  -- Self-join
LEFT JOIN employees mm ON m.manager_id = mm.employee_id -- Second level
WHERE e.status = 'active';

-- Test: Find circular references (manager is own subordinate - data corruption)
SELECT 
    e1.employee_id,
    e1.name,
    e2.employee_id as manager_id,
    e2.name as manager_name
FROM employees e1
JOIN employees e2 ON e1.manager_id = e2.employee_id
JOIN employees e3 ON e2.manager_id = e3.employee_id
WHERE e1.employee_id = e3.employee_id; -- Circular reference detected
```

---

## **30.3 Window Functions for Testing**

Window functions perform calculations across sets of rows related to the current row, enabling sophisticated data analysis without subqueries.

### **30.3.1 Row Numbering and Ranking**

```sql
-- Test: Identify duplicate records for cleanup
-- Assign row number to each duplicate group
WITH DuplicateCheck AS (
    SELECT 
        email,
        user_id,
        created_at,
        ROW_NUMBER() OVER (
            PARTITION BY email 
            ORDER BY created_at ASC
        ) as rn
    FROM users
)
SELECT * FROM DuplicateCheck WHERE rn > 1;
-- Returns: All duplicate emails except the earliest created

-- Test: Top N per category (e.g., top 3 products by revenue per category)
WITH RankedProducts AS (
    SELECT 
        category_id,
        product_name,
        revenue,
        RANK() OVER (
            PARTITION BY category_id 
            ORDER BY revenue DESC
        ) as revenue_rank
    FROM product_sales
)
SELECT * FROM RankedProducts WHERE revenue_rank <= 3;

-- DENSE_RANK vs RANK difference:
-- Values: 100, 100, 90, 80
-- RANK:    1,  1,  3,  4 (skips 2 due to tie)
-- DENSE_RANK: 1,  1,  2,  3 (no gaps)
```

### **30.3.2 Lag and Lead (Time-Series Analysis)**

```sql
-- Test: Verify price changes don't exceed 10% between versions
SELECT 
    product_id,
    price_date,
    price,
    LAG(price) OVER (
        PARTITION BY product_id 
        ORDER BY price_date
    ) as previous_price,
    CASE 
        WHEN LAG(price) OVER (PARTITION BY product_id ORDER BY price_date) IS NOT NULL
        THEN ROUND(
            ((price - LAG(price) OVER (PARTITION BY product_id ORDER BY price_date)) / 
             LAG(price) OVER (PARTITION BY product_id ORDER BY price_date)) * 100, 
            2
        )
        ELSE 0
    END as percent_change
FROM product_prices
WHERE ABS(percent_change) > 10;
-- Flags: Products with price jumps exceeding 10%

-- Test: Sessionization - Identify gaps in user activity > 30 minutes
WITH UserActivity AS (
    SELECT 
        user_id,
        activity_time,
        LAG(activity_time) OVER (
            PARTITION BY user_id 
            ORDER BY activity_time
        ) as previous_activity,
        TIMESTAMPDIFF(MINUTE, 
            LAG(activity_time) OVER (PARTITION BY user_id ORDER BY activity_time),
            activity_time
        ) as minutes_since_last
    FROM user_sessions
)
SELECT * FROM UserActivity WHERE minutes_since_last > 30;
-- Finds: New sessions (potential tracking issues or actual returns)

-- Test: Lead for predictive validation (next scheduled maintenance)
SELECT 
    equipment_id,
    maintenance_date,
    LEAD(maintenance_date) OVER (
        PARTITION BY equipment_id 
        ORDER BY maintenance_date
    ) as next_maintenance,
    DATEDIFF(
        LEAD(maintenance_date) OVER (PARTITION BY equipment_id ORDER BY maintenance_date),
        maintenance_date
    ) as days_until_next
FROM maintenance_log;
```

### **30.3.3 Aggregate Window Functions**

```sql
-- Test: Running totals (verify cumulative calculations in reports)
SELECT 
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as running_total,
    AVG(daily_revenue) OVER (
        ORDER BY order_date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as seven_day_avg
FROM daily_revenue_summary
ORDER BY order_date;

-- Test: Percentage of total (verify pie chart calculations)
SELECT 
    category_name,
    category_revenue,
    SUM(category_revenue) OVER () as total_revenue,
    ROUND(
        (category_revenue / SUM(category_revenue) OVER ()) * 100, 
        2
    ) as percent_of_total
FROM category_revenue;

-- Test: Moving averages for trend analysis
SELECT 
    date,
    temperature,
    AVG(temperature) OVER (
        ORDER BY date 
        ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
    ) as five_day_moving_avg
FROM weather_data;
```

---

## **30.4 Common Table Expressions (CTEs)**

CTEs improve query readability and enable recursive queries for hierarchical testing.

### **30.4.1 Non-Recursive CTEs**

```sql
-- Test: Complex validation with modular steps
WITH 
-- Step 1: Identify active users
ActiveUsers AS (
    SELECT user_id, username, email
    FROM users
    WHERE status = 'active' 
    AND last_login >= DATE_SUB(NOW(), INTERVAL 90 DAY)
),

-- Step 2: Get their recent orders
RecentOrders AS (
    SELECT 
        o.order_id,
        o.user_id,
        o.total,
        o.order_date,
        ROW_NUMBER() OVER (
            PARTITION BY o.user_id 
            ORDER BY o.order_date DESC
        ) as order_rank
    FROM orders o
    WHERE o.order_date >= DATE_SUB(NOW(), INTERVAL 30 DAY)
),

-- Step 3: Calculate user metrics
UserMetrics AS (
    SELECT 
        au.user_id,
        au.username,
        COUNT(ro.order_id) as order_count,
        SUM(CASE WHEN ro.order_rank = 1 THEN ro.total ELSE 0 END) as last_order_value,
        AVG(ro.total) as avg_order_value
    FROM ActiveUsers au
    LEFT JOIN RecentOrders ro ON au.user_id = ro.user_id
    GROUP BY au.user_id, au.username
)

-- Final validation: Users with declining order values
SELECT 
    um.*,
    CASE 
        WHEN um.last_order_value < um.avg_order_value * 0.8 THEN 'At Risk'
        ELSE 'Healthy'
    END as customer_health
FROM UserMetrics um
WHERE um.order_count > 0
ORDER BY um.avg_order_value DESC;
```

### **30.4.2 Recursive CTEs for Hierarchy Testing**

```sql
-- Test: Verify category tree integrity (no orphaned categories)
WITH RECURSIVE CategoryTree AS (
    -- Anchor: Root categories (no parent)
    SELECT 
        category_id,
        category_name,
        parent_id,
        0 as level,
        CAST(category_name AS VARCHAR(1000)) as path
    FROM categories
    WHERE parent_id IS NULL
    
    UNION ALL
    
    -- Recursive: Children of previous level
    SELECT 
        c.category_id,
        c.category_name,
        c.parent_id,
        ct.level + 1,
        CONCAT(ct.path, ' > ', c.category_name)
    FROM categories c
    INNER JOIN CategoryTree ct ON c.parent_id = ct.category_id
    WHERE ct.level < 10  -- Prevent infinite loops
)
SELECT 
    category_id,
    category_name,
    level,
    path
FROM CategoryTree
ORDER BY path;

-- Test: Find circular references in category hierarchy
WITH RECURSIVE CategoryCheck AS (
    SELECT 
        category_id,
        parent_id,
        CAST(category_id AS CHAR(255)) as path,
        1 as depth
    FROM categories
    WHERE parent_id IS NOT NULL
    
    UNION ALL
    
    SELECT 
        cc.category_id,
        c.parent_id,
        CONCAT(cc.path, ',', c.category_id),
        cc.depth + 1
    FROM CategoryCheck cc
    JOIN categories c ON cc.parent_id = c.category_id
    WHERE cc.depth < 5
      AND LOCATE(c.category_id, cc.path) = 0  -- Stop if we see a cycle
)
SELECT * FROM CategoryCheck WHERE depth = 5;
-- If any rows returned at depth 5, circular reference exists
```

---

## **30.5 Stored Procedures and Functions Testing**

Database business logic encapsulated in stored procedures requires specific testing approaches.

### **30.5.1 Testing Stored Procedures**

```sql
-- Example Procedure: Transfer funds between accounts
DELIMITER //
CREATE PROCEDURE TransferFunds(
    IN from_account INT,
    IN to_account INT,
    IN amount DECIMAL(10,2),
    OUT success BOOLEAN,
    OUT message VARCHAR(255)
)
BEGIN
    DECLARE from_balance DECIMAL(10,2);
    DECLARE EXIT HANDLER FOR SQLEXCEPTION
    BEGIN
        ROLLBACK;
        SET success = FALSE;
        SET message = 'Transaction failed: Rolled back';
    END;
    
    START TRANSACTION;
    
    -- Check balance
    SELECT balance INTO from_balance 
    FROM accounts 
    WHERE account_id = from_account FOR UPDATE;
    
    IF from_balance < amount THEN
        SET success = FALSE;
        SET message = 'Insufficient funds';
        ROLLBACK;
    ELSE
        -- Deduct from sender
        UPDATE accounts 
        SET balance = balance - amount 
        WHERE account_id = from_account;
        
        -- Add to receiver
        UPDATE accounts 
        SET balance = balance + amount 
        WHERE account_id = to_account;
        
        -- Log transaction
        INSERT INTO transactions 
        (from_account, to_account, amount, transaction_date)
        VALUES (from_account, to_account, amount, NOW());
        
        SET success = TRUE;
        SET message = 'Transfer successful';
        COMMIT;
    END IF;
END //
DELIMITER ;

-- Test Script for Procedure
-- Test Case 1: Successful transfer
CALL TransferFunds(1, 2, 100.00, @success, @message);
SELECT @success, @message;
-- Expected: TRUE, 'Transfer successful'
SELECT balance FROM accounts WHERE account_id IN (1, 2);
-- Verify: Account 1 decreased by 100, Account 2 increased by 100

-- Test Case 2: Insufficient funds
CALL TransferFunds(1, 2, 999999.00, @success, @message);
SELECT @success, @message;
-- Expected: FALSE, 'Insufficient funds'
-- Verify: Balances unchanged (transaction rolled back)

-- Test Case 3: Invalid account
CALL TransferFunds(999, 2, 100.00, @success, @message);
-- Expected: Error handled gracefully
```

### **30.5.2 Testing Functions**

```sql
-- Example Function: Calculate discount based on customer tier
CREATE FUNCTION CalculateDiscount(
    customer_id INT,
    order_total DECIMAL(10,2)
) RETURNS DECIMAL(10,2)
DETERMINISTIC
BEGIN
    DECLARE tier VARCHAR(20);
    DECLARE discount_rate DECIMAL(4,2);
    
    SELECT membership_tier INTO tier 
    FROM customers 
    WHERE customer_id = customer_id;
    
    SET discount_rate = CASE tier
        WHEN 'Platinum' THEN 0.20
        WHEN 'Gold' THEN 0.15
        WHEN 'Silver' THEN 0.10
        ELSE 0.00
    END;
    
    RETURN order_total * discount_rate;
END;

-- Test Function with various inputs
SELECT 
    customer_id,
    membership_tier,
    CalculateDiscount(customer_id, 1000.00) as discount_amount
FROM customers
WHERE customer_id IN (1, 2, 3, 999);  -- Include non-existent
-- Verify: Correct discount per tier, 0 for unknown/invalid
```

---

## **30.6 Trigger Testing**

Triggers execute automatically on data changes. Testing ensures they fire correctly and don't cause cascading issues.

### **30.6.1 Audit Trigger Validation**

```sql
-- Example: Audit trigger on users table
CREATE TRIGGER user_audit_trigger
AFTER UPDATE ON users
FOR EACH ROW
BEGIN
    INSERT INTO user_audit_log (
        user_id,
        field_changed,
        old_value,
        new_value,
        changed_at,
        changed_by
    )
    VALUES (
        NEW.user_id,
        CASE 
            WHEN OLD.email != NEW.email THEN 'email'
            WHEN OLD.status != NEW.status THEN 'status'
            ELSE 'other'
        END,
        CASE 
            WHEN OLD.email != NEW.email THEN OLD.email
            WHEN OLD.status != NEW.status THEN OLD.status
            ELSE NULL
        END,
        CASE 
            WHEN OLD.email != NEW.email THEN NEW.email
            WHEN OLD.status != NEW.status THEN NEW.status
            ELSE NULL
        END,
        NOW(),
        CURRENT_USER()
    );
END;

-- Test Trigger
-- Step 1: Note current state
SELECT * FROM users WHERE user_id = 1;

-- Step 2: Perform update
UPDATE users 
SET email = 'newemail@example.com', status = 'inactive' 
WHERE user_id = 1;

-- Step 3: Verify audit log
SELECT * FROM user_audit_log 
WHERE user_id = 1 
ORDER BY changed_at DESC;
-- Expected: Two rows (email change and status change)

-- Step 4: Verify rollback doesn't create audit entries
START TRANSACTION;
UPDATE users SET email = 'test@test.com' WHERE user_id = 1;
ROLLBACK;

SELECT COUNT(*) FROM user_audit_log 
WHERE user_id = 1 AND old_value = 'previous_email';
-- Expected: Count unchanged (trigger shouldn't fire on rollback)
```

### **30.6.2 Cascade Trigger Testing**

```sql
-- Test: Ensure cascading updates maintain referential integrity
-- When product category changes, update search index

-- Insert test data
INSERT INTO products (product_id, category_id, name) 
VALUES (9999, 1, 'Test Product');

-- Update category
UPDATE products SET category_id = 2 WHERE product_id = 9999;

-- Verify trigger fired (check search_index table)
SELECT * FROM search_index WHERE product_id = 9999;
-- Expected: category_id updated to 2

-- Cleanup
DELETE FROM products WHERE product_id = 9999;
-- Verify cascade: search_index entry also removed if ON DELETE CASCADE
```

---

## **30.7 Index and Query Performance Testing**

Slow queries indicate missing indexes or poor query design. Testers should identify performance bottlenecks.

### **30.7.1 Execution Plan Analysis**

```sql
-- MySQL/MariaDB: EXPLAIN command
EXPLAIN ANALYZE
SELECT 
    u.username,
    COUNT(o.order_id) as order_count,
    SUM(o.total) as lifetime_value
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE u.created_at >= '2025-01-01'
  AND u.status = 'active'
GROUP BY u.user_id
HAVING COUNT(o.order_id) > 5
ORDER BY lifetime_value DESC
LIMIT 100;

-- Key metrics to check:
-- type: ALL (full table scan - BAD) vs range/ref/eq_ref (index usage - GOOD)
-- rows: High numbers indicate scanning too many rows
-- Extra: Using filesort (needs optimization), Using index (covered query - GOOD)

-- PostgreSQL: EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders 
WHERE order_date BETWEEN '2025-01-01' AND '2025-12-31'
AND status = 'completed';
```

### **30.7.2 Index Validation Tests**

```sql
-- Test: Verify indexes exist on foreign keys (performance best practice)
SELECT 
    TABLE_NAME,
    COLUMN_NAME,
    CONSTRAINT_NAME,
    CASE 
        WHEN EXISTS (
            SELECT 1 FROM INFORMATION_SCHEMA.STATISTICS 
            WHERE TABLE_NAME = KCU.TABLE_NAME 
            AND COLUMN_NAME = KCU.COLUMN_NAME
        ) THEN 'Indexed'
        ELSE 'MISSING INDEX - PERFORMANCE RISK'
    END as index_status
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE KCU
WHERE CONSTRAINT_NAME LIKE '%fk%'
  AND TABLE_SCHEMA = 'your_database';

-- Test: Identify missing indexes on frequently queried columns
SELECT 
    TABLE_NAME,
    COLUMN_NAME,
    COUNT(*) as usage_in_queries
FROM query_log  -- Hypothetical query log table
WHERE COLUMN_NAME NOT IN (
    SELECT COLUMN_NAME 
    FROM INFORMATION_SCHEMA.STATISTICS 
    WHERE TABLE_NAME = query_log.TABLE_NAME
)
GROUP BY TABLE_NAME, COLUMN_NAME
HAVING COUNT(*) > 100;
```

---

## **30.8 Data Comparison and Validation**

Testing data migrations, ETL processes, and synchronization requires comparing datasets between environments.

### **30.8.1 Row-by-Row Comparison**

```sql
-- Compare Production vs Staging for data integrity
WITH ProdData AS (
    SELECT 
        user_id,
        username,
        email,
        MD5(CONCAT(username, email, created_at)) as row_hash
    FROM production.users
),
StageData AS (
    SELECT 
        user_id,
        username,
        email,
        MD5(CONCAT(username, email, created_at)) as row_hash
    FROM staging.users
)
-- Find rows that differ
SELECT 
    COALESCE(p.user_id, s.user_id) as user_id,
    CASE 
        WHEN p.user_id IS NULL THEN 'Missing in Production'
        WHEN s.user_id IS NULL THEN 'Missing in Staging'
        WHEN p.row_hash != s.row_hash THEN 'Data Mismatch'
    END as difference_type,
    p.username as prod_username,
    s.username as stage_username
FROM ProdData p
FULL OUTER JOIN StageData s ON p.user_id = s.user_id
WHERE p.row_hash IS DISTINCT FROM s.row_hash;
-- Note: IS DISTINCT FROM handles NULL comparisons correctly
```

### **30.8.2 Aggregate Comparison**

```sql
-- Quick health check: Compare table statistics
SELECT 
    'Production' as environment,
    COUNT(*) as row_count,
    COUNT(DISTINCT email) as unique_emails,
    MIN(created_at) as earliest_record,
    MAX(created_at) as latest_record,
    SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END) as active_count
FROM production.users

UNION ALL

SELECT 
    'Staging' as environment,
    COUNT(*) as row_count,
    COUNT(DISTINCT email) as unique_emails,
    MIN(created_at) as earliest_record,
    MAX(created_at) as latest_record,
    SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END) as active_count
FROM staging.users;
-- Expected: Row counts should match (or be within expected variance)
```

---

## **30.9 ETL Testing Patterns**

Extract, Transform, Load processes require validation at each stage.

### **30.9.1 Source-to-Target Validation**

```sql
-- ETL Test: Verify all source records reached target
-- Source System (Operational DB)
WITH SourceCounts AS (
    SELECT 
        DATE(created_at) as data_date,
        COUNT(*) as source_count,
        SUM(amount) as source_total
    FROM source_db.transactions
    WHERE created_at >= CURDATE() - INTERVAL 7 DAY
    GROUP BY DATE(created_at)
),

-- Target System (Data Warehouse)
TargetCounts AS (
    SELECT 
        DATE(transaction_date) as data_date,
        COUNT(*) as target_count,
        SUM(amount) as target_total
    FROM data_warehouse.fact_transactions
    WHERE transaction_date >= CURDATE() - INTERVAL 7 DAY
    GROUP BY DATE(transaction_date)
)

-- Reconciliation
SELECT 
    COALESCE(s.data_date, t.data_date) as data_date,
    s.source_count,
    t.target_count,
    s.source_count - t.target_count as count_variance,
    s.source_total,
    t.target_total,
    ABS(s.source_total - t.target_total) as amount_variance
FROM SourceCounts s
FULL OUTER JOIN TargetCounts t ON s.data_date = t.data_date
WHERE s.source_count != t.target_count 
   OR ABS(s.source_total - t.target_total) > 0.01;
-- Returns: Any dates with mismatched counts or totals
```

### **30.9.2 Transformation Logic Validation**

```sql
-- Test: Verify business rules in ETL transformations
-- Rule: VIP customers = >$10,000 lifetime spend OR >50 orders

-- Check if classification logic is correct
WITH CustomerMetrics AS (
    SELECT 
        customer_id,
        SUM(order_total) as lifetime_spend,
        COUNT(DISTINCT order_id) as order_count
    FROM orders
    GROUP BY customer_id
),
ExpectedClassification AS (
    SELECT 
        customer_id,
        CASE 
            WHEN lifetime_spend > 10000 OR order_count > 50 
            THEN 'VIP' 
            ELSE 'Standard' 
        END as expected_tier
    FROM CustomerMetrics
)
SELECT 
    e.customer_id,
    e.expected_tier,
    c.actual_tier,
    'MISMATCH' as status
FROM ExpectedClassification e
JOIN customer_tiers c ON e.customer_id = c.customer_id
WHERE e.expected_tier != c.actual_tier;
-- Returns: Customers with incorrect tier classification
```

---

## **Chapter Summary**

### **Key Takeaways from Chapter 30:**

**Complex Joins:**
- **INNER JOIN:** Returns only matching rows; use to verify referential integrity (all orders have users)
- **LEFT JOIN:** Returns all left table rows; use to find orphans (users without orders)
- **RIGHT JOIN:** Returns all right table rows; use for reverse perspective validation
- **FULL OUTER JOIN:** Returns all rows from both; use for complete data comparison between environments
- **CROSS JOIN:** Cartesian product; use for permission matrix testing (all role-permission combinations)
- **SELF JOIN:** Join table to itself; use for hierarchical data (employees and managers)

**Window Functions:**
- **ROW_NUMBER():** Assigns unique sequential integers; use for deduplication (keep first, remove duplicates)
- **RANK()/DENSE_RANK():** Handle ties in ordering; use for leaderboard/reporting validations
- **LAG()/LEAD():** Access previous/next row values; use for time-series analysis (price changes, session gaps)
- **SUM()/AVG() OVER:** Running totals and moving averages; use for financial report validation
- **PARTITION BY:** Resets calculation per group; essential for per-category or per-user analysis

**Common Table Expressions (CTEs):**
- **Non-recursive:** Break complex queries into readable steps; modular validation logic
- **Recursive:** Query hierarchical data (org charts, category trees); detect circular references
- **WITH clause:** Improves query organization; enables multi-step data preparation

**Stored Procedures/Functions:**
- **Input validation:** Test boundary values (zero, negative, maximum)
- **Transaction handling:** Verify COMMIT on success, ROLLBACK on failure
- **Error handling:** Ensure SQL exceptions are caught and handled gracefully
- **Output verification:** Check OUT parameters and return values match expected business logic

**Triggers:**
- **Audit validation:** Verify triggers create accurate audit logs on data changes
- **Cascade testing:** Ensure triggers maintain referential integrity across related tables
- **Performance impact:** Test that triggers don't significantly slow bulk operations
- **Rollback behavior:** Verify triggers don't fire (or properly rollback) when transactions abort

**Query Performance:**
- **EXPLAIN/EXPLAIN ANALYZE:** Read execution plans to identify full table scans (type=ALL)
- **Index verification:** Ensure foreign keys and frequently queried columns have indexes
- **Slow query identification:** Queries with high row counts or filesort operations need optimization

**Data Comparison:**
- **Hash comparison:** Use MD5/SHA hashes of concatenated fields for efficient row comparison
- **Aggregate reconciliation:** Compare row counts, sums, and distinct counts between environments
- **Variance reporting:** Identify specific rows causing differences between source and target

**ETL Testing:**
- **Source-to-target reconciliation:** Verify record counts and monetary totals match
- **Transformation validation:** Confirm business rules correctly classify/calculate derived fields
- **Incremental loading:** Test that only changed/new data is processed in incremental runs

---

## **📖 Next Chapter: Chapter 31 - Database Testing Techniques**

Having mastered advanced SQL, **Chapter 31** will focus on practical **database testing methodologies and techniques** for comprehensive validation.

In **Chapter 31**, you'll learn:

- **Schema Testing:** Validating table structures, data types, constraints, and default values against specifications
- **Data Integrity Testing:** Techniques for verifying entity, referential, and domain integrity at scale
- **Database Security Testing:** Privilege escalation testing, SQL injection detection, and sensitive data exposure validation
- **Backup and Recovery Testing:** Disaster recovery validation, point-in-time recovery testing, and replication lag verification
- **Concurrency Testing:** Deadlock detection, isolation level validation, and race condition testing
- **Migration Testing:** Schema change validation, data migration verification, and rollback procedure testing
- **Database API Testing:** Testing stored procedure interfaces, REST APIs with database persistence, and ORM validation
- **NoSQL Testing:** Document schema validation, consistency model testing, and sharding verification for MongoDB/Cassandra

**Chapter 31** provides the methodological framework for systematically testing database systems, ensuring data reliability, security, and performance in production environments.

**Continue to Chapter 31 to master the complete database testing methodology and ensure your data layer is production-ready!**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='29. database_fundamentals_for_testers.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='31. database_testing_techniques.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
