# SQL Common Table Expressions (CTEs): Simplifying Complex Queries

## Introduction

**Common Table Expressions (CTEs)** are a powerful SQL feature that allows you to create temporary, named result sets that exist only for the duration of a single query. Think of CTEs as **"temporary views"** that make complex queries more readable and maintainable.

**Why CTEs Matter:**
- üìñ **Readability** - Break complex queries into logical, named parts
- üîÑ **Reusability** - Reference the same subquery multiple times
- üßπ **Maintainability** - Easier to debug and modify complex logic
- üéØ **Recursive Queries** - Enable recursive operations (hierarchies, trees)
- üîç **Duplicate Detection** - Essential for finding and removing duplicate records

**Database:** This course uses **Snowflake** database. All examples are Snowflake-compatible.

**What you'll learn:**
- Understanding what CTEs are and when to use them
- Basic CTE syntax and structure
- Multiple CTEs in a single query
- Using CTEs to find duplicates
- Using CTEs to delete duplicates
- Recursive CTEs (brief introduction)
- Real-world scenarios with practical examples

**Prerequisites:**
- Understanding of SELECT statements
- Knowledge of JOINs and subqueries
- Basic SQL aggregation functions

---

## What is a Common Table Expression (CTE)?

A **CTE** is a temporary named result set that exists only within the scope of a single SQL statement. It's defined using the `WITH` clause and can be referenced multiple times in the main query.

### Simple Analogy

Think of a CTE like a **"helper table"** that you create on-the-fly:
- You define it once with `WITH`
- You can use it multiple times in your query
- It disappears after the query completes
- It makes your query easier to read and understand

### Basic Syntax

```sql
WITH cte_name AS (
    SELECT ...
    FROM ...
    WHERE ...
)
SELECT ...
FROM cte_name
WHERE ...;
```

**Key Points:**
- CTE starts with `WITH`
- CTE name follows `AS`
- CTE definition is in parentheses
- Main query references the CTE by name
- CTE only exists for that one query

---

## Why Use CTEs Instead of Subqueries?

### Example: Without CTE (Using Subquery)

```sql
SELECT 
    customer_id,
    total_orders,
    total_spent
FROM (
    SELECT 
        c.customer_id,
        COUNT(o.order_id) AS total_orders,
        SUM(o.total_amount) AS total_spent
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id
) AS customer_summary
WHERE total_spent > 500
ORDER BY total_spent DESC;
```

**Problems:**
- Hard to read (nested subquery)
- Can't reuse the subquery
- Difficult to debug

### Example: With CTE (Much Cleaner!)

```sql
WITH customer_summary AS (
    SELECT 
        c.customer_id,
        COUNT(o.order_id) AS total_orders,
        SUM(o.total_amount) AS total_spent
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id
)
SELECT 
    customer_id,
    total_orders,
    total_spent
FROM customer_summary
WHERE total_spent > 500
ORDER BY total_spent DESC;
```

**Benefits:**
- ‚úÖ Much more readable
- ‚úÖ Easy to understand the logic
- ‚úÖ Can reference `customer_summary` multiple times
- ‚úÖ Easier to debug and modify

---

## Dataset Setup

We'll use the same tables from the SQL Joins notebook (`customers`, `orders`, `products`, `order_items`). Let's first verify the tables exist and then add some duplicate data for our examples.


## Step 1: Verify Tables Exist

Before we start, let's make sure we have the tables from the SQL Joins notebook. If you haven't run that notebook yet, you'll need to create the tables first.


In [None]:
-- Check if tables exist and view their structure
SELECT * FROM customers LIMIT 5;
SELECT * FROM orders LIMIT 5;
SELECT * FROM products LIMIT 5;
SELECT * FROM order_items LIMIT 5;


## Step 2: Adding Duplicate Data for Examples

To demonstrate finding and deleting duplicates, we'll add some duplicate records to our tables. This simulates real-world data quality issues.

**Note:** We'll create a separate table for duplicate examples to avoid corrupting the original data.


In [None]:
-- Create a table with duplicate customer records for demonstration
-- This simulates data quality issues where duplicates might exist
CREATE OR REPLACE TABLE customers_with_duplicates AS
SELECT * FROM customers
UNION ALL
-- Add duplicate records (same email, different customer_id)
SELECT 7, 'John', 'Doe', 'john.doe@email.com', 'New York', 'USA' UNION ALL
SELECT 8, 'Jane', 'Smith', 'jane.smith@email.com', 'London', 'UK' UNION ALL
SELECT 9, 'Bob', 'Johnson', 'bob.johnson@email.com', 'Toronto', 'Canada' UNION ALL
-- Add near-duplicates (similar but not identical)
SELECT 10, 'John', 'Doe', 'john.doe@email.com', 'Boston', 'USA' UNION ALL
SELECT 11, 'Jane', 'Smith', 'jane.smith@email.com', 'Manchester', 'UK';


In [None]:
-- View the duplicate data
SELECT * FROM customers_with_duplicates ORDER BY email, customer_id;


---

## Part 1: Basic CTE Examples

Let's start with simple examples to understand how CTEs work.


### Example 1: Simple CTE - Customer Summary

**Question:** Show all customers with their total order count and total amount spent.


In [None]:
-- Using CTE to calculate customer summary
WITH customer_summary AS (
    SELECT 
        c.customer_id,
        c.first_name || ' ' || c.last_name AS customer_name,
        COUNT(o.order_id) AS total_orders,
        COALESCE(SUM(o.total_amount), 0) AS total_spent
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.first_name, c.last_name
)
SELECT 
    customer_id,
    customer_name,
    total_orders,
    total_spent
FROM customer_summary
ORDER BY total_spent DESC;


### Example 2: Multiple CTEs - Chaining CTEs

You can define multiple CTEs in a single query. Each CTE can reference previous CTEs.

**Question:** Find customers who spent more than the average customer spending.


In [None]:
-- Multiple CTEs: First calculate customer totals, then compare to average
WITH customer_totals AS (
    SELECT 
        c.customer_id,
        c.first_name || ' ' || c.last_name AS customer_name,
        COALESCE(SUM(o.total_amount), 0) AS total_spent
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.first_name, c.last_name
),
average_spending AS (
    SELECT AVG(total_spent) AS avg_spent
    FROM customer_totals
)
SELECT 
    ct.customer_id,
    ct.customer_name,
    ct.total_spent,
    ROUND(a.avg_spent, 2) AS average_customer_spending,
    ROUND(ct.total_spent - a.avg_spent, 2) AS difference_from_avg
FROM customer_totals ct
CROSS JOIN average_spending a
WHERE ct.total_spent > a.avg_spent
ORDER BY ct.total_spent DESC;


**Key Points:**
- Multiple CTEs are separated by commas
- Second CTE (`average_spending`) can reference the first CTE (`customer_totals`)
- Each CTE is evaluated independently but can use previous CTEs
- Main query can reference any of the CTEs

---

## Part 2: Finding Duplicates Using CTEs

CTEs are particularly powerful for finding duplicates. Let's explore different scenarios.


### Example 1: Find Duplicate Customers by Email

**Problem:** Find all duplicate customer records based on email address.

**Approach:**
1. Use CTE to count occurrences of each email
2. Filter for emails that appear more than once
3. Join back to original table to show all duplicate records


In [None]:
-- Find duplicate customers by email using CTE
WITH duplicate_emails AS (
    SELECT 
        email,
        COUNT(*) AS duplicate_count
    FROM customers_with_duplicates
    GROUP BY email
    HAVING COUNT(*) > 1
)
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    c.email,
    c.city,
    c.country,
    de.duplicate_count
FROM customers_with_duplicates c
INNER JOIN duplicate_emails de ON c.email = de.email
ORDER BY c.email, c.customer_id;


**What this does:**
- `duplicate_emails` CTE finds all emails that appear more than once
- Main query joins back to show all records with duplicate emails
- Results are ordered by email and customer_id for easy review

### Example 2: Find Duplicate Orders (Same Customer, Same Date, Same Amount)

**Problem:** Find duplicate orders that might have been accidentally inserted twice.


In [None]:
-- First, let's add some duplicate orders for demonstration
INSERT INTO orders (order_id, customer_id, order_date, total_amount, status) VALUES
(1008, 1, '2024-01-15', 1029.98, 'delivered'),  -- Duplicate of order 1001
(1009, 2, '2024-01-16', 379.98, 'shipped'),     -- Duplicate of order 1002
(1010, 1, '2024-01-20', 149.99, 'pending');     -- Duplicate of order 1003


In [None]:
-- Find duplicate orders using CTE
WITH duplicate_orders AS (
    SELECT 
        customer_id,
        order_date,
        total_amount,
        COUNT(*) AS duplicate_count
    FROM orders
    GROUP BY customer_id, order_date, total_amount
    HAVING COUNT(*) > 1
)
SELECT 
    o.order_id,
    o.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    o.order_date,
    o.total_amount,
    o.status,
    do.duplicate_count
FROM orders o
INNER JOIN duplicate_orders do 
    ON o.customer_id = do.customer_id 
    AND o.order_date = do.order_date 
    AND o.total_amount = do.total_amount
INNER JOIN customers c ON o.customer_id = c.customer_id
ORDER BY o.customer_id, o.order_date, o.order_id;


### Example 3: Find Duplicate Order Items

**Problem:** Find duplicate order items (same order_id, same product_id).


In [None]:
-- Add duplicate order items for demonstration
INSERT INTO order_items (order_id, product_id, quantity, unit_price) VALUES
(1001, 101, 1, 999.99),  -- Duplicate: order 1001 already has product 101
(1002, 104, 1, 299.99),  -- Duplicate: order 1002 already has product 104
(1003, 107, 1, 149.99);  -- Duplicate: order 1003 already has product 107


In [None]:
-- Find duplicate order items using CTE
WITH duplicate_order_items AS (
    SELECT 
        order_id,
        product_id,
        COUNT(*) AS duplicate_count
    FROM order_items
    GROUP BY order_id, product_id
    HAVING COUNT(*) > 1
)
SELECT 
    oi.order_id,
    oi.product_id,
    p.product_name,
    oi.quantity,
    oi.unit_price,
    (oi.quantity * oi.unit_price) AS line_total,
    doi.duplicate_count
FROM order_items oi
INNER JOIN duplicate_order_items doi 
    ON oi.order_id = doi.order_id 
    AND oi.product_id = doi.product_id
INNER JOIN products p ON oi.product_id = p.product_id
ORDER BY oi.order_id, oi.product_id;


### Example 4: Find Duplicates with ROW_NUMBER() Window Function

**Advanced Approach:** Use ROW_NUMBER() to identify which records to keep and which to delete.

**Problem:** Find duplicate customers and mark which one to keep (e.g., keep the one with the lowest customer_id).


In [None]:
-- Find duplicates and mark which ones to keep/delete using ROW_NUMBER()
WITH ranked_customers AS (
    SELECT 
        customer_id,
        first_name,
        last_name,
        email,
        city,
        country,
        ROW_NUMBER() OVER (
            PARTITION BY email 
            ORDER BY customer_id ASC
        ) AS row_num
    FROM customers_with_duplicates
)
SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    city,
    country,
    row_num,
    CASE 
        WHEN row_num = 1 THEN 'KEEP'
        ELSE 'DELETE'
    END AS action
FROM ranked_customers
ORDER BY email, row_num;


**What this does:**
- `ROW_NUMBER()` assigns 1, 2, 3... to each duplicate group
- `PARTITION BY email` groups by email (creates separate numbering for each email)
- `ORDER BY customer_id ASC` keeps the lowest customer_id as row_num = 1
- Records with `row_num = 1` should be KEPT
- Records with `row_num > 1` should be DELETED

---

## Part 3: Deleting Duplicates Using CTEs

Now let's see how to actually delete duplicates. **Important:** Always test your delete queries with SELECT first!

### Example 1: Delete Duplicate Customers (Keep First Occurrence)

**Problem:** Delete duplicate customer records, keeping only the first occurrence (lowest customer_id).


**‚ö†Ô∏è WARNING: Always test DELETE queries with SELECT first!**

Let's first see what would be deleted:


In [None]:
-- STEP 1: Preview what will be deleted (ALWAYS DO THIS FIRST!)
WITH ranked_customers AS (
    SELECT 
        customer_id,
        first_name,
        last_name,
        email,
        ROW_NUMBER() OVER (
            PARTITION BY email 
            ORDER BY customer_id ASC
        ) AS row_num
    FROM customers_with_duplicates
)
SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    row_num,
    'WILL BE DELETED' AS status
FROM ranked_customers
WHERE row_num > 1  -- Keep row_num = 1, delete others
ORDER BY email, customer_id;


Now, if the preview looks correct, we can delete:


In [None]:
-- STEP 2: Delete duplicates (keeping the first occurrence)
-- Uncomment the DELETE statement only after verifying the SELECT above!

/*
WITH ranked_customers AS (
    SELECT 
        customer_id,
        ROW_NUMBER() OVER (
            PARTITION BY email 
            ORDER BY customer_id ASC
        ) AS row_num
    FROM customers_with_duplicates
)
DELETE FROM customers_with_duplicates
WHERE customer_id IN (
    SELECT customer_id 
    FROM ranked_customers 
    WHERE row_num > 1
);
*/

-- Note: Snowflake doesn't support DELETE with CTE directly in this way.
-- We'll show the correct approach below.


**Note:** Snowflake's DELETE syntax with CTEs works differently. Here's the correct approach:


In [None]:
-- Correct way to delete duplicates in Snowflake
-- First, create a temporary table with records to keep
CREATE OR REPLACE TEMPORARY TABLE customers_to_keep AS
WITH ranked_customers AS (
    SELECT 
        customer_id,
        first_name,
        last_name,
        email,
        city,
        country,
        ROW_NUMBER() OVER (
            PARTITION BY email 
            ORDER BY customer_id ASC
        ) AS row_num
    FROM customers_with_duplicates
)
SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    city,
    country
FROM ranked_customers
WHERE row_num = 1;

-- Then, delete all records and re-insert only the ones to keep
-- (This is safer than deleting specific records)
DELETE FROM customers_with_duplicates;
INSERT INTO customers_with_duplicates 
SELECT * FROM customers_to_keep;

-- Verify the result
SELECT * FROM customers_with_duplicates ORDER BY email, customer_id;


In [None]:
-- STEP 1: Preview duplicate orders to be deleted
WITH ranked_orders AS (
    SELECT 
        order_id,
        customer_id,
        order_date,
        total_amount,
        status,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id, order_date, total_amount
            ORDER BY order_id ASC
        ) AS row_num
    FROM orders
)
SELECT 
    order_id,
    customer_id,
    order_date,
    total_amount,
    status,
    row_num,
    CASE 
        WHEN row_num = 1 THEN 'KEEP'
        ELSE 'DELETE'
    END AS action
FROM ranked_orders
ORDER BY customer_id, order_date, order_id;


In [None]:
-- STEP 2: Delete duplicate orders
-- Using MERGE statement (Snowflake's recommended approach for complex deletes)

MERGE INTO orders AS target
USING (
    WITH ranked_orders AS (
        SELECT 
            order_id,
            ROW_NUMBER() OVER (
                PARTITION BY customer_id, order_date, total_amount
                ORDER BY order_id ASC
            ) AS row_num
        FROM orders
    )
    SELECT order_id
    FROM ranked_orders
    WHERE row_num > 1
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN DELETE;

-- Verify: Check remaining orders
SELECT * FROM orders ORDER BY customer_id, order_date, order_id;


### Example 3: Delete Duplicate Order Items

**Problem:** Delete duplicate order items, keeping only the first occurrence.


In [None]:
-- STEP 1: Preview duplicate order items
WITH ranked_order_items AS (
    SELECT 
        order_id,
        product_id,
        quantity,
        unit_price,
        ROW_NUMBER() OVER (
            PARTITION BY order_id, product_id
            ORDER BY order_id ASC, product_id ASC
        ) AS row_num
    FROM order_items
)
SELECT 
    oi.order_id,
    oi.product_id,
    p.product_name,
    oi.quantity,
    oi.unit_price,
    oi.row_num,
    CASE 
        WHEN oi.row_num = 1 THEN 'KEEP'
        ELSE 'DELETE'
    END AS action
FROM ranked_order_items oi
INNER JOIN products p ON oi.product_id = p.product_id
ORDER BY oi.order_id, oi.product_id;


In [None]:
-- STEP 2: Delete duplicate order items using MERGE
MERGE INTO order_items AS target
USING (
    WITH ranked_order_items AS (
        SELECT 
            order_id,
            product_id,
            ROW_NUMBER() OVER (
                PARTITION BY order_id, product_id
                ORDER BY order_id ASC, product_id ASC
            ) AS row_num
        FROM order_items
    )
    SELECT order_id, product_id
    FROM ranked_order_items
    WHERE row_num > 1
) AS source
ON target.order_id = source.order_id 
   AND target.product_id = source.product_id
WHEN MATCHED THEN DELETE;

-- Verify: Check remaining order items
SELECT * FROM order_items ORDER BY order_id, product_id;


---

## Part 4: Advanced CTE Examples

### Example 1: Complex Analysis with Multiple CTEs

**Problem:** Find customers who have ordered products from multiple categories, showing their category diversity.


In [None]:
-- Complex query using multiple CTEs
WITH customer_orders AS (
    SELECT DISTINCT
        o.customer_id,
        o.order_id,
        p.category
    FROM orders o
    INNER JOIN order_items oi ON o.order_id = oi.order_id
    INNER JOIN products p ON oi.product_id = p.product_id
),
customer_categories AS (
    SELECT 
        customer_id,
        COUNT(DISTINCT category) AS category_count,
        LISTAGG(DISTINCT category, ', ') WITHIN GROUP (ORDER BY category) AS categories
    FROM customer_orders
    GROUP BY customer_id
)
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    cc.category_count,
    cc.categories
FROM customers c
INNER JOIN customer_categories cc ON c.customer_id = cc.customer_id
WHERE cc.category_count > 1
ORDER BY cc.category_count DESC, c.customer_id;


### Example 2: Finding and Summarizing Duplicates

**Problem:** Create a summary report of all duplicates in the system.


In [None]:
-- Comprehensive duplicate summary using CTEs
WITH customer_duplicates AS (
    SELECT 
        'customers' AS table_name,
        email AS duplicate_key,
        COUNT(*) AS duplicate_count
    FROM customers_with_duplicates
    GROUP BY email
    HAVING COUNT(*) > 1
),
order_duplicates AS (
    SELECT 
        'orders' AS table_name,
        customer_id || '-' || order_date || '-' || total_amount AS duplicate_key,
        COUNT(*) AS duplicate_count
    FROM orders
    GROUP BY customer_id, order_date, total_amount
    HAVING COUNT(*) > 1
),
order_item_duplicates AS (
    SELECT 
        'order_items' AS table_name,
        order_id || '-' || product_id AS duplicate_key,
        COUNT(*) AS duplicate_count
    FROM order_items
    GROUP BY order_id, product_id
    HAVING COUNT(*) > 1
)
SELECT * FROM customer_duplicates
UNION ALL
SELECT * FROM order_duplicates
UNION ALL
SELECT * FROM order_item_duplicates
ORDER BY table_name, duplicate_count DESC;


---

## Part 5: Best Practices and Common Patterns

### Best Practices for Using CTEs

1. **Always Test DELETE Queries First**
   - Use SELECT to preview what will be deleted
   - Verify the logic before executing DELETE

2. **Use Meaningful CTE Names**
   - `customer_summary` is better than `cte1`
   - Makes queries self-documenting

3. **Keep CTEs Focused**
   - Each CTE should have a single, clear purpose
   - Don't make CTEs too complex

4. **Order Matters**
   - CTEs are evaluated in order
   - Later CTEs can reference earlier CTEs

5. **Performance Considerations**
   - CTEs are evaluated each time they're referenced
   - For very large datasets, consider materialized views or temp tables

### Common CTE Patterns

#### Pattern 1: Finding Duplicates
```sql
WITH duplicates AS (
    SELECT column, COUNT(*) AS cnt
    FROM table
    GROUP BY column
    HAVING COUNT(*) > 1
)
SELECT * FROM table
WHERE column IN (SELECT column FROM duplicates);
```

#### Pattern 2: Ranking and Filtering
```sql
WITH ranked AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY key ORDER BY id) AS rn
    FROM table
)
SELECT * FROM ranked WHERE rn = 1;
```

#### Pattern 3: Multi-Step Calculations
```sql
WITH step1 AS (SELECT ...),
     step2 AS (SELECT ... FROM step1),
     step3 AS (SELECT ... FROM step2)
SELECT * FROM step3;
```

---

## Summary: Key Takeaways

‚úÖ **CTEs make complex queries readable** - Break down complex logic into named parts  
‚úÖ **CTEs enable reusability** - Reference the same subquery multiple times  
‚úÖ **CTEs are perfect for duplicates** - Essential tool for finding and deleting duplicates  
‚úÖ **Always test DELETE queries** - Use SELECT first to preview what will be deleted  
‚úÖ **Use ROW_NUMBER() for duplicate handling** - Assign ranks to identify which records to keep  
‚úÖ **Multiple CTEs can chain together** - Each CTE can reference previous CTEs  

**When to use CTEs:**
- üìä Complex queries with multiple steps
- üîç Finding and removing duplicates
- üìà Multi-level aggregations
- üîÑ Queries that need to reference the same subquery multiple times
- üßπ Data cleaning and transformation

**CTE vs Subquery:**
- **CTE:** More readable, can be referenced multiple times, easier to debug
- **Subquery:** More compact for simple cases, but harder to read when nested

---

## Practice Exercises

Try these exercises

1. **Find duplicate products** by name and category
2. **Delete duplicate order items**, keeping the one with the highest quantity
3. **Create a customer report** showing total orders, total spent, and average order value using CTEs
4. **Find customers** who have duplicate orders on the same day
5. **Create a summary** of all duplicate records across all tables

---
