# SQL Joins and Union: Combining Data from Multiple Tables

## Introduction

In real-world databases, data is spread across multiple tables. **Joins** and **Union** are essential SQL operations that allow you to combine data from different tables to answer complex business questions.

**Database:** This course uses **Snowflake** database. All examples are Snowflake-compatible. Additional notes for SQL Server are provided where relevant.

**What you'll learn:**
- Understanding relationships between tables
- INNER JOIN - Finding matching records
- LEFT JOIN - Keeping all records from the left table
- RIGHT JOIN - Keeping all records from the right table
- FULL OUTER JOIN - Keeping all records from both tables
- UNION - Combining result sets vertically
- Practical problem-solving with joins
- Production-grade examples for real-world scenarios

**Prerequisites:**
- Understanding of SELECT statements
- Knowledge of Primary Keys and Foreign Keys
- Basic SQL syntax

---

## Why Joins Matter

**Real-world scenario:**
- Customer information is in the `customers` table
- Order information is in the `orders` table
- Product information is in the `products` table

**Question:** "Which customers bought which products?"

**Answer:** You need to JOIN these tables together!

**Without joins:** You'd need multiple queries and manual data combination (inefficient and error-prone)  
**With joins:** One query gives you all the information you need (efficient and accurate)

---

## Dataset Setup

Before we dive into joins, let's create some sample tables with realistic data. This will help you understand and practice joins effectively.

**Note:** In Snowflake, you can use `CREATE OR REPLACE TABLE` to recreate tables easily during learning.


## Snowflake-Specific Notes

**Database Compatibility:**
- ✅ **Snowflake** - All examples in this notebook are tested for Snowflake
- ✅ **SQL Server** - Most examples work with minor syntax adjustments (noted below)
- ⚠️ **MySQL** - FULL OUTER JOIN not supported (use UNION workaround)

**Key Snowflake Syntax:**
- **String Concatenation:** Use `||` operator or `CONCAT()` function
  - `'Hello' || ' ' || 'World'` or `CONCAT('Hello', ' ', 'World')`
  - SQL Server: Use `+` or `CONCAT()`
- **Date Functions:**
  - `CURRENT_DATE()` or `CURRENT_DATE` - Current date
  - `DATEADD(day, -30, CURRENT_DATE())` - Add/subtract days
  - `DATEDIFF(day, date1, date2)` - Difference in days
  - SQL Server: Similar syntax
- **FULL OUTER JOIN:** Fully supported in Snowflake
- **Table Creation:** Use `CREATE OR REPLACE TABLE` for easy recreation during learning

**Performance Tips for Snowflake:**
- Use appropriate join types to minimize data scanned
- Consider clustering keys for large tables
- Use WHERE clauses to filter early in the query


## Step 1: Creating the Database Schema

Let's create tables for an e-commerce scenario:
- **customers**: Customer information
- **orders**: Order information
- **products**: Product catalog
- **order_items**: Items in each order (junction table)


In [None]:
-- Create customers table
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    email VARCHAR(100) UNIQUE NOT NULL,
    city VARCHAR(50),
    country VARCHAR(50)
);


In [None]:
-- Create products table
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100) NOT NULL,
    category VARCHAR(50),
    price DECIMAL(10, 2) NOT NULL CHECK (price > 0),
    stock_quantity INT DEFAULT 0
);


In [None]:
-- Create orders table
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT NOT NULL,
    order_date DATE NOT NULL,
    total_amount DECIMAL(10, 2),
    status VARCHAR(20) DEFAULT 'pending',
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);


In [None]:
-- Create order_items table (junction table)
CREATE TABLE order_items (
    order_id INT NOT NULL,
    product_id INT NOT NULL,
    quantity INT NOT NULL CHECK (quantity > 0),
    unit_price DECIMAL(10, 2) NOT NULL,
    PRIMARY KEY (order_id, product_id),
    FOREIGN KEY (order_id) REFERENCES orders(order_id) ON DELETE CASCADE,
    FOREIGN KEY (product_id) REFERENCES products(product_id)
);


## Step 2: Inserting Sample Data

Now let's populate these tables with realistic sample data.


In [None]:
-- Insert customers
INSERT INTO customers (customer_id, first_name, last_name, email, city, country) VALUES
(1, 'John', 'Doe', 'john.doe@email.com', 'New York', 'USA'),
(2, 'Jane', 'Smith', 'jane.smith@email.com', 'London', 'UK'),
(3, 'Bob', 'Johnson', 'bob.johnson@email.com', 'Toronto', 'Canada'),
(4, 'Alice', 'Williams', 'alice.williams@email.com', 'Sydney', 'Australia'),
(5, 'Charlie', 'Brown', 'charlie.brown@email.com', 'Berlin', 'Germany'),
(6, 'Diana', 'Davis', 'diana.davis@email.com', 'Paris', 'France');


In [None]:
-- Insert products
INSERT INTO products (product_id, product_name, category, price, stock_quantity) VALUES
(101, 'Laptop', 'Electronics', 999.99, 50),
(102, 'Mouse', 'Electronics', 29.99, 200),
(103, 'Keyboard', 'Electronics', 79.99, 150),
(104, 'Monitor', 'Electronics', 299.99, 75),
(105, 'Desk Chair', 'Furniture', 199.99, 30),
(106, 'Desk', 'Furniture', 349.99, 20),
(107, 'Headphones', 'Electronics', 149.99, 100),
(108, 'Webcam', 'Electronics', 89.99, 80);


In [None]:
-- Insert orders
INSERT INTO orders (order_id, customer_id, order_date, total_amount, status) VALUES
(1001, 1, '2024-01-15', 1029.98, 'delivered'),
(1002, 2, '2024-01-16', 379.98, 'shipped'),
(1003, 1, '2024-01-20', 149.99, 'pending'),
(1004, 3, '2024-01-22', 1299.97, 'delivered'),
(1005, 4, '2024-01-25', 89.99, 'shipped'),
(1006, 2, '2024-01-28', 199.99, 'pending'),
(1007, 5, '2024-02-01', 429.98, 'delivered');


In [None]:
-- Insert order_items
INSERT INTO order_items (order_id, product_id, quantity, unit_price) VALUES
(1001, 101, 1, 999.99),  -- John bought Laptop
(1001, 102, 1, 29.99),   -- John bought Mouse
(1002, 104, 1, 299.99),  -- Jane bought Monitor
(1002, 107, 1, 149.99),  -- Jane bought Headphones (but wait, this doesn't match total - let's fix)
(1002, 102, 1, 29.99),   -- Jane bought Mouse
(1003, 107, 1, 149.99),  -- John bought Headphones
(1004, 101, 1, 999.99),  -- Bob bought Laptop
(1004, 104, 1, 299.99),  -- Bob bought Monitor
(1005, 108, 1, 89.99),   -- Alice bought Webcam
(1006, 105, 1, 199.99),  -- Jane bought Desk Chair
(1007, 103, 1, 79.99),   -- Charlie bought Keyboard
(1007, 102, 2, 29.99),   -- Charlie bought 2 Mice
(1007, 108, 1, 89.99);   -- Charlie bought Webcam


## Visual Example: Understanding Joins with Sample Data

Let's use a simplified example to visualize how different joins work. This will make the concept crystal clear!

### Sample Data for Visualization

**Table A: customers (simplified)**
| customer_id | customer_name |
|-------------|---------------|
| 1 | John Doe |
| 2 | Jane Smith |
| 3 | Diana Davis |

**Table B: orders (simplified)**
| order_id | customer_id | order_date |
|----------|-------------|------------|
| 101 | 1 | 2024-01-15 |
| 102 | 2 | 2024-01-16 |
| 103 | 5 | 2024-01-20 |

**Key Observations:**
- Customer 1 (John) has Order 101 ✅
- Customer 2 (Jane) has Order 102 ✅
- Customer 3 (Diana) has NO orders ❌
- Order 103 belongs to customer_id 5, but customer 5 doesn't exist in our customers table ❌

### Join Results Visualization

**Query:** `SELECT c.customer_id, c.customer_name, o.order_id, o.order_date FROM customers c [JOIN_TYPE] orders o ON c.customer_id = o.customer_id;`

#### 1. INNER JOIN Result
| customer_id | customer_name | order_id | order_date |
|-------------|---------------|----------|------------|
| 1 | John Doe | 101 | 2024-01-15 |
| 2 | Jane Smith | 102 | 2024-01-16 |

**What happened:**
- ✅ Only matching rows (customers 1 and 2 with their orders)
- ❌ Diana (customer 3) excluded - no orders
- ❌ Order 103 excluded - no matching customer

#### 2. LEFT JOIN Result
| customer_id | customer_name | order_id | order_date |
|-------------|---------------|----------|------------|
| 1 | John Doe | 101 | 2024-01-15 |
| 2 | Jane Smith | 102 | 2024-01-16 |
| 3 | Diana Davis | **NULL** | **NULL** |

**What happened:**
- ✅ ALL customers included (left table)
- ✅ Orders added where they exist
- ✅ Diana included with NULL values (no orders)
- ❌ Order 103 still excluded (not in left table)

#### 3. RIGHT JOIN Result
| customer_id | customer_name | order_id | order_date |
|-------------|---------------|----------|------------|
| 1 | John Doe | 101 | 2024-01-15 |
| 2 | Jane Smith | 102 | 2024-01-16 |
| **NULL** | **NULL** | 103 | 2024-01-20 |

**What happened:**
- ✅ ALL orders included (right table)
- ✅ Customer info added where it exists
- ✅ Order 103 included with NULL customer values
- ❌ Diana excluded (not in right table)

#### 4. FULL OUTER JOIN Result
| customer_id | customer_name | order_id | order_date |
|-------------|---------------|----------|------------|
| 1 | John Doe | 101 | 2024-01-15 |
| 2 | Jane Smith | 102 | 2024-01-16 |
| 3 | Diana Davis | **NULL** | **NULL** |
| **NULL** | **NULL** | 103 | 2024-01-20 |

**What happened:**
- ✅ ALL customers included
- ✅ ALL orders included
- ✅ Matches connected where possible
- ✅ NULL values where no match exists

**Key Takeaway:** 
- **INNER JOIN** = Only matches (intersection)
- **LEFT JOIN** = All left + matches from right
- **RIGHT JOIN** = All right + matches from left  
- **FULL OUTER JOIN** = Everything from both tables (union)

---


## Step 3: Verify the Data

Let's quickly check our data to understand what we're working with:


In [None]:
-- View all customers
SELECT * FROM customers;


In [None]:
-- View all products
SELECT * FROM products;


In [None]:
-- View all orders
SELECT * FROM orders;


In [None]:
-- View all order items
SELECT * FROM order_items;


---

## Step 4: Adding Duplicate Data for CTE Exercises

**Note:** The following duplicate data is added to demonstrate Common Table Expressions (CTEs) for finding and deleting duplicates. This data simulates real-world data quality issues that you'll learn to handle in the CTE notebook.

**Important:** These duplicates are intentionally added for learning purposes. In production, you would want to prevent duplicates through proper constraints and data validation.


### Adding Duplicate Orders

Let's add some duplicate orders to demonstrate duplicate detection and removal techniques.


In [None]:
-- Add duplicate orders for CTE exercises
-- These simulate accidental duplicate inserts
INSERT INTO orders (order_id, customer_id, order_date, total_amount, status) VALUES
(1008, 1, '2024-01-15', 1029.98, 'delivered'),  -- Duplicate of order 1001
(1009, 2, '2024-01-16', 379.98, 'shipped'),     -- Duplicate of order 1002
(1010, 1, '2024-01-20', 149.99, 'pending'),    -- Duplicate of order 1003
(1011, 3, '2024-01-22', 1299.97, 'delivered'),  -- Duplicate of order 1004
(1012, 4, '2024-01-25', 89.99, 'shipped');      -- Duplicate of order 1005


### Adding Duplicate Order Items

Let's add some duplicate order items to demonstrate duplicate detection in junction tables.


In [None]:
-- Add duplicate order items for CTE exercises
-- Note: These will fail if PRIMARY KEY constraint is enforced
-- For learning purposes, you may need to temporarily disable constraints or use a test table
-- In Snowflake, you can create a test table without the primary key constraint

-- Create a test table for duplicate order items demonstration
CREATE OR REPLACE TABLE order_items_test AS
SELECT * FROM order_items;

-- Add duplicate order items
INSERT INTO order_items_test (order_id, product_id, quantity, unit_price) VALUES
(1001, 101, 1, 999.99),  -- Duplicate: order 1001 already has product 101
(1002, 104, 1, 299.99),  -- Duplicate: order 1002 already has product 104
(1003, 107, 1, 149.99),  -- Duplicate: order 1003 already has product 107
(1004, 101, 1, 999.99),  -- Duplicate: order 1004 already has product 101
(1007, 102, 2, 29.99);   -- Duplicate: order 1007 already has product 102


### Verify Duplicate Data

Let's check the duplicate data we've added:


In [None]:
-- Check for duplicate orders
SELECT 
    customer_id,
    order_date,
    total_amount,
    COUNT(*) AS duplicate_count
FROM orders
GROUP BY customer_id, order_date, total_amount
HAVING COUNT(*) > 1
ORDER BY customer_id, order_date;

-- Check for duplicate order items in test table
SELECT 
    order_id,
    product_id,
    COUNT(*) AS duplicate_count
FROM order_items_test
GROUP BY order_id, product_id
HAVING COUNT(*) > 1
ORDER BY order_id, product_id;


**Note:** In the CTE notebook (2h), you'll learn how to:
- Use CTEs to identify these duplicates
- Delete duplicates while keeping the correct records
- Handle various duplicate scenarios in production environments

---


---

## Understanding Table Relationships

Before we learn joins, let's understand how our tables are related:

```
customers (1) ──────── (many) orders
                           │
                           │ (1)
                           │
                           │ (many)
                           │
                    order_items
                           │
                           │ (many)
                           │
                           │ (1)
                    products
```

**Relationships:**
- One customer can have many orders (1-to-many)
- One order can have many order_items (1-to-many)
- One product can appear in many order_items (1-to-many)
- `order_items` is a **junction table** connecting orders and products

**Key Points:**
- `orders.customer_id` → references `customers.customer_id`
- `order_items.order_id` → references `orders.order_id`
- `order_items.product_id` → references `products.product_id`

---

## Types of Joins

There are several types of joins, each serving a different purpose:

1. **INNER JOIN** - Returns only matching records from both tables
2. **LEFT JOIN** (LEFT OUTER JOIN) - Returns all records from the left table + matching records from the right table
3. **RIGHT JOIN** (RIGHT OUTER JOIN) - Returns all records from the right table + matching records from the left table
4. **FULL OUTER JOIN** - Returns all records from both tables
5. **CROSS JOIN** - Returns Cartesian product (all combinations)

Let's explore each one with examples!


## 1. INNER JOIN: Finding Matching Records

### What is INNER JOIN?

**INNER JOIN** returns only the rows where there is a match in both tables. If a row in one table doesn't have a matching row in the other table, it won't appear in the result.

**Think of it as:** "Show me only the records that exist in BOTH tables."

### Syntax

```sql
SELECT columns
FROM table1
INNER JOIN table2
ON table1.column = table2.column;
```

**Note:** `INNER JOIN` and `JOIN` are the same thing. `JOIN` is shorthand for `INNER JOIN`.

### Visual Representation

```
Table A          Table B          INNER JOIN Result
--------         --------         -----------------
A1  B1           B1  C1           A1  B1  C1
A2  B2           B2  C2           A2  B2  C2
A3  B3           B3  C3           A3  B3  C3
A4  B4           B4  C4           A4  B4  C4
                 B5  C5           (B5 has no match in A, so excluded)
```

### Understanding INNER JOIN with Sample Data

Let's see what happens when we join our sample tables. Here's a simplified example:

**Sample Data:**

**customers table:**
| customer_id | first_name | last_name |
|-------------|------------|-----------|
| 1 | John | Doe |
| 2 | Jane | Smith |
| 6 | Diana | Davis |

**orders table:**
| order_id | customer_id | order_date | total_amount |
|----------|-------------|------------|--------------|
| 1001 | 1 | 2024-01-15 | 1029.98 |
| 1002 | 2 | 2024-01-16 | 379.98 |
| 1007 | 5 | 2024-02-01 | 429.98 |

**INNER JOIN Result:**
```sql
SELECT c.customer_id, c.first_name, c.last_name, o.order_id, o.order_date
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;
```

**Output:**
| customer_id | first_name | last_name | order_id | order_date |
|------------|------------|-----------|----------|------------|
| 1 | John | Doe | 1001 | 2024-01-15 |
| 2 | Jane | Smith | 1002 | 2024-01-16 |

**What happened?**
- ✅ Customer 1 (John) matched with Order 1001 → **INCLUDED**
- ✅ Customer 2 (Jane) matched with Order 1002 → **INCLUDED**
- ❌ Customer 6 (Diana) has no orders → **EXCLUDED**
- ❌ Order 1007 (customer_id=5) has no matching customer in our sample → **EXCLUDED**

**Key Takeaway:** Only rows with matches in BOTH tables appear in the result!

### Example 1: Customers and Their Orders

**Question:** Which customers have placed orders?

```sql
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount
FROM customers c
INNER JOIN orders o
ON c.customer_id = o.customer_id;
```

**What this does:**
- Joins `customers` and `orders` tables
- Matches on `customer_id`
- Returns only customers who have placed orders
- Customers without orders are excluded


In [None]:
-- Example 1: Customers and their orders
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount
FROM customers c
INNER JOIN orders o
ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_date;


### Example 2: Orders with Product Details

**Question:** What products are in each order?

```sql
SELECT 
    o.order_id,
    o.order_date,
    p.product_name,
    oi.quantity,
    oi.unit_price,
    (oi.quantity * oi.unit_price) AS line_total
FROM orders o
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
ORDER BY o.order_id;
```

**What this does:**
- Joins three tables: `orders`, `order_items`, and `products`
- First joins orders with order_items
- Then joins the result with products
- Calculates line total for each item


In [None]:
-- Example 2: Orders with product details
SELECT 
    o.order_id,
    o.order_date,
    p.product_name,
    oi.quantity,
    oi.unit_price,
    (oi.quantity * oi.unit_price) AS line_total
FROM orders o
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
ORDER BY o.order_id;


### Example 3: Complete Order Information

**Question:** Show complete order information including customer name and product details.

```sql
SELECT 
    c.first_name || ' ' || c.last_name AS customer_name,
    o.order_id,
    o.order_date,
    p.product_name,
    oi.quantity,
    oi.unit_price,
    o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
ORDER BY o.order_id;
```

**Note:** `||` is the concatenation operator in most SQL databases. In some databases (like MySQL), use `CONCAT()` function instead.


In [None]:
-- Example 3: Complete order information
SELECT 
    c.first_name || ' ' || c.last_name AS customer_name,
    o.order_id,
    o.order_date,
    p.product_name,
    oi.quantity,
    oi.unit_price,
    o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
ORDER BY o.order_id;


### Key Points about INNER JOIN

✅ **Returns only matching rows** from both tables  
✅ **Most common type of join** in real-world queries  
✅ **Excludes rows** that don't have matches  
✅ **Use when:** You only want records that exist in both tables  

**When to use INNER JOIN:**
- Finding customers who have orders
- Finding products that have been ordered
- Getting complete information that requires multiple tables


## 2. LEFT JOIN: Keeping All Left Table Records

### What is LEFT JOIN?

**LEFT JOIN** (or **LEFT OUTER JOIN**) returns all rows from the left table, plus matching rows from the right table. If there's no match in the right table, the result will have NULL values for right table columns.

**Think of it as:** "Show me ALL records from the left table, and add matching information from the right table if it exists."

### Visual Representation

```
Table A          Table B          LEFT JOIN Result
--------         --------         -----------------
A1  B1           B1  C1           A1  B1  C1
A2  B2           B2  C2           A2  B2  C2
A3  B3           B3  C3           A3  B3  C3
A4  B4           B4  C4           A4  B4  C4
A5  B5           B5  C5           A5  B5  NULL (no match in B)
                 B6  C6           (B6 not in A, so excluded)
```

### Understanding LEFT JOIN with Sample Data

**Sample Data:**

**customers table (LEFT table):**
| customer_id | first_name | last_name |
|-------------|------------|-----------|
| 1 | John | Doe |
| 2 | Jane | Smith |
| 6 | Diana | Davis |

**orders table (RIGHT table):**
| order_id | customer_id | order_date | total_amount |
|----------|-------------|------------|--------------|
| 1001 | 1 | 2024-01-15 | 1029.98 |
| 1002 | 2 | 2024-01-16 | 379.98 |

**LEFT JOIN Result:**
```sql
SELECT c.customer_id, c.first_name, c.last_name, o.order_id, o.order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;
```

**Output:**
| customer_id | first_name | last_name | order_id | order_date |
|------------|------------|-----------|----------|------------|
| 1 | John | Doe | 1001 | 2024-01-15 |
| 2 | Jane | Smith | 1002 | 2024-01-16 |
| 6 | Diana | Davis | **NULL** | **NULL** |

**What happened?**
- ✅ Customer 1 (John) matched with Order 1001 → **INCLUDED with order data**
- ✅ Customer 2 (Jane) matched with Order 1002 → **INCLUDED with order data**
- ✅ Customer 6 (Diana) has no orders → **INCLUDED with NULL values**

**Key Takeaway:** ALL rows from the LEFT table appear, even if there's no match in the RIGHT table!

### Example 1: All Customers and Their Orders (Including Customers Without Orders)

**Question:** Show all customers, and their orders if they have any.

```sql
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id;
```

**What this does:**
- Returns ALL customers
- Adds order information if the customer has orders
- Customers without orders will have NULL values for order columns


In [None]:
-- Example 1: All customers and their orders (including customers without orders)
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_date;


### Example 2: Finding Customers Without Orders

**Question:** Which customers have never placed an order?

```sql
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    c.email
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;
```

**What this does:**
- Uses LEFT JOIN to get all customers
- Filters for rows where `order_id IS NULL`
- This identifies customers who don't have any orders


In [None]:
-- Example 2: Customers who have never placed an order
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    c.email
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;


### Example 3: Products and Their Sales (Including Unsold Products)

**Question:** Show all products and how many times they've been ordered.

```sql
SELECT 
    p.product_id,
    p.product_name,
    p.category,
    COALESCE(SUM(oi.quantity), 0) AS total_quantity_sold,
    COUNT(o.order_id) AS number_of_orders
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
LEFT JOIN orders o ON oi.order_id = o.order_id
GROUP BY p.product_id, p.product_name, p.category
ORDER BY total_quantity_sold DESC;
```

**What this does:**
- LEFT JOIN ensures all products are included
- `COALESCE(SUM(oi.quantity), 0)` returns 0 if no sales (instead of NULL)
- `COUNT(o.order_id)` counts orders (NULL values are not counted)


In [None]:
-- Example 3: Products and their sales (including unsold products)
SELECT 
    p.product_id,
    p.product_name,
    p.category,
    COALESCE(SUM(oi.quantity), 0) AS total_quantity_sold,
    COUNT(o.order_id) AS number_of_orders
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
LEFT JOIN orders o ON oi.order_id = o.order_id
GROUP BY p.product_id, p.product_name, p.category
ORDER BY total_quantity_sold DESC;


### Key Points about LEFT JOIN

✅ **Returns all rows** from the left table  
✅ **Adds matching rows** from the right table  
✅ **NULL values** for right table columns when no match exists  
✅ **Use when:** You want all records from the left table, regardless of matches  

**When to use LEFT JOIN:**
- Finding customers without orders
- Finding products that haven't been sold
- Including all records from the "main" table
- Reporting scenarios where you need complete coverage


## 3. RIGHT JOIN: Keeping All Right Table Records

### What is RIGHT JOIN?

**RIGHT JOIN** (or **RIGHT OUTER JOIN**) returns all rows from the right table, plus matching rows from the left table. If there's no match in the left table, the result will have NULL values for left table columns.

**Think of it as:** "Show me ALL records from the right table, and add matching information from the left table if it exists."

**Note:** RIGHT JOIN is less commonly used than LEFT JOIN. You can achieve the same result by swapping the tables and using LEFT JOIN. However, it's good to understand it.

### Visual Representation

```
Table A          Table B          RIGHT JOIN Result
--------         --------         -----------------
A1  B1           B1  C1           A1  B1  C1
A2  B2           B2  C2           A2  B2  C2
A3  B3           B3  C3           A3  B3  C3
A4  B4           B4  C4           A4  B4  C4
                 B5  C5           NULL B5  C5 (no match in A)
                 B6  C6           NULL B6  C6 (no match in A)
```

### Example: All Orders and Their Customers (Including Orders Without Valid Customers)

**Question:** Show all orders, and customer information if available.

```sql
SELECT 
    o.order_id,
    o.order_date,
    o.total_amount,
    c.first_name,
    c.last_name,
    c.email
FROM customers c
RIGHT JOIN orders o
ON c.customer_id = o.customer_id;
```

**Note:** This is equivalent to:
```sql
SELECT ... FROM orders o LEFT JOIN customers c ON o.customer_id = c.customer_id;
```


In [None]:
-- Example: All orders and their customers (RIGHT JOIN)
SELECT 
    o.order_id,
    o.order_date,
    o.total_amount,
    c.first_name,
    c.last_name,
    c.email
FROM customers c
RIGHT JOIN orders o
ON c.customer_id = o.customer_id
ORDER BY o.order_id;


### Key Points about RIGHT JOIN

✅ **Returns all rows** from the right table  
✅ **Adds matching rows** from the left table  
✅ **NULL values** for left table columns when no match exists  
⚠️ **Less commonly used** - LEFT JOIN is preferred (just swap tables)  
✅ **Use when:** You want all records from the right table, regardless of matches  

**Best Practice:** Prefer LEFT JOIN over RIGHT JOIN for better readability. You can always swap the table order.


## 4. FULL OUTER JOIN: Keeping All Records from Both Tables

### What is FULL OUTER JOIN?

**FULL OUTER JOIN** returns all rows from both tables. If there's no match, NULL values are filled in for the missing side.

**Think of it as:** "Show me ALL records from BOTH tables, matching them where possible."

### Visual Representation

```
Table A          Table B          FULL OUTER JOIN Result
--------         --------         -----------------
A1  B1           B1  C1           A1  B1  C1
A2  B2           B2  C2           A2  B2  C2
A3  B3           B3  C3           A3  B3  C3
A4  B4           B4  C4           A4  B4  C4
A5  B5           B5  C5           A5  B5  NULL (no match in B)
                 B6  C6           NULL B6  C6 (no match in A)
```

### Example: All Customers and All Orders

**Question:** Show all customers and all orders, matching them where possible.

```sql
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount
FROM customers c
FULL OUTER JOIN orders o
ON c.customer_id = o.customer_id;
```

**What this does:**
- Returns all customers (even without orders)
- Returns all orders (even without valid customers)
- Matches them where customer_id matches
- NULL values where there's no match

**Note:** Not all databases support FULL OUTER JOIN. MySQL doesn't support it natively. PostgreSQL, SQL Server, and Oracle do.


In [None]:
-- Example: All customers and all orders (FULL OUTER JOIN)
-- Note: This may not work in MySQL. Use UNION of LEFT and RIGHT JOINs instead.
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount
FROM customers c
FULL OUTER JOIN orders o
ON c.customer_id = o.customer_id
ORDER BY COALESCE(c.customer_id, 0), o.order_date;


### Alternative: FULL OUTER JOIN using UNION (MySQL Compatible)

Since MySQL doesn't support FULL OUTER JOIN, you can simulate it:

```sql
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id

UNION

SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    o.order_id,
    o.order_date,
    o.total_amount
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.customer_id IS NULL;
```

**Note:** We'll learn about UNION in the next section!


### Key Points about FULL OUTER JOIN

✅ **Returns all rows** from both tables  
✅ **Matches rows** where possible  
✅ **NULL values** where there's no match  
⚠️ **Not supported in MySQL** - use UNION of LEFT and RIGHT JOINs  
✅ **Use when:** You need complete coverage from both tables  

**When to use FULL OUTER JOIN:**
- Comparing two datasets
- Finding records that exist in one table but not the other
- Complete data reconciliation


## 5. UNION: Combining Result Sets Vertically

### What is UNION?

**UNION** combines the results of two or more SELECT statements into a single result set. Unlike joins (which combine columns), UNION combines rows.

**Think of it as:** "Stack one result set on top of another."

### Key Rules for UNION

1. **Same number of columns** in all SELECT statements
2. **Compatible data types** in corresponding columns
3. **Column names** come from the first SELECT statement
4. **Duplicate rows are removed** by default (use UNION ALL to keep duplicates)

### Syntax

```sql
SELECT column1, column2 FROM table1
UNION
SELECT column1, column2 FROM table2;
```

### UNION vs UNION ALL

- **UNION**: Removes duplicate rows
- **UNION ALL**: Keeps all rows, including duplicates (faster)

### Example 1: Combining Customer Names from Different Sources

**Question:** Get a list of all customer names and product names together.

```sql
SELECT 
    'Customer' AS type,
    first_name || ' ' || last_name AS name
FROM customers

UNION

SELECT 
    'Product' AS type,
    product_name AS name
FROM products
ORDER BY type, name;
```


In [None]:
-- Example 1: Combining customer names and product names
SELECT 
    'Customer' AS type,
    first_name || ' ' || last_name AS name
FROM customers

UNION

SELECT 
    'Product' AS type,
    product_name AS name
FROM products
ORDER BY type, name;


### Example 2: UNION ALL - Keeping Duplicates

**Question:** Get all order dates and customer creation dates together (keeping duplicates).

```sql
SELECT 
    'Order Date' AS date_type,
    order_date AS date_value
FROM orders

UNION ALL

SELECT 
    'Customer Created' AS date_type,
    CURRENT_DATE AS date_value  -- Assuming we had a created_date column
FROM customers
ORDER BY date_value;
```

**Note:** UNION ALL is faster because it doesn't need to check for duplicates.


In [None]:
-- Example 2: UNION ALL - keeping duplicates
SELECT 
    'Order Date' AS date_type,
    order_date AS date_value
FROM orders

UNION ALL

SELECT 
    'Order Date' AS date_type,
    order_date AS date_value
FROM orders
ORDER BY date_value;


### Example 3: Combining Multiple Tables

**Question:** Get a unified list of all IDs (customer IDs, order IDs, and product IDs).

```sql
SELECT 
    'Customer' AS id_type,
    customer_id AS id_value
FROM customers

UNION

SELECT 
    'Order' AS id_type,
    order_id AS id_value
FROM orders

UNION

SELECT 
    'Product' AS id_type,
    product_id AS id_value
FROM products
ORDER BY id_type, id_value;
```


In [None]:
-- Example 3: Combining multiple tables
SELECT 
    'Customer' AS id_type,
    customer_id AS id_value
FROM customers

UNION

SELECT 
    'Order' AS id_type,
    order_id AS id_value
FROM orders

UNION

SELECT 
    'Product' AS id_type,
    product_id AS id_value
FROM products
ORDER BY id_type, id_value;


### Key Points about UNION

✅ **Combines rows** (not columns like joins)  
✅ **Requires same number of columns** in all SELECT statements  
✅ **Removes duplicates** by default (use UNION ALL to keep them)  
✅ **Column names** from first SELECT are used  
✅ **Use when:** You need to combine similar data from different sources  

**When to use UNION:**
- Combining similar data from different tables
- Creating unified reports
- Merging historical and current data
- Creating master lists from multiple sources

**UNION vs JOIN:**
- **JOIN**: Combines columns horizontally (side by side)
- **UNION**: Combines rows vertically (one on top of another)


---

## Production-Grade Examples: Real-World Scenarios

Now that you understand the basics, let's look at some production-grade examples that you'll encounter in real data engineering projects.

### Example 1: Customer Lifetime Value (CLV) Analysis

**Business Question:** Calculate the total lifetime value of each customer, including those who haven't made purchases yet.

**Production Query:**
```sql
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    c.email,
    c.country,
    COUNT(DISTINCT o.order_id) AS total_orders,
    COALESCE(SUM(o.total_amount), 0) AS lifetime_value,
    COALESCE(AVG(o.total_amount), 0) AS average_order_value,
    MIN(o.order_date) AS first_order_date,
    MAX(o.order_date) AS last_order_date,
    CASE 
        WHEN COUNT(o.order_id) = 0 THEN 'New Customer'
        WHEN MAX(o.order_date) < CURRENT_DATE - INTERVAL '90 days' THEN 'Churned'
        WHEN MAX(o.order_date) < CURRENT_DATE - INTERVAL '30 days' THEN 'At Risk'
        ELSE 'Active'
    END AS customer_status
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.email, c.country
ORDER BY lifetime_value DESC;
```

**Key Production Techniques:**
- ✅ LEFT JOIN to include all customers
- ✅ COALESCE to handle NULL values
- ✅ CASE statement for business logic
- ✅ Multiple aggregations in one query
- ✅ Date calculations for customer segmentation

### Example 2: Product Performance Dashboard

**Business Question:** Generate a comprehensive product performance report for inventory and sales teams.

**Production Query:**
```sql
SELECT 
    p.product_id,
    p.product_name,
    p.category,
    p.price AS current_price,
    p.stock_quantity,
    -- Sales Metrics
    COALESCE(SUM(oi.quantity), 0) AS total_units_sold,
    COALESCE(COUNT(DISTINCT oi.order_id), 0) AS number_of_orders,
    COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS total_revenue,
    COALESCE(AVG(oi.unit_price), p.price) AS average_selling_price,
    -- Inventory Metrics
    CASE 
        WHEN p.stock_quantity = 0 THEN 'Out of Stock'
        WHEN p.stock_quantity < 10 THEN 'Low Stock'
        WHEN p.stock_quantity < 50 THEN 'Medium Stock'
        ELSE 'In Stock'
    END AS stock_status,
    -- Performance Indicators
    CASE 
        WHEN COALESCE(SUM(oi.quantity), 0) = 0 THEN 'Not Selling'
        WHEN COALESCE(SUM(oi.quantity), 0) < 5 THEN 'Slow Moving'
        WHEN COALESCE(SUM(oi.quantity), 0) < 20 THEN 'Normal'
        ELSE 'Fast Moving'
    END AS sales_velocity
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
LEFT JOIN orders o ON oi.order_id = o.order_id AND o.status = 'delivered'
GROUP BY p.product_id, p.product_name, p.category, p.price, p.stock_quantity
ORDER BY total_revenue DESC, total_units_sold DESC;
```

**Key Production Techniques:**
- ✅ Multiple LEFT JOINs for comprehensive data
- ✅ Business logic with CASE statements
- ✅ Filtering in JOIN condition (status = 'delivered')
- ✅ Multiple calculated metrics
- ✅ Categorization for business insights

### Example 3: Monthly Sales Report with Year-over-Year Comparison

**Business Question:** Create a monthly sales report comparing current year with previous year.

**Production Query:**
```sql
WITH current_year_sales AS (
    SELECT 
        EXTRACT(MONTH FROM o.order_date) AS month,
        EXTRACT(YEAR FROM o.order_date) AS year,
        COUNT(DISTINCT o.order_id) AS order_count,
        COUNT(DISTINCT o.customer_id) AS unique_customers,
        SUM(o.total_amount) AS total_revenue,
        AVG(o.total_amount) AS avg_order_value
    FROM orders o
    WHERE o.status = 'delivered'
        AND EXTRACT(YEAR FROM o.order_date) = EXTRACT(YEAR FROM CURRENT_DATE)
    GROUP BY EXTRACT(MONTH FROM o.order_date), EXTRACT(YEAR FROM o.order_date)
),
previous_year_sales AS (
    SELECT 
        EXTRACT(MONTH FROM o.order_date) AS month,
        EXTRACT(YEAR FROM o.order_date) AS year,
        COUNT(DISTINCT o.order_id) AS order_count,
        COUNT(DISTINCT o.customer_id) AS unique_customers,
        SUM(o.total_amount) AS total_revenue,
        AVG(o.total_amount) AS avg_order_value
    FROM orders o
    WHERE o.status = 'delivered'
        AND EXTRACT(YEAR FROM o.order_date) = EXTRACT(YEAR FROM CURRENT_DATE) - 1
    GROUP BY EXTRACT(MONTH FROM o.order_date), EXTRACT(YEAR FROM o.order_date)
)
SELECT 
    COALESCE(c.month, p.month) AS month,
    COALESCE(c.total_revenue, 0) AS current_year_revenue,
    COALESCE(p.total_revenue, 0) AS previous_year_revenue,
    COALESCE(c.total_revenue, 0) - COALESCE(p.total_revenue, 0) AS revenue_change,
    CASE 
        WHEN COALESCE(p.total_revenue, 0) = 0 THEN NULL
        ELSE ROUND(
            ((COALESCE(c.total_revenue, 0) - COALESCE(p.total_revenue, 0)) / p.total_revenue) * 100, 
            2
        )
    END AS revenue_growth_percent,
    COALESCE(c.order_count, 0) AS current_year_orders,
    COALESCE(p.order_count, 0) AS previous_year_orders
FROM current_year_sales c
FULL OUTER JOIN previous_year_sales p ON c.month = p.month
ORDER BY month;
```

**Key Production Techniques:**
- ✅ CTEs (Common Table Expressions) for complex queries
- ✅ FULL OUTER JOIN for complete comparison
- ✅ Date extraction and filtering
- ✅ Percentage calculations
- ✅ Handling division by zero

### Example 4: Customer Segmentation with Purchase Behavior

**Business Question:** Segment customers based on their purchase behavior and demographics.

**Production Query:**
```sql
WITH customer_metrics AS (
    SELECT 
        c.customer_id,
        c.first_name || ' ' || c.last_name AS customer_name,
        c.country,
        COUNT(DISTINCT o.order_id) AS order_count,
        SUM(o.total_amount) AS total_spent,
        AVG(o.total_amount) AS avg_order_value,
        MAX(o.order_date) AS last_order_date,
        MIN(o.order_date) AS first_order_date,
        EXTRACT(DAY FROM (CURRENT_DATE - MAX(o.order_date))) AS days_since_last_order
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.first_name, c.last_name, c.country
)
SELECT 
    customer_id,
    customer_name,
    country,
    order_count,
    total_spent,
    avg_order_value,
    last_order_date,
    -- Customer Segmentation
    CASE 
        WHEN order_count = 0 THEN 'Prospect'
        WHEN total_spent >= 1000 THEN 'VIP'
        WHEN total_spent >= 500 THEN 'Premium'
        WHEN total_spent >= 200 THEN 'Regular'
        ELSE 'Casual'
    END AS customer_tier,
    -- Recency Segmentation
    CASE 
        WHEN order_count = 0 THEN 'Never Purchased'
        WHEN days_since_last_order <= 30 THEN 'Active'
        WHEN days_since_last_order <= 90 THEN 'At Risk'
        ELSE 'Churned'
    END AS recency_status,
    -- Frequency Segmentation
    CASE 
        WHEN order_count = 0 THEN 'No Orders'
        WHEN order_count = 1 THEN 'One-Time Buyer'
        WHEN order_count <= 3 THEN 'Occasional'
        ELSE 'Frequent'
    END AS frequency_segment
FROM customer_metrics
ORDER BY total_spent DESC NULLS LAST;
```

**Key Production Techniques:**
- ✅ CTEs for complex calculations
- ✅ Multiple CASE statements for segmentation
- ✅ Date arithmetic
- ✅ NULL handling with NULLS LAST
- ✅ Business logic implementation

### Example 5: Inventory Reorder Point Analysis

**Business Question:** Identify products that need reordering based on sales velocity and current stock.

**Production Query:**
```sql
WITH product_sales_velocity AS (
    SELECT 
        p.product_id,
        p.product_name,
        p.category,
        p.stock_quantity,
        p.price,
        COALESCE(SUM(oi.quantity), 0) AS total_sold,
        COALESCE(COUNT(DISTINCT o.order_id), 0) AS order_count,
        -- Calculate average daily sales (assuming 30-day period)
        COALESCE(SUM(oi.quantity) / 30.0, 0) AS avg_daily_sales,
        -- Calculate days of inventory remaining
        CASE 
            WHEN COALESCE(SUM(oi.quantity) / 30.0, 0) = 0 THEN 999
            ELSE p.stock_quantity / (SUM(oi.quantity) / 30.0)
        END AS days_of_inventory
    FROM products p
    LEFT JOIN order_items oi ON p.product_id = oi.product_id
    LEFT JOIN orders o ON oi.order_id = o.order_id 
        AND o.order_date >= CURRENT_DATE - INTERVAL '30 days'
        AND o.status = 'delivered'
    GROUP BY p.product_id, p.product_name, p.category, p.stock_quantity, p.price
)
SELECT 
    product_id,
    product_name,
    category,
    stock_quantity,
    total_sold,
    avg_daily_sales,
    ROUND(days_of_inventory, 1) AS days_of_inventory_remaining,
    CASE 
        WHEN days_of_inventory < 7 THEN 'URGENT - Reorder Now'
        WHEN days_of_inventory < 14 THEN 'Low Stock - Reorder Soon'
        WHEN days_of_inventory < 30 THEN 'Monitor Closely'
        WHEN total_sold = 0 THEN 'No Sales - Review Product'
        ELSE 'Stock OK'
    END AS reorder_recommendation,
    -- Suggested reorder quantity (30 days of sales + safety stock)
    CASE 
        WHEN avg_daily_sales = 0 THEN 0
        ELSE CEIL(avg_daily_sales * 30 * 1.5) -- 30 days + 50% safety stock
    END AS suggested_reorder_quantity
FROM product_sales_velocity
ORDER BY days_of_inventory ASC, total_sold DESC;
```

**Key Production Techniques:**
- ✅ CTEs for complex calculations
- ✅ Date filtering in JOIN conditions
- ✅ Mathematical calculations (days of inventory)
- ✅ Business logic for recommendations
- ✅ CEIL function for rounding up

---

## Quick Reference: Join Types Comparison

| Join Type | Returns | Use Case |
|-----------|---------|----------|
| **INNER JOIN** | Only matching rows from both tables | Most common - get related data |
| **LEFT JOIN** | All rows from left + matching from right | Include all left records |
| **RIGHT JOIN** | All rows from right + matching from left | Include all right records (rarely used) |
| **FULL OUTER JOIN** | All rows from both tables | Complete coverage from both sides |
| **CROSS JOIN** | Cartesian product (all combinations) | Rarely used intentionally |
| **UNION** | Combined rows from multiple SELECTs | Stack result sets vertically |

---

## Practice Problems

Now it's time to test your understanding! Try to solve these problems on your own before looking at the solutions.


## Problem 1: Customer Order Summary

**Problem Statement:**
Write a query to show all customers with their total number of orders and total amount spent. Include customers who have never placed an order (show 0 for their counts).

**Expected Output:**
- Customer ID, Name, Email
- Number of orders (0 if none)
- Total amount spent (0 if none)
- Order by total amount spent (descending)


In [None]:
-- Write your solution here for Problem 1


### Solution to Problem 1

**Approach:**
1. Use LEFT JOIN to include all customers (even those without orders)
2. Use COUNT(DISTINCT order_id) to count orders
3. Use COALESCE to convert NULL to 0
4. Group by customer attributes

```sql
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    c.email,
    COALESCE(COUNT(DISTINCT o.order_id), 0) AS number_of_orders,
    COALESCE(SUM(o.total_amount), 0) AS total_amount_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.email
ORDER BY total_amount_spent DESC;
```

**Explanation:**
- `LEFT JOIN` ensures all customers are included
- `COUNT(DISTINCT o.order_id)` counts unique orders (NULL values are not counted, so we use COALESCE)
- `SUM(o.total_amount)` sums the order amounts (returns NULL if no orders, so we use COALESCE)
- `GROUP BY` is required when using aggregate functions


## Problem 2: Product Sales Analysis

**Problem Statement:**
Show all products with the following information:
- Product ID, Name, Category, Price
- Total quantity sold (sum of all quantities from order_items)
- Number of orders containing this product
- Total revenue generated (sum of quantity * unit_price)
- Include products that have never been sold (show 0 for their metrics)


In [None]:
-- Write your solution here for Problem 2


### Solution to Problem 2

**Approach:**
1. Use LEFT JOIN to include all products (even unsold ones)
2. Join products → order_items → orders (if needed for filtering)
3. Use aggregate functions with COALESCE for NULL handling
4. Calculate revenue as sum of (quantity * unit_price)

```sql
SELECT 
    p.product_id,
    p.product_name,
    p.category,
    p.price,
    COALESCE(SUM(oi.quantity), 0) AS total_quantity_sold,
    COALESCE(COUNT(DISTINCT oi.order_id), 0) AS number_of_orders,
    COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS total_revenue
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.product_id, p.product_name, p.category, p.price
ORDER BY total_revenue DESC;
```

**Explanation:**
- `LEFT JOIN` ensures all products are included
- `SUM(oi.quantity)` calculates total units sold
- `COUNT(DISTINCT oi.order_id)` counts unique orders containing this product
- `SUM(oi.quantity * oi.unit_price)` calculates total revenue
- All aggregates use `COALESCE` to return 0 instead of NULL


## Problem 3: Order Details with Customer and Products

**Problem Statement:**
Create a detailed order report showing:
- Order ID, Order Date, Order Status
- Customer Name (First + Last)
- Customer Email
- Product Name, Quantity, Unit Price, Line Total (quantity * unit_price)
- Order Total Amount
- Only show orders that have been delivered
- Order by order date (newest first), then by product name


In [None]:
-- Write your solution here for Problem 3


### Solution to Problem 3

**Approach:**
1. Use INNER JOIN (we only want delivered orders)
2. Join customers → orders → order_items → products
3. Filter for status = 'delivered'
4. Calculate line_total as quantity * unit_price
5. Order by order_date DESC, then product_name

```sql
SELECT 
    o.order_id,
    o.order_date,
    o.status AS order_status,
    c.first_name || ' ' || c.last_name AS customer_name,
    c.email AS customer_email,
    p.product_name,
    oi.quantity,
    oi.unit_price,
    (oi.quantity * oi.unit_price) AS line_total,
    o.total_amount AS order_total_amount
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
WHERE o.status = 'delivered'
ORDER BY o.order_date DESC, p.product_name;
```

**Explanation:**
- `INNER JOIN` is used because we only want orders that have been delivered
- Multiple joins connect all four tables
- `WHERE o.status = 'delivered'` filters for delivered orders only
- `(oi.quantity * oi.unit_price)` calculates line total for each item
- `ORDER BY` sorts by date (newest first), then product name


## Problem 4: Customers Without Orders

**Problem Statement:**
Find all customers who have never placed an order. Show their customer ID, full name, email, and city.


In [None]:
-- Write your solution here for Problem 4


### Solution to Problem 4

**Approach:**
1. Use LEFT JOIN to get all customers
2. Filter for rows where order_id IS NULL
3. This identifies customers without any orders

```sql
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS full_name,
    c.email,
    c.city
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;
```

**Explanation:**
- `LEFT JOIN` includes all customers
- `WHERE o.order_id IS NULL` filters for customers who have no matching orders
- This is a common pattern for finding "missing" relationships


## Problem 5: Products Never Ordered

**Problem Statement:**
Find all products that have never been ordered. Show product ID, name, category, price, and stock quantity.


In [None]:
-- Write your solution here for Problem 5


### Solution to Problem 5

**Approach:**
1. Use LEFT JOIN to get all products
2. Filter for rows where order_id IS NULL in order_items
3. This identifies products that have never been ordered

```sql
SELECT 
    p.product_id,
    p.product_name,
    p.category,
    p.price,
    p.stock_quantity
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
WHERE oi.order_id IS NULL;
```

**Explanation:**
- `LEFT JOIN` includes all products
- `WHERE oi.order_id IS NULL` filters for products that have no matching order_items
- This pattern is similar to Problem 4 but for products instead of customers


## Problem 6: Top Customers by Revenue

**Problem Statement:**
Find the top 3 customers by total revenue (sum of all their order amounts). Show:
- Customer ID, Full Name, Email
- Total number of orders
- Total revenue
- Average order value
- Order by total revenue (descending)


In [None]:
-- Write your solution here for Problem 6


### Solution to Problem 6

**Approach:**
1. Use INNER JOIN to get customers with orders
2. Use aggregate functions to calculate metrics
3. Use LIMIT 3 to get top 3 customers
4. Calculate average order value as total_revenue / order_count

```sql
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS full_name,
    c.email,
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(o.total_amount) AS total_revenue,
    ROUND(AVG(o.total_amount), 2) AS average_order_value
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.email
ORDER BY total_revenue DESC
LIMIT 3;
```

**Explanation:**
- `INNER JOIN` ensures we only get customers with orders
- `COUNT(DISTINCT o.order_id)` counts unique orders
- `SUM(o.total_amount)` calculates total revenue
- `AVG(o.total_amount)` calculates average order value
- `LIMIT 3` restricts results to top 3 customers
- `ORDER BY total_revenue DESC` sorts by highest revenue first


## Problem 7: Category Sales Summary

**Problem Statement:**
Show sales summary by product category:
- Category name
- Number of different products in this category
- Total quantity sold across all products in this category
- Total revenue from this category
- Order by total revenue (descending)


In [None]:
-- Write your solution here for Problem 7


### Solution to Problem 7

**Approach:**
1. Join products → order_items to get sales data
2. Group by category
3. Use aggregate functions to calculate category-level metrics
4. Calculate total revenue as sum of (quantity * unit_price)

```sql
SELECT 
    p.category,
    COUNT(DISTINCT p.product_id) AS number_of_products,
    COALESCE(SUM(oi.quantity), 0) AS total_quantity_sold,
    COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS total_revenue
FROM products p
LEFT JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.category
ORDER BY total_revenue DESC;
```

**Explanation:**
- `LEFT JOIN` ensures all categories are included (even if no products sold)
- `COUNT(DISTINCT p.product_id)` counts unique products in each category
- `SUM(oi.quantity)` calculates total units sold across all products in category
- `SUM(oi.quantity * oi.unit_price)` calculates total revenue for the category
- `COALESCE` handles NULL values for categories with no sales
- `GROUP BY p.category` groups results by category


## Problem 8: Using UNION

**Problem Statement:**
Create a unified list showing:
- All customer emails
- All product names (as if they were emails, for comparison purposes)
- Label each row as either 'Customer Email' or 'Product Name'
- Order by the label, then by the value


In [None]:
-- Write your solution here for Problem 8


## Snowflake Date Function Updates for Production Examples

**Important:** The production examples above use standard SQL date syntax. For Snowflake, update the following:

### Date Function Conversions

**In Production Example 1 (CLV Analysis):**
```sql
-- Replace:
WHEN MAX(o.order_date) < CURRENT_DATE - INTERVAL '90 days' THEN 'Churned'
WHEN MAX(o.order_date) < CURRENT_DATE - INTERVAL '30 days' THEN 'At Risk'

-- With Snowflake syntax:
WHEN MAX(o.order_date) < DATEADD(day, -90, CURRENT_DATE()) THEN 'Churned'
WHEN MAX(o.order_date) < DATEADD(day, -30, CURRENT_DATE()) THEN 'At Risk'
```

**In Production Example 3 (Customer Segmentation):**
```sql
-- Replace:
EXTRACT(DAY FROM (CURRENT_DATE - MAX(o.order_date))) AS days_since_last_order

-- With Snowflake syntax:
DATEDIFF(day, MAX(o.order_date), CURRENT_DATE()) AS days_since_last_order
```

**In Production Example 5 (Inventory Reorder):**
```sql
-- Replace:
AND o.order_date >= CURRENT_DATE - INTERVAL '30 days'

-- With Snowflake syntax:
AND o.order_date >= DATEADD(day, -30, CURRENT_DATE())
```

**In Production Example 3 (Monthly Sales Report):**
```sql
-- Replace:
EXTRACT(YEAR FROM CURRENT_DATE)

-- With Snowflake syntax:
EXTRACT(YEAR FROM CURRENT_DATE())
```

**Note:** Snowflake supports both `CURRENT_DATE` and `CURRENT_DATE()`, but using the function form `CURRENT_DATE()` is more consistent with other date functions.

**SQL Server Compatibility:**
- SQL Server uses similar syntax: `DATEADD(day, -30, GETDATE())` and `DATEDIFF(day, date1, date2)`
- `CURRENT_DATE` in SQL Server is `GETDATE()` or `CURRENT_TIMESTAMP`


### Solution to Problem 8

**Approach:**
1. Use UNION to combine customer emails and product names
2. Add a label column to identify the source
3. Ensure both SELECT statements have the same number and type of columns
4. Order by label, then by value

```sql
SELECT 
    'Customer Email' AS label,
    email AS value
FROM customers

UNION

SELECT 
    'Product Name' AS label,
    product_name AS value
FROM products

ORDER BY label, value;
```

**Explanation:**
- First SELECT gets all customer emails with label 'Customer Email'
- Second SELECT gets all product names with label 'Product Name'
- `UNION` combines them vertically (removes duplicates if any)
- Both SELECT statements have the same structure: label and value columns
- `ORDER BY label, value` sorts first by label (Customer Email, then Product Name), then by value alphabetically
- If you want to keep duplicates, use `UNION ALL` instead of `UNION`


---

## Summary: Join Comparison with Sample Outputs

Let's see a side-by-side comparison of different join types using our sample data:

### Sample Data for Comparison

**customers table:**
| customer_id | first_name | last_name |
|-------------|------------|-----------|
| 1 | John | Doe |
| 2 | Jane | Smith |
| 6 | Diana | Davis |

**orders table:**
| order_id | customer_id | order_date | total_amount |
|----------|-------------|------------|--------------|
| 1001 | 1 | 2024-01-15 | 1029.98 |
| 1002 | 2 | 2024-01-16 | 379.98 |
| 1007 | 5 | 2024-02-01 | 429.98 |

### Join Results Comparison

**Query:**
```sql
SELECT c.customer_id, c.first_name, o.order_id, o.order_date
FROM customers c
[JOIN TYPE] orders o ON c.customer_id = o.customer_id;
```

| Join Type | Result | Explanation |
|-----------|--------|-------------|
| **INNER JOIN** | 2 rows<br>• John → Order 1001<br>• Jane → Order 1002 | Only matching records. Diana (no orders) and Order 1007 (no customer) excluded. |
| **LEFT JOIN** | 3 rows<br>• John → Order 1001<br>• Jane → Order 1002<br>• Diana → NULL | All customers included. Diana has NULL for order columns. Order 1007 excluded. |
| **RIGHT JOIN** | 3 rows<br>• John → Order 1001<br>• Jane → Order 1002<br>• NULL → Order 1007 | All orders included. Order 1007 has NULL for customer columns. Diana excluded. |
| **FULL OUTER JOIN** | 4 rows<br>• John → Order 1001<br>• Jane → Order 1002<br>• Diana → NULL<br>• NULL → Order 1007 | All records from both tables. NULLs where no match exists. |

---

## Key Takeaways

### When to Use Each Join Type

1. **INNER JOIN** - Use when you only need records that exist in both tables
   - ✅ Most common join type
   - ✅ Best performance (smallest result set)
   - ✅ Use for: Getting related data, filtering by relationships

2. **LEFT JOIN** - Use when you need all records from the left table
   - ✅ Essential for finding "missing" relationships
   - ✅ Use for: Including all main records, finding records without matches
   - ✅ Pattern: `LEFT JOIN ... WHERE right_table.id IS NULL` finds missing relationships

3. **RIGHT JOIN** - Rarely used (prefer LEFT JOIN by swapping tables)
   - ⚠️ Less readable than LEFT JOIN
   - ✅ Use only when it makes logical sense in your query structure

4. **FULL OUTER JOIN** - Use when you need complete coverage from both tables
   - ✅ Great for data reconciliation
   - ⚠️ Not supported in MySQL (use UNION workaround)
   - ✅ Use for: Comparing datasets, finding records in one but not the other

5. **UNION** - Use when combining similar data from different sources
   - ✅ Combines rows vertically (not columns horizontally)
   - ✅ Use for: Merging similar datasets, creating unified reports

### Best Practices

1. **Always use table aliases** for better readability
2. **Be explicit with JOIN types** - Don't rely on default behavior
3. **Handle NULL values** with COALESCE when needed
4. **Use WHERE for filtering** (unless filtering right table in LEFT JOIN)
5. **Group by all non-aggregated columns** when using aggregate functions

---

## Next Steps

Now that you've mastered SQL Joins and UNION:

1. **Practice** - Try solving the practice problems on your own
2. **Experiment** - Modify the queries to see how results change
3. **Apply** - Use these concepts in your data engineering projects
4. **Explore** - Learn about advanced topics like self-joins, subqueries, and window functions

**Congratulations!** You now have a solid understanding of SQL Joins and UNION operations. These are fundamental skills for any data engineer!
