# SQL Fundamentals with DuckDB

This notebook covers essential SQL operations using **DuckDB** - a fast, embedded analytical database.

Topics covered:
1. Basic queries: SELECT, WHERE, ORDER BY, LIMIT
2. Aggregations: GROUP BY, COUNT, SUM, AVG
3. Joins: INNER JOIN, LEFT JOIN, CROSS JOIN
4. Window functions: RANK, DENSE_RANK, ROW_NUMBER, LAG, LEAD, running totals
5. CTEs (Common Table Expressions)
6. Subqueries
7. CASE statements

**Why DuckDB?**
- Fast analytical queries on pandas/Parquet
- No server setup required (embedded)
- Excellent for data science workflows
- Pandas integration

In [None]:
import duckdb
import pandas as pd
import numpy as np
from pathlib import Path

print(f"DuckDB version: {duckdb.__version__}")

# Create in-memory database
con = duckdb.connect(':memory:')

## 1. Create Sample Data

We'll create realistic e-commerce tables:
- **customers**: Customer information
- **products**: Product catalog
- **orders**: Order transactions
- **order_items**: Items in each order

In [None]:
np.random.seed(42)

# Customers table
customers = pd.DataFrame({
    'customer_id': range(1, 101),
    'name': [f'Customer {i}' for i in range(1, 101)],
    'email': [f'customer{i}@example.com' for i in range(1, 101)],
    'country': np.random.choice(['USA', 'UK', 'Canada', 'Germany', 'France'], 100),
    'signup_date': pd.date_range('2022-01-01', periods=100, freq='3D'),
    'loyalty_tier': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], 100, p=[0.4, 0.3, 0.2, 0.1])
})

# Products table
product_names = [
    'Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones',
    'Webcam', 'USB Cable', 'HDMI Cable', 'SSD Drive', 'RAM',
    'Graphics Card', 'Motherboard', 'Power Supply', 'CPU', 'Case',
    'Cooling Fan', 'Thermal Paste', 'Wi-Fi Card', 'Ethernet Cable', 'Router',
    'Printer', 'Scanner', 'External HDD', 'USB Hub', 'Docking Station',
    'Speakers', 'Microphone', 'Drawing Tablet', 'Stylus', 'Phone Stand',
    'Laptop Bag', 'Screen Protector', 'Cleaning Kit', 'Cable Organizer', 'Desk Mat',
    'LED Strip', 'Smart Plug', 'Surge Protector', 'KVM Switch', 'Network Switch',
    'Wireless Charger', 'Power Bank', 'Phone Case', 'Tablet Stand', 'Monitor Arm',
    'Ergonomic Chair', 'Desk Lamp', 'Footrest', 'Wrist Rest', 'Laptop Cooler'
]

products = pd.DataFrame({
    'product_id': range(1, 51),
    'product_name': product_names,
    'category': np.random.choice(['Electronics', 'Accessories', 'Components'], 50),
    'price': np.round(np.random.uniform(10, 1500, 50), 2),
    'stock_quantity': np.random.randint(0, 500, 50),
})

# Orders table
n_orders = 500
orders = pd.DataFrame({
    'order_id': range(1, n_orders + 1),
    'customer_id': np.random.randint(1, 101, n_orders),
    'order_date': pd.date_range('2023-01-01', periods=n_orders, freq='h'),
    'status': np.random.choice(['Pending', 'Shipped', 'Delivered', 'Cancelled'], n_orders, p=[0.1, 0.3, 0.5, 0.1]),
    'shipping_cost': np.round(np.random.uniform(5, 50, n_orders), 2)
})

# Order items table
order_items_data = []
item_id = 1
for order_id in range(1, n_orders + 1):
    n_items = np.random.randint(1, 6)  # 1-5 items per order
    for _ in range(n_items):
        order_items_data.append({
            'item_id': item_id,
            'order_id': order_id,
            'product_id': np.random.randint(1, 51),
            'quantity': np.random.randint(1, 5),
            'unit_price': np.round(np.random.uniform(10, 1500), 2)
        })
        item_id += 1

order_items = pd.DataFrame(order_items_data)

print(f"Created tables:")
print(f"  Customers: {len(customers)} rows")
print(f"  Products: {len(products)} rows")
print(f"  Orders: {len(orders)} rows")
print(f"  Order Items: {len(order_items)} rows")

In [None]:
# Preview tables
print("Customers:")
display(customers.head())

print("\nProducts:")
display(products.head())

print("\nOrders:")
display(orders.head())

print("\nOrder Items:")
display(order_items.head())

## 2. Basic SELECT Queries

DuckDB can query pandas DataFrames directly!

In [None]:
# Select all columns
result = con.execute("""
    SELECT *
    FROM customers
    LIMIT 5
""").df()

result

In [None]:
# Select specific columns
result = con.execute("""
    SELECT customer_id, name, country, loyalty_tier
    FROM customers
    LIMIT 10
""").df()

result

In [None]:
# WHERE clause - filtering
result = con.execute("""
    SELECT *
    FROM customers
    WHERE country = 'USA' AND loyalty_tier = 'Gold'
""").df()

print(f"Found {len(result)} Gold tier customers from USA")
result.head()

In [None]:
# ORDER BY - sorting
result = con.execute("""
    SELECT product_name, category, price
    FROM products
    ORDER BY price DESC
    LIMIT 10
""").df()

print("Top 10 most expensive products:")
result

In [None]:
# IN clause
result = con.execute("""
    SELECT *
    FROM customers
    WHERE country IN ('USA', 'UK', 'Canada')
    ORDER BY signup_date DESC
    LIMIT 10
""").df()

result

In [None]:
# LIKE pattern matching
result = con.execute("""
    SELECT product_id, product_name, price
    FROM products
    WHERE product_name LIKE '%Cable%'
""").df()

print("Products containing 'Cable':")
result

## 3. Aggregations and GROUP BY

Aggregate functions: COUNT, SUM, AVG, MIN, MAX

In [None]:
# Count total orders
result = con.execute("""
    SELECT COUNT(*) as total_orders
    FROM orders
""").df()

result

In [None]:
# GROUP BY - aggregations per group
result = con.execute("""
    SELECT 
        status,
        COUNT(*) as num_orders,
        AVG(shipping_cost) as avg_shipping,
        MIN(shipping_cost) as min_shipping,
        MAX(shipping_cost) as max_shipping
    FROM orders
    GROUP BY status
    ORDER BY num_orders DESC
""").df()

result

In [None]:
# GROUP BY with multiple columns
result = con.execute("""
    SELECT 
        country,
        loyalty_tier,
        COUNT(*) as customer_count
    FROM customers
    GROUP BY country, loyalty_tier
    ORDER BY country, customer_count DESC
""").df()

result.head(15)

In [None]:
# HAVING - filter after aggregation
result = con.execute("""
    SELECT 
        country,
        COUNT(*) as customer_count
    FROM customers
    GROUP BY country
    HAVING COUNT(*) > 15
    ORDER BY customer_count DESC
""").df()

print("Countries with more than 15 customers:")
result

In [None]:
# Calculate total revenue per product
result = con.execute("""
    SELECT 
        product_id,
        COUNT(*) as times_ordered,
        SUM(quantity) as total_quantity_sold,
        SUM(quantity * unit_price) as total_revenue,
        AVG(unit_price) as avg_unit_price
    FROM order_items
    GROUP BY product_id
    ORDER BY total_revenue DESC
    LIMIT 10
""").df()

print("Top 10 products by revenue:")
result

## 4. JOINs

Combine data from multiple tables.

In [None]:
# INNER JOIN - get orders with customer details
result = con.execute("""
    SELECT 
        o.order_id,
        c.name,
        c.country,
        o.order_date,
        o.status,
        o.shipping_cost
    FROM orders o
    INNER JOIN customers c ON o.customer_id = c.customer_id
    LIMIT 10
""").df()

result

In [None]:
# Join multiple tables - get order details with product names
result = con.execute("""
    SELECT 
        o.order_id,
        c.name as customer_name,
        p.product_name,
        oi.quantity,
        oi.unit_price,
        oi.quantity * oi.unit_price as line_total
    FROM order_items oi
    INNER JOIN orders o ON oi.order_id = o.order_id
    INNER JOIN customers c ON o.customer_id = c.customer_id
    INNER JOIN products p ON oi.product_id = p.product_id
    LIMIT 15
""").df()

result

In [None]:
# LEFT JOIN - include customers even if they have no orders
result = con.execute("""
    SELECT 
        c.customer_id,
        c.name,
        COUNT(o.order_id) as num_orders,
        COALESCE(SUM(o.shipping_cost), 0) as total_shipping
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.name
    ORDER BY num_orders DESC
    LIMIT 10
""").df()

print("Top 10 customers by number of orders:")
result

In [None]:
# Find customers who never placed an order
result = con.execute("""
    SELECT 
        c.customer_id,
        c.name,
        c.signup_date
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.order_id IS NULL
""").df()

print(f"Found {len(result)} customers with no orders")
result.head()

## 5. Window Functions

Window functions perform calculations across rows related to the current row.

Common window functions:
- **ROW_NUMBER()**: Assign unique sequential number
- **RANK()**: Ranking with gaps for ties
- **DENSE_RANK()**: Ranking without gaps
- **LAG()/LEAD()**: Access previous/next row
- **SUM/AVG/COUNT OVER()**: Running aggregates

In [None]:
# ROW_NUMBER - assign sequential numbers
result = con.execute("""
    SELECT 
        product_name,
        category,
        price,
        ROW_NUMBER() OVER (ORDER BY price DESC) as price_rank
    FROM products
    LIMIT 10
""").df()

result

In [None]:
# RANK and DENSE_RANK - handle ties differently
result = con.execute("""
    SELECT 
        product_name,
        price,
        RANK() OVER (ORDER BY price DESC) as rank,
        DENSE_RANK() OVER (ORDER BY price DESC) as dense_rank,
        ROW_NUMBER() OVER (ORDER BY price DESC) as row_num
    FROM products
    LIMIT 15
""").df()

result

In [None]:
# Partition BY - rank within groups
result = con.execute("""
    SELECT 
        product_name,
        category,
        price,
        RANK() OVER (PARTITION BY category ORDER BY price DESC) as rank_in_category
    FROM products
    ORDER BY category, rank_in_category
    LIMIT 20
""").df()

print("Top products in each category:")
result

In [None]:
# LAG and LEAD - access adjacent rows
result = con.execute("""
    SELECT 
        order_id,
        order_date,
        shipping_cost,
        LAG(shipping_cost) OVER (ORDER BY order_date) as previous_shipping,
        LEAD(shipping_cost) OVER (ORDER BY order_date) as next_shipping
    FROM orders
    LIMIT 10
""").df()

result

In [None]:
# Running total - cumulative sum
result = con.execute("""
    SELECT 
        order_id,
        order_date,
        shipping_cost,
        SUM(shipping_cost) OVER (
            ORDER BY order_date 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) as cumulative_shipping
    FROM orders
    ORDER BY order_date
    LIMIT 15
""").df()

result

In [None]:
# Moving average - last 3 orders
result = con.execute("""
    SELECT 
        order_id,
        order_date,
        shipping_cost,
        AVG(shipping_cost) OVER (
            ORDER BY order_date 
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ) as moving_avg_3
    FROM orders
    ORDER BY order_date
    LIMIT 15
""").df()

result

## 6. CTEs (Common Table Expressions)

CTEs make complex queries more readable by breaking them into named subqueries.

In [None]:
# Single CTE - calculate order totals, then aggregate
result = con.execute("""
    WITH order_totals AS (
        SELECT 
            oi.order_id,
            SUM(oi.quantity * oi.unit_price) as total_amount
        FROM order_items oi
        GROUP BY oi.order_id
    )
    SELECT 
        o.status,
        COUNT(*) as num_orders,
        AVG(ot.total_amount) as avg_order_value,
        SUM(ot.total_amount) as total_revenue
    FROM orders o
    JOIN order_totals ot ON o.order_id = ot.order_id
    GROUP BY o.status
    ORDER BY total_revenue DESC
""").df()

result

In [None]:
# Multiple CTEs
result = con.execute("""
    WITH order_totals AS (
        SELECT 
            order_id,
            SUM(quantity * unit_price) as total_amount
        FROM order_items
        GROUP BY order_id
    ),
    customer_stats AS (
        SELECT 
            c.customer_id,
            c.name,
            c.loyalty_tier,
            COUNT(o.order_id) as num_orders,
            COALESCE(SUM(ot.total_amount), 0) as total_spent
        FROM customers c
        LEFT JOIN orders o ON c.customer_id = o.customer_id
        LEFT JOIN order_totals ot ON o.order_id = ot.order_id
        GROUP BY c.customer_id, c.name, c.loyalty_tier
    )
    SELECT 
        loyalty_tier,
        COUNT(*) as num_customers,
        AVG(num_orders) as avg_orders_per_customer,
        AVG(total_spent) as avg_total_spent,
        SUM(total_spent) as tier_total_revenue
    FROM customer_stats
    GROUP BY loyalty_tier
    ORDER BY tier_total_revenue DESC
""").df()

print("Customer statistics by loyalty tier:")
result

## 7. Subqueries

In [None]:
# Subquery in WHERE clause - find above average
result = con.execute("""
    SELECT product_name, price
    FROM products
    WHERE price > (SELECT AVG(price) FROM products)
    ORDER BY price DESC
""").df()

print("Products priced above average:")
result.head(10)

In [None]:
# Subquery in SELECT - calculate difference from average
result = con.execute("""
    SELECT 
        product_name,
        price,
        (SELECT AVG(price) FROM products) as avg_price,
        price - (SELECT AVG(price) FROM products) as diff_from_avg
    FROM products
    ORDER BY diff_from_avg DESC
    LIMIT 10
""").df()

result

In [None]:
# Correlated subquery - find most expensive product per category
result = con.execute("""
    SELECT 
        p1.product_name,
        p1.category,
        p1.price
    FROM products p1
    WHERE p1.price = (
        SELECT MAX(p2.price)
        FROM products p2
        WHERE p2.category = p1.category
    )
    ORDER BY p1.category, p1.price DESC
""").df()

print("Most expensive product in each category:")
result

## 8. CASE Statements

Conditional logic in SQL queries.

In [None]:
# Simple CASE - categorize prices
result = con.execute("""
    SELECT 
        product_name,
        price,
        CASE 
            WHEN price < 50 THEN 'Budget'
            WHEN price >= 50 AND price < 200 THEN 'Mid-range'
            WHEN price >= 200 AND price < 500 THEN 'Premium'
            ELSE 'Luxury'
        END as price_category
    FROM products
    ORDER BY price DESC
    LIMIT 15
""").df()

result

In [None]:
# CASE in aggregation
result = con.execute("""
    SELECT 
        category,
        COUNT(*) as total_products,
        SUM(CASE WHEN price < 100 THEN 1 ELSE 0 END) as budget_count,
        SUM(CASE WHEN price >= 100 AND price < 500 THEN 1 ELSE 0 END) as mid_count,
        SUM(CASE WHEN price >= 500 THEN 1 ELSE 0 END) as premium_count
    FROM products
    GROUP BY category
""").df()

result

## 9. Complex Real-World Query

Combine everything: CTEs, JOINs, window functions, aggregations.

In [None]:
# Find top customers by revenue with running totals and rankings
result = con.execute("""
    WITH order_totals AS (
        -- Calculate total for each order
        SELECT 
            order_id,
            SUM(quantity * unit_price) as order_total
        FROM order_items
        GROUP BY order_id
    ),
    customer_revenue AS (
        -- Calculate revenue per customer
        SELECT 
            c.customer_id,
            c.name,
            c.country,
            c.loyalty_tier,
            COUNT(DISTINCT o.order_id) as num_orders,
            SUM(ot.order_total) as total_revenue,
            AVG(ot.order_total) as avg_order_value
        FROM customers c
        INNER JOIN orders o ON c.customer_id = o.customer_id
        INNER JOIN order_totals ot ON o.order_id = ot.order_id
        WHERE o.status = 'Delivered'
        GROUP BY c.customer_id, c.name, c.country, c.loyalty_tier
    )
    SELECT 
        name,
        country,
        loyalty_tier,
        num_orders,
        ROUND(total_revenue, 2) as total_revenue,
        ROUND(avg_order_value, 2) as avg_order_value,
        RANK() OVER (ORDER BY total_revenue DESC) as revenue_rank,
        SUM(total_revenue) OVER (
            ORDER BY total_revenue DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) as running_total_revenue
    FROM customer_revenue
    ORDER BY total_revenue DESC
    LIMIT 20
""").df()

print("Top 20 customers by revenue (delivered orders only):")
result

## 10. Summary

### Key SQL Concepts Covered:

1. **Basic Queries**
   - SELECT, WHERE, ORDER BY, LIMIT
   - Filtering with IN, LIKE, comparison operators

2. **Aggregations**
   - COUNT, SUM, AVG, MIN, MAX
   - GROUP BY, HAVING

3. **Joins**
   - INNER JOIN, LEFT JOIN
   - Multi-table joins

4. **Window Functions**
   - ROW_NUMBER, RANK, DENSE_RANK
   - LAG, LEAD
   - Running totals, moving averages
   - PARTITION BY

5. **CTEs**
   - Named subqueries for readability
   - Multiple CTEs in sequence

6. **Subqueries**
   - In WHERE, SELECT clauses
   - Correlated subqueries

7. **CASE Statements**
   - Conditional logic
   - In SELECT and aggregations

### Best Practices:

- Use **CTEs** for complex queries instead of nested subqueries
- **Filter early** with WHERE to reduce data volume
- Use **appropriate indexes** on columns used in JOINs and WHERE clauses
- **Window functions** are powerful but can be expensive - use wisely
- **EXPLAIN** queries to understand execution plans
- Use **column aliases** for clarity

In [None]:
# Close connection
con.close()
print("Notebook complete!")