# Day 1, Block B: Window Functions Primer

**Duration:** 25-30 minutes  
**Course:** ECBS5294 - Introduction to Data Science: Working with Data  
**Instructor:** Eduardo Ariño de la Rubia

---

## Learning Objectives

By the end of this session, you will be able to:

1. **Explain the mental model:** Windows preserve rows; GROUP BY collapses rows
2. **Decide when to use** window functions vs GROUP BY
3. **Use ROW_NUMBER()** for "latest record per group" problems
4. **Use LAG()** for period-over-period comparisons
5. **Create moving averages** with ROWS BETWEEN
6. **Understand window function syntax** (PARTITION BY, ORDER BY, frame)

---

## 1. Introduction: The Power of Windows

### The Challenge

You just learned GROUP BY. It's powerful:
- "Total revenue per product" ✅
- "Count of transactions per month" ✅

But sometimes GROUP BY has a **limitation:**

**Problem:** "I want to see each transaction AND the total for that product."

With GROUP BY:
- You can see the total per product (1 row per product)
- OR you can see all transactions (many rows)
- But **not both at the same time!**

GROUP BY **collapses** rows. What if you want the calculation **without collapsing**?

**Enter: Window Functions**

> **Window functions let you add aggregates to your data WITHOUT collapsing rows.**

This is incredibly powerful for analytics!

---

## 2. Setup

### Why a New Dataset?

We're switching from cafe sales to **Superstore** data because:
- Better **time series** (4 years of data)
- Multiple orders **per customer** (great for ROW_NUMBER examples)
- Daily data (perfect for moving averages)

Superstore is a classic teaching dataset - it's clean, realistic, and perfect for window functions!

In [None]:
# Imports
import duckdb
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

print("✅ Libraries imported!")

In [None]:
# Connect to DuckDB
con = duckdb.connect(':memory:')

print("✅ Connected to DuckDB!")

In [None]:
# Load Superstore data
# Note: Using encoding='latin-1' due to file encoding
superstore = pd.read_csv('../data/day1/Sample - Superstore.csv', encoding='latin-1')

# Cast Order Date to datetime for proper DATE_TRUNC support
superstore['Order Date'] = pd.to_datetime(superstore['Order Date'])

# Register with DuckDB
con.register('superstore', superstore)

print(f"✅ Loaded {len(superstore):,} rows!")

In [None]:
# Explore the data
con.execute("""
    SELECT 
        "Order ID",
        "Order Date",
        "Customer ID",
        "Customer Name",
        Category,
        "Product Name",
        Sales
    FROM superstore
    LIMIT 5
""").df()

In [None]:
# Check date range
con.execute("""
    SELECT 
        MIN("Order Date") as first_order,
        MAX("Order Date") as last_order,
        COUNT(DISTINCT "Customer ID") as unique_customers,
        COUNT(*) as total_orders
    FROM superstore
""").df()

**Perfect!** ~10,000 orders across 4 years from ~800 customers. This is great data for learning window functions.

---

## 3. The Mental Model: Windows vs GROUP BY

> **🚨 THIS IS THE MOST IMPORTANT CONCEPT**

### The Core Difference

| | GROUP BY | Window Functions |
|---|---|---|
| **What happens to rows?** | Collapses to summary | Keeps all rows |
| **Output row count** | Fewer rows (one per group) | Same row count as input |
| **Use when** | You want summary only | You want detail + calculation |
| **Example** | "Total sales per category" | "Each order + category total" |

Let's see this in action with real queries.

### Example: GROUP BY (Collapses Rows)

In [None]:
# ==============================================================================
# THE CRITICAL DIFFERENCE: Side-by-Side Comparison
# ==============================================================================

print("="*70)
print("APPROACH 1: GROUP BY (Collapses Rows)")
print("="*70)
result_groupby = con.execute("""
    SELECT 
        Category,
        COUNT(*) AS order_count
    FROM superstore
    GROUP BY Category
    ORDER BY order_count DESC
""").df()

print(f"\n📊 Input: 9,994 rows")
print(f"📉 Output: {len(result_groupby)} rows (one per category)")
print(f"❌ We LOST all the details! Which products? Which customers? When?\n")
display(result_groupby)

print("\n" + "="*70)
print("APPROACH 2: WINDOW FUNCTION (Preserves Rows)")
print("="*70)
result_window = con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        Category,
        Sales,
        COUNT(*) OVER (PARTITION BY Category) AS category_order_count
    FROM superstore
    LIMIT 10
""").df()

print(f"\n📊 Input: 9,994 rows")
print(f"📈 Output: 9,994 rows (all kept!)")
print(f"✅ We KEPT everything AND added the count!\n")
print("(Showing first 10 rows)\n")
display(result_window)

print("\n" + "="*70)
print("🔑 KEY INSIGHT:")
print("="*70)
print("   GROUP BY:  9,994 rows  →  3 rows      (COLLAPSED)")
print("   Window:    9,994 rows  →  9,994 rows  (PRESERVED)")
print("="*70)
print("\n💡 This is why window functions are powerful:")
print("   You get the DETAIL + the AGGREGATE in the same result!")
print("="*70)

### Example: Window Function (Keeps All Rows)

**See the difference?**
- All detail rows are still there!
- But we've added a new column: `category_order_count`
- Every row in "Furniture" shows the same count (how many total Furniture orders)
- Every row in "Technology" shows its count

**The calculation happened "over a window" of rows, but we kept all rows!**

---

## 4. When to Use Each

### Decision Guide

**Use GROUP BY when:**
- ✅ You want summary only (one row per group)
- ✅ You don't need row-level details
- ✅ Example: "What's our revenue per region?" (just the totals)

**Use Window Functions when:**
- ✅ You want detail + calculation
- ✅ You need ranking (1st, 2nd, 3rd...)
- ✅ You need row-to-row comparisons ("this month vs last month")
- ✅ You need to filter AFTER calculating ("show me the top 3 per category")
- ✅ Example: "Show me all orders, with each order's rank within its category"

**Key insight:** If GROUP BY loses information you need, use window functions!

---

### Simple Example: No PARTITION or ORDER

In [None]:
# Add total order count to every row
con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        COUNT(*) OVER () AS total_orders_in_dataset
    FROM superstore
    LIMIT 5
""").df()

**What happened:** `COUNT(*) OVER ()` with empty `()` means "count ALL rows" and add that number to every row.

### With PARTITION BY

In [None]:
# Add count PER CATEGORY to every row
con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        Category,
        COUNT(*) OVER (PARTITION BY Category) AS category_count
    FROM superstore
    LIMIT 10
""").df()

**PARTITION BY is like GROUP BY for window functions!**
- "PARTITION BY Category" = "For each category..."
- COUNT happens within each partition
- But all rows are kept!

---

## 6. Use Case 1: ROW_NUMBER() for "Latest Per Group"

### The Business Problem

> **"I want the most recent order for each customer."**

This is a VERY common pattern in data analysis:
- Latest transaction per customer
- Most recent login per user
- Current status per order

### Why GROUP BY Fails

In [None]:
# Try with GROUP BY: Get latest date per customer
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        MAX("Order Date") AS latest_order_date
    FROM superstore
    GROUP BY "Customer ID", "Customer Name"
    LIMIT 5
""").df()

**Problem:** We got the date, but we **lost the order details!**
- What was ordered?
- What category?
- Order ID?
- Sales amount?

GROUP BY collapsed everything. We need a different approach.

### Solution: ROW_NUMBER()

> **ROW_NUMBER() assigns a sequential number to each row within a group**

Strategy:
1. For each customer, rank orders by date (newest = 1)
2. Keep all the row details
3. Filter to rank = 1

In [None]:
# Step 1: Add row numbers
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order ID",
        "Order Date",
        Category,
        Sales,
        ROW_NUMBER() OVER (
            PARTITION BY "Customer ID" 
            ORDER BY "Order Date" DESC
        ) AS row_num
    FROM superstore
    LIMIT 20
""").df()

**Breaking it down:**
- `PARTITION BY "Customer ID"` = For each customer...
- `ORDER BY "Order Date" DESC` = Sort by date, newest first
- `ROW_NUMBER()` = Assign 1, 2, 3, ...

**Result:** Each customer's orders are numbered, 1 = most recent!

### Step 2: Filter to Latest Only

In [None]:
# Now filter to row_num = 1 using a subquery
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order ID",
        "Order Date",
        Category,
        "Product Name",
        Sales
    FROM (
        SELECT 
            "Customer ID",
            "Customer Name",
            "Order ID",
            "Order Date",
            Category,
            "Product Name",
            Sales,
            ROW_NUMBER() OVER (
                PARTITION BY "Customer ID" 
                ORDER BY "Order Date" DESC
            ) AS row_num
        FROM superstore
    )
    WHERE row_num = 1
    ORDER BY "Order Date" DESC
    LIMIT 10
""").df()

---

### ⏸️ Pause and Try!

**Your task:** Modify the ROW_NUMBER query from Cell 28 above to get the **TOP 3** orders per customer (not just the latest).

**Requirements:**
1. Use the same ROW_NUMBER pattern from the example above
2. Change the `WHERE` filter to get top 3 instead of latest (hint: `<= 3`)
3. Keep all the same columns in the output
4. Order by Customer ID and row_num
5. Limit to 15 rows total

Replace the placeholder query in the cell below with your complete SQL query.

In [None]:
# Your turn! Write your TOP 3 query here:
con.execute("SELECT 1 AS todo").df()  # Replace this entire query with your answer

### Alternative: Top 3 Orders Per Customer

In [None]:
# Get top 3 most recent orders per customer
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order Date",
        Sales,
        row_num
    FROM (
        SELECT 
            "Customer ID",
            "Customer Name",
            "Order Date",
            Sales,
            ROW_NUMBER() OVER (
                PARTITION BY "Customer ID" 
                ORDER BY "Order Date" DESC
            ) AS row_num
        FROM superstore
    )
    WHERE row_num <= 3
    ORDER BY "Customer ID", row_num
    LIMIT 15
""").df()

**Just change the filter!** `WHERE row_num <= 3` gives top 3 per customer.

**Use cases:**
- Top 5 products per category
- Latest 10 transactions per account
- Most recent 3 logins per user

---

## 7. Use Case 2: LAG() for Period-over-Period Comparison

### The Business Problem

> **"What's the month-over-month change in sales?"**

You want to compare:
- This month vs last month
- This quarter vs last quarter
- Today vs yesterday

You need to access the **previous row's value**. That's what LAG() does!

### Step 1: Aggregate to Monthly Sales

In [None]:
# First, get monthly totals
monthly_sales = con.execute("""
    SELECT 
        DATE_TRUNC('month', "Order Date") AS month,
        ROUND(SUM(Sales), 2) AS monthly_sales
    FROM superstore
    GROUP BY month
    ORDER BY month
""").df()

print(f"Monthly sales for {len(monthly_sales)} months:")
monthly_sales.head(10)

**Good!** We have one row per month. Now let's add the previous month's sales.

### Step 2: Add LAG() for Previous Month

In [None]:
# Add previous month's sales using LAG()
con.execute("""
    SELECT 
        DATE_TRUNC('month', "Order Date") AS month,
        ROUND(SUM(Sales), 2) AS monthly_sales,
        LAG(ROUND(SUM(Sales), 2), 1) OVER (ORDER BY DATE_TRUNC('month', "Order Date")) AS prev_month_sales
    FROM superstore
    GROUP BY month
    ORDER BY month
    LIMIT 12
""").df()

**See that?**
- `prev_month_sales` is the value from the row **before**
- First row is NULL (no previous month)
- Second row shows first month's value
- And so on...

**Syntax:**
- `LAG(column, 1)` = Get value from 1 row before
- `LAG(column, 2)` = Get value from 2 rows before
- `ORDER BY month` = Defines what "before" means!

### Step 3: Calculate Change

In [None]:
# Calculate month-over-month change
con.execute("""
    WITH monthly AS (
        SELECT 
            DATE_TRUNC('month', "Order Date") AS month,
            ROUND(SUM(Sales), 2) AS monthly_sales
        FROM superstore
        GROUP BY month
    )
    SELECT 
        month,
        monthly_sales,
        LAG(monthly_sales, 1) OVER (ORDER BY month) AS prev_month,
        ROUND(monthly_sales - LAG(monthly_sales, 1) OVER (ORDER BY month), 2) AS change,
        ROUND(
            100.0 * (monthly_sales - LAG(monthly_sales, 1) OVER (ORDER BY month)) / 
            LAG(monthly_sales, 1) OVER (ORDER BY month), 
            2
        ) AS pct_change
    FROM monthly
    ORDER BY month
    LIMIT 12
""").df()

**Business insights!**
- See which months grew vs declined
- Calculate percent change
- Spot trends

**Note:** Used `WITH` (Common Table Expression) to make query cleaner. This is advanced but useful!

### Why ORDER BY Matters

**Without ORDER BY, LAG() doesn't know what "previous" means!**

```sql
-- ❌ Wrong - undefined order
LAG(sales) OVER ()

-- ✅ Correct - ordered by time
LAG(sales) OVER (ORDER BY month)
```

**Always ORDER BY the dimension you're comparing across** (usually time).

### LEAD(): The Opposite

`LEAD()` gets the value from rows **after** instead of before:

In [None]:
# Compare to NEXT month instead of previous
con.execute("""
    WITH monthly AS (
        SELECT 
            DATE_TRUNC('month', "Order Date") AS month,
            ROUND(SUM(Sales), 2) AS monthly_sales
        FROM superstore
        GROUP BY month
    )
    SELECT 
        month,
        monthly_sales,
        LEAD(monthly_sales, 1) OVER (ORDER BY month) AS next_month_sales
    FROM monthly
    ORDER BY month
    LIMIT 10
""").df()

**Use case:** "Did we hit our forecast?" Compare actual to next month's forecast.

---

## 8. Use Case 3: Moving Average

### The Business Problem

> **"Daily sales are noisy - smooth them with a 7-day moving average."**

**Why moving averages?**
- Remove day-to-day volatility
- See underlying trends
- Common in time series analysis

**What's a moving average?**
- For each day, average that day + the 6 days before it
- "Window" slides forward each day
- Smooths out spikes and dips

### Step 1: Aggregate to Daily Sales

In [None]:
# Get daily sales
daily_sales = con.execute("""
    SELECT 
        "Order Date" AS date,
        ROUND(SUM(Sales), 2) AS daily_sales
    FROM superstore
    GROUP BY date
    ORDER BY date
    LIMIT 20
""").df()

print(f"Daily sales:")
daily_sales

**See the volatility?** Some days high, some low. Hard to see the trend.

### Step 2: Add 7-Day Moving Average

In [None]:
# Add 7-day moving average
con.execute("""
    WITH daily AS (
        SELECT 
            "Order Date" AS date,
            ROUND(SUM(Sales), 2) AS daily_sales
        FROM superstore
        GROUP BY date
    )
    SELECT 
        date,
        daily_sales,
        ROUND(
            AVG(daily_sales) OVER (
                ORDER BY date
                ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
            ), 
            2
        ) AS moving_avg_7day
    FROM daily
    ORDER BY date
    LIMIT 20
""").df()

**Breaking down the syntax:**

```sql
AVG(daily_sales) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)
```

- `AVG(daily_sales)` - Calculate average
- `ORDER BY date` - Order matters! (need to know which rows are "before")
- `ROWS BETWEEN 6 PRECEDING AND CURRENT ROW` - The magic part!
  - `6 PRECEDING` = 6 rows before current
  - `CURRENT ROW` = current row
  - Total: 7 rows (6 before + current)

**Visual:**
```
For row at day 10:
[day 4][day 5][day 6][day 7][day 8][day 9][day 10]
 ↑                                            ↑
 6 preceding                          current row
 
 Average these 7 days
```

### The Smoothing Effect

In [None]:
# Let's see more data to see the smoothing
con.execute("""
    WITH daily AS (
        SELECT 
            "Order Date" AS date,
            ROUND(SUM(Sales), 2) AS daily_sales
        FROM superstore
        GROUP BY date
    )
    SELECT 
        date,
        daily_sales,
        ROUND(
            AVG(daily_sales) OVER (
                ORDER BY date
                ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
            ), 
            2
        ) AS moving_avg_7day,
        ROUND(daily_sales - AVG(daily_sales) OVER (
                ORDER BY date
                ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
            ), 2) AS deviation_from_avg
    FROM daily
    ORDER BY date
    LIMIT 30
""").df()

**Notice:**
- Daily sales jumps around (high volatility)
- Moving average is smoother (less volatile)
- You can see the trend more clearly

**Business use:** "Is our sales trend up or down?" Moving average makes it clear.

### Other Frame Options

**Frame syntax:** `ROWS BETWEEN <start> AND <end>`

Common patterns:

```sql
-- 7-day moving average
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

-- 3-day centered average (1 before, current, 1 after)
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING

-- Running total (all rows from start to current)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

-- Next 5 days average
ROWS BETWEEN CURRENT ROW AND 5 FOLLOWING
```

**Key:** ORDER BY defines what "preceding" and "following" mean!

---

## 9. Summary: Window Functions

### The Three Use Cases We Learned

1. **ROW_NUMBER()** - Ranking/Latest per group
   - "Latest order per customer"
   - "Top 3 products per category"
   - Requires: PARTITION BY + ORDER BY

2. **LAG()/LEAD()** - Row-to-row comparison
   - "Month-over-month change"
   - "This year vs last year"
   - Requires: ORDER BY

3. **Moving Average** - Smoothing/Rolling calculations
   - "7-day moving average"
   - "Running total"
   - Requires: ORDER BY + ROWS BETWEEN

### Syntax Template

```sql
SELECT 
    column1,
    column2,
    <function>() OVER (
        PARTITION BY group_column    -- Optional: "for each..."
        ORDER BY sort_column         -- When order matters
        ROWS BETWEEN ... AND ...     -- For frames (moving avg)
    ) AS result_column
FROM table
```

### Key Takeaways

1. ✅ **Windows preserve rows, GROUP BY collapses**
2. ✅ **Use PARTITION BY for groups** (like GROUP BY)
3. ✅ **Use ORDER BY when order matters** (almost always!)
4. ✅ **Use ROWS BETWEEN for custom frames** (moving averages)
5. ✅ **Pattern for filtering:** Use subquery to filter window results

---

## 10. Common Gotchas

### ❌ Gotcha 1: Forgetting ORDER BY

```sql
-- ❌ Wrong - undefined order
ROW_NUMBER() OVER (PARTITION BY customer_id)

-- ✅ Correct
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC)
```

### ❌ Gotcha 2: Expecting Windows to Filter Rows

Window functions **add columns**, they don't filter rows!

```sql
-- ❌ This doesn't filter - it adds a column
SELECT *, ROW_NUMBER() OVER (...) AS rn
FROM table
-- Still get all rows!

-- ✅ Use subquery to filter
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (...) AS rn
    FROM table
)
WHERE rn = 1  -- Now we filter
```

### ❌ Gotcha 3: Confusing PARTITION BY with GROUP BY

```sql
-- GROUP BY: Collapses rows
SELECT category, COUNT(*)
FROM sales
GROUP BY category  -- 3 rows output (one per category)

-- PARTITION BY: Keeps rows
SELECT *, COUNT(*) OVER (PARTITION BY category)
FROM sales  -- 10,000 rows output (all rows kept)
```

### ❌ Gotcha 4: Frame Definition Errors

```sql
-- ❌ Wrong - can't have FOLLOWING before PRECEDING
ROWS BETWEEN 5 FOLLOWING AND 1 PRECEDING

-- ✅ Correct - start must come before end
ROWS BETWEEN 1 PRECEDING AND 5 FOLLOWING
```

---

## 11. Decision Guide: GROUP BY vs Windows

### Use GROUP BY when:
- ✅ You want **summary only** (total, average, count)
- ✅ You don't need detail rows
- ✅ Result: Fewer rows (one per group)
- ✅ Example: "What's our revenue per region?"

### Use Window Functions when:
- ✅ You want **detail + calculation**
- ✅ You need **ranking** (1st, 2nd, 3rd per group)
- ✅ You need **row-to-row comparison** (this vs previous)
- ✅ You need **running totals** or moving averages
- ✅ You need to **filter after calculation** ("top 3 per group")
- ✅ Result: Same row count as input
- ✅ Example: "Show all orders with each order's rank in its category"

### Quick Test:

**Ask:** "Do I need to see individual rows, or just summaries?"
- Just summaries → GROUP BY
- Individual rows → Window functions

---

## 12. Preview: HW1

In your homework, you'll apply these concepts to a **525,000 row dataset**!

You'll use:
1. **Basic queries** - SELECT, WHERE, ORDER BY
2. **Aggregations** - GROUP BY, HAVING
3. **Window functions** - ROW_NUMBER, LAG, moving averages

**Tips for success:**
1. Use `LIMIT` while developing queries
2. Build incrementally (start simple, add complexity)
3. Check for NULLs (use `IS NULL`, not `= NULL`)
4. Remember: WHERE filters rows, HAVING filters groups
5. For window functions, always check ORDER BY

**Why 525K rows?** To show you SQL + DuckDB's power!
- Aggregations on 525K rows: ~0.1 seconds
- This is why we use SQL for data analysis

---

## Summary: What We Learned Today

### Notebook 1: SQL Foundations
- SELECT, WHERE, ORDER BY
- NULL handling (IS NULL, not = NULL)
- Calculated columns
- Pattern matching with LIKE

### Notebook 2: Aggregations
- COUNT, SUM, AVG, MIN, MAX
- GROUP BY (collapses rows)
- HAVING (filter groups)
- WHERE vs HAVING (critical difference!)

### Notebook 3: Window Functions
- Windows preserve rows, GROUP BY collapses
- ROW_NUMBER() for latest per group
- LAG()/LEAD() for row-to-row comparison
- Moving averages with ROWS BETWEEN

### Most Important Concepts

1. **NULL handling:** Use `IS NULL`, never `= NULL`
2. **WHERE vs HAVING:** Rows vs groups
3. **Windows vs GROUP BY:** Preserve vs collapse
4. **ORDER BY in windows:** Required when order matters

**You're now ready for HW1!** 🎉

---

**Excellent work!** Window functions are advanced SQL - the fact that you understand them puts you ahead of many data analysts. Practice these patterns - you'll use them constantly in real work.