# Day 1, Block B: Window Functions Primer

**Duration:** 25-30 minutes  
**Course:** ECBS5294 - Introduction to Data Science: Working with Data  
**Instructor:** Eduardo Ariño de la Rubia

---

## Learning Objectives

By the end of this session, you will be able to:

1. **Explain the mental model:** Windows preserve rows; GROUP BY collapses rows
2. **Decide when to use** window functions vs GROUP BY
3. **Use ROW_NUMBER()** for "latest record per group" problems
4. **Use LAG()** for period-over-period comparisons
5. **Create moving averages** with ROWS BETWEEN
6. **Understand window function syntax** (PARTITION BY, ORDER BY, frame)

---

## 1. Introduction: The Power of Windows

### The Challenge

You just learned GROUP BY. It's powerful:
- "Total revenue per product" ✅
- "Count of transactions per month" ✅

But sometimes GROUP BY has a **limitation:**

**Problem:** "I want to see each transaction AND the total for that product."

With GROUP BY:
- You can see the total per product (1 row per product)
- OR you can see all transactions (many rows)
- But **not both at the same time!**

GROUP BY **collapses** rows. What if you want the calculation **without collapsing**?

**Enter: Window Functions**

> **Window functions let you add aggregates to your data WITHOUT collapsing rows.**

This is incredibly powerful for analytics!

---

## 2. Setup

### Why a New Dataset?

We're switching from cafe sales to **Superstore** data because:
- Better **time series** (4 years of data)
- Multiple orders **per customer** (great for ROW_NUMBER examples)
- Daily data (perfect for moving averages)

Superstore is a classic teaching dataset - it's clean, realistic, and perfect for window functions!

In [1]:
# Imports
import duckdb
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

print("✅ Libraries imported!")

✅ Libraries imported!


In [2]:
# Connect to DuckDB
con = duckdb.connect(':memory:')

print("✅ Connected to DuckDB!")

✅ Connected to DuckDB!


In [3]:
# Load Superstore data
# Note: Using encoding='latin-1' due to file encoding
superstore = pd.read_csv('../data/day1/Sample - Superstore.csv', encoding='latin-1')

# Cast Order Date to datetime for proper DATE_TRUNC support
superstore['Order Date'] = pd.to_datetime(superstore['Order Date'])

# Register with DuckDB
con.register('superstore', superstore)

print(f"✅ Loaded {len(superstore):,} rows!")

✅ Loaded 9,994 rows!


In [4]:
# Explore the data
con.execute("""
    SELECT 
        "Order ID",
        "Order Date",
        "Customer ID",
        "Customer Name",
        Category,
        "Product Name",
        Sales
    FROM superstore
    LIMIT 5
""").df()

Unnamed: 0,Order ID,Order Date,Customer ID,Customer Name,Category,Product Name,Sales
0,CA-2016-152156,2016-11-08,CG-12520,Claire Gute,Furniture,Bush Somerset Collection Bookcase,261.96
1,CA-2016-152156,2016-11-08,CG-12520,Claire Gute,Furniture,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,CA-2016-138688,2016-06-12,DV-13045,Darrin Van Huff,Office Supplies,Self-Adhesive Address Labels for Typewriters b...,14.62
3,US-2015-108966,2015-10-11,SO-20335,Sean O'Donnell,Furniture,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,US-2015-108966,2015-10-11,SO-20335,Sean O'Donnell,Office Supplies,Eldon Fold 'N Roll Cart System,22.368


In [5]:
# Check date range
con.execute("""
    SELECT 
        MIN("Order Date") as first_order,
        MAX("Order Date") as last_order,
        COUNT(DISTINCT "Customer ID") as unique_customers,
        COUNT(*) as total_orders
    FROM superstore
""").df()

Unnamed: 0,first_order,last_order,unique_customers,total_orders
0,2014-01-03,2017-12-30,793,9994


**Perfect!** ~10,000 orders across 4 years from ~800 customers. This is great data for learning window functions.

---

## 3. The Mental Model: Windows vs GROUP BY

> **🚨 THIS IS THE MOST IMPORTANT CONCEPT**

### The Core Difference

| | GROUP BY | Window Functions |
|---|---|---|
| **What happens to rows?** | Collapses to summary | Keeps all rows |
| **Output row count** | Fewer rows (one per group) | Same row count as input |
| **Use when** | You want summary only | You want detail + calculation |
| **Example** | "Total sales per category" | "Each order + category total" |

Let's see this in action with real queries.

### Example: GROUP BY (Collapses Rows)

In [6]:
# ==============================================================================
# THE CRITICAL DIFFERENCE: Side-by-Side Comparison
# ==============================================================================

print("="*70)
print("APPROACH 1: GROUP BY (Collapses Rows)")
print("="*70)
result_groupby = con.execute("""
    SELECT 
        Category,
        COUNT(*) AS order_count
    FROM superstore
    GROUP BY Category
    ORDER BY order_count DESC
""").df()

print(f"\n📊 Input: 9,994 rows")
print(f"📉 Output: {len(result_groupby)} rows (one per category)")
print(f"❌ We LOST all the details! Which products? Which customers? When?\n")
display(result_groupby)

print("\n" + "="*70)
print("APPROACH 2: WINDOW FUNCTION (Preserves Rows)")
print("="*70)
result_window = con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        Category,
        Sales,
        COUNT(*) OVER (PARTITION BY Category) AS category_order_count
    FROM superstore
    LIMIT 10
""").df()

print(f"\n📊 Input: 9,994 rows")
print(f"📈 Output: 9,994 rows (all kept!)")
print(f"✅ We KEPT everything AND added the count!\n")
print("(Showing first 10 rows)\n")
display(result_window)

print("\n" + "="*70)
print("🔑 KEY INSIGHT:")
print("="*70)
print("   GROUP BY:  9,994 rows  →  3 rows      (COLLAPSED)")
print("   Window:    9,994 rows  →  9,994 rows  (PRESERVED)")
print("="*70)
print("\n💡 This is why window functions are powerful:")
print("   You get the DETAIL + the AGGREGATE in the same result!")
print("="*70)

APPROACH 1: GROUP BY (Collapses Rows)

📊 Input: 9,994 rows
📉 Output: 3 rows (one per category)
❌ We LOST all the details! Which products? Which customers? When?



Unnamed: 0,Category,order_count
0,Office Supplies,6026
1,Furniture,2121
2,Technology,1847



APPROACH 2: WINDOW FUNCTION (Preserves Rows)

📊 Input: 9,994 rows
📈 Output: 9,994 rows (all kept!)
✅ We KEPT everything AND added the count!

(Showing first 10 rows)



Unnamed: 0,Order ID,Product Name,Category,Sales,category_order_count
0,CA-2016-107104,"GE 48"" Fluorescent Tube, Cool White Energy Sav...",Furniture,595.38,2121
1,CA-2014-156160,"Computer Room Manger, 14""",Furniture,97.44,2121
2,CA-2014-156160,Office Star - Mid Back Dual function Ergonomic...,Furniture,579.528,2121
3,CA-2017-157448,Eldon Radial Chair Mat for Low to Medium Pile ...,Furniture,119.94,2121
4,CA-2017-157448,Eldon Image Series Black Desk Accessories,Furniture,12.42,2121
5,CA-2016-137393,"Executive Impressions 8-1/2"" Career Panel/Part...",Furniture,41.6,2121
6,CA-2017-122770,"Eldon Executive Woodline II Desk Accessories, ...",Furniture,201.04,2121
7,CA-2015-130183,"Atlantic Metals Mobile 5-Shelf Bookcases, Cust...",Furniture,613.9992,2121
8,CA-2016-122511,"DAX Charcoal/Nickel-Tone Document Frame, 5 x 7",Furniture,30.336,2121
9,CA-2016-161746,Office Star Flex Back Scooter Chair with Alumi...,Furniture,242.136,2121



🔑 KEY INSIGHT:
   GROUP BY:  9,994 rows  →  3 rows      (COLLAPSED)
   Window:    9,994 rows  →  9,994 rows  (PRESERVED)

💡 This is why window functions are powerful:
   You get the DETAIL + the AGGREGATE in the same result!


### Example: Window Function (Keeps All Rows)

**See the difference?**
- All detail rows are still there!
- But we've added a new column: `category_order_count`
- Every row in "Furniture" shows the same count (how many total Furniture orders)
- Every row in "Technology" shows its count

**The calculation happened "over a window" of rows, but we kept all rows!**

---

## 4. When to Use Each

### Decision Guide

**Use GROUP BY when:**
- ✅ You want summary only (one row per group)
- ✅ You don't need row-level details
- ✅ Example: "What's our revenue per region?" (just the totals)

**Use Window Functions when:**
- ✅ You want detail + calculation
- ✅ You need ranking (1st, 2nd, 3rd...)
- ✅ You need row-to-row comparisons ("this month vs last month")
- ✅ You need to filter AFTER calculating ("show me the top 3 per category")
- ✅ Example: "Show me all orders, with each order's rank within its category"

**Key insight:** If GROUP BY loses information you need, use window functions!

---

### Simple Example: No PARTITION or ORDER

In [7]:
# Add total order count to every row
con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        COUNT(*) OVER () AS total_orders_in_dataset
    FROM superstore
    LIMIT 5
""").df()

Unnamed: 0,Order ID,Product Name,total_orders_in_dataset
0,CA-2016-152156,Bush Somerset Collection Bookcase,9994
1,CA-2016-152156,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",9994
2,CA-2016-138688,Self-Adhesive Address Labels for Typewriters b...,9994
3,US-2015-108966,Bretford CR4500 Series Slim Rectangular Table,9994
4,US-2015-108966,Eldon Fold 'N Roll Cart System,9994


**What happened:** `COUNT(*) OVER ()` with empty `()` means "count ALL rows" and add that number to every row.

### With PARTITION BY

In [8]:
# Add count PER CATEGORY to every row
con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        Category,
        COUNT(*) OVER (PARTITION BY Category) AS category_count
    FROM superstore
    LIMIT 10
""").df()

Unnamed: 0,Order ID,Product Name,Category,category_count
0,CA-2014-115812,Mitel 5320 IP Phone VoIP phone,Technology,1847
1,CA-2014-115812,Konftel 250 Conference phone - Charcoal black,Technology,1847
2,CA-2014-143336,Cisco SPA 501G IP Phone,Technology,1847
3,CA-2016-121755,Imation 8GB Mini TravelDrive USB 2.0 Flash Drive,Technology,1847
4,CA-2016-117590,GE 30524EE4,Technology,1847
5,CA-2015-117415,Plantronics HL10 Handset Lifter,Technology,1847
6,CA-2017-120999,Panasonic Kx-TS550,Technology,1847
7,CA-2016-118255,Verbatim 25 GB 6x Blu-ray Single Layer Recorda...,Technology,1847
8,CA-2016-169194,Imation 8gb Micro Traveldrive Usb 2.0 Flash Drive,Technology,1847
9,CA-2016-169194,"LF Elite 3D Dazzle Designer Hard Case Cover, L...",Technology,1847


**PARTITION BY is like GROUP BY for window functions!**
- "PARTITION BY Category" = "For each category..."
- COUNT happens within each partition
- But all rows are kept!

---

## 6. Use Case 1: ROW_NUMBER() for "Latest Per Group"

### The Business Problem

> **"I want the most recent order for each customer."**

This is a VERY common pattern in data analysis:
- Latest transaction per customer
- Most recent login per user
- Current status per order

### Why GROUP BY Fails

In [9]:
# Try with GROUP BY: Get latest date per customer
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        MAX("Order Date") AS latest_order_date
    FROM superstore
    GROUP BY "Customer ID", "Customer Name"
    LIMIT 5
""").df()

Unnamed: 0,Customer ID,Customer Name,latest_order_date
0,RA-19885,Ruben Ausman,2017-11-17
1,PS-18970,Paul Stevenson,2017-10-22
2,KD-16345,Katherine Ducich,2017-11-19
3,ER-13855,Elpida Rittenbach,2016-11-04
4,RD-19900,Ruben Dartt,2017-09-07


**Problem:** We got the date, but we **lost the order details!**
- What was ordered?
- What category?
- Order ID?
- Sales amount?

GROUP BY collapsed everything. We need a different approach.

### Solution: ROW_NUMBER()

> **ROW_NUMBER() assigns a sequential number to each row within a group**

Strategy:
1. For each customer, rank orders by date (newest = 1)
2. Keep all the row details
3. Filter to rank = 1

In [10]:
# Step 1: Add row numbers
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order ID",
        "Order Date",
        Category,
        Sales,
        ROW_NUMBER() OVER (
            PARTITION BY "Customer ID" 
            ORDER BY "Order Date" DESC
        ) AS row_num
    FROM superstore
    LIMIT 20
""").df()

Unnamed: 0,Customer ID,Customer Name,Order ID,Order Date,Category,Sales,row_num
0,AA-10375,Allen Armold,CA-2017-100230,2017-12-11,Office Supplies,14.952,1
1,AA-10375,Allen Armold,CA-2017-100230,2017-12-11,Office Supplies,17.94,2
2,AA-10375,Allen Armold,CA-2017-100230,2017-12-11,Technology,116.98,3
3,AA-10375,Allen Armold,US-2017-169488,2017-09-07,Office Supplies,39.96,4
4,AA-10375,Allen Armold,US-2017-169488,2017-09-07,Office Supplies,16.9,5
5,AA-10375,Allen Armold,CA-2016-131065,2016-11-14,Office Supplies,5.28,6
6,AA-10375,Allen Armold,CA-2016-131065,2016-11-14,Office Supplies,8.26,7
7,AA-10375,Allen Armold,CA-2016-131065,2016-11-14,Technology,499.98,8
8,AA-10375,Allen Armold,CA-2016-126613,2016-07-10,Office Supplies,16.768,9
9,AA-10375,Allen Armold,CA-2015-114503,2015-11-13,Office Supplies,84.96,10


**Breaking it down:**
- `PARTITION BY "Customer ID"` = For each customer...
- `ORDER BY "Order Date" DESC` = Sort by date, newest first
- `ROW_NUMBER()` = Assign 1, 2, 3, ...

**Result:** Each customer's orders are numbered, 1 = most recent!

### Step 2: Filter to Latest Only

In [11]:
# Now filter to row_num = 1 using a subquery
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order ID",
        "Order Date",
        Category,
        "Product Name",
        Sales
    FROM (
        SELECT 
            "Customer ID",
            "Customer Name",
            "Order ID",
            "Order Date",
            Category,
            "Product Name",
            Sales,
            ROW_NUMBER() OVER (
                PARTITION BY "Customer ID" 
                ORDER BY "Order Date" DESC
            ) AS row_num
        FROM superstore
    )
    WHERE row_num = 1
    ORDER BY "Order Date" DESC
    LIMIT 10
""").df()

Unnamed: 0,Customer ID,Customer Name,Order ID,Order Date,Category,Product Name,Sales
0,EB-13975,Erica Bern,CA-2017-115427,2017-12-30,Office Supplies,"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",13.904
1,JM-15580,Jill Matthias,CA-2017-156720,2017-12-30,Office Supplies,Bagged Rubber Bands,3.024
2,CC-12430,Chuck Clark,CA-2017-126221,2017-12-30,Office Supplies,Eureka The Boss Plus 12-Amp Hard Box Upright V...,209.3
3,PO-18865,Patrick O'Donnell,CA-2017-143259,2017-12-30,Furniture,"Bush Westfield Collection Bookcases, Fully Ass...",323.136
4,JG-15160,James Galang,CA-2017-118885,2017-12-29,Furniture,"Global High-Back Leather Tilter, Burgundy",393.568
5,BP-11185,Ben Peterman,CA-2017-146626,2017-12-29,Furniture,Nu-Dell Executive Frame,101.12
6,MC-17845,Michael Chen,US-2017-102638,2017-12-29,Office Supplies,Ideal Clamps,6.03
7,KH-16360,Katherine Hughes,US-2017-158526,2017-12-29,Furniture,DMI Arturo Collection Mission-style Design Woo...,1207.84
8,BS-11755,Bruce Stewart,CA-2017-130631,2017-12-29,Furniture,Hand-Finished Solid Wood Document Frame,68.46
9,KB-16600,Ken Brennan,CA-2017-158673,2017-12-29,Office Supplies,Xerox 1915,209.7


---

### ⏸️ Pause and Try!

**Your task:** Modify the ROW_NUMBER query from Cell 28 above to get the **TOP 3** orders per customer (not just the latest).

**Requirements:**
1. Use the same ROW_NUMBER pattern from the example above
2. Change the `WHERE` filter to get top 3 instead of latest (hint: `<= 3`)
3. Keep all the same columns in the output
4. Order by Customer ID and row_num
5. Limit to 15 rows total

Replace the placeholder query in the cell below with your complete SQL query.

In [12]:
# Your turn! Write your TOP 3 query here:
con.execute("SELECT 1 AS todo").df()  # Replace this entire query with your answer

Unnamed: 0,todo
0,1


---

### ⏸️ Pause and Try!

Modify the query above to get the **TOP 3** orders per customer (not just the latest).

**Hint:** Change `WHERE row_num = 1` to `WHERE row_num <= 3`

Try it in the cell below!

### Alternative: Top 3 Orders Per Customer

In [13]:
# Get top 3 most recent orders per customer
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order Date",
        Sales,
        row_num
    FROM (
        SELECT 
            "Customer ID",
            "Customer Name",
            "Order Date",
            Sales,
            ROW_NUMBER() OVER (
                PARTITION BY "Customer ID" 
                ORDER BY "Order Date" DESC
            ) AS row_num
        FROM superstore
    )
    WHERE row_num <= 3
    ORDER BY "Customer ID", row_num
    LIMIT 15
""").df()

Unnamed: 0,Customer ID,Customer Name,Order Date,Sales,row_num
0,AA-10315,Alex Avila,2017-06-29,362.94,1
1,AA-10315,Alex Avila,2017-06-29,11.54,2
2,AA-10315,Alex Avila,2016-03-03,3930.072,3
3,AA-10375,Allen Armold,2017-12-11,14.952,1
4,AA-10375,Allen Armold,2017-12-11,17.94,2
5,AA-10375,Allen Armold,2017-12-11,116.98,3
6,AA-10480,Andrew Allen,2017-04-15,15.552,1
7,AA-10480,Andrew Allen,2016-08-26,11.56,2
8,AA-10480,Andrew Allen,2016-08-26,8.64,3
9,AA-10645,Anna Andreadi,2017-11-05,12.96,1


**Just change the filter!** `WHERE row_num <= 3` gives top 3 per customer.

**Use cases:**
- Top 5 products per category
- Latest 10 transactions per account
- Most recent 3 logins per user

---

## 7. Use Case 2: LAG() for Period-over-Period Comparison

### The Business Problem

> **"What's the month-over-month change in sales?"**

You want to compare:
- This month vs last month
- This quarter vs last quarter
- Today vs yesterday

You need to access the **previous row's value**. That's what LAG() does!

### Step 1: Aggregate to Monthly Sales

In [14]:
# First, get monthly totals
monthly_sales = con.execute("""
    SELECT 
        DATE_TRUNC('month', "Order Date") AS month,
        ROUND(SUM(Sales), 2) AS monthly_sales
    FROM superstore
    GROUP BY month
    ORDER BY month
""").df()

print(f"Monthly sales for {len(monthly_sales)} months:")
monthly_sales.head(10)

Monthly sales for 48 months:


Unnamed: 0,month,monthly_sales
0,2014-01-01,14236.89
1,2014-02-01,4519.89
2,2014-03-01,55691.01
3,2014-04-01,28295.34
4,2014-05-01,23648.29
5,2014-06-01,34595.13
6,2014-07-01,33946.39
7,2014-08-01,27909.47
8,2014-09-01,81777.35
9,2014-10-01,31453.39


**Good!** We have one row per month. Now let's add the previous month's sales.

### Step 2: Add LAG() for Previous Month

In [15]:
# Add previous month's sales using LAG()
con.execute("""
    SELECT 
        DATE_TRUNC('month', "Order Date") AS month,
        ROUND(SUM(Sales), 2) AS monthly_sales,
        LAG(ROUND(SUM(Sales), 2), 1) OVER (ORDER BY DATE_TRUNC('month', "Order Date")) AS prev_month_sales
    FROM superstore
    GROUP BY month
    ORDER BY month
    LIMIT 12
""").df()

Unnamed: 0,month,monthly_sales,prev_month_sales
0,2014-01-01,14236.89,
1,2014-02-01,4519.89,14236.89
2,2014-03-01,55691.01,4519.89
3,2014-04-01,28295.34,55691.01
4,2014-05-01,23648.29,28295.34
5,2014-06-01,34595.13,23648.29
6,2014-07-01,33946.39,34595.13
7,2014-08-01,27909.47,33946.39
8,2014-09-01,81777.35,27909.47
9,2014-10-01,31453.39,81777.35


**See that?**
- `prev_month_sales` is the value from the row **before**
- First row is NULL (no previous month)
- Second row shows first month's value
- And so on...

**Syntax:**
- `LAG(column, 1)` = Get value from 1 row before
- `LAG(column, 2)` = Get value from 2 rows before
- `ORDER BY month` = Defines what "before" means!

### Step 3: Calculate Change

In [16]:
# Calculate month-over-month change
con.execute("""
    WITH monthly AS (
        SELECT 
            DATE_TRUNC('month', "Order Date") AS month,
            ROUND(SUM(Sales), 2) AS monthly_sales
        FROM superstore
        GROUP BY month
    )
    SELECT 
        month,
        monthly_sales,
        LAG(monthly_sales, 1) OVER (ORDER BY month) AS prev_month,
        ROUND(monthly_sales - LAG(monthly_sales, 1) OVER (ORDER BY month), 2) AS change,
        ROUND(
            100.0 * (monthly_sales - LAG(monthly_sales, 1) OVER (ORDER BY month)) / 
            LAG(monthly_sales, 1) OVER (ORDER BY month), 
            2
        ) AS pct_change
    FROM monthly
    ORDER BY month
    LIMIT 12
""").df()

Unnamed: 0,month,monthly_sales,prev_month,change,pct_change
0,2014-01-01,14236.89,,,
1,2014-02-01,4519.89,14236.89,-9717.0,-68.25
2,2014-03-01,55691.01,4519.89,51171.12,1132.13
3,2014-04-01,28295.34,55691.01,-27395.67,-49.19
4,2014-05-01,23648.29,28295.34,-4647.05,-16.42
5,2014-06-01,34595.13,23648.29,10946.84,46.29
6,2014-07-01,33946.39,34595.13,-648.74,-1.88
7,2014-08-01,27909.47,33946.39,-6036.92,-17.78
8,2014-09-01,81777.35,27909.47,53867.88,193.01
9,2014-10-01,31453.39,81777.35,-50323.96,-61.54


**Business insights!**
- See which months grew vs declined
- Calculate percent change
- Spot trends

**Note:** Used `WITH` (Common Table Expression) to make query cleaner. This is advanced but useful!

### Why ORDER BY Matters

**Without ORDER BY, LAG() doesn't know what "previous" means!**

```sql
-- ❌ Wrong - undefined order
LAG(sales) OVER ()

-- ✅ Correct - ordered by time
LAG(sales) OVER (ORDER BY month)
```

**Always ORDER BY the dimension you're comparing across** (usually time).

### LEAD(): The Opposite

`LEAD()` gets the value from rows **after** instead of before:

In [17]:
# Compare to NEXT month instead of previous
con.execute("""
    WITH monthly AS (
        SELECT 
            DATE_TRUNC('month', "Order Date") AS month,
            ROUND(SUM(Sales), 2) AS monthly_sales
        FROM superstore
        GROUP BY month
    )
    SELECT 
        month,
        monthly_sales,
        LEAD(monthly_sales, 1) OVER (ORDER BY month) AS next_month_sales
    FROM monthly
    ORDER BY month
    LIMIT 10
""").df()

Unnamed: 0,month,monthly_sales,next_month_sales
0,2014-01-01,14236.89,4519.89
1,2014-02-01,4519.89,55691.01
2,2014-03-01,55691.01,28295.34
3,2014-04-01,28295.34,23648.29
4,2014-05-01,23648.29,34595.13
5,2014-06-01,34595.13,33946.39
6,2014-07-01,33946.39,27909.47
7,2014-08-01,27909.47,81777.35
8,2014-09-01,81777.35,31453.39
9,2014-10-01,31453.39,78628.72


**Use case:** "Did we hit our forecast?" Compare actual to next month's forecast.

---

## 8. Use Case 3: Moving Average

### The Business Problem

> **"Daily sales are noisy - smooth them with a 7-day moving average."**

**Why moving averages?**
- Remove day-to-day volatility
- See underlying trends
- Common in time series analysis

**What's a moving average?**
- For each day, average that day + the 6 days before it
- "Window" slides forward each day
- Smooths out spikes and dips

### Step 1: Aggregate to Daily Sales

In [18]:
# Get daily sales
daily_sales = con.execute("""
    SELECT 
        "Order Date" AS date,
        ROUND(SUM(Sales), 2) AS daily_sales
    FROM superstore
    GROUP BY date
    ORDER BY date
    LIMIT 20
""").df()

print(f"Daily sales:")
daily_sales

Daily sales:


Unnamed: 0,date,daily_sales
0,2014-01-03,16.45
1,2014-01-04,288.06
2,2014-01-05,19.54
3,2014-01-06,4407.1
4,2014-01-07,87.16
5,2014-01-09,40.54
6,2014-01-10,54.83
7,2014-01-11,9.94
8,2014-01-13,3553.8
9,2014-01-14,61.96


**See the volatility?** Some days high, some low. Hard to see the trend.

### Step 2: Add 7-Day Moving Average

In [19]:
# Add 7-day moving average
con.execute("""
    WITH daily AS (
        SELECT 
            "Order Date" AS date,
            ROUND(SUM(Sales), 2) AS daily_sales
        FROM superstore
        GROUP BY date
    )
    SELECT 
        date,
        daily_sales,
        ROUND(
            AVG(daily_sales) OVER (
                ORDER BY date
                ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
            ), 
            2
        ) AS moving_avg_7day
    FROM daily
    ORDER BY date
    LIMIT 20
""").df()

Unnamed: 0,date,daily_sales,moving_avg_7day
0,2014-01-03,16.45,16.45
1,2014-01-04,288.06,152.26
2,2014-01-05,19.54,108.02
3,2014-01-06,4407.1,1182.79
4,2014-01-07,87.16,963.66
5,2014-01-09,40.54,809.81
6,2014-01-10,54.83,701.95
7,2014-01-11,9.94,701.02
8,2014-01-13,3553.8,1167.56
9,2014-01-14,61.96,1173.62


**Breaking down the syntax:**

```sql
AVG(daily_sales) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)
```

- `AVG(daily_sales)` - Calculate average
- `ORDER BY date` - Order matters! (need to know which rows are "before")
- `ROWS BETWEEN 6 PRECEDING AND CURRENT ROW` - The magic part!
  - `6 PRECEDING` = 6 rows before current
  - `CURRENT ROW` = current row
  - Total: 7 rows (6 before + current)

**Visual:**
```
For row at day 10:
[day 4][day 5][day 6][day 7][day 8][day 9][day 10]
 ↑                                            ↑
 6 preceding                          current row
 
 Average these 7 days
```

### The Smoothing Effect

In [20]:
# Let's see more data to see the smoothing
con.execute("""
    WITH daily AS (
        SELECT 
            "Order Date" AS date,
            ROUND(SUM(Sales), 2) AS daily_sales
        FROM superstore
        GROUP BY date
    )
    SELECT 
        date,
        daily_sales,
        ROUND(
            AVG(daily_sales) OVER (
                ORDER BY date
                ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
            ), 
            2
        ) AS moving_avg_7day,
        ROUND(daily_sales - AVG(daily_sales) OVER (
                ORDER BY date
                ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
            ), 2) AS deviation_from_avg
    FROM daily
    ORDER BY date
    LIMIT 30
""").df()

Unnamed: 0,date,daily_sales,moving_avg_7day,deviation_from_avg
0,2014-01-03,16.45,16.45,0.0
1,2014-01-04,288.06,152.26,135.81
2,2014-01-05,19.54,108.02,-88.48
3,2014-01-06,4407.1,1182.79,3224.31
4,2014-01-07,87.16,963.66,-876.5
5,2014-01-09,40.54,809.81,-769.27
6,2014-01-10,54.83,701.95,-647.12
7,2014-01-11,9.94,701.02,-691.08
8,2014-01-13,3553.8,1167.56,2386.24
9,2014-01-14,61.96,1173.62,-1111.66


**Notice:**
- Daily sales jumps around (high volatility)
- Moving average is smoother (less volatile)
- You can see the trend more clearly

**Business use:** "Is our sales trend up or down?" Moving average makes it clear.

### Other Frame Options

**Frame syntax:** `ROWS BETWEEN <start> AND <end>`

Common patterns:

```sql
-- 7-day moving average
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

-- 3-day centered average (1 before, current, 1 after)
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING

-- Running total (all rows from start to current)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

-- Next 5 days average
ROWS BETWEEN CURRENT ROW AND 5 FOLLOWING
```

**Key:** ORDER BY defines what "preceding" and "following" mean!

---

## 9. Summary: Window Functions

### The Three Use Cases We Learned

1. **ROW_NUMBER()** - Ranking/Latest per group
   - "Latest order per customer"
   - "Top 3 products per category"
   - Requires: PARTITION BY + ORDER BY

2. **LAG()/LEAD()** - Row-to-row comparison
   - "Month-over-month change"
   - "This year vs last year"
   - Requires: ORDER BY

3. **Moving Average** - Smoothing/Rolling calculations
   - "7-day moving average"
   - "Running total"
   - Requires: ORDER BY + ROWS BETWEEN

### Syntax Template

```sql
SELECT 
    column1,
    column2,
    <function>() OVER (
        PARTITION BY group_column    -- Optional: "for each..."
        ORDER BY sort_column         -- When order matters
        ROWS BETWEEN ... AND ...     -- For frames (moving avg)
    ) AS result_column
FROM table
```

### Key Takeaways

1. ✅ **Windows preserve rows, GROUP BY collapses**
2. ✅ **Use PARTITION BY for groups** (like GROUP BY)
3. ✅ **Use ORDER BY when order matters** (almost always!)
4. ✅ **Use ROWS BETWEEN for custom frames** (moving averages)
5. ✅ **Pattern for filtering:** Use subquery to filter window results

---

## 10. Common Gotchas

### ❌ Gotcha 1: Forgetting ORDER BY

```sql
-- ❌ Wrong - undefined order
ROW_NUMBER() OVER (PARTITION BY customer_id)

-- ✅ Correct
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC)
```

### ❌ Gotcha 2: Expecting Windows to Filter Rows

Window functions **add columns**, they don't filter rows!

```sql
-- ❌ This doesn't filter - it adds a column
SELECT *, ROW_NUMBER() OVER (...) AS rn
FROM table
-- Still get all rows!

-- ✅ Use subquery to filter
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (...) AS rn
    FROM table
)
WHERE rn = 1  -- Now we filter
```

### ❌ Gotcha 3: Confusing PARTITION BY with GROUP BY

```sql
-- GROUP BY: Collapses rows
SELECT category, COUNT(*)
FROM sales
GROUP BY category  -- 3 rows output (one per category)

-- PARTITION BY: Keeps rows
SELECT *, COUNT(*) OVER (PARTITION BY category)
FROM sales  -- 10,000 rows output (all rows kept)
```

### ❌ Gotcha 4: Frame Definition Errors

```sql
-- ❌ Wrong - can't have FOLLOWING before PRECEDING
ROWS BETWEEN 5 FOLLOWING AND 1 PRECEDING

-- ✅ Correct - start must come before end
ROWS BETWEEN 1 PRECEDING AND 5 FOLLOWING
```

---

## 11. Decision Guide: GROUP BY vs Windows

### Use GROUP BY when:
- ✅ You want **summary only** (total, average, count)
- ✅ You don't need detail rows
- ✅ Result: Fewer rows (one per group)
- ✅ Example: "What's our revenue per region?"

### Use Window Functions when:
- ✅ You want **detail + calculation**
- ✅ You need **ranking** (1st, 2nd, 3rd per group)
- ✅ You need **row-to-row comparison** (this vs previous)
- ✅ You need **running totals** or moving averages
- ✅ You need to **filter after calculation** ("top 3 per group")
- ✅ Result: Same row count as input
- ✅ Example: "Show all orders with each order's rank in its category"

### Quick Test:

**Ask:** "Do I need to see individual rows, or just summaries?"
- Just summaries → GROUP BY
- Individual rows → Window functions

---

## 12. Preview: HW1

In your homework, you'll apply these concepts to a **525,000 row dataset**!

You'll use:
1. **Basic queries** - SELECT, WHERE, ORDER BY
2. **Aggregations** - GROUP BY, HAVING
3. **Window functions** - ROW_NUMBER, LAG, moving averages

**Tips for success:**
1. Use `LIMIT` while developing queries
2. Build incrementally (start simple, add complexity)
3. Check for NULLs (use `IS NULL`, not `= NULL`)
4. Remember: WHERE filters rows, HAVING filters groups
5. For window functions, always check ORDER BY

**Why 525K rows?** To show you SQL + DuckDB's power!
- Aggregations on 525K rows: ~0.1 seconds
- This is why we use SQL for data analysis

---

## Summary: What We Learned Today

### Notebook 1: SQL Foundations
- SELECT, WHERE, ORDER BY
- NULL handling (IS NULL, not = NULL)
- Calculated columns
- Pattern matching with LIKE

### Notebook 2: Aggregations
- COUNT, SUM, AVG, MIN, MAX
- GROUP BY (collapses rows)
- HAVING (filter groups)
- WHERE vs HAVING (critical difference!)

### Notebook 3: Window Functions
- Windows preserve rows, GROUP BY collapses
- ROW_NUMBER() for latest per group
- LAG()/LEAD() for row-to-row comparison
- Moving averages with ROWS BETWEEN

### Most Important Concepts

1. **NULL handling:** Use `IS NULL`, never `= NULL`
2. **WHERE vs HAVING:** Rows vs groups
3. **Windows vs GROUP BY:** Preserve vs collapse
4. **ORDER BY in windows:** Required when order matters

**You're now ready for HW1!** 🎉

---

**Excellent work!** Window functions are advanced SQL - the fact that you understand them puts you ahead of many data analysts. Practice these patterns - you'll use them constantly in real work.