# Day 1, Block B: Window Functions Primer

**Duration:** 20 minutes  
**Course:** ECBS5294 - Introduction to Data Science: Working with Data  
**Instructor:** Eduardo Ariño de la Rubia

**Note:** This is a focused primer covering the essentials. For advanced topics like LAG(), LEAD(), and moving averages, see the **Window Functions Deep Dive** notebook.

---

## Learning Objectives

By the end of this primer, you will be able to:

1. **Explain the mental model:** Windows preserve rows; GROUP BY collapses rows
2. **Decide when to use** window functions vs GROUP BY
3. **Use ROW_NUMBER()** for "latest record per group" problems
4. **Understand basic window function syntax** (PARTITION BY, ORDER BY)

---

## 1. Introduction: The Problem Window Functions Solve

### The Challenge

You just learned GROUP BY. It's powerful:
- "Total revenue per product" ✅
- "Count of transactions per month" ✅

But sometimes GROUP BY has a **limitation:**

**Problem:** "I want to see each transaction AND the total for that product."

With GROUP BY:
- You can see the total per product (1 row per product)
- OR you can see all transactions (many rows)
- But **not both at the same time!**

GROUP BY **collapses** rows. What if you want the calculation **without collapsing**?

**Enter: Window Functions**

> **Window functions let you add calculations to your data WITHOUT collapsing rows.**

This is incredibly powerful for analytics!

---

## 2. Setup

We're using **Superstore** data because:
- Multiple orders **per customer** (great for ROW_NUMBER examples)
- 4 years of time series data
- ~10,000 rows - perfect for learning

In [None]:
# Imports
import duckdb
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

print("✅ Libraries imported!")

In [None]:
# Connect to DuckDB
con = duckdb.connect(':memory:')

print("✅ Connected to DuckDB!")

In [None]:
# Load Superstore data
# Note: Using encoding='latin-1' due to file encoding
superstore = pd.read_csv('../../data/day1/Sample - Superstore.csv', encoding='latin-1')

# Cast Order Date to datetime for proper DATE_TRUNC support
superstore['Order Date'] = pd.to_datetime(superstore['Order Date'])

# Register with DuckDB
con.register('superstore', superstore)

print(f"✅ Loaded {len(superstore):,} rows!")

In [None]:
# Explore the data
con.execute("""
    SELECT 
        "Order ID",
        "Order Date",
        "Customer ID",
        "Customer Name",
        Category,
        "Product Name",
        Sales
    FROM superstore
    LIMIT 5
""").df()

In [None]:
# Check date range
con.execute("""
    SELECT 
        MIN("Order Date") as first_order,
        MAX("Order Date") as last_order,
        COUNT(DISTINCT "Customer ID") as unique_customers,
        COUNT(*) as total_orders
    FROM superstore
""").df()

**Perfect!** ~10,000 orders across 4 years from ~800 customers. Great data for learning window functions.

---

## 3. The Mental Model: Windows vs GROUP BY

> **🚨 THIS IS THE MOST IMPORTANT CONCEPT**

### The Core Difference

| | GROUP BY | Window Functions |
|---|---|---|
| **What happens to rows?** | Collapses to summary | Keeps all rows |
| **Output row count** | Fewer rows (one per group) | Same row count as input |
| **Use when** | You want summary only | You want detail + calculation |
| **Example** | "Total sales per category" | "Each order + category total" |

Let's see this in action with real queries.

In [None]:
# ==============================================================================
# THE CRITICAL DIFFERENCE: Side-by-Side Comparison
# ==============================================================================

from IPython.display import display

print("="*70)
print("APPROACH 1: GROUP BY (Collapses Rows)")
print("="*70)
result_groupby = con.execute("""
    SELECT 
        Category,
        COUNT(*) AS order_count
    FROM superstore
    GROUP BY Category
    ORDER BY order_count DESC
""").df()

print(f"\n📊 Input: 9,994 rows")
print(f"📉 Output: {len(result_groupby)} rows (one per category)")
print(f"❌ We LOST all the details! Which products? Which customers? When?\n")
display(result_groupby)

print("\n" + "="*70)
print("APPROACH 2: WINDOW FUNCTION (Preserves Rows)")
print("="*70)
result_window = con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        Category,
        Sales,
        COUNT(*) OVER (PARTITION BY Category) AS category_order_count
    FROM superstore
    LIMIT 10
""").df()

print(f"\n📊 Input: 9,994 rows")
print(f"📈 Output: 9,994 rows (all kept!)")
print(f"✅ We KEPT everything AND added the count!\n")
print("(Showing first 10 rows)\n")
display(result_window)

print("\n" + "="*70)
print("🔑 KEY INSIGHT:")
print("="*70)
print("   GROUP BY:  9,994 rows  →  3 rows      (COLLAPSED)")
print("   Window:    9,994 rows  →  9,994 rows  (PRESERVED)")
print("="*70)
print("\n💡 This is why window functions are powerful:")
print("   You get the DETAIL + the AGGREGATE in the same result!")
print("="*70)

**See the difference?**
- All detail rows are still there!
- But we've added a new column: `category_order_count`
- Every row in "Furniture" shows the same count
- Every row in "Technology" shows its count

**The calculation happened "over a window" of rows, but we kept all rows!**

---

## 4. When to Use Each

### Decision Guide

**Use GROUP BY when:**
- ✅ You want summary only (one row per group)
- ✅ You don't need row-level details
- ✅ Example: "What's our revenue per region?" (just the totals)

**Use Window Functions when:**
- ✅ You want detail + calculation
- ✅ You need ranking (1st, 2nd, 3rd...)
- ✅ You need row-to-row comparisons ("this month vs last month")
- ✅ You need to filter AFTER calculating ("show me the top 3 per category")
- ✅ Example: "Show me all orders, with each order's rank within its category"

**Key insight:** If GROUP BY loses information you need, use window functions!

---

## 5. Basic Window Function Syntax

### Simple Example: No PARTITION or ORDER

In [None]:
# Add total order count to every row
con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        COUNT(*) OVER () AS total_orders_in_dataset
    FROM superstore
    LIMIT 5
""").df()

**What happened:** `COUNT(*) OVER ()` with empty `()` means "count ALL rows" and add that number to every row.

### With PARTITION BY

In [None]:
# Add count PER CATEGORY to every row
con.execute("""
    SELECT 
        "Order ID",
        "Product Name",
        Category,
        COUNT(*) OVER (PARTITION BY Category) AS category_count
    FROM superstore
    LIMIT 10
""").df()

**PARTITION BY is like GROUP BY for window functions!**
- "PARTITION BY Category" = "For each category..."
- COUNT happens within each partition
- But all rows are kept!

### Window Function Syntax Template

```sql
<function>() OVER (
    PARTITION BY group_column    -- Optional: "for each..."
    ORDER BY sort_column         -- When order matters (required for some functions)
)
```

---

## 6. The Most Common Use Case: ROW_NUMBER()

### The Business Problem

> **"I want the most recent order for each customer."**

This is a VERY common pattern in data analysis:
- Latest transaction per customer
- Most recent login per user
- Current status per order

### Why GROUP BY Fails

In [None]:
# Try with GROUP BY: Get latest date per customer
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        MAX("Order Date") AS latest_order_date
    FROM superstore
    GROUP BY "Customer ID", "Customer Name"
    LIMIT 5
""").df()

**Problem:** We got the date, but we **lost the order details!**
- What was ordered?
- What category?
- Order ID?
- Sales amount?

GROUP BY collapsed everything. We need a different approach.

### Solution: ROW_NUMBER()

> **ROW_NUMBER() assigns a sequential number to each row within a group**

Strategy:
1. For each customer, rank orders by date (newest = 1)
2. Keep all the row details
3. Filter to rank = 1

### Step 1: Add Row Numbers

In [None]:
# Add row numbers
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order ID",
        "Order Date",
        Category,
        Sales,
        ROW_NUMBER() OVER (
            PARTITION BY "Customer ID" 
            ORDER BY "Order Date" DESC
        ) AS row_num
    FROM superstore
    LIMIT 20
""").df()

**Breaking it down:**
- `PARTITION BY "Customer ID"` = For each customer...
- `ORDER BY "Order Date" DESC` = Sort by date, newest first
- `ROW_NUMBER()` = Assign 1, 2, 3, ...

**Result:** Each customer's orders are numbered, 1 = most recent!

### Step 2: Filter to Latest Only

In [None]:
# Now filter to row_num = 1 using a subquery
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order ID",
        "Order Date",
        Category,
        "Product Name",
        Sales
    FROM (
        SELECT 
            "Customer ID",
            "Customer Name",
            "Order ID",
            "Order Date",
            Category,
            "Product Name",
            Sales,
            ROW_NUMBER() OVER (
                PARTITION BY "Customer ID" 
                ORDER BY "Order Date" DESC
            ) AS row_num
        FROM superstore
    )
    WHERE row_num = 1
    ORDER BY "Order Date" DESC
    LIMIT 10
""").df()

**Perfect!** Now we have:
- ✅ Latest order per customer
- ✅ All order details (Product, Category, Sales, etc.)
- ✅ No information lost!

**Key pattern:** Window functions add columns, they don't filter. Use a **subquery** to filter on the window result.

---

### ⏸️ Pause and Try!

**Your task:** Modify the query above to get the **TOP 3** orders per customer (not just the latest).

**Requirements:**
1. Use the same ROW_NUMBER pattern
2. Change the `WHERE` filter to get top 3 instead of latest (hint: `<= 3`)
3. Keep all the same columns in the output
4. Order by Customer ID and row_num
5. Limit to 15 rows total

Replace the placeholder query in the cell below with your complete SQL query.

In [None]:
# Your turn! Write your TOP 3 query here:
con.execute("SELECT 1 AS todo").df()  # Replace this entire query with your answer

### Solution: Top 3 Orders Per Customer

In [None]:
# Get top 3 most recent orders per customer
con.execute("""
    SELECT 
        "Customer ID",
        "Customer Name",
        "Order Date",
        Sales,
        row_num
    FROM (
        SELECT 
            "Customer ID",
            "Customer Name",
            "Order Date",
            Sales,
            ROW_NUMBER() OVER (
                PARTITION BY "Customer ID" 
                ORDER BY "Order Date" DESC
            ) AS row_num
        FROM superstore
    )
    WHERE row_num <= 3
    ORDER BY "Customer ID", row_num
    LIMIT 15
""").df()

**Just change the filter!** `WHERE row_num <= 3` gives top 3 per customer.

**This pattern works for:**
- Top N products per category
- Latest N transactions per account
- Most recent N logins per user

---

## 7. Summary: What You Learned

### Key Concepts

1. ✅ **Windows preserve rows, GROUP BY collapses**
   - GROUP BY: 10,000 rows → 3 rows (summary)
   - Windows: 10,000 rows → 10,000 rows (detail + calculation)

2. ✅ **Use PARTITION BY for groups** (like GROUP BY for windows)
   - `PARTITION BY Category` = "For each category..."
   - But all rows are kept!

3. ✅ **Use ORDER BY when order matters**
   - Required for ROW_NUMBER() to know what "first" means
   - `ORDER BY date DESC` = Newest first

4. ✅ **ROW_NUMBER() for "latest/top N per group"**
   - Add row numbers with PARTITION BY + ORDER BY
   - Filter with subquery: `WHERE row_num = 1`

5. ✅ **Pattern for filtering:** Use subquery
   - Window functions ADD columns
   - To FILTER on those columns, wrap in subquery

### Syntax You Know

```sql
-- Basic window
COUNT(*) OVER (PARTITION BY category)

-- ROW_NUMBER for ranking
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date DESC)

-- Filter pattern (subquery)
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (...) AS row_num
    FROM table
)
WHERE row_num = 1
```

### Ready for More?

This primer covered the essentials. If you want to learn:
- **LAG()/LEAD()** for period-over-period comparisons ("this month vs last month")
- **Moving averages** with ROWS BETWEEN frame specifications
- **Advanced patterns** and edge cases

→ See the **Window Functions Deep Dive** notebook!

### You're Ready for HW1! 🎉

You now know:
- SELECT, WHERE, ORDER BY
- GROUP BY, HAVING
- Window functions with ROW_NUMBER()

That's everything you need for Homework 1. Practice these patterns - you'll use them constantly in real data work!

---