# Homework 1: SQL Foundations with DuckDB

**Name:** Anton Shestakov
**Due:** Day 2, Start of Class  
**Total Points:** 100 (+ 10 bonus)

---

## Instructions

1. Complete all TODO sections below
2. Write SQL queries to answer each question
3. Add markdown explanations where requested
4. Before submitting: **Kernel → Restart & Run All Cells**
5. Verify all outputs are visible
6. Rename file to `hw1_[your_name].ipynb`

**Read the README.md for full assignment details, rubric, and tips!**

---

## Setup

Run these cells to set up your environment.

In [82]:
# Install DuckDB (if not already installed)
!pip install duckdb -q

In [83]:
# Import libraries
import duckdb
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


In [84]:
# Connect to DuckDB
con = duckdb.connect(':memory:')

print("✅ Connected to DuckDB!")

✅ Connected to DuckDB!


In [85]:
# Load the Online Retail dataset
# This creates a table called 'retail' that you'll query
con.execute("""
    CREATE TABLE retail AS 
    SELECT * FROM 'data/online_retail_hw1.csv'
""")

print("✅ Dataset loaded!")

✅ Dataset loaded!


### Dataset Exploration

Let's explore the data before starting the assignment.

In [86]:
# Check row count
con.execute("""SELECT COUNT(*) as total_rows 
            FROM retail""").df()

Unnamed: 0,total_rows
0,525461


In [87]:
# View first few rows
con.execute("SELECT * FROM retail LIMIT 5").df()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [88]:
# Check for NULL values in each column
con.execute("""
    SELECT 
        COUNT(*) - COUNT(Invoice) AS invoice_nulls,
        COUNT(*) - COUNT(StockCode) AS stockcode_nulls,
        COUNT(*) - COUNT(Description) AS description_nulls,
        COUNT(*) - COUNT(Quantity) AS quantity_nulls,
        COUNT(*) - COUNT(InvoiceDate) AS date_nulls,
        COUNT(*) - COUNT(Price) AS price_nulls,
        COUNT(*) - COUNT("Customer ID") AS customerid_nulls,
        COUNT(*) - COUNT(Country) AS country_nulls
    FROM retail
""").df()

Unnamed: 0,invoice_nulls,stockcode_nulls,description_nulls,quantity_nulls,date_nulls,price_nulls,customerid_nulls,country_nulls
0,0,0,2928,0,0,0,107927,0


In [89]:
# Check date range
con.execute("""
    SELECT 
        MIN(InvoiceDate) as first_transaction,
        MAX(InvoiceDate) as last_transaction
    FROM retail
""").df()

Unnamed: 0,first_transaction,last_transaction
0,2009-12-01 07:45:00,2010-12-09 20:01:00


**Good!** Now you know:
- Total row count
- Which columns have NULLs (Customer ID and Description)
- Date range covered

Keep this in mind as you write queries!

---

## Part 1: Basic Queries (30 points)

This section tests: SELECT, WHERE, ORDER BY, NULL handling, LIKE

#### Note
I depicted the total number of rows with the letter "C" that according to the desription in README means cancelled transactions. To provide a high-quality analysis I have to take into account this category as well. The next query depicts the number of cancelled transactions which has non-zero value in quantity and price. I have to manage with this value because in other case I'll take this row in my calcualtion of total revenue for example, although this transaction was actually cancelled. 

Totally, there are 10,206 transactions which were cancelled. But if we apply the following condifitions: quantity and price > 0, then we have only 1 transaction. 

In [90]:
con.execute("""
    SELECT 
        SUM(transaction_count) AS canceled_transaction_count
        FROM (
        SELECT
        Invoice,
        COUNT(*) AS transaction_count
        FROM retail
        WHERE Invoice LIKE '%C%'
        GROUP BY Invoice)
""").df()

Unnamed: 0,canceled_transaction_count
0,10206.0


In [91]:
con.execute("""
            SELECT
            Invoice,
            COUNT(*) as canceled_transaction_count
            FROM retail
            WHERE Quantity > 0
            AND Price > 0
            AND Invoice LIKE '%C%'
            GROUP BY Invoice
            """).df()

Unnamed: 0,Invoice,canceled_transaction_count
0,C496350,1


### Question 1.1: Guest Checkouts (8 points)

**Business question:** How many transactions were guest checkouts (no Customer ID)?

**Requirements:**
- Count transactions where Customer ID is NULL
- Also calculate what percentage of total transactions this represents
- Your result should have two columns: `guest_transactions` and `pct_of_total`

**Hint:** Remember to use `IS NULL`, not `= NULL`!

In [92]:
# TODO: Write your query here
# Without taking into account cancelled transactions, I keep them. 
con.execute("""
            SELECT 
            COUNT(*) as guest_transactions,
            guest_transactions * 100 / (SELECT COUNT(*) FROM retail) as pct_of_total
            FROM retail
            WHERE "Customer ID" IS NULL
            """).df()

Unnamed: 0,guest_transactions,pct_of_total
0,107927,20.539488


**TODO: Explain your result in 1-2 sentences:**

I depicted the total number of NULL transactions within COUNT and WHERE queries. Additionally, I created a new column of NULL transactions percentage by dividing a variable guest_transactions by all transactions through additional SELECT. Ultimately, I got ~20.53%.

---

### Question 1.2: High-Value Transactions (7 points)

**Business question:** Show the top 20 highest-value transactions (revenue = Quantity * Price).

**Requirements:**
- Calculate revenue as Quantity * Price
- Show: Invoice, Description, Quantity, Price, Revenue
- Only include rows where Quantity and Price are both positive
- Filter out NULL values appropriately
- Sort by revenue descending
- Limit to top 20

**Hint:** Use calculated column with AS to name it `revenue`

In [93]:
# TODO: Write your query here
# Taking into account cancelled transactions, i.e I filter them with the line: Invoice NOT LIKE '%C%.
con.execute(""" SELECT
            Invoice, 
            Description,
            Quantity,
            Price,
            Quantity*Price as revenue
            FROM retail
            WHERE "Quantity" > 0
            AND "Price" > 0
            AND "Description" IS NOT NULL
            AND Invoice NOT LIKE '%C%'
            ORDER BY "revenue" DESC
            LIMIT 20
            """).df()

Unnamed: 0,Invoice,Description,Quantity,Price,revenue
0,512771,Manual,1,25111.09,25111.09
1,530715,ROTATING SILVER ANGELS T-LIGHT HLDR,9360,1.69,15818.4
2,537632,AMAZON FEE,1,13541.33,13541.33
3,502263,Manual,1,10953.5,10953.5
4,502265,Manual,1,10953.5,10953.5
5,522796,Manual,1,10468.8,10468.8
6,524159,Manual,1,10468.8,10468.8
7,525399,Manual,1,10468.8,10468.8
8,496115,Manual,1,8985.6,8985.6
9,511465,PINK PAPER PARASOL,3500,2.55,8925.0


---

### Question 1.3: Product Search (7 points)

**Business question:** Find all products with "CHRISTMAS" in the description.

**Requirements:**
- Use LIKE with wildcard pattern matching
- Show: StockCode, Description
- Get distinct products only (no duplicates)
- Sort alphabetically by Description
- Limit to first 15 results

**Hint:** LIKE is case-sensitive in some databases, but DuckDB is case-insensitive by default

In [94]:
# TODO: Write your query here
con.execute(""" SELECT
            DISTINCT(StockCode),
            Description
            FROM retail
            WHERE Description LIKE '%CHRISTMAS%'
            ORDER BY Description ASC
            LIMIT 15
            """).df()

Unnamed: 0,StockCode,Description
0,35962,12 ASS ZINC CHRISTMAS DECORATIONS
1,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS
2,72815,3 WICK CHRISTMAS BRIAR CANDLE
3,22950,36 DOILIES VINTAGE CHRISTMAS
4,22731,3D CHRISTMAS STAMPS STICKERS
5,22731,3D STICKERS CHRISTMAS STAMPS
6,22733,3D STICKERS TRADITIONAL CHRISTMAS
7,22732,3D STICKERS VINTAGE CHRISTMAS
8,22733,3D TRADITIONAL CHRISTMAS STICKERS
9,22732,3D VINTAGE CHRISTMAS STICKERS


---

### Question 1.4: Multi-Country Orders (8 points)

**Business question:** Show transactions from France, Germany, or Spain, with quantity greater than 10.

**Requirements:**
- Use IN operator for country filtering
- Filter for Quantity > 10
- Show: Invoice, Country, Description, Quantity, Price
- Sort by Country, then Quantity descending
- Limit to 25 rows
- Handle NULLs appropriately

**Hint:** Combine IN with AND for multiple conditions

In [95]:
# TODO: Write your query here
# Taking into account cancelled transactions, i.e I filter them with the line: Invoice NOT LIKE '%C%.
con.execute("""
            SELECT
            Invoice,
            Country,
            Description,
            Quantity,
            Price
            FROM retail
            WHERE Country IN('France', 'Germany', 'Spain')
            AND Quantity > 10
            AND Description IS NOT NULL
            AND Invoice NOT LIKE '%C%'
            ORDER BY Country DESC,
            Quantity DESC
            LIMIT 25
            """).df()

Unnamed: 0,Invoice,Country,Description,Quantity,Price
0,531561,Spain,FRENCH WC SIGN BLUE METAL,576,1.06
1,531561,Spain,LOVE GARLAND PAINTED ZINC,576,1.45
2,527113,Spain,FRENCH WC SIGN BLUE METAL,576,1.25
3,527113,Spain,LOVE GARLAND PAINTED ZINC,576,1.65
4,527796,Spain,FRENCH WC SIGN BLUE METAL,576,1.06
5,527796,Spain,LOVE GARLAND PAINTED ZINC,576,1.45
6,506839,Spain,LARGE HANGING GLASS+ZINC LANTERN,560,1.6
7,531561,Spain,WOODEN HAPPY BIRTHDAY GARLAND,288,2.55
8,531561,Spain,MINI WOODEN HAPPY BIRTHDAY GARLAND,288,1.45
9,527796,Spain,WOODEN HAPPY BIRTHDAY GARLAND,288,2.55


**TODO: Why did you include (or not include) NULL checks in this query?**

According to the task, I had to include 'Quantity' values which are greater than 10, that, in turn, automatically includes the assumption that the column excludes 0 values, so I didn't add WHERE query with NULL filtering for Quantity column. Though I handled with NULL values in column Description because I considered this information as important. Naturally, if I am interested only in number of transactions, then Description doesn't make much sense, however taking into account that I should've included Description column in result table, I assumed that this info is important. 

---

## Part 2: Aggregations (40 points)

This section tests: COUNT, SUM, AVG, GROUP BY, HAVING, WHERE vs HAVING

### Question 2.1: Revenue by Country (10 points)

**Business question:** What's our total revenue and transaction count for each country?

**Requirements:**
- Calculate total revenue (Quantity * Price) per country
- Count transactions per country
- Only include positive quantities and non-NULL prices
- Show: Country, total_revenue, transaction_count
- Sort by total_revenue descending
- Show all countries

**Hint:** Use SUM() and COUNT() with GROUP BY

In [96]:
# TODO: Write your query here


#!!! I didn't distinguish unique invoices, if one transaction is counted within one unique Invoice then it should be added
# to the line "COUNT(Invoice) as transactional_count" also a word DISTINCT. In turn, I considered each line as a seperate transaction (incl. cancelled).
# + Taking into account cancelled transactions, i.e I filter them with the line: Invoice NOT LIKE '%C%, because we have to calculate the total revenue,
# consequently it is necessary to exclude non-zero quantity/price cancelled transactions.
con.execute("""
            SELECT 
            Country,
            SUM(Quantity*Price) as total_revenue,
            COUNT(Invoice) as transaction_count
            FROM retail
            WHERE Quantity > 0
            AND Price IS NOT NULL
            AND Invoice NOT LIKE '%C%'
            GROUP BY Country
            ORDER BY total_revenue DESC
            """).df()

Unnamed: 0,Country,total_revenue,transaction_count
0,United Kingdom,8709204.0,474937
1,EIRE,380977.8,9460
2,Netherlands,268786.0,2730
3,Germany,202395.3,7661
4,France,147211.5,5532
5,Sweden,53525.39,887
6,Denmark,50906.85,418
7,Spain,47601.42,1235
8,Switzerland,43921.39,1170
9,Australia,31446.8,630


**TODO: Which country generates the most revenue? Does this surprise you?**

As specified in README.md to this assignment, our database comes from UK-based retail company, so it's not surprising that the highest revenue value belongs to the UK, as well the largest number of transactions (which I've decided to identify as a separate good within each row and not as unique invoice).  

---

### Question 2.2: Popular Products (10 points)

**Business question:** Which products have been ordered more than 1,000 times?

**Requirements:**
- Group by StockCode and Description
- Count how many times each product appears
- Calculate total quantity sold for each product
- Filter to products with MORE than 1,000 transactions (use HAVING!)
- Show: StockCode, Description, transaction_count, total_quantity_sold
- Sort by transaction_count descending

**Hint:** This requires HAVING, not WHERE, because you're filtering on an aggregate

In [97]:
# TODO: Write your query here
con.execute("""
            SELECT 
            StockCode, 
            Description,
            COUNT(*) as transaction_count,
            SUM(Quantity) as total_quantity_sold
            FROM retail
            GROUP BY StockCode, Description
            HAVING COUNT(*) > 1000
            ORDER BY transaction_count DESC
            """).df()

Unnamed: 0,StockCode,Description,transaction_count,total_quantity_sold
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,3515,57428.0
1,22423,REGENCY CAKESTAND 3 TIER,2212,13093.0
2,21232,STRAWBERRY CERAMIC TRINKET BOX,1843,26563.0
3,21212,PACK OF 72 RETRO SPOT CAKE CASES,1466,46106.0
4,84879,ASSORTED COLOUR BIRD ORNAMENT,1457,44925.0
5,84991,60 TEATIME FAIRY CAKE CASES,1400,36326.0
6,21754,HOME BUILDING BLOCK WORD,1386,5048.0
7,85099B,JUMBO BAG RED RETROSPOT,1285,30308.0
8,20725,LUNCH BAG RED SPOTTY,1274,17608.0
9,21034,REX CASH+CARRY JUMBO SHOPPER,1232,2296.0


**TODO: Explain why you used HAVING instead of WHERE for the >1000 filter:**

Since I had to filter on a new-created variable that was aggregated ('transaction_count'), I used HAVING instead of WHERE. 

---

### Question 2.3: High-Value Customers (10 points)

**Business question:** Which customers have spent more than £5,000 total?

**Requirements:**
- Calculate total spending (SUM of Quantity * Price) per customer
- Count their number of transactions
- Only include customers with Customer ID (exclude guest checkouts)
- Only include positive quantities and prices
- Filter to customers with total spending > 5000
- Show: Customer ID, total_spent, transaction_count
- Sort by total_spent descending

**Hint:** Use WHERE for row-level filtering (NULLs, positive values) and HAVING for aggregate filtering (>5000)

In [98]:
# TODO: Write your query here
con.execute("""
            SELECT
            "Customer ID",
            SUM(QuantIty*Price) as total_spent,
            COUNT(*) as transaction_count
            FROM retail
            WHERE Quantity > 0
            AND Price > 0
            AND "Customer ID" IS NOT NULL
            GROUP BY "Customer ID"
            HAVING SUM(QuantIty*Price) > 5000
            ORDER BY total_spent DESC
            """).df()

Unnamed: 0,Customer ID,total_spent,transaction_count
0,18102.0,349164.35,627
1,14646.0,248396.50,1773
2,14156.0,196566.74,2648
3,14911.0,152147.57,5570
4,13694.0,131443.19,957
...,...,...,...
282,12474.0,5048.66,286
283,16186.0,5019.17,307
284,13599.0,5013.96,163
285,13869.0,5006.62,553


---

### Question 2.4: Monthly Revenue Trend (10 points)

**Business question:** What's our revenue and transaction count by month?

**Requirements:**
- Extract month from InvoiceDate (use DATE_TRUNC('month', InvoiceDate))
- Calculate total revenue per month
- Count transactions per month
- Calculate average transaction value per month
- Only include positive quantities and prices
- Show: month, total_revenue, transaction_count, avg_transaction_value
- Sort by month chronologically

**Hint:** DATE_TRUNC('month', date_column) gives you the first day of each month

In [99]:
# TODO: Write your query here
# Taking into account cancelled transactions, i.e I filter them with the line: Invoice NOT LIKE '%C%.
con.execute("""
            SELECT
            DATE_TRUNC('month', InvoiceDate) as month,
            SUM(Quantity*Price) as total_revenue,
            COUNT(*) as transaction_count,
            AVG(Quantity*Price) as avg_transaction_value
            FROM retail
            WHERE Quantity > 0
            AND Price > 0
            AND Invoice NOT LIKE '%C%'
            GROUP BY month
            ORDER BY month ASC
            """).df()

Unnamed: 0,month,total_revenue,transaction_count,avg_transaction_value
0,2009-12-01,825685.76,43957,18.783942
1,2010-01-01,652708.502,30638,21.303887
2,2010-02-01,553339.736,28281,19.565777
3,2010-03-01,833570.131,40364,20.651326
4,2010-04-01,681528.992,33268,20.486022
5,2010-05-01,659858.86,33795,19.52534
6,2010-06-01,752270.14,38900,19.338564
7,2010-07-01,650712.94,32503,20.020089
8,2010-08-01,697274.91,32473,21.472451
9,2010-09-01,924333.011,41109,22.484931


**TODO: Do you see any seasonal patterns in the revenue?**

The largest values of revenue occur in pre /-Christmas months such as October, November and December. Also the general decreasing trend may be observed during summer months. 

---

## Part 3: Window Functions (30 points)

This section tests: ROW_NUMBER, LAG, moving averages

### Question 3.1: Latest Purchase Per Customer (8 points)

**Business question:** What was each customer's most recent purchase?

**Requirements:**
- Use ROW_NUMBER() to rank transactions per customer by date
- Partition by Customer ID
- Order by InvoiceDate descending (most recent first)
- Filter to only the most recent transaction (row_num = 1)
- Only include customers with Customer ID (no guest checkouts)
- Show: Customer ID, Invoice, InvoiceDate, Description, Quantity, Price
- Sort by InvoiceDate descending
- Show first 20 customers

**Hint:** You'll need a subquery - use ROW_NUMBER() in inner query, filter in outer query

In [100]:
# TODO: Write your query here
# Structure: 
# SELECT ... FROM (
#     SELECT ..., ROW_NUMBER() OVER (...) as row_num
#     FROM retail
# )
# WHERE row_num = 1

con.execute("""
            SELECT "Customer ID",
                Invoice,
                InvoiceDate,
                Description,
                Quantity,
                Price 
            FROM (
                SELECT 
                    "Customer ID",
                    Invoice,
                    InvoiceDate,
                    Description,
                    Quantity,
                    Price,
                    ROW_NUMBER() OVER (PARTITION BY "Customer ID" order BY InvoiceDate DESC) as row_num
            FROM retail
            )
            WHERE row_num = 1
            AND "Customer ID" IS NOT NULL
            ORDER BY InvoiceDate DESC
            LIMIT 20
""").df()

Unnamed: 0,Customer ID,Invoice,InvoiceDate,Description,Quantity,Price
0,17530.0,538171,2010-12-09 20:01:00,PACK OF 60 DINOSAUR CAKE CASES,2,0.55
1,13969.0,538170,2010-12-09 19:32:00,JAM MAKING SET PRINTED,4,1.45
2,13230.0,538169,2010-12-09 19:28:00,HEART DECORATION WITH PEARLS,2,0.85
3,14702.0,538168,2010-12-09 19:23:00,RIBBON REEL LACE DESIGN,5,2.1
4,14713.0,538167,2010-12-09 18:58:00,SET OF 4 NAPKIN CHARMS STARS,3,2.55
5,17965.0,538166,2010-12-09 18:09:00,CHOCOLATE HOT WATER BOTTLE,1,4.95
6,14031.0,538165,2010-12-09 17:34:00,SOLDIERS EGG CUP,72,1.25
7,17841.0,538163,2010-12-09 17:27:00,CIRCUS PARADE LUNCH BOX,1,1.95
8,17576.0,538157,2010-12-09 16:57:00,PENCIL CASE LIFE IS BEAUTIFUL,5,2.95
9,15555.0,538156,2010-12-09 16:53:00,6 RIBBONS ELEGANT CHRISTMAS,8,1.65


**TODO: Why did you use a window function instead of GROUP BY for this question?**

GROUP BY as an aggregation function wouldn't allow me to sort the earliest transaction for each unique customer with displaying all other columns within one query, it would be able to show me a summary or the most recent/latest date of transactions in generall, but without displaying other rows at the same time. In turn, if I begin to add new columns, then I will have to write them down in GROUP BY syntaxix as well.

**For example:** 
```sql
con.execute("""
    SELECT
        "Customer ID",
        Invoice,
        Description,
        MIN(InvoiceDate) AS earliest_date
    FROM retail
    GROUP BY "Customer ID", Invoice, Description
""").df()
```
if I run this query then I'll get the eraliest date within each unique combination of invoice and description, but it won't display one earliest date for individual. Windows function allows me to output the table on one earliest transaction with keeping all data without creating aggregation with unique combinations. In the next code block the results of this query can be observed: more than half million unique combinations with NaN values in column "Customer ID" :)

In [101]:
con.execute("""SELECT 
"Customer ID",
    Invoice,
    Description,
    MIN(InvoiceDate) AS earliest_date
FROM retail
GROUP BY "Customer ID", Invoice, Description
""").df()

Unnamed: 0,Customer ID,Invoice,Description,earliest_date
0,12540.0,535396,PACK OF 72 RETROSPOT CAKE CASES,2010-11-26 10:43:00
1,18031.0,535398,MAGIC DRAWING SLATE LEAP FROG,2010-11-26 11:00:00
2,18031.0,535398,REX CASH+CARRY JUMBO SHOPPER,2010-11-26 11:00:00
3,15550.0,535399,EIGHT PIECE DINOSAUR SET,2010-11-26 11:01:00
4,15550.0,535399,GREY FLORAL FELTCRAFT SHOULDER BAG,2010-11-26 11:01:00
...,...,...,...,...
511819,17754.0,501129,BIRDHOUSE DECORATION MAGIC GARDEN,2010-03-12 16:18:00
511820,17754.0,501129,6 EGG HOUSE PAINTED WOOD,2010-03-12 16:18:00
511821,17754.0,501129,VICTORIAN GLASS HANGING T-LIGHT,2010-03-12 16:18:00
511822,14031.0,C501131,TEA CUP AND SAUCER RETRO SPOT,2010-03-12 17:04:00


---

### Question 3.2: Week-over-Week Revenue Change (12 points)

**Business question:** How is our weekly revenue changing week-over-week?

**Requirements:**
- First, aggregate to weekly level (use DATE_TRUNC('week', InvoiceDate))
- Calculate total revenue per week
- Use LAG() to get previous week's revenue
- Calculate the change (current week - previous week)
- Calculate percent change
- Only include positive quantities and prices
- Show: week, weekly_revenue, prev_week_revenue, revenue_change, pct_change
- Sort by week chronologically
- Show all weeks

**Hint:** Build this incrementally - first get weekly totals, then add LAG, then calculate changes

In [102]:
# TODO: Write your query here
# Consider using a WITH clause (CTE) to make it cleaner:
# WITH weekly AS (
#     SELECT ... GROUP BY week
# )
# SELECT ..., LAG(...) OVER (ORDER BY week) FROM weekly

con.execute("""
            WITH weekly as(
                SELECT 
                    DATE_TRUNC('week', InvoiceDate) AS week,
                    SUM(QuantIty*Price) as weekly_revenue
                    FROM retail
                WHERE 
                    Quantity > 0
                    AND Price > 0
                GROUP BY week
            )
            SELECT 
                week,
                weekly_revenue,
                LAG(weekly_revenue, 1) OVER (ORDER BY week) as prev_week_revenue,
                ROUND(weekly_revenue - LAG(weekly_revenue, 1) OVER (ORDER BY week), 2) as revenue_change,
                ROUND(
                    (weekly_revenue - LAG(weekly_revenue, 1) OVER (ORDER BY week)) *100 /
                    LAG(weekly_revenue, 1) OVER (ORDER BY week), 2
                    ) as pct_change
            FROM weekly
            ORDER BY week
""").df()

Unnamed: 0,week,weekly_revenue,prev_week_revenue,revenue_change,pct_change
0,2009-11-30,267053.53,,,
1,2009-12-07,241631.47,267053.53,-25422.06,-9.52
2,2009-12-14,261738.39,241631.47,20106.92,8.32
3,2009-12-21,55262.37,261738.39,-206476.02,-78.89
4,2010-01-04,168520.11,55262.37,113257.74,204.95
5,2010-01-11,163999.72,168520.11,-4520.39,-2.68
6,2010-01-18,154221.301,163999.72,-9778.42,-5.96
7,2010-01-25,165967.371,154221.301,11746.07,7.62
8,2010-02-01,125037.642,165967.371,-40929.73,-24.66
9,2010-02-08,91110.61,125037.642,-33927.03,-27.13


**TODO: Which week had the biggest increase in revenue? What might explain this?**

The biggest revenue increase of 204% is observed on the first week of the year 2010. This change might be explained by small sales right before holidays (because all customers have already ordered all necessary gifts/things/stuff), but after the holidays they started actively to make orders again.

---

### Question 3.3: 7-Day Moving Average (10 points)

**Business question:** What's the 7-day moving average of daily revenue?

**Requirements:**
- First, aggregate to daily level (DATE_TRUNC('day', InvoiceDate) or just InvoiceDate::DATE)
- Calculate total revenue per day
- Use window function with ROWS BETWEEN to calculate 7-day moving average
- The moving average should include current day + 6 days before
- Only include positive quantities and prices
- Show: date, daily_revenue, moving_avg_7day
- Sort by date
- Show first 30 days

**Hint:** ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives you 7 days total

In [103]:
# TODO: Write your query here
# WITH daily AS (
#     SELECT date, SUM(...) as daily_revenue
#     FROM retail
#     GROUP BY date
# )
# SELECT 
#     date,
#     daily_revenue,
#     AVG(daily_revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_avg
# FROM daily

con.execute("""
        WITH daily as(
            SELECT 
                DATE_TRUNC('day', InvoiceDate) as date,
                SUM(Quantity*Price) as daily_revenue
                FROM retail
                WHERE Quantity > 0
                AND Price > 0
            GROUP BY DATE_TRUNC('day', InvoiceDate)
            )
        SELECT
            date,
            daily_revenue,
            ROUND(AVG(daily_revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW), 2) AS moving_avg 
        FROM daily
        ORDER BY date ASC
        LIMIT 30
""").df()

Unnamed: 0,date,daily_revenue,moving_avg
0,2009-12-01,54513.5,54513.5
1,2009-12-02,63352.51,58933.01
2,2009-12-03,74037.91,63967.97
3,2009-12-04,40732.92,58159.21
4,2009-12-05,9803.05,48487.98
5,2009-12-06,24613.64,44508.92
6,2009-12-07,45083.35,44590.98
7,2009-12-08,49517.23,43877.23
8,2009-12-09,40616.09,40629.17
9,2009-12-10,44442.11,36401.2


**TODO: Why is a moving average useful for analyzing daily revenue?**

The moving average smooths out day-to-day volatility that might be useful for time series and trend analysis. 

---

## Bonus Question (10 points)

This tests: Synthesis of multiple concepts (window functions + GROUP BY)

### Bonus: Top Product Per Country (10 points)

**Business question:** What's the #1 best-selling product (by revenue) in each country?

**Requirements:**
- Calculate total revenue per product per country
- Rank products within each country by revenue
- Show only the #1 product for each country
- Only include positive quantities and prices
- Show: Country, StockCode, Description, total_revenue, rank
- Sort by Country

**Strategy:**
1. First: GROUP BY country and product to get revenue per product per country
2. Then: Use ROW_NUMBER() to rank products within each country
3. Finally: Filter to rank = 1

**Hint:** This combines aggregation (GROUP BY) with window functions (ROW_NUMBER)

In [104]:
# TODO: Write your query here (BONUS)
# This is challenging! Break it into steps:
# 1. Inner query: GROUP BY country and product
# 2. Middle query: Add ROW_NUMBER() partitioned by country
# 3. Outer query: Filter to row_num = 1

con.execute("""
            WITH table1 AS (SELECT 
                Country,
                StockCode,
                Description,
                SUM(Quantity*Price) AS total_revenue
            FROM retail
            WHERE Quantity > 0
            AND Price > 0
            GROUP BY Country, StockCode, Description
            )
            SELECT 
                Country,
                StockCode,
                Description,
                total_revenue,
                rank
            FROM (
                SELECT
                    Country,
                    StockCode,
                    Description,
                    total_revenue,
                    ROW_NUMBER() OVER (PARTITION BY Country ORDER BY total_revenue DESC) AS rank
                FROM table1
            )
            WHERE rank = 1
""").df()

Unnamed: 0,Country,StockCode,Description,total_revenue,rank
0,Lithuania,22750,FELTCRAFT PRINCESS LOLA DOLL,180.0,1
1,Austria,POST,POSTAGE,1600.0,1
2,Greece,85123A,WHITE HANGING HEART T-LIGHT HOLDER,408.0,1
3,Iceland,84558A,3D DOG PICTURE PLAYING CARDS,88.5,1
4,Israel,48187,DOORMAT NEW ENGLAND,202.5,1
5,Switzerland,POST,POSTAGE,2739.0,1
6,Unspecified,21108,FAIRY CAKE FLANNEL ASSORTED COLOUR,206.55,1
7,France,POST,POSTAGE,9558.0,1
8,Korea,21201,TROPICAL HONEYCOMB PAPER GARLAND,100.8,1
9,Norway,M,Manual,13916.34,1


**TODO: (Bonus) Explain your approach to this question:**

Firstly, I find aggregated values within CTE table1: I calculate total_revenue and set the non-negative conditions on quantity and price. Then I code the query that includes rank values within window function ROW_NUMBER() to assign a rank to each product within its country, I output the variables from CTE table1. 

The last step is setting the condition that rank = 1, for this I have to 'frame' or wrap my previous query with a new outer query that would already embed the variable 'rank'. Eventually it turns out that our query with window function becomes an inner query, whence I display the variable 'rank' that I can now filter.

---

## Submission Checklist

Before submitting, verify:

- [✅] All TODO sections completed
- [✅] All queries produce results (no errors)
- [✅] All query outputs are visible
- [✅] All markdown explanations completed
- [✅] SQL formatted nicely (uppercase keywords, indented)
- [✅] NULL values handled appropriately (IS NULL, not = NULL)
- [✅] **CRITICAL:** Kernel → Restart & Run All Cells (no errors)
- [✅] File renamed to `hw1_[your_name].ipynb`

---

## Reflection (Optional but Recommended)

**What was the most challenging part of this assignment?**

To remember the names of variables

**What concept do you feel most confident about now?**

I split the task into several days, the last thing that I've done is window functions, so I'd say I feel myself quite confident with these commands.

**What would you like more practice with?**

I liked hints, so I didn't feel  like I was left alone in the middle of nowhere :)

---

**Great work! 🎉** You've completed queries on 525,000 rows of real data. This is the kind of work data professionals do every day. Be proud!