# Lab Module 7: Subqueries and CTEs

(Run the below cell first, to ensure connectivity)

In [None]:
%load_ext sql

%sql postgresql://admin:password@postgres:5432/postgres

## Challenge 1: The Scalar Subquery (WHERE Clause)
- **Context**: The product team wants to identify premium products. They need a list of products that weigh more than the average product in the catalog. 
- **Task**: Write a query to select the `product_id` and `product_weight_g` for all `products` where the weight is greater than the average weight of all `products`.

In [None]:
%%sql
-- Write your solution here


## Challenge 2: The Scalar Subquery (SELECT Clause)
- **Context**: We want to compare individual item prices against the global average price to create a "Price Ratio" metric. 
- **Task**: Select the `order_id`, `product_id`, and `price`. Add a fourth column `global_avg_price` that displays the average `price` of all items in `order_items` for every row.

In [None]:
%%sql
-- Write your solution here


## Challenge 3: Subquery in the FROM Clause (Derived Tables)
- **Context**: Management needs the average "Order Value" (total revenue per order). To do this, we first need to calculate the total for each order, then average those totals. 
- **Task**: Write a query that first calculates the sum of `price` per `order_id` (subquery) from the `order_items` table, and then calculates the average of those totals in the outer query.

In [None]:
%%sql
-- Write your solution here


## Challenge 4: Multi-Row Subquery (IN)
- **Context**: Marketing wants to target customers living in the top 3 states with the highest number of customers. 
- **Task**: Select `customer_id` and `customer_state` for all `customers` who live in the top 3 states (by customer count). Use a subquery with IN.

In [None]:
%%sql
-- Write your solution here


## Challenge 5: Multi-Row Subquery (NOT IN)
- **Context**: The logistics team is auditing sellers. They want to find sellers who are not located in the same states as our customers. 
- **Task**: Select `seller_id` and `seller_state` for `sellers` where their state is not present in the `customers` table.

In [None]:
%%sql
-- Write your solution here


## Challenge 6: Filtering by Category (IN)
- **Context**: We need a list of orders that contain products from "Tech" categories (computers, electronics, telephony). 
- **Task**: Select `order_id` from `order_items` where the `product_id` corresponds to a product in the `product_category_name` 'telefonia', 'informatica_acessorios', or 'eletronicos'. Use a subquery against the products table.

In [None]:
%%sql
-- Write your solution here


## Challenge 7: Correlated Subquery (EXISTS)
- **Context**: We want to verify which Products have actually been sold at least once. 
- **Task**: Select `product_id` and `product_category_name` from the `products` table. Use an EXISTS clause to filter for products that appear in the `order_items` table.

In [None]:
%%sql
-- Write your solution here


## Challenge 8: Correlated Subquery (NOT EXISTS)
- **Context**: Inventory analysis: Find products that have never been sold. 
- **Task**: Select `product_id` and `product_category_name` from `products` where no matching record exists in `order_items`.

In [None]:
%%sql
-- Write your solution here


## Challenge 9: Correlated Subquery with Logic
- **Context**: Find customers who have made an order that has NOT yet been delivered. 
- **Task**: Select `customer_id` from `customers`. Use EXISTS to check the `orders` table for orders belonging to that customer where `order_status` is not 'delivered'.

In [None]:
%%sql
-- Write your solution here


## Challenge 10: Comparison against Category Average (Correlated)
- **Context**: Identify "expensive for their class" products. We want products that cost more than the average price of their specific category. 
- **Task**: Select `product_id`, `product_category_name`, and `price` (from `order_items` joined to `products`). Filter for items where the `price` is greater than the average price of that specific category. (This query will take some time to run).

In [None]:
%%sql
-- Write your solution here


## Challenge 11: Intro to CTEs (Refactoring Derived Tables)
- **Context**: Code readability is important. Let's calculate the average order value (from Challenge 3) again, but using a Common Table Expression (CTE). 
- **Task**: Define a CTE named `OrderTotals` that sums `price` by `order_id` (from the `order_items` table). Then, select the average of those totals from the CTE.

In [None]:
%%sql
-- Write your solution here

## Challenge 12: CTE for Two-Step Aggregation
- **Context**: We want to find the state with the highest average freight cost per order. 
- **Task**: 
    - 1. Create a CTE `OrderFreight` to sum `freight_value` (from `order_items` table) by `order_id` (from `orders` table), (and include `customer_state` by joining the `customers` table). 
    - 2. Create a second CTE (or main query) to average that freight by `customer_state`. 
    - 3. Order by the highest average freight.

In [None]:
%%sql
-- Write your solution here

## Challenge 13: Multiple CTEs
- **Context**: Who are our "VIP" Sellers? We define them as sellers with more than $50,000 in sales. We want to see how many VIP sellers are in each state. 
- **Task**:
    - CTE 1 `SellerRevenue`: Calculate total sales (`price`) per `seller_id` from the `order_items` table.
    - CTE 2 `VIPSellers`: Filter `SellerRevenue` for totals > 50000.
    - Main Query: Join `VIPSellers` to the `sellers` table (to get `seller_state`) and count them by state.

In [None]:
%%sql
-- Write your solution here

## Challenge 14: CTE vs. Having
- **Context**: Sometimes CTEs are easier to debug than complex HAVING clauses. Let's find orders with more than 5 items. 
- **Task**: Use a CTE to select `order_id` and a count as `item_count` from `order_items`, grouped by `order_id`. Filter the result in the main query for counts > 5.

In [None]:
%%sql
-- Write your solution here

## Challenge 15: Filtering with CTEs
- **Context**: We want to analyze orders strictly from the year 2017. 
- **Task**: Create a CTE `Orders2017` that selects `order_id` from the `orders` table where `order_purchase_timestamp` is in 2017. Then, join this CTE to `order_items` to calculate total revenue for that year.

In [None]:
%%sql
-- Write your solution here

## Challenge 16: Recursive CTE - Number Series
- **Context**: In Data Engineering, we often need to generate rows of numbers (e.g., for pagination or testing). 
- **Task**: Write a Recursive CTE to generate a list of numbers from 1 to 10.

In [None]:
%%sql
-- Write your solution here

## Challenge 17: Recursive CTE - Date Series
- **Context**: We need to report on daily sales, but some days have zero sales. We need a "Calendar" table to left join against. 
- **Task**: Generate a list of all dates for the month of January 2017 using a Recursive CTE.

In [None]:
%%sql
-- Write your solution here

## Challenge 18: Recursive CTE - Finding Gaps (Concept)
- **Context**: We want to visualize sales for the first week of Jan 2017. 
- **Task**: Use the previous DateSeries CTE (limit to Jan 1st - Jan 7th) (a.k.a. rewrite the CTE to end on 7th). LEFT JOIN it with the `orders` table (on `cte.calendar_date = CAST(o.order_purchase_timestamp AS DATE)`) to count how many `order_id` occurred on each specific date. Group and order the results by `calendar_date`.

In [None]:
%%sql
-- Write your solution here

## Challenge 19: Complex Logic - "The High Value Customer"
- **Context**: A "High Value Customer" is defined as someone whose average order size is larger than the overall average order size of the entire store.
- **Task**:
    - CTE 1: Calculate total value (`SUM(price)`) per `order_id` from the `order_items` table (group by `order_id`).
    - CTE 2: Calculate the global average of those order totals.
    - Main Query: Select `customer_id` from `orders` and join CTE1 on `order_id`, where that specific order's total is greater than the global average (from CTE2).

In [None]:
%%sql
-- Write your solution here

## Challenge 20: The Final Boss - Revenue Contribution
- **Context**: We want to see the % contribution of each seller to the total revenue of their state. 
- **Task**:
    - CTE 1 SellerSales: Select `seller_id`, `seller_state`, and total `price` from `sellers` and `order_items`. Grouped by `seller_state`.
    - CTE 2 StateSales: Select `seller_state` and total sales from CTE1 grouped by `seller_state`.
    - Main Query: Select `seller_id`, `seller_state` and (SellerSales / StateSales) * 100 as `pct_contribution`. Order by `pct.contribution` descending.

In [None]:
%%sql
-- Write your solution here