# Module 04: Aggregation & Grouping - Summarizing Data

**Estimated Time:** 60 minutes

## Learning Objectives

By the end of this module, you will be able to:
- Use aggregate functions (COUNT, SUM, AVG, MIN, MAX)
- Group data with GROUP BY clause
- Filter grouped data with HAVING clause
- Combine aggregations with JOINs
- Understand the difference between WHERE and HAVING
- Create meaningful business reports using aggregation

In [None]:
# Setup
import sqlite3
import pandas as pd
from pathlib import Path

%load_ext sql

# Connect to database
DB_PATH = Path.cwd().parent / "data" / "databases" / "ecommerce.db"
conn = sqlite3.connect(DB_PATH)
%sql sqlite:///$DB_PATH

print("✓ Connected to ecommerce.db")

## 1. Aggregate Functions Overview

Aggregate functions perform calculations on a set of values and return a single value.

### Common Aggregate Functions:
- **COUNT()** - Count the number of rows
- **SUM()** - Calculate the total sum
- **AVG()** - Calculate the average
- **MIN()** - Find the minimum value
- **MAX()** - Find the maximum value

In [None]:
# COUNT - Total number of products
%%sql
SELECT COUNT(*) AS total_products
FROM products

In [None]:
# COUNT vs COUNT(column) - NULL handling
%%sql
SELECT 
    COUNT(*) AS total_rows,
    COUNT(product_id) AS count_product_id,
    COUNT(DISTINCT category_id) AS unique_categories
FROM products

In [None]:
# SUM - Total inventory value
%%sql
SELECT 
    SUM(price * stock_quantity) AS total_inventory_value
FROM products

In [None]:
# AVG - Average product price
%%sql
SELECT 
    ROUND(AVG(price), 2) AS average_price,
    ROUND(AVG(stock_quantity), 2) AS average_stock
FROM products

In [None]:
# MIN and MAX - Price range
%%sql
SELECT 
    MIN(price) AS lowest_price,
    MAX(price) AS highest_price,
    MAX(price) - MIN(price) AS price_range
FROM products

In [None]:
# Multiple aggregates in one query
%%sql
SELECT 
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_revenue,
    ROUND(AVG(total_amount), 2) AS average_order_value,
    MIN(total_amount) AS smallest_order,
    MAX(total_amount) AS largest_order
FROM orders

## 2. GROUP BY: Grouping Data

GROUP BY divides rows into groups and applies aggregate functions to each group.

### Syntax
```sql
SELECT column, aggregate_function(column)
FROM table
GROUP BY column;
```

In [None]:
# Products count by category
%%sql
SELECT 
    category_id,
    COUNT(*) AS product_count
FROM products
GROUP BY category_id
ORDER BY product_count DESC

In [None]:
# Average price by category
%%sql
SELECT 
    category_id,
    COUNT(*) AS products,
    ROUND(AVG(price), 2) AS avg_price,
    MIN(price) AS min_price,
    MAX(price) AS max_price
FROM products
GROUP BY category_id
ORDER BY avg_price DESC

In [None]:
# Orders by status
%%sql
SELECT 
    status,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_value,
    ROUND(AVG(total_amount), 2) AS avg_value
FROM orders
GROUP BY status
ORDER BY order_count DESC

In [None]:
# Customer order counts
%%sql
SELECT 
    customer_id,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_spent,
    ROUND(AVG(total_amount), 2) AS avg_order_value
FROM orders
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 10

## 3. GROUP BY with Multiple Columns

You can group by multiple columns to create more detailed aggregations.

In [None]:
# Orders by customer and status
%%sql
SELECT 
    customer_id,
    status,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_amount
FROM orders
GROUP BY customer_id, status
ORDER BY customer_id, status
LIMIT 20

In [None]:
# Monthly order summary (using date functions)
%%sql
SELECT 
    strftime('%Y-%m', order_date) AS month,
    COUNT(*) AS orders,
    SUM(total_amount) AS revenue,
    ROUND(AVG(total_amount), 2) AS avg_order
FROM orders
GROUP BY month
ORDER BY month DESC
LIMIT 12

## 4. HAVING: Filtering Grouped Data

HAVING filters groups after aggregation (WHERE filters rows before aggregation).

### WHERE vs HAVING:
- **WHERE**: Filters individual rows before grouping
- **HAVING**: Filters groups after aggregation

In [None]:
# Categories with more than 10 products
%%sql
SELECT 
    category_id,
    COUNT(*) AS product_count
FROM products
GROUP BY category_id
HAVING COUNT(*) > 10
ORDER BY product_count DESC

In [None]:
# Customers who spent more than $500
%%sql
SELECT 
    customer_id,
    COUNT(*) AS orders,
    SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(total_amount) > 500
ORDER BY total_spent DESC

In [None]:
# Combining WHERE and HAVING
%%sql
SELECT 
    customer_id,
    COUNT(*) AS delivered_orders,
    SUM(total_amount) AS total_spent
FROM orders
WHERE status = 'Delivered'
GROUP BY customer_id
HAVING COUNT(*) >= 3
ORDER BY total_spent DESC
LIMIT 10

In [None]:
# Categories with average price above $50
%%sql
SELECT 
    category_id,
    COUNT(*) AS products,
    ROUND(AVG(price), 2) AS avg_price
FROM products
GROUP BY category_id
HAVING AVG(price) > 50
ORDER BY avg_price DESC

## 5. Combining GROUP BY with JOINs

Aggregate data from multiple joined tables.

In [None]:
# Product count and avg price by category name
%%sql
SELECT 
    c.category_name,
    COUNT(p.product_id) AS product_count,
    ROUND(AVG(p.price), 2) AS avg_price,
    SUM(p.stock_quantity) AS total_stock
FROM categories c
LEFT JOIN products p ON c.category_id = p.category_id
GROUP BY c.category_id, c.category_name
ORDER BY product_count DESC

In [None]:
# Customer spending summary with names
%%sql
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    c.city,
    COUNT(o.order_id) AS total_orders,
    ROUND(SUM(o.total_amount), 2) AS total_spent,
    ROUND(AVG(o.total_amount), 2) AS avg_order_value
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, customer_name, c.city
ORDER BY total_spent DESC
LIMIT 15

In [None]:
# Product sales performance
%%sql
SELECT 
    p.product_name,
    c.category_name,
    COUNT(oi.order_item_id) AS times_sold,
    SUM(oi.quantity) AS total_quantity_sold,
    ROUND(SUM(oi.quantity * oi.unit_price), 2) AS total_revenue
FROM products p
INNER JOIN categories c ON p.category_id = c.category_id
LEFT JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.product_id, p.product_name, c.category_name
ORDER BY total_revenue DESC
LIMIT 15

In [None]:
# Category revenue report
%%sql
SELECT 
    c.category_name,
    COUNT(DISTINCT p.product_id) AS products,
    COUNT(DISTINCT o.order_id) AS orders,
    SUM(oi.quantity) AS items_sold,
    ROUND(SUM(oi.quantity * oi.unit_price), 2) AS total_revenue
FROM categories c
LEFT JOIN products p ON c.category_id = p.category_id
LEFT JOIN order_items oi ON p.product_id = oi.product_id
LEFT JOIN orders o ON oi.order_id = o.order_id
GROUP BY c.category_id, c.category_name
ORDER BY total_revenue DESC

## 6. Real-World Business Reports

Let's create comprehensive business reports using aggregation.

In [None]:
# Report 1: Customer Segmentation by Spending
%%sql
SELECT 
    CASE 
        WHEN total_spent >= 1000 THEN 'VIP'
        WHEN total_spent >= 500 THEN 'Gold'
        WHEN total_spent >= 100 THEN 'Silver'
        ELSE 'Bronze'
    END AS customer_tier,
    COUNT(*) AS customer_count,
    ROUND(SUM(total_spent), 2) AS total_revenue,
    ROUND(AVG(total_spent), 2) AS avg_spent_per_customer
FROM (
    SELECT 
        customer_id,
        SUM(total_amount) AS total_spent
    FROM orders
    GROUP BY customer_id
) AS customer_totals
GROUP BY customer_tier
ORDER BY 
    CASE customer_tier
        WHEN 'VIP' THEN 1
        WHEN 'Gold' THEN 2
        WHEN 'Silver' THEN 3
        ELSE 4
    END

In [None]:
# Report 2: Monthly Revenue Trends
%%sql
SELECT 
    strftime('%Y', order_date) AS year,
    strftime('%m', order_date) AS month,
    COUNT(*) AS orders,
    COUNT(DISTINCT customer_id) AS unique_customers,
    ROUND(SUM(total_amount), 2) AS revenue,
    ROUND(AVG(total_amount), 2) AS avg_order_value
FROM orders
WHERE status != 'Cancelled'
GROUP BY year, month
ORDER BY year DESC, month DESC
LIMIT 12

In [None]:
# Report 3: Low Stock Alert by Category
%%sql
SELECT 
    c.category_name,
    COUNT(*) AS low_stock_products,
    ROUND(AVG(p.stock_quantity), 2) AS avg_stock,
    MIN(p.stock_quantity) AS min_stock,
    ROUND(SUM(p.price * p.stock_quantity), 2) AS inventory_value_at_risk
FROM products p
INNER JOIN categories c ON p.category_id = c.category_id
WHERE p.stock_quantity < 50
GROUP BY c.category_id, c.category_name
ORDER BY low_stock_products DESC

## 7. Exercises

Practice what you've learned with these exercises.

### Exercise 1: Order Status Summary
Create a summary showing each order status, the count of orders, total revenue, and average order value. Sort by total revenue descending.

In [None]:
# Your code here
%%sql

### Exercise 2: Top 10 Customers by Order Count
Find the top 10 customers by number of orders. Include customer name, order count, and total spent.

In [None]:
# Your code here
%%sql

### Exercise 3: Category Performance
For each category, show the number of products, total stock quantity, and average price. Only include categories with more than 5 products.

In [None]:
# Your code here
%%sql

### Exercise 4: High-Value Customers
Find customers who have placed at least 5 orders AND spent more than $300 total. Show customer name, order count, and total spent.

In [None]:
# Your code here
%%sql

### Exercise 5: Product Sales Ranking
Create a report showing the top 20 products by total revenue. Include product name, category name, quantity sold, and total revenue.

In [None]:
# Your code here
%%sql

## Summary

In this module, you learned:
- ✓ Using aggregate functions (COUNT, SUM, AVG, MIN, MAX)
- ✓ Grouping data with GROUP BY
- ✓ Grouping by multiple columns
- ✓ Filtering grouped data with HAVING
- ✓ Difference between WHERE and HAVING
- ✓ Combining GROUP BY with JOINs

**Key Takeaways:**
- Aggregate functions collapse multiple rows into a single result
- GROUP BY creates groups of rows for aggregation
- WHERE filters rows before grouping
- HAVING filters groups after aggregation
- Use COUNT(DISTINCT column) to count unique values
- Combine aggregation with JOINs for powerful reports

**Next:** Module 05 - Subqueries & CTEs

In [None]:
# Cleanup
conn.close()
print("✓ Database connection closed")