# Homework: SQL SELECT Statements with Northwind Database
## **SOLUTIONS**

**Objective:** Practice writing SQL SELECT statements, filtering data with WHERE clauses, using JOINs to combine tables, and performing aggregate operations.

**Database:** Northwind (classic sales database with customers, orders, products, employees)

---

## Instructions
1. Complete all exercises below
2. Run each cell to verify your queries work
3. Ensure your output matches the expected results
4. Submit your completed notebook

## Part 0: Database Setup

This cell will:
1. Import necessary libraries
2. Set database parameters
3. Terminate any active connections to the database
4. Drop and recreate the Northwind database
5. Load the Northwind SQL file
6. Create a SQLAlchemy engine and test the connection

In [2]:
# Import libraries
import pandas as pd
import psycopg2
from sqlalchemy import create_engine, text
import subprocess

# Database parameters
db_params = {
    'host': 'localhost',
    'database': 'northwind',
    'user': 'student',
    'password': ''
}

# Step 1: Terminate active connections and recreate database
print("Step 1: Setting up database...")
terminate_cmd = f"psql -U {db_params['user']} -d postgres -c \"SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = '{db_params['database']}' AND pid <> pg_backend_pid();\""
drop_cmd = f"psql -U {db_params['user']} -d postgres -c 'DROP DATABASE IF EXISTS {db_params['database']};'"
create_cmd = f"psql -U {db_params['user']} -d postgres -c 'CREATE DATABASE {db_params['database']};'"

subprocess.run(terminate_cmd, shell=True, capture_output=True)
subprocess.run(drop_cmd, shell=True, capture_output=True)
result = subprocess.run(create_cmd, shell=True, capture_output=True, text=True)
print(f"Database created: {result.stdout.strip()}")

# Step 2: Load Northwind SQL file
print("\nStep 2: Loading Northwind database...")
sql_file = "/workspaces/Fall2025-MS3083-Base_Template/databases/northwind.sql"
load_cmd = f"psql -U {db_params['user']} -d {db_params['database']} -f {sql_file}"
result = subprocess.run(load_cmd, shell=True, capture_output=True, text=True)
print("Northwind database loaded successfully!")

# Step 3: Create SQLAlchemy engine
print("\nStep 3: Creating database connection...")
engine = create_engine(
    f"postgresql://{db_params['user']}@{db_params['host']}/{db_params['database']}"
)

# Test connection
with engine.connect() as conn:
    result = conn.execute(text("SELECT version();"))
    version = result.fetchone()[0]
    print(f"✓ Connected to: {version[:50]}...")

print("\n✓ Setup complete! Ready to run queries.")

Step 1: Setting up database...
Database created: CREATE DATABASE

Step 2: Loading Northwind database...
Northwind database loaded successfully!

Step 3: Creating database connection...
✓ Connected to: PostgreSQL 18.0 on x86_64-conda-linux-gnu, compile...

✓ Setup complete! Ready to run queries.


## Part 1: Exploring the Database

Before writing queries, let's explore what tables and data are available.

### Exercise 1.1: List all tables

Write a query to show all tables in the `northwind` schema.

In [3]:
# SOLUTION
query = """
SELECT table_name 
FROM information_schema.tables 
WHERE table_schema = 'northwind'
ORDER BY table_name;
"""

df = pd.read_sql(text(query), engine)
print(f"Total tables: {len(df)}\n")
df

Total tables: 8



Unnamed: 0,table_name
0,categories
1,customers
2,employees
3,order_details
4,orders
5,products
6,shippers
7,suppliers


### Exercise 1.2: Explore the Products table

Write a query to show the first 5 products with all their columns.

In [4]:
# SOLUTION
query = """
SELECT * 
FROM northwind.products 
LIMIT 5;
"""

pd.read_sql(text(query), engine)

Unnamed: 0,product_id,product_name,supplier_id,category_id,quantity_per_unit,unit_price,units_in_stock,units_on_order,reorder_level,discontinued
0,1,Chai,1,1,,18.0,39,0,0,False
1,2,Chang,1,1,,19.0,17,0,0,False
2,3,Aniseed Syrup,1,2,,10.0,13,0,0,False
3,4,Chef Anton's Cajun Seasoning,2,2,,22.0,53,0,0,False
4,5,Chef Anton's Gumbo Mix,2,2,,21.35,0,0,0,False


### Exercise 1.3: Count records in each table

Write queries to count how many records are in the `products`, `customers`, `orders`, and `employees` tables.

In [5]:
# SOLUTION
tables = ['products', 'customers', 'orders', 'employees']
counts = {}

for table in tables:
    query = f"SELECT COUNT(*) as count FROM northwind.{table};"
    result = pd.read_sql(text(query), engine)
    counts[table] = result['count'][0]

pd.DataFrame(list(counts.items()), columns=['Table', 'Record Count'])

Unnamed: 0,Table,Record Count
0,products,10
1,customers,5
2,orders,5
3,employees,5


## Part 2: Basic SELECT Statements

Practice selecting specific columns and filtering data.

### Exercise 2.1: Select specific columns

Select only the `product_name`, `unit_price`, and `units_in_stock` from the products table.

In [6]:
# SOLUTION
query = """
SELECT product_name, unit_price, units_in_stock
FROM northwind.products;
"""

df = pd.read_sql(text(query), engine)
print(f"Total products: {len(df)}\n")
df.head(10)

Total products: 10



Unnamed: 0,product_name,unit_price,units_in_stock
0,Chai,18.0,39
1,Chang,19.0,17
2,Aniseed Syrup,10.0,13
3,Chef Anton's Cajun Seasoning,22.0,53
4,Chef Anton's Gumbo Mix,21.35,0
5,Grandma's Boysenberry Spread,25.0,120
6,Uncle Bob's Organic Dried Pears,30.0,15
7,Northwoods Cranberry Sauce,40.0,6
8,Mishi Kobe Niku,97.0,29
9,Ikura,31.0,31


### Exercise 2.2: Filter with WHERE clause

Find all products where the `unit_price` is greater than 50.

In [7]:
# SOLUTION
query = """
SELECT product_name, unit_price, category_id
FROM northwind.products
WHERE unit_price > 50
ORDER BY unit_price DESC;
"""

df = pd.read_sql(text(query), engine)
print(f"Products with price > $50: {len(df)}\n")
df

Products with price > $50: 1



Unnamed: 0,product_name,unit_price,category_id
0,Mishi Kobe Niku,97.0,6


### Exercise 2.3: Multiple conditions

Find products where `unit_price` is between 20 and 50 AND `units_in_stock` is greater than 0.

In [8]:
# SOLUTION
query = """
SELECT product_name, unit_price, units_in_stock
FROM northwind.products
WHERE unit_price BETWEEN 20 AND 50
  AND units_in_stock > 0
ORDER BY unit_price;
"""

df = pd.read_sql(text(query), engine)
print(f"Products matching criteria: {len(df)}\n")
df

Products matching criteria: 5



Unnamed: 0,product_name,unit_price,units_in_stock
0,Chef Anton's Cajun Seasoning,22.0,53
1,Grandma's Boysenberry Spread,25.0,120
2,Uncle Bob's Organic Dried Pears,30.0,15
3,Ikura,31.0,31
4,Northwoods Cranberry Sauce,40.0,6


### Exercise 2.4: Using LIKE for pattern matching

Find all customers whose `company_name` starts with the letter 'A'.

In [9]:
# SOLUTION
query = """
SELECT customer_id, company_name, city, country
FROM northwind.customers
WHERE company_name LIKE 'A%'
ORDER BY company_name;
"""

df = pd.read_sql(text(query), engine)
print(f"Customers starting with 'A': {len(df)}\n")
df

Customers starting with 'A': 4



Unnamed: 0,customer_id,company_name,city,country
0,ALFKI,Alfreds Futterkiste,Berlin,Germany
1,ANATR,Ana Trujillo Emparedados y helados,México D.F.,Mexico
2,ANTON,Antonio Moreno Taquería,México D.F.,Mexico
3,AROUT,Around the Horn,London,UK


### Exercise 2.5: Using IN for multiple values

Find all customers located in 'USA', 'Canada', or 'Mexico'.

In [10]:
# SOLUTION
query = """
SELECT company_name, city, country
FROM northwind.customers
WHERE country IN ('USA', 'Canada', 'Mexico')
ORDER BY country, city;
"""

df = pd.read_sql(text(query), engine)
print(f"North American customers: {len(df)}\n")
df

North American customers: 2



Unnamed: 0,company_name,city,country
0,Ana Trujillo Emparedados y helados,México D.F.,Mexico
1,Antonio Moreno Taquería,México D.F.,Mexico


## Part 3: JOINs - Combining Tables

Practice joining multiple tables to get related information.

### Exercise 3.1: INNER JOIN - Products with Categories

Join the `products` and `categories` tables to show product names with their category names.

In [11]:
# SOLUTION
query = """
SELECT 
    p.product_name,
    p.unit_price,
    c.category_name
FROM northwind.products p
INNER JOIN northwind.categories c ON p.category_id = c.category_id
ORDER BY c.category_name, p.product_name;
"""

df = pd.read_sql(text(query), engine)
print(f"Total products with categories: {len(df)}\n")
df.head(10)

Total products with categories: 10



Unnamed: 0,product_name,unit_price,category_name
0,Chai,18.0,Beverages
1,Chang,19.0,Beverages
2,Aniseed Syrup,10.0,Condiments
3,Chef Anton's Cajun Seasoning,22.0,Condiments
4,Chef Anton's Gumbo Mix,21.35,Condiments
5,Grandma's Boysenberry Spread,25.0,Condiments
6,Northwoods Cranberry Sauce,40.0,Condiments
7,Ikura,31.0,Confections
8,Mishi Kobe Niku,97.0,Produce
9,Uncle Bob's Organic Dried Pears,30.0,Seafood


### Exercise 3.2: Multiple JOINs - Orders with Customer and Employee Info

Join `orders`, `customers`, and `employees` to show:
- Order ID
- Customer company name
- Employee first and last name (concatenated)
- Order date

In [12]:
# SOLUTION
query = """
SELECT 
    o.order_id,
    c.company_name,
    e.first_name || ' ' || e.last_name as employee_name,
    o.order_date
FROM northwind.orders o
INNER JOIN northwind.customers c ON o.customer_id = c.customer_id
INNER JOIN northwind.employees e ON o.employee_id = e.employee_id
ORDER BY o.order_date DESC;
"""

df = pd.read_sql(text(query), engine)
print(f"Total orders: {len(df)}\n")
df.head(10)

Total orders: 5



Unnamed: 0,order_id,company_name,employee_name,order_date
0,5,Berglunds snabbköp,Steven Buchanan,1996-07-09
1,3,Antonio Moreno Taquería,Janet Leverling,1996-07-08
2,4,Around the Horn,Margaret Peacock,1996-07-08
3,2,Ana Trujillo Emparedados y helados,Andrew Fuller,1996-07-05
4,1,Alfreds Futterkiste,Nancy Davolio,1996-07-04


### Exercise 3.3: JOIN with ORDER BY - Products by Supplier

Join `products` and `suppliers` to show products sorted by supplier name.

In [13]:
# SOLUTION
query = """
SELECT 
    s.company_name as supplier,
    p.product_name,
    p.unit_price
FROM northwind.products p
INNER JOIN northwind.suppliers s ON p.supplier_id = s.supplier_id
ORDER BY s.company_name, p.product_name;
"""

df = pd.read_sql(text(query), engine)
print(f"Total products: {len(df)}\n")
df.head(15)

Total products: 10



Unnamed: 0,supplier,product_name,unit_price
0,Exotic Liquids,Aniseed Syrup,10.0
1,Exotic Liquids,Chai,18.0
2,Exotic Liquids,Chang,19.0
3,Grandma Kelly's Homestead,Grandma's Boysenberry Spread,25.0
4,Grandma Kelly's Homestead,Northwoods Cranberry Sauce,40.0
5,Grandma Kelly's Homestead,Uncle Bob's Organic Dried Pears,30.0
6,New Orleans Cajun Delights,Chef Anton's Cajun Seasoning,22.0
7,New Orleans Cajun Delights,Chef Anton's Gumbo Mix,21.35
8,Tokyo Traders,Ikura,31.0
9,Tokyo Traders,Mishi Kobe Niku,97.0


### Exercise 3.4: Complex JOIN - Order Details with Full Information

Join `order_details`, `orders`, `products`, and `customers` to show:
- Order ID
- Customer company name
- Product name
- Quantity
- Unit price
- Line total (quantity × unit_price)

In [14]:
# SOLUTION
query = """
SELECT 
    o.order_id,
    c.company_name,
    p.product_name,
    od.quantity,
    od.unit_price,
    (od.quantity * od.unit_price) as line_total
FROM northwind.order_details od
INNER JOIN northwind.orders o ON od.order_id = o.order_id
INNER JOIN northwind.customers c ON o.customer_id = c.customer_id
INNER JOIN northwind.products p ON od.product_id = p.product_id
ORDER BY o.order_id, line_total DESC;
"""

df = pd.read_sql(text(query), engine)
print(f"Total order line items: {len(df)}\n")
df.head(15)

Total order line items: 5



Unnamed: 0,order_id,company_name,product_name,quantity,unit_price,line_total
0,1,Alfreds Futterkiste,Chai,12,18.0,216.0
1,1,Alfreds Futterkiste,Chang,10,19.0,190.0
2,2,Ana Trujillo Emparedados y helados,Aniseed Syrup,5,10.0,50.0
3,3,Antonio Moreno Taquería,Chef Anton's Cajun Seasoning,9,22.0,198.0
4,4,Around the Horn,Chef Anton's Gumbo Mix,40,21.35,854.0


## Part 4: Aggregate Functions and GROUP BY

Practice using aggregate functions to summarize data.

### Exercise 4.1: Count products by category

Show how many products are in each category.

In [15]:
# SOLUTION
query = """
SELECT 
    c.category_name,
    COUNT(p.product_id) as product_count
FROM northwind.categories c
LEFT JOIN northwind.products p ON c.category_id = p.category_id
GROUP BY c.category_name
ORDER BY product_count DESC;
"""

pd.read_sql(text(query), engine)

Unnamed: 0,category_name,product_count
0,Condiments,5
1,Beverages,2
2,Produce,1
3,Confections,1
4,Seafood,1
5,Dairy Products,0
6,Meat/Poultry,0
7,Grains/Cereals,0


### Exercise 4.2: Average, Min, and Max prices by category

Calculate the average, minimum, and maximum price for products in each category.

In [16]:
# SOLUTION
query = """
SELECT 
    c.category_name,
    COUNT(p.product_id) as product_count,
    ROUND(AVG(p.unit_price)::numeric, 2) as avg_price,
    MIN(p.unit_price) as min_price,
    MAX(p.unit_price) as max_price
FROM northwind.categories c
LEFT JOIN northwind.products p ON c.category_id = p.category_id
GROUP BY c.category_name
ORDER BY avg_price DESC;
"""

pd.read_sql(text(query), engine)

Unnamed: 0,category_name,product_count,avg_price,min_price,max_price
0,Grains/Cereals,0,,,
1,Meat/Poultry,0,,,
2,Dairy Products,0,,,
3,Produce,1,97.0,97.0,97.0
4,Confections,1,31.0,31.0,31.0
5,Seafood,1,30.0,30.0,30.0
6,Condiments,5,23.67,10.0,40.0
7,Beverages,2,18.5,18.0,19.0


### Exercise 4.3: Total sales by customer

Calculate the total sales amount for each customer (sum of quantity × unit_price from order_details).

In [17]:
# SOLUTION
query = """
SELECT 
    c.company_name,
    COUNT(DISTINCT o.order_id) as order_count,
    SUM(od.quantity * od.unit_price) as total_sales
FROM northwind.customers c
INNER JOIN northwind.orders o ON c.customer_id = o.customer_id
INNER JOIN northwind.order_details od ON o.order_id = od.order_id
GROUP BY c.company_name
ORDER BY total_sales DESC
LIMIT 10;
"""

df = pd.read_sql(text(query), engine)
print("Top 10 Customers by Total Sales\n")
df

Top 10 Customers by Total Sales



Unnamed: 0,company_name,order_count,total_sales
0,Around the Horn,1,854.0
1,Alfreds Futterkiste,1,406.0
2,Antonio Moreno Taquería,1,198.0
3,Ana Trujillo Emparedados y helados,1,50.0


### Exercise 4.4: HAVING clause - Categories with high average price

Find categories where the average product price is greater than 30.

In [18]:
# SOLUTION
query = """
SELECT 
    c.category_name,
    COUNT(p.product_id) as product_count,
    ROUND(AVG(p.unit_price)::numeric, 2) as avg_price
FROM northwind.categories c
INNER JOIN northwind.products p ON c.category_id = p.category_id
GROUP BY c.category_name
HAVING AVG(p.unit_price) > 30
ORDER BY avg_price DESC;
"""

pd.read_sql(text(query), engine)

Unnamed: 0,category_name,product_count,avg_price
0,Produce,1,97.0
1,Confections,1,31.0


### Exercise 4.5: Orders per employee

Show the number of orders handled by each employee, sorted by order count.

In [19]:
# SOLUTION
query = """
SELECT 
    e.first_name || ' ' || e.last_name as employee_name,
    e.title,
    COUNT(o.order_id) as order_count
FROM northwind.employees e
LEFT JOIN northwind.orders o ON e.employee_id = o.employee_id
GROUP BY e.employee_id, e.first_name, e.last_name, e.title
ORDER BY order_count DESC;
"""

pd.read_sql(text(query), engine)

Unnamed: 0,employee_name,title,order_count
0,Margaret Peacock,Sales Representative,1
1,Andrew Fuller,"Vice President, Sales",1
2,Janet Leverling,Sales Representative,1
3,Nancy Davolio,Sales Representative,1
4,Steven Buchanan,Sales Manager,1


## Part 5: Advanced Queries

Combine multiple concepts to answer business questions.

### Exercise 5.1: Products that need reordering

Find products where `units_in_stock` is less than or equal to `reorder_level` and the product is not discontinued.

In [21]:
# SOLUTION
query = """
SELECT 
    p.product_name,
    c.category_name,
    p.units_in_stock,
    p.reorder_level,
    p.units_on_order
FROM northwind.products p
INNER JOIN northwind.categories c ON p.category_id = c.category_id
WHERE p.units_in_stock <= p.reorder_level
  AND p.discontinued = FALSE
ORDER BY (p.reorder_level - p.units_in_stock) DESC;
"""

df = pd.read_sql(text(query), engine)
print(f"Products needing reorder: {len(df)}\n")
df

Products needing reorder: 1



Unnamed: 0,product_name,category_name,units_in_stock,reorder_level,units_on_order
0,Chef Anton's Gumbo Mix,Condiments,0,0,0


### Exercise 5.2: Most expensive order

Find the order with the highest total value (sum of quantity × unit_price).

In [None]:
# SOLUTION
query = """
SELECT 
    o.order_id,
    c.company_name,
    o.order_date,
    SUM(od.quantity * od.unit_price) as order_total
FROM northwind.orders o
INNER JOIN northwind.customers c ON o.customer_id = c.customer_id
INNER JOIN northwind.order_details od ON o.order_id = od.order_id
GROUP BY o.order_id, c.company_name, o.order_date
ORDER BY order_total DESC
LIMIT 5;
"""

df = pd.read_sql(text(query), engine)
print("Top 5 Most Expensive Orders\n")
df

### Exercise 5.3: Customer order frequency

Show customers who have placed more than 10 orders, with their total order count and total sales.

In [None]:
# SOLUTION
query = """
SELECT 
    c.company_name,
    c.country,
    COUNT(DISTINCT o.order_id) as order_count,
    ROUND(SUM(od.quantity * od.unit_price)::numeric, 2) as total_sales
FROM northwind.customers c
INNER JOIN northwind.orders o ON c.customer_id = o.customer_id
INNER JOIN northwind.order_details od ON o.order_id = od.order_id
GROUP BY c.customer_id, c.company_name, c.country
HAVING COUNT(DISTINCT o.order_id) > 10
ORDER BY order_count DESC;
"""

df = pd.read_sql(text(query), engine)
print(f"Customers with more than 10 orders: {len(df)}\n")
df

### Exercise 5.4: Product popularity

Find the top 10 most frequently ordered products (by total quantity sold).

In [None]:
# SOLUTION
query = """
SELECT 
    p.product_name,
    c.category_name,
    SUM(od.quantity) as total_quantity_sold,
    COUNT(DISTINCT od.order_id) as times_ordered,
    ROUND(SUM(od.quantity * od.unit_price)::numeric, 2) as total_revenue
FROM northwind.products p
INNER JOIN northwind.categories c ON p.category_id = c.category_id
INNER JOIN northwind.order_details od ON p.product_id = od.product_id
GROUP BY p.product_id, p.product_name, c.category_name
ORDER BY total_quantity_sold DESC
LIMIT 10;
"""

df = pd.read_sql(text(query), engine)
print("Top 10 Most Popular Products\n")
df

### Exercise 5.5: Sales by country

Calculate total sales for each country, showing only countries with total sales over 10000.

In [None]:
# SOLUTION
query = """
SELECT 
    c.country,
    COUNT(DISTINCT c.customer_id) as customer_count,
    COUNT(DISTINCT o.order_id) as order_count,
    ROUND(SUM(od.quantity * od.unit_price)::numeric, 2) as total_sales
FROM northwind.customers c
INNER JOIN northwind.orders o ON c.customer_id = o.customer_id
INNER JOIN northwind.order_details od ON o.order_id = od.order_id
GROUP BY c.country
HAVING SUM(od.quantity * od.unit_price) > 10000
ORDER BY total_sales DESC;
"""

df = pd.read_sql(text(query), engine)
print("Countries with Total Sales > $10,000\n")
df

## Summary

Great work! You've practiced:
- ✓ Basic SELECT statements with specific columns
- ✓ Filtering data with WHERE, LIKE, IN, and BETWEEN
- ✓ INNER JOINs to combine related tables
- ✓ Multiple JOINs across 3-4 tables
- ✓ Aggregate functions (COUNT, SUM, AVG, MIN, MAX)
- ✓ GROUP BY for summarizing data
- ✓ HAVING clause for filtering grouped results
- ✓ Complex queries combining multiple concepts

These are essential SQL skills for data analysis and database work!