# Module 09: Performance & Optimization

**Estimated Time:** 60 minutes

## Learning Objectives

By the end of this module, you will be able to:
- Create and use indexes to speed up queries
- Analyze query execution plans with EXPLAIN
- Identify and fix slow queries
- Apply optimization best practices
- Understand when to use indexes
- Benchmark query performance

In [None]:
# Setup
import sqlite3
import pandas as pd
from pathlib import Path
import time

%load_ext sql

DB_PATH = Path.cwd().parent / "data" / "databases" / "ecommerce.db"
conn = sqlite3.connect(DB_PATH)
%sql sqlite:///$DB_PATH

print("✓ Connected to ecommerce.db")

## 1. Understanding Query Performance

### What Makes Queries Slow?
- Full table scans (reading every row)
- Missing indexes
- Inefficient JOIN operations
- Using SELECT *
- Complex subqueries
- Large result sets

## 2. EXPLAIN: Query Execution Plans

EXPLAIN shows how SQLite executes a query.

In [None]:
# EXPLAIN a simple query
%%sql
EXPLAIN QUERY PLAN
SELECT * FROM products WHERE price > 100

In [None]:
# EXPLAIN a JOIN query
%%sql
EXPLAIN QUERY PLAN
SELECT p.product_name, c.category_name
FROM products p
JOIN categories c ON p.category_id = c.category_id
WHERE p.price > 50

## 3. Indexes: Speeding Up Queries

An **index** is a database structure that improves query speed.

**Benefits:**
- Faster data retrieval
- Speeds up WHERE, JOIN, and ORDER BY

**Costs:**
- Uses disk space
- Slows down INSERT/UPDATE/DELETE
- Requires maintenance

In [None]:
# Check existing indexes
%%sql
SELECT name, tbl_name, sql
FROM sqlite_master
WHERE type = 'index'
LIMIT 10

In [None]:
# Create an index on price column
%%sql
CREATE INDEX IF NOT EXISTS idx_products_price ON products(price)

In [None]:
# Compare query plans before and after index
%%sql
EXPLAIN QUERY PLAN
SELECT * FROM products WHERE price > 100

In [None]:
# Create composite index (multiple columns)
%%sql
CREATE INDEX IF NOT EXISTS idx_products_category_price 
ON products(category_id, price)

In [None]:
# Index on foreign key
%%sql
CREATE INDEX IF NOT EXISTS idx_orders_customer_id ON orders(customer_id)

## 4. When to Use Indexes

**Create indexes on:**
- Primary keys (automatic)
- Foreign keys
- Columns frequently used in WHERE
- Columns frequently used in JOIN
- Columns frequently used in ORDER BY

**Avoid indexes on:**
- Small tables (< 1000 rows)
- Columns with low cardinality (few unique values)
- Columns rarely used in queries
- Tables with frequent INSERT/UPDATE/DELETE

## 5. Benchmarking Queries

In [None]:
# Simple benchmarking function
def benchmark_query(query, iterations=10):
    """Run a query multiple times and measure average time."""
    cursor = conn.cursor()
    times = []

    for _ in range(iterations):
        start = time.time()
        cursor.execute(query)
        cursor.fetchall()
        end = time.time()
        times.append(end - start)

    avg_time = sum(times) / len(times)
    print(f"Average time: {avg_time*1000:.2f}ms over {iterations} runs")
    return avg_time


# Test query performance
query = "SELECT * FROM products WHERE price > 50"
benchmark_query(query)

## 6. Optimization Best Practices

### Practice 1: Avoid SELECT *

In [None]:
# Bad: SELECT *
%%sql
SELECT * FROM products LIMIT 5

In [None]:
# Good: Select only needed columns
%%sql
SELECT product_id, product_name, price FROM products LIMIT 5

### Practice 2: Use LIMIT for Testing

In [None]:
# Always use LIMIT when testing queries
%%sql
SELECT product_name, price 
FROM products 
WHERE price > 50
LIMIT 10  -- Prevents accidentally retrieving millions of rows

### Practice 3: Use EXISTS Instead of IN for Subqueries

In [None]:
# Slower: Using IN
%%sql
SELECT product_name
FROM products
WHERE product_id IN (
    SELECT product_id FROM order_items WHERE quantity > 2
)
LIMIT 10

In [None]:
# Faster: Using EXISTS
%%sql
SELECT product_name
FROM products p
WHERE EXISTS (
    SELECT 1 FROM order_items oi 
    WHERE oi.product_id = p.product_id AND oi.quantity > 2
)
LIMIT 10

### Practice 4: Use JOINs Instead of Subqueries When Possible

In [None]:
# Slower: Subquery in SELECT
%%sql
SELECT 
    p.product_name,
    (SELECT category_name FROM categories c WHERE c.category_id = p.category_id) AS category
FROM products p
LIMIT 10

In [None]:
# Faster: JOIN
%%sql
SELECT p.product_name, c.category_name
FROM products p
JOIN categories c ON p.category_id = c.category_id
LIMIT 10

### Practice 5: Filter Early in Subqueries and CTEs

In [None]:
# Good: Filter in CTE
%%sql
WITH expensive_products AS (
    SELECT product_id, product_name, price
    FROM products
    WHERE price > 100  -- Filter early
)
SELECT ep.product_name, COUNT(oi.order_item_id) AS order_count
FROM expensive_products ep
LEFT JOIN order_items oi ON ep.product_id = oi.product_id
GROUP BY ep.product_id, ep.product_name

## 7. Common Performance Issues and Solutions

### Issue 1: Slow JOINs

**Solution**: Index foreign key columns

In [None]:
%%sql
CREATE INDEX IF NOT EXISTS idx_order_items_product_id ON order_items(product_id);
CREATE INDEX IF NOT EXISTS idx_order_items_order_id ON order_items(order_id)

### Issue 2: Slow WHERE Clauses

**Solution**: Index frequently filtered columns

In [None]:
%%sql
CREATE INDEX IF NOT EXISTS idx_orders_status ON orders(status);
CREATE INDEX IF NOT EXISTS idx_orders_date ON orders(order_date)

### Issue 3: Functions in WHERE Clause

**Problem**: Functions prevent index usage

In [None]:
# Bad: Function on indexed column
%%sql
SELECT * FROM orders WHERE UPPER(status) = 'COMPLETED' LIMIT 5

In [None]:
# Good: No function
%%sql
SELECT * FROM orders WHERE status = 'completed' LIMIT 5

## 8. Monitoring and Maintenance

In [None]:
# View database statistics
cursor = conn.cursor()
cursor.execute("PRAGMA database_list")
print("Database Info:")
for row in cursor.fetchall():
    print(row)

# Get page count (database size indicator)
cursor.execute("PRAGMA page_count")
print(f"\nPage Count: {cursor.fetchone()[0]}")

cursor.execute("PRAGMA page_size")
print(f"Page Size: {cursor.fetchone()[0]} bytes")

## 9. Exercises

### Exercise 1: Analyze Query Performance
Use EXPLAIN QUERY PLAN to analyze this query and suggest improvements:

```sql
SELECT * FROM orders 
WHERE customer_id IN (SELECT customer_id FROM customers WHERE country = 'USA')
```

In [None]:
# Your code here
%%sql

### Exercise 2: Create Optimal Indexes
Create appropriate indexes for this query:

```sql
SELECT p.product_name, SUM(oi.quantity) 
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
WHERE p.category_id = 1
GROUP BY p.product_id, p.product_name
```

In [None]:
# Your code here
%%sql

### Exercise 3: Optimize a Slow Query
Rewrite this query to be more efficient:

```sql
SELECT *
FROM products
WHERE product_id IN (
    SELECT product_id FROM order_items
    WHERE order_id IN (
        SELECT order_id FROM orders WHERE status = 'completed'
    )
)
```

In [None]:
# Your code here
%%sql

## Summary

In this module, you learned:
- ✓ How to analyze queries with EXPLAIN QUERY PLAN
- ✓ Creating and using indexes
- ✓ When to use (and not use) indexes
- ✓ Query optimization best practices
- ✓ Common performance issues and solutions
- ✓ Benchmarking query performance

**Key Optimization Rules:**
1. Index foreign keys and frequently queried columns
2. Avoid SELECT * - only select needed columns
3. Use LIMIT when testing queries
4. Prefer JOINs over subqueries when possible
5. Use EXISTS instead of IN for correlated subqueries
6. Avoid functions on indexed columns in WHERE
7. Filter early in CTEs and subqueries
8. Always use EXPLAIN to verify query plans

**Next:** Module 10 - Final Project

In [None]:
conn.close()