# SQL Crash Course for Data Science Interviews

**Last Updated:** 20 January 2026

This notebook provides a comprehensive review of SQL concepts commonly tested in data science interviews. All examples use SQLite with Python's built-in `sqlite3` module, making them runnable without any additional database setup.

## Table of Contents

1. [Setup and Sample Data](#1-setup-and-sample-data)
2. [Basic SELECT, WHERE, ORDER BY](#2-basic-select-where-order-by)
3. [Aggregation Functions](#3-aggregation-functions)
4. [GROUP BY and HAVING](#4-group-by-and-having)
5. [JOINs](#5-joins)
6. [Subqueries and Nested Queries](#6-subqueries-and-nested-queries)
7. [Common Table Expressions (CTEs)](#7-common-table-expressions-ctes)
8. [Window Functions](#8-window-functions)
9. [CASE Statements](#9-case-statements)
10. [String Functions](#10-string-functions)
11. [Date Functions](#11-date-functions)
12. [Practice Questions](#12-practice-questions)

---

## 1. Setup and Sample Data

We'll create an in-memory SQLite database with sample tables representing a simple e-commerce scenario.

In [1]:
import sqlite3
import pandas as pd
from typing import Any
from collections.abc import Mapping


def create_connection() -> sqlite3.Connection:
    """Create and return an in-memory SQLite database connection.
    
    Returns:
        sqlite3.Connection: A connection to the in-memory database.
    """
    return sqlite3.connect(':memory:')


def run_query(
    conn: sqlite3.Connection,
    query: str,
    params: list[Any] | Mapping[str, Any] | None = None
) -> pd.DataFrame:
    """Execute a SQL query and return results as a pandas DataFrame.
    
    Args:
        conn: The database connection.
        query: The SQL query to execute.
        params: Optional parameters for parameterised queries.
    
    Returns:
        pd.DataFrame: Query results as a DataFrame.
    """
    if params:
        return pd.read_sql_query(query, conn, params=params)
    return pd.read_sql_query(query, conn)


def execute_sql(conn: sqlite3.Connection, sql: str) -> None:
    """Execute SQL statement(s) without returning results.
    
    Args:
        conn: The database connection.
        sql: The SQL statement(s) to execute.
    """
    conn.executescript(sql)
    conn.commit()


conn = create_connection()
print("Database connection established successfully.")

Database connection established successfully.


In [2]:
# Create sample tables
setup_sql = """
-- Customers table
CREATE TABLE customers (
    customer_id INTEGER PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    email TEXT UNIQUE,
    city TEXT,
    country TEXT,
    signup_date DATE
);

-- Products table
CREATE TABLE products (
    product_id INTEGER PRIMARY KEY,
    product_name TEXT NOT NULL,
    category TEXT,
    price DECIMAL(10, 2),
    stock_quantity INTEGER
);

-- Orders table
CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    customer_id INTEGER,
    order_date DATE,
    total_amount DECIMAL(10, 2),
    status TEXT,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

-- Order items table
CREATE TABLE order_items (
    item_id INTEGER PRIMARY KEY,
    order_id INTEGER,
    product_id INTEGER,
    quantity INTEGER,
    unit_price DECIMAL(10, 2),
    FOREIGN KEY (order_id) REFERENCES orders(order_id),
    FOREIGN KEY (product_id) REFERENCES products(product_id)
);

-- Employees table (for self-join examples)
CREATE TABLE employees (
    employee_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    department TEXT,
    salary DECIMAL(10, 2),
    manager_id INTEGER,
    hire_date DATE,
    FOREIGN KEY (manager_id) REFERENCES employees(employee_id)
);

-- Insert sample data into customers
INSERT INTO customers VALUES
    (1, 'Alice', 'Smith', 'alice@email.com', 'London', 'UK', '2023-01-15'),
    (2, 'Bob', 'Johnson', 'bob@email.com', 'Manchester', 'UK', '2023-02-20'),
    (3, 'Charlie', 'Williams', 'charlie@email.com', 'Birmingham', 'UK', '2023-03-10'),
    (4, 'Diana', 'Brown', 'diana@email.com', 'Paris', 'France', '2023-04-05'),
    (5, 'Eve', 'Davis', 'eve@email.com', 'Berlin', 'Germany', '2023-05-12'),
    (6, 'Frank', 'Miller', 'frank@email.com', 'London', 'UK', '2023-06-18'),
    (7, 'Grace', 'Wilson', 'grace@email.com', 'Edinburgh', 'UK', '2023-07-22'),
    (8, 'Henry', 'Moore', 'henry@email.com', 'Dublin', 'Ireland', '2023-08-30'),
    (9, 'Ivy', 'Taylor', NULL, 'Glasgow', 'UK', '2023-09-14'),
    (10, 'Jack', 'Anderson', 'jack@email.com', 'Amsterdam', 'Netherlands', '2023-10-01');

-- Insert sample data into products
INSERT INTO products VALUES
    (1, 'Laptop', 'Electronics', 999.99, 50),
    (2, 'Smartphone', 'Electronics', 699.99, 100),
    (3, 'Headphones', 'Electronics', 149.99, 200),
    (4, 'Desk Chair', 'Furniture', 299.99, 30),
    (5, 'Standing Desk', 'Furniture', 599.99, 20),
    (6, 'Monitor', 'Electronics', 399.99, 75),
    (7, 'Keyboard', 'Electronics', 79.99, 150),
    (8, 'Mouse', 'Electronics', 49.99, 200),
    (9, 'Bookshelf', 'Furniture', 149.99, 40),
    (10, 'Lamp', 'Furniture', 59.99, 100);

-- Insert sample data into orders
INSERT INTO orders VALUES
    (1, 1, '2024-01-10', 1149.98, 'Completed'),
    (2, 2, '2024-01-15', 699.99, 'Completed'),
    (3, 1, '2024-02-01', 299.99, 'Completed'),
    (4, 3, '2024-02-14', 1599.97, 'Completed'),
    (5, 4, '2024-02-20', 149.99, 'Shipped'),
    (6, 5, '2024-03-05', 999.99, 'Shipped'),
    (7, 2, '2024-03-10', 449.98, 'Processing'),
    (8, 6, '2024-03-15', 79.99, 'Processing'),
    (9, 7, '2024-03-20', 659.98, 'Pending'),
    (10, 1, '2024-03-25', 1299.98, 'Pending'),
    (11, 8, '2024-04-01', 549.98, 'Completed'),
    (12, 3, '2024-04-10', 199.98, 'Completed');

-- Insert sample data into order_items
INSERT INTO order_items VALUES
    (1, 1, 1, 1, 999.99),
    (2, 1, 3, 1, 149.99),
    (3, 2, 2, 1, 699.99),
    (4, 3, 4, 1, 299.99),
    (5, 4, 1, 1, 999.99),
    (6, 4, 5, 1, 599.99),
    (7, 5, 3, 1, 149.99),
    (8, 6, 1, 1, 999.99),
    (9, 7, 6, 1, 399.99),
    (10, 7, 8, 1, 49.99),
    (11, 8, 7, 1, 79.99),
    (12, 9, 5, 1, 599.99),
    (13, 9, 10, 1, 59.99),
    (14, 10, 1, 1, 999.99),
    (15, 10, 4, 1, 299.99),
    (16, 11, 6, 1, 399.99),
    (17, 11, 3, 1, 149.99),
    (18, 12, 8, 2, 49.99),
    (19, 12, 7, 1, 79.99);

-- Insert sample data into employees
INSERT INTO employees VALUES
    (1, 'Sarah Connor', 'Engineering', 120000, NULL, '2020-01-15'),
    (2, 'John Smith', 'Engineering', 95000, 1, '2021-03-20'),
    (3, 'Emily Jones', 'Engineering', 85000, 1, '2022-06-10'),
    (4, 'Michael Brown', 'Sales', 80000, NULL, '2019-08-01'),
    (5, 'Jessica White', 'Sales', 75000, 4, '2021-11-15'),
    (6, 'David Lee', 'Sales', 70000, 4, '2022-02-28'),
    (7, 'Anna Garcia', 'Marketing', 90000, NULL, '2020-05-12'),
    (8, 'Robert Wilson', 'Marketing', 72000, 7, '2023-01-08'),
    (9, 'Lisa Martinez', 'Engineering', 88000, 1, '2022-09-01'),
    (10, 'James Taylor', 'Sales', 65000, 5, '2023-04-15');
"""

execute_sql(conn, setup_sql)
print("Sample tables created and populated successfully.")

Sample tables created and populated successfully.


In [3]:
# Verify our tables
tables_query = """
SELECT name FROM sqlite_master 
WHERE type='table' 
ORDER BY name;
"""
run_query(conn, tables_query)

Unnamed: 0,name
0,customers
1,employees
2,order_items
3,orders
4,products


---

## 2. Basic SELECT, WHERE, ORDER BY

The `SELECT` statement is the foundation of SQL queries. It retrieves data from one or more tables.

### Basic SELECT Syntax

```sql
SELECT column1, column2, ...
FROM table_name;
```

### The WHERE Clause

The `WHERE` clause filters rows **before** any grouping or aggregation occurs. It supports:
- Comparison operators: `=`, `<>`, `<`, `>`, `<=`, `>=`
- Logical operators: `AND`, `OR`, `NOT`
- Pattern matching: `LIKE`, `IN`, `BETWEEN`
- NULL checking: `IS NULL`, `IS NOT NULL`

### The ORDER BY Clause

The `ORDER BY` clause sorts the result set:
- `ASC` for ascending order (default)
- `DESC` for descending order

In [4]:
# Select all columns from customers
query = "SELECT * FROM customers;"
run_query(conn, query)

Unnamed: 0,customer_id,first_name,last_name,email,city,country,signup_date
0,1,Alice,Smith,alice@email.com,London,UK,2023-01-15
1,2,Bob,Johnson,bob@email.com,Manchester,UK,2023-02-20
2,3,Charlie,Williams,charlie@email.com,Birmingham,UK,2023-03-10
3,4,Diana,Brown,diana@email.com,Paris,France,2023-04-05
4,5,Eve,Davis,eve@email.com,Berlin,Germany,2023-05-12
5,6,Frank,Miller,frank@email.com,London,UK,2023-06-18
6,7,Grace,Wilson,grace@email.com,Edinburgh,UK,2023-07-22
7,8,Henry,Moore,henry@email.com,Dublin,Ireland,2023-08-30
8,9,Ivy,Taylor,,Glasgow,UK,2023-09-14
9,10,Jack,Anderson,jack@email.com,Amsterdam,Netherlands,2023-10-01


In [5]:
# Select specific columns with WHERE clause
query = """
SELECT first_name, last_name, city, country
FROM customers
WHERE country = 'UK';
"""
run_query(conn, query)

Unnamed: 0,first_name,last_name,city,country
0,Alice,Smith,London,UK
1,Bob,Johnson,Manchester,UK
2,Charlie,Williams,Birmingham,UK
3,Frank,Miller,London,UK
4,Grace,Wilson,Edinburgh,UK
5,Ivy,Taylor,Glasgow,UK


In [6]:
# Using multiple conditions with AND/OR
query = """
SELECT first_name, last_name, city, country
FROM customers
WHERE country = 'UK' AND city != 'London';
"""
run_query(conn, query)

Unnamed: 0,first_name,last_name,city,country
0,Bob,Johnson,Manchester,UK
1,Charlie,Williams,Birmingham,UK
2,Grace,Wilson,Edinburgh,UK
3,Ivy,Taylor,Glasgow,UK


In [7]:
# Using LIKE for pattern matching
query = """
SELECT product_name, category, price
FROM products
WHERE product_name LIKE '%phone%' OR product_name LIKE '%top%';
"""
run_query(conn, query)

Unnamed: 0,product_name,category,price
0,Laptop,Electronics,999.99
1,Smartphone,Electronics,699.99
2,Headphones,Electronics,149.99


In [8]:
# Using IN and BETWEEN
query = """
SELECT product_name, category, price
FROM products
WHERE category IN ('Electronics', 'Furniture')
  AND price BETWEEN 100 AND 500
ORDER BY price DESC;
"""
run_query(conn, query)

Unnamed: 0,product_name,category,price
0,Monitor,Electronics,399.99
1,Desk Chair,Furniture,299.99
2,Headphones,Electronics,149.99
3,Bookshelf,Furniture,149.99


In [9]:
# Handling NULL values
query = """
SELECT first_name, last_name, email
FROM customers
WHERE email IS NULL;
"""
run_query(conn, query)

Unnamed: 0,first_name,last_name,email
0,Ivy,Taylor,


In [10]:
# ORDER BY with multiple columns
query = """
SELECT product_name, category, price
FROM products
ORDER BY category ASC, price DESC;
"""
run_query(conn, query)

Unnamed: 0,product_name,category,price
0,Laptop,Electronics,999.99
1,Smartphone,Electronics,699.99
2,Monitor,Electronics,399.99
3,Headphones,Electronics,149.99
4,Keyboard,Electronics,79.99
5,Mouse,Electronics,49.99
6,Standing Desk,Furniture,599.99
7,Desk Chair,Furniture,299.99
8,Bookshelf,Furniture,149.99
9,Lamp,Furniture,59.99


In [11]:
# Using LIMIT and OFFSET
query = """
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 5 OFFSET 2;
"""
run_query(conn, query)

Unnamed: 0,product_name,price
0,Standing Desk,599.99
1,Monitor,399.99
2,Desk Chair,299.99
3,Headphones,149.99
4,Bookshelf,149.99


In [12]:
# Using DISTINCT to remove duplicates
query = """
SELECT DISTINCT country
FROM customers
ORDER BY country;
"""
run_query(conn, query)

Unnamed: 0,country
0,France
1,Germany
2,Ireland
3,Netherlands
4,UK


---

## 3. Aggregation Functions

Aggregation functions perform calculations on a set of values and return a single value.

### Common Aggregation Functions

| Function | Description |
|----------|-------------|
| `COUNT()` | Returns the number of rows |
| `SUM()` | Returns the sum of values |
| `AVG()` | Returns the average of values |
| `MIN()` | Returns the minimum value |
| `MAX()` | Returns the maximum value |

**Note:** Aggregation functions ignore NULL values (except `COUNT(*)`).

In [13]:
# COUNT examples
query = """
SELECT 
    COUNT(*) AS total_customers,
    COUNT(email) AS customers_with_email,
    COUNT(DISTINCT country) AS unique_countries
FROM customers;
"""
run_query(conn, query)

Unnamed: 0,total_customers,customers_with_email,unique_countries
0,10,9,5


In [14]:
# SUM and AVG examples
query = """
SELECT 
    SUM(total_amount) AS total_revenue,
    AVG(total_amount) AS average_order_value,
    ROUND(AVG(total_amount), 2) AS avg_order_rounded
FROM orders;
"""
run_query(conn, query)

Unnamed: 0,total_revenue,average_order_value,avg_order_rounded
0,8139.8,678.316667,678.32


In [15]:
# MIN and MAX examples
query = """
SELECT 
    MIN(price) AS cheapest_product,
    MAX(price) AS most_expensive_product,
    MAX(price) - MIN(price) AS price_range
FROM products;
"""
run_query(conn, query)

Unnamed: 0,cheapest_product,most_expensive_product,price_range
0,49.99,999.99,950.0


In [16]:
# Combining aggregations with WHERE
query = """
SELECT 
    COUNT(*) AS completed_orders,
    SUM(total_amount) AS completed_revenue
FROM orders
WHERE status = 'Completed';
"""
run_query(conn, query)

Unnamed: 0,completed_orders,completed_revenue
0,6,4499.89


---

## 4. GROUP BY and HAVING

### GROUP BY

The `GROUP BY` clause groups rows that have the same values in specified columns into summary rows. It's typically used with aggregation functions.

### HAVING

The `HAVING` clause filters groups **after** aggregation (unlike `WHERE` which filters rows before grouping).

### Key Difference: WHERE vs HAVING

- **WHERE**: Filters individual rows before grouping
- **HAVING**: Filters groups after aggregation

In [17]:
# Basic GROUP BY
query = """
SELECT category, COUNT(*) AS product_count
FROM products
GROUP BY category;
"""
run_query(conn, query)

Unnamed: 0,category,product_count
0,Electronics,6
1,Furniture,4


In [18]:
# GROUP BY with multiple aggregations
query = """
SELECT 
    category,
    COUNT(*) AS product_count,
    ROUND(AVG(price), 2) AS avg_price,
    MIN(price) AS min_price,
    MAX(price) AS max_price,
    SUM(stock_quantity) AS total_stock
FROM products
GROUP BY category;
"""
run_query(conn, query)

Unnamed: 0,category,product_count,avg_price,min_price,max_price,total_stock
0,Electronics,6,396.66,49.99,999.99,775
1,Furniture,4,277.49,59.99,599.99,190


In [19]:
# Orders by status
query = """
SELECT 
    status,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_revenue
FROM orders
GROUP BY status
ORDER BY total_revenue DESC;
"""
run_query(conn, query)

Unnamed: 0,status,order_count,total_revenue
0,Completed,6,4499.89
1,Pending,2,1959.96
2,Shipped,2,1149.98
3,Processing,2,529.97


In [20]:
# Using HAVING to filter groups
query = """
SELECT 
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1
ORDER BY total_spent DESC;
"""
run_query(conn, query)

Unnamed: 0,customer_id,order_count,total_spent
0,1,3,2749.95
1,3,2,1799.95
2,2,2,1149.97


In [21]:
# WHERE and HAVING together
query = """
SELECT 
    customer_id,
    COUNT(*) AS completed_orders,
    SUM(total_amount) AS total_spent
FROM orders
WHERE status = 'Completed'
GROUP BY customer_id
HAVING SUM(total_amount) > 500
ORDER BY total_spent DESC;
"""
run_query(conn, query)

Unnamed: 0,customer_id,completed_orders,total_spent
0,3,2,1799.95
1,1,2,1449.97
2,2,1,699.99
3,8,1,549.98


In [22]:
# Grouping by multiple columns
query = """
SELECT 
    country,
    city,
    COUNT(*) AS customer_count
FROM customers
GROUP BY country, city
ORDER BY country, customer_count DESC;
"""
run_query(conn, query)

Unnamed: 0,country,city,customer_count
0,France,Paris,1
1,Germany,Berlin,1
2,Ireland,Dublin,1
3,Netherlands,Amsterdam,1
4,UK,London,2
5,UK,Manchester,1
6,UK,Glasgow,1
7,UK,Edinburgh,1
8,UK,Birmingham,1


---

## 5. JOINs

JOINs combine rows from two or more tables based on a related column.

### Types of JOINs

| JOIN Type | Description |
|-----------|-------------|
| `INNER JOIN` | Returns only matching rows from both tables |
| `LEFT JOIN` | Returns all rows from left table, matched rows from right |
| `RIGHT JOIN` | Returns all rows from right table, matched rows from left (not supported in SQLite) |
| `FULL OUTER JOIN` | Returns all rows when there's a match in either table |
| `CROSS JOIN` | Returns Cartesian product of both tables |
| `SELF JOIN` | Joins a table to itself |

In [23]:
# INNER JOIN - Orders with customer details
query = """
SELECT 
    o.order_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    o.order_date,
    o.total_amount,
    o.status
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
ORDER BY o.order_date DESC
LIMIT 10;
"""
run_query(conn, query)

Unnamed: 0,order_id,customer_name,order_date,total_amount,status
0,12,Charlie Williams,2024-04-10,199.98,Completed
1,11,Henry Moore,2024-04-01,549.98,Completed
2,10,Alice Smith,2024-03-25,1299.98,Pending
3,9,Grace Wilson,2024-03-20,659.98,Pending
4,8,Frank Miller,2024-03-15,79.99,Processing
5,7,Bob Johnson,2024-03-10,449.98,Processing
6,6,Eve Davis,2024-03-05,999.99,Shipped
7,5,Diana Brown,2024-02-20,149.99,Shipped
8,4,Charlie Williams,2024-02-14,1599.97,Completed
9,3,Alice Smith,2024-02-01,299.99,Completed


In [24]:
# LEFT JOIN - All customers with their orders (including those without orders)
query = """
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    COUNT(o.order_id) AS order_count,
    COALESCE(SUM(o.total_amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, customer_name
ORDER BY total_spent DESC;
"""
run_query(conn, query)

Unnamed: 0,customer_id,customer_name,order_count,total_spent
0,1,Alice Smith,3,2749.95
1,3,Charlie Williams,2,1799.95
2,2,Bob Johnson,2,1149.97
3,5,Eve Davis,1,999.99
4,7,Grace Wilson,1,659.98
5,8,Henry Moore,1,549.98
6,4,Diana Brown,1,149.99
7,6,Frank Miller,1,79.99
8,9,Ivy Taylor,0,0.0
9,10,Jack Anderson,0,0.0


In [25]:
# Multiple JOINs - Order details with customer and product info
query = """
SELECT 
    o.order_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    p.product_name,
    oi.quantity,
    oi.unit_price,
    (oi.quantity * oi.unit_price) AS line_total
FROM order_items oi
INNER JOIN orders o ON oi.order_id = o.order_id
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN products p ON oi.product_id = p.product_id
ORDER BY o.order_id, p.product_name;
"""
run_query(conn, query)

Unnamed: 0,order_id,customer_name,product_name,quantity,unit_price,line_total
0,1,Alice Smith,Headphones,1,149.99,149.99
1,1,Alice Smith,Laptop,1,999.99,999.99
2,2,Bob Johnson,Smartphone,1,699.99,699.99
3,3,Alice Smith,Desk Chair,1,299.99,299.99
4,4,Charlie Williams,Laptop,1,999.99,999.99
5,4,Charlie Williams,Standing Desk,1,599.99,599.99
6,5,Diana Brown,Headphones,1,149.99,149.99
7,6,Eve Davis,Laptop,1,999.99,999.99
8,7,Bob Johnson,Monitor,1,399.99,399.99
9,7,Bob Johnson,Mouse,1,49.99,49.99


In [26]:
# SELF JOIN - Employees with their managers
query = """
SELECT 
    e.name AS employee_name,
    e.department,
    e.salary,
    m.name AS manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id
ORDER BY e.department, e.salary DESC;
"""
run_query(conn, query)

Unnamed: 0,employee_name,department,salary,manager_name
0,Sarah Connor,Engineering,120000,
1,John Smith,Engineering,95000,Sarah Connor
2,Lisa Martinez,Engineering,88000,Sarah Connor
3,Emily Jones,Engineering,85000,Sarah Connor
4,Anna Garcia,Marketing,90000,
5,Robert Wilson,Marketing,72000,Anna Garcia
6,Michael Brown,Sales,80000,
7,Jessica White,Sales,75000,Michael Brown
8,David Lee,Sales,70000,Michael Brown
9,James Taylor,Sales,65000,Jessica White


In [27]:
# CROSS JOIN example - All combinations of categories and statuses
query = """
SELECT DISTINCT 
    p.category,
    o.status
FROM (SELECT DISTINCT category FROM products) p
CROSS JOIN (SELECT DISTINCT status FROM orders) o
ORDER BY p.category, o.status;
"""
run_query(conn, query)

Unnamed: 0,category,status
0,Electronics,Completed
1,Electronics,Pending
2,Electronics,Processing
3,Electronics,Shipped
4,Furniture,Completed
5,Furniture,Pending
6,Furniture,Processing
7,Furniture,Shipped


In [28]:
# Simulating FULL OUTER JOIN in SQLite using UNION
query = """
SELECT 
    c.customer_id AS customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    o.order_id AS order_id,
    o.total_amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id

UNION

SELECT 
    COALESCE(c.customer_id, o.customer_id) AS customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    o.order_id AS order_id,
    o.total_amount
FROM orders o
LEFT JOIN customers c ON c.customer_id = o.customer_id

ORDER BY customer_id, order_id;
"""
run_query(conn, query)

Unnamed: 0,customer_id,customer_name,order_id,total_amount
0,1,Alice Smith,1.0,1149.98
1,1,Alice Smith,3.0,299.99
2,1,Alice Smith,10.0,1299.98
3,2,Bob Johnson,2.0,699.99
4,2,Bob Johnson,7.0,449.98
5,3,Charlie Williams,4.0,1599.97
6,3,Charlie Williams,12.0,199.98
7,4,Diana Brown,5.0,149.99
8,5,Eve Davis,6.0,999.99
9,6,Frank Miller,8.0,79.99


---

## 6. Subqueries and Nested Queries

A subquery is a query nested inside another query. Subqueries can be used in:
- `SELECT` clause (scalar subqueries)
- `FROM` clause (derived tables)
- `WHERE` clause (filtering)

### Types of Subqueries

- **Scalar subquery**: Returns a single value
- **Row subquery**: Returns a single row
- **Table subquery**: Returns multiple rows and columns
- **Correlated subquery**: References columns from the outer query

In [29]:
# Scalar subquery in SELECT
query = """
SELECT 
    product_name,
    price,
    (SELECT AVG(price) FROM products) AS avg_price,
    price - (SELECT AVG(price) FROM products) AS diff_from_avg
FROM products
ORDER BY diff_from_avg DESC;
"""
run_query(conn, query)

Unnamed: 0,product_name,price,avg_price,diff_from_avg
0,Laptop,999.99,348.99,651.0
1,Smartphone,699.99,348.99,351.0
2,Standing Desk,599.99,348.99,251.0
3,Monitor,399.99,348.99,51.0
4,Desk Chair,299.99,348.99,-49.0
5,Headphones,149.99,348.99,-199.0
6,Bookshelf,149.99,348.99,-199.0
7,Keyboard,79.99,348.99,-269.0
8,Lamp,59.99,348.99,-289.0
9,Mouse,49.99,348.99,-299.0


In [30]:
# Subquery in WHERE clause
query = """
SELECT product_name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products)
ORDER BY price DESC;
"""
run_query(conn, query)

Unnamed: 0,product_name,price
0,Laptop,999.99
1,Smartphone,699.99
2,Standing Desk,599.99
3,Monitor,399.99


In [31]:
# Subquery with IN
query = """
SELECT 
    first_name || ' ' || last_name AS customer_name,
    city,
    country
FROM customers
WHERE customer_id IN (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE total_amount > 1000
);
"""
run_query(conn, query)

Unnamed: 0,customer_name,city,country
0,Alice Smith,London,UK
1,Charlie Williams,Birmingham,UK


In [32]:
# Subquery with EXISTS
query = """
SELECT 
    first_name || ' ' || last_name AS customer_name,
    email
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.status = 'Completed'
);
"""
run_query(conn, query)

Unnamed: 0,customer_name,email
0,Alice Smith,alice@email.com
1,Bob Johnson,bob@email.com
2,Charlie Williams,charlie@email.com
3,Henry Moore,henry@email.com


In [33]:
# Subquery in FROM clause (derived table)
query = """
SELECT 
    customer_summary.customer_name,
    customer_summary.total_orders,
    customer_summary.total_spent
FROM (
    SELECT 
        c.first_name || ' ' || c.last_name AS customer_name,
        COUNT(o.order_id) AS total_orders,
        COALESCE(SUM(o.total_amount), 0) AS total_spent
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, customer_name
) AS customer_summary
WHERE customer_summary.total_orders > 0
ORDER BY customer_summary.total_spent DESC;
"""
run_query(conn, query)

Unnamed: 0,customer_name,total_orders,total_spent
0,Alice Smith,3,2749.95
1,Charlie Williams,2,1799.95
2,Bob Johnson,2,1149.97
3,Eve Davis,1,999.99
4,Grace Wilson,1,659.98
5,Henry Moore,1,549.98
6,Diana Brown,1,149.99
7,Frank Miller,1,79.99


In [34]:
# Correlated subquery - Find products that are above their category average price
query = """
SELECT 
    product_name,
    category,
    price
FROM products p1
WHERE price > (
    SELECT AVG(price)
    FROM products p2
    WHERE p2.category = p1.category
)
ORDER BY category, price DESC;
"""
run_query(conn, query)

Unnamed: 0,product_name,category,price
0,Laptop,Electronics,999.99
1,Smartphone,Electronics,699.99
2,Monitor,Electronics,399.99
3,Standing Desk,Furniture,599.99
4,Desk Chair,Furniture,299.99


---

## 7. Common Table Expressions (CTEs)

CTEs provide a way to write auxiliary statements for use in a larger query. They make complex queries more readable and maintainable.

### Syntax

```sql
WITH cte_name AS (
    SELECT ...
)
SELECT ...
FROM cte_name;
```

### Benefits of CTEs

- **Readability**: Break complex queries into logical blocks
- **Reusability**: Reference the same subquery multiple times
- **Recursion**: CTEs can reference themselves (recursive CTEs)

In [35]:
# Basic CTE
query = """
WITH high_value_orders AS (
    SELECT 
        order_id,
        customer_id,
        total_amount
    FROM orders
    WHERE total_amount > 500
)
SELECT 
    c.first_name || ' ' || c.last_name AS customer_name,
    hvo.order_id,
    hvo.total_amount
FROM high_value_orders hvo
JOIN customers c ON hvo.customer_id = c.customer_id
ORDER BY hvo.total_amount DESC;
"""
run_query(conn, query)

Unnamed: 0,customer_name,order_id,total_amount
0,Charlie Williams,4,1599.97
1,Alice Smith,10,1299.98
2,Alice Smith,1,1149.98
3,Eve Davis,6,999.99
4,Bob Johnson,2,699.99
5,Grace Wilson,9,659.98
6,Henry Moore,11,549.98


In [36]:
# Multiple CTEs
query = """
WITH 
customer_totals AS (
    SELECT 
        customer_id,
        SUM(total_amount) AS total_spent,
        COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
),
avg_spending AS (
    SELECT AVG(total_spent) AS avg_total_spent
    FROM customer_totals
)
SELECT 
    c.first_name || ' ' || c.last_name AS customer_name,
    ct.total_spent,
    ct.order_count,
    ROUND(avs.avg_total_spent, 2) AS avg_customer_spending,
    CASE 
        WHEN ct.total_spent > avs.avg_total_spent THEN 'Above Average'
        ELSE 'Below Average'
    END AS spending_category
FROM customer_totals ct
JOIN customers c ON ct.customer_id = c.customer_id
CROSS JOIN avg_spending avs
ORDER BY ct.total_spent DESC;
"""
run_query(conn, query)

Unnamed: 0,customer_name,total_spent,order_count,avg_customer_spending,spending_category
0,Alice Smith,2749.95,3,1017.48,Above Average
1,Charlie Williams,1799.95,2,1017.48,Above Average
2,Bob Johnson,1149.97,2,1017.48,Above Average
3,Eve Davis,999.99,1,1017.48,Below Average
4,Grace Wilson,659.98,1,1017.48,Below Average
5,Henry Moore,549.98,1,1017.48,Below Average
6,Diana Brown,149.99,1,1017.48,Below Average
7,Frank Miller,79.99,1,1017.48,Below Average


In [37]:
# CTE with aggregations - Product sales summary
query = """
WITH product_sales AS (
    SELECT 
        p.product_id,
        p.product_name,
        p.category,
        p.price,
        COALESCE(SUM(oi.quantity), 0) AS total_sold,
        COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS total_revenue
    FROM products p
    LEFT JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.product_id, p.product_name, p.category, p.price
)
SELECT 
    product_name,
    category,
    price,
    total_sold,
    total_revenue
FROM product_sales
ORDER BY total_revenue DESC;
"""
run_query(conn, query)

Unnamed: 0,product_name,category,price,total_sold,total_revenue
0,Laptop,Electronics,999.99,4,3999.96
1,Standing Desk,Furniture,599.99,2,1199.98
2,Monitor,Electronics,399.99,2,799.98
3,Smartphone,Electronics,699.99,1,699.99
4,Desk Chair,Furniture,299.99,2,599.98
5,Headphones,Electronics,149.99,3,449.97
6,Keyboard,Electronics,79.99,2,159.98
7,Mouse,Electronics,49.99,3,149.97
8,Lamp,Furniture,59.99,1,59.99
9,Bookshelf,Furniture,149.99,0,0.0


In [38]:
# Recursive CTE - Employee hierarchy
query = """
WITH RECURSIVE employee_hierarchy AS (
    -- Base case: Top-level managers (no manager)
    SELECT 
        employee_id,
        name,
        department,
        manager_id,
        0 AS level,
        name AS hierarchy_path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive case: Employees with managers
    SELECT 
        e.employee_id,
        e.name,
        e.department,
        e.manager_id,
        eh.level + 1,
        eh.hierarchy_path || ' -> ' || e.name
    FROM employees e
    INNER JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT 
    employee_id,
    name,
    department,
    level,
    hierarchy_path
FROM employee_hierarchy
ORDER BY department, level, name;
"""
run_query(conn, query)

Unnamed: 0,employee_id,name,department,level,hierarchy_path
0,1,Sarah Connor,Engineering,0,Sarah Connor
1,3,Emily Jones,Engineering,1,Sarah Connor -> Emily Jones
2,2,John Smith,Engineering,1,Sarah Connor -> John Smith
3,9,Lisa Martinez,Engineering,1,Sarah Connor -> Lisa Martinez
4,7,Anna Garcia,Marketing,0,Anna Garcia
5,8,Robert Wilson,Marketing,1,Anna Garcia -> Robert Wilson
6,4,Michael Brown,Sales,0,Michael Brown
7,6,David Lee,Sales,1,Michael Brown -> David Lee
8,5,Jessica White,Sales,1,Michael Brown -> Jessica White
9,10,James Taylor,Sales,2,Michael Brown -> Jessica White -> James Taylor


---

## 8. Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing the result into a single value.

### Syntax

```sql
function_name() OVER (
    [PARTITION BY column1, column2, ...]
    [ORDER BY column3, column4, ...]
    [frame_clause]
)
```

### Common Window Functions

| Function | Description |
|----------|-------------|
| `ROW_NUMBER()` | Assigns unique sequential integers |
| `RANK()` | Assigns ranks with gaps for ties |
| `DENSE_RANK()` | Assigns ranks without gaps |
| `NTILE(n)` | Divides rows into n buckets |
| `LAG(column, n)` | Accesses data from n rows before |
| `LEAD(column, n)` | Accesses data from n rows after |
| `FIRST_VALUE()` | Returns first value in window |
| `LAST_VALUE()` | Returns last value in window |
| `SUM() OVER()` | Running sum over window |
| `AVG() OVER()` | Running average over window |

In [39]:
# ROW_NUMBER, RANK, and DENSE_RANK comparison
query = """
SELECT 
    name,
    department,
    salary,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
    RANK() OVER (ORDER BY salary DESC) AS rank_num,
    DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank_num
FROM employees
ORDER BY salary DESC;
"""
run_query(conn, query)

Unnamed: 0,name,department,salary,row_num,rank_num,dense_rank_num
0,Sarah Connor,Engineering,120000,1,1,1
1,John Smith,Engineering,95000,2,2,2
2,Anna Garcia,Marketing,90000,3,3,3
3,Lisa Martinez,Engineering,88000,4,4,4
4,Emily Jones,Engineering,85000,5,5,5
5,Michael Brown,Sales,80000,6,6,6
6,Jessica White,Sales,75000,7,7,7
7,Robert Wilson,Marketing,72000,8,8,8
8,David Lee,Sales,70000,9,9,9
9,James Taylor,Sales,65000,10,10,10


In [40]:
# PARTITION BY - Rank within each department
query = """
SELECT 
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
    RANK() OVER (ORDER BY salary DESC) AS overall_rank
FROM employees
ORDER BY department, dept_rank;
"""
run_query(conn, query)

Unnamed: 0,name,department,salary,dept_rank,overall_rank
0,Sarah Connor,Engineering,120000,1,1
1,John Smith,Engineering,95000,2,2
2,Lisa Martinez,Engineering,88000,3,4
3,Emily Jones,Engineering,85000,4,5
4,Anna Garcia,Marketing,90000,1,3
5,Robert Wilson,Marketing,72000,2,8
6,Michael Brown,Sales,80000,1,6
7,Jessica White,Sales,75000,2,7
8,David Lee,Sales,70000,3,9
9,James Taylor,Sales,65000,4,10


In [41]:
# LAG and LEAD - Compare with previous/next values
query = """
SELECT 
    order_id,
    order_date,
    total_amount,
    LAG(total_amount, 1) OVER (ORDER BY order_date) AS prev_order_amount,
    LEAD(total_amount, 1) OVER (ORDER BY order_date) AS next_order_amount,
    total_amount - LAG(total_amount, 1) OVER (ORDER BY order_date) AS diff_from_prev
FROM orders
ORDER BY order_date;
"""
run_query(conn, query)

Unnamed: 0,order_id,order_date,total_amount,prev_order_amount,next_order_amount,diff_from_prev
0,1,2024-01-10,1149.98,,699.99,
1,2,2024-01-15,699.99,1149.98,299.99,-449.99
2,3,2024-02-01,299.99,699.99,1599.97,-400.0
3,4,2024-02-14,1599.97,299.99,149.99,1299.98
4,5,2024-02-20,149.99,1599.97,999.99,-1449.98
5,6,2024-03-05,999.99,149.99,449.98,850.0
6,7,2024-03-10,449.98,999.99,79.99,-550.01
7,8,2024-03-15,79.99,449.98,659.98,-369.99
8,9,2024-03-20,659.98,79.99,1299.98,579.99
9,10,2024-03-25,1299.98,659.98,549.98,640.0


In [42]:
# Running total and cumulative sum
query = """
SELECT 
    order_id,
    order_date,
    total_amount,
    SUM(total_amount) OVER (ORDER BY order_date) AS running_total,
    COUNT(*) OVER (ORDER BY order_date) AS running_count,
    ROUND(AVG(total_amount) OVER (ORDER BY order_date), 2) AS running_avg
FROM orders
ORDER BY order_date;
"""
run_query(conn, query)

Unnamed: 0,order_id,order_date,total_amount,running_total,running_count,running_avg
0,1,2024-01-10,1149.98,1149.98,1,1149.98
1,2,2024-01-15,699.99,1849.97,2,924.99
2,3,2024-02-01,299.99,2149.96,3,716.65
3,4,2024-02-14,1599.97,3749.93,4,937.48
4,5,2024-02-20,149.99,3899.92,5,779.98
5,6,2024-03-05,999.99,4899.91,6,816.65
6,7,2024-03-10,449.98,5349.89,7,764.27
7,8,2024-03-15,79.99,5429.88,8,678.74
8,9,2024-03-20,659.98,6089.86,9,676.65
9,10,2024-03-25,1299.98,7389.84,10,738.98


In [43]:
# NTILE - Divide into quartiles
query = """
SELECT 
    product_name,
    price,
    NTILE(4) OVER (ORDER BY price) AS price_quartile
FROM products
ORDER BY price;
"""
run_query(conn, query)

Unnamed: 0,product_name,price,price_quartile
0,Mouse,49.99,1
1,Lamp,59.99,1
2,Keyboard,79.99,1
3,Headphones,149.99,2
4,Bookshelf,149.99,2
5,Desk Chair,299.99,2
6,Monitor,399.99,3
7,Standing Desk,599.99,3
8,Smartphone,699.99,4
9,Laptop,999.99,4


In [44]:
# Running total per customer
query = """
SELECT 
    o.order_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    o.order_date,
    o.total_amount,
    SUM(o.total_amount) OVER (
        PARTITION BY o.customer_id 
        ORDER BY o.order_date
    ) AS customer_running_total,
    ROW_NUMBER() OVER (
        PARTITION BY o.customer_id 
        ORDER BY o.order_date
    ) AS customer_order_num
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
ORDER BY customer_name, o.order_date;
"""
run_query(conn, query)

Unnamed: 0,order_id,customer_name,order_date,total_amount,customer_running_total,customer_order_num
0,1,Alice Smith,2024-01-10,1149.98,1149.98,1
1,3,Alice Smith,2024-02-01,299.99,1449.97,2
2,10,Alice Smith,2024-03-25,1299.98,2749.95,3
3,2,Bob Johnson,2024-01-15,699.99,699.99,1
4,7,Bob Johnson,2024-03-10,449.98,1149.97,2
5,4,Charlie Williams,2024-02-14,1599.97,1599.97,1
6,12,Charlie Williams,2024-04-10,199.98,1799.95,2
7,5,Diana Brown,2024-02-20,149.99,149.99,1
8,6,Eve Davis,2024-03-05,999.99,999.99,1
9,8,Frank Miller,2024-03-15,79.99,79.99,1


In [45]:
# Top N per group using window functions
query = """
WITH ranked_products AS (
    SELECT 
        product_name,
        category,
        price,
        ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) AS rank_in_category
    FROM products
)
SELECT 
    product_name,
    category,
    price,
    rank_in_category
FROM ranked_products
WHERE rank_in_category <= 2
ORDER BY category, rank_in_category;
"""
run_query(conn, query)

Unnamed: 0,product_name,category,price,rank_in_category
0,Laptop,Electronics,999.99,1
1,Smartphone,Electronics,699.99,2
2,Standing Desk,Furniture,599.99,1
3,Desk Chair,Furniture,299.99,2


In [46]:
# Percentage of total using window functions
query = """
SELECT 
    product_name,
    category,
    price,
    SUM(price) OVER (PARTITION BY category) AS category_total,
    SUM(price) OVER () AS grand_total,
    ROUND(100.0 * price / SUM(price) OVER (PARTITION BY category), 2) AS pct_of_category,
    ROUND(100.0 * price / SUM(price) OVER (), 2) AS pct_of_total
FROM products
ORDER BY category, price DESC;
"""
run_query(conn, query)

Unnamed: 0,product_name,category,price,category_total,grand_total,pct_of_category,pct_of_total
0,Laptop,Electronics,999.99,2379.94,3489.9,42.02,28.65
1,Smartphone,Electronics,699.99,2379.94,3489.9,29.41,20.06
2,Monitor,Electronics,399.99,2379.94,3489.9,16.81,11.46
3,Headphones,Electronics,149.99,2379.94,3489.9,6.3,4.3
4,Keyboard,Electronics,79.99,2379.94,3489.9,3.36,2.29
5,Mouse,Electronics,49.99,2379.94,3489.9,2.1,1.43
6,Standing Desk,Furniture,599.99,1109.96,3489.9,54.06,17.19
7,Desk Chair,Furniture,299.99,1109.96,3489.9,27.03,8.6
8,Bookshelf,Furniture,149.99,1109.96,3489.9,13.51,4.3
9,Lamp,Furniture,59.99,1109.96,3489.9,5.4,1.72


---

## 9. CASE Statements

CASE statements provide conditional logic in SQL queries, similar to if-else statements in programming languages.

### Simple CASE Syntax

```sql
CASE expression
    WHEN value1 THEN result1
    WHEN value2 THEN result2
    ELSE default_result
END
```

### Searched CASE Syntax

```sql
CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    ELSE default_result
END
```

In [47]:
# Simple CASE statement
query = """
SELECT 
    order_id,
    status,
    CASE status
        WHEN 'Completed' THEN 'Done'
        WHEN 'Shipped' THEN 'In Transit'
        WHEN 'Processing' THEN 'Being Prepared'
        WHEN 'Pending' THEN 'Awaiting Action'
        ELSE 'Unknown'
    END AS status_description
FROM orders;
"""
run_query(conn, query)

Unnamed: 0,order_id,status,status_description
0,1,Completed,Done
1,2,Completed,Done
2,3,Completed,Done
3,4,Completed,Done
4,5,Shipped,In Transit
5,6,Shipped,In Transit
6,7,Processing,Being Prepared
7,8,Processing,Being Prepared
8,9,Pending,Awaiting Action
9,10,Pending,Awaiting Action


In [48]:
# Searched CASE with conditions
query = """
SELECT 
    product_name,
    price,
    CASE
        WHEN price < 100 THEN 'Budget'
        WHEN price >= 100 AND price < 300 THEN 'Mid-Range'
        WHEN price >= 300 AND price < 700 THEN 'Premium'
        ELSE 'Luxury'
    END AS price_tier
FROM products
ORDER BY price;
"""
run_query(conn, query)

Unnamed: 0,product_name,price,price_tier
0,Mouse,49.99,Budget
1,Lamp,59.99,Budget
2,Keyboard,79.99,Budget
3,Headphones,149.99,Mid-Range
4,Bookshelf,149.99,Mid-Range
5,Desk Chair,299.99,Mid-Range
6,Monitor,399.99,Premium
7,Standing Desk,599.99,Premium
8,Smartphone,699.99,Premium
9,Laptop,999.99,Luxury


In [49]:
# CASE in aggregation - Pivot-like behaviour
query = """
SELECT 
    category,
    COUNT(*) AS total_products,
    SUM(CASE WHEN price < 100 THEN 1 ELSE 0 END) AS budget_products,
    SUM(CASE WHEN price >= 100 AND price < 500 THEN 1 ELSE 0 END) AS mid_range_products,
    SUM(CASE WHEN price >= 500 THEN 1 ELSE 0 END) AS premium_products
FROM products
GROUP BY category;
"""
run_query(conn, query)

Unnamed: 0,category,total_products,budget_products,mid_range_products,premium_products
0,Electronics,6,2,2,2
1,Furniture,4,1,2,1


In [50]:
# CASE for conditional aggregation
query = """
SELECT 
    SUM(CASE WHEN status = 'Completed' THEN total_amount ELSE 0 END) AS completed_revenue,
    SUM(CASE WHEN status = 'Shipped' THEN total_amount ELSE 0 END) AS shipped_revenue,
    SUM(CASE WHEN status IN ('Processing', 'Pending') THEN total_amount ELSE 0 END) AS pending_revenue,
    SUM(total_amount) AS total_revenue
FROM orders;
"""
run_query(conn, query)

Unnamed: 0,completed_revenue,shipped_revenue,pending_revenue,total_revenue
0,4499.89,1149.98,2489.93,8139.8


In [51]:
# CASE in ORDER BY
query = """
SELECT 
    order_id,
    status,
    total_amount
FROM orders
ORDER BY 
    CASE status
        WHEN 'Pending' THEN 1
        WHEN 'Processing' THEN 2
        WHEN 'Shipped' THEN 3
        WHEN 'Completed' THEN 4
        ELSE 5
    END,
    total_amount DESC;
"""
run_query(conn, query)

Unnamed: 0,order_id,status,total_amount
0,10,Pending,1299.98
1,9,Pending,659.98
2,7,Processing,449.98
3,8,Processing,79.99
4,6,Shipped,999.99
5,5,Shipped,149.99
6,4,Completed,1599.97
7,1,Completed,1149.98
8,2,Completed,699.99
9,11,Completed,549.98


---

## 10. String Functions

SQL provides various functions for manipulating text data.

### Common String Functions in SQLite

| Function | Description |
|----------|-------------|
| `LENGTH()` | Returns the length of a string |
| `UPPER()` | Converts to uppercase |
| `LOWER()` | Converts to lowercase |
| `SUBSTR()` | Extracts a substring |
| `TRIM()` | Removes leading/trailing spaces |
| `LTRIM()` | Removes leading spaces |
| `RTRIM()` | Removes trailing spaces |
| `REPLACE()` | Replaces occurrences of a substring |
| `INSTR()` | Returns position of substring |
| `||` | Concatenates strings |

In [52]:
# String concatenation and case conversion
query = """
SELECT 
    first_name || ' ' || last_name AS full_name,
    UPPER(first_name || ' ' || last_name) AS full_name_upper,
    LOWER(email) AS email_lower
FROM customers
WHERE email IS NOT NULL
LIMIT 5;
"""
run_query(conn, query)

Unnamed: 0,full_name,full_name_upper,email_lower
0,Alice Smith,ALICE SMITH,alice@email.com
1,Bob Johnson,BOB JOHNSON,bob@email.com
2,Charlie Williams,CHARLIE WILLIAMS,charlie@email.com
3,Diana Brown,DIANA BROWN,diana@email.com
4,Eve Davis,EVE DAVIS,eve@email.com


In [53]:
# LENGTH and SUBSTR
query = """
SELECT 
    product_name,
    LENGTH(product_name) AS name_length,
    SUBSTR(product_name, 1, 3) AS first_three_chars,
    SUBSTR(product_name, -3) AS last_three_chars
FROM products;
"""
run_query(conn, query)

Unnamed: 0,product_name,name_length,first_three_chars,last_three_chars
0,Laptop,6,Lap,top
1,Smartphone,10,Sma,one
2,Headphones,10,Hea,nes
3,Desk Chair,10,Des,air
4,Standing Desk,13,Sta,esk
5,Monitor,7,Mon,tor
6,Keyboard,8,Key,ard
7,Mouse,5,Mou,use
8,Bookshelf,9,Boo,elf
9,Lamp,4,Lam,amp


In [54]:
# REPLACE and INSTR
query = """
SELECT 
    email,
    REPLACE(email, '@email.com', '@company.com') AS new_email,
    INSTR(email, '@') AS at_position,
    SUBSTR(email, 1, INSTR(email, '@') - 1) AS username,
    SUBSTR(email, INSTR(email, '@') + 1) AS domain
FROM customers
WHERE email IS NOT NULL;
"""
run_query(conn, query)

Unnamed: 0,email,new_email,at_position,username,domain
0,alice@email.com,alice@company.com,6,alice,email.com
1,bob@email.com,bob@company.com,4,bob,email.com
2,charlie@email.com,charlie@company.com,8,charlie,email.com
3,diana@email.com,diana@company.com,6,diana,email.com
4,eve@email.com,eve@company.com,4,eve,email.com
5,frank@email.com,frank@company.com,6,frank,email.com
6,grace@email.com,grace@company.com,6,grace,email.com
7,henry@email.com,henry@company.com,6,henry,email.com
8,jack@email.com,jack@company.com,5,jack,email.com


In [55]:
# TRIM functions
query = """
SELECT 
    '  hello world  ' AS original,
    TRIM('  hello world  ') AS trimmed,
    LTRIM('  hello world  ') AS left_trimmed,
    RTRIM('  hello world  ') AS right_trimmed,
    LENGTH('  hello world  ') AS original_length,
    LENGTH(TRIM('  hello world  ')) AS trimmed_length;
"""
run_query(conn, query)

Unnamed: 0,original,trimmed,left_trimmed,right_trimmed,original_length,trimmed_length
0,hello world,hello world,hello world,hello world,15,11


In [56]:
# Pattern matching with LIKE and GLOB
query = """
SELECT 
    product_name,
    category
FROM products
WHERE product_name LIKE '%o%'  -- Contains 'o'
   OR product_name LIKE 'L%'   -- Starts with 'L'
ORDER BY product_name;
"""
run_query(conn, query)

Unnamed: 0,product_name,category
0,Bookshelf,Furniture
1,Headphones,Electronics
2,Keyboard,Electronics
3,Lamp,Furniture
4,Laptop,Electronics
5,Monitor,Electronics
6,Mouse,Electronics
7,Smartphone,Electronics


In [57]:
# COALESCE for handling NULL values
query = """
SELECT 
    first_name || ' ' || last_name AS customer_name,
    email,
    COALESCE(email, 'No email provided') AS email_with_default
FROM customers;
"""
run_query(conn, query)

Unnamed: 0,customer_name,email,email_with_default
0,Alice Smith,alice@email.com,alice@email.com
1,Bob Johnson,bob@email.com,bob@email.com
2,Charlie Williams,charlie@email.com,charlie@email.com
3,Diana Brown,diana@email.com,diana@email.com
4,Eve Davis,eve@email.com,eve@email.com
5,Frank Miller,frank@email.com,frank@email.com
6,Grace Wilson,grace@email.com,grace@email.com
7,Henry Moore,henry@email.com,henry@email.com
8,Ivy Taylor,,No email provided
9,Jack Anderson,jack@email.com,jack@email.com


---

## 11. Date Functions

SQLite provides several functions for working with dates and times.

### Date/Time Functions in SQLite

| Function | Description |
|----------|-------------|
| `DATE()` | Returns the date portion |
| `TIME()` | Returns the time portion |
| `DATETIME()` | Returns date and time |
| `JULIANDAY()` | Returns Julian day number |
| `STRFTIME()` | Formats date/time strings |

### STRFTIME Format Codes

| Code | Description |
|------|-------------|
| `%Y` | 4-digit year |
| `%m` | Month (01-12) |
| `%d` | Day of month (01-31) |
| `%H` | Hour (00-23) |
| `%M` | Minute (00-59) |
| `%S` | Second (00-59) |
| `%W` | Week of year |
| `%w` | Day of week (0-6) |
| `%j` | Day of year (001-366) |

In [58]:
# Current date and time
query = """
SELECT 
    DATE('now') AS current_date,
    TIME('now') AS current_time,
    DATETIME('now') AS current_datetime,
    DATE('now', 'localtime') AS local_date;
"""
run_query(conn, query)

Unnamed: 0,current_date,current_time,current_datetime,local_date
0,2026-01-25,14:28:39,2026-01-25 14:28:39,2026-01-25


In [59]:
# Extracting date parts with STRFTIME
query = """
SELECT 
    order_date,
    STRFTIME('%Y', order_date) AS year,
    STRFTIME('%m', order_date) AS month,
    STRFTIME('%d', order_date) AS day,
    STRFTIME('%W', order_date) AS week_of_year,
    STRFTIME('%w', order_date) AS day_of_week
FROM orders
ORDER BY order_date
LIMIT 5;
"""
run_query(conn, query)

Unnamed: 0,order_date,year,month,day,week_of_year,day_of_week
0,2024-01-10,2024,1,10,2,3
1,2024-01-15,2024,1,15,3,1
2,2024-02-01,2024,2,1,5,4
3,2024-02-14,2024,2,14,7,3
4,2024-02-20,2024,2,20,8,2


In [60]:
# Date arithmetic
query = """
SELECT 
    order_date,
    DATE(order_date, '+7 days') AS plus_7_days,
    DATE(order_date, '-1 month') AS minus_1_month,
    DATE(order_date, '+1 year') AS plus_1_year,
    DATE(order_date, 'start of month') AS start_of_month,
    DATE(order_date, 'start of month', '+1 month', '-1 day') AS end_of_month
FROM orders
LIMIT 5;
"""
run_query(conn, query)

Unnamed: 0,order_date,plus_7_days,minus_1_month,plus_1_year,start_of_month,end_of_month
0,2024-01-10,2024-01-17,2023-12-10,2025-01-10,2024-01-01,2024-01-31
1,2024-01-15,2024-01-22,2023-12-15,2025-01-15,2024-01-01,2024-01-31
2,2024-02-01,2024-02-08,2024-01-01,2025-02-01,2024-02-01,2024-02-29
3,2024-02-14,2024-02-21,2024-01-14,2025-02-14,2024-02-01,2024-02-29
4,2024-02-20,2024-02-27,2024-01-20,2025-02-20,2024-02-01,2024-02-29


In [61]:
# Calculating date differences
query = """
SELECT 
    order_id,
    order_date,
    DATE('now') AS today,
    CAST(JULIANDAY('now') - JULIANDAY(order_date) AS INTEGER) AS days_since_order
FROM orders
ORDER BY order_date DESC
LIMIT 5;
"""
run_query(conn, query)

Unnamed: 0,order_id,order_date,today,days_since_order
0,12,2024-04-10,2026-01-25,655
1,11,2024-04-01,2026-01-25,664
2,10,2024-03-25,2026-01-25,671
3,9,2024-03-20,2026-01-25,676
4,8,2024-03-15,2026-01-25,681


In [62]:
# Grouping by date parts - Orders by month
query = """
SELECT 
    STRFTIME('%Y-%m', order_date) AS month,
    COUNT(*) AS order_count,
    SUM(total_amount) AS monthly_revenue
FROM orders
GROUP BY STRFTIME('%Y-%m', order_date)
ORDER BY month;
"""
run_query(conn, query)

Unnamed: 0,month,order_count,monthly_revenue
0,2024-01,2,1849.97
1,2024-02,3,2049.95
2,2024-03,5,3489.92
3,2024-04,2,749.96


In [63]:
# Customer tenure calculation
query = """
SELECT 
    first_name || ' ' || last_name AS customer_name,
    signup_date,
    DATE('now') AS today,
    CAST((JULIANDAY('now') - JULIANDAY(signup_date)) / 30 AS INTEGER) AS months_as_customer,
    CAST((JULIANDAY('now') - JULIANDAY(signup_date)) / 365 AS INTEGER) AS years_as_customer
FROM customers
ORDER BY signup_date;
"""
run_query(conn, query)

Unnamed: 0,customer_name,signup_date,today,months_as_customer,years_as_customer
0,Alice Smith,2023-01-15,2026-01-25,36,3
1,Bob Johnson,2023-02-20,2026-01-25,35,2
2,Charlie Williams,2023-03-10,2026-01-25,35,2
3,Diana Brown,2023-04-05,2026-01-25,34,2
4,Eve Davis,2023-05-12,2026-01-25,32,2
5,Frank Miller,2023-06-18,2026-01-25,31,2
6,Grace Wilson,2023-07-22,2026-01-25,30,2
7,Henry Moore,2023-08-30,2026-01-25,29,2
8,Ivy Taylor,2023-09-14,2026-01-25,28,2
9,Jack Anderson,2023-10-01,2026-01-25,28,2


In [64]:
# Filtering by date ranges
query = """
SELECT 
    order_id,
    order_date,
    total_amount
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-02-28'
ORDER BY order_date;
"""
run_query(conn, query)

Unnamed: 0,order_id,order_date,total_amount
0,1,2024-01-10,1149.98
1,2,2024-01-15,699.99
2,3,2024-02-01,299.99
3,4,2024-02-14,1599.97
4,5,2024-02-20,149.99


---

## 12. Practice Questions

Now it's time to test your SQL skills! Below are practice questions ranging from basic to advanced. Each question has a hidden solution - try to solve it yourself first before revealing the answer.

---

### Question 1: Basic SELECT and Filtering

**Task:** Find all customers from the UK who signed up in 2023. Display their full name, city, and signup date. Order the results by signup date (most recent first).

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
SELECT 
    first_name || ' ' || last_name AS full_name,
    city,
    signup_date
FROM customers
WHERE country = 'UK'
  AND signup_date BETWEEN '2023-01-01' AND '2023-12-31'
ORDER BY signup_date DESC;
```

</details>

---

### Question 2: Aggregation with GROUP BY

**Task:** Calculate the total revenue, average order value, and number of orders for each order status. Only include statuses that have generated more than 500 in total revenue. Round the average to 2 decimal places.

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
SELECT 
    status,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_revenue,
    ROUND(AVG(total_amount), 2) AS avg_order_value
FROM orders
GROUP BY status
HAVING SUM(total_amount) > 500
ORDER BY total_revenue DESC;
```

</details>

---

### Question 3: JOIN with Multiple Tables

**Task:** Create a report showing each customer's name, total number of orders, and total amount spent. Include customers who haven't placed any orders (showing 0 for their counts). Order by total spent descending.

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
SELECT 
    c.first_name || ' ' || c.last_name AS customer_name,
    COUNT(o.order_id) AS total_orders,
    COALESCE(SUM(o.total_amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, customer_name
ORDER BY total_spent DESC;
```

</details>

---

### Question 4: Subquery with EXISTS

**Task:** Find all products that have never been ordered. Display the product name, category, and price.

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
SELECT 
    product_name,
    category,
    price
FROM products p
WHERE NOT EXISTS (
    SELECT 1
    FROM order_items oi
    WHERE oi.product_id = p.product_id
);
```

</details>

---

### Question 5: Window Function - Ranking

**Task:** Rank all employees by salary within their department. Show the employee name, department, salary, and their rank within the department. Also show their overall company-wide rank.

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
SELECT 
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
    RANK() OVER (ORDER BY salary DESC) AS company_rank
FROM employees
ORDER BY department, dept_rank;
```

</details>

---

### Question 6: CTE with Running Total

**Task:** Using a CTE, calculate the running total of revenue by order date. Show the order date, daily revenue, and cumulative revenue up to that date.

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
WITH daily_revenue AS (
    SELECT 
        order_date,
        SUM(total_amount) AS daily_total
    FROM orders
    GROUP BY order_date
)
SELECT 
    order_date,
    daily_total,
    SUM(daily_total) OVER (ORDER BY order_date) AS cumulative_revenue
FROM daily_revenue
ORDER BY order_date;
```

</details>

---

### Question 7: CASE Statement with Aggregation

**Task:** Create a summary report that shows, for each product category, the count of products in each price tier (Budget: <100, Mid-Range: 100-500, Premium: >500).

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
SELECT 
    category,
    SUM(CASE WHEN price < 100 THEN 1 ELSE 0 END) AS budget_count,
    SUM(CASE WHEN price >= 100 AND price <= 500 THEN 1 ELSE 0 END) AS mid_range_count,
    SUM(CASE WHEN price > 500 THEN 1 ELSE 0 END) AS premium_count,
    COUNT(*) AS total_products
FROM products
GROUP BY category;
```

</details>

---

### Question 8: Top N Per Group

**Task:** Find the top 2 highest-priced products in each category. Display the product name, category, price, and rank within category.

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
WITH ranked_products AS (
    SELECT 
        product_name,
        category,
        price,
        ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank
    FROM products
)
SELECT 
    product_name,
    category,
    price,
    price_rank
FROM ranked_products
WHERE price_rank <= 2
ORDER BY category, price_rank;
```

</details>

---

### Question 9: Self-Join with Hierarchy

**Task:** Create a report showing each employee, their manager's name, and how many direct reports their manager has. Order by department and employee name.

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
WITH manager_report_counts AS (
    SELECT 
        manager_id,
        COUNT(*) AS direct_reports
    FROM employees
    WHERE manager_id IS NOT NULL
    GROUP BY manager_id
)
SELECT 
    e.name AS employee_name,
    e.department,
    COALESCE(m.name, 'No Manager') AS manager_name,
    COALESCE(mrc.direct_reports, 0) AS manager_direct_reports
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id
LEFT JOIN manager_report_counts mrc ON m.employee_id = mrc.manager_id
ORDER BY e.department, e.name;
```

</details>

---

### Question 10: Complex Query - Customer Analysis

**Task:** Create a comprehensive customer analysis report that shows:
- Customer name
- Total orders
- Total spent
- Average order value
- Days since first order
- Days since last order
- Customer tier (VIP: >2000 total spent, Regular: 500-2000, New: <500)

Only include customers who have placed at least one order.

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
WITH customer_orders AS (
    SELECT 
        c.customer_id,
        c.first_name || ' ' || c.last_name AS customer_name,
        COUNT(o.order_id) AS total_orders,
        SUM(o.total_amount) AS total_spent,
        ROUND(AVG(o.total_amount), 2) AS avg_order_value,
        MIN(o.order_date) AS first_order_date,
        MAX(o.order_date) AS last_order_date
    FROM customers c
    INNER JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, customer_name
)
SELECT 
    customer_name,
    total_orders,
    total_spent,
    avg_order_value,
    CAST(JULIANDAY('now') - JULIANDAY(first_order_date) AS INTEGER) AS days_since_first_order,
    CAST(JULIANDAY('now') - JULIANDAY(last_order_date) AS INTEGER) AS days_since_last_order,
    CASE 
        WHEN total_spent > 2000 THEN 'VIP'
        WHEN total_spent >= 500 THEN 'Regular'
        ELSE 'New'
    END AS customer_tier
FROM customer_orders
ORDER BY total_spent DESC;
```

</details>

---

### Question 11: LAG/LEAD Analysis

**Task:** For each order, show the order details along with the previous order's amount and the percentage change from the previous order. Order by order date.

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
SELECT 
    order_id,
    order_date,
    total_amount,
    LAG(total_amount) OVER (ORDER BY order_date) AS prev_order_amount,
    ROUND(
        100.0 * (total_amount - LAG(total_amount) OVER (ORDER BY order_date)) 
        / LAG(total_amount) OVER (ORDER BY order_date), 
        2
    ) AS pct_change
FROM orders
ORDER BY order_date;
```

</details>

---

### Question 12: Complex Multi-Table Query

**Task:** Create a product performance report showing:
- Product name and category
- Total quantity sold
- Total revenue generated
- Number of unique customers who purchased
- Average quantity per order
- Rank by revenue within category

Include products with at least one sale.

In [None]:
# Write your solution here
query = """

"""
# run_query(conn, query)

<details>
<summary>Click to reveal answer</summary>

```sql
WITH product_metrics AS (
    SELECT 
        p.product_id,
        p.product_name,
        p.category,
        SUM(oi.quantity) AS total_quantity_sold,
        SUM(oi.quantity * oi.unit_price) AS total_revenue,
        COUNT(DISTINCT o.customer_id) AS unique_customers,
        ROUND(AVG(oi.quantity), 2) AS avg_quantity_per_order
    FROM products p
    INNER JOIN order_items oi ON p.product_id = oi.product_id
    INNER JOIN orders o ON oi.order_id = o.order_id
    GROUP BY p.product_id, p.product_name, p.category
)
SELECT 
    product_name,
    category,
    total_quantity_sold,
    total_revenue,
    unique_customers,
    avg_quantity_per_order,
    RANK() OVER (PARTITION BY category ORDER BY total_revenue DESC) AS revenue_rank_in_category
FROM product_metrics
ORDER BY category, revenue_rank_in_category;
```

</details>

---

## Additional Resources

For further practice and learning, consider these resources:

- [DataCamp - Top SQL Interview Questions](https://www.datacamp.com/blog/top-sql-interview-questions-and-answers-for-beginners-and-intermediate-practitioners)
- [DataLemur - SQL Interview Questions](https://datalemur.com/questions)
- [GeeksforGeeks - SQL Interview Questions](https://www.geeksforgeeks.org/sql/sql-interview-questions/)
- [StrataScratch - SQL Practice](https://www.stratascratch.com/)
- [SQLite Window Functions Documentation](https://sqlite.org/windowfunctions.html)
- [Big Tech Interviews - SQL Window Functions Guide](https://bigtechinterviews.com/sql-window-functions/)

---

## Summary

This notebook covered the essential SQL concepts frequently tested in data science interviews:

1. **Basic Queries**: SELECT, WHERE, ORDER BY, DISTINCT, LIMIT
2. **Aggregation**: COUNT, SUM, AVG, MIN, MAX
3. **Grouping**: GROUP BY, HAVING (remember: WHERE filters rows, HAVING filters groups)
4. **JOINs**: INNER, LEFT, RIGHT, FULL OUTER, CROSS, SELF
5. **Subqueries**: Scalar, table, correlated, EXISTS/NOT EXISTS
6. **CTEs**: Readable, reusable query blocks; recursive CTEs for hierarchies
7. **Window Functions**: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE, running aggregations
8. **CASE Statements**: Conditional logic in queries
9. **String Functions**: Concatenation, substring, pattern matching
10. **Date Functions**: Extraction, arithmetic, formatting

**Key Interview Tips:**
- Always clarify requirements before writing queries
- Consider edge cases (NULL values, empty results)
- Optimise for readability first, performance second
- Practice explaining your thought process
- Know when to use CTEs vs subqueries vs JOINs

Good luck with your interview preparation!

In [None]:
# Close the database connection when done
conn.close()
print("Database connection closed.")