# SQL Aggregation & Grouping


## **SQL Environment Setup (do not edit)**

In [1]:
# @title

%%capture
!mkdir -p notebook_lib
!wget -q -O notebook_lib/sql_runner.py \
  https://raw.githubusercontent.com/Haross/sql_notebook/8021f5c05b7d973b8db549a1398a3c9a5c7829d5/notebook_lib/sql_runner.py
!wget -q -O notebook_lib/validators.py \
  https://raw.githubusercontent.com/Haross/sql_notebook/7baff2c6485cdf641cabcdb55d92a51317cd18b9/notebook_lib/validators.py

from notebook_lib.sql_runner import make_sql_runner
from notebook_lib.validators import make_df_validator_nospoilers, check_process_rules

import sqlite3
import pandas as pd
from pathlib import Path


In [2]:
# @title

DB_FILE = 'class.db'

if DB_FILE != ":memory:":
    Path(DB_FILE).unlink(missing_ok=True)

conn = sqlite3.connect(DB_FILE)
conn.execute("PRAGMA foreign_keys = ON;")

conn.executescript(r'''
DROP TABLE IF EXISTS employee;

CREATE TABLE employee (
    department   TEXT,
    first_name   TEXT NOT NULL,
    last_name    TEXT NOT NULL,
    year         INTEGER,
    salary       INTEGER,
    position     TEXT
);

INSERT INTO employee (department, first_name, last_name, year, salary, position) VALUES
('IT','Olivia','Pearson',2011,3000,'Trainee'),
('IT','Olivia','Pearson',2012,3000,'Trainee'),
('IT','Olivia','Pearson',2012,4200,'Junior Developer'),
('IT','Olivia','Pearson',2013,4900,'Junior Developer'),
('IT','Olivia','Pearson',2014,8100,'Senior Developer'),

('Management','Jack','Johnson',2011,4300,'Junior Project Manager'),
('Management','Jack','Johnson',2012,5100,'Project Manager'),
('Management','Jack','Johnson',2013,7200,'Senior Project Manager'),
('Management','Jack','Johnson',2014,7600,'Senior Project Manager'),
('Management','Jack','Johnson',2015,9500,'Head of Department'),

('IT','Harry','Taylor',2015,2700,'Trainee'),

('Human Resources','Lily','Bennett',2013,1900,'Junior HR Specialist'),
('Human Resources','Lily','Bennett',2014,2300,'HR Specialist'),
('Human Resources','Lily','Bennett',2015,3650,'Senior HR Specialist'),

('Accounting','Charlie','Johnson',2010,2000,'Junior Accountant'),
('Accounting','Charlie','Johnson',2011,2000,'Junior Accountant'),
('Accounting','Charlie','Johnson',2012,2500,'Accountant'),
('Accounting','Charlie','Johnson',2013,3200,'Accountant'),
('Accounting','Charlie','Johnson',2014,3700,'Senior Accountant'),
('Accounting','Charlie','Johnson',2015,4200,'Senior Accountant'),

('IT','Jacob','King',2013,3400,'Trainee'),
('IT','Jacob','King',2014,4100,'Junior Developer'),
('IT','Jacob','King',2015,5900,'Developer'),

('Accounting','Jessica','Poole',2014,3800,'Senior Accountant'),
('Accounting','Jessica','Poole',2015,4300,'Senior Accountant'),

('Management','Ethan','Black',2013,5100,'Project Manager'),
('Management','Ethan','Black',2014,5900,'Project Manager'),
('Management','Ethan','Black',2015,6300,'Senior Project Manager'),

('Help Desk','Ella','Watson',2013,1400,'Trainee'),
('Help Desk','Ella','Watson',2014,1900,'Customer Service Assistant'),
('Help Desk','Ella','Watson',2015,2300,'Customer Service Assistant'),

('Human Resources','Sophia','Hunt',2011,2100,'HR Junior Specialist'),

('Marketing','Amelia','Wright',2014,2100,'Trainee'),
('Marketing','Amelia','Wright',2015,2300,'Junior SEO Specialist'),

('Marketing','Lucy','Green',2013,2000,'Trainee'),

('Marketing','Ruby','Chapman',2012,2500,'Trainee'),
('Marketing','Ruby','Chapman',2013,3400,'Junior SEO Specialist'),
('Marketing','Ruby','Chapman',2014,3900,'SEO Specialist'),
('Marketing','Ruby','Chapman',2015,5400,'Senior SEO Specialist'),

(NULL,'Amie','Walker',NULL,NULL,NULL),

('Help Desk','Brian','Murphy',2012,1500,'Trainee'),
('Help Desk','Brian','Murphy',2013,2000,'Customer Service Assistant'),
('Help Desk','Brian','Murphy',2014,2500,'Customer Service Assistant'),
('Help Desk','Brian','Murphy',2015,3700,'Customer Service Specialist'),

('Management','Eva','Saunders',2011,2100,'Trainee'),
('Management','Eva','Saunders',2012,4100,'Junior Project Manager'),
('Management','Eva','Saunders',2013,4600,'Junior Project Manager'),
('Management','Eva','Saunders',2014,5300,'Project Manager'),
('Management','Eva','Saunders',2015,6100,'Senior Project Manager');
''')
print(f"Database ready ✅ ({DB_FILE})")


Database ready ✅ (class.db)


# SQL Basic: Aggregations & Grouping

## Count the rows

So far, your queries have returned **rows of data**.

But SQL can also do more than just retrieve information, it can **compute statistics** across many rows at once.

This type of operation is called **aggregation**.

Let’s start with the simplest example:

```sql
SELECT COUNT(*)
FROM orders;
````

Here we introduce something new: `COUNT()`.

> `COUNT()` is a **function** and in SQL, functions always use parentheses.

The value inside the parentheses tells the function what to operate on.

* `COUNT(*)` → count all rows
* `*` means “everything”

So instead of returning the actual orders, this query returns just **one number**:
the total number of rows in the `orders` table.

No details, just the summary.


In [3]:
# @title Practice 1
base_practice_1 = make_df_validator_nospoilers(
    expected_hash='9d2b32f656572fd947be6dc3849aa8ea3ded535b68561b695ee25e13e295883d',
    required_cols=['COUNT(*)'],
    expected_rows=1,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_1 = base_practice_1

make_sql_runner(
    conn,
    runner_id="practice_1",
    description_md='### Practice 1\nCount all rows in the table employee\n',
    validator=val_practice_1,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 1</h3>\n<p>Count all rows in the table employee<…

## Count the rows, ignore the NULLs

Naturally, the asterisk (`*`) isn’t the only option you can use with `COUNT()`.

Instead of counting all rows, you can count the values in a **specific column**:

```sql
SELECT COUNT(customer_id)
FROM orders;
````

So what’s the difference?

### `COUNT(*)`

> Counts **every row** in the table.

### `COUNT(customer_id)`

> Counts only rows where `customer_id` is **not NULL**.

In other words, `COUNT(column)` automatically **ignores NULL values**.

If a row has a missing value in that column, it won’t be included in the count.


Think of it like this:

> * `COUNT(*)` → count rows
> * `COUNT(column)` → count non-empty values


In [4]:
# @title Practice 2
base_practice_2 = make_df_validator_nospoilers(
    expected_hash='c58d55ba30f669b0af4788f5d06b631da7010b019230fc4738a6f099944255e7',
    required_cols=['non_null_no'],
    expected_rows=1,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_2 = base_practice_2

make_sql_runner(
    conn,
    runner_id="practice_2",
    description_md='### Practice 2\nCheck how many non-NULL values in the column **position** there are in the table employee. Name the column **non_null_no**.\n',
    validator=val_practice_2,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 2</h3>\n<p>Check how many non-NULL values in the…

## Count distinct values in a column

Great. And just like before, we can combine `COUNT()` with `DISTINCT`.

```sql
SELECT COUNT(DISTINCT customer_id) AS distinct_customers
FROM orders;
````

Now we’re not counting rows.

We’re counting **unique values**.

This query tells the database:

* look at `customer_id`
* remove duplicates
* then count what remains

The result is the number of **different customers** who have placed at least one order.

So even if one customer placed 5 (or 50) orders, they are counted **only once**.

---

Think of it as:

* `COUNT(*)` → how many rows?
* `COUNT(column)` → how many non-NULL values?
* `COUNT(DISTINCT column)` → how many unique values?


In [5]:
# @title Practice 3
base_practice_3 = make_df_validator_nospoilers(
    expected_hash='725a43a341e086d7ef351a3990f0a45aecffc0f9b3bee0a9903efd284b3a088d',
    required_cols=['distinct_positions'],
    expected_rows=1,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_3 = base_practice_3

make_sql_runner(
    conn,
    runner_id="practice_3",
    description_md='### Practice 3\nCount how many different positions there are in the table employee. Name the column distinct_positions\n',
    validator=val_practice_3,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 3</h3>\n<p>Count how many different positions th…

## Find the minimum and maximum value

Nice work so far. Let’s add two more useful aggregation functions to your toolkit.

Sometimes you don’t need every row, you just want the **smallest** or **largest** value.

For example:

```sql
SELECT MIN(total_sum)
FROM orders;
````

`MIN(total_sum)` returns the smallest value in the `total_sum` column.

In our case, that means finding the **cheapest order** in the table.

Simple and very practical.

There’s also the opposite function: `MAX()`.

```sql
SELECT MAX(total_sum)
FROM orders;
```

`MAX(total_sum)` returns the largest value, the **most expensive order**.


These functions are perfect when you need quick insights like:

* lowest salary
* highest sale
* earliest date
* latest shipment          


In [6]:
# @title Practice 4
base_practice_4 = make_df_validator_nospoilers(
    expected_hash='ef7e938aa91b83197f740f6bf965c24bdeec576e626255d745fa7569fd43773f',
    required_cols=['MAX(salary)'],
    expected_rows=1,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_4 = base_practice_4

make_sql_runner(
    conn,
    runner_id="practice_4",
    description_md='### Practice 4\nSelect the highest salary from the table employee.\n',
    validator=val_practice_4,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 4</h3>\n<p>Select the highest salary from the ta…

## Find the average value

So now you know how to find the minimum and maximum.

But what if you want something more representative, not the extremes, but the **typical value**?

That’s where `AVG()` comes in.

```sql
SELECT AVG(total_sum)
FROM orders
WHERE customer_id = 100;
````

`AVG()` calculates the **average (mean)** of the specified column.

Here’s what happens:

1. `WHERE` filters the rows → only orders from customer `100`
2. `AVG()` computes the average → across those orders

The result is the **average order value** for that customer.

Very useful for questions like:

* What’s the average salary?
* What’s the average purchase amount?
* What’s the average delivery time?


In [7]:
# @title Practice 5
base_practice_5 = make_df_validator_nospoilers(
    expected_hash='bec47f5fa46176eaa9c57179c62fb87b491828489b65417f3c5ca10018dc6262',
    required_cols=['AVG(salary)'],
    expected_rows=1,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_5 = base_practice_5

make_sql_runner(
    conn,
    runner_id="practice_5",
    description_md='### Practice 5\nFind the average salary in the table employee for the year 2013.\n',
    validator=val_practice_5,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 5</h3>\n<p>Find the average salary in the table …

## Find the sum

Great! one last aggregation function to complete the set: `SUM()`.

```sql
SELECT SUM(total_sum)
FROM orders
WHERE customer_id = 100;
````

`SUM()` adds together all values in the specified column.

Here’s what happens:

1. `WHERE` selects only orders from customer `100`
2. `SUM()` adds their `total_sum` values

The result is the **total amount of money** this customer has spent across all their orders.

---

This function is especially useful for questions like:

* total sales
* total revenue
* total expenses
* total hours worked


In [8]:
# @title Practice 6
base_practice_6 = make_df_validator_nospoilers(
    expected_hash='3957a0e80ec3b1a198feaa3ed23867612f83ba218b7ef7d08a5844ee1e0d4e16',
    required_cols=['SUM(salary)'],
    expected_rows=1,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_6 = base_practice_6

make_sql_runner(
    conn,
    runner_id="practice_6",
    description_md='### Practice 6\nFind the sum of all salaries in the Marketing department in 2014. Remember to put the department name in quotes!\n',
    validator=val_practice_6,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 6</h3>\n<p>Find the sum of all salaries in the M…

# Grouping the rows and count them

So far, our aggregation functions worked on **all rows at once**.

But often we don’t want one single number for the whole table.

We want statistics **per group**.

For example:
- orders per customer
- employees per department
- sales per year

This is where `GROUP BY` comes in.

```sql
SELECT
  customer_id,
  COUNT(*)
FROM orders
GROUP BY customer_id;
````
Take a look at the following table which illustrates the query:

<img src="https://raw.githubusercontent.com/Haross/DB_pics_nt/main/group_and_count_example.png" width="60%">


Here’s what happens:

1. `GROUP BY customer_id` → rows are grouped by customer
2. all orders from the same customer are collected together
3. `COUNT(*)` → counts how many rows each group contains

Instead of one total count, we now get **one row per customer**, together with the number of orders they placed.

So the result looks something like:

customer_id → number_of_orders

In other words:
**split → then aggregate**.




In [9]:
# @title Practice 7
base_practice_7 = make_df_validator_nospoilers(
    expected_hash='30d336f8e2f4c4dbd6e7888ef1328f436712fd8738f27a08893be0fc1c20e4b3',
    required_cols=['department', 'employees_no'],
    expected_rows=6,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_7 = base_practice_7

make_sql_runner(
    conn,
    runner_id="practice_7",
    description_md='### Practice 7\nFind the number of employee in each department in the year 2013. Show the department name together with the number of employees. Name the second column employees_no.\n',
    validator=val_practice_7,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 7</h3>\n<p>Find the number of employee in each d…

## Find minimum and maximum values in groups

Excellent! And remember, `COUNT(*)` isn’t the only function you can use with `GROUP BY`.

In fact, **any aggregation function** can be applied per group.

Take a look:

```sql
SELECT
  customer_id,
  MAX(total_sum)
FROM orders
GROUP BY customer_id;
````

We simply replaced `COUNT(*)` with `MAX(total_sum)`.

So what changes?

Instead of counting orders, SQL now:

1. groups rows by `customer_id`
2. finds the highest `total_sum` inside each group

The result shows **each customer together with their most expensive order**.

Much more informative than a single overall maximum.


In [10]:
# @title Practice 8
base_practice_8 = make_df_validator_nospoilers(
    expected_hash='d00f43fa8e41f16dc530b4fcd9cb23bdb0c0c2d9df8a0cb8aeaa6f4e48dd287e',
    required_cols=['department', 'MIN(salary)', 'MAX(salary)'],
    expected_rows=6,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_8 = base_practice_8

make_sql_runner(
    conn,
    runner_id="practice_8",
    description_md='### Practice 8\nShow all departments together with their lowest and highest salary in 2014.\n',
    validator=val_practice_8,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 8</h3>\n<p>Show all departments together with th…

## Find the average value in groups

Nice! Let’s build on the same idea one more time.

Just like with `COUNT()` or `MAX()`, we can also use `AVG()` together with `GROUP BY`.

```sql
SELECT
  customer_id,
  AVG(total_sum)
FROM orders
WHERE order_date >= '2019-01-01'
  AND order_date < '2020-01-01'
GROUP BY customer_id;
````

Let’s break it down:

1. `WHERE` → keep only orders placed in 2019
2. `GROUP BY customer_id` → group orders by customer
3. `AVG(total_sum)` → compute the average inside each group

So instead of one global average, we now get **one average per customer**.

The result tells us:

👉 *What was the typical order value for each customer in 2019?*

This is extremely common in real analysis — for example:

* average spending per customer
* average salary per department
* average sales per month

**Filter → group → aggregate.**
That’s the pattern to remember.


In [17]:
# @title Practice 9
base_practice_9 = make_df_validator_nospoilers(
    expected_hash='becd258a92f17502ba157e5ce48ef8a47cfa1003e3f4380e12a9446a71b744cc',
    required_cols=['department', 'AVG(salary)'],
    expected_rows=6,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_9 = base_practice_9

make_sql_runner(
    conn,
    runner_id="practice_9",
    description_md='### Practice 9\nFor each department find the average salary in 2015.\n',
    validator=val_practice_9,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 9</h3>\n<p>For each department find the average …

## Group by multiple columns

Nice work, let’s take `GROUP BY` one step further.

Just like `ORDER BY`, grouping isn’t limited to a single column.  
You can group by **multiple columns at the same time** to create more detailed summaries.

Imagine this situation:

Some customers place many orders every day, and we want to know the **total amount spent per customer per day**.

```sql
SELECT
  customer_id,
  order_date,
  SUM(total_sum)
FROM orders
GROUP BY customer_id, order_date;
````
Take a look at the following table which illustrates the query:

<img src="https://raw.githubusercontent.com/Haross/DB_pics_nt/main/group_and_count_multiple_Fields_example.png" width="60%">

Here’s what SQL does:

1. group rows by `customer_id`
2. then split again by `order_date`
3. calculate `SUM(total_sum)` inside each group

So instead of one total per customer, we now get **one total per customer per day**.

This gives us a much more detailed view of activity.


### Important rule

When using `GROUP BY`, every column in `SELECT` must be:

✅ either included in `GROUP BY`  
✅ or wrapped in an aggregation function (`SUM`, `COUNT`, `AVG`, etc.)

Why?

Because once rows are grouped, SQL needs **one single value per group**.

For example:
- `customer_id` → same for the whole group ✔️  
- `order_date` → same for the whole group ✔️  
- `SUM(total_sum)` → computed value ✔️  
- `ship_date` → could be different inside the group ❌  

If you tried selecting `ship_date`, the database wouldn’t know **which one to choose**, so it would raise an error.

---

### Mental model

Think of it like:

> **more grouping columns → smaller, more specific groups → more detailed summaries**


In [12]:
# @title Practice 10
base_practice_10 = make_df_validator_nospoilers(
    expected_hash='c1993ddc08865c86603a06e0263dd5bc1b2d3f0b148bf6e2a8f1f49ab0fe3257',
    required_cols=['last_name', 'first_name', 'AVG(salary)'],
    expected_rows=16,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_10 = base_practice_10

make_sql_runner(
    conn,
    runner_id="practice_10",
    description_md='### Practice 10\nFind the average salary for each employee. Show the last name, the first name, and the average salary. Group the table by the last name and the first name.\n',
    validator=val_practice_10,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 10</h3>\n<p>Find the average salary for each emp…

## Filter groups

So far, we’ve learned how to:
- filter rows → `WHERE`
- group rows → `GROUP BY`
- compute statistics → `SUM`, `COUNT`, `AVG`, …

But what if we want to **filter the groups themselves**?

For example:
👉 only show customers whose **daily total** exceeds $2,000.

For that, SQL provides a special keyword: `HAVING`.

```sql
SELECT
  customer_id,
  order_date,
  SUM(total_sum)
FROM orders
GROUP BY customer_id, order_date
HAVING SUM(total_sum) > 2000;
````

Here’s the logic:

1. `GROUP BY` → create groups (customer + day)
2. `SUM(total_sum)` → calculate totals for each group
3. `HAVING` → keep only groups where the total is greater than 2000

So instead of filtering individual orders, we’re filtering **summaries**.

Only the high-spending days remain.

### WHERE vs HAVING (important!)

A simple rule to remember:

- `WHERE` → filters **rows** (before grouping)
- `HAVING` → filters **groups** (after grouping)



### Clause order matters

SQL clauses must appear in this syntax order:
```sql
-- Syntax order (how you must write it)
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT / OFFSET
```
But the database generally evaluates them in this logical execution order:

```sql
SELECT        -- (5) Choose output columns + aggregates (after grouping/filtering)
FROM          -- (1) Build the working set (tables + joins)
WHERE         -- (2) Filter rows (before grouping)
GROUP BY      -- (3) Form groups ("buckets") for aggregation
HAVING        -- (4) Filter groups ("buckets") after aggregation
ORDER BY      -- (6) Sort the final result set
LIMIT/OFFSET  -- (7) Return a window of rows (pagination)
````

> Why this matters: you can’t swap syntax order (e.g., HAVING must come after GROUP BY), and understanding execution order explains why things like aggregates behave differently in WHERE vs HAVING.


Keeping this structure in mind will save you many debugging headaches.




Take a look at the following picture which illustrates the SQL order of execution:

<img src="https://raw.githubusercontent.com/Haross/DB_pics_nt/main/SQL_order_execution_illustrated.png" width="60%">

In [13]:
# @title Practice 11
base_practice_11 = make_df_validator_nospoilers(
    expected_hash='15c3c0f58ab585bbd298bb8c424313836fb2430c0f85de605a3ea809f3bdaa1a',
    required_cols=['last_name', 'first_name', 'years'],
    expected_rows=10,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_11 = base_practice_11

make_sql_runner(
    conn,
    runner_id="practice_11",
    description_md='### Practice 11\nFind such employees who (have) spent more than 2 years in the company. Select their last name and first name together with the number of years worked (name this column years).\n',
    validator=val_practice_11,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 11</h3>\n<p>Find such employees who (have) spent…

In [14]:
# @title Practice 12
base_practice_12 = make_df_validator_nospoilers(
    expected_hash='817430c2a9c2f0cfd9591520ad96b11aea45eb1f6046a6c3c06cf769f0814dc8',
    required_cols=['department', 'AVG(salary)'],
    expected_rows=2,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_12 = base_practice_12

make_sql_runner(
    conn,
    runner_id="practice_12",
    description_md='### Practice 12\nFind such departments where the average salary in 2012 was higher than $3,000. Show the department name with the average salary.\n',
    validator=val_practice_12,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 12</h3>\n<p>Find such departments where the aver…

## Order groups

Correct! One last useful trick before we wrap up.

Just like regular rows, **groups can also be sorted**.

After computing aggregated values, you can use `ORDER BY` to control how those grouped results appear.

```sql
SELECT
  customer_id,
  order_date,
  SUM(total_sum)
FROM orders
GROUP BY customer_id, order_date
ORDER BY SUM(total_sum) DESC;
````

Here’s what happens:

1. `GROUP BY` → create groups (customer + day)
2. `SUM(total_sum)` → calculate totals per group
3. `ORDER BY` → sort those totals

This time we sort by `SUM(total_sum)` in descending order (`DESC`),
so the **highest daily totals appear first**.

In other words, the biggest spenders rise to the top.


### Tip

You can sort by:
- an aggregation → `ORDER BY SUM(total_sum)`
- a selected column → `ORDER BY customer_id`
- or even an alias → `ORDER BY daily_total`

Example:

```sql
SELECT
  customer_id,
  order_date,
  SUM(total_sum) AS daily_total
FROM orders
GROUP BY customer_id, order_date
ORDER BY daily_total DESC;
````

---

### Mental model

**group → aggregate → filter (HAVING) → sort**

This is the typical workflow for analytical SQL queries.


In [15]:
# @title Practice 13
base_practice_13 = make_df_validator_nospoilers(
    expected_hash='01c5ec5966c8d53477d7a1338dd4f852d001a6ccd2c13b43555e4b326af67f46',
    required_cols=['last_name', 'first_name', 'SUM(salary)'],
    expected_rows=16,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_13 = base_practice_13

make_sql_runner(
    conn,
    runner_id="practice_13",
    description_md='### Practice 13\nSort the employees according to their summary salaries. Highest values should appear first. Show the last name, the first name, and the sum.\n',
    validator=val_practice_13,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 13</h3>\n<p>Sort the employees according to thei…

In [19]:
# @title In-class exercise 1
make_sql_runner(
    conn,
    runner_id="in_class_1",
    description_md="### In-class exercise 1\nShow the columns last_name and first_name from the table employees together with each person's **average salary** and the number of years they (have) worked in the company.\n\nUse the following aliases: average_salary for each person's average salary and years_worked for the number of years worked in the company. Show only such employees **who (have) spent more than 2 years in the company**. Order the results according to the **average salary** in the descending order.\n",
    validator=None,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>In-class exercise 1</h3>\n<p>Show the columns last_name a…