# SQL Ordering

So far, you’ve learned how to retrieve data from one table, combine multiple tables with joins, and filter exactly the rows you need.

But getting the *right* data is only half the job.

In real analysis, **how the results are arranged matters just as much**, whether you're ranking sales, finding the newest records, or sorting customers alphabetically.

In this notebook, you’ll learn how to control the order of your results using SQL.

Let’s get started.


## **SQL Environment Setup (do not edit)**

In [1]:
# @title

%%capture
!mkdir -p notebook_lib
!wget -q -O notebook_lib/sql_runner.py \
  https://raw.githubusercontent.com/Haross/sql_notebook/main/notebook_lib/sql_runner.py
!wget -q -O notebook_lib/validators.py \
  https://raw.githubusercontent.com/Haross/sql_notebook/main/notebook_lib/validators.py

from notebook_lib.sql_runner import make_sql_runner
from notebook_lib.validators import make_df_validator_nospoilers, check_process_rules

import sqlite3
import pandas as pd
from pathlib import Path


In [2]:
# @title

DB_FILE = 'class.db'

if DB_FILE != ":memory:":
    Path(DB_FILE).unlink(missing_ok=True)

conn = sqlite3.connect(DB_FILE)
conn.execute("PRAGMA foreign_keys = ON;")

conn.executescript(r'''
DROP TABLE IF EXISTS employee;

CREATE TABLE employee (
    department   TEXT,
    first_name   TEXT NOT NULL,
    last_name    TEXT NOT NULL,
    year         INTEGER,
    salary       INTEGER,
    position     TEXT
);

INSERT INTO employee (department, first_name, last_name, year, salary, position) VALUES
('IT','Olivia','Pearson',2011,3000,'Trainee'),
('IT','Olivia','Pearson',2012,3000,'Trainee'),
('IT','Olivia','Pearson',2012,4200,'Junior Developer'),
('IT','Olivia','Pearson',2013,4900,'Junior Developer'),
('IT','Olivia','Pearson',2014,8100,'Senior Developer'),

('Management','Jack','Johnson',2011,4300,'Junior Project Manager'),
('Management','Jack','Johnson',2012,5100,'Project Manager'),
('Management','Jack','Johnson',2013,7200,'Senior Project Manager'),
('Management','Jack','Johnson',2014,7600,'Senior Project Manager'),
('Management','Jack','Johnson',2015,9500,'Head of Department'),

('IT','Harry','Taylor',2015,2700,'Trainee'),

('Human Resources','Lily','Bennett',2013,1900,'Junior HR Specialist'),
('Human Resources','Lily','Bennett',2014,2300,'HR Specialist'),
('Human Resources','Lily','Bennett',2015,3650,'Senior HR Specialist'),

('Accounting','Charlie','Johnson',2010,2000,'Junior Accountant'),
('Accounting','Charlie','Johnson',2011,2000,'Junior Accountant'),
('Accounting','Charlie','Johnson',2012,2500,'Accountant'),
('Accounting','Charlie','Johnson',2013,3200,'Accountant'),
('Accounting','Charlie','Johnson',2014,3700,'Senior Accountant'),
('Accounting','Charlie','Johnson',2015,4200,'Senior Accountant'),

('IT','Jacob','King',2013,3400,'Trainee'),
('IT','Jacob','King',2014,4100,'Junior Developer'),
('IT','Jacob','King',2015,5900,'Developer'),

('Accounting','Jessica','Poole',2014,3800,'Senior Accountant'),
('Accounting','Jessica','Poole',2015,4300,'Senior Accountant'),

('Management','Ethan','Black',2013,5100,'Project Manager'),
('Management','Ethan','Black',2014,5900,'Project Manager'),
('Management','Ethan','Black',2015,6300,'Senior Project Manager'),

('Help Desk','Ella','Watson',2013,1400,'Trainee'),
('Help Desk','Ella','Watson',2014,1900,'Customer Service Assistant'),
('Help Desk','Ella','Watson',2015,2300,'Customer Service Assistant'),

('Human Resources','Sophia','Hunt',2011,2100,'HR Junior Specialist'),

('Marketing','Amelia','Wright',2014,2100,'Trainee'),
('Marketing','Amelia','Wright',2015,2300,'Junior SEO Specialist'),

('Marketing','Lucy','Green',2013,2000,'Trainee'),

('Marketing','Ruby','Chapman',2012,2500,'Trainee'),
('Marketing','Ruby','Chapman',2013,3400,'Junior SEO Specialist'),
('Marketing','Ruby','Chapman',2014,3900,'SEO Specialist'),
('Marketing','Ruby','Chapman',2015,5400,'Senior SEO Specialist'),

(NULL,'Amie','Walker',NULL,NULL,NULL),

('Help Desk','Brian','Murphy',2012,1500,'Trainee'),
('Help Desk','Brian','Murphy',2013,2000,'Customer Service Assistant'),
('Help Desk','Brian','Murphy',2014,2500,'Customer Service Assistant'),
('Help Desk','Brian','Murphy',2015,3700,'Customer Service Specialist'),

('Management','Eva','Saunders',2011,2100,'Trainee'),
('Management','Eva','Saunders',2012,4100,'Junior Project Manager'),
('Management','Eva','Saunders',2013,4600,'Junior Project Manager'),
('Management','Eva','Saunders',2014,5300,'Project Manager'),
('Management','Eva','Saunders',2015,6100,'Senior Project Manager');
''')
print(f"Database ready ✅ ({DB_FILE})")


Database ready ✅ (class.db)


## Get to Know the tables

Great — let’s take a look at the tables we’ll be working with.

If you’ve had enough of cars and movies, good news: this time we’re switching to **orders** and **employees**.

We’ll start with examples based on the `orders` table:

> orders (order_id, customer_id, order_date, ship_date, total_sum)

Pretty straightforward.

Each row represents one order:
- a unique id (`order_id`)
- the customer who placed it (`customer_id`)
- when it was created (`order_date`)
- when it was shipped (`ship_date`)
- and its total value (`total_sum`)

Simple structure, perfect for practicing queries.

Now for the exercises, we’ll use a slightly richer dataset: **employees with their salaries over time**.

Here, things get more realistic:
- an employee can appear in multiple years
- salaries may change from year to year
- departments can differ
- positions may evolve due to promotions

In other words, multiple rows can describe the same person across time.

This makes the dataset ideal for practicing sorting


In [13]:
# @title Employee table — sample preview
df = pd.read_sql("SELECT * FROM employee LIMIT 10", conn).style.format(na_rep="NULL").hide(axis="index")
df

department,first_name,last_name,year,salary,position
IT,Olivia,Pearson,2011,3000,Trainee
IT,Olivia,Pearson,2012,3000,Trainee
IT,Olivia,Pearson,2012,4200,Junior Developer
IT,Olivia,Pearson,2013,4900,Junior Developer
IT,Olivia,Pearson,2014,8100,Senior Developer
Management,Jack,Johnson,2011,4300,Junior Project Manager
Management,Jack,Johnson,2012,5100,Project Manager
Management,Jack,Johnson,2013,7200,Senior Project Manager
Management,Jack,Johnson,2014,7600,Senior Project Manager
Management,Jack,Johnson,2015,9500,Head of Department


## Sort the rows — ORDER BY

Alright, let’s get to work.

You’re already comfortable filtering rows with `WHERE`.  
But here’s an important question:

**Have you noticed that query results don’t come back in any particular order?**

That’s because, by default, SQL does **not** sort your data.

Without explicit sorting:
- rows may appear in an arbitrary sequence
- different databases may return different orders
- even the same query can produce a different order on different runs

If you care about the order of your results (and you usually should), you must tell the database how to sort them.

That’s where `ORDER BY` comes in.

```sql
SELECT *
FROM orders
ORDER BY customer_id;
````
Here we add the `ORDER BY` clause and specify a column.

In this case, the results are sorted by `customer_id`, so all orders are grouped and arranged according to their customer.


In [3]:
# @title Practice 1
base_practice_1 = make_df_validator_nospoilers(
    expected_hash='afe1e7cf8f3a2f48fc410bf05b71a01ee24ac091966dea136de0aaa5505b9828',
    required_cols=['department', 'first_name', 'last_name', 'year', 'salary', 'position'],
    expected_rows=49,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_1 = base_practice_1

make_sql_runner(
    conn,
    runner_id="practice_1",
    description_md='### Practice 1\nSelect all columns from the table employee and sort them according to the salary.\n',
    validator=val_practice_1,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 1</h3>\n<p>Select all columns from the table emp…

## ORDER BY with conditions

Excellent! Now you know how to sort results, which means you can quickly spot things like the lowest or highest salary.

But SQL gets even more powerful when you **combine filtering and sorting**.

You don’t have to choose one or the other, you can do both in the same query:

```sql
SELECT *
FROM orders
WHERE customer_id = 100
ORDER BY total_sum;
````

Here’s what happens step by step:

1. `WHERE` filters the rows → only orders from customer `100`
2. `ORDER BY` sorts the remaining rows → by `total_sum`

The result?
The cheapest order appears first, and the most expensive one appears last.

Filter first. Then sort.
Simple and very useful.


In [4]:
# @title Practice 2
base_practice_2 = make_df_validator_nospoilers(
    expected_hash='61f3a35c934fe011802a6be8465c9c71524e1b5f3980c1ddb8af57e18ec02a26',
    required_cols=['department', 'first_name', 'last_name', 'year', 'salary', 'position'],
    expected_rows=5,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_2 = base_practice_2

make_sql_runner(
    conn,
    runner_id="practice_2",
    description_md='### Practice 2\nSelect only the rows related to the year 2011 from the table employee. Sort them by salary.\n',
    validator=val_practice_2,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 2</h3>\n<p>Select only the rows related to the y…

## Ascending and descending orders

In the previous example, the smallest values appeared first and the largest last.

That’s called **ascending order** and in SQL, it’s the default behavior.

So even if you write:

```sql
SELECT *
FROM orders
ORDER BY total_sum;
````

the database will automatically sort from low → high.

Still, it’s often good practice to be explicit and add `ASC` (ascending):

```sql
SELECT *
FROM orders
ORDER BY total_sum ASC;
```

Adding `ASC` doesn’t change the result — it simply makes your intention clear and your query easier to read.

---

### Reversing the order

Sometimes you want the opposite: the **largest values first**.

For that, use `DESC` (descending):

```sql
SELECT *
FROM orders
ORDER BY total_sum DESC;
```

Now the results go from high → low, so the most expensive orders appear at the top.

**Quick summary:**

* `ASC` → smallest to largest (default)
* `DESC` → largest to smallest
<img src="https://raw.githubusercontent.com/Haross/DB_pics_nt/main/Ascending-Descending-order.webp" width="40%">


In [5]:
# @title Practice 3
base_practice_3 = make_df_validator_nospoilers(
    expected_hash='afe1e7cf8f3a2f48fc410bf05b71a01ee24ac091966dea136de0aaa5505b9828',
    required_cols=['department', 'first_name', 'last_name', 'year', 'salary', 'position'],
    expected_rows=49,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_3 = base_practice_3

make_sql_runner(
    conn,
    runner_id="practice_3",
    description_md='### Practice 3\nSelect all rows from the table employee and sort them in the **descending** order by the column **last_name**.\n',
    validator=val_practice_3,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 3</h3>\n<p>Select all rows from the table employ…

## Sort by multiple columns

Nice work so far. Let’s level up a bit.

Sorting doesn’t have to rely on just one column, you can sort by **multiple columns at the same time**, and each one can use a **different order**.

```sql
SELECT *
FROM orders
ORDER BY customer_id ASC, total_sum DESC;
````

SQL applies the sorting **from left to right**:

1. First → sort by `customer_id` (ascending)
2. Then → within each customer, sort by `total_sum` (descending)

So the result will look like this:

* customers appear in order (1, 2, 3, …)
* for each customer, the most expensive orders come first

This technique is very common when working with grouped or hierarchical data.

Think of it as:
> **primary sort → secondary sort → tertiary sort → ...**


In [6]:
# @title Practice 4
base_practice_4 = make_df_validator_nospoilers(
    expected_hash='afe1e7cf8f3a2f48fc410bf05b71a01ee24ac091966dea136de0aaa5505b9828',
    required_cols=['department', 'first_name', 'last_name', 'year', 'salary', 'position'],
    expected_rows=49,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_4 = base_practice_4

make_sql_runner(
    conn,
    runner_id="practice_4",
    description_md='### Practice 4\nSelect all rows from the table **employee** and sort them in the **ascending** order by the **department** and then in the **descending** order by the **salary**.\n',
    validator=val_practice_4,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 4</h3>\n<p>Select all rows from the table <stron…

## Duplicate results

Great progress! Let’s look at another important detail.

By default, SQL returns **every row** that matches your query.

Most of the time, that’s exactly what we want.  
But sometimes… it’s not.

Imagine this task:

> Get the IDs of all customers who have ever placed an order.

A first attempt might look like this:

```sql
SELECT customer_id
FROM orders;
````

Seems correct, right?

But there’s a problem.

If a customer placed multiple orders, their `customer_id` will appear multiple times in the results, once for each order.

So instead of a clean list of customers, you get **duplicates**.

Before moving on, let's go to the next practice and check what happens.
What do you notice?


In [14]:
# @title Practice 5
base_practice_5 = make_df_validator_nospoilers(
    expected_hash='b0da3d9e0811399a5a084fb0b1610a6bc00b6f98f2dedc7ff52f11a7820ae292',
    required_cols=[ 'year'],
    expected_rows=49,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_5 = base_practice_5

make_sql_runner(
    conn,
    runner_id="practice_5",
    description_md='### Practice 5\nSelect the column **year** for all rows in the table **employee**. Then examine the result carefully.\n',
    validator=val_practice_5,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 5</h3>\n<p>Select the column <strong>year</stron…

## Select distinct values

Did you spot the issue?

Some values appeared multiple times.

If several orders were placed by the same customer, their `customer_id` showed up again and again — once per order.

But we don’t want duplicates.  
We just want a **clean list of unique customers**.

Luckily, SQL makes this easy with `DISTINCT`:

```sql
SELECT DISTINCT customer_id
FROM orders;
````

By adding `DISTINCT`, the database removes duplicate values and returns only unique ones.

Now each `customer_id` appears **once and only once**.

Simple change, much cleaner result.


In [8]:
# @title Practice 6
base_practice_6 = make_df_validator_nospoilers(
    expected_hash='2f91706a5dcc434aca543c0a62c85c3966ac83842fd3a1cfb2b0306021089fb8',
    required_cols=['year'],
    expected_rows=7,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_6 = base_practice_6

make_sql_runner(
    conn,
    runner_id="practice_6",
    description_md='### Practice 6\nSelect the column **year** from the table **employee** in such a way that each year is only shown once.\n',
    validator=val_practice_6,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 6</h3>\n<p>Select the column <strong>year</stron…

## Select distinct values in multiple columns

Excellent. `DISTINCT` isn’t limited to a single column — you can apply it to **combinations of columns** as well.

```sql
SELECT DISTINCT
  customer_id,
  order_date
FROM orders;
````

When you use multiple columns, SQL doesn’t check each one separately.

Instead, it keeps only **unique combinations**.

So in this case:

* a customer might place several orders on the same day
* but each `(customer_id, order_date)` pair will appear only once

The result tells us **on which days each customer placed at least one order**, without repeating the same day multiple times.

Think of it as:
unique pairs, not unique cells.


In [9]:
# @title Practice 7
base_practice_7 = make_df_validator_nospoilers(
    expected_hash='4f166ad0bcc9a71a2926e41f140cad1faa4044db4a52dee104154babfde5d3ad',
    required_cols=['department', 'position'],
    expected_rows=24,
    sort_rows=True,
    sort_cols=True,
    exact_cols=False,
    hide_missing_cols=True,
    hide_row_count=False,
)

val_practice_7 = base_practice_7

make_sql_runner(
    conn,
    runner_id="practice_7",
    description_md='### Practice 7\nCheck what positions there are in every department. In order to do that, select the columns department and position from the table employee and **eliminate duplicates**.\n',
    validator=val_practice_7,
    sol_sql=None,
    select_only=True,
    dedupe=True,
    schema_tables=['employee']
)


VBox(children=(HTML(value="<div class='sql-desc'><h3>Practice 7</h3>\n<p>Check what positions there are in eve…