## SQL-92 Demo using DuckDB

This notebook introduces the basics of read-only SQL-92-queries.

Copyright Jens Dittrich & Christian Schön & Joris Nix, [Big Data Analytics Group](https://bigdata.uni-saarland.de/), [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/legalcode)

This notebook uses [DuckDB](https://duckdb.org/).

Information on the SQL dialect supported by DuckDB can be found [here](https://duckdb.org/docs/).

In [1]:
import duckdb

### Database Schema

All examples below are taken from the following scenario:

You are running a photo agency which has several types of employees: seniors, salespersons, and photographers. 

Create schemas for all tables:

In [2]:
duckdb.sql("""
CREATE TABLE persons (
    id INTEGER PRIMARY KEY,
    lastname TEXT,
    firstname TEXT,
    birthday TEXT
);""")

duckdb.sql("""
CREATE TABLE employees (
    personId INTEGER PRIMARY KEY,
    salary INTEGER,
    experience INTEGER,
    FOREIGN KEY(personId) REFERENCES persons(id)
);""")

duckdb.sql("""
CREATE TABLE seniors (
    employeeId INTEGER PRIMARY KEY,
    numGreyHairs INTEGER,
    bonus INTEGER,
    FOREIGN KEY(employeeId) REFERENCES employees(personId)
);""")

duckdb.sql("""
CREATE TABLE salespersons (
    employeeId INTEGER PRIMARY KEY,
    areaOfExpertise TEXT,
    FOREIGN KEY(employeeId) REFERENCES employees(personId)
);""")

duckdb.sql("""
CREATE TABLE photographers (
    employeeId INTEGER PRIMARY KEY,
    FOREIGN KEY(employeeId) REFERENCES employees(personId)
);""")

duckdb.sql("""
CREATE TABLE cameras (
    id INTEGER PRIMARY KEY,
    brand TEXT,
    model TEXT
);""")

duckdb.sql("""
CREATE TABLE photos (
    id INTEGER PRIMARY KEY,
    location TEXT,
    unix_time INTEGER,
    photographerId INTEGER,
    cameraId INTEGER,
    FOREIGN KEY(photographerId) REFERENCES photographers(employeeId),
    FOREIGN KEY(cameraId) REFERENCES cameras(id)
);""")

Import the csv-data into those tables:

In [3]:
duckdb.sql("COPY persons FROM './data/photodb/persons.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY employees FROM './data/photodb/employees.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY seniors FROM './data/photodb/seniors.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY salespersons FROM './data/photodb/salespersons.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY photographers FROM './data/photodb/photographers.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY cameras FROM './data/photodb/cameras.csv' (FORMAT CSV, DELIMITER ',');")
duckdb.sql("COPY photos FROM './data/photodb/photos.csv' (FORMAT CSV, DELIMITER ',');")

In [4]:
# show the complete table:
duckdb.sql("""
SELECT *
FROM employees;""")

┌──────────┬────────┬────────────┐
│ personId │ salary │ experience │
│  int32   │ int32  │   int32    │
├──────────┼────────┼────────────┤
│        1 │  45000 │          3 │
│        2 │  37000 │          3 │
│        3 │  50000 │          2 │
│        4 │  60000 │          3 │
│        5 │  55000 │          2 │
│        6 │  15000 │          1 │
│        7 │  50000 │          2 │
└──────────┴────────┴────────────┘

In [5]:
duckdb.sql("""
SELECT            *	
FROM              employees e, seniors s
WHERE             e.personid = s.employeeid;""")

┌──────────┬────────┬────────────┬────────────┬──────────────┬───────┐
│ personId │ salary │ experience │ employeeId │ numGreyHairs │ bonus │
│  int32   │ int32  │   int32    │   int32    │    int32     │ int32 │
├──────────┼────────┼────────────┼────────────┼──────────────┼───────┤
│        1 │  45000 │          3 │          1 │           45 │ 34000 │
│        2 │  37000 │          3 │          2 │          457 │ 40000 │
└──────────┴────────┴────────────┴────────────┴──────────────┴───────┘

In [6]:
# show the complete table:
duckdb.sql("""
SELECT            *	
FROM              employees e JOIN seniors s
                  ON e.personid = s.employeeid;""")

┌──────────┬────────┬────────────┬────────────┬──────────────┬───────┐
│ personId │ salary │ experience │ employeeId │ numGreyHairs │ bonus │
│  int32   │ int32  │   int32    │   int32    │    int32     │ int32 │
├──────────┼────────┼────────────┼────────────┼──────────────┼───────┤
│        1 │  45000 │          3 │          1 │           45 │ 34000 │
│        2 │  37000 │          3 │          2 │          457 │ 40000 │
└──────────┴────────┴────────────┴────────────┴──────────────┴───────┘

In [7]:
duckdb.sql("SELECT 42;")

┌───────┐
│  42   │
│ int32 │
├───────┤
│    42 │
└───────┘

### Projection

In [8]:
# projection for the attribute 'salary':
duckdb.sql("""
SELECT salary
FROM employees;""")

┌────────┐
│ salary │
│ int32  │
├────────┤
│  45000 │
│  37000 │
│  50000 │
│  60000 │
│  55000 │
│  15000 │
│  50000 │
└────────┘

First difference to relational algebra: **duplicates**

In [9]:
# projection for the attribute 'salary', eliminating duplicates using 'DISTINCT'
duckdb.sql("""
SELECT DISTINCT salary
FROM employees;""")

┌────────┐
│ salary │
│ int32  │
├────────┤
│  45000 │
│  37000 │
│  50000 │
│  60000 │
│  55000 │
│  15000 │
└────────┘

In [10]:
# projection for the attribute 'personId':
duckdb.sql("""
SELECT personid
FROM employees;""")

┌──────────┐
│ personId │
│  int32   │
├──────────┤
│        1 │
│        2 │
│        3 │
│        4 │
│        5 │
│        6 │
│        7 │
└──────────┘

As personId is a key, DISTINCT does not have an effect here.

### Sorting of the output, order by

In [11]:
# projection for the attributes 'experience' and 'personId' using descending order:
# alternative, ascending: ASC
duckdb.sql("""
SELECT experience, personid
FROM employees
ORDER BY experience DESC, personid DESC;""")

┌────────────┬──────────┐
│ experience │ personId │
│   int32    │  int32   │
├────────────┼──────────┤
│          3 │        4 │
│          3 │        2 │
│          3 │        1 │
│          2 │        7 │
│          2 │        5 │
│          2 │        3 │
│          1 │        6 │
└────────────┴──────────┘

### Selection/Filter

not to be confused with SELECT (which projects the data, see above)

In [12]:
# selection of/filter all employees with a salary of more than 50000:
duckdb.sql("""
SELECT *
FROM employees
WHERE salary>50000;""")

┌──────────┬────────┬────────────┐
│ personId │ salary │ experience │
│  int32   │ int32  │   int32    │
├──────────┼────────┼────────────┤
│        4 │  60000 │          3 │
│        5 │  55000 │          2 │
└──────────┴────────┴────────────┘

In [13]:
# selection of all employees with a salary of more than 50000: and a personid > 4:
duckdb.sql("""
SELECT *
FROM employees
WHERE salary>50000 AND personid>4;""")

┌──────────┬────────┬────────────┐
│ personId │ salary │ experience │
│  int32   │ int32  │   int32    │
├──────────┼────────┼────────────┤
│        5 │  55000 │          2 │
└──────────┴────────┴────────────┘

### Filtering Strings

To filter on string-types we can use the LIKE-operator:

1. percent symbol (%): represents zero, one, or multiple characters

2. underscore symbol (_) : represents a single character


In [14]:
duckdb.sql("""
SELECT *
FROM persons;""")

┌───────┬────────────┬───────────┬────────────┐
│  id   │  lastname  │ firstname │  birthday  │
│ int32 │  varchar   │  varchar  │  varchar   │
├───────┼────────────┼───────────┼────────────┤
│     1 │ Schweitzer │ Albert    │ 1973-03-01 │
│     2 │ Carlos     │ Rob       │ 1975-07-12 │
│     3 │ Mueller    │ Peter     │ 1963-10-09 │
│     4 │ Zappa      │ Frank     │ 1955-11-04 │
│     5 │ Taylor     │ Tim       │ 1980-03-04 │
│     6 │ Wurst      │ Hans      │ 1974-02-01 │
│     7 │ Miese      │ Peter     │ 1983-05-06 │
│     8 │ Koenig     │ Dieter    │ 1967-06-11 │
└───────┴────────────┴───────────┴────────────┘

In [15]:
duckdb.sql("""
SELECT *
FROM persons
WHERE firstname LIKE '__bert';""")

┌───────┬────────────┬───────────┬────────────┐
│  id   │  lastname  │ firstname │  birthday  │
│ int32 │  varchar   │  varchar  │  varchar   │
├───────┼────────────┼───────────┼────────────┤
│     1 │ Schweitzer │ Albert    │ 1973-03-01 │
└───────┴────────────┴───────────┴────────────┘

In [16]:
duckdb.sql("""
SELECT *
FROM persons
WHERE firstname LIKE '%bert';""")

┌───────┬────────────┬───────────┬────────────┐
│  id   │  lastname  │ firstname │  birthday  │
│ int32 │  varchar   │  varchar  │  varchar   │
├───────┼────────────┼───────────┼────────────┤
│     1 │ Schweitzer │ Albert    │ 1973-03-01 │
└───────┴────────────┴───────────┴────────────┘

In [17]:
duckdb.sql("""
SELECT *
FROM persons
WHERE firstname LIKE '_ete_';""")

┌───────┬──────────┬───────────┬────────────┐
│  id   │ lastname │ firstname │  birthday  │
│ int32 │ varchar  │  varchar  │  varchar   │
├───────┼──────────┼───────────┼────────────┤
│     3 │ Mueller  │ Peter     │ 1963-10-09 │
│     7 │ Miese    │ Peter     │ 1983-05-06 │
└───────┴──────────┴───────────┴────────────┘

In [18]:
duckdb.sql("""
SELECT *
FROM persons
WHERE firstname LIKE '%ete_';""")

┌───────┬──────────┬───────────┬────────────┐
│  id   │ lastname │ firstname │  birthday  │
│ int32 │ varchar  │  varchar  │  varchar   │
├───────┼──────────┼───────────┼────────────┤
│     3 │ Mueller  │ Peter     │ 1963-10-09 │
│     7 │ Miese    │ Peter     │ 1983-05-06 │
│     8 │ Koenig   │ Dieter    │ 1967-06-11 │
└───────┴──────────┴───────────┴────────────┘

### Union

In [19]:
# example for Union:
duckdb.sql("""
SELECT employeeId
FROM salespersons
  UNION
SELECT employeeId
FROM photographers;""")

┌────────────┐
│ employeeId │
│   int32    │
├────────────┤
│          4 │
│          5 │
│          3 │
│          7 │
│          6 │
└────────────┘

### Difference

In [20]:
# example for difference:
# 'MINUS' is called 'EXCEPT' in duckdb
duckdb.sql("""
SELECT employeeId
FROM salespersons
  EXCEPT
SELECT employeeId
FROM photographers;""")

┌────────────┐
│ employeeId │
│   int32    │
├────────────┤
│          4 │
│          5 │
└────────────┘

### Cross Product

In [21]:
# example for the cross product:
duckdb.sql("""
SELECT *
FROM employees, seniors;""")

┌──────────┬────────┬────────────┬────────────┬──────────────┬───────┐
│ personId │ salary │ experience │ employeeId │ numGreyHairs │ bonus │
│  int32   │ int32  │   int32    │   int32    │    int32     │ int32 │
├──────────┼────────┼────────────┼────────────┼──────────────┼───────┤
│        1 │  45000 │          3 │          1 │           45 │ 34000 │
│        2 │  37000 │          3 │          1 │           45 │ 34000 │
│        3 │  50000 │          2 │          1 │           45 │ 34000 │
│        4 │  60000 │          3 │          1 │           45 │ 34000 │
│        5 │  55000 │          2 │          1 │           45 │ 34000 │
│        6 │  15000 │          1 │          1 │           45 │ 34000 │
│        7 │  50000 │          2 │          1 │           45 │ 34000 │
│        1 │  45000 │          3 │          2 │          457 │ 40000 │
│        2 │  37000 │          3 │          2 │          457 │ 40000 │
│        3 │  50000 │          2 │          2 │          457 │ 40000 │
│     

In [22]:
duckdb.sql("""
SELECT *
FROM employees;""")

┌──────────┬────────┬────────────┐
│ personId │ salary │ experience │
│  int32   │ int32  │   int32    │
├──────────┼────────┼────────────┤
│        1 │  45000 │          3 │
│        2 │  37000 │          3 │
│        3 │  50000 │          2 │
│        4 │  60000 │          3 │
│        5 │  55000 │          2 │
│        6 │  15000 │          1 │
│        7 │  50000 │          2 │
└──────────┴────────┴────────────┘

In [23]:
duckdb.sql("""
SELECT *
FROM seniors;""")

┌────────────┬──────────────┬───────┐
│ employeeId │ numGreyHairs │ bonus │
│   int32    │    int32     │ int32 │
├────────────┼──────────────┼───────┤
│          1 │           45 │ 34000 │
│          2 │          457 │ 40000 │
└────────────┴──────────────┴───────┘

In [24]:
duckdb.sql("SELECT 'The cross product has: ', (SELECT COUNT(*) FROM employees, seniors) AS cnt, 'entries.';")

┌───────────────────────────┬───────┬────────────┐
│ 'The cross product has: ' │  cnt  │ 'entries.' │
│          varchar          │ int64 │  varchar   │
├───────────────────────────┼───────┼────────────┤
│ The cross product has:    │    14 │ entries.   │
└───────────────────────────┴───────┴────────────┘

### Rename

In [25]:
duckdb.sql("SELECT 6*7 AS answerToEverything;")

┌────────────────────┐
│ answerToEverything │
│       int32        │
├────────────────────┤
│                 42 │
└────────────────────┘

In [26]:
# Building the cross product employees X employees using renaming:
duckdb.sql("""
SELECT e1.salary, e2.salary
FROM employees AS e1, employees AS e2;""")

┌────────┬────────┐
│ salary │ salary │
│ int32  │ int32  │
├────────┼────────┤
│  45000 │  45000 │
│  37000 │  45000 │
│  50000 │  45000 │
│  60000 │  45000 │
│  55000 │  45000 │
│  15000 │  45000 │
│  50000 │  45000 │
│  45000 │  37000 │
│  37000 │  37000 │
│  50000 │  37000 │
│    ·   │    ·   │
│    ·   │    ·   │
│    ·   │    ·   │
│  55000 │  15000 │
│  15000 │  15000 │
│  50000 │  15000 │
│  45000 │  50000 │
│  37000 │  50000 │
│  50000 │  50000 │
│  60000 │  50000 │
│  55000 │  50000 │
│  15000 │  50000 │
│  50000 │  50000 │
├────────┴────────┤
│     49 rows     │
│   (20 shown)    │
└─────────────────┘

### Joins

In [27]:
# Theta-Join, in this example an equi-join predicate, the explicit way:
duckdb.sql("""
SELECT *
FROM employees JOIN seniors
ON personid = employeeId;""")

┌──────────┬────────┬────────────┬────────────┬──────────────┬───────┐
│ personId │ salary │ experience │ employeeId │ numGreyHairs │ bonus │
│  int32   │ int32  │   int32    │   int32    │    int32     │ int32 │
├──────────┼────────┼────────────┼────────────┼──────────────┼───────┤
│        1 │  45000 │          3 │          1 │           45 │ 34000 │
│        2 │  37000 │          3 │          2 │          457 │ 40000 │
└──────────┴────────┴────────────┴────────────┴──────────────┴───────┘

In [28]:
duckdb.sql("""
SELECT *
FROM employees, seniors
where personid = employeeId;""")

┌──────────┬────────┬────────────┬────────────┬──────────────┬───────┐
│ personId │ salary │ experience │ employeeId │ numGreyHairs │ bonus │
│  int32   │ int32  │   int32    │   int32    │    int32     │ int32 │
├──────────┼────────┼────────────┼────────────┼──────────────┼───────┤
│        1 │  45000 │          3 │          1 │           45 │ 34000 │
│        2 │  37000 │          3 │          2 │          457 │ 40000 │
└──────────┴────────┴────────────┴────────────┴──────────────┴───────┘

In [29]:
# Theta-Join, explicitly as INNER JOIN, the keyword "INNER" is redundant and can be left out:
duckdb.sql("""
SELECT *
FROM employees INNER JOIN seniors
ON personid = employeeId;""")

┌──────────┬────────┬────────────┬────────────┬──────────────┬───────┐
│ personId │ salary │ experience │ employeeId │ numGreyHairs │ bonus │
│  int32   │ int32  │   int32    │   int32    │    int32     │ int32 │
├──────────┼────────┼────────────┼────────────┼──────────────┼───────┤
│        1 │  45000 │          3 │          1 │           45 │ 34000 │
│        2 │  37000 │          3 │          2 │          457 │ 40000 │
└──────────┴────────┴────────────┴────────────┴──────────────┴───────┘

In [30]:
duckdb.sql("SELECT * FROM seniors;")

┌────────────┬──────────────┬───────┐
│ employeeId │ numGreyHairs │ bonus │
│   int32    │    int32     │ int32 │
├────────────┼──────────────┼───────┤
│          1 │           45 │ 34000 │
│          2 │          457 │ 40000 │
└────────────┴──────────────┴───────┘

In [31]:
# Left Outer Join:
duckdb.sql("""
SELECT *
FROM employees LEFT OUTER JOIN seniors
ON personid = employeeId;""")

┌──────────┬────────┬────────────┬────────────┬──────────────┬───────┐
│ personId │ salary │ experience │ employeeId │ numGreyHairs │ bonus │
│  int32   │ int32  │   int32    │   int32    │    int32     │ int32 │
├──────────┼────────┼────────────┼────────────┼──────────────┼───────┤
│        1 │  45000 │          3 │          1 │           45 │ 34000 │
│        2 │  37000 │          3 │          2 │          457 │ 40000 │
│        3 │  50000 │          2 │       NULL │         NULL │  NULL │
│        4 │  60000 │          3 │       NULL │         NULL │  NULL │
│        5 │  55000 │          2 │       NULL │         NULL │  NULL │
│        6 │  15000 │          1 │       NULL │         NULL │  NULL │
│        7 │  50000 │          2 │       NULL │         NULL │  NULL │
└──────────┴────────┴────────────┴────────────┴──────────────┴───────┘

NULL-values

In [32]:
# Right Outer Join:
duckdb.sql("""
SELECT *
FROM employees RIGHT OUTER JOIN seniors
ON personid = employeeId;""")

┌──────────┬────────┬────────────┬────────────┬──────────────┬───────┐
│ personId │ salary │ experience │ employeeId │ numGreyHairs │ bonus │
│  int32   │ int32  │   int32    │   int32    │    int32     │ int32 │
├──────────┼────────┼────────────┼────────────┼──────────────┼───────┤
│        1 │  45000 │          3 │          1 │           45 │ 34000 │
│        2 │  37000 │          3 │          2 │          457 │ 40000 │
└──────────┴────────┴────────────┴────────────┴──────────────┴───────┘

well, not much difference to the inner join here, let's switch tables:

In [33]:
# Right Outer Join:
duckdb.sql("""
SELECT *
FROM seniors RIGHT OUTER JOIN employees
ON personid = employeeId;""")

┌────────────┬──────────────┬───────┬──────────┬────────┬────────────┐
│ employeeId │ numGreyHairs │ bonus │ personId │ salary │ experience │
│   int32    │    int32     │ int32 │  int32   │ int32  │   int32    │
├────────────┼──────────────┼───────┼──────────┼────────┼────────────┤
│          1 │           45 │ 34000 │        1 │  45000 │          3 │
│          2 │          457 │ 40000 │        2 │  37000 │          3 │
│       NULL │         NULL │  NULL │        3 │  50000 │          2 │
│       NULL │         NULL │  NULL │        4 │  60000 │          3 │
│       NULL │         NULL │  NULL │        5 │  55000 │          2 │
│       NULL │         NULL │  NULL │        6 │  15000 │          1 │
│       NULL │         NULL │  NULL │        7 │  50000 │          2 │
└────────────┴──────────────┴───────┴──────────┴────────┴────────────┘

In [34]:
# Full Outer Join:
duckdb.sql("""
SELECT *
FROM seniors FULL OUTER JOIN employees
ON personid = employeeId;""")

┌────────────┬──────────────┬───────┬──────────┬────────┬────────────┐
│ employeeId │ numGreyHairs │ bonus │ personId │ salary │ experience │
│   int32    │    int32     │ int32 │  int32   │ int32  │   int32    │
├────────────┼──────────────┼───────┼──────────┼────────┼────────────┤
│          1 │           45 │ 34000 │        1 │  45000 │          3 │
│          2 │          457 │ 40000 │        2 │  37000 │          3 │
│       NULL │         NULL │  NULL │        3 │  50000 │          2 │
│       NULL │         NULL │  NULL │        4 │  60000 │          3 │
│       NULL │         NULL │  NULL │        5 │  55000 │          2 │
│       NULL │         NULL │  NULL │        6 │  15000 │          1 │
│       NULL │         NULL │  NULL │        7 │  50000 │          2 │
└────────────┴──────────────┴───────┴──────────┴────────┴────────────┘

### Grouping and Aggregation

In [35]:
# as there is no GROUP BY statement, the entire input is considered 
# a single partition/group
# this means the aggregate function is called only once 
# count(*) calculates the number of tuples in the group:
duckdb.sql("""
SELECT count(*)
FROM employees;""")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            7 │
└──────────────┘

In [36]:
# combining several aggregation functions is no problem:
duckdb.sql("""
SELECT count(*) as number, CAST(round(avg(salary),0) AS INTEGER) as avg_salary
FROM employees;""")

┌────────┬────────────┐
│ number │ avg_salary │
│ int64  │   int32    │
├────────┼────────────┤
│      7 │      44571 │
└────────┴────────────┘

In [37]:
# now a real grouping with three different groups, followed by the aggregation:
duckdb.sql("""
SELECT experience, count(*), CAST(round(avg(salary),0) AS INTEGER) as avg_salary
FROM employees
GROUP BY experience;""")

┌────────────┬──────────────┬────────────┐
│ experience │ count_star() │ avg_salary │
│   int32    │    int64     │   int32    │
├────────────┼──────────────┼────────────┤
│          1 │            1 │      15000 │
│          2 │            3 │      51667 │
│          3 │            3 │      47333 │
└────────────┴──────────────┴────────────┘

In [38]:
# a grouping based on two grouping attributes:
duckdb.sql("""
SELECT salary, experience, count(*)
FROM employees
GROUP BY salary, experience;""")

┌────────┬────────────┬──────────────┐
│ salary │ experience │ count_star() │
│ int32  │   int32    │    int64     │
├────────┼────────────┼──────────────┤
│  45000 │          3 │            1 │
│  37000 │          3 │            1 │
│  50000 │          2 │            2 │
│  60000 │          3 │            1 │
│  55000 │          2 │            1 │
│  15000 │          1 │            1 │
└────────┴────────────┴──────────────┘

In [39]:
duckdb.sql("""
SELECT *
FROM employees
order by salary asc;""")

┌──────────┬────────┬────────────┐
│ personId │ salary │ experience │
│  int32   │ int32  │   int32    │
├──────────┼────────┼────────────┤
│        6 │  15000 │          1 │
│        2 │  37000 │          3 │
│        1 │  45000 │          3 │
│        3 │  50000 │          2 │
│        7 │  50000 │          2 │
│        5 │  55000 │          2 │
│        4 │  60000 │          3 │
└──────────┴────────┴────────────┘

In [40]:
# select using an attribut not appearing in the group by clause
duckdb.sql("""
SELECT salary, max(salary), count(*)
FROM employees
GROUP BY experience;""")

BinderException: Binder Error: column "salary" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(salary)" if the exact value of "salary" is not important.

**Note!** This query throws an exception.

**WHY?** This SQL query has an ambiguous semantic interpretation!
The attribute 'salary' has potentially different values within each group. For this reason:
Only use attributes occuring in the GROUP BY clause without a corresponding aggregation function in the SELECT clause.
For this example, only the attribute 'experience' can be used without an aggregation function; every other attribute must be used with an aggregation function!

In [41]:
# consider again the contents of employees:
duckdb.sql("""
SELECT experience, salary
FROM employees
order by experience;""")

┌────────────┬────────┐
│ experience │ salary │
│   int32    │ int32  │
├────────────┼────────┤
│          1 │  15000 │
│          2 │  50000 │
│          2 │  55000 │
│          2 │  50000 │
│          3 │  45000 │
│          3 │  37000 │
│          3 │  60000 │
└────────────┴────────┘

In [42]:
# this example is valid and has a semantic meaning:
duckdb.sql("""
SELECT experience, salary, count(*)
FROM employees
GROUP BY experience, salary;""")

┌────────────┬────────┬──────────────┐
│ experience │ salary │ count_star() │
│   int32    │ int32  │    int64     │
├────────────┼────────┼──────────────┤
│          3 │  45000 │            1 │
│          3 │  37000 │            1 │
│          2 │  50000 │            2 │
│          3 │  60000 │            1 │
│          2 │  55000 │            1 │
│          1 │  15000 │            1 │
└────────────┴────────┴──────────────┘

### Grouping and Aggregation with HAVING

In [43]:
# now grouping with HAVING:
duckdb.sql("""
SELECT experience, count(*), avg(salary)
FROM employees
WHERE salary > 40000
GROUP BY experience
HAVING count(*) > 2;""")

┌────────────┬──────────────┬────────────────────┐
│ experience │ count_star() │    avg(salary)     │
│   int32    │    int64     │       double       │
├────────────┼──────────────┼────────────────────┤
│          2 │            3 │ 51666.666666666664 │
└────────────┴──────────────┴────────────────────┘

### Conceptual Order of HAVING

The same query with HAVING rolled out in a step-by-step manner according to the conceptual order of execution:

In [44]:
# 1. FROM: all employees
duckdb.sql("""
SELECT * 
FROM employees;""")

┌──────────┬────────┬────────────┐
│ personId │ salary │ experience │
│  int32   │ int32  │   int32    │
├──────────┼────────┼────────────┤
│        1 │  45000 │          3 │
│        2 │  37000 │          3 │
│        3 │  50000 │          2 │
│        4 │  60000 │          3 │
│        5 │  55000 │          2 │
│        6 │  15000 │          1 │
│        7 │  50000 │          2 │
└──────────┴────────┴────────────┘

In [45]:
# 2. WHERE: selection of tuples with salary > 40000
duckdb.sql("""
SELECT * 
FROM employees
WHERE salary > 40000;""")

┌──────────┬────────┬────────────┐
│ personId │ salary │ experience │
│  int32   │ int32  │   int32    │
├──────────┼────────┼────────────┤
│        1 │  45000 │          3 │
│        3 │  50000 │          2 │
│        4 │  60000 │          3 │
│        5 │  55000 │          2 │
│        7 │  50000 │          2 │
└──────────┴────────┴────────────┘

In [46]:
# 3. GROUP BY: build groups based on "experience"
# 5. compute the aggregation functions count(*) and avg(salary) for each group
Q1 = duckdb.sql("""
SELECT experience, count(*), avg(salary) AS avg_salary
FROM employees
WHERE salary > 40000
GROUP BY experience;""")
Q1 # let's call this query Q1

┌────────────┬──────────────┬────────────────────┐
│ experience │ count_star() │     avg_salary     │
│   int32    │    int64     │       double       │
├────────────┼──────────────┼────────────────────┤
│          2 │            3 │ 51666.666666666664 │
│          3 │            2 │            52500.0 │
└────────────┴──────────────┴────────────────────┘

In [47]:
# 3. GROUP BY: build groups based on "experience"
# 4. HAVING: only output groups with avg(salary) > 50000
# 5. compute the aggregation functions count(*) and avg(salary) for each group
duckdb.sql("""
SELECT experience, count(*), avg(salary)
FROM employees
WHERE salary > 40000
GROUP BY experience
HAVING avg(salary) > 52000;""")

┌────────────┬──────────────┬─────────────┐
│ experience │ count_star() │ avg(salary) │
│   int32    │    int64     │   double    │
├────────────┼──────────────┼─────────────┤
│          3 │            2 │     52500.0 │
└────────────┴──────────────┴─────────────┘

In [48]:
# or alternatively using Q1
duckdb.sql("""
SELECT *
FROM Q1
WHERE avg_salary > 52000;""")

┌────────────┬──────────────┬────────────┐
│ experience │ count_star() │ avg_salary │
│   int32    │    int64     │   double   │
├────────────┼──────────────┼────────────┤
│          3 │            2 │    52500.0 │
└────────────┴──────────────┴────────────┘

### Uncorrelated Subqueries

We could also explain the semantics of HAVING using a so-called (uncorrelated) subquery:

In [49]:
# recall Q1 from above
duckdb.sql("""
SELECT experience, count(*), avg(salary) 
FROM employees
WHERE salary > 40000
GROUP BY experience;""")

┌────────────┬──────────────┬────────────────────┐
│ experience │ count_star() │    avg(salary)     │
│   int32    │    int64     │       double       │
├────────────┼──────────────┼────────────────────┤
│          2 │            3 │ 51666.666666666664 │
│          3 │            2 │            52500.0 │
└────────────┴──────────────┴────────────────────┘

we could rewrite this to:

In [50]:
duckdb.sql("""
SELECT *
FROM (
    SELECT experience, count(*), avg(salary) AS avg_salary
    FROM employees
    WHERE salary > 40000
    GROUP BY experience
    );""")

┌────────────┬──────────────┬────────────────────┐
│ experience │ count_star() │     avg_salary     │
│   int32    │    int64     │       double       │
├────────────┼──────────────┼────────────────────┤
│          2 │            3 │ 51666.666666666664 │
│          3 │            2 │            52500.0 │
└────────────┴──────────────┴────────────────────┘

Here Q1 is a subquery (or inner query). Conceptually you can read this as: Q1 produces a relation and that relation is then used as any other relation in the FROM clause.

Now, we can filter on particular groups and thus simulate the effect of HAVING:

In [51]:
duckdb.sql("""
SELECT *
FROM (
    SELECT experience, count(*), avg(salary) AS avg_salary
    FROM employees
    WHERE salary > 40000
    GROUP BY experience
    )
WHERE avg_salary > 52000;""")

┌────────────┬──────────────┬────────────┐
│ experience │ count_star() │ avg_salary │
│   int32    │    int64     │   double   │
├────────────┼──────────────┼────────────┤
│          3 │            2 │    52500.0 │
└────────────┴──────────────┴────────────┘

### Views

Subqueries can become very hard to read. In general, SQL statements can become pretty large and often large parts of these statements specify things we specify over and over again anyways. Therefore we recommend to break up SQL-statements into building blocks wherever possible to enhance readability. This can be done using views.

In [52]:
# R:= T_T.x=S.y join S

# HighlyPaidEmployees := SELECT experience, count(*), avg(salary) AS avg_salary
#     FROM employees
#     WHERE salary > 40000
#     GROUP BY experience

In [53]:
# delete this view if it already exists:

duckdb.sql("DROP VIEW IF EXISTS HighlyPaidEmployees;")

# create a view
duckdb.sql("""
CREATE VIEW HighlyPaidEmployees as
    SELECT experience, count(*), avg(salary) AS avg_salary
    FROM employees
    WHERE salary > 40000
    GROUP BY experience;""")

In [54]:
duckdb.sql("""
SELECT *
FROM HighlyPaidEmployees;""")

┌────────────┬──────────────┬────────────────────┐
│ experience │ count_star() │     avg_salary     │
│   int32    │    int64     │       double       │
├────────────┼──────────────┼────────────────────┤
│          2 │            3 │ 51666.666666666664 │
│          3 │            2 │            52500.0 │
└────────────┴──────────────┴────────────────────┘

Notice that a view definition **does not execute any query**, it is merely an alias to an SQL statement. A view can then be used just like any other relation and only then it will be executed:

In [55]:
duckdb.sql("""
SELECT *
FROM (SELECT experience, count(*), avg(salary) AS avg_salary
    FROM employees
    WHERE salary > 40000
    GROUP BY experience);""")

┌────────────┬──────────────┬────────────────────┐
│ experience │ count_star() │     avg_salary     │
│   int32    │    int64     │       double       │
├────────────┼──────────────┼────────────────────┤
│          2 │            3 │ 51666.666666666664 │
│          3 │            2 │            52500.0 │
└────────────┴──────────────┴────────────────────┘

If you want to preexecute a view, you can use **materialized views**. Some systems offer this functionality.

**General recommendation:** do not use materialized views unless you know exactly what you are doing...