## SQL-92 Demo

This notebook introduces the basics of read-only SQL-92-queries.

Copyright Jens Dittrich & Christian Schön, [Big Data Analytics Group](https://bigdata.uni-saarland.de/), [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/legalcode)

### Installation

This notebook requires the SQLite3 kernel for jupyter notebook. The advantage of this sytem: SQLite is a very lightweight DBMS which does not require a complicated installation outside of jupyter notebook, but still works pretty well for smaller projects/databases, and queries.

For the installation instructions of the kernel, please refer to the official [sqlite3-kernel Github repository](https://github.com/brownan/sqlite3-kernel). 

Afterwards, you can simply switch the active kernel using the menu "Kernel"->"Change Kernel"->"Sqlite3".

Unfortunately, the SQLite kernel requires a Unix system and will therefore not work on Windows.
If you prefer working on Windows, you have to use a virtual machine or stick to the Python kernel and wrap your queries in Pandas.


### Database Schema

All examples below are taken from the following scenario:

You are running a photo agency which has several types of employees: seniors, salespersons, and photographers. 

Create schemas for all tables:

In [1]:
PRAGMA foreign_keys = OFF;

DROP TABLE IF EXISTS employees;
DROP TABLE IF EXISTS persons;
DROP TABLE IF EXISTS seniors;
DROP TABLE IF EXISTS salespersons;
DROP TABLE IF EXISTS photographers;
DROP TABLE IF EXISTS cameras;
DROP TABLE IF EXISTS photos;

PRAGMA foreign_keys = ON;

CREATE TABLE persons (
    id INTEGER PRIMARY KEY,
    lastname TEXT,
    firstname TEXT,
    birthday TEXT
);

CREATE TABLE employees (
    personId INTEGER PRIMARY KEY,
    salary INTEGER,
    experience INTEGER,
    FOREIGN KEY(personId) REFERENCES persons(id)
);

CREATE TABLE seniors (
    employeeId INTEGER PRIMARY KEY,
    numGreyHairs INTEGER,
    bonus INTEGER,
    FOREIGN KEY(employeeId) REFERENCES employees(personId)
);

CREATE TABLE salespersons (
    employeeId INTEGER PRIMARY KEY,
    areaOfExpertise TEXT,
    FOREIGN KEY(employeeId) REFERENCES employees(personId)
);

CREATE TABLE photographers (
    employeeId INTEGER PRIMARY KEY,
    FOREIGN KEY(employeeId) REFERENCES employees(personId)
);

CREATE TABLE cameras (
    id INTEGER PRIMARY KEY,
    brand TEXT,
    model TEXT
);

CREATE TABLE photos (
    id INTEGER PRIMARY KEY,
    location TEXT,
    unix_time INTEGER,
    photographerId INTEGER,
    cameraId INTEGER,
    FOREIGN KEY(photographerId) REFERENCES photographers(employeeId),
    FOREIGN KEY(cameraId) REFERENCES cameras(id)
);



Import the csv-data into those tables:

In [2]:
-- enable CSV mode:
.mode csv
-- import the necessary CSV-files as tables:
-- syntax: .import path/to/csv.file tablename
-- warning: be careful to import this data, i.e. execute this cell only once!
.import data/photodb/persons.csv persons
.import data/photodb/employees.csv employees
.import data/photodb/seniors.csv seniors
.import data/photodb/salespersons.csv salespersons
.import data/photodb/photographers.csv photographers
.import data/photodb/cameras.csv cameras
.import data/photodb/photos.csv photos



In [3]:
-- enable pretty formatting of the output:
.mode columns
.headers on
.schema 

CREATE TABLE persons (
    id INTEGER PRIMARY KEY,
    lastname TEXT,
    firstname TEXT,
    birthday TEXT
);
CREATE TABLE employees (
    personId INTEGER PRIMARY KEY,
    salary INTEGER,
    experience INTEGER,
    FOREIGN KEY(personId) REFERENCES persons(id)
);
CREATE TABLE seniors (
    employeeId INTEGER PRIMARY KEY,
    numGreyHairs INTEGER,
    bonus INTEGER,
    FOREIGN KEY(employeeId) REFERENCES employees(personId)
);
CREATE TABLE salespersons (
    employeeId INTEGER PRIMARY KEY,
    areaOfExpertise TEXT,
    FOREIGN KEY(employeeId) REFERENCES employees(personId)
);
CREATE TABLE photographers (
    employeeId INTEGER PRIMARY KEY,
    FOREIGN KEY(employeeId) REFERENCES employees(personId)
);
CREATE TABLE cameras (
    id INTEGER PRIMARY KEY,
    brand TEXT,
    model TEXT
);
CREATE TABLE photos (
    id INTEGER PRIMARY KEY,
    location TEXT,
    unix_time INTEGER,
    photographerId INTEGER,
    cameraId INTEGER,
    FOREIGN KEY(photogra

In [4]:
-- show the complete table:
SELECT *
FROM employees;

personId    salary      experience
----------  ----------  ----------
1           45000       3         
2           37000       3         
3           50000       2         
4           60000       3         
5           55000       2         
6           15000       1         
7           50000       2         


In [5]:
-- show the complete table:
SELECT *
FROM seniors;

employeeId  numGreyHairs  bonus     
----------  ------------  ----------
1           45            34000     
2           457           40000     


In [6]:
SELECT 42;

42        
----------
42        


### Projection

In [7]:
-- projection for the attribute 'salary':
SELECT salary
FROM employees;

salary    
----------
45000     
37000     
50000     
60000     
55000     
15000     
50000     


First difference to relational algebra: **duplicates**

In [8]:
-- projection for the attribute 'salary', eliminating duplicates using 'DISTINCT'
SELECT DISTINCT salary
FROM employees;

salary    
----------
45000     
37000     
50000     
60000     
55000     
15000     


In [9]:
-- projection for the attribute 'personId':
SELECT personid
FROM employees;

personId  
----------
1         
2         
3         
4         
5         
6         
7         


As personId is a key, DISTINCT does not have an effect here.

### Sorting of the output, order by

In [10]:
-- projection for the attributes 'experience' and 'personId' using ascending order:
SELECT experience, personid
FROM employees
ORDER BY experience, personid ASC;

experience  personId  
----------  ----------
1           6         
2           3         
2           5         
2           7         
3           1         
3           2         
3           4         


### Selection/Filter

not to be confused with SELECT (which projects the data, see above)

In [11]:
-- selection of/filter all employees with a salary of more than 50000:
SELECT *
FROM employees
WHERE salary>50000;

personId    salary      experience
----------  ----------  ----------
4           60000       3         
5           55000       2         


In [12]:
-- selection of all employees with a salary of more than 50000: and a personid > 4:
SELECT *
FROM employees
WHERE salary>50000 AND personid>4;

personId    salary      experience
----------  ----------  ----------
5           55000       2         


### Filtering Strings

To filter on string-types we can use the LIKE-operator:

1. percent symbol (%): represents zero, one, or multiple characters

2. underscore symbol (_) : represents a single character


In [13]:
SELECT *
FROM persons

id          lastname    firstname   birthday  
----------  ----------  ----------  ----------
1           Schweitzer  Albert      1973-03-01
2           Carlos      Rob         1975-07-12
3           Mueller     Peter       1963-10-09
4           Zappa       Frank       1955-11-04
5           Taylor      Tim         1980-03-04
6           Wurst       Hans        1974-02-01
7           Miese       Peter       1983-05-06
8           Koenig      Dieter      1967-06-11


In [14]:
SELECT *
FROM persons
WHERE firstname LIKE '_ank'



In [15]:
SELECT *
FROM persons
WHERE firstname LIKE '%ank'

id          lastname    firstname   birthday  
----------  ----------  ----------  ----------
4           Zappa       Frank       1955-11-04


In [16]:
SELECT *
FROM persons
WHERE firstname LIKE '_ete_'

id          lastname    firstname   birthday  
----------  ----------  ----------  ----------
3           Mueller     Peter       1963-10-09
7           Miese       Peter       1983-05-06


In [17]:
SELECT *
FROM persons
WHERE firstname LIKE '%ete_'

id          lastname    firstname   birthday  
----------  ----------  ----------  ----------
3           Mueller     Peter       1963-10-09
7           Miese       Peter       1983-05-06
8           Koenig      Dieter      1967-06-11


### Union

In [18]:
-- example for Union:
SELECT employeeId
FROM salespersons
  UNION
SELECT employeeId
FROM photographers

employeeId
----------
3         
4         
5         
6         
7         


### Difference

In [19]:
-- example for difference:
-- 'MINUS' is called 'EXCEPT' in sqlite
SELECT employeeId
FROM salespersons
  EXCEPT
SELECT employeeId
FROM photographers

employeeId
----------
4         
5         


### Cross Product

In [20]:
-- example for the cross product:
SELECT *
FROM employees, seniors;

personId    salary      experience  employeeId  numGreyHairs  bonus     
----------  ----------  ----------  ----------  ------------  ----------
1           45000       3           1           45            34000     
1           45000       3           2           457           40000     
2           37000       3           1           45            34000     
2           37000       3           2           457           40000     
3           50000       2           1           45            34000     
3           50000       2           2           457           40000     
4           60000       3           1           45            34000     
4           60000       3           2           457           40000     
5           55000       2           1           45            34000     
5           55000       2           2           457           40000     
6           15000       1           1           45            34000     
6           15000       1           2 

In [21]:
.headers off
SELECT "The cross product has: ", (SELECT COUNT(*) FROM employees, seniors) AS cnt, 'entries.';
.headers on

The cross product has:     14          entries.  


### Rename

In [22]:
SELECT 6*7 AS answerToEverything;

answerToEverything
------------------
42                


In [23]:
-- Building the cross product employees X employees using renaming:
SELECT e1.salary, e2.salary
FROM employees as e1, employees as e2;

salary      salary    
----------  ----------
45000       45000     
45000       37000     
45000       50000     
45000       60000     
45000       55000     
45000       15000     
45000       50000     
37000       45000     
37000       37000     
37000       50000     
37000       60000     
37000       55000     
37000       15000     
37000       50000     
50000       45000     
50000       37000     
50000       50000     
50000       60000     
50000       55000     
50000       15000     
50000       50000     
60000       45000     
60000       37000     
60000       50000     
60000       60000     
60000       55000     
60000       15000     
60000       50000     
55000       45000     
55000       37000     
55000       50000     
55000       60000     
55000       55000     
55000       15000     
55000       50000     
15000       45000     
15000       37000     
15000       50000     
15000       60000     
15000       5500

### Joins

In [24]:
-- Theta-Join, in this example an equi-join predicate, the explicit way:
SELECT *
FROM employees JOIN seniors
ON personid = employeeId;

personId    salary      experience  employeeId  numGreyHairs  bonus     
----------  ----------  ----------  ----------  ------------  ----------
1           45000       3           1           45            34000     
2           37000       3           2           457           40000     


In [25]:
-- Theta-Join, explicitly as INNER JOIN, the keyword "INNER" is redundant and can be left out:
SELECT *
FROM employees INNER JOIN seniors
ON personid = employeeId;

personId    salary      experience  employeeId  numGreyHairs  bonus     
----------  ----------  ----------  ----------  ------------  ----------
1           45000       3           1           45            34000     
2           37000       3           2           457           40000     


In [26]:
-- Left Outer Join:
SELECT *
FROM employees LEFT OUTER JOIN seniors
ON personid = employeeId;

personId    salary      experience  employeeId  numGreyHairs  bonus     
----------  ----------  ----------  ----------  ------------  ----------
1           45000       3           1           45            34000     
2           37000       3           2           457           40000     
3           50000       2                                               
4           60000       3                                               
5           55000       2                                               
6           15000       1                                               
7           50000       2                                               


In [27]:
-- Right Outer Join:
SELECT *
FROM employees RIGHT OUTER JOIN seniors
ON personid = employeeId;

Error: RIGHT and FULL OUTER JOINs are not currently supported


**Ouch!** sqlite3 does not support it

### Grouping and Aggregation

In [28]:
-- as there is no GROUP BY statement, the entire input is considered 
-- a single partition/group
-- this means the aggregate function is called only once 
-- count(*) calculates the number of tuples in the group:
SELECT count(*)
FROM employees;

count(*)  
----------
7         


In [29]:
-- combining several aggregation functions is no problem:
SELECT count(*), avg(salary)
FROM employees;

count(*)    avg(salary)     
----------  ----------------
7           44571.4285714286


In [30]:
-- now a real grouping with three different groups, followed by the aggregation:
SELECT experience, count(*), avg(salary)
FROM employees
GROUP BY experience;

experience  count(*)    avg(salary)
----------  ----------  -----------
1           1           15000.0    
2           3           51666.66666
3           3           47333.33333


In [31]:
SELECT *
FROM employees;

personId    salary      experience
----------  ----------  ----------
1           45000       3         
2           37000       3         
3           50000       2         
4           60000       3         
5           55000       2         
6           15000       1         
7           50000       2         


In [32]:
-- a grouping based on two grouping attributes:
SELECT experience, salary, count(*)
FROM employees
GROUP BY experience;

experience  salary      count(*)  
----------  ----------  ----------
1           15000       1         
2           50000       3         
3           60000       3         


<!---**Autsch!** Hier sollte sqlite besser eine Fehlermeldung werfen (andere Datenbanksysteme wie z.B. PostgreSQL machen das).

**Warum?** Diese SQL-Anfrage macht semantisch keinen Sinn!
Das Attribut "gehalt" hat innerhalb jeder Gruppe potenziell unterschiedliche Werte. Deswegen gilt:
Nur Attribute die im GROUP BY stehen dürfen im SELECT außerhalb von Aggregatfunktionen benutzt werden.
In diesem Beispiel darf also nur "erfahrung" ohne Aggregatfunktion benutzt werden; alle anderen Attribute nicht! -->

**Ouch!** Sqlite should better throw an exception in this case (other database systems such as PostgreSQL do it).

**WHY?** This SQL query has an ambiguous semantic interpretation!
The attribute 'salary' has potentially different values within each group. For this reason:
Only use attributes occuring in the GROUP BY clause without a corresponding aggregation function in the SELECT clause.
For this example, only the attribute 'experience' can be used without an aggregation function; every other attribute must be used with an aggregation function!

In [33]:
-- this example is valid and has a semantic meaning:
SELECT experience, salary, count(*)
FROM employees
GROUP BY experience, salary;

experience  salary      count(*)  
----------  ----------  ----------
1           15000       1         
2           50000       2         
2           55000       1         
3           37000       1         
3           45000       1         
3           60000       1         


### Grouping and Aggregation with HAVING

In [34]:
-- now grouping with HAVING:
SELECT experience, count(*), avg(salary)
FROM employees
WHERE salary > 40000
GROUP BY experience
HAVING avg(salary) > 52000;

experience  count(*)    avg(salary)
----------  ----------  -----------
3           2           52500.0    


### Conceptual Order of HAVING

The same query with HAVING rolled out in a step-by-step manner according to the conceptual order of execution:

In [35]:
-- 1. FROM: all employees
SELECT * 
FROM employees

personId    salary      experience
----------  ----------  ----------
1           45000       3         
2           37000       3         
3           50000       2         
4           60000       3         
5           55000       2         
6           15000       1         
7           50000       2         


In [36]:
-- 2. WHERE: selection of tuples with salary > 40000
SELECT * 
FROM employees
WHERE salary > 40000

personId    salary      experience
----------  ----------  ----------
1           45000       3         
3           50000       2         
4           60000       3         
5           55000       2         
7           50000       2         


In [37]:
-- 3. GROUP BY: build groups based on "experience"
-- 5. compute the aggregation functions count(*) and avg(salary) for each group
SELECT experience, count(*), avg(salary) 
FROM employees
WHERE salary > 40000
GROUP BY experience
-- let's call this query Q1

experience  count(*)    avg(salary)     
----------  ----------  ----------------
2           3           51666.6666666667
3           2           52500.0         


In [38]:
-- 3. GROUP BY: build groups based on "experience"
-- 4. HAVING: only output groups with avg(salary) > 50000
-- 5. compute the aggregation functions count(*) and avg(salary) for each group
SELECT experience, count(*), avg(salary) 
FROM employees
WHERE salary > 40000
GROUP BY experience
HAVING avg(salary) > 52000;

experience  count(*)    avg(salary)
----------  ----------  -----------
3           2           52500.0    


### Uncorrelated Subqueries

We could also explain the semantics of HAVING using a so-called (uncorrelated) subquery:

In [39]:
-- recall Q1 from above
SELECT experience, count(*), avg(salary) 
FROM employees
WHERE salary > 40000
GROUP BY experience

experience  count(*)    avg(salary)     
----------  ----------  ----------------
2           3           51666.6666666667
3           2           52500.0         


we could rewrite this to:

In [40]:
SELECT *
FROM (
    SELECT experience, count(*), avg(salary) 
    FROM employees
    WHERE salary > 40000
    GROUP BY experience
    );

experience  count(*)    avg(salary)     
----------  ----------  ----------------
2           3           51666.6666666667
3           2           52500.0         


Here Q1 is a subquery (or inner query). Conceptually you can read this as: Q1 produces a relation and that relation is then used as any other relation in the FROM clause.

Now, we can filter on particular groups and thus simulate the effect of HAVING:

In [41]:
SELECT *
FROM (
    SELECT experience, count(*), avg(salary) AS avg_salary
    FROM employees
    WHERE salary > 40000
    GROUP BY experience
    )
WHERE avg_salary > 52000;

experience  count(*)    avg_salary
----------  ----------  ----------
3           2           52500.0   


### Views

Subqueries can become very hard to read. In general, SQL statements can become pretty large and often large parts of these statements specify things we specify over and over again anyways. Therefore we recommend to break up SQL-statements into building blocks wherever possible to enhance readability. This can be done using views.

In [42]:
-- delete this view if it already exists:
DROP VIEW IF EXISTS HighlyPaidEmployees;

-- creta a view
CREATE VIEW HighlyPaidEmployees as
    SELECT experience, count(*), avg(salary) AS avg_salary
    FROM employees
    WHERE salary > 40000
    GROUP BY experience



Notice that a view definition **does not execute any query**, it is merely an alias to an SQL statement. A view can then be used just like any other relation and only then it will be executed:

In [43]:
SELECT *
FROM (SELECT experience, count(*), avg(salary) AS avg_salary
    FROM employees
    WHERE salary > 40000
    GROUP BY experience)
WHERE avg_salary > 52000;

experience  count(*)    avg_salary
----------  ----------  ----------
3           2           52500.0   


If you want to preexecute a view, you can use **materialized views**. Some systems offer this functionality.

**General recommendation:** do not use materialized views unless you know exactly what you are doing...

<!---**Quintessenz:** sqlite3 ist für sehr sehr einfache SQL-Anfragen innerhalb eines Notebooks, SPJ-Anfragen (select, project, join), ein nützliches Werkzeug. SQL ist aber wesentlich mächtiger als die kleine Teilmenge, die von sqlite unterstützt wird. Für ernsthafte Anwendungen (große Datenmengen, viele Tabellen, komplexere Anfragen, Data Curation, etc.), ist es sinnvoll, ein "richtiges" DBMS zu benutzen wie z.B. PostgreSQL.-->

**Takeaway message:** sqlite3 is a useful tool for simple SQL queries within notebooks and SPJ-queries (select, project, join). The SQL in general, however, is much more powerful than the small subset implemented by sqlite. For more serious applications (large databases, many tables, complex queries, data curation etc.) stick to a better and more feature-complete DBMS such as PostgreSQL or a commercial DBMS.

## Exercises

In this exercise, we want to further query the photo database.

### Query 1

First, you should find out how many photos each photographer took. Your output should contain the following information:
1. The ID of the photographer as "PhotographerID",
2. The first name of the photgrapher as "FirstName",
3. The last name of the photographer as "LastName",
4. The amount of photos the photographer took as "Photos".

In [44]:
-- drop view if already existing
DROP VIEW IF EXISTS q1_student;
-- create view based on student query
CREATE VIEW q1_student AS
-- insert your query here
-- ...



In [45]:
-- TEST
-- define result table
DROP TABLE IF EXISTS q1;
CREATE TABLE q1 (
    PhotographerID INTEGER PRIMARY KEY,
    FirstName TEXT,
    LastName TEXT,
    Photos INTEGER
);

-- import query results
.mode csv
.import data/photodb/tests/q1.csv q1
.mode columns
-- compare query results
SELECT *
FROM (SELECT * FROM q1
      EXCEPT
      SELECT * FROM q1_student)
UNION
SELECT *
FROM (SELECT * FROM q1_student
      EXCEPT
      SELECT * FROM q1);
-- We expect an empty result.
-- Note that this test compares the resulting tuples and does not ensure that your query is semantically correct.



### Query 2

Finally, we want to determine how many photos were taken in each year by camera brand. Do not consider cameras made by Nikon. Your output should consists of the following attributes. Sort your descending by the amount of photos and ascending by the year.
1. The year of the photos as "Year",
2. The brand of the cameras as "Brand",
3. The amaount of photos taken as "Amount".

In [46]:
-- drop view if already existing
DROP VIEW IF EXISTS q2_student;
-- create view based on student query
CREATE VIEW q2_student AS
-- insert your query here
-- ...



In [47]:
-- TEST
-- define result table
DROP TABLE IF EXISTS q2;
CREATE TABLE q2 (
    Year TEXT,
    Brand TEXT,
    Amount INTEGER,
    PRIMARY KEY (Year, Brand)
);

-- import query results
.mode csv
.import data/photodb/tests/q2.csv q2
.mode columns
-- compare query results
SELECT *
FROM (SELECT * FROM q2
      EXCEPT
      SELECT * FROM q2_student)
UNION
SELECT *
FROM (SELECT * FROM q2_student
      EXCEPT
      SELECT * FROM q2);
-- We expect an empty result.
-- Note that this test compares the resulting tuples and does not ensure that your query is semantically correct.

