# SSC Data Science and Analytics Workshop 2022

## Intro to Databases in Industry: Data Cleaning, Querying, and Modeling at Scale

### Introduction to SQL

Welcome! Over the next 1 hour, I'll introduce some of the fundamentals of the Querying databases using the SQL language.

- Why using databases
- The relational model
- Query languages, SQL, Postgres
- How to run SQL
- Basic SQL queries

## Why not spreadsheets?

At this point in MDS, you have a good idea of why spreadsheet software like Excel or its equivalents are not suitable for most data science purposes. 

Pandas:
- reproducible
- range of functionalities
- scalable
- fast
- can be automated

## Why not Pandas?

But Pandas was pretty nice and powerful, wasn't it? let's see.

Think about what happens if:

- Your dataframe is 100 GB in size
- Multiple people want to use and make changes to the dataset simultaneously
- You want to be able to manage what each user can do
- You want to be able to let different users see different parts of the dataset
- You don't want to store everything at one place
- You want to restrict the kind of data to be stored
- The dataset file is corrupted
- The system crashes half way through making a change
- You want to optimize access to your data
- ...


## Databases and database management systems

You guessed it right! A **database management system (DBMS)** addresses all of the above problems.

**What is a database?**
A database is an organized collection of related data

**What is a database management system?**
A DBMS is a collection of programs that enables users to create, query, modify and manage a database in an optimized and efficient manner. A DBMS relieves us from worrying about storing a manage

Using a DBMS ensures:
- Data independence
- Efficient data access
- Data integrity
- Data security
- Concurrent access
- Crash recovery

There are different types of DBMS for different kinds of data

- Relational (most widely used)
- Document
- Hierarchical
- Network
- Object-oriented
- Graph

---

**Remember:**
    
database $\ne$ database management system

---

## The relational model

### Why the relational model?

Take a moment and think about the kind of problems that you may run into if you choose to store data in a single table.

<img src="./img/lecture1/table.png" width="800">

The most famous data model today is the relational model, while other models have also gained traction in the past few years.

The relational model works with **entities** and **relationships**. It is based on the set theory in mathematics was introduced by by Edgar Codd (IBM) in 1970 ([more details here](https://en.wikipedia.org/wiki/Relational_model)). It's foundations in **set theory** is the reason you will here words like "tuples", "domain", "union", "cross product", etc.

---

**Example:**

Entities:
- students in a school
- employees of an organization
- cars of a rental company
- houses in a city

Relations:
- students to a department
- purchases to customers
- movies to actors
- customers to a bank
    
---

In a relational model, entities and relationships are both **sets of tuples** called **relations**. These relations are represented as **tables** with rows and columns.

**What is a relational database?** A collection of relations

**Relations**: made up of two parts: A schema and an instance

**Schema**: specifies
1. Name of a relation
2. Name and domain of each attribute

**Domain**: A set of constraints that determines the type, length, format, range, uniqueness and nullability of values stored for an attribute.

---

**Example:**

Student (**sid**: _string_, **name**: _string_, **login**: _string_, **age**: _integer_, **gpa**: _real_)

---

**Instance**: a particular relation that follows a certain schema

**Relational Database Schema**: collection of schemas in the database

**Database Instance**: a collection of instances of its relations

### Anatomy of a table

<img src="img/lecture1/table_anatomy.png" width="700">

## Query language in a DBMS

**What is a query?** A question that we ask about the data. The result of a query is a new relation.

In order to talk to the database and ask questions, we need to speak its language. A DBMS
- provides a specialized language for us to write our queries
- optimizes how our queries are executed

## What is SQL?

Well, it's finally time to learn about SQL!

- SQL stands for Structured Query Language ([or... does it?](https://en.wikipedia.org/wiki/SQL#History)).
- It is a programming language that we use to talk to a relational DBMS.
- Originally developed by IBM in 1970s to manipulate and retrieve data stored in their DBMS, System R.
- SQL ≠ relational model ≠ database ≠ DBMS

### A peak at SQL queries

Suppose that we have the following table (relation) in our database, and 

> we want to retrieve the names and GPAs of students older than 25.

|  sid  | name      | login      | age | gpa |
|-------|-----------|------------|-----|-----|
| 23792 | Arman     | arman@mds  | 28  | 2.5 |
| 82347 | Varada    | varada@mds | 29  | 2.9 |
| 11238 | Tiffany   | tiff@mds   | 23  | 2.8 |
| 87263 | Mike      | mike@mds   | 19  | 3.8 |
| 13298 | Joel      | joel@mds   | 25  | 3.2 |
| 91287 | Florencia | flor@mds   | 20  | 3.3 |

We can write this as the following SQL query:

```sql
SELECT
    name, age, gpa
FROM
    Students
WHERE
    age > 25;
```

Running the above query should return this relation:

| name   | age | gpa |
|--------|-----|-----|
| Arman  | 28  | 2.5 |
| Varada | 29  | 2.9 |

### SQL syntax

Let's dissect the different parts of the our SQL query here:

```sql
SELECT
    name, age, gpa
FROM
    Students
WHERE
    age > 25;
```

A SQL statement consists of keywords, clauses, identifiers, terminating semi-colon and sometimes comments which together form a complete executable and independent piece of code.

```sql
SELECT
```
- The keyword `SELECT` is the **keyword** that exists in every SQL query. It is used to select and return data from columns, given the conditions that follow it.

```sql
name, age, gpa
Students
```

- `SELECT` is very powerful, but not dangerous: A `SELECT` statement never changes any values or tables in the database.

- The fact that we select only a few columns (instead of all of them) is called **projection** in database terms.
  
- These are called **identifiers**, and refer to the labels of columns and tables that exist in the database.

```sql
FROM
```
- This is another keyword that tells SQL which relation (i.e. table) to retrieve the columns from.

```sql
WHERE
```
- Yet another SQL keyword that is used to place a condition on the returned values.

- We can also have comments in a SQL query by preceding text with `--`:

```sql
-- Hey, I'm a comment!
-- ===========================
SELECT
    name, age, gpa  -- column names
FROM
    Students        -- table name
WHERE
    age > 25;       -- condition
```

- Block comments are also possible by enclosing comment lines in `/*` and `*/`:

```sql
/*
This is our first SQL query, and we
are learning about the following keywords:
SELECT
FROM
WHERE
*/

SELECT
    name, age, gpa
FROM
    Students
WHERE
    age > 25;
```

- Don't forget that every SQL statement needs to be terminated with a `;`.

- SQL keywords are traditionally written in upper case letters, but that is not a requirement. I prefer to follow this tradition because it makes the query more readable.

- A Keyword together with identifiers, expressions, etc that follow them are collectively a clause. For example:

```sql
SELECT
    name, age, gpa  -- columns are chosen here
FROM
    Students        -- table is specified here
WHERE
    age > 25;       -- filter is applied here
```

- It is common to put each clause or each keyword on a different line, but there is no generally agreed-upon style.

- In general, it doesn't matter whether the entire SQL statement is on one line or broken over several lines. Anything that comes before a `;` belongs to the same statement.

There are many other keywords that we will use throughout DSCI 513. The ones that you just saw are a few that other usually used when querying data.

> Note that SQL **is not imperative** (like Python or C++); it is a **declarative** language: We don't tell SQL **how** to retrieve data, but **what** to retrieve. For instance, we didn't write a for loop to retrieve the data from each row according to a certain condition. We told SQL what we wanted, and SQL did it for us.

### Flavours of SQL

- SQL is not owned by a particular company or organization
- It became a database language standard by the American National Standards Institute (ANSI) in 1986, and the International Organization for Standardization (ISO) in 1987.
- However, there are various SQL flavors and implementations, such as Oracle SQL, MySQL (open source), PostgreSQL (open source), IBM DB2, Microsoft SQL Server, Microsoft Access, SQLite (open source)
- These implementations have slightly different syntax and various additional features.
- In DSCI 513, we use **PostgreSQL**

<img src="img/lecture1/flavours_sql.png" width="700">

## What is PostgreSQL?

[PostgreSQL](https://www.postgresql.org/about/) (also known simply by its nickname _Postgres_) is an open-source, cross-platform DBMS that implements the relational model. PostgreSQL is very reliable with great performance characteristics, and is equipped with almost all features of the commercial and proprietary DBMSs.

PostgreSQL appeared in 1980s as a research project in University of California, Berkeley. It was meant to improve an earlier prototype relational DBMS called INGRES, which explains the name Postgres, which is short for PostINGRES. [Here](https://medium.com/launch-school/a-brief-history-of-postgresql-36d8d392c611) is an informative blog post about PostgreSQL's history if you're interested!

## The client-server model

Similar to most other DBMSs, Postgres works based on a **client-sever** model. In this model

- The DBMS along with its databases and data are all stored on a host computer where the database server resides. This is typically a powerful machine with high processing power and large storage
- Client hosts are usually personal computers with GUIs that can connect to a database server to access the data.

In this model, the clients and the server are connected over a network. The heavy-lifting of processing, managing and storing large amounts of data is done by the server host, and clients only retrieve the data that they need.

<img src="img/lecture1/client-server.png" width="500">

> Although sometimes used interchangeably, there is a difference between a **client/server** and a **client/server host**. A host is a device, whereas a client/server is a piece of software. For example, you can simultaneously have multiple client programs connected to a remote database. Similarly, a remote host is a device (i.e. a computer) that might have several server programs running concurrently.

The idea of client-server models for databases has become the standard of computation and storage today, known as **cloud computing**:

- Today we rarely store movie or music files on our computers. This is why most of us have laptops with only 256/512 GB of space, because most of that takes up space is already provided as a cloud service (e.g. Netflix, Spotify, Youtube), or is stored on cloud storage spaces (e.g. One Drive, Dropbox, Google Drive).
- We rarely run production-stage computation-intensive jobs on our own computers. All such computations are done on cloud-computing services (e.g. Google Cloud Platform, Amazon Web Services, Microsoft Azure). I personally haven't run a single simulation code on my own computer, neither ever stored any raw data locally. I use my computer mainly as an interface to access the services that I want.

> Note that there are certain situations where one might want to **locally** benefit from the advantages of storing data in a database. A relational database engine that works only with local databases is SQLite. If you're curious to find out the use cases for **SQLite**, take a look [here](https://www.sqlite.org/whentouse.html).

Whenever we use Postgres (or any other client-server DBMS), the first step before anything else is to **connect** to the database server. This is why we will talk about _host address_, _port_, _username_, and _password_ when we try to use a database.

## How to run SQL in Postgres?

Well, we have a variety of options to run our SQL statements in PostgreSQL:

- pgAdmin is the official web-based GUI for interacting with PostgreSQL databases
- `psql` is PostgreSQL's interactive command-line interface
- `%sql` and `%%sql` magic commands in Jupyter notebooks, which are provided by the `ipython-sql` package
- `psycopg2` is the official Python adapter for PostgreSQL databases
- Using `.read_sql_query()` method in Pandas

I will demonstrate the usage of `%sql` and `%%sql` provided by the `ipython-sql` interfaces here.

### `psql`

This is PostgreSQL's command-line tool that allows us to interactively run SQL statements as well as "meta" commands. I introduce a couple of useful `psql` meta commands here, but you can find all the other ones in Postgres documentations [here](https://www.postgresql.org/docs/current/app-psql.html) or a shorter version in this [cheatsheet](http://www.postgresonline.com/downloads/special_feature/postgresql83_psql_cheatsheet.pdf).

| Command | Usage                                         |
|---------|-----------------------------------------------|
| `\l`    | list all databases                            |
| `\c`    | connect to a database                         |
| `\cd`   | change directory                              |
| `\!`    | execute shell commands                        |
| `\i`    | execute commands from file                    |
| `\d`    | list tables and views                         |
| `\d+`   | list tables and views with additional info    |
| `\dt`   | list tables                                   |
| `\dt+`  | list tables with additional info              |
| `\h`    | view help on SQL commands                     |
| `\?`    | view help on psql meta commands               |
| `\q`    | quit interactive shell                        |

> Note that you don't need to terminate meta commands with `;`.

### `ipython-sql` (`%sql` and `%%sql`)

`ipython-sql` is a package that enables us to run SQL statements right from a Jupyter notebook. This package is included in the `dsci513env.yaml` environment file, so you should have it installed in your conda environment. In order to use it, we should load it first:

In [1]:
%load_ext sql

Now we need the host address of where the database is stored, along with a username and a password.

It is always a bad idea to store login information directly in a notebook or code file because of security reasons. For example, you don't want to commit your sensitive login information to a Git repo.

In order to avoid that, we store that kind of information in a separate file, like `credentials.json` here, and read the username and password into our IPython session:

In [2]:
import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

And also make sure to add your file name (e.g. `credentials.json`) to your `.gitignore` file, so you don't accidentally commit it.

Now we can establish the connection to the `world_dsci513` database using the following code:

In [3]:
%sql postgresql://{username}:{password}@{host}:{port}/world_dsci513

'Connected: postgres@world_dsci513'

Note that we have used the `%sql` line magic to interpret the line in front of it as a magic command. This is similar to the `%timit` magic that we used in DSCI 511.

We can also use `%%sql` cell magic to apply the magic to an entire notebook cell.

A limited number of `psql` meta commands (e.g. `\l`, `d`) can also be executed here. This is made possible through the `pgspecial` package. For example, let's list all databases that exist on our PostgreSQL server:

In [5]:
%sql \l

 * postgresql://postgres:***@localhost:5432/world_dsci513
4 rows affected.


Name,Owner,Encoding,Collate,Ctype,Access privileges
postgres,postgres,UTF8,C,C,
template0,postgres,UTF8,C,C,=c/postgres postgres=CTc/postgres
template1,postgres,UTF8,C,C,=c/postgres postgres=CTc/postgres
world_dsci513,postgres,UTF8,C,C,


Or list the relations (i.e. tables) in the current database:

In [6]:
%sql \d

 * postgresql://postgres:***@localhost:5432/world_dsci513
3 rows affected.


Schema,Name,Type,Owner
public,city,table,postgres
public,country,table,postgres
public,countrylanguage,table,postgres


Let's run some SQL statements now. Let's retrieve the `name` and `population` columns from the `country` table:

In [6]:
%sql SELECT name, population FROM country;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


name,population
Afghanistan,22720000
Netherlands,15864000
Netherlands Antilles,217000
Albania,3401200
Algeria,31471000
American Samoa,68000
Andorra,78000
Angola,12878000
Anguilla,8000
Antigua and Barbuda,68000


#### Limiting returned and displayed rows

As you can see, all rows are returned and displayed by default. This behaviour can be problematic if our table is very large for two reasons:
1. Retrieving large tables can be slow, and maybe not necessary
2. Displaying a lot of rows clutters our Jupyter notebook

We can modifying `ipython-sql` configuration to limit the number of returned and displayed rows. For example, here we change the display limit:

In [8]:
%config SqlMagic.displaylimit = 20

In [9]:
%sql SELECT name, population FROM country;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


name,population
Afghanistan,22720000
Netherlands,15864000
Netherlands Antilles,217000
Albania,3401200
Algeria,31471000
American Samoa,68000
Andorra,78000
Angola,12878000
Anguilla,8000
Antigua and Barbuda,68000


Looks good. Let's apply the magic to an entire cell so that we can break the lines:

In [10]:
%%sql

SELECT
    name, population
FROM
    country
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


name,population
Afghanistan,22720000
Netherlands,15864000
Netherlands Antilles,217000
Albania,3401200
Algeria,31471000
American Samoa,68000
Andorra,78000
Angola,12878000
Anguilla,8000
Antigua and Barbuda,68000


We can use `*` to retrieve all columns:

In [11]:
%%sql

SELECT
    *
FROM
    country
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1,AF
NLD,Netherlands,Europe,Western Europe,41526.0,1581.0,15864000,78.3,371362.0,360478.0,Nederland,Constitutional Monarchy,Beatrix,5,NL
ANT,Netherlands Antilles,North America,Caribbean,800.0,,217000,74.7,1941.0,,Nederlandse Antillen,Nonmetropolitan Territory of The Netherlands,Beatrix,33,AN
ALB,Albania,Europe,Southern Europe,28748.0,1912.0,3401200,71.6,3205.0,2500.0,Shqipëria,Republic,Rexhep Mejdani,34,AL
DZA,Algeria,Africa,Northern Africa,2381741.0,1962.0,31471000,69.7,49982.0,46966.0,Al-Jazair/Algérie,Republic,Abdelaziz Bouteflika,35,DZ
ASM,American Samoa,Oceania,Polynesia,199.0,,68000,75.1,334.0,,Amerika Samoa,US Territory,George W. Bush,54,AS
AND,Andorra,Europe,Southern Europe,468.0,1278.0,78000,83.5,1630.0,,Andorra,Parliamentary Coprincipality,,55,AD
AGO,Angola,Africa,Central Africa,1246700.0,1975.0,12878000,38.3,6648.0,7984.0,Angola,Republic,José Eduardo dos Santos,56,AO
AIA,Anguilla,North America,Caribbean,96.0,,8000,76.1,63.2,,Anguilla,Dependent Territory of the UK,Elisabeth II,62,AI
ATG,Antigua and Barbuda,North America,Caribbean,442.0,1981.0,68000,70.5,612.0,584.0,Antigua and Barbuda,Constitutional Monarchy,Elisabeth II,63,AG


## More SQL commands

### `DISTINCT`

The `DISTINCT` keyword is used to return only distinct rows from a table, and ignore duplicates:

```sql
SELECT
    DISTINCT column1, column2, ...
FROM
    table1;
```

Note that `DISTINCT` is applied to **all columns** that we list in front of `SELECT`, and returns all distinct combinations of values stored in those columns. In the above code snippet, columns other than `column1` and `column2` can still have duplicate values.

In [25]:
%%sql

SELECT
    DISTINCT continent
FROM
    country
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
7 rows affected.


continent
Asia
South America
North America
Oceania
Antarctica
Africa
Europe


In [26]:
%%sql

SELECT
    DISTINCT continent, region
FROM
    country
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
25 rows affected.


continent,region
Oceania,Melanesia
Oceania,Australia and New Zealand
North America,Central America
Africa,Northern Africa
Asia,Eastern Asia
Oceania,Polynesia
Europe,Nordic Countries
Asia,Middle East
Oceania,Micronesia/Caribbean
Europe,Baltic Countries


### `DISTINCT ON`

`DISTINCT ON` is not standard SQL, but a useful Postgres extension which allows us to return distinct rows based on the value of a **single** column (`DISTINCT` applies to all columns).

```sql
SELECT
    DISTINCT ON (column1), column2
FROM
    table1
;
```

Note that only the first row of each duplicate group is returned. It's not predictable which row in the duplicate group is returned as the first row!

In [None]:
%%sql

SELECT
    DISTINCT ON (countrycode) countrycode, name
FROM
    city
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
232 rows affected.


countrycode,name
ABW,Oranjestad
AFG,Mazar-e-Sharif
AGO,Luanda
AIA,The Valley
ALB,Tirana
AND,Andorra la Vella
ANT,Willemstad
ARE,Abu Dhabi
ARG,San Salvador de Jujuy
ARM,Yerevan


### `ORDER BY`

The `ORDER BY` keyword sorts the results according to one or particular set of columns:

```sql
SELECT
    column1, column2, ...
FROM
    table1
ORDER BY
    column1, column2, ...;
```

The rows are sorted in **ascending** order by default.

In [28]:
%%sql

SELECT
    name, population
FROM
    country
ORDER
    BY population
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


name,population
Heard Island and McDonald Islands,0
United States Minor Outlying Islands,0
South Georgia and the South Sandwich Islands,0
Antarctica,0
Bouvet Island,0
British Indian Ocean Territory,0
French Southern territories,0
Pitcairn,50
Cocos (Keeling) Islands,600
Holy See (Vatican City State),1000


We can also sort the returned rows in descending order by adding the keyword `DESC` keyword after the column names. In fact, there is a `ASC` keyword as well for ascending sorting, which is optional:

```sql
SELECT
    column1, column2, ...
FROM
    table1
ORDER BY
    column1 [ASC|DESC], column2 [ASC|DESC], ...;
```

In [29]:
%%sql

SELECT
    name, population
FROM
    country
ORDER BY
    population DESC
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


name,population
China,1277558000
India,1013662000
United States,278357000
Indonesia,212107000
Brazil,170115000
Pakistan,156483000
Russian Federation,146934000
Bangladesh,129155000
Japan,126714000
Nigeria,111506000


### `LIMIT`

We've already talked about how we can limit the number of returned rows from the database using `ipython-sql`'s configuration, but that is specific to `ipython-sql` extension. With SQL in general, we can use the `LIMIT` keyword to limit the number of returned rows:

```sql
SELECT
    column1, column2, ...
FROM
    table1
LIMIT
    N_ROWS;
```

In [30]:
%%sql

SELECT
    name, continent
FROM
    country
LIMIT
    5
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
5 rows affected.


name,continent
Afghanistan,Asia
Netherlands,Europe
Netherlands Antilles,North America
Albania,Europe
Algeria,Africa


It is also possible to skip the first `n` rows by supplying the optional `OFFSET` keyword:

```sql
SELECT
    column1, column2, ...
FROM
    table1
LIMIT
    N_ROWS OFFSET N_OFFSET;
```

In [31]:
%%sql

SELECT
    name, continent
FROM
    country
LIMIT
    5 OFFSET 10;

 * postgresql://postgres:***@localhost:5432/world_dsci513
5 rows affected.


name,continent
United Arab Emirates,Asia
Argentina,South America
Armenia,Asia
Aruba,North America
Australia,Oceania


---  
**Remember:**

The order of SQL keywords does matter: `SELECT`, `FROM`, `JOIN`, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, `LIMIT`
    
---

## ---------

- Various data types in SQL
- `WHERE` conditionals, pattern matching
- Derived columns, aliases with `AS`
- Conditionals with `CASE`
- Functions and operators in SQL

In [5]:
%load_ext sql
%config SqlMagic.displaylimit = 20

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [6]:
import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)

user = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

In [7]:
%sql postgresql://{user}:{password}@{host}:{port}/

'Connected: postgres@'

## Data types

You might remember from previous lecture that in relational databases, each column is characterized with its name and its **domain**. A domain is the set of permissible or valid values that a column is allowed to store. This highlights one of the advantages of using a DBMS, which enforces particular data types for the columns of a table.

Postgres supports

- boolean
- character
- number
- datetime
- binary

and some extension types specific to Postgres.

### Type conversion

To demonstrate how different data types work in SQL, I first need to show you how we convert values from one type to another. In standard SQL, type conversion can be done using the `CAST` function:

```sql
CAST(<column> AS <data_type>)
```
In Postgres, we can also use the double-colon syntax as a shorthand for the above `CAST` function:

```sql
<column>::<data_type>
```

### Boolean

We can specify this data type using the keyword `BOOLEAN` or `BOOL`. Valid values are `NULL`, `TRUE`, `1` (or any other positive integer), `YES`, `Y`, `T`, `FALSE`, `0`, `NO`, `N`, `F`. Note that all of these values will be interpreted as `TRUE`, `FALSE`, or `NULL`  by Postgres:

In [19]:
%%sql

SELECT
    'TRUE'::BOOLEAN,
    'T'::BOOLEAN,
    '0'::BOOLEAN,
    'NO'::BOOLEAN
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


bool,bool_1,bool_2,bool_3
True,True,False,False


### Characters

The character data type is used to represent fixed-length and variable length character strings. This type can be defined using the following keywords:

- `CHAR(n)`: a string of exactly `n` characters padded with spaces
- `VARCHAR(n)`: a variable set of `n` characters
- `TEXT` which is a Postgres specific type for which there is practically no limit on the number of characters.

In [20]:
%%sql

SELECT
    'Arman'::CHAR(50),
    'Arman'::VARCHAR(2),
    'Arman'::TEXT
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


bpchar,varchar,text
Arman,Ar,Arman


> Note that you can't see the space-padding for `CHAR(50) 'Arman'` in the Jupyter notebook, but if you run the same statement in `psql`, you will see `'Arman'` + 45 spaces in the output.

### Numbers

Numerical values in Postgres belong to the following general categories:
- Integers
- Floating-point numbers
- Arbitrary precision numbers

**Integers:**

| Name     | Storage Size | Description                | Range                                        |
|----------|--------------|----------------------------|----------------------------------------------|
| `smallint` | 2 bytes      | small-range integer        | -32768 to +32767                             |
| `integer`  | 4 bytes      | typical choice for integer | -2147483648 to +2147483647                   |
| `bigint`   | 8 bytes      | large-range integer        | -9223372036854775808 to +9223372036854775807 |
| `serial`      | 4 bytes | auto-incrementing integer       | 1 to 2147483647          |
| `bigserial`   | 8 bytes | large auto-incrementing integer | 1 to 9223372036854775807 |

We'll learn later that the `serial` type (which is not an actual data type) is a shortcut to tell Postgres create unique "auto-incrementing" often used for the primary key column of table.

**Floating-point numbers:**

| Name     | Storage Size | Description                | Range                                        |
|----------|--------------|----------------------------|----------------------------------------------|
| `real`             | 4 bytes  | variable-precision, inexact     | at least 6 decimal digits (implementation dependent) |
| `double precision` | 8 bytes  | variable-precision, inexact     | at least 15 decimal digits (implementation dependent) |

**Arbitrary precision numbers**

| Name     | Storage Size | Description                | Range                                        |
|----------|--------------|----------------------------|----------------------------------------------|
| `numeric`          | variable | user-specified precision, exact | 131072 digits before and 16383 digits after the decimal point |
| `decimal`          | variable | user-specified precision, exact | 131072 digits before and 16383 digits after the decimal point |

> `DECIMAL` and `NUMERIC` data types are exactly the same thing in Postgres.

In [21]:
%%sql

SELECT CAST(44.268 AS SMALLINT);

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


int2
44


Note that this was also acceptable (and maybe preferred, but specific to Postgres):

In [22]:
%%sql

SELECT 44.268::SMALLINT;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


int2
44


In [23]:
%%sql

SELECT CAST(4.54021223948E-8 AS REAL);

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


float4
4.540212e-08


With the `numeric` data type, we can specify the total number of significant digits to store (known as precision) as well as the number of digits in the fractional part (known as scale) by specifying `NUMERIC(precision, scale)`:

In [24]:
%%sql

SELECT CAST('1.123456789' AS NUMERIC(5, 2));

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


numeric
1.12


The `numeric` type is exact (as opposed to other types of floats) and immune to the round-off error, but it is **slow to work with for the DBMS**. It is often used for monetary and financial data, where either numbers with a many digits may be stored or exactness is important.

For example, the following number cannot be represented as `BIGINT` and would throw an error, but it works with `NUMERIC`:

In [25]:
%%sql

SELECT CAST(9223372036854775808 AS NUMERIC);

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


numeric
9223372036854775808


### Date/time

Postgres provides **datetime** and **interval** data types similar to those we've seen in DSCI 511 in Python and Pandas.

#### Datetimes

- `DATE` for dates
- `TIME` for the time of day

Postgres also provides two ways to store the **timestamp** datatype;
- `TIMESTAMP` for date + time
- `TIMESTAMPTZ` for date + time + timezone (Postgres specific)

When a timestamp value is queried:
- For `TIMESTAMP`, Postgres returns the timestamp as originally stored in the database server
- For `TIMESTAMPTZ`, Postgres converts the timestamp into the local timezone of the database server

Note that Postgres does not store timezone information. It always internally stores `TIMESTAMPTZ` in [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) value, and does the back-conversion using the local time zone of the database server.

#### Intervals

There is also another datatype for storing intervals of time. Intervals are useful for doing date and time arithmetic, such as adding a duration of time to a timestamp.

For more detailed information, refer to the Postgres documentation [here](https://www.postgresql.org/docs/8.4/datatype-datetime.html).

**Entering datetime data**

Postgres does a pretty good job of getting the datetimes right even if we don't enter them in the standard ISO way. Let's take a look at a few examples:

In [26]:
%%sql

SELECT
    'January 23, 2021'::DATE,
    '23 January 2021'::DATE,
    '2021 1 23'::DATE,
    '1/23/2021'::DATE,
    'today'::DATE,
    'tomorrow'::DATE
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


date,date_1,date_2,date_3,date_4,date_5
2021-01-23,2021-01-23,2021-01-23,2021-01-23,2021-11-18,2021-11-19


In [27]:
%%sql

SELECT
    '14:24:00'::TIME,
    '2:24pm'::TIME,
    '2:24 PM PST'::TIME WITH TIME ZONE,
    'now'::TIME,
    'now'::TIME WITH TIME ZONE
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


time,time_1,timetz,time_2,timetz_1
14:24:00,14:24:00,14:24:00-08:00,07:39:08.686027,07:39:08.686027-08:00


In [28]:
%%sql

SELECT
    '1 day 23 hours 8 minutes'::INTERVAL,
    '2m 18s'::INTERVAL,
    '3 years 2 months'::INTERVAL
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


interval,interval_1,interval_2
"1 day, 23:08:00",0:02:18,"1155 days, 0:00:00"


When datetime is stored without timezone, it is oblivious to the local server timezone:

In [29]:
%sql SELECT '2021-11-18 8:30:00'::TIMESTAMP;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


timestamp
2021-11-18 08:30:00


In [30]:
%sql SELECT '2021-11-18 8:30:00'::TIMESTAMPTZ;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


timestamptz
2021-11-18 08:30:00-08:00


In [31]:
%sql SHOW TIMEZONE;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


TimeZone
America/Vancouver


In [32]:
%sql SET timezone = 'America/New_York';

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
Done.


[]

In [33]:
%sql SELECT '2021-11-18 8:30:00 -8'::TIMESTAMPTZ;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


timestamptz
2021-11-18 11:30:00-05:00


In [34]:
%sql SET timezone = 'America/Vancouver';

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
Done.


[]

In [35]:
%sql SELECT '2021-11-18 8:30:00'::TIMESTAMPTZ;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


timestamptz
2021-11-18 08:30:00-08:00


### Binary data

It is also possible to have binary data in a table (e.g. documents, images, videos). We don't use binary data in this course.

### Nulls

A null is marker to indicate that the value for a column is unknown, or not entered yet. A null is not equal to 0, or an empty string. In fact, a null is not even equal to another null!

How different environments show nulls:
- `ipython-sql` -> `None`
- psql -> blank space
- pgAdmin -> `[null]`

## Filtering rows with `WHERE`

We've seen the `WHERE` keyword in passing in the last lecture. `WHERE` is an intuitive keyword that is used to filter rows based on a particular condition. The syntax is as follows:

```sql
SELECT
    column1, column2
FROM
    table1
WHERE
    condition
;
```

| Condition        | Operator                        |
|------------------|---------------------------------|
| Comparison       | `=`, `<>`, `<`, `<=`, `>`, `>=` |
| Pattern matching | `LIKE`                          |
| Range            | `BETWEEN`                       |
| List             | `IN`                            |
| Null testing     | `IS NULL`                       |

In [37]:
%sql postgresql://{user}:{password}@{host}:{port}/imdb_dsci513

'Connected: postgres@imdb_dsci513'

In [38]:
%%sql

SELECT
    *
FROM
    movies;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
26058 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10035423,Kate & Leopold,,2001,,118,6.4,74982
10042742,Mister 880,,1950,,90,7.1,1171
10041181,Black Hand,,1950,,92,6.4,666
10041387,Francis,,1950,,91,6.4,979
10041719,Orpheus,Orphée,1950,,95,8.0,9346
10041931,Stromboli,"Stromboli, terra di Dio",1950,,107,7.3,5239
10042052,Woman in Hiding,,1950,,92,6.9,553
10042179,Abbott and Costello in the Foreign Legion,,1950,,80,6.6,2573
10042200,Annie Get Your Gun,,1950,,107,6.9,4050
10042206,Armored Car Robbery,,1950,,67,7.0,2077


---

**Example:** Retrieve rows for movies produced in or after 2010.

---

In [39]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    start_year >= 2010
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
8804 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10069049,The Other Side of the Wind,,2018,,122,6.9,4904
10176694,The Tragedy of Man,Az ember tragédiája,2011,,160,7.8,610
10293069,Dark Blood,,2012,,86,6.5,1073
10315642,Wazir,,2016,,103,7.1,15796
10337692,On the Road,,2012,,124,6.1,38216
10359950,The Secret Life of Walter Mitty,,2013,,114,7.3,278645
10365907,A Walk Among the Tombstones,,2014,,114,6.5,106413
10369610,Jurassic World,,2015,,124,7.0,547391
10376136,The Rum Diary,,2011,,119,6.2,95417
10376479,American Pastoral,,2016,,108,6.1,13376


---

**Example:** Retrieve the row for the movie called "Lost Highway".

---

In [40]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    title = 'Lost Highway'
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10116922,Lost Highway,,1997,,134,7.6,120549


> Note that in SQL, strings are enclosed in single quotes, i.e. `'string'`.

> While SQL syntax is case-insensitive, SQL is **case-sensitive** when it comes to **comparing strings**. In the above example, `'Lost highway'` will not return any rows.

### Logical operators `AND`, `OR`, and `NOT`

Just like in Python, we can combine multiple conditions logical/boolean operators `AND`, `OR`, and `NOT`.

When there are multiple logical operators, `NOT` is evaluated first, then `AND` and finally `OR`.

We can enclose each condition in parentheses if we want. This can be done either for readability, or to override the default precedence rules.

---

**Example:** Retrieve the rows for movies that are produced in 2015 and are rated higher than 8.

---

In [41]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    start_year = 2015
    AND
    rating > 9
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
0 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes


---

**Example:** Retrieve the rows for movies that are produced either in 2015 or 2018, and are rated higher than 8.

---

In [42]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    start_year = 2015
    OR 
    start_year = 2018
    AND
    rating > 8
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1048 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10369610,Jurassic World,,2015,,124,7.0,547391
10420293,The Stanford Prison Experiment,,2015,,122,6.9,33319
10478970,Ant-Man,,2015,,117,7.3,517941
10790770,Miles Ahead,,2015,,100,6.4,8650
10884732,The Wedding Ringer,,2015,,101,6.6,67575
11533089,Tab Hunter Confidential,,2015,,90,7.8,2852
11596363,The Big Short,,2015,,130,7.8,318033
11598642,Z for Zachariah,,2015,,98,6.0,25985
11618448,Racing Extinction,,2015,,90,8.3,7042
11638355,The Man from U.N.C.L.E.,,2015,,116,7.3,245184


What? This isn't the right result! We have multiple returned movies that are rated below 8.

The reason is that the `AND` operator takes precedence over `OR`. Therefore, `start_year = 2018 AND rating > 8` gets evaluated first, and then the result is passed to the `OR` part of the condition. In order to override this behaviour, we can rewrite our query in the following way:

In [43]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    (start_year = 2015
    OR
    start_year = 2018)
    AND
    rating > 8
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
119 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
11618448,Racing Extinction,,2015,,90,8.3,7042
12096673,Inside Out,,2015,,95,8.2,550606
12473476,Be Here Now,,2015,,100,8.7,2863
12631186,Baahubali: The Beginning,Bahubali: The Beginning,2015,,159,8.1,94989
12865822,All the World in a Design School,,2015,,59,8.4,1270
13170832,Room,,2015,,118,8.2,326042
13270538,Requiem for the American Dream,,2015,,73,8.1,8061
13717510,The Drop Box,,2015,,79,8.1,604
13865286,My Lonely Me,,2015,,95,8.2,671
14112208,Kuttram Kadithal,,2015,,120,8.1,638


---

**Example:** Count the number of movies that have no less than 1 million votes.

---

We need to use the `COUNT()` function to count the number of returned rows (more on `COUNT()` in a later lecture):

In [44]:
%%sql

SELECT
    COUNT(*)
FROM
    movies
WHERE
    NOT nvotes < 1000000
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


count
33


> It is mostly a matter of style whether to use `NOT` or `<>`.

### Pattern matching

It is a quite common situation that we want to find rows for which the values of one or more columns match a particular pattern. In SQL, this can be done either using `LIKE` or by using regular expressions. The syntax is as follows:

```sql
SELECT
    column1, column2
FROM
    table1
WHERE
    column1 [NOT] LIKE '<pattern>'
;
```

Postgres provides us with two wild-cards that we can use with `LIKE`:
- `%` matches any string of characters
- `_` matches a single character.

Pattern matching with `LIKE` is case sensitive; however, Postgres also provides the `ILIKE` keyword that has the same functionality as `LIKE` but is case-insensitive.

> **Note:** With `LIKE` or `ILIKE`, the entire string should match the pattern.

In [47]:
%%sql

SELECT
    'Arman' LIKE '%a_',
    'UBC' LIKE '_B_',
    'MDS is awesome!' LIKE '%!_'
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


?column?,?column?_1,?column?_2
True,True,False


---

**Example:** Retrieve those movies from the `movie` table whose title contains the word `'violin'` (note that `LIKE` is picky about letter cases in strings!)

---

In [48]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    title LIKE '%Violin%'
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
5 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10120802,The Red Violin,Le violon rouge,1998,,130,7.6,30285
10451966,The Violin,El violín,2005,,98,7.7,2212
12401715,The Devil's Violinist,,2013,,122,6.1,3033
14972904,The Violin Teacher,Tudo Que Aprendemos Juntos,2015,,102,6.8,645
10053987,The Steamroller and the Violin,Katok i skripka,1961,,46,7.5,4867


---

**Example:** Retrieve those movies from the `movie` table whose title starts with the word `'Zero'`.

---

In [49]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    title LIKE 'Zero%'
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
18 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10095244,Zerograd,Gorod Zero,1988,,103,7.5,1463
10113557,Zero Kelvin,Kjærlighetens kjøtere,1995,,118,7.3,1711
10120906,Zero Effect,,1998,,116,6.9,13383
10198837,Zero Tolerance,Noll tolerans,1999,,108,6.4,3288
10283693,Zero Woman: Red Handcuffs,Zeroka no onna: Akai wappa,1974,,88,6.6,783
10365960,Zero Day,,2002,,92,7.2,3840
10421090,Zerophilia,,2005,,90,6.2,2177
11592292,Zero 2,,2010,,90,7.6,5360
11790885,Zero Dark Thirty,,2012,,157,7.4,254644
12294965,Zero Charisma,,2013,,86,6.2,2384


---

**Example:** Retrieve those movies from the `movie` table whose title is 4 letters long and ends with the letter `'e'`.

---

In [50]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    title LIKE '___e'
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
71 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10043539,Five,,1951,,93,6.3,1068
10064694,More,,1969,,112,6.5,2155
10066500,Hope,Umut,1970,,100,8.2,2770
10067814,Love,Szerelem,1971,,88,7.9,1582
10068306,Bone,,1972,,95,6.8,905
10069158,Rage,,1972,,100,6.3,765
10071803,Mame,,1974,,132,6.1,2490
10080716,Fame,,1980,,134,6.6,18864
10087182,Dune,,1984,,137,6.5,113255
10088930,Clue,,1985,,94,7.3,71433


---

**Example:** Retrieve those movies from the `movie` table whose title contains the character `'%'`.

---

We can specify an escape character using the keyword `ESCAPE` that tells SQL to not interpret a `%` or `_` that immediately follows it:

In [51]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    title LIKE '%$%%' ESCAPE '$'
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
3 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10487092,Who the #$&% Is Jackson Pollock?,,2006,,74,7.0,1134
12662228,10%: What Makes a Hero?,,2013,,88,6.8,543
11869226,100% Love,,2011,,141,7.0,2369


Pattern matching could also be done using regular expressions in two different ways:
- `SIMILAR TO`: This is the SQL standard's definition of a regular expression, which is a mix between the `LIKE` and common regular expressions
- `~`: This is the POSIX regular expression operator

> **Note:** With `SIMILAR TO`, the entire string should match the pattern. This is unlike regex behaviour!

You can find more information on this in the Postgres documentation [here](https://www.postgresql.org/docs/current/functions-matching.html)

In [52]:
%%sql

SELECT
    'abc' SIMILAR TO '%(b|d)%',
    'abc' SIMILAR TO '(b|c)_';

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


?column?,?column?_1
True,False


---

**Example:** Select movies from the `movie` table whose title starts and ends with a digit.

---

In [53]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    title ~ '^\d.*\d$'
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
58 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10048918,1984,,1956,,90,7.0,2837
10068156,1776,,1972,,141,7.6,7250
10074084,1900,Novecento,1976,,317,7.7,20696
10078721,10,,1979,,122,6.1,14224
10080319,9 to 5,Nine to Five,1980,,109,6.8,25199
10087803,1984,Nineteen Eighty-Four,1984,,113,7.1,61070
10109001,1-900,06,1994,,87,6.2,576
10112257,301/302,"301, 302",1995,,100,6.4,938
10126765,23,,1998,,99,7.3,6219
10212712,2046,,2004,,129,7.4,47553


### `IN`

Sometimes we want to check whether a column value matches any one of the items in a list. We can express this with an `OR` operator:

```sql
SELECT
    column1, column2
FROM
    table1
WHERE
    column1 = value1
    OR
    column1 = value2
    OR
    column1 = value3
;
```

This can be rewritten more succinctly using the `IN` operator:

```sql
SELECT
    column1, column2
FROM
    table1
WHERE
    column1 [NOT] IN (value1, value2, value3)
;
```

---

**Example:** Retrieve rows from the `movie` table that correspond to the movies `'Donnie Brasco'`, `'The Usual Suspects'`, `'Schindler''s List'`, `'Shutter Island'`, `'A Beautiful Mind'`.

---

In [54]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    title IN ('Donnie Brasco',
              'The Usual Suspects',
              'Schindler''s List',
              'Shutter Island',
              'A Beautiful Mind'
               )
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
5 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10108052,Schindler's List,,1993,,195,8.9,1110590
10114814,The Usual Suspects,,1995,,106,8.5,922333
10119008,Donnie Brasco,,1997,,127,7.7,258120
10268978,A Beautiful Mind,,2001,,135,8.2,784095
11130884,Shutter Island,,2010,,138,8.1,1027318


### `BETWEEN`

The `BETWEEN` keyword is helpful for when we want to select a range of values, and it can be used for number, character and datetime ranges:

```sql
SELECT
    column1, column2
FROM
    table1
WHERE
    column1 [NOT] BETWEEN value1 AND value2
;
```

> **Note:** `BETWEEN` is **inclusive** of both ends of the interval.

We can try it out using a `SELECT` statement without any tables:

In [55]:
%%sql

SELECT 
    5 BETWEEN 1 AND 10,
    DATE '2021-11-01' BETWEEN DATE '2021-01-01' AND '2021-11-10',
    'w' BETWEEN 'e' AND 'm';

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
1 rows affected.


?column?,?column?_1,?column?_2
True,True,False


---

**Example:** Retrieve the name, production year and rating of the top 5 movies from the `movie` table that are produced between 2018 and 2020, and have a rating of at least 8.5 with at least 100000 votes. Sort the results in descending order based on ratings.

---

In [56]:
%%sql

SELECT
    title, start_year, rating
FROM
    movies
WHERE
    start_year BETWEEN 2018 AND 2020
    AND
    rating >= 8
    AND
    nvotes >= 100000
ORDER BY
    rating
LIMIT
    5
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
5 rows affected.


title,start_year,rating
Once Upon a Time... in Hollywood,2019,8.0
Toy Story 4,2019,8.0
Bohemian Rhapsody,2018,8.0
Green Book,2018,8.2
Spider-Man: Into the Spider-Verse,2018,8.4


### `IS NULL`

Trying to find `NULL` values using `WHERE column = NULL` fails. This is because a `NULL` value is by definition not known and _could be anything_, so it's not necessarily equal to another `NULL`. To find `NULL` values in a column, we can use `IS NULL`:

```sql
SELECT
    column1, column2
FROM
    table1
WHERE
    column1 IS [NOT] NULL
;
```

---

**Example:** Find movies the `movie` whose `orig_title` is different from that listed in the `title` column.

---

In [57]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    orig_title IS NOT NULL
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
8270 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10041719,Orpheus,Orphée,1950,,95,8.0,9346
10041931,Stromboli,"Stromboli, terra di Dio",1950,,107,7.3,5239
10042355,Story of a Love Affair,Cronaca di un amore,1950,,98,7.1,2209
10042619,Diary of a Country Priest,Journal d'un curé de campagne,1951,,115,8.0,8621
10042692,Variety Lights,Luci del varietà,1950,,97,7.1,2416
10042804,The Young and the Damned,Los olvidados,1950,,85,8.3,16453
10042810,Operation Disaster,Morning Departure,1950,,102,7.0,668
10042876,Rashomon,Rashômon,1950,,88,8.2,138304
10042906,La Ronde,La ronde,1950,,93,7.6,4456
10043048,To Joy,Till glädje,1950,,98,7.2,2109


## Column Aliases with `AS`

In SQL, we are not required to use the same column and table names in the schema. We can create **aliases** for a column or a table with the following syntax:

```sql
SELECT
    column1 [AS] c1,
    column2 [AS] c2
FROM
    table1 [AS] t1
;
```

Note that the keyword `AS` is optional. I usually choose to use it because it makes the query more readable.

We will use table aliases a lot when we work on SQL joins in the upcoming lectures!

In [58]:
%%sql

SELECT
    title AS movieTitle,
    orig_title AS "oringinal Title",
    runtime AS Duration
FROM
    movies;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
26058 rows affected.


movietitle,oringinal Title,duration
Kate & Leopold,,118
Mister 880,,90
Black Hand,,92
Francis,,91
Orpheus,Orphée,95
Stromboli,"Stromboli, terra di Dio",107
Woman in Hiding,,92
Abbott and Costello in the Foreign Legion,,80
Annie Get Your Gun,,107
Armored Car Robbery,,67


Note that we've used a column alias with a space in its name. This is generally not a good practice, but if you absolutely need to do it, in Postgres you should enclose the alias in double quotes, e.g. `"alias name"`. A situation where double quotes are necessary is when you want to name a column with a word that is reserved keyword in Postgres, e.g. `"COUNT"`.

> **Note:** we **cannot** use column aliases in the `WHERE` clause, since it is evaluated by SQL before setting aliases. The following query will throw an error:

In [59]:
%%sql

SELECT
    title AS movieTitle,
    orig_title AS "oringinal Title",
    runtime AS Duration
FROM
    movies
WHERE
    Duration > 100
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
(psycopg2.errors.UndefinedColumn) column "duration" does not exist
LINE 8:     Duration > 100
            ^

[SQL: SELECT
    title AS movieTitle,
    orig_title AS "oringinal Title",
    runtime AS Duration
FROM
    movies
WHERE
    Duration > 100
;]
(Background on this error at: https://sqlalche.me/e/14/f405)


## Derived columns

Derived columns in SQL are columns that are the result of doing operations on existing columns of a table.

For example, suppose that we want to convert the `runtime` column of our table `movies` from minutes to hours. We can do that by manipulating the `runtime` column right in the `SELECT` statement:

In [60]:
%%sql

SELECT
    title,
    runtime / 60.
FROM
    movies;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
26058 rows affected.


title,?column?
Kate & Leopold,1.9666666666666668
Mister 880,1.5
Black Hand,1.5333333333333332
Francis,1.5166666666666668
Orpheus,1.5833333333333333
Stromboli,1.7833333333333332
Woman in Hiding,1.5333333333333332
Abbott and Costello in the Foreign Legion,1.3333333333333333
Annie Get Your Gun,1.7833333333333332
Armored Car Robbery,1.1166666666666667


> Note that I've written `60.` with the decimal point on purpose. If you divide by `60` instead, SQL assumes that the result of this operation should also be an integer (given that the column `runtime` is also of type integer), and will return truncated integer values instead of floats.

SQL doesn't know what to call the derived column, and by default you will see `?column?` as the column name. We can use an alias to name the new derived column:

In [61]:
%%sql

SELECT
    title,
    runtime / 60. AS runtime_hours
FROM
    movies;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
26058 rows affected.


title,runtime_hours
Kate & Leopold,1.9666666666666668
Mister 880,1.5
Black Hand,1.5333333333333332
Francis,1.5166666666666668
Orpheus,1.5833333333333333
Stromboli,1.7833333333333332
Woman in Hiding,1.5333333333333332
Abbott and Costello in the Foreign Legion,1.3333333333333333
Annie Get Your Gun,1.7833333333333332
Armored Car Robbery,1.1166666666666667


Remember I mentioned that the `SELECT` statement is powerful, but not dangerous? Derived columns returned by Postgres are not saved anywhere, nor do they change existing columns.

---

**Example:** Using table `names` from the `imdb` database, find the age of all actors/actresses who are still alive. Who is the youngest person alive listed in the table?

---

In [62]:
%%sql

SELECT
    name,
    2021 - birth_year AS age
FROM
    names
WHERE
    birth_year IS NOT NULL
    AND
    death_year IS NULL
ORDER BY
    age DESC
;

   postgresql://postgres:***@localhost:5432/
 * postgresql://postgres:***@localhost:5432/imdb_dsci513
35413 rows affected.


name,age
Julia Calhoun,151
Benoît Duval,140
John Seabourne Sr.,131
Carl Stephenson,128
Pierre Charbonnier,124
Manuel R. Ojeda,123
Léonide Azar,121
Helen Leary,121
Georges Chaperot,119
Earl Rath,119


## ---------

Well, things have finally gotten serious with SQL and relational databases! In this lecture, we will learn how to do more advanced operations to gain more insight into the data in the tables of a database. Furthermore, we will explore how to connect the related data in multiple tables together through SQL joins. It is with joins in relational databases that we can benefit from the true power of these databases.

First things first, let's connect to our database:

In [24]:
%load_ext sql
%config SqlMagic.displaylimit = 30

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [8]:
import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

In [66]:
%sql postgresql://{username}:{password}@{host}:{port}/world_dsci513

'Connected: postgres@world_dsci513'

## Aggregations

So far in the course, we have seen many functions for various purposes. For example, `ROUND()` and `SQRT()` for math operations, `CHAR_LENGTH` and `SUBSTR()` for manipulating strings, or `EXTRACT()` and `to_date()` for working with datetimes. As you may have noticed, these functions produce an individual output for each and every row of a column in an element-wise manner (remember vectorized operations in NumPy?).
There is also a small class of useful functions in SQL called **aggregation** functions, which operate on groups of rows and summarize the data stored in those rows in the form of a single value. Here is a list of standard aggregation functions in SQL:

| Function   | What it computes                    |
|------------|-------------------------------------|
| `COUNT(*)` | Count of all rows in a table        |
| `COUNT()`  | Count of non-null rows of a column  |
| `MIN()`    | Minimum value in a column           |
| `MAX()`    | Maximum value in a column           |
| `AVG()`    | Average of values in a column       |
| `SUM()`    | Total sum of values in a column     |

A couple of points to remember:

- Except for `COUNT(*)`, all aggregation functions ignore `NULL`s
- In addition to numbers, `MIN()` and `MAX()` also work with strings.

---

**Example:** Find the population of the world according to the `country` table in the `world_dsci513` database.

---

In [4]:
%%sql

SELECT
    SUM(population)
FROM
    country
;

 * postgresql://postgres:***@localhost/world_dsci513
1 rows affected.


sum
6078749450


There are a few things that we need to remember when using aggregation functions:

- It is valid to have multiple aggregations in a SQL query, but it is NOT possible have both aggregations and regular columns in a single query:

```sql
-- This is CORRECT:
SELECT
    AVG(lifeexpectancy), SUM(population)
FROM
    country
WHERE
    continent = 'North America'
;

-- This is WRONG:
SELECT
    AVG(lifeexpectancy), name
FROM
    country
WHERE
    continent = 'North America'
;
```

There is only one exception to the latter rule, and that's when we have a `GROUP BY` clause (we'll learn about that in a bit).
  
- An aggregation function CANNOT be used in the `WHERE` clause, because in SQL `WHERE` processes rows before aggregations. For example, we can't find the name of countries with above average populations using the following query:

```sql
-- This is WRONG:
SELECT
    name
FROM
    country
WHERE
    population > AVG(population)
;
```

It is, of course, possible to write a query to answer the above question, but we have to wait until we learn about subqueries in a later lecture!

### Postgres-specific aggregations (OPTIONAL)

In addition to the standard SQL aggregation functions, Postgres also provides a number of functions of the same kind which can be useful for some statistical calculations (find a comprehensive list in the documentations [here](https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE)). Here is a few examples of Postgres-specific aggregations:

| Function               | What it computes                                                       |
|------------------------|------------------------------------------------------------------------|
| `stddev_pop()`         | Population standard deviation                                          |
| `stddev_samp()`        | Sample standard deviation                                              |
| `regr_r2(X, Y)`        | Coefficient of determination for linear regression between `X` and `Y` |
| `regr_slope(X, Y)`     | Slope of the regression line of `X` and `Y`                            |
| `regr_intercept(X, Y)` | Intercept of the regression line of `X` and `Y`                        |

---

**Example:** Compute the average ± sample standard deviation of the population of cities located in the US using `city` table. Write your query such that its output looks like this: `Average ± STDEV population of US cities = <average_population> ± <stdev_population>`

---

In [46]:
%%sql

SELECT
    'Average ± STDEV population of cities in the US = ' ||
    AVG(population)::int || ' ± ' || stddev_samp(population)::int 
FROM
    city
WHERE
    countrycode = 'USA'
;

   postgresql://postgres:***@localhost/
   postgresql://postgres:***@localhost/imdb
 * postgresql://postgres:***@localhost/world
1 rows affected.


?column?
Average ± STDEV population of cities in the US = 286955 ± 586583


## Grouping

If we divide a table into groups of rows based on values of one or more columns, that is called grouping. For example, in the `country` table of the `world_dsci513` database, we find several countries located in the same continent. In this situation, we can group the rows in our `country` table based on the values in the `continent` column. In this way, we would end up with bunch of "sub-tables": A sub-table for all rows where `continent = 'Asia'`, another sub-table for all rows where `continent = 'Europe`, and so on.

The formal syntax of the grouping operation in SQL looks like this:

```sql
SELECT
    grouping_columns, aggregated_columns
FROM
    table1
WHERE
    condition
GROUP BY
    grouping_columns
ORDER BY
    grouping_columns
```

Typically, it is not the sub-tables themselves that we're interested in, but some sort of summary statistics: For example, we might want to know the average population for each continent, i.e. for each sub-table or group. In order to do this, we can use aggregation functions that learned about them in the previous section. The question of **"what is the average population of countries in each continent"** can be asked in SQL terms as follows:

In [47]:
%%sql

SELECT
    continent, AVG(population)
FROM
    country
GROUP BY
    continent
;

   postgresql://postgres:***@localhost/
   postgresql://postgres:***@localhost/imdb
 * postgresql://postgres:***@localhost/world
7 rows affected.


continent,avg
Asia,72647562.74509805
South America,24698571.42857143
North America,13053864.864864863
Oceania,1085755.3571428573
Antarctica,0.0
Africa,13525431.034482758
Europe,15871186.95652174


Important points:

- If your SQL query involves grouping as well as filtering with `WHERE` and sorting with `ORDER BY`, the `GROUP BY` clause MUST appear between `WHERE` and `ORDER BY`.
- There can't be any non-aggregated column in a grouping query, except for the columns which are used for grouping (remember the exception I talked about with aggregation functions?). In other words, a non-aggregated column in the `SELECT` clause MUST appear in the `GROUP BY` clause as well.
- If there are null values in the grouping column, there will be a separate group for null values in the results.

---

**Example:** Write a query to return the average and maximum population of cities in the `city` table for China, India, Canada, US, Australia, and Russia. Show the results for each country using the corresponding country code, and order groups alphabetically in ascending order.

---

In [67]:
%%sql

SELECT
    countrycode, AVG(population), MAX(population)
FROM
    city
WHERE
    countrycode IN ('CHN', 'IND', 'CAN', 'USA', 'AUS', 'RUS')
GROUP BY
    countrycode
ORDER BY
    countrycode
;

 * postgresql://postgres:***@localhost/world_dsci513
   postgresql://postgres:***@localhost:5432/mds
6 rows affected.


countrycode,avg,max
AUS,808119.0,3276207
CAN,258649.7959183673,1016376
CHN,484720.69972451794,9696300
IND,361579.2551319648,10500000
RUS,365876.7195767196,8389200
USA,286955.37956204376,8008278


### Filtering revisited: the `HAVING` clause

So far, we have used the `WHERE` clause to filter rows. However, I mentioned before that aggregation functions cannot be used inside a `WHERE` clause. There is another reserved keyword, `HAVING`, for when we need to do filtering using aggregated values. The syntax is as follows (order is important!):

```sql
SELECT
    grouping_columns, aggregated_columns
FROM
    table1
[WHERE
    condition]
GROUP BY
    grouping_columns
HAVING
    group_condition
[ORDER BY
    grouping_columns]
```

To summarize:

- `WHERE` filters rows **before** grouping (or any other operation)
- `HAVING` filters groups **after** grouping

---

**Example:**

Write a query to return the average and maximum population of cities for countries that have at least 60 cities listed in the `city` table.

Show the results for each country using the corresponding country code, and order groups by the number of cities of each country in descending order. Also, convert the returned values to integer type.

---

In [66]:
%%sql

SELECT
    countrycode,
    AVG(population)::int,
    MAX(population)::int,
    COUNT(population) AS count
FROM
    city
GROUP BY
    countrycode
HAVING
    COUNT(*) > 60
ORDER BY
    count DESC
;

   postgresql://postgres:***@localhost/
   postgresql://postgres:***@localhost/imdb
 * postgresql://postgres:***@localhost/world
15 rows affected.


countrycode,avg,max,count
CHN,484721,9696300,363
IND,361579,10500000,341
USA,286955,8008278,274
BRA,343507,9968485,250
JPN,314375,7980230,248
RUS,365877,8389200,189
MEX,345390,8591309,173
PHL,227462,2173831,136
DEU,282209,3386667,93
IDN,441008,9604900,85


Note that just like with the `WHERE` clause, the expression used for filtering with `HAVING` does not necessarily need to appear in the `SELECT` clause. For instance, the `HAVING` clause will still do its job even if `COUNT(population)` is omitted from the `SELECT` clause:

In [68]:
%%sql

SELECT
    countrycode,
    AVG(population)::int,
    MAX(population)::int
FROM
    city
GROUP BY
    countrycode
HAVING
    COUNT(*) > 60
ORDER BY
    COUNT(*) DESC
;

   postgresql://postgres:***@localhost/
   postgresql://postgres:***@localhost/imdb
 * postgresql://postgres:***@localhost/world
15 rows affected.


countrycode,avg,max
CHN,484721,9696300
IND,361579,10500000
USA,286955,8008278
BRA,343507,9968485
JPN,314375,7980230
RUS,365877,8389200
MEX,345390,8591309
PHL,227462,2173831
DEU,282209,3386667
IDN,441008,9604900


A `GROUP BY` clause can be considered as equivalent to using `DISTINCT` if no aggregate functions are used:

In [74]:
%%sql

SELECT
    continent
FROM
    country
GROUP BY
    continent
;

   postgresql://postgres:***@localhost/
   postgresql://postgres:***@localhost/imdb
 * postgresql://postgres:***@localhost/world
7 rows affected.


continent
Asia
South America
North America
Oceania
Antarctica
Africa
Europe


In [75]:
%%sql

SELECT
    DISTINCT continent
FROM
    country
;

   postgresql://postgres:***@localhost/
   postgresql://postgres:***@localhost/imdb
 * postgresql://postgres:***@localhost/world
7 rows affected.


continent
Asia
South America
North America
Oceania
Antarctica
Africa
Europe


> **Note:** Neither `GROUP BY` nor `DISTINCT` ignore null values.

As long as they are aggregated, columns appearing in the `HAVING` clause don't necessarily need to be present in the `SELECT` clause. For example, here we're retrieving the name of continents having at least 40 countries:

In [87]:
%%sql

SELECT
    continent
FROM
    country
GROUP BY
    continent
HAVING
    COUNT(name) >= 40
;

   postgresql://postgres:***@localhost/
   postgresql://postgres:***@localhost/imdb
 * postgresql://postgres:***@localhost/world
3 rows affected.


continent
Asia
Africa
Europe


### Multi-level grouping

The `GROUP BY` clause can accommodate more than one column to construct multi-level groups. For example, we can group the rows in the `country` table of the `world_dsci513` database first based on `continent` and then based on `region`, all in one go:

In [70]:
%%sql

SELECT
    continent, region, AVG(population)::int
FROM
    country
GROUP BY
    continent, region
ORDER BY
    continent, region
;

   postgresql://postgres:***@localhost/
   postgresql://postgres:***@localhost/imdb
 * postgresql://postgres:***@localhost/world
25 rows affected.


continent,region,avg
Africa,Central Africa,10628000
Africa,Eastern Africa,12349950
Africa,Northern Africa,24752286
Africa,Southern Africa,9377200
Africa,Western Africa,13039529
Antarctica,Antarctica,0
Asia,Eastern Asia,188416000
Asia,Middle East,10465594
Asia,Southeast Asia,47140091
Asia,Southern and Central Asia,106484000


## Joins

Joins are probably the most fundamentally important operation in relational databases. The reason is that the whole idea of such databases is that data can be broken down into various tables that are related to each other, and can be joined together whenever related information from multiple tables is required. Consider the following query as an example:

---

**Example:** Write a query that returns the name of all countries along with their corresponding continents and their cities.

---

As we've been working with the `world_dsci513` database, we immediately notice that information about countries and cities are stored in two different tables, so we should somehow combine or "join" the data from the two tables. 

The syntax for a joining tables in SQL is as follows:

```sql
SELECT
    columns
FROM
    left_table
join_type
    right_table
ON
    join_condition
WHERE
    row_filter
GROUP BY
    columns
HAVING
    group_filter
ORDER BY
    columns
;
```

In this section, we'll learn how to do a join to answer the question we posed for the `world_dsci513`, but I prefer to use a smaller database to demonstrate various joining methods first, and then use our larger databases.

First, let's create a new database called `mds` on the local host (i.e. our own computer) and connect to it:

In [5]:
%sql CREATE DATABASE mds;

 * postgresql://postgres:***@localhost/world_dsci513
Done.


[]

In [9]:
%sql postgresql://{username}:{password}@{host}:{port}/mds

'Connected: postgres@mds'

The following cell creates two tables with the names `instructor` and `instructor_course` with the information about MDS instructors and courses they teach. **Don't worry about the content of this cell!** We will learn how to create tables in the next lecture. For now, just run the cell to create and populate the tables:

In [56]:
%%sql

DROP TABLE IF EXISTS
    instructor,
    instructor_course,
    course_cohort
;

CREATE TABLE instructor (
    id INTEGER PRIMARY KEY,
    name TEXT,
    email TEXT,
    phone VARCHAR(12),
    department VARCHAR(50)
    )
;

INSERT INTO
    instructor (id, name, email, phone, department)
VALUES
    (1, 'Mike', 'mike@mds.ubc.ca', '605-332-2343', 'Computer Science'),
    (2, 'Tiffany', 'tiff@mds.ubc.ca', '445-794-2233', 'Neuroscience'),
    (3, 'Arman', 'arman@mds.ubc.ca', '935-738-5796', 'Physics'),
    (4, 'Varada', 'varada@mds.ubc.ca', '243-924-4446', 'Computer Science'),
    (5, 'Quan', 'quan@mds.ubc.ca', '644-818-0254', 'Economics'),
    (6, 'Joel', 'joel@mds.ubc.ca', '773-432-7669', 'Biomedical Engineering'),
    (7, 'Florencia', 'flor@mds.ubc.ca', '773-926-2837', 'Biology'),
    (8, 'Alexi', 'alexiu@mds.ubc.ca', '421-888-4550', 'Statistics'),
    (15, 'Vincenzo', 'vincenzo@mds.ubc.ca', '776-543-1212', 'Statistics'),
    (19, 'Gittu', 'gittu@mds.ubc.ca', '776-334-1132', 'Biomedical Engineering'),
    (16, 'Jessica', 'jessica@mds.ubc.ca', '211-990-1762', 'Computer Science')
;

    
CREATE TABLE instructor_course (
    id SERIAL PRIMARY KEY,
    instructor_id INTEGER,
    course TEXT,
    enrollment INTEGER,
    begins DATE
    )
;

INSERT INTO
    instructor_course (instructor_id, course, enrollment, begins)
VALUES
    (8, 'Statistical Inference and Computation I', 125, '2021-10-01'),
    (8, 'Regression II', 102, '2022-02-05'),
    (1, 'Descriptive Statistics and Probability', 79, '2021-09-10'),
    (1, 'Algorithms and Data Structures', 25, '2021-10-01'),
    (3, 'Algorithms and Data Structures', 25, '2021-10-01'),
    (3, 'Python Programming', 133, '2021-09-07'),
    (3, 'Databases & Data Retrieval', 118, '2021-11-16'),
    (6, 'Visualization I', 155, '2021-10-01'),
    (6, 'Privacy, Ethics & Security', 148, '2022-03-01'),
    (2, 'Programming for Data Manipulation', 160, '2021-09-08'),
    (7, 'Data Science Workflows', 98, '2021-09-15'),
    (2, 'Data Science Workflows', 98, '2021-09-15'),
    (12, 'Web & Cloud Computing', 78, '2022-02-10'),
    (10, 'Introduction to Optimization', NULL, '2022-09-01'),
    (9, 'Parallel Computing', NULL, '2023-01-10'),
    (13, 'Natural Language Processing', NULL, '2023-09-10')
;

CREATE TABLE course_cohort (
    id INTEGER,
    cohort VARCHAR(7)
    )
;

INSERT INTO
    course_cohort (id, cohort)
VALUES
    (13, 'MDS-CL'),
    (8, 'MDS-CL'),
    (1, 'MDS-CL'),
    (3, 'MDS-CL'),
    (1, 'MDS-V'),
    (9, 'MDS-V'),
    (3, 'MDS-V')
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
Done.
Done.
11 rows affected.
Done.
16 rows affected.
Done.
7 rows affected.


[]

Awesome! Let's take a look at the first two tables:

In [57]:
%sql SELECT * FROM instructor;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
11 rows affected.


id,name,email,phone,department
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering


In [58]:
%sql SELECT * FROM instructor_course;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
16 rows affected.


id,instructor_id,course,enrollment,begins
1,8,Statistical Inference and Computation I,125.0,2021-10-01
2,8,Regression II,102.0,2022-02-05
3,1,Descriptive Statistics and Probability,79.0,2021-09-10
4,1,Algorithms and Data Structures,25.0,2021-10-01
5,3,Algorithms and Data Structures,25.0,2021-10-01
6,3,Python Programming,133.0,2021-09-07
7,3,Databases & Data Retrieval,118.0,2021-11-16
8,6,Visualization I,155.0,2021-10-01
9,6,"Privacy, Ethics & Security",148.0,2022-03-01
10,2,Programming for Data Manipulation,160.0,2021-09-08


### Cross join

A cross join is the simplest way to join two tables: by cross-joining tables A and B, we match each every row from table A with every row from table B! In other words, it returns all combinations of rows from table A and table B. This type of join is also sometimes called _the Cartesian product_ of two relations or tables:

In [59]:
%config SqlMagic.displaylimit = 200

In [60]:
%%sql

SELECT
    *
FROM
    instructor
CROSS JOIN
    instructor_course
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
176 rows affected.


id,name,email,phone,department,id_1,instructor_id,course,enrollment,begins
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science,1,8,Statistical Inference and Computation I,125.0,2021-10-01
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience,1,8,Statistical Inference and Computation I,125.0,2021-10-01
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics,1,8,Statistical Inference and Computation I,125.0,2021-10-01
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science,1,8,Statistical Inference and Computation I,125.0,2021-10-01
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics,1,8,Statistical Inference and Computation I,125.0,2021-10-01
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering,1,8,Statistical Inference and Computation I,125.0,2021-10-01
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology,1,8,Statistical Inference and Computation I,125.0,2021-10-01
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics,1,8,Statistical Inference and Computation I,125.0,2021-10-01
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics,1,8,Statistical Inference and Computation I,125.0,2021-10-01
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering,1,8,Statistical Inference and Computation I,125.0,2021-10-01


**How to deal with ambiguous column names**

Now suppose that we want to return only the names of the instructors and their IDs from the `instructor` table, and names of courses and their IDs from the `course` table. Since there is a column named `id` in both tables, we cannot use `id` in the `SELECT` clause, because it is ambiguous.

In this situation, we should either prepend the column name by the full name of its parent table (e.g. `instructor.id`), or we can create table aliases using the keyword `AS` (just like we did before with columns) and prepend the column name with the parent table alias. A table name followed by a dot and the name of a column is called a _qualified name_. Here is an example of using qualified names for ambiguous column names:

In [61]:
%%sql

SELECT
    name, i.id, course, ic.id
FROM
    instructor AS i
CROSS JOIN
    instructor_course AS ic
LIMIT 10
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
10 rows affected.


name,id,course,id_1
Mike,1,Statistical Inference and Computation I,1
Tiffany,2,Statistical Inference and Computation I,1
Arman,3,Statistical Inference and Computation I,1
Varada,4,Statistical Inference and Computation I,1
Quan,5,Statistical Inference and Computation I,1
Joel,6,Statistical Inference and Computation I,1
Florencia,7,Statistical Inference and Computation I,1
Alexi,8,Statistical Inference and Computation I,1
Vincenzo,15,Statistical Inference and Computation I,1
Gittu,19,Statistical Inference and Computation I,1


- The keyword `AS` can be dropped
- Table aliases only exist during the execution of a statement
- Using table aliases is a great way to reduce clutter in SQL join statements
- Once you create an alias for a table, you should only use the alias to refer to that table in the statement. For example, the following query would throw an error:

```sql
-- This is WRONG
SELECT
    instructor.name, instructor.id, course, ic.id
FROM
    instructor AS i
CROSS JOIN
    instructor_course AS ic
;
```
- When you retrieve data from multiple tables, you can still use `*` to return all columns of a particular table. The only difference is that you should prepend it with the corresponding tables name. For example, we can return all columns of the table `instructor` and just the `course` column of the table `instructor_course` in a cross join with the following query:

```sql
SELECT
    instructor.*, ic.course
FROM
    instructor AS i
CROSS JOIN
    instructor_course AS ic
;
```

### Inner join

Except for a cross join, all other types of joins use a condition using the `ON` keyword to figure out which rows from the two tables to pair up. An inner join is a type of join that only returns the matching rows from the left and right tables. The image below ([source](https://www.postgresqltutorial.com/postgresql-joins/)) shows Venn diagram of an inner join:

<img src="img/lecture3/inner_join.png" width="250">

For example, in our `instructor` table there are some instructors who are assigned one or more courses in the `instructor_course` table, some who are not. Similarly, there are courses in the `instructor_course` table that have an instructor, and some that don't have an instructor yet. With an inner join based on `instructor.id` and `instructor_course.id` columns, we would retrieve matching rows, meaning that only instructors are retrieved that have one or more assigned courses, and vice versa:

In [52]:
%%sql

SELECT
    name, i.id, ic.instructor_id, course
FROM
    instructor AS i
INNER JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
12 rows affected.


name,id,instructor_id,course
Alexi,8,8,Statistical Inference and Computation I
Alexi,8,8,Regression II
Mike,1,1,Descriptive Statistics and Probability
Mike,1,1,Algorithms and Data Structures
Arman,3,3,Algorithms and Data Structures
Arman,3,3,Python Programming
Arman,3,3,Databases & Data Retrieval
Joel,6,6,Visualization I
Joel,6,6,"Privacy, Ethics & Security"
Tiffany,2,2,Programming for Data Manipulation


In the above returned table, "Quan" and "Varada" are missing as instructors since they are not yet assigned any courses. Also, the courses "Web & Cloud Computing", "Parallel Computing", and "Introduction to Optimization" are missing, since there not yet any instructors assigned for these courses.

> **Note:** The `INNER` keyword is optional.

### Self join

Sometimes we want to compare a table to itself. For example, we may want to know which paris of instructors in the `instructors` table are from the same department. In order to find out, we need to compare the values in the `department` column of each row to all other rows to find matches:

In [62]:
%%sql

SELECT
    i1.name, i1.department, i2.department, i2.name
FROM
    instructor i1
JOIN
    instructor i2
ON
    i1.department = i2.department
    AND
    i1.id <> i2.id
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
10 rows affected.


name,department,department_1,name_1
Mike,Computer Science,Computer Science,Jessica
Mike,Computer Science,Computer Science,Varada
Varada,Computer Science,Computer Science,Jessica
Varada,Computer Science,Computer Science,Mike
Joel,Biomedical Engineering,Biomedical Engineering,Gittu
Alexi,Statistics,Statistics,Vincenzo
Vincenzo,Statistics,Statistics,Alexi
Gittu,Biomedical Engineering,Biomedical Engineering,Joel
Jessica,Computer Science,Computer Science,Varada
Jessica,Computer Science,Computer Science,Mike


The `i1.id <> i2.id` join condition ensures that a row does not match itself.

### Natural join

For joins involving a join condition, e.g. inner or self joins, we have so far explicitly specified the matching condition. In a situation that columns in different tables have the same name and we want to simply match rows with similar values **in all similarly named columns**, we can do a **natural join** using the keywords `NATURAL JOIN`. For example, the `id` column in the `course_cohort` refers to the `id` column let's find which courses are offered for which cohorts using a natural join:

In [63]:
%sql SELECT * FROM course_cohort;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
7 rows affected.


id,cohort
13,MDS-CL
8,MDS-CL
1,MDS-CL
3,MDS-CL
1,MDS-V
9,MDS-V
3,MDS-V


In [64]:
%%sql

SELECT
    ic.course, cc.cohort
FROM
    instructor_course ic
NATURAL JOIN
    course_cohort cc
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
7 rows affected.


course,cohort
Web & Cloud Computing,MDS-CL
Visualization I,MDS-CL
Statistical Inference and Computation I,MDS-CL
Descriptive Statistics and Probability,MDS-CL
Statistical Inference and Computation I,MDS-V
"Privacy, Ethics & Security",MDS-V
Descriptive Statistics and Probability,MDS-V


If there are no matching columns in the two tables, `NATURAL JOIN` acts like `JOIN ... ON TRUE` and results in a cross-product join between the participating tables.

### Outer joins

An outer join is a type of join that returns all the rows from one or both of the tables that takes part in the joining. Outer joins are useful in questions that involve missing values.

#### Left outer join

In the joining process, the first table from which data is retrieved using `SELECT` is called the **left** table, and the table that is joined onto that is called the **right** table. In other words, the first table that appears in the query is the left table (table on the left of the query), and the one appearing later is the right table (table on the right of the query).

A left outer join is a type of join that returns all rows from the left table (matching or not), in addition to the matching rows from both tables. The non-matching rows from the left table are assigned null values in the columns that belong to the 
right table. This is schematically shown in the diagram below ([source](https://www.postgresqltutorial.com/postgresql-joins/)):

<img src="img/lecture3/left_join.png" width="250">

For example, in the [inner join](#Inner-join) example, instructors who don't teach any course are not returned by the join operation. Let's say we want to retrieve a list of all instructors and the courses they teach, as well as those who don't teach any courses:

In [32]:
%%sql

SELECT
    name, i.id, ic.instructor_id, course
FROM
    instructor AS i
LEFT OUTER JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
17 rows affected.


name,id,instructor_id,course
Alexi,8,8.0,Statistical Inference and Computation I
Alexi,8,8.0,Regression II
Mike,1,1.0,Descriptive Statistics and Probability
Mike,1,1.0,Algorithms and Data Structures
Arman,3,3.0,Algorithms and Data Structures
Arman,3,3.0,Python Programming
Arman,3,3.0,Databases & Data Retrieval
Joel,6,6.0,Visualization I
Joel,6,6.0,"Privacy, Ethics & Security"
Tiffany,2,2.0,Programming for Data Manipulation


> **Note:** The keyword `OUTER` is optional.

How can this be helpful? As an example, we can return the name of instructors who don't teach any courses with the following query:

In [31]:
%%sql

SELECT
    name
FROM
    instructor AS i
LEFT JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
WHERE
    ic.course IS NULL
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
5 rows affected.


name
Vincenzo
Quan
Gittu
Jessica
Varada


#### Right outer join

A right join acts exactly in the same way as a left join, except that it keeps all rows from the right table and only the matching ones from the left table. The diagram below demonstrates a right join schematically ([source](https://www.postgresqltutorial.com/postgresql-joins/)):

<img src="img/lecture3/right_join.png" width="250">

Let's retrieve a list of all courses and their appointed instructors, as well as those courses without an instructor:

In [35]:
%%sql

SELECT
    name, i.id, ic.instructor_id, course
FROM
    instructor AS i
RIGHT OUTER JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
16 rows affected.


name,id,instructor_id,course
Alexi,8.0,8,Statistical Inference and Computation I
Alexi,8.0,8,Regression II
Mike,1.0,1,Descriptive Statistics and Probability
Mike,1.0,1,Algorithms and Data Structures
Arman,3.0,3,Algorithms and Data Structures
Arman,3.0,3,Python Programming
Arman,3.0,3,Databases & Data Retrieval
Joel,6.0,6,Visualization I
Joel,6.0,6,"Privacy, Ethics & Security"
Tiffany,2.0,2,Programming for Data Manipulation


Now, let's find out which courses do not have an appointed instructor yet:

In [202]:
%%sql

SELECT
    course
FROM
    instructor AS i
RIGHT OUTER JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
WHERE
    i.id IS NULL
;

   postgresql://postgres:***@localhost/
   postgresql://postgres:***@localhost/imdb
 * postgresql://postgres:***@localhost/postgres
   postgresql://postgres:***@localhost/world
4 rows affected.


course
Web & Cloud Computing
Introduction to Optimization
Parallel Computing
Natural Language Processing


#### Full outer join

A full outer join is the combination of a left and right join: it retrieves **matching and non-matching** rows from **both** tables. Take a look at the schematic diagram of a full outer join ([source](https://www.postgresqltutorial.com/postgresql-joins/)):

<img src="img/lecture3/full_outer_join.png" width="250">

Let's do a full outer join between the `instructor` and `instructor_course` tables to retrieve all instructors and courses:

In [41]:
%%sql

SELECT
    name, i.id, ic.instructor_id, course
FROM
    instructor AS i
FULL OUTER JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
21 rows affected.


name,id,instructor_id,course
Alexi,8.0,8.0,Statistical Inference and Computation I
Alexi,8.0,8.0,Regression II
Mike,1.0,1.0,Descriptive Statistics and Probability
Mike,1.0,1.0,Algorithms and Data Structures
Arman,3.0,3.0,Algorithms and Data Structures
Arman,3.0,3.0,Python Programming
Arman,3.0,3.0,Databases & Data Retrieval
Joel,6.0,6.0,Visualization I
Joel,6.0,6.0,"Privacy, Ethics & Security"
Tiffany,2.0,2.0,Programming for Data Manipulation


We can now write a query to find instructors who are free to teach a course, and courses that need an instructor:

In [42]:
%%sql

SELECT
    name, course
FROM
    instructor AS i
FULL OUTER JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
WHERE
    i.name IS NULL
    OR
    ic.course IS NULL
;

   postgresql://postgres:***@localhost/world_dsci513
 * postgresql://postgres:***@localhost:5432/mds
9 rows affected.


name,course
,Web & Cloud Computing
,Introduction to Optimization
,Parallel Computing
,Natural Language Processing
Vincenzo,
Quan,
Gittu,
Jessica,
Varada,


---

**Question:** What's the difference between a cross join and a full outer join?

---

## ---------

So far in the course, we concentrated on retrieving data from a database and its tables using `SELECT` statements. In this lecture, you will learn how to make modifications to rows and tables, delete existing ones, and make new ones. You'll also learn about how to enforce constraints on your tables such that your database always stays in good shape.

In [1]:
%load_ext sql
%config SqlMagic.displaylimit = 30

In [2]:
import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

In [3]:
%sql postgresql://{username}:{password}@{host}:{port}/mds

'Connected: postgres@mds'

In [5]:
%%sql

DROP TABLE IF EXISTS
    instructor,
    instructor_course,
    course_cohort
;

CREATE TABLE instructor (
    id INTEGER PRIMARY KEY,
    name TEXT,
    email TEXT,
    phone VARCHAR(12),
    department VARCHAR(50)
    )
;

INSERT INTO
    instructor (id, name, email, phone, department)
VALUES
    (1, 'Mike', 'mike@mds.ubc.ca', '605-332-2343', 'Computer Science'),
    (2, 'Tiffany', 'tiff@mds.ubc.ca', '445-794-2233', 'Neuroscience'),
    (3, 'Arman', 'arman@mds.ubc.ca', '935-738-5796', 'Physics'),
    (4, 'Varada', 'varada@mds.ubc.ca', '243-924-4446', 'Computer Science'),
    (5, 'Quan', 'quan@mds.ubc.ca', '644-818-0254', 'Economics'),
    (6, 'Joel', 'joel@mds.ubc.ca', '773-432-7669', 'Biomedical Engineering'),
    (7, 'Florencia', 'flor@mds.ubc.ca', '773-926-2837', 'Biology'),
    (8, 'Alexi', 'alexiu@mds.ubc.ca', '421-888-4550', 'Statistics'),
    (15, 'Vincenzo', 'vincenzo@mds.ubc.ca', '776-543-1212', 'Statistics'),
    (19, 'Gittu', 'gittu@mds.ubc.ca', '776-334-1132', 'Biomedical Engineering'),
    (16, 'Jessica', 'jessica@mds.ubc.ca', '211-990-1762', 'Computer Science')
;

    
CREATE TABLE instructor_course (
    id SERIAL PRIMARY KEY,
    instructor_id INTEGER,
    course TEXT,
    enrollment INTEGER,
    begins DATE
    )
;

INSERT INTO
    instructor_course (instructor_id, course, enrollment, begins)
VALUES
    (8, 'Statistical Inference and Computation I', 125, '2021-10-01'),
    (8, 'Regression II', 102, '2022-02-05'),
    (1, 'Descriptive Statistics and Probability', 79, '2021-09-10'),
    (1, 'Algorithms and Data Structures', 25, '2021-10-01'),
    (3, 'Algorithms and Data Structures', 25, '2021-10-01'),
    (3, 'Python Programming', 133, '2021-09-07'),
    (3, 'Databases & Data Retrieval', 118, '2021-11-16'),
    (6, 'Visualization I', 155, '2021-10-01'),
    (6, 'Privacy, Ethics & Security', 148, '2022-03-01'),
    (2, 'Programming for Data Manipulation', 160, '2021-09-08'),
    (7, 'Data Science Workflows', 98, '2021-09-15'),
    (2, 'Data Science Workflows', 98, '2021-09-15'),
    (12, 'Web & Cloud Computing', 78, '2022-02-10'),
    (10, 'Introduction to Optimization', NULL, '2022-09-01'),
    (9, 'Parallel Computing', NULL, '2023-01-10'),
    (13, 'Natural Language Processing', NULL, '2023-09-10')
;

CREATE TABLE course_cohort (
    id INTEGER,
    cohort VARCHAR(7)
    )
;

INSERT INTO
    course_cohort (id, cohort)
VALUES
    (13, 'MDS-CL'),
    (8, 'MDS-CL'),
    (1, 'MDS-CL'),
    (3, 'MDS-CL'),
    (1, 'MDS-V'),
    (9, 'MDS-V'),
    (3, 'MDS-V')
;

 * postgresql://postgres:***@localhost/mds
Done.
Done.
11 rows affected.
Done.
16 rows affected.
Done.
7 rows affected.


[]

Let's take a look at the tables of the `mds` database that we created in lecture 3 to demonstrate various types of joins:

In [6]:
%sql SELECT * FROM instructor;

 * postgresql://postgres:***@localhost/mds
11 rows affected.


id,name,email,phone,department
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering


In [7]:
%sql SELECT * FROM instructor_course;

 * postgresql://postgres:***@localhost/mds
16 rows affected.


id,instructor_id,course,enrollment,begins
1,8,Statistical Inference and Computation I,125.0,2021-10-01
2,8,Regression II,102.0,2022-02-05
3,1,Descriptive Statistics and Probability,79.0,2021-09-10
4,1,Algorithms and Data Structures,25.0,2021-10-01
5,3,Algorithms and Data Structures,25.0,2021-10-01
6,3,Python Programming,133.0,2021-09-07
7,3,Databases & Data Retrieval,118.0,2021-11-16
8,6,Visualization I,155.0,2021-10-01
9,6,"Privacy, Ethics & Security",148.0,2022-03-01
10,2,Programming for Data Manipulation,160.0,2021-09-08


## Inserting, modifying, and deleting rows

A database is rarely only used for retrieving data from. We often want to insert new data, update existing data, or delete obsolete data. You might remember from lecture 1 that this relates to the data manipulation language (DML) that a DBMS also provides along with its data query language (DQL). For relational DBMSs, SQL provides standard statements for data manipulation using keywords `INSERT`, `UPDATE`, and `DELETE`.

With row insertion, updating and deletion statements, we typically need to know in advance about the structure of our table, e.g. column names, their data types, constraints, etc. We can easily inspect the columns of a table using `psql`'s meta-commands that we've learned before. For example, we can find out about the columns and datatypes in the `instructor` table by running `\d instructor` in psql:

<img src="img/lecture4/d_instructor.png" width="600">

### `INSERT`

The `INSERT` statement is used to add new rows to a table, and can be used in three different ways:
- by column position
- by column name
- from a table

#### By column position

```sql
INSERT INTO
    table_name
VALUES
    (value1, value2, ...);
```

- It's not mandatory to provided a value for every column in the table, unless they are explicitly set as **non-nullable**. This is a Postgres extension; in other RDBMSs you might need to provide a value for every column using this syntax. The values are assigned to columns from left to right.
- The order of values should be the same as the order of columns in the table.

For example, let's add two new instructor to our `instructor` table in the `mds` database:

In [8]:
%%sql

INSERT INTO
    instructor
VALUES
    (78, 'Rachel', 'rachel@cs.ubc.ca', '766-442-9059', 'Computer Science')
;

 * postgresql://postgres:***@localhost/mds
1 rows affected.


[]

In [9]:
%%sql

INSERT INTO
    instructor
VALUES
    (79, 'Anthony', 'anthony@gmail')
;

 * postgresql://postgres:***@localhost/mds
1 rows affected.


[]

In [10]:
%sql SELECT * FROM instructor;

 * postgresql://postgres:***@localhost/mds
13 rows affected.


id,name,email,phone,department
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering


> Remember that there is no guarantee that the new row shows up as the last row in our table.

> Columns that are not given a value in the insert statement will either be set to their default value (if they have one) or null.

#### By column name

```sql
INSERT INTO
    table_name(col1, col2, ...)
VALUES
    (value1, value2, ...);
```

- A value should be provided for every listed column, but it's not mandatory to provide values for all columns in the table (unless they are primary keys for the table with no default value, more about this later in this lecture)
- The column names and their values can appear in any order with this syntax

For example, here I'll add another row to the `instructor` table using the syntax above, with a shuffled column order:

In [11]:
%%sql

INSERT INTO
    instructor(department, name, email, id)
VALUES
    ('Mathematics', 'Carl', 'carl@math.ubc.ca', 65)
;

 * postgresql://postgres:***@localhost/mds
1 rows affected.


[]

In [12]:
%sql SELECT * FROM instructor;

 * postgresql://postgres:***@localhost/mds
14 rows affected.


id,name,email,phone,department
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering


Let's also try to insert a row with only one value for the column `id` (which is required since it's a primary key):

In [13]:
%%sql

INSERT INTO
    instructor(id)
VALUES
    (999)
;

 * postgresql://postgres:***@localhost/mds
1 rows affected.


[]

In [14]:
%sql SELECT * FROM instructor;

 * postgresql://postgres:***@localhost/mds
15 rows affected.


id,name,email,phone,department
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering


#### Multiple rows at once

When inserting rows, we don't need to write several `INSERT INTO` statements to insert several rows. It's possible to insert multiple rows with a single `INSERT INTO` statement by separating the rows to be inserted with commas:
```sql
INSERT INTO
    table_name(col1, col2, ...)
VALUES
    (row1_value1, row1_value2, ...),
    (row2_value1, row2_value2, ...),
    (row3_value1, row3_value2, ...),
    (row4_value1, row4_value2, ...)
;
```

#### From a table

```sql
INSERT INTO
    table_name
    [(col1, col2, ...)]
SELECT ...
```

This `INSERT` syntax allows for reading rows from another table and inserting them into the table we want. In terms of column order and default values, it works the same as the previous two methods.

Let's assume that we had another table called `visiting_instructor` in our `mds` database, with the following rows:

In [15]:
%%sql

CREATE TABLE visiting_instructor (
    id INTEGER PRIMARY KEY,
    name TEXT,
    email TEXT
    )
;

INSERT INTO
    visiting_instructor (id, name, email)
VALUES
    (501, 'Oliver', 'oliver@gmail.com'),
    (502, 'Adriana', 'adriana@gmail.com')
;

 * postgresql://postgres:***@localhost/mds
Done.
2 rows affected.


[]

In [16]:
%sql SELECT * FROM visiting_instructor;

 * postgresql://postgres:***@localhost/mds
2 rows affected.


id,name,email
501,Oliver,oliver@gmail.com
502,Adriana,adriana@gmail.com


Recently, these two visiting instructors have accepted a permanent position in the MDS program, so we want to add them to our `instructor` table. We can do so using the following `INSERT` statement to bring rows from the `visiting_instructor` to `instructor`:

In [17]:
%%sql

INSERT INTO
    instructor(id, name, email)
SELECT
    *
FROM visiting_instructor
;

 * postgresql://postgres:***@localhost/mds
2 rows affected.


[]

In [18]:
%sql SELECT * FROM instructor;

 * postgresql://postgres:***@localhost/mds
17 rows affected.


id,name,email,phone,department
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering


Note that:

- We can retrieve any subset of columns and rows in another table as long as they are consistent with the columns of the destination table.
- The column names returned by the `SELECT` statement are ignored by the RDBMS.

### `UPDATE`

In addition to adding rows to our tables, we can also update existing ones. The standard SQL syntax for updating rows is:

```sql
UPDATE
    table_name
SET
    col1 = expr1,
    col2 = expr2,
    ...
WHERE
     condition
;
```

For example, let's assign our new instructors to the business school:

In [19]:
%%sql

UPDATE
    instructor
SET
    department = 'Business'
WHERE
    department IS NULL
    AND
    name IS NOT NULL
;

 * postgresql://postgres:***@localhost/mds
3 rows affected.


[]

In [20]:
%sql SELECT * FROM instructor;

 * postgresql://postgres:***@localhost/mds
17 rows affected.


id,name,email,phone,department
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering


---
    
**Remember:**

`UPDATE` is a **dangerous** statement; you might accidentally modify all or several rows in a table with a wrong search condition. It is always a good idea to use a `SELECT` statement to make sure the returned rows are actually the ones you want to update, and then modify them via the `UPDATE` statement.

One other method is to create a temporary table to test your `UPDATE` statement. You'll learn how to do that later in this lecture.

---

### `DELETE`

Well, there finally comes a time when some rows need to be deleted (or all rows, who knows...). The `DELETE` statement in SQL removes rows from a table. Most of the time we don't want to delete all rows but only those that meet specific conditions. Similar to `UPDATE`, `DELETE` also accepts a `WHERE` clause:

```sql
DELETE FROM
    table_name
WHERE
    condition
;
```

For example, I want to remove the row in the `instructor` table that has an `id` of 999:

In [21]:
%%sql

DELETE FROM
    instructor
WHERE
    id = 999
;

 * postgresql://postgres:***@localhost/mds
1 rows affected.


[]

In [22]:
%sql SELECT * FROM instructor;

 * postgresql://postgres:***@localhost/mds
16 rows affected.


id,name,email,phone,department
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering


Also, I want to remove the newly hired instructors (which we copied into the `instructor` table) from the `visiting_instructor` table. Since I want to delete all rows in that table, I can write the `DELETE` statement without a `WHERE` condition:

In [23]:
%sql DELETE FROM visiting_instructor;

 * postgresql://postgres:***@localhost/mds
2 rows affected.


[]

In [24]:
%sql SELECT * FROM visiting_instructor;

 * postgresql://postgres:***@localhost/mds
0 rows affected.


id,name,email


Note that although the `DELETE` statement has removed all rows, **the table structure is intact**. In other words, the table is still there but stores no rows at the moment.

### `TRUNCATE`

If the goal is to remove all rows from a table, popular RDBMSs including Postgres also support the `TRUNCATE` statement with the following syntax:
```sql
TRUNCATE TABLE table_name;
```

`TRUNCATE` is **faster and more efficient** in terms of the RDBMS resources, because it does not scan every row as opposed to `DELETE`. During the process, `DELETE` logs the changes made to each and every row, whereas `TRUNCATE` treats the deleting of all rows as a single operation. Even though`TRUNCATE` locks the entire table during the deleting operation (as opposed to `DELETE` which locks individual rows), it is still the better choice to delete all rows because it will release the lock much sooner.

---
    
**Remember:**

`DELETE` and `TRUNCATE` are both very dangerous statements, and you should use them with extreme caution in real-life databases. With `DELETE`, you should always try a `SELECT` query first to see if your `WHERE` condition returns the rows that you want, and then use `DELETE` to remove them.

One other method is to create a temporary table to test your `DELETE` statement. You'll learn how to do that later in this lecture.

---

### `RETURNING` (OPTIONAL)

The table modifying commands `INSERT`, `UPDATE`, and `DELETE` all accept a `RETURNING` clause which returns the rows that have been modified by these commands. The returning clause can be helpful in reliably identifying the rows that have been modified, without having to run a separate `SELECT` statement after table modification. Here is the syntax for `RETURNING` used along with `INSERT`:

```sql
INSERT INTO
    table1(col1, col2, ...)
VALUES
    (val1, val2, ...)
RETURNING
    *
;
```

We can select any one of the columns of the modified rows just as in a `SELECT` statement.

Here is an example of returning rows from an `UPDATE` statement:

In [25]:
%%sql

UPDATE
    instructor
SET
    department = 'STATS'
WHERE
    department = 'Statistics'
RETURNING
    *
;

 * postgresql://postgres:***@localhost/mds
2 rows affected.


id,name,email,phone,department
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,STATS
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,STATS


We can check that the above two rows in the `instructor` table are actually the ones that are updated:

In [26]:
%sql SELECT * FROM instructor;

 * postgresql://postgres:***@localhost/mds
16 rows affected.


id,name,email,phone,department
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering
16,Jessica,jessica@mds.ubc.ca,211-990-1762,Computer Science
78,Rachel,rachel@cs.ubc.ca,766-442-9059,Computer Science


## Creating, modifying, and dropping tables

In this section, we'll learn about the basics of creating tables from scratch, modifying and also dropping them. Before moving on to discussing the ways to do this, I need to remind you that designing a database with optimal structure, relations, and constraints typically requires expert knowledge and experience, and sometimes many design iterations. The goal here is for you to get a basic knowledge of how tables are defined, altered and dropped and how constraints are enforced in relatively simple cases. Even if you're not going to design a database or create tables yourself, an understanding of what the process looks is beneficial in how you think about a database in general and how you use it.

### Creating tables

The general syntax for creating a table is as follows:

```sql
CREATE TABLE table_name (
   column1    datatype [column_contraint],
   column2    datatype [column_contraint],
   column3    datatype [column_contraint],
   [table_constraints]
);
```

In order to define a table, we need:
- column names
- column data types

You may optionally also define
- default values
- constraints

Suppose that in a database for a sample online store, we have a table to keep information about our customers. We can create a table called `customer` with the following statement:

```sql
CREATE TABLE customer
(
    customer_id    INTEGER,
    title          CHAR(4),
    fname          VARCHAR(32),
    lname          VARCHAR(32),
    addressline    VARCHAR(64),
    town           VARCHAR(32),
    zipcode        CHAR(10),
    phone          VARCHAR(16)
);
```

## ---------

## Subqueries

There is a particular type of question that we have avoided so far for our SQL queries, and that is one for which we need the result of a second query to be able to run the first query. Take the following question as an example:

---

**Example:** Using the `world_dsci513` database, find the countries with surface area above the average value of all countries in the world.

---

This query looks simple. You might be tempted to try

```sql
-- This will NOT work
SELECT
    name
FROM
    country
WHERE
    surfacearea > AVG(surfacearea)
;
```

but we've learned before that aggregate functions **cannot** be used within a `WHERE` clause.

To answer this question, we need to query the database twice: once to obtain the average surface area, and once to actually retrieve the rows that satisfy the condition. However, we don't need to do that in two separate queries and manually take the data coming from the first query and use it in the second. We can use a **subquery** to do that for us.

A subquery is a `SELECT` statement that is incorporated into another SQL statement. For example, the query that computes the average surface area is:

```sql
SELECT
    AVG(surfacearea)
FROM
    country
;
```

We can use this intermediate information in our original query by embedding the above query in the `WHERE` clause of the original query:

```sql
SELECT
    name
FROM
    country
WHERE
    surfacearea > (
        SELECT AVG(surfacearea) FROM country
    )
;
```

In [39]:
%sql postgresql://{username}:{password}@{host}:{port}/world_dsci513

'Connected: postgres@world_dsci513'

In [41]:
%%sql

SELECT
    name
FROM
    country
WHERE
    surfacearea > (
        SELECT
            AVG(surfacearea)
        FROM
            country
    )
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
43 rows affected.


name
Afghanistan
Algeria
Angola
Argentina
Australia
Bolivia
Brazil
Chile
Egypt
South Africa


Note that:

- A subquery should always be enclosed in parentheses, e.g. `(SELECT ...)`
- Subqueries should **not** be terminated by a semi-colon, as opposed to regular queries
- Sometimes the main SQL statement is called the **outer query** and the subquery is called the **inner query**

A subquery can be used in the `SELECT`, `FROM`, `WHERE`, and `HAVING` clauses, but most commonly in the `WHERE`.

In our last query, we can use a subquery in the `SELECT` clause as well to check if the returned rows do satisfy the condition of `surfacearea > AVG(surfacearea)`:

In [42]:
%%sql

SELECT
    name,
    ROUND(surfacearea::NUMERIC / (SELECT AVG(surfacearea) FROM country)::NUMERIC, 2)
        AS ratio
FROM
    country
WHERE
    surfacearea > (
        SELECT AVG(surfacearea) FROM country
    )
ORDER BY
    ratio
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
43 rows affected.


name,ratio
Somalia,1.02
Afghanistan,1.05
Myanmar,1.09
Zambia,1.21
Chile,1.21
Turkey,1.24
Pakistan,1.28
Mozambique,1.29
Namibia,1.32
Tanzania,1.42


> **Note:** A subquery in the `SELECT` clause should always return a single value, not a column or rows of values.

---

**Example:** Retrieve the name of countries whose capital cities have a population larger than 5 million.

---

In [58]:
%%sql

SELECT
    name
FROM
    country
WHERE
    capital IN (
        SELECT id
        FROM city
        WHERE population > 5000000
    )
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
13 rows affected.


name
United Kingdom
Egypt
Indonesia
Iran
Japan
China
Colombia
"Congo, The Democratic Republic of the"
South Korea
Mexico


Well, as you might have guessed, we can rewrite this query using a **join**:

In [61]:
%%sql

SELECT
    co.name
FROM
    country co
JOIN
    city ci
ON
    co.capital = ci.id
WHERE
    ci.population > 5000000
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
13 rows affected.


name
United Kingdom
Egypt
Indonesia
Iran
Japan
China
Colombia
"Congo, The Democratic Republic of the"
South Korea
Mexico


Using a subquery is actually another way to gain access to data stored in other tables.

Typically, joins can be rewritten as a subquery and vice-versa, so what's the difference?

- Subqueries tend to be more readable and more intuitive
- Subqueries cannot be used if you need to include columns from the inner query in your results

---

**Example:** Retrieve the name of countries where English is an official language, and have a population of over 1 million.

---

In [85]:
%%sql

SELECT
    name
FROM
    country
WHERE
    population > 1000000
    AND
    code IN (
        SELECT
            countrycode
        FROM
            countrylanguage
        WHERE
            language = 'English'
            AND
            isofficial = True
    )
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
10 rows affected.


name
Australia
United Kingdom
South Africa
Hong Kong
Ireland
Canada
Lesotho
New Zealand
United States
Zimbabwe


### Correlated subqueries

Subqueries that we have seen so far are called **simple or uncorrelated subqueries**, as they are executed **once** and **independently** of the outer query.

It sometimes happens that we need data from each row of the outer query in the subquery. This is an instance of what's called a **correlated subquery**. In a correlated subquery, the subquery takes the the current row of the outer query, and executes over all rows of the inner query. When finished, the next row from the outer query is selected, and the subquery is executed entirely for that outer row again, and so on.

---

**Example:** Which countries have the largest population in the continent where they are located?

---

The query for the above example requires the population of each country to be compared with all other countries that are in the same continent. The comparison of each row, with every other row that meet a particular condition can be achieved with a correlated subquery:

In [21]:
%%sql

SELECT
    c1.name, c1.continent
FROM
    country c1
WHERE
    c1.population = (
        SELECT
            MAX(c2.population)
        FROM
            country c2
        WHERE
            c2.continent = c1.continent
    )
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
11 rows affected.


name,continent
Australia,Oceania
Brazil,South America
China,Asia
Nigeria,Africa
Russian Federation,Europe
United States,North America
Antarctica,Antarctica
Bouvet Island,Antarctica
South Georgia and the South Sandwich Islands,Antarctica
Heard Island and McDonald Islands,Antarctica


---

**Example:** Write a query that returns the most populated city listed for each country code in the `city` table.

---

In [74]:
%%sql

SELECT
    ci1.name, ci1.countrycode
FROM
    city ci1
WHERE
    ci1.population = (
        SELECT
            MAX(ci2.population)
        FROM
            city ci2
        WHERE
            ci2.countrycode = ci1.countrycode
    )
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
232 rows affected.


name,countrycode
Kabul,AFG
Amsterdam,NLD
Willemstad,ANT
Tirana,ALB
Alger,DZA
Tafuna,ASM
Andorra la Vella,AND
Luanda,AGO
South Hill,AIA
Saint John´s,ATG


---

**Example:** Use the above query to return the name of countries whose capital city is not their most populated city.

---

The outer query in a subquery can be the result of joining other tables. In this example, we first need to join `country` and `city` to list capital cities of each country:

In [25]:
%%sql

SELECT
    co.name, ci.name
FROM
    country co
JOIN
    city ci
ON
    co.capital = ci.id
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
232 rows affected.


name,name_1
Afghanistan,Kabul
Netherlands,Amsterdam
Netherlands Antilles,Willemstad
Albania,Tirana
Algeria,Alger
American Samoa,Fagatogo
Andorra,Andorra la Vella
Angola,Luanda
Anguilla,The Valley
Antigua and Barbuda,Saint John´s


In the next step, we need to check whether the population of that capital city is the maximum population of cities of that particular country or not. This can be achieved via a correlated subquery:

In [29]:
%%sql

SELECT
    co.name
FROM
    country co
JOIN
    city ci
ON
    co.capital = ci.id
WHERE
    ci.population <> (
        SELECT
            MAX(ci2.population)
        FROM
            city ci2
        WHERE
            ci.countrycode = ci2.countrycode
    )
ORDER BY
    co.population DESC
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
48 rows affected.


name
China
India
United States
Brazil
Pakistan
Nigeria
Vietnam
Philippines
Turkey
South Africa


Correlated subqueries are usually quite inefficient (remember time complexity of nested loops from DSCI 512?), but they can answer some interesting and complex questions.

### `ANY` and `ALL`

Syntax:

```sql
SELECT
    column_name(s)
FROM
    table_name
WHERE
    column_name operator {ALL|ANY} (
        SELECT
            column_name
        FROM
            table_name
        WHERE
            condition
    );
```

---

**Example:** Find all non-European countries whose population is larger than every European country.

---

In [30]:
%%sql

SELECT
    name
FROM
    country
WHERE
    continent <> 'Europe'
    AND
    population > ALL (
        SELECT
            population
        FROM
            country
        WHERE
            continent = 'Europe'
    )
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
6 rows affected.


name
Brazil
Indonesia
India
China
Pakistan
United States


---

**Example:** Find all European countries whose population is smaller than at least one city in the US.

---

In [106]:
%%sql

SELECT
    name
FROM
    country
WHERE
    continent = 'Europe'
    AND
    population < ANY (
        SELECT
            population
        FROM
            city
        WHERE
            countrycode = (
                SELECT
                    code
                FROM
                    country
                WHERE
                    name ILIKE '%United%States'
            )
    )
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
26 rows affected.


name
Albania
Andorra
Bosnia and Herzegovina
Faroe Islands
Gibraltar
Svalbard and Jan Mayen
Ireland
Iceland
Croatia
Latvia


Subqueries written with `ALL` or `ANY` and `<`, `<=`, `>`, and `>=` be rewritten using aggregations. For example, we can rewrite the above query as:

In [109]:
%%sql

SELECT
    name
FROM
    country
WHERE
    continent = 'Europe'
    AND
    population < (
        SELECT
            MAX(population)
        FROM
            city
        WHERE
            countrycode = (
                SELECT
                    code
                FROM
                    country
                WHERE
                    name ILIKE '%United%States'
            )
    )
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
26 rows affected.


name
Albania
Andorra
Bosnia and Herzegovina
Faroe Islands
Gibraltar
Svalbard and Jan Mayen
Ireland
Iceland
Croatia
Latvia


Here are equivalencies between using `ALL` and `ANY` with `MAX` and `MIN` in subqueries:

| Using `ALL` and `ANY` | Using `MIN` and `MAX` |
|-----------------------|-----------------------|
| `< ALL (subquery)`    | `< MIN(values)`       |
| `> ALL (subquery)`    | `> MAX(values)`       |
| `< ANY (subquery)`    | `< MAX(values)`       |
| `> ANY (subquery)`    | `> MIN(values)`       |

> **Remember:** The fact that two queries achieve the same result does not mean that they are also the same in terms of performance. For example, using `> MAX()` is usually faster than `> ALL()`.

### `EXISTS`

With subqueries, sometimes we don't care about the rows that are returned, but if a row is returned at all or not. The `EXISTS` and `NOT EXISTS` keyword in SQL provide this type of functionality for us, i.e., the check for **existence** of rows, not their values. Here is the syntax for an `EXISTS` subquery:

```sql
SELECT
    column_name(s)
FROM
    table_name
WHERE
    [NOT] EXISTS (
        SELECT column_name FROM table_name WHERE condition
    )
;
```

If the subquery returns one or more rows, then the `WHERE` condition becomes `TRUE`.

Remember that:
- With `EXISTS`, the columns returned by the subquery do not matter at all, which is why we usually use `SELECT *` in the subquery.
- The subquery following `EXISTS` can be either simple or correlated, but typically they are correlated.

---

**Example:** Find all countries that have at least a city with a population greater that 5 million.

---

In [114]:
%%sql

SELECT
    co.name
FROM
    country co
WHERE
    EXISTS (
        SELECT
            *
        FROM
            city ci
        WHERE
            co.code = ci.countrycode
            AND
            ci.population > 5000000
    )
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
18 rows affected.


name
Brazil
United Kingdom
Egypt
Indonesia
India
Iran
Japan
China
Colombia
"Congo, The Democratic Republic of the"


---

**Example:** Which countries speak at least one language that is not spoken in any other country in their continent?

---

It's best to first create a temporary table that stores the names of countries, continents and their languages:

In [44]:
%%sql

DROP TABLE IF EXISTS ccl;

CREATE TEMPORARY TABLE ccl AS (
    SELECT
        co.name, co.continent, cl.language
    FROM
        country co
    JOIN
        countrylanguage cl
    ON
        co.code = cl.countrycode
    WHERE
        co.continent IN ('Asia', 'Europe')
)
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
Done.
441 rows affected.


[]

Now we can use `NOT EXISTS` to detect countries that have languages not spoken elsewhere in their continent. Note that because a subquery runs its query over all rows, we need to exclude the current row of the outer query, otherwise there would always be a country in the same continent speaking all languages of the country in the outer query, i.e. itself!

In [45]:
%%sql

SELECT
    t1.*
FROM
    ccl t1
WHERE
    NOT EXISTS (
        SELECT
            *
        FROM
            ccl t2
        WHERE
            t1.name <> t2.name
            AND
            t1.continent = t2.continent
            AND
            t1.language = t2.language
    )

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
129 rows affected.


name,continent,language
Bhutan,Asia,Dzongkha
Philippines,Asia,Pilipino
Faroe Islands,Europe,Faroese
Georgia,Asia,Georgiana
Indonesia,Asia,Javanese
Iceland,Europe,Icelandic
Japan,Asia,Japanese
Kyrgyzstan,Asia,Kirgiz
Cyprus,Asia,Greek
Latvia,Europe,Latvian


We can also replace `t1.*` in the `SELECT` clause with `COUNT(DISTINCT name)` to find how many countries with such a property exist in the world.

To summarize, we use `EXISTS` when:

- We don't need the data from a related table. With joins, we always have access to the columns of the target tables as well.
- We just need to check existence, which can also be achieved by using outer joins and checking for nulls.

---

**Example:** Find city names that happen to be used in more than one country!

---

In [46]:
%%sql

SELECT
    DISTINCT ci1.name
FROM
    city ci1
WHERE
    EXISTS (
        SELECT
            ci2.name
        FROM
            city ci2
        WHERE
            ci2.name = ci1.name
            AND
            ci1.id <> ci2.id
            AND
            ci1.countrycode <> ci2.countrycode
    )
ORDER BY
    ci1.name DESC
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
47 rows affected.


name
York
Worcester
Victoria
Vancouver
Valencia
Tripoli
Toledo
Taiping
Santa Rosa
Santa Maria


Alternatively, we can use a self-join to find city names used in more than one country:

In [48]:
%%sql

SELECT
    DISTINCT ON (ci1.name) ci1.name,
    ci1.countrycode,
    ci2.name,
    ci2.countrycode
FROM
    city ci1
JOIN
    city ci2
ON
    ci2.name = ci1.name
    AND
    ci1.id <> ci2.id
    AND
    ci1.countrycode <> ci2.countrycode
ORDER BY
    ci1.name DESC
;

   postgresql://postgres:***@localhost:5432/mds
 * postgresql://postgres:***@localhost:5432/world_dsci513
47 rows affected.


name,countrycode,name_1,countrycode_1
York,CAN,York,GBR
Worcester,USA,Worcester,GBR
Victoria,MEX,Victoria,SYC
Vancouver,CAN,Vancouver,USA
Valencia,ESP,Valencia,VEN
Tripoli,LBY,Tripoli,LBN
Toledo,BRA,Toledo,PHL
Taiping,MYS,Taiping,TWN
Santa Rosa,PHL,Santa Rosa,USA
Santa Maria,BRA,Santa Maria,PHL
