# Setup connection to postgres server

Assume that we are running the server in on a node called `postgres` See [https://github.com/IITDBGroup/cs425](https://github.com/IITDBGroup/cs425) to see how to setup a docker container and link the notebook container to expose the postgres server as `postgres` on the notebookserver. Make sure to run this cell first to get a connection. The syntax for running SQL code from a jupyter notebook using cell magic is described [here](https://github.com/catherinedevlin/ipython-sql). **Every time you open this notebook, you have to execute the cell below to open a connection to postgres.**

In [1]:
%load_ext sql
%sql postgresql://postgres:test@notebpostgres/cs425

  from IPython.utils.traitlets import Bool, Int, Unicode


'Connected: postgres@cs425'

# Basic SQL syntax, constants, and identifiers

## Keywords

Keywords in SQL are case insensitive, e.g., `SELECT`, `SeLeCT`, and `select` will all be recognized as the keyword "SELECT" in SQL. 

## Identifiers

The conventions for identifiers (e.g., table and attribute names) in SQL are dependent on the database system you are using. Typically, identifiers are case-insensitive, have to start with a letter, and can contain letters, numbers, and `_` (underscore). Postgres internally stores identifiers as lowercase, e.g., table names `STuDENt`, `student`, and `STUDENT` would all be represented internally as `student`. Using quoting, you can use identifiers that do not follow this syntax. Quoted identifiers are delimited by `"` (double quote). For example, `99people` is not a valid identifier since it starts with a number. However, `"99people"` is allowed. 

## Constants

* String constants in SQL are delimited by `'` (single quote), e.g., `'Peter'` is a valid string
* Number constants, e.g., `1`, `12432`, `-234235`
* Format of date constants is database system dependent. Most systems allow you to specify the format for a date. See [https://www.postgresql.org/docs/9.6/static/datatype-datetime.html](https://www.postgresql.org/docs/9.6/static/datatype-datetime.html) for information of how dates are handled in Postgres. For example, ` DATE '2004-10-19'` creates a date constant

## Casting

* In postgres casting is denoted by `expression::datatype`. For instance, `'123'::int` casts the string constant `'123'` as an integer

## Function Calls

* functions are called using `()`. For instance, function `upper` converts a string into upper case: `upper('abc')` would yield `'ABC'`.

# Data Definition Language (DDL)

The data definition language part of SQL allows you to change the schema of a database, e.g., creating new relations (tables) or changing the schema of a relation.

## Creating Tables

The create table statement creates a new table. It is of the form:

~~~sql
CREATE TABLE table_name (attrdefs_and_constraints);

attrdefs_and_constraints := (attrdef | constraint)*

attrdef := name datatype
constraint := PRIMARY KEY (attrname_list) | FOREIGN KEY (attrname_list) REFERENCES relation_name | ...
~~~

Let's create a table to store information about student organizations which records for each organization their `name`, `budget`, and whether its membership is restricted to persons of a particular gender (`m = male`, `f = female`, `a = all`).

In [9]:
%%sql
CREATE TABLE student_org 
(
    name TEXT,
    budget float,
    gender char(1),
    PRIMARY KEY (name)
);

Done.


[]

Now let's check out the newly generated table using a query: `SELECT * FROM table_name` returns all rows of table `table_name`. We will discuss queries in more detail later.

In [11]:
%%sql
SELECT * FROM student_org;

0 rows affected.


name,budget,gender


Now let's insert some rows into our new table. We are using SQL's insert command of the form

~~~sql
INSERT INTO table_name VALUES (value_list);
~~~

and then check the updated content.

In [12]:
%%sql
INSERT INTO student_org VALUES ('ACM', 10000, 'a');
INSERT INTO student_org VALUES ('IEEE', 20000, 'a');
SELECT * FROM student_org;

1 rows affected.
1 rows affected.
2 rows affected.


name,budget,gender
ACM,10000.0,a
IEEE,20000.0,a


## Changing the schema of a relation

SQL provides the `ALTER TABLE` command for changing the schema of a relation. Let's add a column storing the immigration status of student to the `student` relation:

In [3]:
%%sql
ALTER TABLE student ADD imm_status VARCHAR(30);

Done.


[]

Again, let's see how this affected the student relation using a query. A shown below, the database has set the value of the new column to `NULL` (shown as `None` in Python) for all students in the database.

In [5]:
%%sql
SELECT * FROM student;

13 rows affected.


id,name,dept_name,tot_cred,imm_status
128,Zhang,Comp. Sci.,102,
12345,Shankar,Comp. Sci.,32,
19991,Brandt,History,80,
23121,Chavez,Finance,110,
44553,Peltier,Physics,56,
45678,Levy,Physics,46,
54321,Williams,Comp. Sci.,54,
55739,Sanchez,Music,38,
70557,Snow,Physics,0,
76543,Brown,Comp. Sci.,58,


Now let's get rid of this column.

In [6]:
%%sql
ALTER TABLE student DROP imm_status;

Done.


[]

... and check that we are back to normal.

In [7]:
%%sql
SELECT * FROM student;

13 rows affected.


id,name,dept_name,tot_cred
128,Zhang,Comp. Sci.,102
12345,Shankar,Comp. Sci.,32
19991,Brandt,History,80
23121,Chavez,Finance,110
44553,Peltier,Physics,56
45678,Levy,Physics,46
54321,Williams,Comp. Sci.,54
55739,Sanchez,Music,38
70557,Snow,Physics,0
76543,Brown,Comp. Sci.,58


# Run basic SQL queries

First let's run some basic queries over the **University** schema from the textbook

Get all departments (here the * * * is a shortcut referring to all attributes)

In [2]:
%%sql
SELECT * FROM department

7 rows affected.


dept_name,building,budget
Biology,Watson,90000.0
Comp. Sci.,Taylor,100000.0
Elec. Eng.,Taylor,85000.0
Finance,Painter,120000.0
History,Painter,50000.0
Music,Packard,80000.0
Physics,Watson,70000.0


Only show the names of departments:

In [2]:
%%sql
SELECT dept_name FROM department

7 rows affected.


dept_name
Biology
Comp. Sci.
Elec. Eng.
Finance
History
Music
Physics


Find all departments that at least one student is associated with. Do only return each department one (using `DISTINCT`)

In [3]:
%%sql
SELECT DISTINCT dept_name FROM student

7 rows affected.


dept_name
Comp. Sci.
Elec. Eng.
History
Music
Finance
Physics
Biology


just to demonstrate what would be different if we omit the `DISTINCT`

In [3]:
%%sql 
SELECT dept_name FROM student

13 rows affected.


dept_name
Comp. Sci.
Comp. Sci.
History
Finance
Physics
Physics
Comp. Sci.
Music
Physics
Comp. Sci.


Return ids of students that have more than 50 total credits


In [4]:
%%sql
SELECT id 
FROM student
WHERE tot_cred > 50

9 rows affected.


id
128
19991
23121
44553
54321
76543
76653
98765
98988


just to confirm that this worked let's get back all of the attributes

In [5]:
%%sql
SELECT *
FROM student
WHERE tot_cred > 50

9 rows affected.


id,name,dept_name,tot_cred
128,Zhang,Comp. Sci.,102
19991,Brandt,History,80
23121,Chavez,Finance,110
44553,Peltier,Physics,56
54321,Williams,Comp. Sci.,54
76543,Brown,Comp. Sci.,58
76653,Aoi,Elec. Eng.,60
98765,Bourikas,Elec. Eng.,98
98988,Tanaka,Biology,120


Finding all the instructors and the buildings they are working in

In [7]:
%%sql
SELECT name, building
FROM instructor, department
WHERE instructor.dept_name = department.dept_name

12 rows affected.


name,building
Srinivasan,Taylor
Wu,Painter
Mozart,Packard
Einstein,Watson
El Said,Painter
Gold,Watson
Katz,Taylor
Califieri,Painter
Singh,Painter
Crick,Watson


or using aliasing we can write the same query with less code. In SQL you can assign an alias to a relation in the `FROM` clause like this `relation alias`. Then you can refer to the relation using the alias instead of the relation name in the `SELECT` and `WHERE` clauses.

In [8]:
%%sql
SELECT i.name, d.building
FROM instructor i, department d
WHERE i.dept_name = d.dept_name

12 rows affected.


name,building
Srinivasan,Taylor
Wu,Painter
Mozart,Packard
Einstein,Watson
El Said,Painter
Gold,Watson
Katz,Taylor
Califieri,Painter
Singh,Painter
Crick,Watson


Pairs of instructors working for the same department (we use `x.name <> y.name` to ensure that we are not pairing an instructment with him-/herself.

In [9]:
%%sql
SELECT x.name, y.name
FROM instructor x, instructor y
WHERE x.dept_name = y.dept_name AND x.name <> y.name

12 rows affected.


name,name_1
Srinivasan,Brandt
Srinivasan,Katz
Wu,Singh
Einstein,Gold
El Said,Califieri
Gold,Einstein
Katz,Brandt
Katz,Srinivasan
Califieri,El Said
Singh,Wu


However this still returns each pair of instructors A and B twice. Once as `(A,B)` and once as `(B,A)`. To avoid that we can enforce that the name of the left instructor is lexicographically smaller than the name of the right instructor by adding a condition `x.name < y.name`.

In [11]:
%%sql
SELECT x.name, y.name AS name_right
FROM instructor x, instructor y
WHERE x.dept_name = y.dept_name AND x.name <> y.name AND x.name < y.name

6 rows affected.


name,name_right
Einstein,Gold
Katz,Srinivasan
Califieri,El Said
Singh,Wu
Brandt,Katz
Brandt,Srinivasan


In the `SELECT` clause you can also use expressions, e.g., arithmetics and renaming (`expression AS new_name`).

In [13]:
%%sql
SELECT name, tot_cred / 10 AS one_tenth_cred
FROM student

13 rows affected.


name,one_tenth_cred
Zhang,10.2
Shankar,3.2
Brandt,8.0
Chavez,11.0
Peltier,5.6
Levy,4.6
Williams,5.4
Sanchez,3.8
Snow,0.0
Brown,5.8


new operators for comparison and case distinctions, return for each student an indicator whether they are ready to graduate. A student is ready to graduate if they have earned more than 80 credits.


In [4]:
%%sql
SELECT name, tot_cred, CASE WHEN tot_cred > 80 THEN 'ready to graduate' ELSE 'not ready' END AS grad_status
FROM student

13 rows affected.


name,tot_cred,grad_status
Zhang,102,ready to graduate
Shankar,32,not ready
Brandt,80,not ready
Chavez,110,ready to graduate
Peltier,56,not ready
Levy,46,not ready
Williams,54,not ready
Sanchez,38,not ready
Snow,0,not ready
Brown,58,not ready


return students with between 80 and 100 credits

In [5]:
%%sql
SELECT name, tot_cred
FROM student
WHERE tot_cred BETWEEN 80 AND 100

2 rows affected.


name,tot_cred
Brandt,80
Bourikas,98


Based on student request, we made it work using aggregation and having (assuming that student names are unique)

In [9]:
%%sql
SELECT name, min(tot_cred) AS tot_cred
FROM student
GROUP BY name
HAVING min(tot_cred) >= 80 AND max(tot_cred) <= 100

2 rows affected.


name,tot_cred
Bourikas,98
Brandt,80


Return the average total credits of students per department. In SQL, aggregation is applied in the `SELECT` clause. Group-by expressions are given a separate `GROUP BY` clause. 

In [10]:
%%sql
SELECT dept_name, avg(tot_cred) AS avg_cred
FROM student
GROUP BY dept_name


7 rows affected.


dept_name,avg_cred
Comp. Sci.,61.5
Elec. Eng.,79.0
History,80.0
Music,38.0
Finance,110.0
Physics,34.0
Biology,120.0


If we only want departments where the average credit is larger than `100` we can apply a `HAVING` clause to post-filter the result after aggregation. The `HAVING` clause and the `WHERE` clause both correspond to selection in relational algebra. The difference is that the `WHERE` clause is applied **before** any aggregation or grouping is evaluated and the `HAVING` clause is applied **after** aggregation. Note that the `HAVING` clause may reference aggregation functions that are not used in the `SELECT` clause.

In [11]:
%%sql
SELECT dept_name, avg(tot_cred) AS avg_cred
FROM student
GROUP BY dept_name
HAVING avg(tot_cred) > 100

2 rows affected.


dept_name,avg_cred
Finance,110.0
Biology,120.0


Return the total credit hours students from the `Music` or `Biology` departments have taken.

In [12]:
%%sql
SELECT sum(tot_cred)
FROM student
WHERE dept_name = 'Music' OR dept_name = 'Biology'


1 rows affected.


sum
158


Return the highest instructor salary:

In [13]:
%%sql
SELECT max(salary)
FROM instructor

1 rows affected.


max
95000.0


Same query, but using the trick we introduced in the relational algebra part of the lecture to compute maximal salaries without using aggregation (find salaries for which at least one higher salary exists and then remove these salaries from the set of all salaries). Note that here we are using the **Set** version of set difference.

In [16]:
%%sql
(SELECT salary FROM instructor)
EXCEPT
(SELECT l.salary
FROM instructor l, instructor r
WHERE l.salary < r.salary)

1 rows affected.


salary
95000.0


Now let's get back the names of instructors with the highest salaries. This requires joining the result of aggregation with the instructor table. Here we use two new features:

* `WITH q AS (SELECT ...)` defines a so-called common table expression (CTE). This works just like assignment in relational algebra
* the `FROM` clause can contain queries. The semantics is that the queries in the `FROM` clause are evaluated first before we evaluate the outer query.

In [19]:
%%sql
WITH maxSal AS (SELECT max(salary) AS msal
FROM instructor)
SELECT i.name
FROM maxSal m, instructor i
WHERE msal = i.salary


1 rows affected.


name
Einstein


Or using subqueries in the `FROM` clause instead of the CTEs

In [20]:
%%sql
SELECT i.name
FROM (SELECT max(salary) AS msal
      FROM instructor) m, 
      instructor i
WHERE msal = i.salary

1 rows affected.


name
Einstein


# Data Manipulation Language (DML) Operations 
Now let's learn about how to update tables by inserting, deleting, and updating rows.

## Inserting data
We first take a look at how to insert data into a table using SQL's `INSERT` command. Inserting a single new row is done as follows:
~~~sql
INSERT INTO table VALUES (value1, ..., valueN)
~~~

In [3]:
%%sql
INSERT INTO department VALUES ('data science', 'Watson', 200000.0)

1 rows affected.


[]

Now let's check the new state of table `department`

In [4]:
%%sql
SELECT * FROM department

8 rows affected.


dept_name,building,budget
Biology,Watson,90000.0
Comp. Sci.,Taylor,100000.0
Elec. Eng.,Taylor,85000.0
Finance,Painter,120000.0
History,Painter,50000.0
Music,Packard,80000.0
Physics,Watson,70000.0
data science,Watson,200000.0
