# SQL Workshop
### <font color=indigo>LIVE DEMOs</font>
---

* This document is a technical brainstorm (like whiteboarding with code). 
* I'll give you this and any other assets over the break. 
* No need to follow along live.

---

## Introductions

* Name
* Background/Role
* Prior Experience with SQL?
* Hobby

### Michael Burgess
* michael.burgess@decoded.com

* Head of Technical Solutions
    * Head for Data, Analytics, AI
    * IT contractor defence, telephony, mobile, ...
    * Physics 6/7yr
* Arguing, Youtube, Podcasts, Philosophy

---


## SQLite & Python

A library is built-into python which provides util for sqlite; this is specific to sqlite3.

In [1]:
import sqlite3

The sqlalchemy library can connect to and manage multiple different `RDBMS`es...

In [2]:
from sqlalchemy import create_engine

Set up and (in this case *create*) the databse,

In [3]:
db = create_engine("sqlite:///demo.db")

Get a connetion which we can use *as if it were a file*,

In [4]:
con = db.connect()

... this means that sometimes when i could use a file connection, i can instead use this one..

## Using Pandas with SQL

In [5]:
import pandas as pd

Pandas can save to *files* and therefore also dbs,

In [6]:
prices = pd.DataFrame({
    'Name': ['Alice', 'Eve', 'Bob'],
    'Price': [10, 20, 30]
})

prices

Unnamed: 0,Name,Price
0,Alice,10
1,Eve,20
2,Bob,30


The number of rows *saved* to the table `prices` is...

In [7]:
# prices.to_sql('prices', con)

In [8]:
prices.to_sql('prices', con, index=False, if_exists='replace')

3

In [9]:
pd.read_sql('SELECT * FROM prices', con)

Unnamed: 0,Name,Price
0,Alice,10
1,Eve,20
2,Bob,30


### A Querying Template

It's convenient to use a document string when specifying a query, so we can use whitespace,

In [10]:
# document string

query = """
    SELECT *
        FROM prices
"""

pd.read_sql(query, con)

Unnamed: 0,Name,Price
0,Alice,10
1,Eve,20
2,Bob,30


### Q&A

* How would I select just the `Name` column?

In [11]:
query = """
    SELECT Name
        FROM prices
"""

pd.read_sql(query, con)

Unnamed: 0,Name
0,Alice
1,Eve
2,Bob


* How would I select `prices` equal to or over $£20$,

In [12]:
query = """
    SELECT *
        FROM prices
        WHERE 
            Price >= 20
"""

pd.read_sql(query, con)

Unnamed: 0,Name,Price
0,Eve,20
1,Bob,30


* How would I select `prices` equal to or over $£20$, **but just one row**

In [13]:
query = """
    SELECT *
        FROM prices
        WHERE 
            Price >= 20
        LIMIT 1
"""

pd.read_sql(query, con)

Unnamed: 0,Name,Price
0,Eve,20


...which row is (sort of) random right now... how would we make this a minimum?

# BREAK

In [14]:
query = """
    select *
        from prices
        where 
            Price >= 20
        limit 1
"""

pd.read_sql(query, con)

Unnamed: 0,Name,Price
0,Eve,20


---
## Activity: Workbook I (c. 25min)
---

## Review Activity I: Filtering

In [15]:
# Import create_engine from sqlalchemy to connect to the database
from sqlalchemy import create_engine

# Import pandas
import pandas as pd

# Create an engine to the database sqlite-sakila.db
engine = create_engine(f"sqlite:///sqlite-sakila.db")

# Establish a connection to the database
dvd = engine.connect()

In [16]:
query = """
    SELECT title FROM film LIMIT 1
"""
pd.read_sql(query, dvd)

Unnamed: 0,title
0,ACADEMY DINOSAUR


* Find the film with the minimum replacement cost which is long (> 2hr) and not adult (neither R or NC17 rated).
    

In [17]:
query = """
    SELECT title
    FROM film 
    WHERE
        (length >= 120)
    AND NOT (rating = 'R')
    AND NOT (rating = 'NC-17')
    ORDER BY replacement_cost
    LIMIT 1
"""
pd.read_sql(query, dvd)

Unnamed: 0,title
0,CONTROL ANTHEM


* You can write `<>` or `!=` as meaning `NOT` "equal"
    * `<>` used more in microsfy places
    * `!=` everywhere else

In [18]:
query = """
    SELECT title
    FROM film 
    WHERE
        (length >= 120)
    AND (rating != 'R')
    AND (rating != 'NC-17')
    ORDER BY replacement_cost
    LIMIT 1
"""
pd.read_sql(query, dvd)

Unnamed: 0,title
0,CONTROL ANTHEM


In [19]:
query = """
    SELECT title
    FROM film 
    WHERE
            (length >= 120)
    AND NOT (rating IN ('R', 'NC-17'))
    
    ORDER BY replacement_cost
    LIMIT 1
"""
pd.read_sql(query, dvd)

Unnamed: 0,title
0,CONTROL ANTHEM


* Isn't this *wrong*? ...

In [20]:
query = """
    SELECT title
    FROM film 
    WHERE
        (length >= 120)
    AND (rating != 'R' OR rating != 'NC-17')
    ORDER BY replacement_cost
    LIMIT 1
"""
pd.read_sql(query, dvd)

Unnamed: 0,title
0,CONTROL ANTHEM


### Applying Filters to Rows

```sql
Film Name   Length     Rating                          ?        ?    ?

A          120         R                               T  AND ( F OR T ) ->      T AND T      -> T 
B          120         NC17                            T  AND ( T OR F ) ->      T AND T      -> T
D          120         PG                              T  AND ( T OR T ) ->      T AND T      -> T
E          120         U                               T  AND ( T OR T ) ->      T AND T      -> T .... OOPs!

```

* Why did this go wrong?
    * ... what mistake in *thinking* occured?
    * why did (my college at decoded...) write `OR` when the correct answer `AND`?
* The film should `NOT` be R `OR` NC17
    * `OR` in english does not always translate to `OR` in logic

## JOINs

In [21]:
pd.read_sql('SELECT * FROM prices', con)

Unnamed: 0,Name,Price
0,Alice,10
1,Eve,20
2,Bob,30


In [22]:
cities = pd.DataFrame({
    'Name': ['Alice', 'Eve', 'Bob', 'Dan'],
    'City': ['Leeds', 'London', 'Paris', 'Glasgow']
})

cities.to_sql('cities', con, index=False, if_exists='replace')

4

In [23]:
pd.read_sql('SELECT * FROM cities', con)

Unnamed: 0,Name,City
0,Alice,Leeds
1,Eve,London
2,Bob,Paris
3,Dan,Glasgow


#### cartesian join = joining without any condition
* all match-ups

In [24]:
pd.read_sql('SELECT * FROM cities, prices', con)

Unnamed: 0,Name,City,Name.1,Price
0,Alice,Leeds,Alice,10
1,Alice,Leeds,Eve,20
2,Alice,Leeds,Bob,30
3,Eve,London,Alice,10
4,Eve,London,Eve,20
5,Eve,London,Bob,30
6,Bob,Paris,Alice,10
7,Bob,Paris,Eve,20
8,Bob,Paris,Bob,30
9,Dan,Glasgow,Alice,10


####  inner join = "join"
* adding a pairing-up condition 
    * aka filtering the above 

In [25]:
pd.read_sql("""
    SELECT * 
    FROM cities 
    JOIN prices
    ON cities.Name = prices.Name
""", con)

Unnamed: 0,Name,City,Name.1,Price
0,Alice,Leeds,Alice,10
1,Eve,London,Eve,20
2,Bob,Paris,Bob,30


####  `LEFT` join 
* keeps the data in the *LEFT* table, NULL'ing the RIGHT,

In [26]:
pd.read_sql("""
    SELECT * 
    FROM cities 
    LEFT JOIN prices
    ON cities.Name = prices.Name
""", con)

Unnamed: 0,Name,City,Name.1,Price
0,Alice,Leeds,Alice,10.0
1,Eve,London,Eve,20.0
2,Bob,Paris,Bob,30.0
3,Dan,Glasgow,,


---
## Group Activity: Workbook 2 (c. 25 min)
---

`RIGHT` joins arent supported (they dont really need to be), so they cause an error,

In [27]:
#pd.read_sql("SELECT * FROM cities RIGHT JOIN prices ON cities.Name = prices.Name", con)

## Review Activity II: Linking Tables

There are several ways of phrasing a `JOIN`, consider the below where we place the `ON` conditions underneath the `JOIN film_actor`,

```sql
    SELECT title, first_name, last_name
    FROM actor    
    JOIN film
    JOIN film_actor
        ON 
            film_actor.film_id = film.film_id 
        AND
            film_actor.actor_id = actor.actor_id 
```

In [56]:
query = """
    SELECT title, first_name, last_name
    FROM actor    
    JOIN film
    JOIN film_actor
        ON 
            film.film_id = film_actor.film_id
        AND
            actor.actor_id = film_actor.actor_id
    WHERE 
        film.film_id IN (1, 2, 3)
"""
pd.read_sql(query, dvd)

Unnamed: 0,title,first_name,last_name
0,ACADEMY DINOSAUR,PENELOPE,GUINESS
1,ACADEMY DINOSAUR,CHRISTIAN,GABLE
2,ACADEMY DINOSAUR,LUCILLE,TRACY
3,ACADEMY DINOSAUR,SANDRA,PECK
4,ACADEMY DINOSAUR,JOHNNY,CAGE
5,ACADEMY DINOSAUR,MENA,TEMPLE
6,ACADEMY DINOSAUR,WARREN,NOLTE
7,ACADEMY DINOSAUR,OPRAH,KILMER
8,ACADEMY DINOSAUR,ROCK,DUKAKIS
9,ACADEMY DINOSAUR,MARY,KEITEL


---

## Sub-Queries

It seems like `SELECT` mostly means `RETURN`, ie., provide us with data,

In [60]:
pd.read_sql("""
    SELECT * 
    FROM film
    LIMIT 1
""", dvd)

Unnamed: 0,film_id,title,description,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,rating,special_features,last_update
0,1,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a Mad Scientist...,2006,1,,6,0.99,86,20.99,PG,"Deleted Scenes,Behind the Scenes",2021-03-06 15:52:00


In [61]:
pd.read_sql("SELECT * FROM prices", con)

Unnamed: 0,Name,Price
0,Alice,10
1,Eve,20
2,Bob,30


A subquery is, eg., a use of `SELECT` as part of another query. Below we `SELECT` the `Name`s of `prices` table and filter `cities` by whether it's `Name` is *in* the `prices` set. 

(Ie., we join).

In [65]:
pd.read_sql("""
    SELECT * 
    FROM cities 
    WHERE 
        cities.Name IN (SELECT Name FROM prices)
    """, con)

Unnamed: 0,Name,City
0,Alice,Leeds
1,Eve,London
2,Bob,Paris


...we shouldn't do this. We should just `JOIN`. 

Why? DBs are optimized for JOINs ("do less"), and can struggle to optimize subquery.

But it still can be very useful,

```sql

INSERT INTO table_today VALUES 
    (SELECT * FROM table_yesterday)

```

---
## Individual Activity & Take-Home: Workbook 3 (c. 20 min)
* do you have an questions from the course so far?
* any areas you want a little reivew?
* what areas have you found most useful, hardest, most interesting?
    * ...
---

In [68]:
# Writing an SQL query 
query = """
    SELECT payment_id
    FROM payment 
    WHERE 
        amount > (SELECT AVG(amount) FROM payment)
    ORDER BY amount ASC
    LIMIT 5
    """

# Querying the database
pd.read_sql_query(query, dvd)

Unnamed: 0,payment_id
0,6
1,7
2,12
3,13
4,16


In [71]:
# Writing an SQL query 
query = """
    SELECT payment_id
    FROM payment 
    JOIN 
        (SELECT AVG(amount) AS Avg FROM payment)
    ON 
        amount > Avg
    ORDER BY amount
    LIMIT 5
    
    """

# Querying the database
pd.read_sql_query(query, dvd)

Unnamed: 0,payment_id
0,6
1,7
2,12
3,13
4,16


# Next Steps & Review