# Discussion 11: SQL

In [1]:
import pandas as pd
import numpy as np
import duckdb

In [2]:
# Run this cell to set up SQL.
%load_ext sql

There's a new jupysql version available (0.10.10), you're running 0.8.0. To upgrade: pip install jupysql --upgrade
Deploy Dash apps for free on Ploomber Cloud! Learn more: https://ploomber.io/s/signup


In [3]:
# Run this cell to connect to duckdb
conn = duckdb.connect()
conn.query("INSTALL sqlite")
%sql conn --alias duckdb

In [4]:
# Connect to the database
%sql duckdb:///test_database --alias test

## SQL Syntax
### Q1 
Create the `survey` table that we'll work with in this question.

In [5]:
data = {'j_name': ['Llama Technician','Software Engineer','Open Source Maintainer','Big Data Engineer', 'Data Analyst', 'Analyst Intern'],
        'c_name': ["Google","Salesforce", "Github","Microsoft","Startup","Google"],
        'c_location' : ['Mountain View', 'SF', 'SF', 'Redmond', 'Berkeley', 'SF'],
        'm_name': ["Applied Math","ORMS","Computer Science", "Data Science", "Data Science","Philosophy"]
        }

survey = pd.DataFrame(data, columns = list(data.keys()))
survey

Unnamed: 0,j_name,c_name,c_location,m_name
0,Llama Technician,Google,Mountain View,Applied Math
1,Software Engineer,Salesforce,SF,ORMS
2,Open Source Maintainer,Github,SF,Computer Science
3,Big Data Engineer,Microsoft,Redmond,Data Science
4,Data Analyst,Startup,Berkeley,Data Science
5,Analyst Intern,Google,SF,Philosophy


#### 1a
Write an SQL query that selects all data science major graduates that got jobs in Berkeley.
The result generated by your query should include all 4 columns.

In [6]:
%%sql 
-- write your query here --
SELECT * 
FROM survey
WHERE m_name = 'Data Science' 
AND c_location = 'Berkeley';

j_name,c_name,c_location,m_name
Data Analyst,Startup,Berkeley,Data Science


#### 1b
Write an SQL query to find the top 2 most popular companies that data science graduates will work at, from most popular to 2nd most popular.

In [7]:
%%sql 
-- write your query here --
SELECT c_name, COUNT(*) AS count
FROM survey
WHERE m_name = 'Data Science'
GROUP BY c_name
ORDER BY count DESC
LIMIT 2;

c_name,count
Microsoft,1
Startup,1


### Q2

Create the tables that we'll work with in this question.

In [8]:
homes_data = {'home_id': [1,2,3,4,5,6],
        'city': ["Berkeley","San Jose","Berkeley","Berkeley","Berkeley", "Sunnyvale"],
        'bedrooms': [2,1,5,3,4,1],
        'bathrooms': [2,2,1,1,3,2],
        'area': [str(i) for i in [500,750,1000,1500,500,1000]] 
        }

homes = pd.DataFrame(homes_data, columns = list(homes_data.keys()))

transactions_data = {'home_id': [1,2,3,5],
        'buyer_id': [5,6,7,8],
        'seller_id': [8,7,6,5],
        'transaction_data': ['1/12/2001','4/14/2001','8/11/2001','12/21/2001'],
        'sale_price': [1000,500,750,1200]
        }

transactions = pd.DataFrame(transactions_data, columns = list(transactions_data.keys()))


buyers_data = {'buyer_id': [5,6,7,8],
        'name': ["Lillian","Ishani","Arman","Yash"],
        }

buyers = pd.DataFrame(buyers_data, columns = list(buyers_data.keys()))

seller_data = {'seller_id': [8,7,6,5],
        'name': ["Mir","Shiny","Pragnay","Ian"],
        }

seller = pd.DataFrame(seller_data, columns = list(seller_data.keys()))

In [9]:
homes

Unnamed: 0,home_id,city,bedrooms,bathrooms,area
0,1,Berkeley,2,2,500
1,2,San Jose,1,2,750
2,3,Berkeley,5,1,1000
3,4,Berkeley,3,1,1500
4,5,Berkeley,4,3,500
5,6,Sunnyvale,1,2,1000


In [10]:
transactions

Unnamed: 0,home_id,buyer_id,seller_id,transaction_data,sale_price
0,1,5,8,1/12/2001,1000
1,2,6,7,4/14/2001,500
2,3,7,6,8/11/2001,750
3,5,8,5,12/21/2001,1200


In [11]:
buyers

Unnamed: 0,buyer_id,name
0,5,Lillian
1,6,Ishani
2,7,Arman
3,8,Yash


In [12]:
seller

Unnamed: 0,seller_id,name
0,8,Mir
1,7,Shiny
2,6,Pragnay
3,5,Ian




Consider the following real estate schema (underlined column names have unique values and no duplicates):

<code>
Homes(<u>home_id int</u>, city text, bedrooms int, bathrooms int,
area text)
Transactions(<u>home_id int, buyer_id int, seller_id int, transaction_date date</u>, sale_price int)
Buyers(<u>buyer_id int</u>, name text)
Sellers(<u>seller_id int</u>, name text)
</code>
<br>

Fill in the blanks in the SQL query to find the home_id, selling price, and area for each home in Berkeley with an area greater than 600. If the home has not been sold yet and has an area greater than 600, it should still be included in the table with **the price as None**.


In [13]:
%%sql 
-- fill in the blanks --
SELECT H.home_id, T.sale_price, CAST(H.area as INT) as int_area
FROM Homes AS H
LEFT JOIN Transactions AS T
ON H.home_id = T.home_id
WHERE H.city = 'Berkeley'
AND int_area > 600;

home_id,sale_price,int_area
3,750.0,1000
4,,1500


In [14]:
%%sql 
-- alternate solution using RIGHT JOIN and casting only when making a comparison--
SELECT H.home_id, T.sale_price, H.area
FROM Transactions AS T
RIGHT JOIN Homes AS H
ON H.home_id = T.home_id
WHERE H.city = 'Berkeley'
AND CAST(area AS INT) > 600;

home_id,sale_price,area
3,750.0,1000
4,,1500


## Joins 

![](joins.png)

Note: You do not need the JOIN keyword to join SQL tables. The following are equivalent:

    SELECT column1, column2
    FROM table1, table2
    WHERE table1.id = table2.id;
    
    SELECT column1, column2
    FROM table1 JOIN table2 
    ON table1.id = table2.id;

### Q3 

In the figure above, assume `table1` has $m$ records, while `table2` has $n$ records. Describe which records are returned from each type of join. What is the **maximum** possible number of records returned in each join? Consider the cases where on the joined field, (1) both tables have unique values, and (2) both tables have duplicated values. As a bonus, what is the **minimum** possible number of records returned in each join?


_Write your answer in this cell_ 

## More SQL Queries
### Q4

Examine this schema for these two tables:

    CREATE TABLE cat_owners (
        id integer, 
        name text, 
        age integer,
        PRIMARY KEY (id)
    ); 

    CREATE TABLE cats (
        id integer
        owner_id integer, 
        name text, 
        breed text, 
        age integer, 
        PRIMARY KEY (id),
        FOREIGN KEY (owner_id) REFERENCES cat_owners
    );


In [15]:
cat_owners_data = {'id': [10,11,12],
        'name': ["Alice","Bob","Candice"],
        }

cat_owners = pd.DataFrame(cat_owners_data, columns = list(cat_owners_data.keys()))

cats_data = {'id': [51,52,53,54,55],
        'owner_id': ["Alice","Alice","Bob","Bob","Candice"],
        'name': ["Mittens","Whisker","Felix","Lucky","Fluffy"],
        'breed' : ["Tabby","Black","Orange","Tabby","Black"],
        'age': [2,3,1,2,16]
        }

cats = pd.DataFrame(cats_data, columns = list(cats_data.keys()))

In [16]:
cat_owners 

Unnamed: 0,id,name
0,10,Alice
1,11,Bob
2,12,Candice


In [17]:
cats

Unnamed: 0,id,owner_id,name,breed,age
0,51,Alice,Mittens,Tabby,2
1,52,Alice,Whisker,Black,3
2,53,Bob,Felix,Orange,1
3,54,Bob,Lucky,Tabby,2
4,55,Candice,Fluffy,Black,16


#### 4a
Write an SQL query to create an almost identical table as cats, except with an additional
column `Nickname` that has the value 'Kitten' for cats less than or equal to the age of 1,
'Catto' for cats between 1 and 15, and 'Wise One' for cats older than or equal to 15

In [18]:
%%sql 
-- write your query here --
SELECT C.id, owner_id, name, breed, age,
    CASE
        WHEN age <= 1 THEN 'Kitten'
        WHEN age >= 15 THEN 'Wise One'
        ELSE 'Catto'
    END AS Nickname
FROM Cats AS C;

id,owner_id,name,breed,age,Nickname
51,Alice,Mittens,Tabby,2,Catto
52,Alice,Whisker,Black,3,Catto
53,Bob,Felix,Orange,1,Kitten
54,Bob,Lucky,Tabby,2,Catto
55,Candice,Fluffy,Black,16,Wise One


#### 4b
Considering only cats with ages strictly greater than 1, write an SQL query that returns the names of owners that own more than one cat.

In [23]:
%%sql 
-- write your query here --
SELECT owner_id
FROM Cats AS C
WHERE C.age > 1
GROUP BY C.owner_id
HAVING COUNT(*) > 1;

owner_id
Alice


#### 4c
Write an SQL query to figure out the `owner_id`/owner of the cat owner who owns the most cats.

In [20]:
%%sql 
-- write your query here --
SELECT owner_id 
FROM Cats 
GROUP BY owner_id
ORDER BY COUNT(*) DESC 
LIMIT 1;

owner_id
Alice


#### 4d
Write an SQL query to figure out the names of all of the cat owners who have a cat
named Pishi. 

In [21]:
%%sql 
-- write your query here --
SELECT O.name 
FROM Cats AS C
JOIN Cat_owners AS O
ON C.owner_id = O.id 
WHERE C.name = 'Pishi'

name


#### 4e
It is possible to have a cat with an owner not in the `cat_owners` table? Explain your answer.

_Write your answer in this cell_ 

#### 4f
Write an SQL query to select all rows from the `cats` table that have cats of
the top 2 most popular cat breeds.

In [22]:
%%sql 
-- write your query here --
SELECT *
FROM Cats WHERE breed IN
    (SELECT breed
    FROM Cats
    GROUP BY breed
    ORDER BY COUNT(*) DESC
    LIMIT 2);

id,owner_id,name,breed,age
51,Alice,Mittens,Tabby,2
52,Alice,Whisker,Black,3
54,Bob,Lucky,Tabby,2
55,Candice,Fluffy,Black,16
