# Discussion 11: SQL

In [None]:
import pandas as pd
import numpy as np
import duckdb

In [None]:
# Run this cell to set up SQL.
%load_ext sql

In [None]:
# Run this cell to connect to duckdb
conn = duckdb.connect()
conn.query("INSTALL sqlite")
%sql conn --alias duckdb

## SQL Syntax
### Q1 
Create the `survey` table that we'll work with in this question.

In [None]:
data = {'j_name': ['Llama Technician','Software Engineer','Open Source Maintainer','Big Data Engineer', 'Data Analyst', 'Analyst Intern'],
        'c_name': ["Google","Salesforce", "Github","Microsoft","Startup","Google"],
        'c_location' : ['Mountain View', 'SF', 'SF', 'Redmond', 'Berkeley', 'SF'],
        'm_name': ["Applied Math","ORMS","Computer Science", "Data Science", "Data Science","Philosophy"]
        }

survey = pd.DataFrame(data, columns = list(data.keys()))
survey

#### 1a
Write an SQL query that selects all data science major graduates that got jobs in Berkeley.
The result generated by your query should include all 4 columns.

In [None]:
%%sql 
-- write your query here --


#### 1b
Write an SQL query to find the top 2 most popular companies that data science graduates will work at, from most popular to 2nd most popular.

In [None]:
%%sql 
-- write your query here --
SELECT c_name, ____________ AS count
FROM survey
WHERE _____________ = 'Data Science'
GROUP BY ______________
ORDER BY ______________
LIMIT 2;

### Q2

Create the tables that we'll work with in this question.

In [None]:
homes_data = {'home_id': [1,2,3,4,5,6],
        'city': ["Berkeley","San Jose","Berkeley","Berkeley","Berkeley", "Sunnyvale"],
        'bedrooms': [2,1,5,3,4,1],
        'bathrooms': [2,2,1,1,3,2],
        'area': [str(i) for i in [500,750,1000,1500,500,1000]] 
        }

homes = pd.DataFrame(homes_data, columns = list(homes_data.keys()))

transactions_data = {'home_id': [1,2,3,5],
        'buyer_id': [5,6,7,8],
        'seller_id': [8,7,6,5],
        'transaction_data': ['1/12/2001','4/14/2001','8/11/2001','12/21/2001'],
        'sale_price': [1000,500,750,1200]
        }

transactions = pd.DataFrame(transactions_data, columns = list(transactions_data.keys()))


buyers_data = {'buyer_id': [5,6,7,8],
        'name': ["Angela","Maya","Jacob","Kevin"],
        }

buyers = pd.DataFrame(buyers_data, columns = list(buyers_data.keys()))

seller_data = {'seller_id': [8,7,6,5],
        'name': ["Gisella","Vicky","Alana","James"],
        }

seller = pd.DataFrame(seller_data, columns = list(seller_data.keys()))

In [None]:
homes

In [None]:
transactions

In [None]:
buyers

In [None]:
seller



Consider the following real estate schema (underlined column names have unique values and no duplicates):

* <code> homes(<u>home_id int</u>, city text, bedrooms int, bathrooms int,
area text) </code>
* <code> transactions(<u>home_id int, buyer_id int, seller_id int, transaction_date date</u>, sale_price int) </code>
* <code> buyers(<u>buyer_id int</u>, name text) </code>
* <code> sellers(<u>seller_id int</u>, name text) </code>

Fill in the blanks in the SQL query to find the home_id, selling price, and area for each home in Berkeley with an area greater than 600. If the home has not been sold yet and has an area greater than 600, it should still be included in the table with **the price as None**.


In [None]:
%%sql 
-- fill in the blanks --
SELECT 
FROM _________
_________ JOIN _________
ON _______________
WHERE _______________;

## Joins 

![](joins.png)

Note: You do not need the JOIN keyword to join SQL tables. The following are equivalent:

    SELECT column1, column2
    FROM table1, table2
    WHERE table1.id = table2.id;
    
    SELECT column1, column2
    FROM table1 JOIN table2 
    ON table1.id = table2.id;

### Q3 

In the figure above, assume `table1` has $m$ records, while `table2` has $n$ records. Describe which records are returned from each type of join. What is the **maximum** possible number of records returned in each join? Consider the cases where on the joined field, (1) both tables have unique values, and (2) both tables have duplicated values. As a bonus, what is the **minimum** possible number of records returned in each join?


_Write your answer in this cell_ 

## More SQL Queries
### Q4

Examine this schema for these two tables:

    CREATE TABLE cat_owners (
        id integer, 
        name text, 
        age integer,
        PRIMARY KEY (id)
    ); 

    CREATE TABLE cats (
        id integer
        owner_id integer, 
        name text, 
        breed text, 
        age integer, 
        PRIMARY KEY (id),
        FOREIGN KEY (owner_id) REFERENCES cat_owners
    );


In [None]:
cat_owners_data = {'id': [10,11,12],
        'name': ["Alice","Bob","Candice"],
        }

cat_owners = pd.DataFrame(cat_owners_data, columns = list(cat_owners_data.keys()))

cats_data = {'id': [51,52,53,54,55],
        'owner_id': [10, 10, 11, 11, 12],
        'name': ["Mittens","Whisker","Felix","Lucky","Fluffy"],
        'breed' : ["Tabby","Black","Orange","Tabby","Black"],
        'age': [2,3,1,2,16]
        }

cats = pd.DataFrame(cats_data, columns = list(cats_data.keys()))

In [None]:
cat_owners 

In [None]:
cats

#### 4a
Write an SQL query to create an almost identical table as cats, except with an additional
column `Nickname` that has the value 'Kitten' for cats less than or equal to the age of 1,
'Catto' for cats between 1 and 15, and 'Wise One' for cats older than or equal to 15

In [None]:
%%sql 
-- write your query here --


#### 4b
Considering only cats with ages strictly greater than 1, write an SQL query that returns the owner_ids that own more than one cat.

In [None]:
%%sql 
-- write your query here --


#### 4c
Write an SQL query that returns the total number of cats each `owner_id` owns sorted by the number of cats in descending order. There should be two columns (`owner_id` and `num_cats`).

In [None]:
%%sql 
-- write your query here --


#### 4d
Write an SQL query to figure out the names of all of the cat owners who have a cat
named Pishi. 

In [None]:
%%sql 
-- write your query here --


#### 4e
It is possible to have a cat with an owner not in the `cat_owners` table? Explain your answer.

_Write your answer in this cell_ 

#### 4f
Write an SQL query to select all rows from the `cats` table that have cats of
the top 2 most popular cat breeds.

In [None]:
%%sql 
-- write your query here --