In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab11.ipynb")

# Lab 11: SQL
## Due Sunday, July 28, 11:59 PM PT

In this lab, we are going to practice viewing, sorting, grouping, and merging tables with SQL. We will explore two datasets:
1. A "minified" version of the [Internet Movie Database](https://www.imdb.com/interfaces/) (IMDb). This SQLite database (~10MB) is a tiny sample of the much larger database (more than a few GBs). As a result, disclaimer that we may get wildly different results than if we use the whole database!

1. The money donated during the 2016 election using the [Federal Election Commission (FEC)'s public records](https://www.fec.gov/data/). You will be connecting to a SQLite database containing the data. The data we will be working with in this lab is relatively small (~106MB); however, it is a sample taken from a much larger database (more than a few GBs).


To receive credit for a lab, answer all questions correctly and submit before the deadline.

**The on-time deadline is Sunday, July 28, 11:59 PM PT**. Please read the syllabus for the grace period policy. No late submissions beyond the grace period will be accepted. While course staff is happy to help you if you encounter difficulties with submission, we may not be able to respond to late-night requests for assistance (TAs need to sleep, after all!). **We strongly encourage you to plan to submit your work to Gradescope several hours before the stated deadline.** This way, you will have ample time to contact staff for submission support.

### Lab Walk-Through
In addition to the lab notebook, we have also released a prerecorded walk-through video of the lab. We encourage you to reference this video as you work through the lab. Run the cell below to display the video.

**Note:** the walkthrough video is from Spring 2022, where the format of answers was different than it is this semester.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("uQ3E4pejmD8", list = 'PLQCcNQgUcDfpdBnhS-lPq8LPas48tkMgp', listType = 'playlist')

### Collaboration Policy
Data science is a collaborative activity. While you may talk with others about this assignment, we ask that you **write your solutions individually**. If you discuss the assignment with others, please **include their names** in the cell below.

**Collaborators:** *list names here*

### Debugging Guide
If you run into any technical issues, we highly recommend checking out the [Data 100 Debugging Guide](https://ds100.org/debugging-guide/). In this guide, you can find general questions about Jupyter notebooks / Datahub, Gradescope, common SQL errors, and more.

In [None]:
# Run this cell and the next one to set up your notebook.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sqlalchemy
from pathlib import Path
from zipfile import ZipFile

with ZipFile('data.zip', 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()

## SQL Query Syntax

Throughout this lab, you will become familiar with the following syntax for the `SELECT` query:

```
SELECT <column list>
FROM <table>
[WHERE <predicate>]
[GROUP BY <column list>]
[HAVING <predicate>]
[ORDER BY <column list>]
[LIMIT <number of rows>]
[OFFSET <number of rows>]
```

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Part 0 [Tutorial]: Writing SQL in Jupyter Notebooks

**Caution: Be careful with large SQL queries!!** You may need to reboot your Jupyter Hub instance if it stops responding. To avoid printing out 100k-sized tables, we've adjusted the display limit to ensure that the tables displayed are truncated to 20 rows (though they may contain more rows in reality).


In [None]:
%config SqlMagic.displaylimit = 20

In [None]:
# Run this cell to set up SQL. 
import duckdb
%load_ext sql

### 1. `%%sql` cell magic

In lecture, we used the `sqlalchemy` extension to use **`%%sql` cell magic**, which enables us to connect to SQL databases and issue SQL commands within Jupyter Notebooks.

Run the below cells to connect to a mini IMDb database using `duckdb` as the backend.

In [None]:
# Run this cell to connect to duckdb
conn = duckdb.connect()
conn.query("INSTALL sqlite")

In [None]:
# Run this cell to connect to the imdbmini database
imdb_mini_db = "duckdb:///imdbmini.db"
%sql duckdb:///imdbmini.db --alias imdb_engine

The above cell connects to the same database using the SQLAlchemy Python library, which can connect to several different database management systems, including sqlite3, MySQL, PostgreSQL, and Oracle; we use `duckdb`. The library also supports an advanced feature for generating queries called an [object relational mapper](https://docs.sqlalchemy.org/en/20/tutorial/index.html#unified-tutorial) or ORM, which we won't discuss in this course but is quite useful for application development.

Above, prefixing our single-line command with `%sql` means that the entire line will be treated as a SQL command (this is called "line magic"). In this class we will most often write multi-line SQL, meaning we need "cell magic", where the first line has `%%sql` (note the double `%` operator).

The database `imdbmini.db` includes several tables, one of which is `Name`. Running the below cell will return first 5 lines of that table. Note that `%%sql` is on its own line.

We've also included syntax for single-line comments, which are surrounded by `--`, and multi-line comments, which are surrounded by `/*` and `*/`.

In [None]:
%%sql
/*
 * This is a
 * multi-line comment.
 */
-- This is a single-line/inline comment. --
SELECT *
FROM Name
LIMIT 5;

<br/>

### 2. The `pandas` command `pd.read_sql`

This section describes how data scientists use SQL and `python` in practice, using the `pandas` command `pd.read_sql` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html)). **You will see both `%sql` magic and `pd.read_sql` in this course**.

With the SQLAlchemy object `engine`, we can call `pd.read_sql` which takes in a `query` **string**. Note the `"""` to define our multi-line string, which allows us to have a query span multiple lines. The resulting `DataFrame` `df` stores the results of the same SQL query from the previous section.

In [None]:
# Run this cell to see the demo.
query = """
SELECT *
FROM Name
LIMIT 5;
"""

df = pd.read_sql(query, imdb_mini_db)
df

#### `pd.read_sql` vs. `%sql` magic error messages

`pd.read_sql` has **long error messages**: Given that the SQL query is now in the string, the errors become more unintelligible. Consider the below (incorrect) query.

**Note**: Uncomment the below code and check out the error. You can uncomment/comment out multiple cells at the same time by selecting the lines and press ctrl + / or command + /. Make sure to comment out the erroring cell afterwards to prevent autograder issues!

In [None]:
# Uncomment the below code and check out the error.
# query = """
# SELECT *
# FROM Title;
# LIMIT 5
# """
# pd.read_sql(query, imdb_mini_db)

<br/>
<details>
<summary>Now that's an unruly error message! Can you see what's wrong in the cell above and correct the query? Toggle this cell to check your answer!</summary>
It has a semicolon in the wrong place!
</details>

<br/>

On the other hand, `%sql` magic gives more intelligible error messages, so we will use this format more often.

In [None]:
# %%sql
# -- Uncomment the code and check out the error. --
# SELECT *
# FROM Title;
# LIMIT 5

<br/><br/>
<hr style="border: 1px solid #fdb515;" />

# Part 1: The IMDb (mini) Dataset

Let's explore a miniature version of the [IMDb Dataset](https://www.imdb.com/interfaces/). This is the same dataset that we will use for the upcoming homework. We'll load it in using cell magic.

In [None]:
%%sql imdb_engine
SELECT * FROM sqlite_master WHERE type='table';

From running the above cell, we see the database has 4 tables: `Name`, `Role`, `Rating`, and `Title`.

<details open>
    <summary>[<b>Click to Expand</b>] See descriptions of each table's schema.</summary>
    
**`Name`** – Contains the following information for names of people.
    
- nconst (integer) - alphanumeric unique identifier of the name/person
- primaryName (text)– name by which the person is most often credited
- birthYear (text) – in YYYY format
- deathYear (text) – in YYYY format
    
    
**`Role`** – Contains the principal cast/crew for titles.
    
- tconst (integer) - alphanumeric unique identifier of the title
- ordering (text) – a number to uniquely identify rows for a given tconst
- nconst (integer) - alphanumeric unique identifier of the name/person
- category (text) - the category of job that person was in
- characters (text) - the name of the character played if applicable, else '\\N'
    
**`Rating`** – Contains the IMDb rating and votes information for titles.
    
- tconst (integer) - alphanumeric unique identifier of the title
- averageRating (text) – weighted average of all the individual user ratings
- numVotes (text) - number of votes (i.e., ratings) the title has received
    
**`Title`** - Contains the following information for titles.
    
- tconst (integer) - alphanumeric unique identifier of the title
- titleType (text) -  the type/format of the title
- primaryTitle (text) -  the more popular title / the title used by the filmmakers on promotional materials at the point of release
- isAdult (text) - 0: non-adult title; 1: adult title
- Year (text) – represents the release year of a title.
- runtimeMinutes (text)  – primary runtime of the title, in minutes
    
</details>

<br/><br/>

From the above descriptions, we can conclude the following:
* `Name.nconst` and `Title.tconst` are primary keys of the `Name` and `Title` tables, respectively.
* that `Role.nconst` and `Role.tconst` are **foreign keys** that point to `Name.nconst` and `Title.tconst`, respectively.

<br/><br/>

---

## Question 1

What are the different kinds of `titleType`s included in the `Title` table? Write a query to find out all the unique `titleType`s of films using the `DISTINCT` keyword.  (**You may not use `GROUP BY`.**)

In [None]:
%%sql imdb_engine --save query_q1
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q1 = %sqlcmd snippets query_q1
res_q1 = pd.read_sql(sql_q1, imdb_mini_db)

In [None]:
grader.check("q1")

<br><br>

---

## Question 2

Before we proceed we want to get a better picture of the kinds of jobs that exist.  To do this examine the `Role` table by computing the number of records with each job `category`.  Present the results in descending order by the total counts (`total`).

The top of your table should look like this (however, you should have more rows):

| |category|total|
|-----|-----|-----|
|**0**|actor|21665|
|**1**|writer|13830|
|**2**|...|...|

In [None]:
%%sql imdb_engine --save query_q2
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q2 = %sqlcmd snippets query_q2
res_q2 = pd.read_sql(sql_q2, imdb_mini_db)

In [None]:
grader.check("q2")

<br/>
If we computed the results correctly we should see a nice horizontal bar chart of the counts per category below:

In [None]:
# Run this cell to make a bar plot.
plt.barh(res_q2["category"], res_q2["total"])
plt.xlabel("Counts")
plt.ylabel("Job Category");

<br/><br/>

---

## Question 3

Now that we have a better sense of the basics of our data, we can ask some more interesting questions.

The `Rating` table has the `numVotes` and the `averageRating` for each title. Which 10 films have the most ratings?

Write a SQL query that outputs three fields: the `title`, `numVotes`, and `averageRating` for the 10 films that have the highest number of ratings.  Sort the result in descending order by the number of votes.

**Hint**: The `numVotes` in the `Rating` table is not an integer! Use `CAST(Rating.numVotes AS int) AS numVotes` to convert the attribute to an integer. 

In [None]:
%%sql imdb_engine --save query_q3
...

In [None]:
# No further action is required. 
sql_q3 = %sqlcmd snippets query_q3
res_q3 = pd.read_sql(sql_q3, imdb_mini_db)

In [None]:
grader.check("q3")

<br/><br/>
<hr style="border: 1px solid #fdb515;" />

# Part 2: Election Donations in New York City

Finally, let's analyze the Federal Election Commission (FEC)'s public records. We connect to the database using cell magic so that we can flexibly explore the database.

In [None]:
# Run this cell to connect to the fec_nyc database
fec_nyc_db = "duckdb:///fec_nyc.db"
%sql duckdb:///fec_nyc.db --alias fec_engine

### Table Descriptions

Run the below cell to explore the **schemas** of all tables saved in the database.

If you'd like, you can consult the below linked FEC pages for the descriptions of the tables themselves.

* `cand` ([link](https://www.fec.gov/campaign-finance-data/candidate-summary-file-description/)): Candidates table. Contains names and party affiliation.
* `comm` ([link](https://www.fec.gov/campaign-finance-data/committee-summary-file-description/)): Committees table. Contains committee names and types.
* `indiv_sample_nyc` ([link](https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description)): All individual contributions from New York City .

In [None]:
%%sql 
/* just run this cell */
SELECT * FROM sqlite_master WHERE type='table';

<br/><br/>

Let's look at the `indiv_sample_nyc` table. The below cell displays individual donations made by residents of the state of New York. We use `LIMIT 5` to avoid loading and displaying a huge table.

In [None]:
%%sql fec_engine
/* just run this cell */
SELECT *
FROM indiv_sample_nyc
LIMIT 5;

You can write a SQL query to return the id and name of the first five candidates from the Democratic party, as below:

In [None]:
%%sql fec_engine
/* just run this cell */
SELECT cand_id, cand_name
FROM cand
WHERE cand_pty_affiliation = 'DEM'
LIMIT 5;

<br/><br/>
<hr style="border: 1px solid #fdb515;" />

## [Tutorial] Matching Text with `LIKE`

First, let's look at 2016 election contributions made by Donald Trump, who was a New York (NY) resident during that year. The following SQL query returns the `cmte_id`, `transaction_amt`, and `name` for every contribution made by any donor with "DONALD" and "TRUMP" in their name in the `indiv_sample_nyc` table.

**Notes:**
* We use the `WHERE ... LIKE '...'` to match fields with text patterns. The `%` wildcard represents at least zero characters. Compare this to what you know from regex!
* We use `pd.read_sql` syntax here because we will do some EDA on the result `res`.

In [None]:
# Run this cell to see an example of LIKE.
example_query = """
SELECT 
    cmte_id,
    transaction_amt,
    name
FROM indiv_sample_nyc
WHERE name LIKE '%TRUMP%' AND name LIKE '%DONALD%';
"""

example_res = pd.read_sql(example_query, fec_nyc_db)
example_res

If we look at the list above, it appears that some donations were not by Donald Trump himself, but instead by an entity called "DONALD J TRUMP FOR PRESIDENT INC". Fortunately, we see that our query only seems to have picked up one such anomalous name.

In [None]:
# Run this cell to see the value counts for each candidate.
example_res['name'].value_counts()

<br/><br/>

---

## Question 4



In the cell below, revise the above query so that the 15 anomalous donations made by "DONALD J TRUMP FOR PRESIDENT INC" do not appear. Your resulting table should have 142 rows. 

**Hints:**
* Consider using the above query as a starting point, or checking out the SQL query skeleton at the top of this lab. 
* The `NOT` keyword may also be useful here.


In [None]:
%%sql fec_engine --save query_q4 
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q4 = %sqlcmd snippets query_q4
res_q4 = pd.read_sql(sql_q4, fec_nyc_db)

In [None]:
# Print the number of rows in your query
# Double check that this equals 142 
res_q4.shape[0]

In [None]:
grader.check("q4")

<br/><br/>

---

## Question 5: `JOIN`ing Tables

Let's explore the other two tables in our database: `cand` and `comm`.

The `cand` table contains summary financial information about each candidate registered with the FEC or appearing on an official state ballot for House, Senate or President.

In [None]:
%%sql fec_engine
/* just run this cell */
SELECT *
FROM cand
LIMIT 5;

The `comm` table contains summary financial information about each committee registered with the FEC. Committees are organizations that spend money for political action or parties, or spend money for or against political candidates.

In [None]:
%%sql fec_engine
/* just run this cell */
SELECT *
FROM comm
LIMIT 5;

<br><br>

---

### Question 5a

Notice that both the `cand` and `comm` tables have a `cand_id` column. Let's try joining these two tables on this column to print out committee information for candidates.

List the first 5 candidate names (`cand_name`) in reverse lexicographic order (i.e reverse alphabetical order) by `cand_name`, along with their corresponding committee names. **Only select rows that have a matching `cand_id` in both tables.**

Your output should look similar to the following:

|cand_name|cmte_nm|
|----|----|
|ZUTLER, DANIEL PAUL MR|CITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT|
|ZUMWALT, JAMES|ZUMWALT FOR CONGRESS|
|...|...|

Consider starting from the following query skeleton, which uses the `AS` keyword to rename the `cand` and `comm` tables to `c1` and `c2`, respectively.
Which join is most appropriate?

    SELECT ...
    FROM cand AS c1
        [INNER | {LEFT |RIGHT | FULL } {OUTER}] JOIN comm AS c2
        ON ...
    ...
    ...;


In [None]:
%%sql fec_engine --save query_q5a 
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q5a = %sqlcmd snippets query_q5a
res_q5a = pd.read_sql(sql_q5a, fec_nyc_db)

In [None]:
grader.check("q5a")

<br/><br/>

---

### Question 5b

Suppose we modify the query from the previous part to include *all* candidates, **including those that don't have a committee.**


List the first 5 candidate names (`cand_name`) in reverse lexicographic order by `cand_name`, along with their corresponding committee names. If the candidate has no committee in the `comm` table, then `cmte_nm` should be NULL (or `None` in the `python` representation).

Your output should look similar to the following:

|cand_name|cmte_nm|
|----|----|
|ZUTLER, DANIEL PAUL MR|CITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT|
|...|...|
|ZORNOW, TODD MR|None|

**Hint**: Start from the same query skeleton as the previous part. 
Which join is most appropriate?

In [None]:
%%sql fec_engine --save query_q5b
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q5b = %sqlcmd snippets query_q5b
res_q5b = pd.read_sql(sql_q5b, fec_nyc_db)

In [None]:
grader.check("q5b")

<br/><br/>

---

## Question 6: Subqueries and Grouping

If we return to our results from Question 4, we see that many of the contributions were to the same committee:

In [None]:
# Your SQL query result from Question 4
# Reprinted for your convenience
res_q4['cmte_id'].value_counts()

<br><br>

For this question, create a new SQL query that returns the total amount that Donald Trump contributed to each committee.

Your table should have four columns: `cmte_id`, `total_amount` (total amount contributed to that committee), `num_donations` (total number of donations), and `cmte_nm` (name of the committee). Your table should be sorted in **decreasing order** of `total_amount`.

Your output should look similar to the following:

|cmte_id|total_amount|num_donations|cmte_nm|
|----|----|----|----|
|C00580100|18633157|131|DONALD J. TRUMP FOR PRESIDENT, INC.|
|C00055582|10000|1|NY REPUBLICAN FEDERAL CAMPAIGN COMMITTEE
|...|...|...|

**This is a hard question!** Don't be afraid to reference the lecture slides, or the overall SQL query skeleton at the top of this lab.

**Hint**:

* Note that committee names are not available in `indiv_sample_nyc`, so you will have to obtain information somehow from the `comm` table (perhaps a `JOIN` would be useful).
* Remember that you can compute summary statistics after grouping by using aggregates like `COUNT(*)`, `SUM()` as output fields.
* A **subquery** may be useful (but not required) to break your question down into subparts. Consider the following query skeleton, which uses the `WITH` operator to store a subquery's results in a temporary table named `donations`.

        WITH donations AS (
            SELECT ...
            FROM ...
            ... JOIN ...
                ON ...
            WHERE ...
        )
        SELECT ...
        FROM donations
        GROUP BY ...
        ORDER BY ...;
  
**Note**: The video walkthrough solution may not be fully correct here. Remember that when using `GROUP BY`, all columns in the `SELECT` statement must either be present in the `GROUP BY` clause or be used in an aggregate function. For example:

        SELECT titleType, SUM(runtimeMinutes), Year
		FROM Title
		GROUP BY titleType;

Here, this query violates the rule because `Year` is included in the SELECT statement without being either part of an aggregate function or listed in the `GROUP BY` clause.


In [None]:
%%sql fec_engine --save query_q6
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q6 = %sqlcmd snippets query_q6
res_q6 = pd.read_sql(sql_q6, fec_nyc_db)

In [None]:
grader.check("q6")

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Allie congratulates you for finishing Lab 11!

<center><video controls src = "images/allie.MOV" width = "250">animation</video></a></center>

<img src='images/allie_2.jpg' width="250px" /> <img src='images/allie_1.jpg' width="200px" /> 

### Course Content Feedback

If you have any feedback about this assignment or about any of our other weekly assignments, lectures, or discussions, please fill out the [Course Content Feedback Form](https://forms.gle/owfPCGgnrju1xQEA9). Your input is valuable in helping us improve the quality and relevance of our content to better meet your needs and expectations!

### Submission Instructions

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. Submit this file to the Lab 11 assignment on Gradescope. If you run into any issues when running this cell, feel free to check this [section](https://ds100.org/debugging-guide/autograder_gradescope/autograder_gradescope.html#why-does-grader.exportrun_teststrue-fail-if-all-previous-tests-passed) in the Data 100 Debugging Guide.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)