In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab09.ipynb")

# Lab 09: Random Variables and Inference & SQL
For the first half of this lab, you will:

1. Explore properties of random variables using the example of a binomial distribution.
1. Verify the Central Limit Theorem (CLT) using simulations. 
1. Examine if the "sample maximum" is a biased estimator for the true maximum of a population.
1. Perform inference for the population correlation of the tips dataset.

For the second half of the lab, you will practice viewing, sorting, grouping, and merging tables with SQL. We will explore two datasets:
1. A "minified" version of the [Internet Movie Database](https://www.imdb.com/interfaces/) (IMDb). This SQLite database (~10MB) is a tiny sample of the much larger database (more than a few GBs). As a result, disclaimer that we may get wildly different results than if we use the whole database!

1. The money donated during the 2016 election using the [Federal Election Commission (FEC)'s public records](https://www.fec.gov/data/). You will be connecting to a SQLite database containing the data. The data we will be working with in this lab is relatively small (~106MB); however, it is a sample taken from a much larger database (more than a few GBs).

To receive credit for a lab, answer all questions correctly and submit before the deadline.

You must submit this assignment to Pensieve by the on-time deadline, **Thursday, July 31, 11:59 PM PT**. Please read the syllabus for the Slip Day policy. As a reminder, slip days are **not** applicable on labs. **We strongly encourage you to plan to submit your work to Pensieve several hours before the stated deadline.** This way, you will have ample time to contact staff for submission support. 

### Lab Walk-Through
In addition to the lab notebook, we have also released a prerecorded walk-through video of the lab. We encourage you to reference this video as you work through the lab. Run the cell below to display the video.

**Note**: The walkthrough video is partially recorded from Spring 2023. There may be slight inconsistencies between the version you are viewing and the version used in the recording, but content is identical.

**Playlist link**: click [**here**](https://www.youtube.com/watch?v=_K7OvmRbb5w&list=PLQCcNQgUcDfoJvVQxlWZC2vJfj1IKzWwR) for all the parts of the lab.

For the first half:

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("_K7OvmRbb5w", list = 'PLQCcNQgUcDfoJvVQxlWZC2vJfj1IKzWwR', listType = 'playlist')

For the second half:

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("uQ3E4pejmD8", list = 'PLQCcNQgUcDfpdBnhS-lPq8LPas48tkMgp', listType = 'playlist')

### Discussion 10 & 11 Mini-lecture

In Data 100, discussions will not reserve time to host mini-lectures. Instead, we will release a set of pre-recorded mini-lectures that supplement the concepts introduced in lecture. For labs released on Tuesday (or earlier), the content of the mini-lectures will be covered in Wednesday discussions. For lab released on Thursday (or earlier), the content of the mini-lecture will be covered in Monday discussions. Thus, it is important to watch these mini-lectures **before** attending the discussion section you are assigned to. Discussion 10 mini-lecture is attached [here](https://www.youtube.com/watch?v=bp7-OwxdGwg&list=PLQCcNQgUcDfoUXRtrHc9TUx2pBYNfToVN). Discussion 11 mini-lecture is attached [here](https://www.youtube.com/watch?v=bp7-OwxdGwg&list=PLQCcNQgUcDfoUXRtrHc9TUx2pBYNfToVN).

### Collaboration Policy
Data science is a collaborative activity. While you may talk with others about this assignment, we ask that you **write your solutions individually**. If you discuss the assignment with others, please **include their names** in the cell below.

**Collaborators:** *list names here*

---
### Debugging Guide

If you run into any technical issues, we highly recommend checking out the [Data 100 Debugging Guide](https://ds100.org/debugging-guide/). In this guide, you can find general questions about Jupyter notebooks / Datahub, Pensieve, and common `pandas`, RegEx, and visualization errors.

In [None]:
# Run this cell to set up your notebook
import csv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns
from IPython.display import FileLink, FileLinks
FileLink('path_to_file/filename.extension')
%matplotlib inline
sns.set()
sns.set_context("talk")
import warnings
warnings.filterwarnings('ignore')

np.random.seed(2023) # Do not change this line; this sets the pseudorandomness of the autograder.

from IPython.display import display, Latex, Markdown

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Part 1: Random Variables and Inferences
---

## Question 1: Probability with Binomial Random Variables

---

### Question 1a: Loading the Data

The Berkeley Half Marathon is an annual weekend-long race here in Berkeley. 

We want to understand how many participants in this year’s race also participated in the previous year's race. To accomplish this, we collect a sample of this year’s participants.

Let's first assume that we have access to the official data so we can simulate the potential result we might get (**in practice we don't!**). The dataset `marathon.csv` includes information for **all racers** who registered for the Berkeley Half Marathon. In other words, the dataset represents our **full population**.

* The `Bib Number` of each participant (i.e., racer) is in order of registration — integers from $1$ to the total unknown number of participants. 
* The column `Race Type` denotes the type of race a participant is in.
* The column `Experienced` denotes if a participant participated in the race in the previous year.
* The column `Dog Lover` denotes if a participant is a dog lover.

Load the dataset `marathon.csv` into the `DataFrame` `marathon` and assign `true_prop` to the true proportion of experienced racers. 

**Hint:** The **true proportion** of experienced racers is the proportion of experienced racers in the *population*. We term numerical functions of the population, such as the true proportion of experienced racers, as **population parameters**.

In [None]:
marathon = ...
display(marathon.head())
true_prop = ...
print(f"The true proportion of experienced racers in the population of size {len(marathon)} is {np.round(true_prop, 4)}") 

In [None]:
grader.check("q1a")

Suppose that you have access to the official roster and are able to collect a Simple Random Sample (SRS) of 100 racers. You decided to use the proportion of experienced racers in this smaller *sample* as an **estimate** of the true proportion of experienced racers in the full *population*. Let's denote this true proportion as $p$.


How would a sample proportion compare to the true proportion? Suppose we take a simple random sample of size $n$. For an individual $i \in \{1, 2, \dots, n\}$ in our sample, we define $X_i$ to be a random variable indicating if individual $i$ is experienced or not. That is, if individual $i$ is experienced, $X_i = 1$, otherwise, $X_i = 0$. Then we can define the sample proportion as the fraction of experienced racers in the sample. The sample proportion $\hat{p}$ is therefore also the mean of the sample.

$$\hat{p} = \text{sample proportion} = \frac{1}{n}\sum_{i=1}^{\text{n}} X_i$$

Note that **sample proportion** is a numerical function of the sample, so it is also a (sample) statistic. As a reminder, sample statistics are random variables, due to the randomness of the samples. 


For the remainder of the lab, we will assume that the true population is large enough to simplify the sample as **a random sample with replacement.** Under this assumption, $X_i$’s are i.i.d (independent and identically distributed). Each $X_i$ follows a Bernoulli distribution with probability $p$ that a racer is experienced (i.e. $P(X_i = 1) = p$). Then, the sample proportion $\hat{p}$ is a scaled Binomial random variable with expectation $\mathbb{E}[\hat{p}] = p$ and variance $\text{Var}(\hat{p}) = \frac{p(1-p)}{n}$.

As a reminder, we can show that $\mathbb{E}[\hat{p}] = p$ via linearity of expectation:

$$
\mathbb{E}[\hat{p}] = \mathbb{E}[\frac{1}{n}\sum_{i=1}^{\text{n}} X_i] = \frac{1}{n}\sum_{i=1}^{\text{n}} \mathbb{E}[X_i] = \frac{1}{n}\sum_{i=1}^{\text{n}} p = \frac{np}{n} = p
$$

and $\text{Var}(\hat{p}) = \frac{p(1-p)}{n}$ via additivity of variance in independent variables:

$$
\text{Var}(\hat{p}) = \text{Var}(\frac{1}{n}\sum_{i=1}^{\text{n}} X_i) = \frac{1}{n^2}\sum_{i=1}^{\text{n}} \text{Var}(X_i) = \frac{1}{n^2}\sum_{i=1}^{\text{n}} p(1-p) = \frac{p(1-p)}{n}
$$

In the remainder of this question, let's confirm these statistics through simulation.

<br><br>

---

### Question 1b: Expected Proportion

The expressions above give us the expectation and variance for the proportion of experienced racers if we apply probability theory. Do these results hold true if we actually simulate the proportion of experienced racers on random samples?”

Run 5000 independent simulations to compute the **proportion of experienced racers** in simulated samples of size $n = 100$, each generated uniformly at random from the true population `marathon`. You may assume that the true population is large enough such that the sample is a random sample with replacement. Assign `samples` to an array with 5000 elements, each of which is the proportion of experienced racers in that simulated sample. Also, assign `simulated_mean` and `simulated_var` to the mean and variance of the simulated proportions, respectively.

Useful function: `df.sample` ([link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html))

In [None]:
samples = ...
simulated_mean = ...
simulated_var = ...

print(f"[Mean]     Simulated: {simulated_mean:.5f}   Theoretical: {true_prop:.5f}")
print(f"[Variance] Simulated: {simulated_var:.5f}   Theoretical: {true_prop*(1-true_prop)/100:.5f}")

In [None]:
grader.check("q1b")

<br>

---

What is a better way to support racers than passing out dog photos? :-) You decide to take a sample of size $n = 100$, where each racer will receive 1 dog photo if they are an experienced racer, 3 dog photos if they love dogs, and 4 dog photos if they are both an experienced racer and love dogs. What is the expected number of photos you need to print? 

Again, assume that the true population is large enough such that the sample is a random sample with replacement to simplify the problem, and that whether a racer loves dogs and whether the racer participated in the previous year are independent. Let $D$ be the number of dog photos that need to be printed. Here, we picked $D$ to refer to dog photos. More generally, when picking letters for random variables, it is good practice to try and pick something informative and unambiguous. 

We can then find the **expected number of photos, $\mathbb{E}(D)$** as follows: 
$$\mathbb{E}(D) = \large 100 \cdot p + 100 \cdot 3 \cdot q,$$ 
where $p$ is the true proportion of experienced racers and $q$ is the true proportion of dog lovers. This result follows from the linearity of expectation:

$$\mathbb{E}[aX+bY] = a \mathbb{E}[X] + b\mathbb{E}[Y].$$

The variance of the number of photos is:
$$\text{Var} (D) = \large 100 \cdot p (1- p)+ 100 \cdot 3^2 \cdot q (1- q),$$
which follows from the properties of variance and that the two samples are independent: 

$$\text{Var}(aX+bY) = a^2\text{Var}(X) + b^2\text{Var}(Y) + 2 \cdot a \cdot b \text{Cov}(X, Y) = a^2\text{Var}(X) + b^2\text{Var}(Y).$$

See the video walkthrough for a full derivation of these results.

<br>

---

### Question 1c: Expectation and Variance of Linear Combinations of Random Variables

Confirm this result through simulation. Run 5000 independent simulations, where each simulation finds the number of **photos** needed for a sample of size $n = 100$ racers, selected uniformly at random (with replacement) from the true population `marathon`.  Assign `photo_samples` to an array with 5000 elements, each of which is the number of **photos** needed for the simulated sample. 

**Note:** We have computed `prop_dog_lover`, the true proportion of dog lover racers for you so that you can verify that your simulated statistics match the theoretical statistics.

In [None]:
prop_dog_lover = np.mean(marathon["Dog Lover"])

photo_samples = []
...

simulated_photos_mean = np.mean(photo_samples)
simulated_photos_var = np.var(photo_samples)


print(f"[Mean]     Simulated: {simulated_photos_mean:.5f}    Theoretical: {(true_prop + 3*prop_dog_lover)*100:.5f}")
print(f"[Variance] Simulated: {simulated_photos_var:.5f}   Theoretical: {100*true_prop*(1-true_prop) + 900*prop_dog_lover*(1-prop_dog_lover):.5f}")

In [None]:
grader.check("q1c")

<br>
<hr style="border: 1px solid #fdb515;" />

## Question 2: Central Limit Theorem

The Central Limit Theorem states that the distribution of the sample mean will converge to a normal distribution as the sample size ($n$) goes to infinity. That means that if we collected enough samples from the population, calculated the proportion of experienced racers (which is a sample mean) for each sample, and viewed a histogram of the proportions, we would see a normal distribution!

Let's see this in action!

<br>

---

### Question 2a
Complete the function `simulate`. The function `simulate` takes in one argument (`sample_size`: the size of the sample) and returns a list of length 5000 where each element is the proportion of experienced racers in a random sample of size `sample_size`. 

Then, use `simulate` to run 5000 independent simulations, where each simulation finds the proportion of **experienced racers** in a sample of size of 100, 500, and 1000 selected uniformly at random from the true population `marathon`. You may assume that the true population is large enough such that the sample is a random sample with replacement (note that in reality, our population is finite in size about 50k; this approximation becomes more inaccurate as our sample size grows larger). You should assign `samples100`, `samples500`, and `samples1000` each to arrays of 5000 elements with proportions of experienced racers of sample sizes **100**, **500**, and **1000**, respectively.

In [None]:
def simulate(sample_size):
    ...

samples100 = ...
samples500 = ...
samples1000 = ...

In [None]:
grader.check("q2a")

<br>

---

### Question 2b

Recall that if a random variable follows a normal distribution with mean $\mu$ and variance $\sigma^2$, then its Probability Density Function (pdf) is
$$\large
f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2  \sigma ^2} \right)
$$

Complete the function `gaussian` which returns the pdf of a normal distribution with mean `mean`, and variance of `var`, computed at values `x`. Pay attention to the order of operations and add parentheses accordingly - the solution presented in the lab walkthrough video is missing a pair, so you will need to modify the solution slightly!

**Hint:** This is very similar to `gaussian_kernel` from Lab 04!

**Note:** The solution differs slightly from the walkthrough. You will need to add parentheses around `2 * var` for the tests to pass.

**Another Note:** Use the `numpy` (instead of `math`) library to access common mathematical constants, such as $e$ and $\pi$.

In [None]:
def gaussian(mu, var, x):
    """
    Compute the Gaussian density at value x.

    Args:
        mu: the mean/center of the Gaussian distribution.
        var: variance of the Gaussian distribution.
        x: observation.

    Returns:
        The density at value x.
    """    
    ...

gaussian(0, 1, np.array([-1, 0, 1]))

In [None]:
grader.check("q2b")

<br>

---

### Question 2c

We are ready to demonstrate the Central Limit Theorem visually by comparing simulated distributions of sample means to the normal distribution. We have provided the skeleton code of an interactive plot. Fill in the blanks below using the `simulate` and `gaussian` functions from previous parts of this question.

Then, in the cell below, describe the mean and spread of the sampling distribution and how they change as you increase the value of `sample_size`.

In [None]:
from ipywidgets import interact

def f(sample_size):
    plt.figure(figsize=(10, 5))
    # Generate the simulated proportions
    sim_samples = ...
    # Make a histogram plot of the simulated proportions. Set density to True and edgecolor to "none"
    ...
    x = np.linspace(0, 1, 1001)
    # We provided the mean and variance for you. If you are interested in knowing how to calculate these, take Data 140!
    mean = true_prop
    var = true_prop*(1-true_prop)/sample_size
    # Compute the pdf of the normal distribution of mean `mean` and variance `var` at locations x
    y = ...
    plt.plot(x, y, linewidth=1)
    plt.xlim(0, 0.6);
    plt.ylim(0, 35);
    plt.show()
interact(f, sample_size=(10, 1000, 10));

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br>

<hr style="border: 1px solid #fdb515;" />

## Question 3: Estimator for Population Max


Now suppose that we do not have access to the official roster; instead, we only have one sample. Without the official roster, we do not know the population and therefore do not know the total number of racers. However, we still want to estimate the total racers given an observed sample so we can prepare a dog photo for everyone. That is, we want to find an estimator for the **population maximum**.

Recall that the `Bib Number` of each participant (i.e., racer) is in order of registration—integers from $1$ to the total unknown number of participants. You decide to construct a sample by recording the bib number of every racer you see on the street in a given time period and use the maximum bib number in your sample as an estimator for the true maximum bib number (that is, the total number of participants, assuming everyone who registered participated). Assume that a racer's bib number has no relation to their racing experience so that you are equally likely to see any of the bib numbers in your sample.

**Is the sample maximum a good estimator for the population maximum?** We'll use simulation to explore the answer to this question in this part of the lab.

<br>

---

### Question 3a

Let's first assume that we have access to the total number of participants (again, in practice we don't!). Find the **true population maximum** and assign it to `true_max`.

In [None]:
true_max = ...
true_max

In [None]:
grader.check("q3a")

In [None]:
# Run this cell to see the summary statistics of Bib Number; no further action is needed.
marathon.describe()

You can use the above output to quickly check and see if the value you assigned to `true_max` aligns with what you find in the dataset.

<br>

---

### Question 3b

How would a sample maximum compare to the true maximum? Suppose we draw a sample of size $n$ with replacement from the population. We denote the bib number of individual $i$ in the sample as $B_i$. We will have $n$ i.i.d. random variables: $B_1, B_2, \dots, B_n$. Define the **sample max** as the maximum value of the sample.

$$\text{sample max} = \max (B_1, \dots, B_n)$$


Recall from [Data 8](https://inferentialthinking.com/chapters/10/3/Empirical_Distribution_of_a_Statistic.html) that we can get the empirical distribution of a statistic by **simulating**, or repeatedly sampling from the population.
Suppose we compute the sample max as the **maximum bib number from observing the bib numbers of $n = 200$ random racers**. By repeating this process for many randomly selected samples, we get a simulated distribution of the sample max statistic.

Assign `sample_maxes` to a list that contains 5000 simulated sample maxes from samples of size $n = 200$, each sampled randomly **with replacement** from the population `marathon`. (Side note: We sample with replacement because while it suggests that we could see the same racer multiple times in our sample, it allows us to assume each individual in our sample is drawn i.i.d. from the population.)

Some useful functions: `df.sample` ([link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)), `np.random.choice` ([link](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html)). 



In [None]:
sample_maxes = ...
for i in range(5000):
    sample = ...
    ...

In [None]:
grader.check("q3b")

<!-- BEGIN QUESTION -->

<br>

---


### Question 3c

Plot the empirical distribution of the sample maximum that you generated in Question 3b. Your plot should look like the below plot. It should include both the average sample maximum and the true population maximum as vertical lines.

<img src='images/sample_max_dist.png' width="600px" />

Visualization/plotting tips:
* To plot a vertical line with specific linestyles, see the `plt.axvline` [documentation](https://matplotlib.org/3.5.1/api/_as_gen/matplotlib.pyplot.axvline.html).
* To include a label in the legend, pass in `label=...` to the plot that you'd like to label ([example](https://matplotlib.org/3.5.1/gallery/pyplots/axline.html#sphx-glr-gallery-pyplots-axline-py)).

**Note:** Your graph does not need to look exactly like the one above, but it should be similar.

In [None]:
plt.figure(figsize = [10, 6])
bins = np.linspace(49000, 50750, 25) # For your plot

avg_sample_maxes = ...
...

plt.legend();     # Show legend

<!-- END QUESTION -->

</br>

---

### Question 3d

Recall from Lecture 16 that an **unbiased estimator** is one where the expected value of the estimator is the parameter. For example, the sample mean $\bar{X}_n$ is an unbiased estimator of the population mean $\mu$ because $\mathbb{E}[\bar{X}_n] = \mu$ by linearity of expectation.

Based on your analysis in Question 3c, assign `q3d` to the most correct option out of the following; then in the second cell, **explain your choice.**

1. The sample maximum is an unbiased estimator of the population maximum.
1. The sample maximum overestimates the population maximum.
1. The sample maximum underestimates the population maximum.


In [None]:
q3d = ...

In [None]:
grader.check("q3d")

**Explanation your choice here**:

<br>

<hr style="border: 1px solid #fdb515;" />

## Question 4: Inference for the Population Correlation


Previously in this lab, we explored some properties of random variables and identities a biased estimator. For these parts, we assumed we had access to the population data. We simulate samples from the **true population** and calculate the mean and variance of these sample statistics. In practice, however, we only have access to one sample (and therefore one value of our estimator); we will explore this next.


We define **population correlation** as the expected product of *standardized* deviations from expectation: 

$$r(X, Y) =  \mathbb{E} \left[\left(\frac{X - \mathbb{E}[X]}{\text{SD}(X)} \right) \left(\frac{Y - \mathbb{E}[Y]}{\text{SD}(Y)}\right)\right]$$

Note that population correlation involves the population means $\mathbb{E}[X]$ and $\mathbb{E}[Y]$ and the population standard deviations $\text{SD}(X)$ and $\text{SD}(Y)$. Correlation provides us with important information about the linear relationship between variables.

In this part, we'll explore the `tips` dataset once more, and we will compute the sample correlation statistic of two features: **total bill** and **party size**. We will then explore how the sample correlation estimates the true population correlation parameter.

The below cell assigns `data` to our single sample collected about customer tipping behaviors.

In [None]:
# Run this cell to load tips data
tips = sns.load_dataset("tips")
data = tips[['total_bill','size']]
data

<br>

---

### Question 4a
To estimate the population correlation, we'd like to use an estimator based on data from a simple random sample of our tips data set. For a sample $(X_1, Y_1), \dots, (X_n, Y_n)$ generated IID from a population,  define the **sample correlation** as follows:

$$\frac{\sum\limits_{i=1}^n\left(X_i-\overline{X}\right)\left(Y_i-\overline{Y}\right)}{\sqrt{\sum\limits_{i=1}^n \left(X_i - \overline{X}\right)^2}\sqrt{\sum\limits_{i=1}^n \left(Y_i - \overline{Y}\right)^2}}$$

In this formula, the numerator calculates the covariance between the two variables, while the denominator normalizes this value, yielding a coefficient between -1 and 1 that represents the strength and direction of the linear relationship in the sample data.

Note the similar structure to the true population correlation. If the $i$-th individual in our sample has "total bill" $X_i$ and "party size" $Y_i$, then $\overline{X}, \overline{Y}$ are the sample means of total bill and party size, respectively.

Implement the `sample_correlation` function in the cell below to compute the sample correlation for `sample`, which has two columns: `total_bill` and `size`.

**Hint:** Remember that Python evaluates mathematical expressions the same way a mathematician would. For example, $A / B * C$ is interpreted as $(A / B) * C$, not $A / (B * C)$.


In [None]:
def sample_correlation(sample):
    """
    Compute sample correlation of x and y.
    sample: A DataFrame of dimension (n, 2). The two columns are 'total_bill' and 'size'
    """
    x, y = sample['total_bill'], sample['size']
    x_bar = ...
    y_bar = ...
    n = ...
    ...

sample_correlation(data)

In [None]:
grader.check("q4a")

### Review: Terminology

Let the sample correlation of `data` be the estimator for the population correlation. In other words:

* **Parameter**: Population correlation. Unknown, but fixed.
* **Statistic**: Sample correlation. Dependent on the random sample we obtained.
* **Estimator**: The sample correlation statistic `corr_est` is an estimator of the population correlation parameter.

What can we infer about the population correlation given this estimate? Is it possible that the total bill and the party size are actually uncorrelated?

We can perform **bootstrapped hypothesis testing** ([data 8 textbook](https://inferentialthinking.com/chapters/13/2/Bootstrap.html)): We cannot simulate samples from the true population, what if we resample from the sample instead? If the sample is representative of the true population, then the resample will also be similar to samples from the original population. Note that if your sample is not representative, bootstrapping will also give you a biased result.

The hypotheses are as follows:

* **Null hypothesis**: Total bill and party size are uncorrelated; the population correlation is 0.
* **Alternate hypothesis**: The population correlation is not 0.


To test this hypothesis, we can bootstrap a $(1-p)$% confidence interval for the population correlation and check if 0 is in the interval. If 0 is in the interval, the data are consistent with the null hypothesis. If 0 is *not* in the interval, we reject the null hypothesis at the $p$% significance level. For more on the duality of the confidence interval and the p-value, see this [StackExchange discussion](https://stats.stackexchange.com/questions/179902/confidence-interval-p-value-duality-vs-frequentist-interpretation-of-cis). 

<br>

---

### Question 4b

Implement the `ci_correlation` function in the cell below that returns a bootstrapped confidence interval at the `conf` level. Your bootstrap should resample the `sample` `DataFrame` with replacement `n` times to construct `m` bootstrapped sample correlations using the `sample_correlation` function you implemented in Question 4a.

Then, assign `boot_ci` to the bootstrapped 95\% confidence interval for the tips `data` sample.

**Hint:** You may find `np.percentile`([documentation](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html)) helpful.

**Note:** This question is called Question 5 in the lab walkthrough.


In [None]:
def ci_correlation(sample, conf, m=5000):
    """Compute a bootstrap confidence interval for the correlation coefficient.
    Parameters:
    sample (DataFrame): A DataFrame containing at least two numeric columns to compute correlation.
    conf (float): The desired confidence level (e.g., 95 for a 95% confidence interval).
    m (int): The number of bootstrap resamples to generate. Default is 5000.
    """
    estimates = []
    n = len(sample)
    for j in range(m):
        resample = ...
        ...
    lower = ...
    upper = ...
    return (lower, upper)

boot_ci = ...
boot_ci

In [None]:
grader.check("q4b")

<!-- BEGIN QUESTION -->

<br>

---

### Question 4c
Now that we have the bootstrapped 95% confidence interval of the parameter based on a single sample of size 244, let's determine what we can conclude about our population correlation.

Fill in the blanks for the sentence:

> By bootstrapping our sample `data`, our estimate of the population correlation is ________ with a ___ % confidence interval of ________.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br>

---

### Question 4d

In the cell below, interpret the statement from the previous question. Can we reject the null hypothesis at the 5% significance level? What can we infer about the relationship between total bill and party size?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Part 2: SQL
---


### Important Note!

**Please only have ONE active SQL database running at once on DataHub, meaning do not have multiple homework/lab/lecture notebooks running at the same time**.

In [None]:
# Run this cell to set up your notebook.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sqlalchemy
from pathlib import Path
from zipfile import ZipFile

if Path('../../../../../../../gradescope').is_dir():
    path = Path('.') 
else:
    path = Path('../../../../../../../tmp') 

with ZipFile('data.zip', 'r') as zipObj:
    # Extract all the contents of zip file to the tmp directory
    zipObj.extractall(path = path)

imdb_sqlite_conn = 'duckdb:///' + str(path / 'imdbmini.db')
fec_nyc_sqlite_conn = 'duckdb:///' + str(path / 'fec_nyc.db')

## SQL Query Syntax

Throughout this lab, you will become familiar with the following syntax for the `SELECT` query:

```
SELECT <column list>
FROM <table>
[WHERE <predicate>]
[GROUP BY <column list>]
[HAVING <predicate>]
[ORDER BY <column list>]
[LIMIT <number of rows>]
[OFFSET <number of rows>]
```

## Subpart 1 [Tutorial]: Writing SQL in Jupyter Notebooks

**Caution: Be careful with large SQL queries!!** You may need to reboot your Jupyter Hub instance if it stops responding. To avoid printing out 100k-sized tables, we've adjusted the display limit to ensure that the tables displayed are truncated to 20 rows (though they may contain more rows in reality).


In [None]:
%config SqlMagic.displaylimit = 20

In [None]:
# Run this cell to set up SQL. 
import duckdb
%load_ext sql

### 1. `%%sql` cell magic

In lecture, we used the `sqlalchemy` extension to use **`%%sql` cell magic**, which enables us to connect to SQL databases and issue SQL commands within Jupyter Notebooks.

Run the below cells to connect to a mini IMDb database using `duckdb` as the backend.

In [None]:
# Run this cell to connect to duckdb
conn = duckdb.connect()
conn.query("INSTALL sqlite")

In [None]:
# Run this cell to connect to the imdbmini database
imdb_mini_db = 'duckdb:///' + str(path / 'imdbmini.db')
%sql duckdb:///../../../../../../../tmp/imdbmini.db --alias imdb_engine

The above cell connects to the same database using the SQLAlchemy Python library, which can connect to several different database management systems, including sqlite3, MySQL, PostgreSQL, and Oracle; we use `duckdb`. The library also supports an advanced feature for generating queries called an [object relational mapper](https://docs.sqlalchemy.org/en/20/tutorial/index.html#unified-tutorial) or ORM, which we won't discuss in this course but is quite useful for application development. 
                                                                                                                                                                                                                                                                                                                                   <br><br>
                                                                                                                                                                                                                                                          Above, prefixing our single-line command with `%sql` means that the entire line will be treated as a SQL command (this is called "line magic"). In this class we will most often write multi-line SQL, meaning we need "cell magic", where the first line has `%%sql` (note the double `%` operator).

The database `imdbmini.db` includes several tables, one of which is `Name`. Running the below cell will return first 5 lines of that table. Note that `%%sql` is on its own line.

We've also included syntax for single-line comments, which are surrounded by `--`, and multi-line comments, which are surrounded by `/*` and `*/`.
                                                                                                                                                                                                                                                          

In [None]:
%%sql
/*
 * This is a
 * multi-line comment.
 */
-- This is a single-line/inline comment. --
SELECT
  *
FROM
  Name
LIMIT
  5;

<br/>

### 2. The `pandas` command `pd.read_sql`

This section describes how data scientists use SQL and `python` in practice, using the `pandas` command `pd.read_sql` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html)). **You will see both `%sql` magic and `pd.read_sql` in this course**.

With the SQLAlchemy object `engine`, we can call `pd.read_sql` which takes in a `query` **string**. Note the `"""` to define our multi-line string, which allows us to have a query span multiple lines. The resulting `DataFrame` `df` stores the results of the same SQL query from the previous section.

In [None]:
# Run this cell to see the demo.
query = """
SELECT *
FROM name
LIMIT 5;
"""

df = pd.read_sql(query, imdb_mini_db)
df

#### `pd.read_sql` vs. `%sql` magic error messages

`pd.read_sql` has **long error messages**: Given that the SQL query is now in the string, the errors become more unintelligible. Consider the below (incorrect) query.

**Note**: Uncomment the below code and check out the error. You can uncomment/comment out multiple cells at the same time by selecting the lines and press ctrl+/ or command+/. 

In [None]:
# Uncomment the below code and check out the error.
# query = """
# SELECT *
# FROM Title;
# LIMIT 5
# """
# pd.read_sql(query, imdb_mini_db)

<br/>
<details>
<summary>Now that's an unruly error message! Can you see what's wrong in the cell above and correct the query? Toggle this cell to check your answer!</summary>
It has a semicolon in the wrong place!
</details>

<br/>

On the other hand, `%sql` magic gives more intelligible error messages, so we will use this format more often.

In [None]:
# %%sql
# -- Uncomment all this block and check out the error. --
# SELECT *
# FROM Title;
# LIMIT 5

<br/><br/>
<hr style="border: 1px solid #fdb515;" />

## Subpart 1: The IMDb (mini) Dataset

Let's explore a miniature version of the [IMDb Dataset](https://www.imdb.com/interfaces/). This is the same dataset that we will use for the upcoming homework. We'll load it in using cell magic.

In [None]:
%%sql imdb_engine
SELECT
  *
FROM
  sqlite_master
WHERE
  type = 'table';

From running the above cell, we see the database has 4 tables: `Title`, `Name`, `Role`, and `Rating`.

<details open>
    <summary>[<b>Click to Expand</b>] See descriptions of each table's schema.</summary>
    
**`Title`** - Contains the following information for titles.
    
- tconst (integer) - alphanumeric unique identifier of the title
- titleType (text) -  the type/format of the title
- primaryTitle (text) -  the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (text) -  original title, in the original language
- isAdult (text) - 0: non-adult title; 1: adult title
- startYear (text) – represents the release year of a title.
- endYear (text) - TV Series end year. 'None' for all other title types
- runtimeMinutes (text)  – primary runtime of the title, in minutes
- genres (text) – includes up to three genres associated with the title


**`Name`** – Contains the following information for names of people.
    
- nconst (integer) - alphanumeric unique identifier of the name/person
- primaryName (text)– name by which the person is most often credited
- birthYear (text) – in YYYY format
- deathYear (text) – in YYYY format
- primaryProfession (text) – the top-3 professions of the person
    
    
**`Role`** – Contains the principal cast/crew for titles.
    
- tconst (integer) - alphanumeric unique identifier of the title
- ordering (text) – a number to uniquely identify rows for a given tconst
- nconst (integer) - alphanumeric unique identifier of the name/person
- category (text) - the category of job that person was in
- job (text) - the specific job title if applicable, else 'None'
- characters (text) - the name of the character played if applicable, else 'None'
    
**`Rating`** – Contains the IMDb rating and votes information for titles.
    
- tconst (integer) - alphanumeric unique identifier of the title
- averageRating (text) – weighted average of all the individual user ratings
- numVotes (text) - number of votes (i.e., ratings) the title has received
    
</details>

<br/><br/>

From the above descriptions, we can conclude the following:
* `Name.nconst` and `Title.tconst` are primary keys of the `Name` and `Title` tables, respectively.
* that `Role.nconst` and `Role.tconst` are **foreign keys** that point to `Name.nconst` and `Title.tconst`, respectively.

<br/><br/>

---

### Question 5

What are the different kinds of `titleType`s included in the `Title` table? Write a query to find out all the unique `titleType`s of films using the `DISTINCT` keyword.  (**You may not use `GROUP BY`.**)

In [None]:
%%sql imdb_engine --save query_q5
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q5 = %sqlcmd snippets query_q5
res_q5 = pd.read_sql(sql_q5, imdb_mini_db)

In [None]:
grader.check("q5")

<br><br>

---

### Question 6

Before we proceed we want to get a better picture of the kinds of jobs that exist.  To do this examine the `Role` table by computing the number of records with each job `category`.  Present the results in descending order by the total counts.

The top of your table should look like this (however, you should have more rows):

| |category|total|
|-----|-----|-----|
|**0**|actor|21665|
|**1**|writer|13830|
|**2**|...|...|

In [None]:
%%sql imdb_engine --save query_q6
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q6 = %sqlcmd snippets query_q6
res_q6 = pd.read_sql(sql_q6, imdb_mini_db)

In [None]:
grader.check("q6")

<br/>
If we computed the results correctly we should see a nice horizontal bar chart of the counts per category below:

In [None]:
# Run this cell to make a bar plot.
plt.barh(res_q6["category"], res_q6["total"])
plt.xlabel("Counts")
plt.ylabel("Job Category");

<br/><br/>

---

### Question 7

Now that we have a better sense of the basics of our data, we can ask some more interesting questions.

The `Rating` table has the `numVotes` and the `averageRating` for each title. Which 10 films have the most ratings?

Write a SQL query that outputs three fields: the `title`, `numVotes`, and `averageRating` for the 10 films that have the highest number of ratings.  Sort the result in descending order by the number of votes.

**Hint**: The `numVotes` in the `Rating` table is not an integer! Use `CAST(Rating.numVotes AS int) AS numVotes` to convert the attribute to an integer. 

In [None]:
%%sql imdb_engine --save query_q7
...

In [None]:
# No further action is required. 
sql_q7 = %sqlcmd snippets query_q7
res_q7 = pd.read_sql(sql_q7, imdb_mini_db)

In [None]:
grader.check("q7")

<br/><br/>
<hr style="border: 1px solid #fdb515;" />

## Subpart 2: Election Donations in New York City

Finally, let's analyze the Federal Election Commission (FEC)'s public records. We connect to the database using cell magic so that we can flexibly explore the database.

In [None]:
# Run this cell to connect to the fec_nyc database
fec_nyc_db = 'duckdb:///' + str(path / 'fec_nyc.db')
%sql duckdb:///../../../../../../../tmp/fec_nyc.db --alias fec_engine

### Table Descriptions

Run the below cell to explore the **schemas** of all tables saved in the database.

If you'd like, you can consult the below linked FEC pages for the descriptions of the tables themselves.

* `cand` ([link](https://www.fec.gov/campaign-finance-data/candidate-summary-file-description/)): Candidates table. Contains names and party affiliation.
* `comm` ([link](https://www.fec.gov/campaign-finance-data/committee-summary-file-description/)): Committees table. Contains committee names and types.
* `indiv_sample_nyc` ([link](https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description)): All individual contributions from New York City .

In [None]:
%%sql 
/* just run this cell */
SELECT
  *
FROM
  sqlite_master
WHERE
  type = 'table';

<br/><br/>

Let's look at the `indiv_sample_nyc` table. The below cell displays individual donations made by residents of the state of New York. We use `LIMIT 5` to avoid loading and displaying a huge table.

In [None]:
%%sql fec_engine
/* just run this cell */
SELECT
  *
FROM
  indiv_sample_nyc
LIMIT
  5;

In [None]:
%%sql fec_engine
/* just run this cell */
SELECT
  cand_id,
  cand_name
FROM
  cand
WHERE
  cand_pty_affiliation = 'DEM'
LIMIT
  5;

<br/><br/>
<hr style="border: 1px solid #fdb515;" />

## [Tutorial] Matching Text with `LIKE`

First, let's look at 2016 election contributions made by Donald Trump, who was a New York (NY) resident during that year. The following SQL query returns the `cmte_id`, `transaction_amt`, and `name` for every contribution made by any donor with "DONALD" and "TRUMP" in their name in the `indiv_sample_nyc` table.

**Notes:**
* We use the `WHERE ... LIKE '...'` to match fields with text patterns. The `%` wildcard represents at least zero characters. Compare this to what you know from regex!
* We use `pd.read_sql` syntax here because we will do some EDA on the result `res`.

In [None]:
# Run this cell to see an example of LIKE.
example_query = """
SELECT 
    cmte_id,
    transaction_amt,
    name
FROM indiv_sample_nyc
WHERE name LIKE '%TRUMP%' AND name LIKE '%DONALD%';
"""

example_res = pd.read_sql(example_query, fec_nyc_db)
example_res

If we look at the list above, it appears that some donations were not by Donald Trump himself, but instead by an entity called "DONALD J TRUMP FOR PRESIDENT INC". Fortunately, we see that our query only seems to have picked up one such anomalous name.

In [None]:
# Run this cell to see the value counts for each candidate.
example_res['name'].value_counts()

<br/><br/>

---

### Question 8



In the cell below, revise the above query so that the 15 anomalous donations made by "DONALD J TRUMP FOR PRESIDENT INC" do not appear. Your resulting table should have 142 rows. 

**Hints:**
* Consider using the above query as a starting point, or checking out the SQL query skeleton at the top of this lab. 
* The `NOT` keyword may also be useful here.


In [None]:
%%sql fec_engine --save query_q8
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q8 = %sqlcmd snippets query_q8
res_q8 = pd.read_sql(sql_q8, fec_nyc_db)

In [None]:
# Print the number of rows in your query
# Double check that this equals 142 
res_q8.shape[0]

In [None]:
grader.check("q8")

<br/><br/>

---

### Question 9: `JOIN`ing Tables

Let's explore the other two tables in our database: `cand` and `comm`.

The `cand` table contains summary financial information about each candidate registered with the FEC or appearing on an official state ballot for House, Senate or President.

In [None]:
%%sql fec_engine
/* just run this cell */
SELECT
  *
FROM
  cand
LIMIT
  5;

The `comm` table contains summary financial information about each committee registered with the FEC. Committees are organizations that spend money for political action or parties, or spend money for or against political candidates.

In [None]:
%%sql fec_engine
/* just run this cell */
SELECT
  *
FROM
  comm
LIMIT
  5;

<br><br>

---

#### Question 9a

Notice that both the `cand` and `comm` tables have a `cand_id` column. Let's try joining these two tables on this column to print out committee information for candidates.

List the first 5 candidate names (`cand_name`) in reverse lexicographic order (i.e reverse alphabetical order) by `cand_name`, along with their corresponding committee names. **Only select rows that have a matching `cand_id` in both tables.**

Your output should look similar to the following:

|cand_name|cmte_nm|
|----|----|
|ZUTLER, DANIEL PAUL MR|CITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT|
|ZUMWALT, JAMES|ZUMWALT FOR CONGRESS|
|...|...|

Consider starting from the following query skeleton, which uses the `AS` keyword to rename the `cand` and `comm` tables to `c1` and `c2`, respectively.
Which join is most appropriate?

    SELECT ...
    FROM cand AS c1
        [INNER | {LEFT |RIGHT | FULL } {OUTER}] JOIN comm AS c2
        ON ...
    ...
    ...;


In [None]:
%%sql fec_engine --save query_q9a 
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q9a = %sqlcmd snippets query_q9a
res_q9a = pd.read_sql(sql_q9a, fec_nyc_db)

In [None]:
grader.check("q9a")

<br/><br/>

---

### Question 9b

Suppose we modify the query from the previous part to include *all* candidates, **including those that don't have a committee.**


List the first 5 candidate names (`cand_name`) in reverse lexicographic order by `cand_name`, along with their corresponding committee names. If the candidate has no committee in the `comm` table, then `cmte_nm` should be NULL (or `None` in the `python` representation).

Your output should look similar to the following:

|cand_name|cmte_nm|
|----|----|
|ZUTLER, DANIEL PAUL MR|CITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT|
|...|...|
|ZORNOW, TODD MR|None|

**Hint**: Start from the same query skeleton as the previous part. 
Which join is most appropriate?

In [None]:
%%sql fec_engine --save query_q9b
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q9b = %sqlcmd snippets query_q9b
res_q9b = pd.read_sql(sql_q9b, fec_nyc_db)

In [None]:
grader.check("q9b")

<br/><br/>

---

## Question 10: Subqueries and Grouping

If we return to our results from Question 8, we see that many of the contributions were to the same committee:

In [None]:
# Your SQL query result from Question 8
# Reprinted for your convenience
res_q8['cmte_id'].value_counts()

<br><br>

For this question, create a new SQL query that returns the total amount that Donald Trump contributed to each committee.

Your table should have four columns: `cmte_id`, `total_amount` (total amount contributed to that committee), `num_donations` (total number of donations), and `cmte_nm` (name of the committee). Your table should be sorted in **decreasing order** of `total_amount`.

Your output should look similar to the following:

|cmte_id|total_amount|num_donations|cmte_nm|
|----|----|----|----|
|C00580100|18633157|131|DONALD J. TRUMP FOR PRESIDENT, INC.|
|C00055582|10000|1|NY REPUBLICAN FEDERAL CAMPAIGN COMMITTEE
|...|...|...|

**This is a hard question!** Don't be afraid to reference the lecture slides, or the overall SQL query skeleton at the top of this lab.

**Hint**:

* Note that committee names are not available in `indiv_sample_nyc`, so you will have to obtain information somehow from the `comm` table (perhaps a `JOIN` would be useful).
* Remember that you can compute summary statistics after grouping by using aggregates like `COUNT(*)`, `SUM()` as output fields.
* A **subquery** may be useful (but not required) to break your question down into subparts. Consider the following query skeleton, which uses the `WITH` operator to store a subquery's results in a temporary table named `donations`.

        WITH donations AS (
            SELECT ...
            FROM ...
            ... JOIN ...
                ON ...
            WHERE ...
        )
        SELECT ...
        FROM donations
        GROUP BY ...
        ORDER BY ...;
  
**Note**: The video walkthrough solution may not be fully correct here. Remember that when using `GROUP BY`, all columns in the `SELECT` statement must either be present in the `GROUP BY` clause or be used in an aggregate function. For example:

        SELECT titleType, SUM(runtimeMinutes), Year
		FROM Title
		GROUP BY titleType;

Here, this query violates the rule because `Year` is included in the SELECT statement without being either part of an aggregate function or listed in the `GROUP BY` clause.


In [None]:
%%sql fec_engine --save query_q10
...

In [None]:
# Run this cell for grading purposes. 
# No further action is required. 
sql_q10 = %sqlcmd snippets query_q10
res_q10 = pd.read_sql(sql_q10, fec_nyc_db)

In [None]:
grader.check("q10")

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Congratulations! You are finished with Lab 09!


### Course Content Feedback

If you have any feedback about this assignment or about any of our other weekly, weekly assignments, lectures, or discussions, please fill out the [Course Content Feedback Form](https://forms.gle/tUUekdzp91W4bxS87). Your input is valuable in helping us improve the quality and relevance of our content to better meet your needs and expectations!

### Submission Instructions

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. Submit this file to the Lab 09 assignment on Pensieve. If you run into any issues when running this cell, feel free to check this [section](https://ds100.org/debugging-guide/autograder_gradescope/autograder_gradescope.html#why-does-grader.exportrun_teststrue-fail-if-all-previous-tests-passed) in the Data 100 Debugging Guide.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)