# Discussion 1: Random Variables, Sampling, and SQL

In this notebook, we will walk through Question 2 of Discussion 1 and demonstrate how biased samples can be generated from different sampling methods.

In [None]:
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter("ignore")
%matplotlib inline

def plot_sample(arr):
    plt.figure(figsize=[10,6])
    plt.hist(arr)
    plt.xlabel("Exam Difficulty")
    data_mean = np.mean(arr)
    pop_mean = data["response"].mean()
    plt.axvline(data_mean, 0, 250, color="r", linewidth=3, label="Empirical Mean")
    plt.axvline(pop_mean, 0, 250, color="orange", linewidth=3, label="Population Mean")
    plt.legend()
    return data_mean

## Question 2: Sampling

To start with, we read in the dataframe `data` which has a row for each student in Data 100, and the following columns:

* `response`, their response to the survey on a 10-point scale
* `attended fireside chat`, whether the student attended the fireside chat after the exam
* `saw on piazza`, whether the student _saw_ the survey Piazza post
* `social pressure bias`, the amount a student's `response` changes when faced with social pressure; for example, if `response` was 5 and `social pressure bias` was -1, the student would answer 4 when faced with social pressure
* `mini-disucssion`, the student's mini-discussion section.



In [None]:
means = {} # this will collect our mean estimates for each option
data = pd.read_csv("data/responses.csv")
data

For each sampling procedure, we will bootstrap from `data` using the mean of `response` as our test statistic, calculate the bootstrapped estimate of the mean of `response`, and plot a histogram comparing our empirical estimate to the true population mean, which is:

In [None]:
data["response"].mean()

### Option A

Recall that in option A:

> The professor sends a Zoom poll to all students in the first fireside chat following the exam.

To simulate this, we will bootstrap from `data` among the students for whom `attended fireside chat` is `True`. We assume that only 80% of students in the fireside chat complete the poll.

In [None]:
# option A
np.random.seed(42)

replications = 1000
sample_frac = 0.8 # assume only 80% of students respond to the poll
data_A = []
for _ in  range(replications):
    responses = data[data["attended fireside chat"]].sample(frac=sample_frac)
    data_A.append(responses["response"].mean())

means["A"] = plot_sample(data_A)
means["A"]

### Option B

Recall that in option B:

> The professor adds a question to the homework assignment of a simple random sample of students within each discussion section.

To simulate this, we will first add a new column `discussion` to `data` with the student's discussion section number (taken from `mini-discussion`). We will then bootstrap from `data` by grouping by `discussion` and sampling 10 students from each section.

In [None]:
# option B
np.random.seed(42)

# add discussion column to data
data["discussion"] = data["mini-discussion"].str[:-1].astype(int)

replications = 1000
sample_size = 10 # survey 10 students in each section
data_B = []
for _ in  range(replications):
    responses = data.groupby("discussion").sample(n=sample_size)
    data_B.append(responses["response"].mean())

means["B"] = plot_sample(data_B)
means["B"]

### Option C

Recall that in option C:

> The professor makes a post on Piazza asking students to submit a Google Form.

To simulate this, we will bootstrap from `data` among the students for whom `saw on Piazza` is `True`. We assume that only 40% of students who see the post complete the form.

In [None]:
# option C
np.random.seed(42)

replications = 1000
sample_frac = 0.4 # assume only 40% of students who saw the post responded
data_C = []
for _ in  range(replications):
    responses = data[data["saw on piazza"]].sample(frac=sample_frac)
    data_C.append(responses["response"].mean())

means["C"] = plot_sample(data_C)
means["C"]

### Option D

Recall that in option D:

> The professor chooses a simple random sample of mini-discussion sections, goes to each selected section, and surveys each student in the group as part of the ending discussion question.

To simulate this, we will bootstrap from `data` by randomly choosing 20 mini-discussion sections and filtering out the students in `data` who are not in one of the selected sections. Note that for this sampling procedure, because the students are being asked in public, we also need to account for social pressure bias by adding the `social pressure bias` column to `response` before taking the mean.

In [None]:
# option D
np.random.seed(42)

replications = 1000
sample_size = 20 # choose 4 discussion sections
data_D = []
for _ in range(replications):
    sampled_discussions = np.random.choice(data["mini-discussion"].unique(), replace=False, size=sample_size)
    responses = data[data["mini-discussion"].isin(sampled_discussions)]
    
    # account for social pressure bias
    responses.loc[:,"response"] = responses.loc[:,"response"] + responses.loc[:,"social pressure bias"]
    data_D.append(responses["response"].mean())

means["D"] = plot_sample(data_D)
means["D"]

### Results

The cell below summarizes the results of each sampling procedure using the `means` dictionary. Note that option B is the theoretically optimal sampling procedure, and indeed does obtain the lowest estimation error in magnitude.

In [None]:
# summary of sampling processes
actual_mean = data["response"].mean()
print(f"Actual Mean: {actual_mean:.4f}")
for option, mean in means.items():
    print(
        f"Option {option}:   Mean = {mean:.4f}   Bias = {(mean - actual_mean):.4f}"
        f"{'   (THEORETICALLY OPTIMAL)' if option == 'B' else ''}"
    )