In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

# Data Sampling


In our description of Gallup's polling fiasco in 1948, we used the term "bias" to say why the poll results were wrong. In this section, we'll define bias with a bit more rigor using the concept of sampling frames. Later in the book, we'll define bias in a precise statistical sense using random variables and expectation. Before starting to work with equations, though, it's important to develop intuition for why bias arises.

**Statistical bias** is the difference between your estimate and the truth. The Gallup's 1948 poll predicted that Dewey would receive 50% of the popular vote, but in reality Dewey received 45%. This estimate had a bias of 5%.

As the Gallup poll shows, bias commonly occurs because the sample differs from the population. In data analysis, our **population** is the set of all units of interest. The "of interest" part is important. For an election poll, the population is not every person in the US, or even every registered voter in the US – it is every person who will actually cast a vote. This difference can be dramatic: in 2020, there were around 330 million people in the US and 240 million registered voters, but 158 million people cast votes.

Our **sampling frame** describes all the possible units that can be drawn into the sample. Our **sample** is a subset of our sampling frame. It's important to distinguish between the sampling frame and sample. Before we actually collect any data, we choose a sampling frame when we decide how we collect data. For instance, an election poll conducted only through via landline phone calls has a sampling frame of people with landlines. This sampling frame misses people from the population who don't have landlines. It also includes people not in the population, like people with landlines but aren't registered to vote. Even if the poll manages to receive responses from every single person with a landline, their result will still be biased because of the sampling frame.

## Common Sampling Scenarios

Let's examine the population, sampling frame, and sample for common sampling scenarios. We'll use an orange oval to depict the population, blue for the sampling frame, and green for the sample:

```{image} design_data_figures/sampling_frames.svg
:alt: sampling_frames.svg
:align: center
```

**Scenario 1: Census**

```{image} design_data_figures/census.svg
:alt: census.svg
:align: center
```

In a perfect census, the sampling frame and sample capture the entire population. For instance, if our population is the names of all students currently enrolled in Data 100, the official course roster is a census. A perfect census is relatively rare, however. Even well-funded, large-scale efforts like the US Census cannot practically capture every single person in the US simply because some people won't fill out the form.

**Scenario 2: Administrative data**

```{image} design_data_figures/census.svg
:alt: census.svg
:align: center
```

In this book, administrative data refers to a nonrandom sample that was not collected specifically for research purposes. With administrative data, the sample covers the entire sampling frame, but the sampling frame is not the entire population. For example, the US government keeps track of every tax return. However, if we aim to find the median income for households in the United States, this data is not a census since the IRS does not require a tax return for households with incomes beneath a threshold (at the time of this writing, roughly $20,000).

Administrative datasets are often large since they come from large companies, websites, or governments. But, these data present challenges for analysis because they were collected in a nonrandom manner and thus may not be representative of the population. For instance, a 2010 study collected 100,000 political tweets from Twitter from the month prior to the German national election. They found that the proportion of tweets about a party closely followed the proportion of votes that party received in the election {cite}`tumasjanPredictingElectionsTwitter2010`. From this finding, one might infer that number of tweets might replace expensive, randomly collected surveys for predicting votes.

In a perfect census, the sampling frame and sample capture the entire population. For instance, if our population is the names of all students currently enrolled in Data 100, the official course roster is a census. A perfect census is relatively rare, however. Even well-funded, large-scale efforts like the US Census cannot capture every single person in the US because some people simply don't fill out the form.

Outline

intro based on election case study

- get the sense that things don't match reality
- call this bias
- define this more precisely

terms

- population, sampling frame, sample
- include figure from slides here

common sample scenarios

1. census
1. administrative data
1. simple random sample
1. reality of simple random samples
1. presidential election