In [None]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:theorydatadesign)=
# Theory for Data Design

In this chapter, we develop the theory behind the chance processes introduced in the {ref}`ch:datascope` chapter.  This theory makes the concepts of bias and variation more precise.  
We continue to motivate the accuracy of our data through the abstraction of an urn model that was first introduced in the {ref}`sec:variationtypes` section of chapter {ref}`ch:datascope`, and we use probability and simulation to develop the theory.

We use the urn model in two ways. First, we consider an artifical example, where a small sample of the first 7 SpaceX Starship prototypes (called SN1, SN2, ..., SN7) are pressure tested. 
In this example, we can exactly compute the chance of a particular sample being drawn from the urn of marbles (prototypes) for testing. This example is useful for introducing the concepts of a simple random sample and stratfied random samples, which are both core sampling techniques in complex surveys, and for
computing the chance of any particular outcome from sampling. (See Section 3.1)

Next (in Section 3.2), we use the urn model as a technical framework to design and run simulation studies to understand larger and more complex problems. 
We return to some of the examples from {numref}`Chapter %s <ch:datascope>` and, for example, dive deeper into
understanding how the pollsters might have gotten the 2016 Presidential Election predictions wrong.   
We can use the actual votes cast in Pennsylvania for the two candidates to simulate the sampling variation in selecting 1,400 voters for a poll from the six million who voted in Pennsylvania.
This simulation can help us uncover how response bias could skew the polls, 
and show us that collecting a lot more data would not have helped the situtation. 

In a second simulation study (Section 3.3), we examine the efficacy of a COVID-19 vaccine. A designed experiment for the vaccine was carried out on over 50,000 volunteers. Abstracting the experiment to an urn model offers a mechanism for studying assignment variation.  Through simulation, we examine the expected outcome of a clinical trial for a vaccine, under the assumption that the treatment was not effective. Our simulation and careful examination of the data scope, debunks claims of vaccine ineffectiveness. 

In addition, to sampling variation, assignment variation, we also address measurement error in Section 3.4. There, we use measurements from diffeerent times of the day to estimate the accuracy of an air quality sensor. 

Simulation studies enable us to approximate the typical deviations in a chance process and the distribution of the possible values for a summary statistic of the data. For those wanting a more formal approach, we generalize these approximations through probability in Section 3.5. Others may wish to skip this section until they find the need for the more formal theory.  