In [None]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

# Exercises



**TODO: the content below was pasted verbatim from Notion, but doesn't seem to have actual exercises?**

## Cluster Sampling

In cluster sampling, the population is also divided into non-overlapping subgroups. These tend to be smaller than strata, and they are called clusters.  With cluster sampling we take a simple random sample of the clusters and use all of the units in each cluster for our sample. Using our urn analogy, suppose the marbles are divided up into small groups and each group of marbles is placed inside a hollow ball. These balls are placed in the urn, and we take a simple random sample of balls.  For each ball selected, we put all of the marbles inside that ball into our sample.  

Again, as a simple example, suppose we our population of $7$ individuals are placed into $4$ clusters as follows: 

$$\left(A,B\right)~ \left(C, D\right)~ \left(E, F\right) ~ \left(G\right).$$

Then, if we take a SRS of $2$ clusters our sample will be one of the following:

$$\left(A,B,C,D\right)~~ \left(A,B,E,F\right)~~ \left(A,B,G\right)\\ \left(C, D,E,F\right)~~ \left(C, D,G\right)~~ \left(E, F, G\right).$$

We can compute the probability that $A$ is in our sample

$${\mathbb{P}}\left(\textrm{A in sample} \right) = {\mathbb{P}}\left(\textrm{A, B chosen}\right) =  \frac{1}{3}$$

With this cluster sampling scheme:

$${\mathbb{P}}\left(\textrm{A and B in sample} \right) = \frac{1}{3}\\ {\mathbb{P}}\left(\textrm{A and C in sample}\right) = 0$$

since A and C can never appear in the same sample if we only select one cluster.

Cluster sampling is still probability sampling since we can assign a probability to each potential sample. However, the resulting probabilities are different than using a SRS depending on how the population is clustered.

Why use cluster sampling? Cluster sampling makes sample collection easier. For example, it is much easier to poll 100 homes of 2-4 people each than to poll 300 individuals scattered across a  city. This is the reason why many polling agencies today use forms of cluster sampling to conduct surveys.


## Systematic Sampling

In systematic sampling, the population is ordered in a list, the first unit is selected at random from the first $k$ elements on the list, and then every $k^{th}$ unit after that is placed in the sample. Again, as a simple example, suppose we our population of $6$ individuals are ordered alphabetically and we select on from the first three $A, B, C$  at random, and then every third element on the list after that. Since our list only has $6$ elements, our samples are simply: 

$$\left(A,D\right)~ \left(B, E\right)~ \left(C, F\right).$$

Again, they are equally likely so each has chance $1/3$. We can compute the probability that $A$ is in our sample

$${\mathbb{P}}\left(\textrm{A in sample} \right) = \frac{1}{3}$$

With this sampling scheme:

$${\mathbb{P}}\left(\textrm{A and D in sample} \right) = \frac{1}{3}\\ {\mathbb{P}}\left(\textrm{A and C in sample}\right) = 0$$

In fact, the chance that $A$  and any other letter other than $D$  is in the sample is $0$. 

You might have notices that systematic sampling is a special case of cluster sampling.  

## Intercept surveys on the web

When a popup window asks you to complete a brief questionnaire, frequently use systematic sampling.... an intercept survey. Every $k^{th}$ visitor to a website is asked to complete a brief survey.  Here the "list" of population units is simply the ordering of visitors to the site. It's reasonable to imagine that this sort of ordering wouldn't introduce a selection bias in the sampling process. But, how often do you click, "No thanks" or shut the popup window?  Non-response bias in these surveys is difficult to assess, we might imagine that those taking the time to answer the questions are either very dissatisfied or satisfied with their customer experience. And what if you use a popup blocker?  You will never answer any surveys. This type of non-response can be even more difficult to get a handle on.

## Wikipedia and a simulation study
