# KEN1435 - Principles of Data Science: Lecture 6

First we load the necessary python packages

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from palmerpenguins import load_penguins
#import watermark

tab10 = plt.get_cmap("tab10").colors

%matplotlib inline
#%load_ext watermark
#%watermark -v -iv

# Accuracy of Percentages

We will use the box model. But now we will go the other way around. We have draws from the box, but we don't know the composition of the box. The statistical task is to guess what the composition from the draws. Since there is randomness, we will not just provide one number. We will provide a range of numbers that are plausible. We introduce a new concept: confidence interval. Here is how to build confidence intervals for percentages.

Let's look at an example from politics. From [wikipedia](https://en.wikipedia.org/wiki/Voter_turnout_in_United_States_presidential_elections):

```
Approximately 240 million people were eligible to vote in the 2020 
presidential election and roughly 66.1% of them submitted ballots, 
totaling about 158 million. Biden received about 81 million votes, Trump 
about 74 million votes, and other candidates (including Jo Jorgensen 
and Howie Hawkins) a combined approximately 3 million votes.
```

In polling, we want to guess the outcome of an election before it happens.

In [3]:
votes_for_biden = np.repeat("Biden", 81 * 1000000)
votes_for_trump = np.repeat("Trump", 74 * 1000000)
box = np.concatenate([votes_for_biden, votes_for_trump])

In [5]:
n= 400
draws = np.random.choice(box, size = n, replace = False)

biden_wims = (draws == "Biden").mean()
biden_wims

0.475

How accurate is this percentage? The usual chance error applies here:
$$
\text{sample percentage } = \text{population percentage } + \text{chance error}.
$$
If we make the range around the sample percentage large enough then we can hopefully cover the population percentage. From earlier lectures we know that $2 \times SE$ we know that $2$ standard errors from the average covers around 95% of the possible values. So, it seems natural to define an intervals using standard errors:
$$
\text{sample percentage } \pm 2 \times \text{SE}.
$$
The number $2$ is just one possible interval. If we choose a larger number, then the intervals will also get larger and the chance that we will cover the population percentage will increases, too. Let's calculate the 95% confidence interval for the polling example above. 

We already calculated the sample percentage.

What's left is to calculate the SE. Recall the formula for the SE from a previous lecture,
$$
\text{SE} = \sqrt{\text{number of draws}} \times \text{(SD of box)}.
$$
The new challenge is that we don't know the composition of the box. It turns out that we can use the sample to guess the composition of the box. That's usually a good guess even with moderate sample sizes.

The formula of SD comes out of the blue. You will have a formal mathematical derivation during the Probability and Statistics course in your second year. 

The confidence interval looks like this:s:

What do confidence intervals really mean?

Step 1: repeat polling many times.

Step 2: plot intervals.

Step 3: show how they relate to the population average.

Step 3: show how they cover the pop 95% of the times.

```{r}
pop <- mean(box == "Biden")
ggplot(df_ci, aes(sample_id, mid)) + 
  geom_pointrange(aes(ymin = lower, ymax = upper)) + 
  geom_hline(yintercept = pop)
```

Step 4: how many cover the population percentage?

# Accuracy of Averages

So far we've been working with sums and percentages. The same ideas also carry over to averages. Here is an example using the box model.

# Summary

We learned

* SE for percentages.
* confidence intervals for percentages.
* SE for averages.