# Confidence intervals

Before getting started, let's load some packages.

In [1]:
## load packages
import numpy as np
import pandas as pd

## Sampling the Arsenal Stadium ⚽

Here's a hypothetical study using simulated data to show the uncertainty between the population and samples --- it'll help us understand confidence intervals.

We want to know the average height of every adult Arsenal fan attending the Emirates next Sunday. Excluding under 18s and away fans, that's about 50,000 adults.

## The population

Remember this is all hypothetical! We can simulate the height of all 50,000 adults, and conveniently save it in a dataframe called population. We set the mean height as 175 cm and the standard deviation as 30 cm.

In [33]:
## set mean height, sd height, number of adults
mean, sd, size = 175, 30, 50000

## create dataframe with two columns
## person: numbers from 1 to 50000
## height: normal distribution using mean, sd, and size
population = pd.DataFrame(
    dict(
        person = range(1, size+1),
        height = np.random.normal(loc = mean, scale = sd, size = size)),
    columns=['person', 'height'])

Before taking a sample from our population, let's check a few things. Does our dataframe `population` have the two columns `person` and `height`, and how many rows does it have?

In [34]:
## see first 6 rows
population.head(n = 6)

## get length of population
print(population.shape)

(50000, 2)


Now, what is the mean height of the population?

In [35]:
## get mean of columns and round to 1 d.p
round(np.mean(population, axis = 0), ndigits = 1)

person    25000.5
height      174.9
dtype: float64

## Samples

Now in real life we could run a study to estimate the height of adult Arsenal fans. Realistically, we could stand outside the stadium and measure 100 adults. Let's do that with code, saving our 100 adults in an object called `sample001`.

In [36]:
## sample 100 people and save this into object called 'sample001'
sample001 = population.sample(n = 100)

Let's check the number of rows in this sample then calculate the mean height.

**Think:** Would you expect the sample mean and the population mean to be identical?

In [37]:
## get length of sample001
print(sample001.shape)

## get mean of columns and round to 1 d.p
round(np.mean(sample001, axis = 0), ndigits = 1)

(100, 2)


person    26798.1
height      176.0
dtype: float64

Let's take another sample, called `sample002`, and calculate the mean height.

**Think:** Would you expect the mean of another sample to be the same as that of the first sample?

In [38]:
## repeat sampling step above
## sample 100 people and save this into object called 'sample002'
sample002 = population.sample(n = 100)

## get mean of columns and round to 1 d.p
round(np.mean(sample002, axis = 0), ndigits = 1)

person    26652.2
height      177.1
dtype: float64

## Calculating 95% confidence intervals

Let's calculate a confidence interval (CI), by hand, to see how they can communicate uncertainty.

We will do this for our `sample001`.




In [39]:
## save sample size in object n
n = 100

## calculate sample mean and save in object x bar
x_bar = np.mean(sample001["height"])

## calculate standard error
se = np.std(sample001["height"])/np.sqrt(n)

## calculate lower and upper limits
lower = x_bar - (1.96 * se)
upper = x_bar + (1.96 * se)

## print limits
print(lower, upper)

170.01738820176416 181.92541662063263
