## Statistical Inference with Confidence Intervals

Throughout week 2, we have explored the concept of confidence intervals, how to calculate them, interpret them, and what *confidence* really means.  

In this tutorial, we're going to review how to calculate confidence intervals for population proportions and means.

To begin, let's review some of the material from this week and consider once more why confidence intervals are useful tools for deriving insights from data.

First, recall the setting in which nearly all statistical analysis takes place -- the data are a sample from a population, and that population can be described in terms of various numerical parameters.  Using the data, we can estimate a parameter of interest.  For example, suppose that the parameter of interest is the mean credit card debt for all people residing in the United States.  We may call this (unknown) parameter $\theta$.  Using a sample of data, we estimate this parameter, say using the sample mean (average) of the credit card debts for all people in our sample.  We can denote this estimate by $\hat{\theta}$.  We know that $\hat{\theta}$ is not exactly equal to $\theta$, but can we somehow convey which values for $\theta$ could potentially be the actual value?  This is the goal of a confidence interval.

### Why Confidence Intervals?

A confidence interval is a calculated range around a parameter estimate (a statistic) that includes all possible true values of the parameter that are consistent with the data in a certain sense.  For example, in the lecture, we estimated, with 95% confidence, that the population proportion of parents with a toddler that use a car seat for all travel with their toddler was somewhere between 82.2% and 87.7%.  This interval is a random quantity since it is derived from one sample of data.  If we were to obtain a different sample from the same population, we would obtain a different interval.

The key property of a confidence interval is that if we were to repeatedly sample from the population, calculating a confidence interval from each sample, then 95% of our calculated confidence intervals would contain ("cover") the true proportion.

### How are Confidence Intervals Calculated?

A confidence interval for a population proportion can be calculated as follows:

$${\rm Best\ Estimate} \pm {\rm Margin\ of\ Error}$$

Where the *Best Estimate* is the **observed population proportion or mean** and the *Margin of Error* is the **t-multiplier** times the **standard error**.

The t-multiplier is calculated based on the degrees of freedom and desired confidence level.  For samples with more than 30 observations and a confidence level of 95%, we can use the z-multiplier of 1.96 instead of the t-multiplier.  The z-multiplier has the advantage of not depending on any "degrees of freedom".

The equation to create a 95% confidence interval can also be shown as:

$${\rm Population\ Proportion\ or\ Mean} \pm ({\rm t-multiplier} \cdot {\rm Standard\ Error})$$

Lastly, the Standard Error is calculated differenly for the population proportion and mean:

$${\rm Standard\ Error \ for\ Proportion} = \sqrt{\frac{{\rm Population\ Proportion} \cdot (1 - {\rm Population\ Proportion})}{{\rm Number\ Of\ Observations}}}$$

$${\rm Standard\ Error \ for\ Mean} = \frac{{\rm Standard\ Deviation}}{\sqrt{{\rm Number\ Of\ Observations}}}$$

Let's replicate the car seat example from the course lecture:

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

We have a sample of 659 people with a toddler, and 85% of these parents use a car seat all of the time. This point estimate (85%) is not exactly equal to the population proportion of parents who use a car seat.  The standard error (SE) conveys the likely error in the point estimate relative to the population value.  We calculate this standard error next, using the procedure for calculating a standard error of a proportion.

In [2]:
tstar = 1.96
p = .85
n = 659

se = np.sqrt((p * (1 - p))/n)
se

0.01390952774409444

The standard error is 0.014, or 1.4 percentage points.  Thus, our point estimate (85%) is likely to be around 1.4 percentage points from the truth.

Next we compute a confidence interval for the proportion of parents of toddlers who always use a car seat.  A confidence interval is defined in terms of its "lower confidence bound" (lcb) and "upper confidence bound" (ucb).

In [3]:
lcb = p - tstar * se
ucb = p + tstar * se
(lcb, ucb)

(0.8227373256215749, 0.8772626743784251)

We don't need to compute the confidence interval from the formula, we can have statsmodels calculate it for us:

In [5]:
sm.stats.proportion_confint(n * p, n)

(0.8227378265796143, 0.8772621734203857)

Now, let's take our Cartwheel dataset introduced in lecture and calculate a confidence interval for our mean cartwheel distance:

In [6]:
df = pd.read_csv("Cartwheeldata.csv")

In [7]:
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [8]:
mean = df["CWDistance"].mean()
sd = df["CWDistance"].std()
n = len(df)
n

25

In [9]:
tstar = 2.064

se = sd/np.sqrt(n)
se

3.0117104774529713

In [10]:
lcb = mean - tstar * se
ucb = mean + tstar * se
(lcb, ucb)

(76.26382957453707, 88.69617042546294)

In [11]:
sm.stats.DescrStatsW(df["CWDistance"]).zconfint_mean()

(76.57715593233026, 88.38284406766975)