# Traditional Confidence Interval

Build a confidence interval using the sampling distribution of the statistic that best estimates the parameter of interest. In this case, I used a sample mean height to estimate the population mean height.

In [2]:
# Import all packages and set plots to be embedded inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(42)

# load dataset in dataframe from csv file
coffee_full = pd.read_csv('coffee_dataset.csv')
coffee_red = coffee_full.sample(200)

In [3]:
# Bootstrap sampling distribution on sample of coffee dataset
diff = []

for _ in range(10000):
    bootsample = coffee_red.sample(200, replace=True)
    mean_coff = bootsample[bootsample['drinks_coffee'] == True]['height'].mean()
    mean_nocoff = bootsample[bootsample['drinks_coffee'] == False]['height'].mean()
    diff.append(mean_coff - mean_nocoff)

# With 95% confidence interval
np.percentile(diff, 2.5), np.percentile(diff, 97.5)

(0.39656867909086274, 2.2432588681124224)

In [4]:
# Creating 95% confidence interval using statsmodel function
import statsmodels.stats.api as sms

X1 = coffee_red[coffee_red['drinks_coffee'] == True]['height'] 
X2 = coffee_red[coffee_red['drinks_coffee'] == False]['height']

cm = sms.CompareMeans(sms.DescrStatsW(X1), sms.DescrStatsW(X2))
cm.tconfint_diff(usevar='unequal')

(0.39600106159185644, 2.273413157022891)

The 95% confidence interval makes us 95% confident that the population mean falls between the lower and upper bounds. Notice that the percent and the parameter can both change depending on what you are building your confidence interval for, and what percentage you cutoff in each tail.