<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Confidence-Intervals" data-toc-modified-id="Confidence-Intervals-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Confidence Intervals</a></span></li><li><span><a href="#Interpreting-Confidence-Intervals" data-toc-modified-id="Interpreting-Confidence-Intervals-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Interpreting Confidence Intervals</a></span><ul class="toc-item"><li><span><a href="#Interpretation-1-(incorrect)" data-toc-modified-id="Interpretation-1-(incorrect)-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Interpretation 1 (incorrect)</a></span></li><li><span><a href="#Interpretation-2-(correct)" data-toc-modified-id="Interpretation-2-(correct)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Interpretation 2 (correct)</a></span></li></ul></li><li><span><a href="#Code" data-toc-modified-id="Code-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Code</a></span><ul class="toc-item"><li><span><a href="#Finding-confidence-interval" data-toc-modified-id="Finding-confidence-interval-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Finding confidence interval</a></span></li></ul></li></ul></div>

# Confidence Intervals

We calculate this from our sample to get an idea of our population

# Interpreting Confidence Intervals

Consider: 

"we found our 95% confidence interval for ages to be from 26.3 and 28.3"

OR 

"we are 95% confident that the average age falls between 26.3 and 28.3"

## Interpretation 1 (incorrect)

> There is a 95% probability that the mean age is between 26.3 and 28.3

## Interpretation 2 (correct)

> If we find 100 (random) samples and create confidence intervals, we expect 95 intervals would contain the true mean of population age.

# Code 

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
titanic_file = 'data/titanic.csv'
df_titanic = pd.read_csv(titanic_file)
ages = df_titanic.Age.dropna()

In [None]:
np.min(ages), np.max(ages)
np.std(ages)

In [None]:
sns.distplot(ages)

In [None]:
sample = ages.sample(10, replace=True)
print(sample.mean())
display(sample)


In [None]:
def get_all_sample_means(data, n=10, n_samples=100):
    '''
    '''
    #
    samples = np.random.choice(data,size=(n_samples,n))
    means = np.mean(samples, axis=1)
    #
    return means

In [None]:
samples = get_all_sample_means(ages,n=10, n_samples=10**3)
samples

In [None]:
sns.distplot(
    samples, 
#     ages,
    kde=False, 
    hist=False, 
    rug=True
)
sns.distplot(ages)
plt.axvline(ages.mean(), color='red')

In [None]:
sns.distplot(ages)
plt.axvline(ages.mean(), color='red')

## Finding confidence interval

In [None]:
def bootstrap_sample(sample, n_samples=10**4):
    '''
    '''
    #
    bs_sample_means = get_all_sample_means(
        sample, 
        n=len(sample),
        n_samples=n_samples
    )
    
    return bs_sample_means

In [None]:
np.mean(sample)

In [None]:
b_sample_means = bootstrap_sample(sample)

In [None]:
sns.distplot(b_sample_means)
plt.axvline(b_sample_means.mean(), color='red')

In [None]:
np.mean(b_sample_means)

In [None]:
two_std = np.std(b_sample_means)*2

In [None]:

(np.mean(sample)-two_std, np.mean(sample)+two_std)

In [None]:
two_std/2

In [None]:
# Find value from normal curve
import scipy.stats

normal_curve = scipy.stats.norm(np.mean(b_sample_means),np.std(b_sample_means))


In [None]:
normal_curve.cdf(40) - normal_curve.cdf(20)