## Calculate Confidence Interval

### Point estimates are estimates of population parameters based on sample data. If we wanted to know the average weight of students in a school, we will collect a sample to estimate the average weight of all the students in a school

### The sample mean is usually not exactly the same as the population mean.

### A confidence interval is a range of values above and below a point estimate that captures the true population parameter at some predetermined confidence level.

### if you want to have a 95% chance of capturing the true population parameter with a point estimate and a corresponding confidence interval, you'd set your confidence level to 95%

### CI= Mean +- Margin of error
### where MOE is Z*sd/sqrt(n)

In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import random
import math

In [3]:
np.random.seed(10)

population_wt = stats.poisson.rvs(loc=18, mu=35, size=150000) #randomly create population weight dta for 150k people

sample_size = 1000
sample = np.random.choice(a= population_wt, size = sample_size)
sample_mean = sample.mean() #sample of1000 people

z_critical = stats.norm.ppf(q = 0.975)  # Get the z-critical value at 95% level*
# z value at 95% CI is 1.96

print("z-critical value:")              # Check the z-critical value
print(z_critical)                        

pop_stdev = population_wt.std()  # Get the population standard deviation

margin_of_error = z_critical * (pop_stdev/math.sqrt(sample_size))

confidence_interval = (sample_mean - margin_of_error,
                       sample_mean + margin_of_error)  

print("Confidence interval:")
print(confidence_interval)

z-critical value:
1.95996398454
Confidence interval:
(52.666990574348212, 53.399009425651791)


In [4]:
s_stdev = sample.std()

In [5]:
from scipy.stats import t
from numpy import average, std
from math import sqrt

### The major difference between using a Z score and a T statistic is that former need population standard deviation. 
### The T test is also used if you have a small sample size (less than 30).

In [6]:
data = [63.5, 81.3, 88.9, 63.5, 76.2, 67.3, 66.0, 64.8, 74.9, 81.3, 76.2,
            72.4, 76.2, 81.3, 71.1, 80.0, 73.7, 74.9, 76.2, 86.4, 73.7, 81.3,
            68.6, 71.1, 83.8, 71.1, 68.6, 81.3, 73.7, 74.9]

In [7]:
mean = average(data) #mean wieght of wieghtlifters

In [8]:
mean


74.806666666666658

In [9]:
from scipy.stats import sem, t
from scipy import mean
confidence = 0.95

In [10]:
n = len(data)
m = mean(data)
std_err = sem(data) #standard error
h = std_err * t.ppf((1 + confidence) / 2, n - 1) #moe

start = m - h
print start

72.3288596918


In [11]:
end = m + h
print end

77.2844736415
