# Statistics

## Estimating a population mean for when population std dev $\sigma$ is known.
Not realistic, but this is the groundwork for t-tests. 

Example:
Random sample of 40 students. Avg resting heart rate is 76.3bpm.  
Assume population std dev is 12.5bpm.  
Construct a 99% confidence interval for the mean resting heart rate of the population.

In [59]:
from scipy.stats import norm
import numpy as np

In [34]:
# setup our variables from the given info
n = 40
x_bar = 76.3
sigma = 12.5
ci = .99 # confidence interval
alpha = 1- ci

In [35]:
# find the crit value
crit_val = norm.ppf(alpha/2) * -1
crit_val

2.5758293035489004

Next, solve for margin of error E, according to the following equation
  
$ \normalsize E = Z_{\alpha/2} * \frac{\sigma}{\sqrt{n}}
$

In [38]:
E = crit_val * (sigma/np.sqrt(n))
E

5.090929664387351

Now we can interpret:

In [39]:
lb = round(x_bar-E,2)
ub = round(x_bar+E,2)

print(f'Lower bound= {lb}')
print(f'Upper bound= {ub}')

Lower bound= 71.21
Upper bound= 81.39


We can say with 99% certainty that the average resting heart rate of the population is within the range $ 71.21 < \mu < 81.39 $

---

## T-Score: For estimating a population mean for when you know nothing about the population mean $\mu$
A much more realistic situation! 
If you don't know $\sigma$, you can't use a z-score. Instead, we use a t-score.  
What we need:  
1. Random sample
2. n > 30 or population known to be normally distributed  
  
Recall:  
$ \normalsize Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} $   
  
For t, we just make a modification:  
$ \normalsize T = \frac{\bar{X}-\mu}{s/\sqrt{n}} $  

Critical values are given by $ \normalsize t_{\alpha/2} $  
  
T distribution uses degress of freedom, n - 1.  
You're often using t-score for small samples like n < 30.  
  
**To find a t critical value with python:  
`scipy.stats.t.ppf(q, df)`  
Where   
q = significance level (.05, .025, etc)  
df = degrees of freedom  
  
!! As sample size grows, the t-score approaches the z-score !!  
  
Hey, let's get an example:

Construct a 95% confidence interval for avg age of people denied a promotion.    
A random sample of 23 people where the average age was 47, with a standard deviation of 7.2.   
Assume the sample comes from a normally distributed population.   

In [40]:
from scipy.stats import t

In [41]:
n = 23
s = 7.2 # sample std dev
x_bar = 47 # sample mean
df = n-1
ci = .95 # confidence interval
alpha = 1-ci # outside the bounds of the ci

In [42]:
t_crit_val = t.ppf((alpha/2), df) * -1
t_crit_val

2.0738730679040147

In [43]:
E = t_crit_val * (s/np.sqrt(n))
E

3.1135134785958267

In [44]:
lb = round(x_bar-E,2)
ub = round(x_bar+E,2)

print(f'Lower bound= {lb}')
print(f'Upper bound= {ub}')

Lower bound= 43.89
Upper bound= 50.11


We can say with 95% certainty that the average resting heart rate of the population is within the range: $ 43.89 < \mu < 50.11 $

---

## Chi-Squared: Estimate population variance 
$ \large \sigma^2 = \frac{(n-1)s^2}{\chi^2}
$
The chi-sq distribution is used to estimate the population variance $\sigma^2$, which gives us the estimated population standard deviation, $\sigma$, within a specified confidence interval.  
  
Note, chi-sq distribution is not symmetrical. Use it like this:
$$ \large \frac{(n-1)s^2}{\chi^2_R} < \sigma^2 < \frac{(n-1)s^2}{\chi^2_L}
$$



Example and Procedure:  
A sample of appliances has a voltage standard deviation of 0.15 volts.    
Construct a 95% confidence interval for population variance and stdev. 

In [51]:
from scipy.stats import chi2

# given info
s = 0.15
n = 10
df = n-1
conf_lvl = 0.95
alpha = 1 - conf_lvl

# break out the sides
left_conf_lvl = alpha/2 # 0.025
right_conf_lvl = 1-(alpha/2) # 0.025

# table values
left_crit_val = chi2.ppf(left_conf_lvl, df)
right_crit_val = chi2.ppf(right_conf_lvl, df)
    
print(f'Right crit value: {right_crit_val}')
print(f'Left crit value: {left_crit_val}')

Right crit value: 19.02276779864163
Left crit value: 2.7003894999803584


$ \chi^2_R = 19.023 $  
  
$ \chi^2_L = 2.700 $

In [58]:
# Use the equation
var_lb = (n-1)*s**2 / right_crit_val
var_ub = (n-1)*s**2 / left_crit_val

print(f'variance lower bound: {round(var_lb,3)}')
print(f'variance upper bound: {round(var_ub,3)}\n')
print(f'Stdev lower bound: {round(np.sqrt(var_lb),3)}')
print(f'Stdev upper bound: {round(np.sqrt(var_ub),3)}')



variance lower bound: 0.011
variance upper bound: 0.075

Stdev lower bound: 0.103
Stdev upper bound: 0.274


And the interpretation:  
We are 95% sure this is the most voltage can range by for these appliances.  

$ 0.011v < \sigma^2 < 0.075v $  
$ 0.103 < \sigma < 0.274 $