# Hypothesis Testing (One Sample)

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

### One Sided Hypothesis Tests

#### Example: Pharmaceutical Company

A pharmaceutical company is trying out a medication for lowering blood sugar and managing diabetes. It is known that any level of Hemoglobin A1c below 5.7% is considered normal. The drug company has treated 100 study volunteers with this medication and would like to prove that after treatment their mean A1c is below 5.7%.

In [2]:
pop_mean = 5.7
sample_mean = 5.1
sample_std = 1.6
n = 100
statistic = (sample_mean - pop_mean)/(sample_std/np.sqrt(n))
pval = stats.t.sf(np.abs(statistic), n-1)
print(statistic)
print(pval)
# pval: probability to observe sth as exterme as this by pure chance,
# given that our null hypothesis is true
# This is pval is smaller than 5%, thereofore we reject H_0

-3.750000000000003
0.0001489332089038242


In [4]:
# Confidence Interval
stats.t.interval(0.95, df=n-1, loc=sample_mean, scale=(sample_std/np.sqrt(n)))
# we are 95% sure that the population mean lies between the below confidence intervals

(4.78252528775861, 5.417474712241389)

#### Example: Municipal Children's Home

Boys of a certain age are known to have a mean weight of μ = 85 pounds. A complaint is made that the boys living in a municipal children's home are underfed and thus underweight (one-sided test!!). As one bit of evidence, n = 25 boys(of the same age) are weighed and found to have a mean weight of 80.94 pounds. It is known that the population standard deviation σ is 11.6 pounds (the unrealistic part of this example!).  
Based on the available data, what should be concluded concerning the complaint?

In [2]:
# your code here
pop_mean = 85
sample_mean = 80.94
sample_std = 11.6
n = 25
statistic = (sample_mean - pop_mean)/(sample_std/np.sqrt(n))
pval = stats.t.sf(np.abs(statistic), n-1)
print(statistic)
print(pval)

-1.750000000000001
0.046447544473094286


In [4]:
# Confidence Interval
stats.t.interval(0.95, df=n-1, loc=sample_mean, scale=(sample_std/np.sqrt(n)))

(76.15175533702299, 85.728244662977)

### Two-sided Hypothesis Tests

#### Example: Honolulu Heart Study

It is assumed that the mean systolic blood pressure is μ = 120 mm Hg. In the Honolulu Heart Study, a sample of n = 100 people had an average systolic blood pressure of 130.1 mm Hg with a standard deviation of 21.21 mm Hg. Is the group significantly different (with respect to systolic blood pressure!) from the regular population?

In [7]:
pop_mean = 120
sample_mean = 130.1
sample_std = 21.21
n = 100
statistic = (sample_mean - pop_mean)/(sample_std/np.sqrt(n))
pval = stats.t.sf(np.abs(statistic), n-1)*2 # for two-sided: *2 !!
print(statistic)
print(pval)

4.761904761904759
6.562701817208617e-06


In [8]:
# Confidence Interval
stats.t.interval(0.95, df=n-1, loc=sample_mean, scale=(sample_std/np.sqrt(n)))

(125.89147584585008, 134.30852415414992)

## Using data arrays

#### Generating 1000 draws from a standard normal random variable

In [6]:
X = stats.norm(0, 1).rvs(size = 100)
print(X)
# data sample

[ 0.45967078 -0.75660082 -0.57403381  0.55584224 -0.2279619  -1.80562109
  0.36546587  0.88974727 -1.32205161 -0.44629461  2.09378472 -0.20946527
  0.87962938  1.50887261  1.34744692 -0.54795174 -1.50735744 -0.75278797
 -0.55298303  1.53791274 -1.02225382  1.9869551   0.45395842  1.07151053
  2.14552979  1.00135539 -0.44736846  0.38224522 -2.15499338 -0.39455264
  1.42473992  0.3923123  -0.45450243  0.20940424  0.00834324  0.38081188
  0.89310614 -0.90461316 -0.78744142 -1.1106927  -1.57782351  0.51522652
  0.67030771 -0.11270582 -0.57755272 -0.155758   -0.17802566 -0.04152592
 -0.22103078  1.97696614 -0.68291835 -0.72747554  1.31987146  1.20968905
  1.20991322  0.3158278  -1.730445    0.69493725  0.70876368  0.68939313
  1.219246    0.61264267  1.01691654 -1.28985324  0.61734299  1.03683691
  0.61425521  0.45709165 -0.15764698  1.95882039 -0.39661293  0.26339999
  1.40040244  1.75363875  0.11321864  1.48283378 -0.05528946 -0.52090103
  1.58535736 -0.57091935  0.20169989 -1.20928696  0

#### Test if the sample average of X is equal to 0

In [9]:
help(stats.ttest_lsamp)

AttributeError: module 'scipy.stats' has no attribute 'ttest_lsamp'

In [10]:
stats.ttest_1samp(X, 0) # the mean of my null hypothese pvalue is greater than 5% fail to reject. 
stats.ttest_1samp(X, 5) # tttest_ is always two tail test. coefficent test 0 or not 0. Most of the test we do is two tail test

Ttest_1sampResult(statistic=0.5582027960403086, pvalue=0.5779661451748539)

#### Using actual data

In [11]:
data = pd.read_csv('../../../03_data-visualization/02_lab-matplotlib-seaborn/your-code/Fitbit2.csv') 
data.head()

FileNotFoundError: [Errno 2] File ../../../03_data-visualization/02_lab-matplotlib-seaborn/your-code/Fitbit2.csv does not exist: '../../../03_data-visualization/02_lab-matplotlib-seaborn/your-code/Fitbit2.csv'

In [None]:
data.describe()

In [None]:
stats.ttest_1samp(data['Distance'], 8.5)
# fail to reject Pvalue is greater than 5%
stats.ttest_1samp(data['Distance'], 20)
# reject 
# bigger critical value > 1.96 means P is less than 5%
# statistic has to look at the table