# Hypothesis Testing

We know that hypothesis testing is a critical tool in determing what the value of a parameter could be.

We know that the basis of our testing has two attributes:

**Null Hypothesis: $H_0$**

**Alternative Hypothesis: $H_a$**

The tests are:

* One Population Proportion
* Difference in Population Proportions
* One Population Mean
* Difference in Population Means

In this notebook, some functions that are extremely useful when calculating a t-statistic and p-value for a hypothesis test are discussed.

So let's look at ways to calculate a test statistic for the tests listed above.

The equation is:

$$\frac{Best\ Estimate - Hypothesized\ Estimate}{Standard\ Error\ of\ Estimate}$$ 

In [1]:
import statsmodels.api as sm
import numpy as np
import pandas as pd
import scipy
from scipy import stats

## One Population Proportion

#### Research Question 

In previous years 52% of parents believed that electronics and social media was the cause of their teenager’s lack of sleep. Do more parents today believe that their teenager’s lack of sleep is caused due to electronics and social media? 

**Population**: Parents with a teenager (age 13-18)  
**Parameter of Interest**: p  
**Null Hypothesis:** p = 0.52  
**Alternative Hypthosis:** p > 0.52  

#### Survey Results:
A random sample of 1018 Parents with a teenager was taken and 56% said they believe electronics and social media was the cause of their teenager's lack of sleep.

In [2]:
n = 1018
p_null = 0.52
p_hat = 0.56

In [3]:
def one_prop(n, p_null, p_hat):
    best_estimate = p_hat
    hypo_estimate = p_null
    se = np.sqrt((p_null*(1-p_null))/n)
    z = (best_estimate - hypo_estimate)/se
    p = scipy.stats.norm.sf(abs(z)) #One-Sided
    print("Z-Test-Statistic:",z)
    print("P-value:",p)

In [4]:
one_prop(n, p_null, p_hat)

Z-Test-Statistic: 2.5545334262132955
P-value: 0.005316510991822442


**OR**

In [5]:
z_test_statistic, p_value = sm.stats.proportions_ztest(p_hat * n, n, p_null)
print("z test statistic:",z_test_statistic)
print("p-value:",p_value)
# It gives (z_test_statistic_value, p_value)

z test statistic: 2.571067795759113
p-value: 0.010138547731721065


## Difference in Population Proportions

#### Research Question

Is there a significant difference between the population proportions of parents of black children and parents of Hispanic children who report that their child has had some swimming lessons?

**Populations**: All parents of black children age 6-18 and all parents of Hispanic children age 6-18  
**Parameter of Interest**: p1 - p2, where p1 = black and p2 = hispanic  
**Null Hypothesis:** p1 - p2 = 0  
**Alternative Hypthosis:** p1 - p2 $\neq$ = 0  

#### Survey Results
* A sample of 247 Parents of Black Children age 6-18 was taken with 91 saying that their child has had some swimming lessons. i.e. 36.8% of parents report that their child has had some swimming lessons.
* A sample of 308 Parents of Hispanic Children age 6-18 was taken with 120 saying that their child has had some swimming lessons. i.e. 38.9% of parents report that their child has had some swimming lessons.

In [6]:
#n1 = 247
#p1 = .37

#n2 = 308
#p2 = .39

# In order to solve this problem programatically let's create the two populations.
#population1 = np.random.binomial(1, p1, n1)
#population2 = np.random.binomial(1, p2, n2)

#print(population1[:5])
#print(population2[:5])

#sm.stats.ttest_ind(population1, population2)
# (t_test_statistic, p_value, 

In [7]:
n1 = 247
y1 = 91

n2 = 308
y2 = 120

p_null = 0

In [8]:
# Function for calculating Z-Statistic and P-value for Difference in Population Proportions
def two_prop(hypo_estimate, y1, y2, n1, n2):
    p_hat = (y1+y2)/(n1+n2)
    p1 = y1/n1
    p2 = y2/n2
    best_estimate = p1-p2
    se = np.sqrt((p_hat*(1-p_hat))*(1/n1 + 1/n2)) #Standard Error of Estimate
    z = (best_estimate - hypo_estimate)/se
    p = scipy.stats.norm.sf(abs(z))*2 #two-sided
    # p = scipy.stats.norm.sf(abs(z))#one-sided
    print("z-statistic:",z)
    print("p-value:",p)

In [9]:
two_prop(p_null, y1, y2, n1, n2)

z-statistic: -0.5110545335044571
p-value: 0.6093128715165157


## One Population Mean

#### Research Question 

Is the average cartwheel distance (in inches) for adults 
more than 80 inches?

**Population**: All adults  
**Parameter of Interest**: $\mu$, population mean cartwheel distance.<br>
**Null Hypothesis:** $\mu$ = 80<br>
**Alternative Hypthosis:** $\mu$ > 80

#### Survey Results:
25 Adults asked to perform a Cart Wheel.

$\mu = 82.46$ <br>
$\sigma = 15.06$

In [10]:
df = pd.read_csv("Cartwheeldata.csv")
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [11]:
n = len(df)
mean = df["CWDistance"].mean()
sd = df["CWDistance"].std()
p_null = 80
(n, mean, sd)

(25, 82.48, 15.058552387264855)

In [12]:
def one_mean(n,mean,sd,p_null):
    best_estimate = mean
    hypo_estimate = p_null
    se = sd/np.sqrt(n)
    t = (best_estimate-hypo_estimate)/se
    p = stats.t.sf(np.abs(t), n-1)
    print("t-statistic:",t)
    print("p-value:",p)

In [13]:
one_mean(n,mean,sd,p_null)

t-statistic: 0.8234523266982027
p-value: 0.2091793328533854


**OR**

In [14]:
sm.stats.ztest(df["CWDistance"], value = 80, alternative = "larger")

(0.8234523266982029, 0.20512540845395266)

## Difference in Population Means (for Independent Groups)

#### Research Question 

Considering adults in the NHANES data, do males have a significantly higher mean Body Mass Index than females?

**Population**: Adults in the NHANES data.  
**Parameter of Interest**: $\mu_1 - \mu_2$, Body Mass Index.  
**Null Hypothesis:** $\mu_1 = \mu_2$  
**Alternative Hypthosis:** $\mu_1 \neq \mu_2$

2976 Females <br>
$\mu_1 = 29.94$  
$\sigma_1 = 7.75$  

2759 Male Adults   
$\mu_2 = 28.78$  
$\sigma_2 = 6.25$  

$\mu_1 - \mu_2 = 1.16$

In [15]:
da = pd.read_csv("nhanes_2015_2016.csv")
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [16]:
females = da[da["RIAGENDR"] == 2]
male = da[da["RIAGENDR"] == 1]

In [17]:
n1 = len(females)
mu1 = females["BMXBMI"].mean()
sd1 = females["BMXBMI"].std()

(n1, mu1, sd1)

(2976, 29.939945652173996, 7.75331880954568)

In [18]:
n2 = len(male)
mu2 = male["BMXBMI"].mean()
sd2 = male["BMXBMI"].std()

(n2, mu2, sd2)

(2759, 28.778072111846985, 6.252567616801485)

In [22]:
p_null = 0

In [23]:
def estimated_se_pooled(n1,n2,s1,s2):
    return np.sqrt( ((n1-1)*(s1**2) + (n1-1)*(s1**2))/(n1+n2-2) )*np.sqrt(1/n1+1/n2)

def estimated_se_unpooled(n1,n2,s1,s2):
    return np.sqrt( (s1**2)/n1 + (s2**2)/n2 )

def two_mean(n1, mu1, s1, n2, mu2, s2, p_null, pool=1):
    best_estimate = mu1-mu2
    hypo_estimate = p_null
    if(pool==1):
        se = estimated_se_pooled(n1,n2,s1,s2)
    elif(pool==0):
        se = estimated_se_unpooled(n1,n2,s1,s2)
    t = (best_estimate - hypo_estimate)/se
    p = stats.t.sf(np.abs(t), n-1)*2
    print("t-statistic:",t)
    print("p-value:",p)

In [24]:
two_mean(n1,mu1,sd1, n2,mu2,sd2, p_null,1)

t-statistic: 5.565821959751823
p-value: 9.998995655375236e-06


**OR**

In [25]:
# Correctness NOT VERIFIED
sm.stats.ztest(
    x1=females["BMXBMI"].dropna(),
    x2=male["BMXBMI"].dropna(),
    value=0,
    alternative='two-sided',
    usevar='pooled',
    ddof=1.0
)

(6.1755933531383205, 6.591544431126401e-10)