The basis of hypothesis testing has two attributes:

Null Hypothesis: $H_0$

Alternative Hypothesis: $H_a$

The tests we will discuss in this notebook are

1. One Population Proportion
2. Difference in Population Proportions
3. One Population Mean
4. Difference in Population Means

In [1]:
#importing all important libraries and models 
import statsmodels.api as sm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### One Population Propotion



Research Question :- In previous years, 52% of parents believed that electronics and social media was the cause of their teenager’s lack of sleep. Do more parents today believe that their teenager’s lack of sleep is caused due to electronics and social media?

Population: Parents with a teenager (age 13-18), Parameter of Interest: p

Null Hypothesis: p = 0.52, Alternative Hypthosis: p > 0.52 (note that this is a one-sided test)

Data:- 1018 people were surveyed. 56% of those who were surveyed believe that their teenager’s lack of sleep is caused due to electronics and social media.

Use of proportions_ztest() from statsmodels

Note the argument alternative="larger" indicating a one-sided test. The function returns two values - the z-statistic and the corresponding p-value.

In [3]:
p = 0.52
n= 1018
data = 0.56

Z_stats,p_value=sm.stats.proportions_ztest(data*n,n,p,alternative="larger")
print("the value of the z stats is :{}" .format(Z_stats))
print("the p value to coreponding to z stats is :{}" .format(p_value))

the value of the z stats is :2.571067795759113
the p value to coreponding to z stats is :0.005069273865860533


Conclusion: Since the calculated p-value (~0.005) of the z-test is pretty small, we can reject the Null hypothesis that the percentage of parents, who believe that their teenager’s lack of sleep is caused due to electronics and social media, is as same as previous years’ estimate i.e. 52%. Although we do not accept the alternative hypothesis, this informally means that there is a good chance of this proportion being more than 52%.

### Difference in Population Proportions

Research Question Is there a significant difference between the population proportions of parents of black children and parents of Hispanic children who report that their child has had some swimming lessons?

Populations: All parents of black children age 6-18 and all parents of Hispanic children age 6-18 Parameter of Interest: p1 - p2, where p1 = black and p2 = hispanic

Null Hypothesis: p1 - p2 = 0 Alternative Hypthosis: p1 - p2 $\neq$ = 0

Data: 247 Parents of Black Children. 36.8% of parents report that their child has had some swimming lessons. 308 Parents of Hispanic Children. 38.9% of parents report that their child has had some swimming lessons.

Use of ttest_ind() from statsmodels Difference in population proportion needs t-test. Also, the population follow a binomial distribution here. We can just pass on the two population quantities with the appropriate binomial distribution parameters to the t-test function.

The function returns three values: (a) test statisic, (b) p-value of the t-test, and (c) degrees of freedom used in the t-test.

In [4]:
n1=247
p1=.37

n2=308
p2=.39

population1 = np.random.binomial(1,p1,n1)
population2 = np.random.binomial(1,p2,n2)

In [5]:
x,y,z=sm.stats.ttest_ind(population1,population2)

In [6]:
print('the value of t Stats is:{}'.format(x))
print('the p value of the t test is:{}'.format(y))
print('the degree of freedom is:{}'.format(z))

the value of t Stats is:1.0368991283987257
the p value of the t test is:0.3002360536344007
the degree of freedom is:553.0


Conclusion of the hypothesis test Since the p-value is quite high ~0.768, we cannot reject the Null hypothesis in this case i.e. the difference in the population proportions are not statistically significant.

### One Population Mean

One Population Mean Research Question Let's say a cartwheeling competition was organized for some adults. The data looks like following,

(80.57, 98.96, 85.28, 83.83, 69.94, 89.59, 91.09, 66.25, 91.21, 82.7 , 73.54, 81.99, 54.01, 82.89, 75.88, 98.32, 107.2 , 85.53, 79.08, 84.3 , 89.32, 86.35, 78.98, 92.26, 87.01)

Is distance Is the average cartwheel distance (in inches) for adults more than 80 inches?

Population: All adults Parameter of Interest: $\mu$, population mean cartwheel distance.

Null Hypothesis: $\mu$ = 80 Alternative Hypthosis: $\mu$ > 80

Data: 25 adult participants. $\mu = 83.84$ $\sigma = 10.72$

In [7]:
cwdata = np.array([80.57, 98.96, 85.28, 83.83, 69.94, 89.59, 91.09, 66.25, 91.21, 82.7 , 73.54, 81.99, 54.01, 82.89, 75.88, 98.32, 107.2 , 85.53, 79.08, 84.3 , 89.32, 86.35, 78.98, 92.26, 87.01])


In [8]:
n=len(cwdata)
mean=cwdata.sum()/n
std=cwdata.std()
(n,mean,std)

(25, 83.84320000000001, 10.716018932420752)

In [9]:
 sm.stats.ztest(cwdata,value=80,alternative='larger')

(1.756973189172546, 0.039461189601168366)

Conclusion of the hypothesis test: Since the p-value (0.0394) is lower than the standard confidence level 0.05, we can reject the Null hypothesis that the mean cartwheel distance for adults (a population quantity) is equal to 80 inches. There is strong evidence in support for the alternatine hypothesis that the mean cartwheel distance is, in fact, higher than 80 inches. Note, we used alternative="larger" in the z-test.

### The difference in Population Means

Difference in Population Means Research Question Considering adults in the NHANES data, do males have a significantly higher mean Body Mass Index than females?

Population: Adults in the NHANES data., Parameter of Interest: $\mu_1 - \mu_2$, Body Mass Index.

Null Hypothesis: $\mu_1 = \mu_2$, Alternative Hypthosis: $\mu_1 \neq \mu_2$,

Data:

2976 Females $\mu_1 = 29.94$ $\sigma_1 = 7.75$

2759 Male Adults $\mu_2 = 28.78$ $\sigma_2 = 6.25$

$\mu_1 - \mu_2 = 1.16$

In [10]:
#reading the dataset
data=pd.read_csv('https://raw.githubusercontent.com/kshedden/statswpy/master/NHANES/merged/nhanes_2015_2016.csv')

In [11]:
df=pd.DataFrame(data)

In [12]:
df.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [13]:
#separating the female and male and calcualtin the mean and the std of the BMI
female=df[df.RIAGENDR==2]
male = df[df.RIAGENDR==1]

In [14]:
#for female
n=len(female)
mean = female.BMXBMI.mean()
std = female.BMXBMI.std()
(n,mean,std)

(2976, 29.939945652173996, 7.75331880954568)

In [15]:
# for male
n=len(male)
mean = male.BMXBMI.mean()
std = male.BMXBMI.std()
(n,mean,std)

(2759, 28.778072111846985, 6.252567616801485)

In [16]:
sm.stats.ztest(female["BMXBMI"].dropna(), male["BMXBMI"].dropna(),alternative='two-sided')

(6.1755933531383205, 6.591544431126401e-10)

Conclusion: Since the p-value (6.59e-10) is extremely small, we can reject the Null hypothesis that the mean BMI of males is the same as that of females.