# Hypothesis Testing

In this notebook we demonstrate formal hypothesis testing using the NHANES data.

It is important to note that the NHANES data are a "complex survey". The data are not an independent and representative sample from the target population. Proper analysis of complex survey data should make use of additional information about how the data were collected. Since complex survey analysis is a somewhat specialized topic, we ignore this aspect of the data here, and analyze the NHANES data as if it were an independent and identically distributed sample from a population.

First we import the libraries that we will need.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import statsmodels.api as sm
import scipy.stats.distributions as dist

Below we read the data, and convert some of the integer codes to text values. The NHANES codebooks for `SMQ020`, `RIAGENDR`, and `DMDCITZN` describe the meanings of the numerical codes.

In [2]:
da = pd.read_csv("nhanes_2015_2016.csv")
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [3]:
da = da.replace({
    "SMQ020" : {1 : "Yes", 2 : "No"},
    "RIAGENDR" : {1 : "Male", 2 : "Female"},
    "DMDCITZN" : {1 : "Yes", 2 : "No", 7 : np.nan, 9 : np.nan}
})

In [4]:
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,Yes,Male,62,3,Yes,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,Yes,Male,53,3,No,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,Yes,Male,78,3,Yes,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,No,Female,56,3,Yes,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,No,Female,42,4,Yes,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


##  1) Hypothesis Tests for one Proportion

### <ins> **Method 1 :**</ins>

The most basic hypothesis test may be the **one-sample test for a proportion**. This test is used if we have specified a particular value as the null value for the proportion, and we wish to assess if the data are compatible with the true parameter value being equal to this specified value. One-sample tests are not used very often in practice, because it is not very common that we have a specific fixed value to use for comparison.

For illustration, imagine that the rate of lifetime smoking in another country was known to be 40%, and we wished to assess whether the rate of lifetime smoking in the US were different from 40%. In the following notebook cell, we carry out the **(two-sided) one-sample test** that the population proportion of smokers is 0.4, and obtain a p-value of 0.43. This indicates that the NHANES data are compatible with the proportion of (ever) smokers in the US being 40%.

In [5]:
x = da.SMQ020.dropna() == "Yes"
p = x.mean()
se = np.sqrt(0.4 * (1-0.4) / len(x))

# Test-statistic and p-value :
test_stat = (p-0.4) / se
pvalue = 2 * dist.norm.cdf(-np.abs(test_stat))

print(test_stat, pvalue)

0.673856895322319 0.5004022987340593


### <ins> **Method 2 : statsmodels**</ins>

The following cell carries out the same test as performed above using the Statsmodels library. The results in the first (default) case below are slightly different from the results obtained above because Statsmodels by default uses the sample proportion instead of the null proportion when computing the standard error. This distinction is rarely consequential, but we can specify that the null proportion should be used to calculate the standard error, and the results agree exactly with what we calculated above. The first two lines below carry out tests using the normal approximation to the sampling distribution of the test statistic, and the third line below carries uses the exact binomial sampling distribution. We can see here that the p-values are nearly identical in all three cases. This is expected when the sample size is large, and the proportion is not close to either 0 or 1.


In [6]:
# Normal approximation with estimated proportion in SE
print(sm.stats.proportions_ztest(count=x.sum(), nobs= len(x), value=0.4))

(0.672662805775994, 0.501161835318324)


## Hypothesis Tests for two Proportions

Comparative tests tend to be used much more frequently than tests comparing one population to a fixed value. A **two-sample test of proportions** is used to assess whether the proportion of individuals with some trait differs between two sub-populations. For example, we can compare the smoking rates between females and males. Since smoking rates vary strongly with age, we do this in the subpopulation of people **between 20 and 25 years of age**. In the cell below, we carry out this test without using any libraries using Python code. We find that the smoking rate for men is around 10 percentage points greater than the smoking rate for females, and this difference is statistically significant (the p-value is around 0.01).

In [7]:
# Droppping missing values
dx = da[["SMQ020", "RIAGENDR", "RIDAGEYR"]].dropna()

# Restrict to people between 20 and 25 yo:
dx = dx.loc[(dx["RIDAGEYR"]>=20) & (dx["RIDAGEYR"]<=25), :]

# Summarize the data by calculating the proportion of "yes" responses and the sample size
p = dx.groupby("RIAGENDR")["SMQ020"].agg([lambda z : np.mean(z == "Yes"), "size"])
p.columns = ["Smoke", "N"]
print(p)

# The pooled rate of "Yes" responses, and the SE of the estimated difference of proportions
p_comb = (dx.SMQ020 == "Yes").mean()
va = p_comb * (1 - p_comb)
se = np.sqrt(va*(1/p.N.Female + 1/p.N.Male))

# Calculate the test statistic and the p-value:
test_stat = (p.Smoke.Female - p.Smoke.Male) / se
pvalue = 2 * dist.norm.cdf(-np.abs(test_stat))

print(test_stat, pvalue)

             Smoke    N
RIAGENDR               
Female    0.238095  273
Male      0.341270  252
-2.6092144683138088 0.00907503457584406


Essentially the same test as above can be conducted by converting the "Yes"/"No" responses to numbers (Yes=1, No=0) and conducting a two-sample t-test, as below:

In [8]:
dx_females = dx.loc[dx.RIAGENDR == "Female", "SMQ020"].replace({
    "Yes" : 1,
    "No" : 0
})

dx_male = dx.loc[dx.RIAGENDR == "Male", "SMQ020"].replace({
    "Yes" : 1,
    "No" : 0
})

In [9]:
# print test statistic and p-value and degrees of freedom
sm.stats.ttest_ind(dx_females, dx_male)

(-1.360177655178216, 0.17435969067157403, 523.0)