# NHANES Hypothesis Tests Practice

<img src="images/hiptest.png"/>

### Import Labraries

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np
import scipy.stats.distributions as dist
from statsmodels.stats.proportion import proportion_confint

In [2]:
da = pd.read_csv("data/nhanes_2015_2016.csv")

da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


## Question 1

Conduct a hypothesis test (at the 0.05 level) for the null hypothesis that the proportion of women who smoke is equal to the proportion of men who smoke.

In [3]:
# Create 2 groups: 'males' and 'females'

da_male_smoke = da[da['RIAGENDR']==1]['SMQ020']
da_male_smoke = da_male_smoke[~da_male_smoke.isna()]
da_male_smoke.reset_index(inplace=True, drop=True)

da_female_smoke = da[da['RIAGENDR']==2]['SMQ020']
da_female_smoke = da_female_smoke[~da_female_smoke.isna()]
da_female_smoke.reset_index(inplace=True, drop=True)

In [4]:
# Sample sizes

n_male_smoke = sum(da_male_smoke==1)

n_male_not_smoke = sum(da_male_smoke!=1)

n_total_male = len(da_male_smoke)


n_female_smoke = sum(da_female_smoke==1)

n_female_not_smoke = len(da_male_smoke) - sum(da_female_smoke==1)  # some gaps in data

n_total_female = len(da_male_smoke)


print(n_male_smoke,n_male_not_smoke,n_total_male)
print(n_female_smoke,n_female_not_smoke,n_total_female)

1413 1346 2759
906 1853 2759


In [5]:
# Proportions

prop_male_smoke = n_male_smoke / n_total_male

prop_female_smoke = n_female_smoke / n_total_female

print('Proportion smokers between males:', round(prop_male_smoke,2))

print('Proportion smokers between females:', round(prop_female_smoke,2))

Proportion smokers between males: 0.51
Proportion smokers between females: 0.33


In [6]:
# Hypothesis Test for the "Difference in Two Proportions"

# Sample sizes
n1 = n_total_male
n2 = n_total_female

# Number of smokers in each gender
y1 = n_male_smoke
y2 = n_female_smoke

# Estimates of the population proportions
p1 = prop_male_smoke
p2 = prop_female_smoke

# Estimate of the combined population proportion (a kind of mean)
phat = (y1 + y2) / (n1 + n2)

# Estimate of the variance of the combined population proportion
va = phat * (1 - phat)

# Estimate of the standard error of the combined population proportion
se = np.sqrt(va * (1 / n1 + 1 / n2))

# Test statistic and its p-value
test_stat = (p1 - p2) / se
pvalue = 2*dist.norm.cdf(-np.abs(test_stat))

# Print the test statistic its p-value
print("Test Statistic")
print(round(test_stat, 2))

print("\nP-Value")
print(pvalue)

Test Statistic
13.83

P-Value
1.7414332306604475e-43


**Answer.** 

    Because our p-value < alpha value (0.05), we "Reject the null hypothesis".

The proportion of women who smoke **is not** the same the proportion of men who smoke.

__Q1a.__ Write 1-2 sentences explaining the substance of your findings to someone who does not know anything about statistical hypothesis tests.

**Answer.** The proportion of women who smoke is significally different from the proportion of men who smoke.

__Q1b.__ Construct three 95% confidence intervals: one for the proportion of women who smoke, one for the proportion of men who smoke, and one for the difference in the rates of smoking between women and men.

In [7]:
# Standard Error for difference of 2 population proportions

# Standard Error for males
se_male = np.sqrt(prop_male_smoke * (1 - prop_male_smoke)/ n_total_male)

# Standard Error for females
se_female = np.sqrt(prop_female_smoke * (1 - prop_female_smoke)/ n_total_female)

# Standard Error for the difference
se_diff = np.sqrt(se_female**2 + se_male**2)
se_diff

0.013057420421539416

In [8]:
# 95% C.I. for proportion female smokers

'''
se = np.sqrt((p * (1 - p))/n)

lcb = p - zstar * se
ucb = p + zstar * se

(lcb, ucb)
'''

# Boundaries
lcb_female = prop_female_smoke - 1.96 * se_female
ucb_female = prop_female_smoke + 1.96 * se_female

print('95% C.I. for proportion female smokers:',(lcb_female,ucb_female))

95% C.I. for proportion female smokers: (0.31085596540720084, 0.34590373013466214)


In [9]:
# 95% C.I. for proportion male smokers

'''
se = np.sqrt((p * (1 - p))/n)

lcb = p - zstar * se
ucb = p + zstar * se

(lcb, ucb)
'''

# Boundaries
lcb_male = prop_male_smoke - 1.96 * se_male
ucb_male = prop_male_smoke + 1.96 * se_male

print('95% C.I. for proportion female smokers:',(lcb_male,ucb_male))

95% C.I. for proportion female smokers: (0.4934902211293819, 0.5307939397984904)


In [10]:
# 95% C.I. for diference between proportions

'''
se_diff = np.sqrt(se_female**2 + se_male**2)

d = p1 - p2

lcb = d - zstar * se
ucb = d + zstar * se

(lcb, ucb)
'''

# Diference between proportions
d = prop_male_smoke - prop_female_smoke   

# Boundaries
lcb_diff = d - 1.96 * se_diff
ucb_diff = d + 1.96 * se_diff

print('95% C.I. for proportion female smokers:',(lcb_diff, ucb_diff))

95% C.I. for proportion female smokers: (0.15816968866678743, 0.20935477671922192)


__Q1c.__ Comment on any ways in which the confidence intervals that you found in part b reinforce, contradict, or add support to the hypothesis test conducted in part a.

**Answer.** The confidence intervals support our conclusion that the the proportion of males who smoke is significant from the proportion of females who smoke. The confidence interval for the proportion of males who smoke does not overlap with the proportion of females who smoke. The confidence interval for the difference between the two proportions does not include zero.

## Question 2

Partition the population into two groups based on whether a person has graduated college or not, using the educational attainment variable [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2).  Then conduct a test of the null hypothesis that the average heights (in centimeters) of the two groups are equal.  Next, convert the heights from centimeters to inches, and conduct a test of the null hypothesis that the average heights (in inches) of the two groups are equal.

In [14]:
# Partition the population into two groups based on whether a person has graduated college or not.

graduated_height_cm = da[da['DMDEDUC2']==5]['BMXHT']
graduated_height_cm = graduated_height_cm[~graduated_height_cm.isna()]
graduated_height_cm.reset_index(inplace=True, drop=True)

not_graduated_height_cm = da[da['DMDEDUC2']!=5]['BMXHT']
not_graduated_height_cm = not_graduated_height_cm[~not_graduated_height_cm.isna()]
not_graduated_height_cm.reset_index(inplace=True, drop=True)

In [19]:
# Sample Sizes

n_graduated_height_cm = len(graduated_height_cm)

n_not_graduated_height_cm = len(not_graduated_height_cm)


# Means

graduated_mean_height_cm = graduated_height_cm.mean()

not_graduated_mean_height_cm = not_graduated_height_cm.mean()


# Standard Deviations

graduated_std_height_cm = graduated_height_cm.std()

not_graduated_std_height_cm = not_graduated_height_cm.std()


# Errors

se_graduated_height_cm = graduated_std_height_cm / n_graduated_height_cm

se_not_graduated_height_cm = not_graduated_std_height_cm / n_not_graduated_height_cm

sem_diff = np.sqrt(se_graduated_height_cm**2 + se_not_graduated_height_cm**2)     # SE for difference of 2 population means

In [21]:
# 95% C.I. for diference between means in cm

'''
se_diff = np.sqrt(se_A**2 + se_B**2)

mu = mu1 -mu2

lcb = mu - zstar * se
ucb = mu + zstar * se

(lcb, ucb)
'''

# Diference between means
d = abs(graduated_mean_height_cm - not_graduated_mean_height_cm)   

# Boundaries
lcb_diff = d - 1.96 * se_diff
ucb_diff = d + 1.96 * se_diff

print('95% C.I. for diference between means in cm:',(lcb_diff, ucb_diff))

95% C.I. for diference between means: (2.224315675065462, 2.275500763117897)


In [17]:
# The population into two groups in inches

graduated_height_inches = graduated_height_cm / 2.54
graduated_height_inches = graduated_height_inches[~graduated_height_inches.isna()]
graduated_height_inches.reset_index(inplace=True, drop=True)

not_graduated_height_inches = not_graduated_height_cm / 2.54
not_graduated_height_inches = not_graduated_height_inches[~not_graduated_height_inches.isna()]
not_graduated_height_inches.reset_index(inplace=True, drop=True)

In [22]:
# Sample Sizes
n_graduated_height_inches = len(graduated_height_inches)
n_not_graduated_height_inches = len(not_graduated_height_inches)


# Means
graduated_mean_height_inches = graduated_height_inches.mean()
not_graduated_mean_height_inches = not_graduated_height_inches.mean()


# Standard Deviations
graduated_std_height_inches = graduated_height_inches.std()
not_graduated_std_height_inches = not_graduated_height_inches.std()


# Errors
se_graduated_height_inches = graduated_std_height_inches / n_graduated_height_inches
se_not_graduated_height_inches = not_graduated_std_height_inches / n_not_graduated_height_inches
sem_diff = np.sqrt(se_graduated_height_inches**2 + se_not_graduated_height_inches**2)     # SE for difference

In [23]:
# 95% C.I. for diference between means in inches

'''
se_diff = np.sqrt(se_A**2 + se_B**2)

mu = mu1 -mu2

lcb = mu - zstar * se
ucb = mu + zstar * se

(lcb, ucb)
'''

# Diference between means
d = abs(graduated_mean_height_inches - not_graduated_mean_height_inches)   

# Boundaries
lcb_diff = d - 1.96 * se_diff
ucb_diff = d + 1.96 * se_diff

print('95% C.I. for diference between means in inches:',(lcb_diff, ucb_diff))

95% C.I. for diference between means in inches: (0.8601980934112066, 0.9113831814636411)


__Q2a.__ Based on the analysis performed here, are you confident that people who graduated from college have a different average height compared to people who did not graduate from college?

__Q2b:__ How do the results obtained using the heights expressed in inches compare to the results obtained using the heights expressed in centimeters?

## Question 3

Conduct a hypothesis test of the null hypothesis that the average BMI for men between 30 and 40 is equal to the average BMI for men between 50 and 60.  Then carry out this test again after log transforming the BMI values.

__Q3a.__ How would you characterize the evidence that mean BMI differs between these age bands, and how would you characterize the evidence that mean log BMI differs between these age bands?

## Question 4

Suppose we wish to compare the mean BMI between college graduates and people who have not graduated from college, focusing on women between the ages of 30 and 40.  First, consider the variance of BMI within each of these subpopulations using graphical techniques, and through the estimated subpopulation variances.  Then, calculate pooled and unpooled estimates of the standard error for the difference between the mean BMI in the two populations being compared.  Finally, test the null hypothesis that the two population means are equal, using each of the two different standard errors.

__Q4a.__ Comment on the strength of evidence against the null hypothesis that these two populations have equal mean BMI.

__Q4b.__ Comment on the degree to which the two populations have different variances, and on the extent to which the results using different approaches to estimating the standard error of the mean difference give divergent results.

## Question 5

Conduct a test of the null hypothesis that the first and second diastolic blood pressure measurements within a subject have the same mean values.

__Q5a.__ Briefly describe your findings for an audience that is not familiar with statistical hypothesis testing.

__Q5b.__ Pretend that the first and second diastolic blood pressure measurements were taken on different people.  Modfify the analysis above as appropriate for this setting.

__Q5c.__ Briefly describe how the approaches used and the results obtained in the preceeding two parts of the question differ.