# Practice notebook for hypothesis tests using NHANES data

This notebook will give you the opportunity to perform some hypothesis tests with the NHANES data that are similar to
what was done in the week 3 case study notebook.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook.  You will need to edit code from that notebook in small ways to adapt it to the prompts below.

Remember! ttest for means with sample greater than 30

To get started, we will use the same module imports and read the data in the same way as we did in the case study:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np
import scipy.stats as sem

da = pd.read_csv("nhanes_2015_2016.csv")

In [2]:
da.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


## Question 1

Conduct a hypothesis test (at the 0.05 level) for the null hypothesis that the proportion of women who smoke is equal to the proportion of men who smoke.

In [3]:
# We already know that 1=male, 2=female, 1=smoke, 2=no smoke. 
#7 and 9 are refuses and we have to set them into nan to drop them.

#df_gen_smo = da[["RIAGENDR", "SMQ020"]].replace({7: np.nan, 9: np.nan}).dropna()
female_smoke = da.loc[da.RIAGENDR==2, "SMQ020"].replace({2:0, 7: np.nan, 9: np.nan}).dropna()
male_smoke = da.loc[da.RIAGENDR==1, "SMQ020"].replace({2:0, 7: np.nan, 9: np.nan}).dropna()
#diff_smoke = 
sm.stats.ttest_ind(female_smoke, male_smoke)

(-16.42058555898443, 3.032088786691117e-59, 5723.0)

__Q1a.__ Write 1-2 sentences explaining the substance of your findings to someone who does not know anything about statistical hypothesis tests.

Because p-value is extremely small we can reject the null hypothsis of proportion of women smoke equal than men

__Q1b.__ Construct three 95% confidence intervals: one for the proportion of women who smoke, one for the proportion of men who smoke, and one for the difference in the rates of smoking between women and men.

In [4]:
p_f_s = female_smoke.mean()
n_f_s = female_smoke.size

p_m_s = male_smoke.mean()
n_m_s = male_smoke.size

diff_smoke = ((p_f_s - p_m_s) * (n_f_s + n_m_s))

# proportion_confint(needs the proportion in number, not in %)
ci_female = sm.stats.proportion_confint(p_f_s * n_f_s, n_f_s)
ci_male = sm.stats.proportion_confint(p_m_s * n_m_s, n_m_s)
ci_diff = sm.stats.proportion_confint(-diff_smoke, n_f_s + n_m_s)

print(" Female Confidence Interval:",ci_female,"\n", 
      "Male Confidence Interval:",ci_male,"\n", 
      "DefferenceConfidence Interval:",ci_diff)

 Female Confidence Interval: (0.2882949879861214, 0.32139545615923526) 
 Male Confidence Interval: (0.49458749263718593, 0.5319290347874418) 
 DefferenceConfidence Interval: (0.1978916761657093, 0.21893440711356177)


__Q1c.__ Comment on any ways in which the confidence intervals that you found in part b reinforce, contradict, or add support to the hypothesis test conducted in part a.

We see that the highest part of famale confidence interval is quite lower than the lowest part of male confidence interval. So, with this we reinforce the above decision to reject the hypotheses of famale smoke same as male 

## Question 2

Partition the population into two groups based on whether a person has graduated college or not, using the educational attainment variable [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2).  Then conduct a test of the null hypothesis that the average heights (in centimeters) of the two groups are equal.  Next, convert the heights from centimeters to inches, and conduct a test of the null hypothesis that the average heights (in inches) of the two groups are equal.

In [5]:
# we want to replace 1, 2, and 3 to no, which are persons without college. 
# 4 and 5 have college, then the value will be yes
# 7 and 9 delete
education = da[["DMDEDUC2", "BMXHT"]].replace({1:"no",
                                             2:"no",
                                             3:"no",
                                             4:"yes",
                                             5:"yes",
                                             7:np.nan,
                                             9:np.nan}).dropna()
education.columns = ["College", "Height"]
education.head()

Unnamed: 0,College,Height
0,yes,184.5
1,no,171.4
2,no,170.1
3,yes,160.9
4,yes,164.9


In [6]:
yes_edu = education.loc[education.College == "yes"]["Height"]
no_edu = education.loc[education.College == "no"]["Height"]

In [7]:
# ztest(we need a serie of number)
print(" Hypthesis in cm:", sm.stats.ttest_ind(yes_edu, no_edu), "\n",
      "Hypothesis in inches:", sm.stats.ttest_ind(yes_edu*2.54, no_edu*2.54))
      

 Hypthesis in cm: (10.29686451254072, 1.2272029342207103e-24, 5412.0) 
 Hypothesis in inches: (10.29686451254091, 1.2272029342183252e-24, 5412.0)


__Q2a.__ Based on the analysis performed here, are you confident that people who graduated from college have a different average height compared to people who did not graduate from college?

We decide to refuse the hypothesis of people who graduate have equal height because the p-value is very small

__Q2b:__ How do the results obtained using the heights expressed in inches compare to the results obtained using the heights expressed in centimeters?

The results between centimeters end inches are practically same

## Question 3

Conduct a hypothesis test of the null hypothesis that the average BMI for men between 30 and 40 is equal to the average BMI for men between 50 and 60.  Then carry out this test again after log transforming the BMI values.

In [8]:
# We already know that 1 = male
bmi_df = da[["RIAGENDR", "RIDAGEYR", "BMXBMI"]]
men_30_40 = da.loc[(da.RIAGENDR == 1) & (da.RIDAGEYR >= 30) & (da.RIDAGEYR <= 40), "BMXBMI"].dropna()
men_50_60 = da.loc[(da.RIAGENDR == 1) & (da.RIDAGEYR >= 50) & (da.RIDAGEYR <= 60), "BMXBMI"].dropna()
sm.stats.ttest_ind(men_30_40, men_50_60)

(0.8984008016755045, 0.3691930312327223, 978.0)

__Q3a.__ How would you characterize the evidence that mean BMI differs between these age bands?

We keep the hyphesis because the p-value is grater than 0.05

## Question 4

Suppose we wish to compare the mean BMI between college graduates and people who have not graduated from college, focusing on women between the ages of 30 and 40. Then, calculate pooled and unpooled estimates of the standard error for the difference between the mean BMI in the two populations being compared.  Finally, test the null hypothesis that the two population means are equal, using each of the two different standard errors.

In [9]:
# In this case (may be there is another way, which I have not found)
# we need to create the columns and then replace because if we replace
# 1:"men" of gender, then 1:"no" of ecucation will change all the 1 in da
da["Gender"] = da.RIAGENDR.replace({1: np.nan, 2: "women"})
da["Education"] = da.DMDEDUC2.replace({1:"no college", 
                                       2:"no college", 
                                       3:"no college", 
                                       4:"yes college", 
                                       5:"yes college", 
                                       7:np.nan, 
                                       9:np.nan})
da["agegrp"] = pd.cut(da.RIDAGEYR,[30,40])

women = da[["Gender", "Education", "agegrp", "BMXBMI"]].dropna()

In [10]:
women.head()

Unnamed: 0,Gender,Education,agegrp,BMXBMI
7,women,yes college,"(30, 40]",28.2
34,women,yes college,"(30, 40]",25.5
50,women,no college,"(30, 40]",27.2
61,women,no college,"(30, 40]",35.3
65,women,yes college,"(30, 40]",27.8


In [11]:
w_college_bmi = women.loc[women.Education == "yes college"]["BMXBMI"]
w_no_college_bmi = women.loc[women.Education == "no college"]["BMXBMI"]

w_college_bmi = sm.stats.DescrStatsW(w_college_bmi)
w_no_college_bmi = sm.stats.DescrStatsW(w_no_college_bmi)

print("pooled: ", sm.stats.CompareMeans(w_college_bmi, w_no_college_bmi).ttest_ind(usevar='pooled'))
print("unequal:", sm.stats.CompareMeans(w_college_bmi, w_no_college_bmi).ttest_ind(usevar='unequal'))


pooled:  (-0.9065765008408478, 0.3650983150014415, 467.0)
unequal: (-0.9197633379807121, 0.3583282185034655, 350.9363329961912)


In [12]:
pooled = sm.stats.CompareMeans(w_college_bmi, w_no_college_bmi).ztest_ind(usevar='pooled')
unpooled = sm.stats.CompareMeans(w_college_bmi, w_no_college_bmi).ztest_ind(usevar='unequal')
sm.stats.ttest_ind(pooled, unpooled)

(0.011164881676915218, 0.9921054824732218, 2.0)

__Q4a.__ Comment on the strength of evidence against the null hypothesis that these two populations have equal mean BMI.

p-value is almost at 1, what means 100%, what means that the null hypothesis of bmi have equal mean

## Question 5

Conduct a test of the null hypothesis that the first and second diastolic blood pressure measurements within a subject have the same mean values.

In [13]:
blood_pressure = da[["BPXSY1", "BPXSY2"]].dropna()
blood_pressure.columns = ["bp1", "bp2"]
sm.stats.ttest_ind(blood_pressure["bp1"], blood_pressure["bp2"])

(1.9065526348810609, 0.05660521364967507, 10736.0)