# Practice notebook for univariate analysis using NHANES data

This notebook will give you the opportunity to perform some univariate analyses on your own using the NHANES data.  These analyses are similar to what was done in the week 2 NHANES case study notebook.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook.  You will need to edit code from that notebook in small ways to adapt it to the prompts below.

To get started, we will use the same module imports and read the data in the same way as we did in the case study:

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

da = pd.read_csv("nhanes_2015_2016.csv")

## Question 1

Relabel the marital status variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) to have brief but informative character labels.  Then construct a frequency table of these values for all people, then for women only and for men only.  Then construct these three frequency tables using only people whose age is between 30 and 40.

In [2]:
r = {1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "NeverMarried", 6: "Cohabitating", 77: "Refused", 99: "Unknown"}
da["DMDMARTLx"] = da["DMDMARTL"].replace(r)

print("All subjects:")
x = da["DMDMARTLx"].value_counts()
print(x / x.sum())

for ky,db in da.groupby("RIAGENDR"):
    print("\nRIAGENDR=", ky)
    x = db["DMDMARTLx"].value_counts()
    print(x / x.sum())
    
da3040 = da.query('RIDAGEYR >= 30 & RIDAGEYR <= 40')
for ky,db in da3040.groupby("RIAGENDR"):
    print("\nRIAGENDR=", ky, " 30 <= RIDAGEYR <= 40")
    x = db["DMDMARTLx"].value_counts()
    print(x / x.sum())

All subjects:
DMDMARTLx
Married         0.507855
NeverMarried    0.183412
Divorced        0.105773
Cohabitating    0.096273
Widowed         0.072342
Separated       0.033979
Refused         0.000365
Name: count, dtype: float64

RIAGENDR= 1
DMDMARTLx
Married         0.562881
NeverMarried    0.184451
Cohabitating    0.100991
Divorced        0.087271
Widowed         0.038110
Separated       0.025915
Refused         0.000381
Name: count, dtype: float64

RIAGENDR= 2
DMDMARTLx
Married         0.457193
NeverMarried    0.182456
Divorced        0.122807
Widowed         0.103860
Cohabitating    0.091930
Separated       0.041404
Refused         0.000351
Name: count, dtype: float64

RIAGENDR= 1  30 <= RIDAGEYR <= 40
DMDMARTLx
Married         0.556680
NeverMarried    0.204453
Cohabitating    0.157895
Divorced        0.048583
Separated       0.024291
Widowed         0.006073
Refused         0.002024
Name: count, dtype: float64

RIAGENDR= 2  30 <= RIDAGEYR <= 40
DMDMARTLx
Married         0.535714
Nev

__Q1a.__ Briefly comment on some of the differences that you observe between the distribution of marital status between women and men, for people of all ages.

__Q1b.__ Briefly comment on the differences that you observe between the distribution of marital status states for women between the overall population, and for women between the ages of 30 and 40.

__Q1c.__ Repeat part b for the men.

## Question 2

Restricting to the female population, stratify the subjects into age bands no wider than ten years, and construct the distribution of marital status within each age band.  Within each age band, present the distribution in terms of proportions that must sum to 1.

In [3]:
# insert your code here
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
da.RIAGENDRx.value_counts()

RIAGENDRx
Female    2976
Male      2759
Name: count, dtype: int64

In [4]:
da[da['RIAGENDRx'] == "Male"].DMDMARTLx.value_counts()/da[da['RIAGENDRx'] == "Male"].shape[0]

DMDMARTLx
Married         0.535339
NeverMarried    0.175426
Cohabitating    0.096049
Divorced        0.083001
Widowed         0.036245
Separated       0.024647
Refused         0.000362
Name: count, dtype: float64

In [5]:
dx = da.groupby(["DMDMARTLx"])["RIAGENDRx"].value_counts().unstack()
dx = dx.apply(lambda x: x/x.sum(), axis=0)
print(dx.to_string(float_format="%.4f")) 

RIAGENDRx     Female   Male
DMDMARTLx                  
Cohabitating  0.0919 0.1010
Divorced      0.1228 0.0873
Married       0.4572 0.5629
NeverMarried  0.1825 0.1845
Refused       0.0004 0.0004
Separated     0.0414 0.0259
Widowed       0.1039 0.0381


In [15]:
da_fem = da[da["RIAGENDRx"] == "Female"]
da_fem.loc[:,"agegrp"] = pd.cut(da_fem.RIDAGEYR, [10, 20, 30, 40, 50, 60, 70, 80])
dx = da_fem.groupby(["agegrp"])["DMDMARTLx"].value_counts().unstack()
dx = dx.apply(lambda x: x/x.sum(), axis=0)
cm = sns.light_palette("pink", as_cmap=True)
dx.style.background_gradient(cmap=cm)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  da_fem.loc[:,"agegrp"] = pd.cut(da_fem.RIDAGEYR, [10, 20, 30, 40, 50, 60, 70, 80])


DMDMARTLx,Cohabitating,Divorced,Married,NeverMarried,Refused,Separated,Widowed
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"(10, 20]",0.030534,0.0,0.000767,0.057692,0.0,0.0,0.0
"(20, 30]",0.40458,0.031429,0.120491,0.440385,0.0,0.09322,0.0
"(30, 40]",0.217557,0.122857,0.198005,0.186538,0.0,0.144068,0.006757
"(40, 50]",0.141221,0.197143,0.221028,0.121154,0.0,0.279661,0.040541
"(50, 60]",0.122137,0.237143,0.197237,0.080769,1.0,0.228814,0.094595
"(60, 70]",0.072519,0.242857,0.162701,0.073077,0.0,0.186441,0.219595
"(70, 80]",0.01145,0.168571,0.09977,0.040385,0.0,0.067797,0.638514


__Q2a.__ Comment on the trends that you see in this series of marginal distributions.

__Q2b.__ Repeat the analysis for males.

In [16]:
# insert your code here
da_m = da[da["RIAGENDRx"] == "Male"]
da_m.loc[:,"agegrp"] = pd.cut(da_m.RIDAGEYR, [10, 20, 30, 40, 50, 60, 70, 80])
dx = da_m.groupby(["agegrp"])["DMDMARTLx"].value_counts().unstack()
dx = dx.apply(lambda x: x/x.sum(), axis=0)
cm = sns.light_palette("blue", as_cmap=True)
dx.style.background_gradient(cmap=cm)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  da_m.loc[:,"agegrp"] = pd.cut(da_m.RIDAGEYR, [10, 20, 30, 40, 50, 60, 70, 80])


DMDMARTLx,Cohabitating,Divorced,Married,NeverMarried,Refused,Separated,Widowed
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"(10, 20]",0.011321,0.0,0.000677,0.07438,0.0,0.0,0.0
"(20, 30]",0.34717,0.008734,0.069736,0.466942,0.0,0.102941,0.02
"(30, 40]",0.271698,0.104803,0.174678,0.183884,1.0,0.176471,0.02
"(40, 50]",0.124528,0.148472,0.190928,0.080579,0.0,0.161765,0.02
"(50, 60]",0.128302,0.248908,0.200406,0.097107,0.0,0.147059,0.1
"(60, 70]",0.083019,0.240175,0.197021,0.078512,0.0,0.205882,0.17
"(70, 80]",0.033962,0.248908,0.166554,0.018595,0.0,0.205882,0.67


__Q2c.__ Comment on any notable differences that you see when comparing these results between females and for males.

## Question 3

Construct a histogram of the distribution of heights using the BMXHT variable in the NHANES sample.

In [19]:
# insert your code here
da["BMXHT"].describe()

count    5673.000000
mean      166.142834
std        10.079264
min       129.700000
25%       158.700000
50%       166.000000
75%       173.500000
max       202.700000
Name: BMXHT, dtype: float64

__Q3a.__ Use the `bins` argument to [histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn.histplot) to produce histograms with different numbers of bins.  Assess whether the default value for this argument gives a meaningful result, and comment on what happens as the number of bins grows excessively large or excessively small. 

__Q3b.__ Make separate histograms for the heights of women and men, then make a side-by-side boxplot showing the heights of women and men.

In [3]:
# insert your code here

__Q3c.__ Comment on what features, if any are not represented clearly in the boxplots, and what features, if any, are easier to see in the boxplots than in the histograms.

## Question 4

Make a boxplot showing the distribution of within-subject differences between the first and second systolic blood pressure measurents ([BPXSY1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXSY1) and [BPXSY2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXSY2)).

In [None]:
# insert your code here

__Q4a.__ What proportion of the subjects have a lower SBP on the second reading compared to the first?

In [None]:
# insert your code here

__Q4b.__ Make side-by-side boxplots of the two systolic blood pressure variables.

In [4]:
# insert your code here

__Q4c.__ Comment on the variation within either the first or second systolic blood pressure measurements, and the variation in the within-subject differences between the first and second systolic blood pressure measurements.

## Question 5

Construct a frequency table of household sizes for people within each educational attainment category (the relevant variable is [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2)).  Convert the frequencies to proportions.

In [None]:
# insert your code here

__Q5a.__ Comment on any major differences among the distributions.

__Q5b.__ Restrict the sample to people between 30 and 40 years of age.  Then calculate the median household size for women and men within each level of educational attainment.

In [7]:
# insert your code here

## Question 6

The participants can be clustered into "masked variance units" (MVU) based on every combination of the variables [SDMVSTRA](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#SDMVSTRA) and [SDMVPSU](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#SDMVPSU).  Calculate the mean age ([RIDAGEYR](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIDAGEYR)), height ([BMXHT](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXHT)), and BMI ([BMXBMI](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXBMI)) for each gender ([RIAGENDR](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIAGENDR)), within each MVU, and report the ratio between the largest and smallest mean (e.g. for height) across the MVUs.

In [1]:
# insert your code here

__Q6a.__ Comment on the extent to which mean age, height, and BMI vary among the MVUs.

__Q6b.__ Calculate the inter-quartile range (IQR) for age, height, and BMI for each gender and each MVU.  Report the ratio between the largest and smalles IQR across the MVUs.

In [None]:
# insert your code here

__Q6c.__ Comment on the extent to which the IQR for age, height, and BMI vary among the MVUs.