# Practice notebook for multivariate analysis using NHANES data

This notebook will give you the opportunity to perform some multivariate analyses on your own using the NHANES study data.  These analyses are similar to what was done in the week 3 NHANES case study notebook.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook.  You will need to edit code from that notebook in small ways to adapt it to the prompts below.

To get started, we will use the same module imports and read the data in the same way as we did in the case study:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np
from scipy import stats

da = pd.read_csv("nhanes_2015_2016.csv")
da.columns

## Question 1

Make a scatterplot showing the relationship between the first and second measurements of diastolic blood pressure ([BPXDI1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXDI1) and [BPXDI2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXDI2)).  Also obtain the 4x4 matrix of correlation coefficients among the first two systolic and the first two diastolic blood pressure measures.

In [None]:
# enter your code here
sns.regplot(x='BPXDI1', y='BPXDI2', data=da, fit_reg=False, scatter_kws={'alpha': 0.2});

In [None]:
sns.jointplot(x='BPXSY1', y='BPXDI1', data=da, kind='kde').annotate(stats.pearsonr);

In [None]:
sns.jointplot(x='BPXSY2', y='BPXDI2', data=da, kind='kde').annotate(stats.pearsonr);

In [None]:
sns.jointplot(x='BPXSY1', y='BPXSY2', data=da, kind='kde').annotate(stats.pearsonr);

In [None]:
sns.jointplot(x='BPXDI1', y='BPXDI2', data=da, kind='kde').annotate(stats.pearsonr);

__Q1a.__ How does the correlation between repeated measurements of diastolic blood pressure relate to the correlation between repeated measurements of systolic blood pressure?

The correlation betweeen the repeated measurements of the diastolic blood pressure and systolic blood pressure are weakly correlated. The correlation coefficient of the first diastolic blood pressure and the first systolic blood pressure is around 0.32 indicating that some people have unusually high systolic blood pressure but have average diastolic blood pressure. The correlation coefficient of the second diastolic blood pressure and the second systolic blood pressure is around 0.30 indicating that some people have unusually high systolic blood pressure but have average diastolic blood pressure.

__Q1b.__ Are the second systolic and second diastolic blood pressure measure more correlated or less correlated than the first systolic and first diastolic blood pressure measure?

The second systolic and second diastolic blood pressure measurement are less correlated than the first systolic and first diastolic blood pressure measurement.

## Question 2

Construct a grid of scatterplots between the first systolic and the first diastolic blood pressure measurement.  Stratify the plots by gender (rows) and by race/ethnicity groups (columns).

In [None]:
# insert your code here
sns.FacetGrid(da).map(plt.scatter, 'BPXSY1', 'BPXDI1', alpha=0.4).add_legend()

In [None]:
da['RIAGENDRx']=da.RIAGENDR.replace({1:'Male', 2:'Female'})
_=sns.FacetGrid(da, col="RIDRETH1", row='RIAGENDRx').map(plt.scatter, 'BPXSY1', 'BPXDI1', alpha=0.5).add_legend()

__Q2a.__ Comment on the extent to which these two blood pressure variables are correlated to different degrees in different demographic subgroups.

These scatterplots reveal differences in the means as well differences in the degree of association (correlation) between different pairs of variables.  We see that although some ethnic groups tend to have higher blood pressure than others, the relationship between systolic and diastolic blood pressure within genders is roughly similar across the ethnic groups.  

## Question 3

Use "violin plots" to compare the distributions of ages within groups defined by gender and educational attainment.

In [None]:
# insert your code here
plt.figure(figsize=(12, 4))
da['DMDEDUC2x']=da.DMDEDUC2.replace({1:'less than 9th grade', 2:'9-11 grade', 3:'HS/GED', 4:'Some col/AA', 5:'College Graduate', 7:'Refused', 9:'DK'})
a=sns.violinplot(da.DMDEDUC2x, da.RIDAGEYR)

In [None]:
plt.figure(figsize=(12, 4))
b=sns.violinplot(da.RIAGENDRx, da.RIDAGEYR)

__Q3a.__ Comment on any evident differences among the age distributions in the different demographic groups.

The distributions for the age and educational attainment violinplot have intermediate mean values and are approximately symmetrically distributed. The mean for less than 9th grade seems to be higher than the other means except DK. The distribution for the age and gender violinplot have intermediate means and are approximately symmeterically distributed and the shape is unimodal.

## Question 4

Use violin plots to compare the distributions of BMI within a series of 10-year age bands.  Also stratify these plots by gender.

In [None]:
# insert your code here
plt.figure(figsize=(12, 4))
da["agegrp"] = pd.cut(da.RIDAGEYR, [18, 30, 40, 50, 60, 70, 80])
c=sns.violinplot(da['agegrp'], da.BMXBMI)

In [None]:
plt.figure(figsize=(12, 4))
b=sns.violinplot(da.RIAGENDRx, da.BMXBMI)

__Q4a.__ Comment on the trends in BMI across the demographic groups.

In the BMI within a series of 10 year age bands, we can see quite clearly that the distributions have intermediate mean values and are strongly right-skewed. In the age and BMI violinplot, the plots have intermediate mean values and are mostly right-skewed.

## Question 5

Construct a frequency table for the joint distribution of ethnicity groups ([RIDRETH1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIDRETH1)) and health-insurance status ([HIQ210](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/HIQ_I.htm#HIQ210)).  Normalize the results so that the values within each ethnic group are proportions that sum to 1.

In [None]:
da.groupby('RIDRETH1')['HIQ210'].value_counts()

In [None]:
# insert your code here
dx = da.loc[~da.DMDEDUC2x.isin(["Don't know", "Missing"]), :] 
dx = dx.groupby(['RIDRETH1'])['HIQ210']
dx = dx.value_counts()
dx = dx.unstack()
dx = dx.apply(lambda x: x/x.sum(), axis=1)
print(dx.to_string(float_format='%.3f'))

__Q5a.__ Which ethnic group has the highest rate of being uninsured in the past year?

Mexican Americans had the highest rate of being uninsured in the past year.