<a href="https://colab.research.google.com/github/Camicb/Statistics-w-python-Coursera/blob/master/nhanes_multivariate_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice notebook for multivariate analysis using NHANES data

This notebook will give you the opportunity to perform some multivariate analyses on your own using the NHANES study data.  These analyses are similar to what was done in the week 3 NHANES case study notebook.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook.  You will need to edit code from that notebook in small ways to adapt it to the prompts below.

To get started, we will use the same module imports and read the data in the same way as we did in the case study:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

da = pd.read_csv("nhanes_2015_2016.csv")
da.columns


## Question 1

Make a scatterplot showing the relationship between the first and second measurements of diastolic blood pressure ([BPXDI1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXDI1) and [BPXDI2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm#BPXDI2)).  Also obtain the 4x4 matrix of correlation coefficients among the first two systolic and the first two diastolic blood pressure measures.

In [None]:
sns.regplot(x="BPXDI1", y="BPXDI2", data=da, fit_reg=False, scatter_kws={"alpha": 0.2})# un scatter con puntos transparentes
print(da.loc[:, ["BPXSY1", "BPXSY2"]].dropna().corr()) #systolic
print(da.loc[:, ["BPXDI1", "BPXDI2"]].dropna().corr()) #diastolic
#print(da.loc[:, ["BPXDI1", "BPXDI2","BPXSY1", "BPXSY2"]].dropna().corr()) #diastolic and systolic

## Question 2

Construct a grid of scatterplots between the first systolic and the first diastolic blood pressure measurement.  Stratify the plots by gender (rows) and by race/ethnicity groups (columns).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

da = pd.read_csv("nhanes_2015_2016.csv")
da["RIAGENDR"] = da.RIAGENDR.replace({1: "Male", 2: "Female"}) 
#da["RIDRETH1"] = da.RIDRETH1.replace({1: "Mexican American", 2: "Other Hispanic", 3: "Non-Hispanic White", 4: "Non-Hispanic Black", 5: "Other Race - Including Multi-Racial"})
q2 = sns.FacetGrid(da, col="RIDRETH1",  row="RIAGENDR").map(plt.scatter, "BPXSY1", "BPXDI1", alpha=0.5).add_legend()

## Question 3

Use "violin plots" to compare the distributions of ages within groups defined by gender and educational attainment.

In [None]:
plt.figure(figsize=(15, 5))
a = sns.violinplot(da.DMDEDUC2, da.RIDAGEYR, da.RIAGENDR)

## Question 4

Use violin plots to compare the distributions of BMI within a series of 10-year age bands.  Also stratify these plots by gender.

In [None]:
da['agegrp'] = pd.cut(da.RIDAGEYR, [20, 30, 40, 50, 60, 70, 80])
plt.figure(figsize=(15, 5))
b = sns.violinplot(da.agegrp, da.BMXBMI, da.RIAGENDR)

## Question 5

Construct a frequency table for the joint distribution of ethnicity groups ([RIDRETH1](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#RIDRETH1)) and health-insurance status ([HIQ210](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/HIQ_I.htm#HIQ210)).  Normalize the results so that the values within each ethnic group are proportions that sum to 1.

In [None]:
da.groupby(["RIDRETH1", "HIQ210"]).size().unstack().fillna(0).apply(lambda x: x/x.sum(), axis=1)

In [None]:
pd.crosstab(da['RIDRETH1'], da['HIQ210'], margins=True, normalize='index')#codeprof
# normalize='index' normalizes each row
# margins=True gives us overall normalization