<a href="https://colab.research.google.com/github/EdenShaveet/Disclosure-Curriculum/blob/main/Module2_NHANES_merge_subset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module Exercise: Merge, Subset & Preprocess NHANES Data (then disclose your methods!)
**Script Description:** Merges and subsets pre-pandemic demographic and body measurement data from NHANES, and provides acommpanying methods disclosure text

**Instructions:** Download two NHANES datasets from GitHub and upload them to this script in the code blocks marked with the symbols "⬅️🗂️." Run each code block sequentially to merge and subset the datasets to pre-defined characteristics. Review the methods disclosures at the end of this script and revise/add details as you see fit.

First, let's import necessary packages.

In [None]:
# Import packages
import pandas as pd
import seaborn as sns
import io

# Download NHANES Datasets from GitHub

Download pre-pandemic (2017-2020) [demographic NHANES data](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_DEMO.htm) as an xlsx file from [HERE](https://github.com/EdenShaveet/Disclosure-Curriculum/blob/main/NHANES_Demo_2017_2020.xlsx).

Download pre-pandemic (2017-2020) [body measures NHANES data](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BMX.htm) as a xlsx file from [HERE](https://github.com/EdenShaveet/Disclosure-Curriculum/blob/main/NHANES_Body_2017_2020.xlsx).

# Upload & View Data

Run the following code block, select "choose files," and select your downloaded demographic dataset (NHANES_Demo_2017_2020.xlsx)

In [None]:
# Upload Demographic dataset
from google.colab import files
uploaded = files.upload()

In [None]:
# Name demographic dataset as "df_demo"
df_demo = pd.read_excel(io.BytesIO(uploaded['NHANES_Demo_2017_2020.xlsx']))
# Return dataset preview
df_demo

Let's take a look at the shape of our diet dataset

*Based on information from NHANES, we expect to see **15,560** cases (rows) and **29** variables (columns)*

In [None]:
df_demo.shape

Run the following code block, select "choose files," and select your downloaded dietary dataset (NHANES_Diet_2017_2020.xlsx)

In [None]:
# Upload Body Measures dataset
from google.colab import files
uploaded = files.upload()

In [None]:
# Name diet dataset as "df_body"
df_body = pd.read_excel(io.BytesIO(uploaded['NHANES_Body_2017_2020.xlsx']))
# Return dataset preview
df_body

Let's take a look at the shape of our diet dataset.

*Based on information from NHANES, we expect to see **14,300** cases or rows and **22** columns or variables*

In [None]:
df_body.shape

# Merge Datasets

We are going to merge our two datasets on the unique NHANES respondent identifier: "SEQN"

In [None]:
# Merge demographic and body measures datasets
df_merged = pd.merge(df_demo, df_body, on='seqn', how ='inner')
# Return merged dataset
df_merged

Let's take a look at the shape of our merged dataset.

*Since we performed an inner merge, we expect to see **14,300** cases (the same number as our least populated dataset: our body measures dataset) and **50** variables because our demographic dataset contains 29 variables, our dietary dataset contains 22 variables, and we merged on one variable. (29+22)-1=50*

In [None]:
# View dataset shape (rows, columns)
df_merged.shape

# Subset Dataset (Exclusions)

Let's say we are only interested in maintaining a dataset that contains the demographic and body measurement data of females aged 18+ years who were born in a U.S. state or Washington D.C.

We're going to subset our dataset to include only those individuals.

In [None]:
# Subset to females 18+ born in U.S. state or D.C.
df = df_merged[(df_merged.riagendr==2) & (df_merged.ridageyr>17) & (df_merged.dmdborn4==1)]
df

Let's say we also wish to exclude any case that is missing measurements for upper arm length or arm circumference

In [None]:
# Exclude those who did not provide measurements for upper arm length or arm circumference.
df = df[(df.bmiarml != 1) | (df.bmiarmc != 1)]
df

# Download Dataset

Let's download our new dataset and save it in a place we'll rememver (You'll need it again in a later module)

In [None]:
df.to_csv('NHANES_subset.csv')
files.download('NHANES_subset.csv')

# Methods Disclosure

Now, let's **disclose** our data acquisition and preprocessing (merging, subsetting/excluding) methods in such a way that if someone wanted to reproduce our methods, they could.

*Feel free to add your own revisions or additions to this disclosure!*

1. **Data Acquisition**
* This dataset was initially acquired in May 2022 from two pre-pandemic NHANES datasets: [2017-2020 Demographic Variables and Sample Weights](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_DEMO.htm#RIDAGEYR) and [2017-2020 Body Measures](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BMX.htm#BMIWAIST). Information about sample eligibility is available in each set's linked documentation. Each dataset was converted from an XPT file to a XLSX file using Stata v.17 and made available in a [GitHub repository](https://github.com/EdenShaveet/Disclosure-Curriculum).

2. **Data Merging**
* An inner merge on participant identifier number was conducted using the Pandas package in Python within a Google Colab notebook. See Colab notebook for code.

3. **Data Subsetting**
* The merged dataset was subset to include only females aged 18+ at the time of participation who were born in a U.S. state or D.C. We additionally excluded respondents for whom upper arm length or arm circumference measurements were not obtained, See Colab notebook for code.
