# Wrangle NHANES

Pulling raw NHANES dietary recall data for 2017-2020 and 2021-2023 waves, joining with FPED, demographics, and eventually calculating HEI scores to make our nice clean single DF to work with for clustering and analyses.

## Set Working Directory

When you open a notebook, the default working directory will be the folder that notebook is in. We want it to be the top (root) directory of the project, `ds1_nhanes`.

First, we need to mount our Google Drive, which contains the `ds1_nhanes` folder. The following chunk will mount the drive (if in Google Colab) and set the working directory to the root of the project folder. Note that this code chunk should be at the top of every notebook.

In [None]:
import os
import re

try:
  from google.colab import drive
  drive.mount('/content/drive')
  os.chdir('/content/drive/MyDrive/ds1_nhanes/')
except:
  from pathlib import Path
  if not re.search(r'ds1_nhanes$', str(os.getcwd())):
    os.chdir(Path(os.getcwd()).parent)

print(os.getcwd())

Mounted at /content/drive
/content/drive/MyDrive/ds1_nhanes


Bingo bongo, we're good to go.

## Load Libraries and Data

In [None]:
import pandas as pd
import numpy as np
import sys

# If in colab, have to install pyreadstat
# If local, it is already installed
if 'google.colab' in sys.modules:
  !pip install pyreadstat
import pyreadstat

Collecting pyreadstat
  Downloading pyreadstat-1.2.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.0 kB)
Downloading pyreadstat-1.2.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyreadstat
Successfully installed pyreadstat-1.2.8


The DR1IFF_L and DR2IFF_L datasets include daily food intake over two days for respondents identified by the SEQN or respondent sequence identifier. The surveys were conducted in waves and we aim to combine two waves of data survey responses from 2017-2020 and 2021-2023. Given the size of these datasets we tried to pull only the columns that are relevant for our analysis are read for both days and both years, however, usecols does not work on .XPT files.

In [None]:
# Read in the four datasets with specified columns
dr1_17 = pd.read_sas('data/raw/nhanes_2017_2020/P_DR1IFF.xpt')
dr2_17 = pd.read_sas('data/raw/nhanes_2017_2020/P_DR2IFF.xpt')
dr1_21 = pd.read_sas('data/raw/nhanes_2021_2023/DR1IFF_L.xpt')
dr2_21 = pd.read_sas('data/raw/nhanes_2021_2023/DR2IFF_L.xpt')

print(dr1_17.columns)

print(dr1_21.columns)
# Create a list of all the datasets
datasets = [dr1_21, dr2_21, dr1_17, dr2_17]

# Define the relevant columns for this analysis
relevant_cols = ['SEQN', # response_sequence, unique identifier for each respondent
                 'WTDR2D', # weight_day_2_dietary, the weighting factor given to
                 # the second day depending on how many days the respondents reported
                 'WTDR2DPP', # different name in 2017 - 2020 data
                 'DR1IGRMS', # grams, total grams of food consumed, labeled DR1GRMS or DR2GRMS
                 # based on which day it was reported
                 'DR2IGRMS', # labeled DR1GRMS or DR2GRMS based on which day it was reported
                 'DR1IFDCD', # usda_food_code, food identifier
                 'DR2IFDCD', # labeled DR1GRMS or DR2GRMS based on which day it was reported
                 'DR1IKCAL', # total 1000 calories for day 2
                 'DR2IKCAL', # total 1000 calories for day 1
                 'DR1ISFAT', # total saturated fatty acids (gm) day 1
                 'DR2ISFAT', # total saturated fatty acids (gm) day 2
                 'DR1ISODI', # sodium (mg) day 1
                 'DR2ISODI', # sodium (mg) day 2
                 'DR1IMFAT', # monounsaturated fatty acids (gm) day 1
                 'DR2IMFAT', # monousaturated fatty acids (gm) day 2
                 'DR2IPFAT',  # polyunsaturated fatty acids (gm) day 2
                 'DR1IPFAT'] # polyunsaturated fatty acids (gm) day 1
for dataset in datasets:
  dataset.drop(columns=[col for col in dataset.columns if col not in relevant_cols], inplace=True)

#Look up what columns are relevant and load only what is relevant, check class notes - LB
print(dr1_21.info())
print(dr2_21.info())
print(dr1_17.info())
print(dr2_17.info())
dr1_21.head()


Index(['SEQN', 'WTDRD1PP', 'WTDR2DPP', 'DR1ILINE', 'DR1DRSTZ', 'DR1EXMER',
       'DRABF', 'DRDINT', 'DR1DBIH', 'DR1DAY', 'DR1LANG', 'DR1CCMNM',
       'DR1CCMTX', 'DR1_020', 'DR1_030Z', 'DR1FS', 'DR1_040Z', 'DR1IFDCD',
       'DR1IGRMS', 'DR1IKCAL', 'DR1IPROT', 'DR1ICARB', 'DR1ISUGR', 'DR1IFIBE',
       'DR1ITFAT', 'DR1ISFAT', 'DR1IMFAT', 'DR1IPFAT', 'DR1ICHOL', 'DR1IATOC',
       'DR1IATOA', 'DR1IRET', 'DR1IVARA', 'DR1IACAR', 'DR1IBCAR', 'DR1ICRYP',
       'DR1ILYCO', 'DR1ILZ', 'DR1IVB1', 'DR1IVB2', 'DR1INIAC', 'DR1IVB6',
       'DR1IFOLA', 'DR1IFA', 'DR1IFF', 'DR1IFDFE', 'DR1ICHL', 'DR1IVB12',
       'DR1IB12A', 'DR1IVC', 'DR1IVD', 'DR1IVK', 'DR1ICALC', 'DR1IPHOS',
       'DR1IMAGN', 'DR1IIRON', 'DR1IZINC', 'DR1ICOPP', 'DR1ISODI', 'DR1IPOTA',
       'DR1ISELE', 'DR1ICAFF', 'DR1ITHEO', 'DR1IALCO', 'DR1IMOIS', 'DR1IS040',
       'DR1IS060', 'DR1IS080', 'DR1IS100', 'DR1IS120', 'DR1IS140', 'DR1IS160',
       'DR1IS180', 'DR1IM161', 'DR1IM181', 'DR1IM201', 'DR1IM221', 'DR1IP182',
       

Unnamed: 0,SEQN,WTDR2D,DR1IFDCD,DR1IGRMS,DR1IKCAL,DR1ISFAT,DR1IMFAT,DR1IPFAT,DR1ISODI
0,130378.0,70554.222162,94000100.0,120.0,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.0
1,130378.0,70554.222162,94000100.0,120.0,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.0
2,130378.0,70554.222162,92101000.0,300.0,3.0,0.006,0.045,0.003,6.0
3,130378.0,70554.222162,94000100.0,240.0,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,10.0
4,130378.0,70554.222162,83102000.0,4.9,27.0,0.43,0.661,1.609,59.0


In [None]:
# Check sample sizes in each wave
for set in datasets:
  print(len(set['SEQN'].unique()))

# What should our full sample size be if we are using 2-day dietary recall data:
print(f"Full sample size: {len(dr2_17['SEQN'].unique()) + len(dr2_21['SEQN'].unique())}")

6751
5879
12632
10830
Full sample size: 16709


## Explore Dietary Recall Data

Compare the dimensions of the df with the number of unique SEQN numbers (respondent ids)

In [None]:
# Compare rows to unique respondent IDs
# first get number of rows
rows = dr1_21.shape[0]
unique_seqns = dr1_21['SEQN'].nunique()
print(f"{rows} rows and {unique_seqns} unique SEQN numbers in DR1.")

100116 rows and 6751 unique SEQN numbers in DR1.


There are far more rows than unique respondents. This is because for each respondent, there is one row for each individual food they consumed.

Check out how many unique food codes there are:

In [None]:
n_codes = dr1_21['DR1IFDCD'].nunique()
print(f"There are {n_codes} unique food codes")

There are 3987 unique food codes


In [None]:
# # Check SEQNs between dr1 and dr2
# diff = set(dr1_17['SEQN']).difference(dr2_17['SEQN'])
# print(f"{len(diff)} SEQNs missing from DR2 that were in DR1 in the 2017 - 2020 wave")
# diff2 = set(dr2_17['SEQN']).difference(dr1_17['SEQN'])
# print(f"{len(diff2)} SEQNs missing from DR1 that were in DR2 in the 2017 - 2020 wave")

# diff = set(dr1_21['SEQN']).difference(dr2_21['SEQN'])
# print(f"{len(diff)} SEQNs missing from DR2 that were in DR1 in the 2021 - 2023 wave")
# diff2 = set(dr2_21['SEQN']).difference(dr1_21['SEQN'])
# print(f"{len(diff2)} SEQNs missing from DR1 that were in DR2 in the 2021 - 2023 wave")


For the 2017 - 2020 wave, there are 1804 respondants for day 1 that did not report in day 2, and two respondent who reported in day 2 but not day 1.

For the 2021 - 2023 wave, there are 873 respondants for day 1 that did not report in day 2, and one respondent who reported in day 2 but not day 1.  people when we join.

The two-day weights (WTDR2) was adjusted based on the day 1 weights (WTDR1) and further adjusting for additional non-response for the second recall, so we will drop the respondents that only respond in both days.


In [None]:
# # JEANNINE ADDED
# # Get a list of the missing SEQNs

# # 2017-2020 wave
# missing_from_dr2_17 = list(set(dr1_17['SEQN']).difference(dr2_17['SEQN']))
# missing_from_dr1_17 = list(set(dr2_17['SEQN']).difference(dr1_17['SEQN']))

# # 2021-2023 wave
# missing_from_dr2_21 = list(set(dr1_21['SEQN']).difference(dr2_21['SEQN']))
# missing_from_dr1_21 = list(set(dr2_21['SEQN']).difference(dr1_21['SEQN']))

# # Print the lists of missing SEQN values
# print("SEQNs missing from DR2 that were in DR1 in the 2017 - 2020 wave:", missing_from_dr2_17)
# print("SEQNs missing from DR1 that were in DR2 in the 2017 - 2020 wave:", missing_from_dr1_17)
# print("SEQNs missing from DR2 that were in DR1 in the 2021 - 2023 wave:", missing_from_dr2_21)
# print("SEQNs missing from DR1 that were in DR2 in the 2021 - 2023 wave:", missing_from_dr1_21)

## Join with FPED

Join FPED to each of the four DR datasets. We have to do this and aggregate within each DR before we combine the DRs together, otherwise we get cartesian merges.

In [None]:
fped = pd.read_csv('data/miscellany/FPED_1720.csv')
fped.columns = fped.columns.str.lower()
fped.columns = fped.columns.str.replace(" ", "_")

fped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7444 entries, 0 to 7443
Data columns (total 39 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   foodcode                   7444 non-null   int64  
 1   description                7444 non-null   object 
 2   f_total_(cup_eq)           7444 non-null   float64
 3   f_citmlb_(cup_eq)          7444 non-null   float64
 4   f_other_(cup_eq)           7444 non-null   float64
 5   f_juice_(cup_eq)           7444 non-null   float64
 6   v_total_(cup_eq)           7444 non-null   float64
 7   v_drkgr_(cup_eq)           7444 non-null   float64
 8   v_redor_total_(cup_eq)     7444 non-null   float64
 9   v_redor_tomato_(cup_eq)    7444 non-null   float64
 10  v_redor_other_(cup_eq)     7444 non-null   float64
 11  v_starchy_total_(cup_eq)   7444 non-null   float64
 12  v_starchy_potato_(cup_eq)  7444 non-null   float64
 13  v_starchy_other_(cup_eq)   7444 non-null   float

Rename columns in DR datasets that are not consistent (FDCD, GRMS, and 2 day weights). This will make it easier map over the list of all datasets when we aggregate

In [None]:
# Function to rename columns across all datasets
# These are the ones we care about that differ between DR dataset
def rename_columns(df):
    new_columns = {}
    for col in df.columns:
        if re.search(r'FDCD$', col):
            new_columns[col] = 'food_code'
        elif re.search(r'GRMS$', col):
            new_columns[col] = 'grams'
        elif re.search(r'SODI$', col):
            new_columns[col] = 'sodium'
        elif re.search(r'SFAT$', col):
            new_columns[col] = 'satfat'
        elif re.search(r'MFAT$', col):
            new_columns[col] = 'monofat'
        elif re.search(r'PFAT$', col):
            new_columns[col] = 'polyfat'
        elif re.search(r'^WTDR2', col):
            new_columns[col] = 'weight_2d'
        elif re.search(r'DR1IKCAL', col):
            new_columns[col] = 'kcal_d1'
        elif re.search(r'DR2IKCAL', col):
            new_columns[col] = 'kcal_d2'
        else:
            new_columns[col] = col

    df = df.rename(columns=new_columns)
    return df

# Rename columns in each dataset
datasets_renamed = list(map(rename_columns, datasets))

print("\nRenamed Datasets:\n")
list(map(lambda df: df.info(), datasets_renamed))



Renamed Datasets:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100116 entries, 0 to 100115
Data columns (total 9 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   SEQN       100116 non-null  float64
 1   weight_2d  100116 non-null  float64
 2   food_code  100116 non-null  float64
 3   grams      99787 non-null   float64
 4   kcal_d1    99787 non-null   float64
 5   satfat     99787 non-null   float64
 6   monofat    99787 non-null   float64
 7   polyfat    99787 non-null   float64
 8   sodium     99787 non-null   float64
dtypes: float64(9)
memory usage: 6.9 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88032 entries, 0 to 88031
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   SEQN       88032 non-null  float64
 1   weight_2d  88032 non-null  float64
 2   food_code  88032 non-null  float64
 3   grams      87778 non-null  float64
 4   kcal_d2    87778 non-null 

[None, None, None, None]

In [None]:
# Check sample sizes
for df in datasets_renamed:
  print(len(df['SEQN'].unique()))

6751
5879
12632
10830


Merge each DR with FPED

In [None]:
# Map over our list of datasets and merge each out with FPED
print('Shape before merge:')
for df in datasets_renamed:
  print(df.shape)

def merge_with_fped(df):
    return df.merge(fped, left_on='food_code', right_on='foodcode', how='left')

dfs_fped = list(map(merge_with_fped, datasets_renamed))

print('\nRow counts should be same after: ')
for df in dfs_fped:
  print(df.shape)

Shape before merge:
(100116, 9)
(88032, 9)
(183910, 9)
(149495, 9)

Row counts should be same after: 
(100116, 48)
(88032, 48)
(183910, 48)
(149495, 48)


In [None]:
# Check sample sizes
for df in dfs_fped:
  print(len(df['SEQN'].unique()))

6751
5879
12632
10830


## Combine Datasets

First reduce each dataset to the SEQNs that did both days of dietary recall

In [None]:
dr17_seqns = np.intersect1d(dfs_fped[0]['SEQN'], dfs_fped[1]['SEQN'])
print(f'Num unique SEQNs 2017: {len(dr17_seqns)}')

dfs_fped[0] = dfs_fped[0][dfs_fped[0]['SEQN'].isin(dr17_seqns)]
dfs_fped[1] = dfs_fped[1][dfs_fped[1]['SEQN'].isin(dr17_seqns)]

dr21_seqns = np.intersect1d(dfs_fped[2]['SEQN'], dfs_fped[3]['SEQN'])
print(f'Num unique SEQNs 2021: {len(dr21_seqns)}')

dfs_fped[2] = dfs_fped[2][dfs_fped[2]['SEQN'].isin(dr21_seqns)]
dfs_fped[3] = dfs_fped[3][dfs_fped[3]['SEQN'].isin(dr21_seqns)]

print('\nUnique SEQNs that were in both days:')
for df in dfs_fped:
  print(len(df['SEQN'].unique()))

Num unique SEQNs 2017: 5878
Num unique SEQNs 2021: 10828

Unique SEQNs that were in both days:
5878
5878
10828
10828


Now concatenate each set of DR days together for each wave.

In [None]:
# Combine first and second DF to make a df for 2021-2023
df_21 = pd.concat([dfs_fped[0], dfs_fped[1]], ignore_index=True)
print(df_21.shape)

# Combine second and third DF to make df for 2017-2020
df_17 = pd.concat([dfs_fped[2], dfs_fped[3]], ignore_index=True)
print(df_17.shape)

(175182, 49)
(308387, 49)


In [None]:
# Check sample sizes
df21ss = len(df_21['SEQN'].unique())
df17ss = len(df_17['SEQN'].unique())
print(f"2021-2023: {df21ss}")
print(f"2017-2020: {df17ss}")
print(f'Total: {df21ss + df17ss}')

2021-2023: 5878
2017-2020: 10828
Total: 16706


Get food group totals for each food code for each person. Take grams, divide by 100, then multiply by every food group category.

In [None]:
# Put them back into a list
waves = [df_17, df_21]
# print(waves[0].head())

def get_food_group_totals(df):
    cols = df.loc[:, 'f_total_(cup_eq)':'a_drinks_(no._of_drinks)'].columns
    df[cols] = df[cols].multiply(df['grams'] / 100, axis=0)
    return df

waves_fped = list(map(
    get_food_group_totals,
    waves
))

print(waves_fped[0].head())

       SEQN     weight_2d   food_code  grams  kcal_d1  satfat  monofat  \
0  109263.0  17808.067666  28320300.0  199.5    114.0   1.472    2.105   
1  109263.0  17808.067666  91746110.0   20.0    101.0   2.659    1.255   
2  109263.0  17808.067666  58106210.0  238.0    633.0  10.627    6.207   
3  109263.0  17808.067666  64104010.0  209.0     99.0   0.046    0.013   
4  109263.0  17808.067666  11710801.0  124.0    123.0   1.557    2.401   

   polyfat  sodium  foodcode  ... pf_legumes_(oz_eq)  d_total_(cup_eq)  \
0    0.948   649.0  28320300  ...                0.0             0.000   
1    0.534    24.0  91746110  ...                0.0             0.010   
2    4.001  1423.0  58106210  ...                0.0             1.666   
3    0.082    10.0  64104010  ...                0.0             0.000   
4    1.324    45.0  11710801  ...                0.0             0.000   

   d_milk_(cup_eq)  d_yogurt_(cup_eq)  d_cheese_(cup_eq)  oils_(grams)  \
0            0.000                0.

Group by SEQN and aggregate FPED variables

In [None]:
## Grouping by SEQN within each DR dataset
# Set aggregation functions so we don't have to do them all manually
# Everything except food code and description in FPED should be summed
print(waves_fped[0].columns)
# aggregate the nhanes columns and fped columns
cols_to_sum = ['grams', 'satfat', 'monofat', 'polyfat',
       'sodium', 'f_total_(cup_eq)',
       'f_citmlb_(cup_eq)', 'f_other_(cup_eq)', 'f_juice_(cup_eq)',
       'v_total_(cup_eq)', 'v_drkgr_(cup_eq)', 'v_redor_total_(cup_eq)',
       'v_redor_tomato_(cup_eq)', 'v_redor_other_(cup_eq)',
       'v_starchy_total_(cup_eq)', 'v_starchy_potato_(cup_eq)',
       'v_starchy_other_(cup_eq)', 'v_other_(cup_eq)', 'v_legumes_(cup_eq)',
       'g_total_(oz_eq)', 'g_whole_(oz_eq)', 'g_refined_(oz_eq)',
       'pf_total_(oz_eq)', 'pf_mps_total_(oz_eq)', 'pf_meat_(oz_eq)',
       'pf_curedmeat_(oz_eq)', 'pf_organ_(oz_eq)', 'pf_poult_(oz_eq)',
       'pf_seafd_hi_(oz_eq)', 'pf_seafd_low_(oz_eq)', 'pf_eggs_(oz_eq)',
       'pf_soy_(oz_eq)', 'pf_nutsds_(oz_eq)', 'pf_legumes_(oz_eq)',
       'd_total_(cup_eq)', 'd_milk_(cup_eq)', 'd_yogurt_(cup_eq)',
       'd_cheese_(cup_eq)', 'oils_(grams)', 'solid_fats_(grams)',
       'add_sugars_(tsp_eq)', 'a_drinks_(no._of_drinks)', 'kcal_d1', 'kcal_d2']

# Set aggregation functions
aggs = {col: 'sum' for col in cols_to_sum}
aggs['SEQN'] = 'first'
aggs['weight_2d'] = 'unique'
aggs['grams'] = 'sum'

# Aggregate each dataset
waves_grouped = list(map(
    lambda df:
        df.groupby('SEQN').agg(aggs),
    waves_fped
))

print('\nSample sizes:')
for wave in waves_grouped:
  print(len(wave['SEQN'].unique()))

print('\nShapes:')
for wave in waves_grouped:
  print(wave.shape)

# print(waves_grouped[0].head())
print(waves_grouped[0].columns)

Index(['SEQN', 'weight_2d', 'food_code', 'grams', 'kcal_d1', 'satfat',
       'monofat', 'polyfat', 'sodium', 'foodcode', 'description',
       'f_total_(cup_eq)', 'f_citmlb_(cup_eq)', 'f_other_(cup_eq)',
       'f_juice_(cup_eq)', 'v_total_(cup_eq)', 'v_drkgr_(cup_eq)',
       'v_redor_total_(cup_eq)', 'v_redor_tomato_(cup_eq)',
       'v_redor_other_(cup_eq)', 'v_starchy_total_(cup_eq)',
       'v_starchy_potato_(cup_eq)', 'v_starchy_other_(cup_eq)',
       'v_other_(cup_eq)', 'v_legumes_(cup_eq)', 'g_total_(oz_eq)',
       'g_whole_(oz_eq)', 'g_refined_(oz_eq)', 'pf_total_(oz_eq)',
       'pf_mps_total_(oz_eq)', 'pf_meat_(oz_eq)', 'pf_curedmeat_(oz_eq)',
       'pf_organ_(oz_eq)', 'pf_poult_(oz_eq)', 'pf_seafd_hi_(oz_eq)',
       'pf_seafd_low_(oz_eq)', 'pf_eggs_(oz_eq)', 'pf_soy_(oz_eq)',
       'pf_nutsds_(oz_eq)', 'pf_legumes_(oz_eq)', 'd_total_(cup_eq)',
       'd_milk_(cup_eq)', 'd_yogurt_(cup_eq)', 'd_cheese_(cup_eq)',
       'oils_(grams)', 'solid_fats_(grams)', 'add_sugars_(

Now we are down to 1 row per SEQN per wave, including both days of dietary recall.

Finally we can concat both waves into a single df and divide 2day weights by 2


In [None]:
# Combine waves
df = pd.concat(waves_grouped, ignore_index=True)

# Divide 2day weights by 2 since we are combining two waves
df['weight_2d'] = df['weight_2d'] / 2

# Rearrange columns by bringing important ones to front
# Leaving in grams just as a check
front_cols = ['SEQN', 'weight_2d', 'grams']
cols = front_cols + [col for col in df.columns if col not in front_cols]
df = df[cols]

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16706 entries, 0 to 16705
Data columns (total 46 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   SEQN                       16706 non-null  float64
 1   weight_2d                  16706 non-null  object 
 2   grams                      16706 non-null  float64
 3   satfat                     16706 non-null  float64
 4   monofat                    16706 non-null  float64
 5   polyfat                    16706 non-null  float64
 6   sodium                     16706 non-null  float64
 7   f_total_(cup_eq)           16706 non-null  float64
 8   f_citmlb_(cup_eq)          16706 non-null  float64
 9   f_other_(cup_eq)           16706 non-null  float64
 10  f_juice_(cup_eq)           16706 non-null  float64
 11  v_total_(cup_eq)           16706 non-null  float64
 12  v_drkgr_(cup_eq)           16706 non-null  float64
 13  v_redor_total_(cup_eq)     16706 non-null  flo

In [None]:
# Make the SEQN column an integer and weight_2d a float with two decimal places

df['SEQN'] = df['SEQN'].astype(int)

df['weight_2d'] = df['weight_2d'].astype(float)
df['weight_2d'] = df['weight_2d'].round(2)

df

Unnamed: 0,SEQN,weight_2d,grams,satfat,monofat,polyfat,sodium,f_total_(cup_eq),f_citmlb_(cup_eq),f_other_(cup_eq),...,d_total_(cup_eq),d_milk_(cup_eq),d_yogurt_(cup_eq),d_cheese_(cup_eq),oils_(grams),solid_fats_(grams),add_sugars_(tsp_eq),a_drinks_(no._of_drinks),kcal_d1,kcal_d2
0,109263,8904.03,2827.75,26.920,22.669,13.076,3846.0,3.750000,0.0000,0.00000,...,2.092055,0.07600,0.000000,2.012055,13.845360,27.626970,9.062974,0.000000,1402.0,1133.0
1,109264,3626.88,4177.01,34.240,39.302,25.897,4390.0,0.000000,0.0000,0.00000,...,0.537095,0.13950,0.000000,0.365095,34.001246,43.120193,23.014909,0.000000,1046.0,1932.0
2,109265,17806.00,4081.76,58.919,45.656,29.272,3982.0,5.515000,1.0385,1.00450,...,4.619700,4.24320,0.000000,0.352500,32.383460,92.666504,24.896911,0.000000,1926.0,1551.0
3,109266,2994.10,9866.51,44.307,45.510,31.964,5440.0,2.050530,1.6773,0.37323,...,3.714278,0.24605,0.296753,2.486475,73.121477,58.731125,18.556072,0.000000,1698.0,1896.0
4,109269,9115.96,1906.30,26.090,30.298,32.008,3007.0,0.099107,0.0000,0.00000,...,1.362210,0.32500,0.000000,1.023210,56.224312,24.756572,36.124985,0.000000,1251.0,847.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16701,142303,24470.59,2969.85,10.447,8.263,6.134,1886.0,0.807600,0.0000,0.18760,...,1.120547,0.00000,0.000000,1.120547,8.543764,13.108101,3.121792,0.000000,280.0,625.0
16702,142304,18983.82,1984.75,27.862,25.901,19.110,3021.0,1.277200,0.0000,0.00000,...,1.246750,0.06900,0.000000,1.177750,25.632750,44.236450,24.995500,0.000000,1569.0,659.0
16703,142307,35129.58,4994.00,32.649,54.585,34.805,5532.0,0.613400,0.0000,0.56280,...,3.507997,0.99040,0.585000,1.433397,44.998021,51.066426,31.430700,0.000000,1609.0,1713.0
16704,142309,44695.97,3649.05,92.059,73.725,45.295,11457.0,0.000000,0.0000,0.00000,...,7.245600,0.00000,0.000000,7.245600,103.999400,96.851400,20.917200,0.000000,4651.0,1411.0


In [None]:
# Combine kcal day 1 and day 2 into a total kcal variable
df['kcal_2day'] = df['kcal_d1'] + df['kcal_d2']

print('\nCheck kcal sums:')
print(df.loc[:, df.columns.str.contains('kcal')].head())

print('\nMean kcal for day 1, 2, and both:')
print(df.loc[:, df.columns.str.contains('kcal')].mean())


Check kcal sums:
   kcal_d1  kcal_d2  kcal_2day
0   1402.0   1133.0     2535.0
1   1046.0   1932.0     2978.0
2   1926.0   1551.0     3477.0
3   1698.0   1896.0     3594.0
4   1251.0    847.0     2098.0

Mean kcal for day 1, 2, and both:
kcal_d1      1945.627679
kcal_d2      1834.166467
kcal_2day    3779.794146
dtype: float64


This is hopefully a nice clean DF with one row per SEQN, both DR days, and two waves lumped together.

In [None]:
df.columns

Index(['SEQN', 'weight_2d', 'grams', 'satfat', 'monofat', 'polyfat', 'sodium',
       'f_total_(cup_eq)', 'f_citmlb_(cup_eq)', 'f_other_(cup_eq)',
       'f_juice_(cup_eq)', 'v_total_(cup_eq)', 'v_drkgr_(cup_eq)',
       'v_redor_total_(cup_eq)', 'v_redor_tomato_(cup_eq)',
       'v_redor_other_(cup_eq)', 'v_starchy_total_(cup_eq)',
       'v_starchy_potato_(cup_eq)', 'v_starchy_other_(cup_eq)',
       'v_other_(cup_eq)', 'v_legumes_(cup_eq)', 'g_total_(oz_eq)',
       'g_whole_(oz_eq)', 'g_refined_(oz_eq)', 'pf_total_(oz_eq)',
       'pf_mps_total_(oz_eq)', 'pf_meat_(oz_eq)', 'pf_curedmeat_(oz_eq)',
       'pf_organ_(oz_eq)', 'pf_poult_(oz_eq)', 'pf_seafd_hi_(oz_eq)',
       'pf_seafd_low_(oz_eq)', 'pf_eggs_(oz_eq)', 'pf_soy_(oz_eq)',
       'pf_nutsds_(oz_eq)', 'pf_legumes_(oz_eq)', 'd_total_(cup_eq)',
       'd_milk_(cup_eq)', 'd_yogurt_(cup_eq)', 'd_cheese_(cup_eq)',
       'oils_(grams)', 'solid_fats_(grams)', 'add_sugars_(tsp_eq)',
       'a_drinks_(no._of_drinks)', 'kcal_d1', 'k

## Add Demographics

The NHANES dataset includes demographics for each respondent. The following steps merge key demographics with each respondent id number SEQN.

Load demographic data as xpt:

In [None]:
demos_17 = pd.read_sas('data/raw/nhanes_2017_2020/P_DEMO.xpt')
demos_21 = pd.read_sas('data/raw/nhanes_2021_2023/DEMO_L.xpt')

demos_17.info()
demos_21.info()

# combine into a single dataset
demos = pd.concat([demos_17, demos_21], ignore_index=True)
demos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15560 entries, 0 to 15559
Data columns (total 29 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SEQN      15560 non-null  float64
 1   SDDSRVYR  15560 non-null  float64
 2   RIDSTATR  15560 non-null  float64
 3   RIAGENDR  15560 non-null  float64
 4   RIDAGEYR  15560 non-null  float64
 5   RIDAGEMN  987 non-null    float64
 6   RIDRETH1  15560 non-null  float64
 7   RIDRETH3  15560 non-null  float64
 8   RIDEXMON  14300 non-null  float64
 9   DMDBORN4  15560 non-null  float64
 10  DMDYRUSZ  3028 non-null   float64
 11  DMDEDUC2  9232 non-null   float64
 12  DMDMARTZ  9232 non-null   float64
 13  RIDEXPRG  1874 non-null   float64
 14  SIALANG   15560 non-null  float64
 15  SIAPROXY  15560 non-null  float64
 16  SIAINTRP  15560 non-null  float64
 17  FIALANG   14481 non-null  float64
 18  FIAPROXY  14481 non-null  float64
 19  FIAINTRP  14481 non-null  float64
 20  MIALANG   11000 non-null  fl

Take the columns SEQN, age, gender, race, education, and ratio of family income to poverty. Also need pseudo strata and sampling units (SDMVPSU and SDMVSTRA)

In [None]:
demos = demos[['SEQN', 'RIAGENDR', 'RIDAGEYR', 'RIDRETH3', 'DMDEDUC2', 'INDFMPIR', 'SDMVPSU', 'SDMVSTRA']]

# rename the demo columns
demos.columns = ['SEQN', 'gender', 'age', 'race', 'education', 'income_ratio', 'psu', 'strata']
demos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27493 entries, 0 to 27492
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SEQN          27493 non-null  float64
 1   gender        27493 non-null  float64
 2   age           27493 non-null  float64
 3   race          27493 non-null  float64
 4   education     17026 non-null  float64
 5   income_ratio  23251 non-null  float64
 6   psu           27493 non-null  float64
 7   strata        27493 non-null  float64
dtypes: float64(8)
memory usage: 1.7 MB


In [None]:
print(demos['race'].unique())
print(demos['race'].min())
print(demos['race'].max())

[6. 1. 3. 2. 4. 7.]
1.0
7.0


Merge demographics with our dietary intake data:

In [None]:
# check to see
# diff = set(df['SEQN']).difference(demos['SEQN'])
# print(f"{len(diff)} SEQNs missing from demographics that were in dietary intake")
# diff2 = set(demos['SEQN']).difference(df['SEQN'])
# print(f"{len(diff2)} SEQNs missing from dietary intake that were in demographics")

# Now join our demos with dietary intake data
df = df.merge(demos, on='SEQN', how='left')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16706 entries, 0 to 16705
Data columns (total 54 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   SEQN                       16706 non-null  int64  
 1   weight_2d                  16706 non-null  float64
 2   grams                      16706 non-null  float64
 3   satfat                     16706 non-null  float64
 4   monofat                    16706 non-null  float64
 5   polyfat                    16706 non-null  float64
 6   sodium                     16706 non-null  float64
 7   f_total_(cup_eq)           16706 non-null  float64
 8   f_citmlb_(cup_eq)          16706 non-null  float64
 9   f_other_(cup_eq)           16706 non-null  float64
 10  f_juice_(cup_eq)           16706 non-null  float64
 11  v_total_(cup_eq)           16706 non-null  float64
 12  v_drkgr_(cup_eq)           16706 non-null  float64
 13  v_redor_total_(cup_eq)     16706 non-null  flo

Recode demographic variables. Coding schemes are available at the NHANES website in the documentation beside each dataset. We are splitting the income to poverty ratio into quartiles.

In [None]:
# Gender
df['gender'] = df['gender'].apply(lambda x: ('Female' if x == 2 else 'Male'))

# Education
df['education'] = df['education'].apply(
  lambda x: (
    'Less than 9th grade' if x == 1
    else '9th to 11th grade' if x == 2
    else 'High school/GED' if x == 3
    else 'Some college or AA' if x == 4
    else 'College graduate or above' if x == 5
    else "Don\'t know"
  )
)

# Race
df['race'] = df['race'].apply(
  lambda x: (
    'Mexican American' if x == 1
    else 'Other Hispanic' if x == 2
    else 'White' if x == 3
    else 'Black' if x == 4
    else 'Asian' if x == 6
    else 'Other or Multi'
  )
)

# Income to poverty ratio
df['income_ratio_qs'] = pd.qcut(
  x=df['income_ratio'],
  q=5,
  duplicates='drop',
  labels=['Lowest', 'Low', 'Medium', 'High', 'Highest']
)
df.head()

# export df to a the data folder
df.to_csv('data/clean/nhanes_2017_2023_forHEI.csv', index=False)

In [None]:
# # JEANNINE ADDED
# # From the lists of missing SEQN values as described determined above
# # missing_from_dr2_17, missing_from_dr1_17, missing_from_dr2_21, missing_from_dr1_21

# # Create a DataFrame containing only the missing SEQN values
# missing_seqn_df = demos[demos['SEQN'].isin(missing_from_dr2_17 + missing_from_dr1_17 + missing_from_dr2_21 + missing_from_dr1_21)]

# # Display the demographic information of the missing SEQN values
# print(missing_seqn_df)

### Remove Children

In [None]:
print(f"Shape before removing children: {df.shape}")
df = df[df['age'] >= 18]
print(f"Shape after removing children: {df.shape}")

Shape before removing children: (16706, 55)
Shape after removing children: (11394, 55)


 ## Calculate PBP Consumption

Get column names for consumption of protein/fats from PBPs (legumes, nuts and seeds, and soy). Then add them together to get total ounces of PBP consumption per person. And also calculating PBP consumption as a proportion of total protein consumption.

In [None]:
# Keywords we will use to find column names
keywords = ['pf_legumes', 'pf_nutsds', 'pf_soy']
pbp_columns = [col for col in df.columns if any(keyword in col for keyword in keywords)]

# Condition: Any of the selected columns has a value greater than 1
# df['has_pbp'] = (df[pbp_columns] > 1).any(axis=1)
# print(df.head())

# Check that this worked
# df[df['has_pbp'] == True][pbp_columns + ['has_pbp']].head(10)

# Sum the total oz equivalents of pbp in a new column
df['oz_pbp'] = df[pbp_columns].sum(axis=1)

# Check that this worked - oz from pbp should be sum of the three pbp cols
print(df[['SEQN', 'oz_pbp'] + pbp_columns].head())

      SEQN   oz_pbp  pf_soy_(oz_eq)  pf_nutsds_(oz_eq)  pf_legumes_(oz_eq)
3   109266  2.52624           0.783            0.11484              1.6284
6   109271  0.07200           0.072            0.00000              0.0000
8   109273  0.00000           0.000            0.00000              0.0000
9   109274  0.00000           0.000            0.00000              0.0000
16  109282  0.54075           0.012            0.52875              0.0000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['oz_pbp'] = df[pbp_columns].sum(axis=1)


Check how many people consumed 0 protein total (excludes legumes), or mps total (meat, poultry, seafood)

In [None]:
print(f"{df[df['pf_total_(oz_eq)'] == 0].shape[0]} \
people consumed no total protein (excluding legumes)")

print(f"{df[df['pf_mps_total_(oz_eq)'] == 0].shape[0]} \
people consumed no meat, poultry, seafood protein")

print(f"{df[df['pf_total_(oz_eq)'] + df['pf_legumes_(oz_eq)']== 0].shape[0]} \
people consumed no total protein OR legume protein")

34 people consumed no total protein (excluding legumes)
367 people consumed no meat, poultry, seafood protein
16 people consumed no total protein OR legume protein


Note that the pf_total category excludes legumes, presumably because these are not the same quality as meat proteins? This is worth looking into more to see whether it is fair for us to make this comparison. We might be better off using the grams of protein from the NHANES survey instead?

Anyhow, if we want to get a proportion of ounces of PBP out of total proteins, we should compare PBPs to pf_total + pf_legumes

In [None]:
# Make our own total protein variable, adding total protein to legumes
df['pf_total_calc'] = df['pf_total_(oz_eq)'] + df['pf_legumes_(oz_eq)']
df['prop_pbp'] = np.where(
    df['pf_total_calc'] == 0,
    np.nan,
    df['oz_pbp'] / (df['pf_total_calc'])
)

print(df[['SEQN', 'oz_pbp', 'pf_total_(oz_eq)', 'prop_pbp']].head(10))

# Check how many NaNs there are in prop_pbp
print(f'\nThere are {df["prop_pbp"].isna().sum()} NaNs in prop_pbp')

      SEQN    oz_pbp  pf_total_(oz_eq)  prop_pbp
3   109266  2.526240          2.788150  0.571994
6   109271  0.072000         19.537600  0.003685
8   109273  0.000000          8.140580  0.000000
9   109274  0.000000          7.475666  0.000000
16  109282  0.540750          7.221450  0.074881
17  109284  0.000000          6.725655  0.000000
19  109286  5.483253         28.136695  0.187637
22  109290  0.000000         12.633300  0.000000
23  109291  0.000000          1.962517  0.000000
24  109293  0.000000          9.923510  0.000000

There are 16 NaNs in prop_pbp


509 people are NA, which means the sum of their pf_total and pf_legumes was 0.

Let's check summary stats on the proportions and make sure they make sense

In [None]:
df['prop_pbp'].describe()

Unnamed: 0,prop_pbp
count,11378.0
mean,0.195182
std,0.236676
min,0.0
25%,0.0
50%,0.107587
75%,0.309851
max,1.0


This is capping out at 1, which is perfect.

## Tying in Biomarkers

In [None]:
# This is 2021 - 2023 Data, need to connect w/ prior wave

# Note: moved import to top of script

# import total cholesterol
tchol_17, meta = pyreadstat.read_xport('data/raw/nhanes_2017_2020/P_TCHOL.xpt')
tchol_21, meta = pyreadstat.read_xport('data/raw/nhanes_2021_2023/TCHOL_L.xpt')

# import glycohemoglobin (glyco_HG)
glyco_HG_17, meta = pyreadstat.read_xport('data/raw/nhanes_2017_2020/P_GHB.xpt')
glyco_HG_21, meta = pyreadstat.read_xport('data/raw/nhanes_2021_2023/GHB_L.xpt')

# import vitamin D ---- dammit vitamin D doesn't exist for 2017 wave
#vit_d_17, meta= pyreadstat.read_xport('data/raw/nhanes_2017_2020/VID_L.xpt')
vit_d_21, meta= pyreadstat.read_xport('data/raw/nhanes_2021_2023/VID_L.xpt')

# import lead, cadmium, mercury, selenium, manganese in blood
heavy_met_17, meta=pyreadstat.read_xport('data/raw/nhanes_2017_2020/P_PBCD.xpt')
heavy_met_21, meta=pyreadstat.read_xport('data/raw/nhanes_2021_2023/PBCD_L.xpt')

# import ferritin   -its being weird lets try a different method
#frtn, meta= pyreadstat.read_xport('data/raw/nhanes_2021_2023/FERTIN_L.xpt')

frtn_17= pd.read_sas("data/raw/nhanes_2017_2020/P_FERTIN.xpt", format='xport')
frtn_21= pd.read_sas("data/raw/nhanes_2021_2023/FERTIN_L.xpt", format='xport')

# Import Blood Pressure
bp_17, meta=pyreadstat.read_xport('data/raw/nhanes_2017_2020/P_BPXO.xpt')
bp_21, meta=pyreadstat.read_xport('data/raw/nhanes_2021_2023/BPXO_L.xpt')


In [None]:
tchol_21.head()
# LBXTC is the total cholesterol in serum (mg/dL)
# tchol is the data det that will retain the weight (WTPH2YR) since they can't merge with all them having it
# LBDTCSI is total cholesterol (mmol/L)
tchol_21.describe()
#tchol_17.head()


Unnamed: 0,SEQN,WTPH2YR,LBXTC,LBDTCSI
count,8068.0,8068.0,6890.0,6890.0
mean,136360.136589,37744.395761,181.541074,4.694643
std,3448.491865,30937.952799,42.31614,1.094357
min,130378.0,0.0,62.0,1.6
25%,133345.5,18092.615464,151.0,3.9
50%,136391.0,30264.726858,178.0,4.6
75%,139336.25,49006.051233,207.0,5.35
max,142310.0,241728.857241,438.0,11.33


In [None]:
glyco_HG_21.head()
#LBXGH is glycohemoglobin %
glyco_HG_21.describe()
glyco_HG_17_short=glyco_HG_17[['SEQN','LBXGH']]
glyco_HG_21_short=glyco_HG_21[['SEQN','LBXGH']]
glyco_HG_21_short.head()
glyco_HG_17_short.head()

Unnamed: 0,SEQN,LBXGH
0,109264.0,5.3
1,109266.0,5.2
2,109271.0,5.6
3,109273.0,5.1
4,109274.0,5.7


In [None]:
# Aint using Vitamin D since '17 doesn't have it
vit_d_21.head()
vit_d_21.describe()
#LBXVIDMS is 25-hydroxyvitamin D2 + D3 (combined vitamin counts)
vit_d_21_short = vit_d_21[['SEQN','LBXVIDMS']]

In [None]:
vit_d_21_short.head()
#vit_d_short.describe()

Unnamed: 0,SEQN,LBXVIDMS
0,130378.0,58.9
1,130379.0,60.5
2,130380.0,39.4
3,130381.0,
4,130382.0,


In [None]:
heavy_met_21.head()
# LBDBPBSI is blood lead (umol/L)
# LBXBCD - Blood cadmium (ug/L)
# LBXTHG - Blood mercury, total (ug/L)
# LBXBSE - Blood selenium (ug/L)
# LBXBMN - Blood manganese (ug/L)
# selenium is good for you, not sure why lumped in here

# Lets condense the data frame
heavy_met_17_short = heavy_met_17[['SEQN','LBDBPBSI','LBXBCD','LBXTHG','LBXBSE','LBXBMN']]
heavy_met_21_short = heavy_met_21[['SEQN','LBDBPBSI','LBXBCD','LBXTHG','LBXBSE','LBXBMN']]
heavy_met_21_short.head()
#num_na_rows = heavy_met_short.isna().any(axis=1).sum()
#print(f"Number of rows with at least one NA: {num_na_rows}")


Unnamed: 0,SEQN,LBDBPBSI,LBXBCD,LBXTHG,LBXBSE,LBXBMN
0,130378.0,0.136,0.117,1.01,189.2,10.94
1,130379.0,0.099,0.313,9.64,192.3,7.74
2,130380.0,0.019,0.27,0.55,160.5,11.93
3,130381.0,,,,,
4,130382.0,0.03,0.136,0.12,151.9,15.49


In [None]:
frtn_17.head()
frtn_21.head()
# LBDFERSI - ferritin (ug/L)
frtn_21_short=frtn_21[['SEQN','LBDFERSI']]
frtn_21_short.head()
frtn_17_short=frtn_17[['SEQN','LBDFERSI']]
frtn_17_short.head()
num_na_rows = frtn_21_short.isna().any(axis=1).sum()
print(f"Number of rows with at least one NA: {num_na_rows}")

Number of rows with at least one NA: 614


In [None]:
# Blood Pressure

#bp.head()

# Too many readings,lets average diastolic and systolic

# Create a new DataFrame with just SEQN
bp_17_avg = bp_17[['SEQN']].copy()
bp_21_avg = bp_21[['SEQN']].copy()

# Calculate average systolic and diastolic BP for each participant
bp_17_avg['systolic_avg'] = bp_17[['BPXOSY1', 'BPXOSY2', 'BPXOSY3']].mean(axis=1, skipna=True)
bp_17_avg['diastolic_avg'] = bp_17[['BPXODI1', 'BPXODI2', 'BPXODI3']].mean(axis=1, skipna=True)
bp_21_avg['systolic_avg'] = bp_21[['BPXOSY1', 'BPXOSY2', 'BPXOSY3']].mean(axis=1, skipna=True)
bp_21_avg['diastolic_avg'] = bp_21[['BPXODI1', 'BPXODI2', 'BPXODI3']].mean(axis=1, skipna=True)

# Preview the result
bp_17_avg.head()
bp_21_avg.head()
#num_na_rows = bp_avg.isna().any(axis=1).sum()
#print(f"Number of rows with at least one NA: {num_na_rows}")

Unnamed: 0,SEQN,systolic_avg,diastolic_avg
0,130378.0,132.666667,96.0
1,130379.0,117.0,78.666667
2,130380.0,109.0,78.333333
3,130386.0,115.0,73.666667
4,130387.0,141.333333,76.0


In [None]:
# Combining Biomarkers into common data frame

# I am interested in inner join (retains SEQN that have input for the biomarkers)
# Lets first use

from functools import reduce

# Put all biomarker DataFrames in a list
biomarker_17_df = [tchol_17, glyco_HG_17_short, heavy_met_17_short,bp_17_avg, frtn_17_short]
biomarker_21_df = [tchol_21, glyco_HG_21_short, heavy_met_21_short,bp_21_avg, frtn_21_short]

# Merge them on 'SEQN' w/ inner or outer
# INNER: only keep participants present in all biomarker files
# OUTER: keep all SEQNs that appear in at least one

biomarkers_17 = reduce(lambda left, right: pd.merge(left, right, on='SEQN', how='outer'), biomarker_17_df)
biomarkers_21 = reduce(lambda left, right: pd.merge(left, right, on='SEQN', how='outer'), biomarker_21_df)

# Preview
print(biomarkers_21.shape)
biomarkers_21.head()

# How many NANs? - 7077 rows with at least 1 nan
num_na_rows = biomarkers_21.isna().any(axis=1).sum()
print(f"Number of rows with at least one NA: {num_na_rows}")

biomarkers_21.isna().sum()

# Ferritin has a load of NAs; 6777/8727 I think we need to omit it

(8727, 13)
Number of rows with at least one NA: 7151


Unnamed: 0,0
SEQN,0
WTPH2YR,659
LBXTC,1837
LBDTCSI,1837
LBXGH,2012
LBDBPBSI,1141
LBXBCD,1141
LBXTHG,1141
LBXBSE,1141
LBXBMN,1141


In [None]:
# Lets tidy up the biomarker df
biomarkers_17 = biomarkers_17.rename(columns={
    'SEQN': 'SEQN',
    'WTPH2YR': 'weight_2yr',
    'LBXTC': 'total_cholesterol',
    'LBDTCSI': 'cholesterol_std_dev',
    'LBXGH': 'glycohemoglobin',
    'LBDBPBSI': 'blood_lead',
    'LBXBCD': 'blood_cadmium',
    'LBXTHG': 'blood_mercury',
    'LBXBSE': 'blood_selenium',
    'LBXBMN': 'blood_manganese',
    'systolic_avg': 'avg_systolic_bp',
    'diastolic_avg': 'avg_diastolic_bp',
    'LBDFERSI': 'serum_ferritin'
})
biomarkers_21 = biomarkers_21.rename(columns={
    'SEQN': 'SEQN',
    'WTPH2YR': 'weight_21_2yr',
    'LBXTC': 'total_cholesterol',
    'LBDTCSI': 'cholesterol_std_dev',
    'LBXGH': 'glycohemoglobin',
    'LBXVIDMS': 'vitamin_d',
    'LBDBPBSI': 'blood_lead',
    'LBXBCD': 'blood_cadmium',
    'LBXTHG': 'blood_mercury',
    'LBXBSE': 'blood_selenium',
    'LBXBMN': 'blood_manganese',
    'systolic_avg': 'avg_systolic_bp',
    'diastolic_avg': 'avg_diastolic_bp',
    'LBDFERSI': 'serum_ferritin'
})
#biomarkers_21.head()
biomarkers_17.head()

Unnamed: 0,SEQN,total_cholesterol,cholesterol_std_dev,glycohemoglobin,blood_lead,blood_cadmium,blood_mercury,blood_selenium,blood_manganese,avg_systolic_bp,avg_diastolic_bp,serum_ferritin
0,109263.0,,,,,,,,,,,
1,109264.0,166.0,4.29,5.3,0.017,0.1,0.2,167.69,15.21,108.0,67.0,15.7
2,109265.0,,,,,0.071,0.2,168.27,10.62,,,42.1
3,109266.0,195.0,5.04,5.2,0.082,0.223,0.36,167.51,8.85,99.0,54.333333,11.6
4,109269.0,,,,,0.071,0.2,157.94,9.61,,,41.7


In [None]:
# Lets try and merge biomarkers '17 and biomarkers '21

biomarkers_total = pd.concat([biomarkers_17, biomarkers_21], ignore_index=True)
biomarkers_total.info()
biomarkers_total.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22499 entries, 0 to 22498
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   SEQN                 22499 non-null  float64
 1   total_cholesterol    17718 non-null  float64
 2   cholesterol_std_dev  17718 non-null  float64
 3   glycohemoglobin      16452 non-null  float64
 4   blood_lead           18693 non-null  float64
 5   blood_cadmium        19688 non-null  float64
 6   blood_mercury        19688 non-null  float64
 7   blood_selenium       19688 non-null  float64
 8   blood_manganese      19688 non-null  float64
 9   avg_systolic_bp      17871 non-null  float64
 10  avg_diastolic_bp     17871 non-null  float64
 11  serum_ferritin       12507 non-null  float64
 12  weight_21_2yr        8068 non-null   float64
dtypes: float64(13)
memory usage: 2.2 MB


Unnamed: 0,SEQN,total_cholesterol,cholesterol_std_dev,glycohemoglobin,blood_lead,blood_cadmium,blood_mercury,blood_selenium,blood_manganese,avg_systolic_bp,avg_diastolic_bp,serum_ferritin,weight_21_2yr
0,109263.0,,,,,,,,,,,,
1,109264.0,166.0,4.29,5.3,0.017,0.1,0.2,167.69,15.21,108.0,67.0,15.7,
2,109265.0,,,,,0.071,0.2,168.27,10.62,,,42.1,
3,109266.0,195.0,5.04,5.2,0.082,0.223,0.36,167.51,8.85,99.0,54.333333,11.6,
4,109269.0,,,,,0.071,0.2,157.94,9.61,,,41.7,
5,109270.0,103.0,2.66,,0.058,0.141,0.2,143.47,8.86,124.666667,73.333333,,
6,109271.0,147.0,3.8,5.6,0.072,2.31,0.42,196.7,6.95,107.0,67.0,196.0,
7,109273.0,164.0,4.24,5.1,0.033,0.84,0.51,196.63,5.75,113.666667,67.333333,313.0,
8,109274.0,105.0,2.72,5.7,0.019,0.071,0.2,181.73,3.79,134.0,70.0,366.0,
9,109275.0,167.0,4.32,,0.037,0.11,0.2,206.31,8.69,,,,


In [None]:
#df.head()
df.describe()
df.shape
biomarkers_total.shape

# compare biomarkers to df

missing_in_df = biomarkers_total[~biomarkers_total['SEQN'].isin(df['SEQN'])]
missing_in_df

Unnamed: 0,SEQN,total_cholesterol,cholesterol_std_dev,glycohemoglobin,blood_lead,blood_cadmium,blood_mercury,blood_selenium,blood_manganese,avg_systolic_bp,avg_diastolic_bp,serum_ferritin,weight_21_2yr
0,109263.0,,,,,,,,,,,,
1,109264.0,166.0,4.29,5.3,0.017,0.100,0.20,167.69,15.21,108.000000,67.000000,15.70,
2,109265.0,,,,,0.071,0.20,168.27,10.62,,,42.10,
4,109269.0,,,,,0.071,0.20,157.94,9.61,,,41.70,
5,109270.0,103.0,2.66,,0.058,0.141,0.20,143.47,8.86,124.666667,73.333333,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
22488,142300.0,167.0,4.32,5.1,0.071,0.306,0.87,153.80,21.74,141.666667,91.000000,5.69,31470.681927
22492,142304.0,,,,,,,,,113.000000,72.666667,,0.000000
22493,142305.0,180.0,4.65,6.0,0.016,0.226,0.12,193.40,7.11,143.666667,79.333333,,49710.929024
22494,142306.0,,,,,,,,,,,,0.000000


In [None]:
df.head()

Unnamed: 0,SEQN,weight_2d,grams,satfat,monofat,polyfat,sodium,f_total_(cup_eq),f_citmlb_(cup_eq),f_other_(cup_eq),...,age,race,education,income_ratio,psu,strata,income_ratio_qs,oz_pbp,pf_total_calc,prop_pbp
3,109266,2994.1,9866.51,44.307,45.51,31.964,5440.0,2.05053,1.6773,0.37323,...,29.0,Asian,College graduate or above,5.0,2.0,168.0,Highest,2.52624,4.41655,0.571994
6,109271,7988.83,13587.9,107.088,113.156,68.257,11603.0,1.35389,0.0,1.329896,...,49.0,White,9th to 11th grade,,1.0,167.0,,0.072,19.5376,0.003685
8,109273,28255.51,3707.63,42.654,39.373,24.282,4939.0,0.0,0.0,0.0,...,36.0,White,Some college or AA,0.83,1.0,155.0,Lowest,0.0,8.14058,0.0
9,109274,6187.41,5679.36,45.402,55.548,39.762,10156.0,0.8347,0.8316,0.0,...,68.0,Other or Multi,Some college or AA,1.2,2.0,167.0,Low,0.0,7.475666,0.0
16,109282,25233.38,5262.9,101.404,60.175,21.043,6286.0,2.87094,0.03402,1.85542,...,76.0,White,College graduate or above,3.61,2.0,164.0,High,0.54075,7.22145,0.074881


In [None]:
#Merging Biomarkers + df
#first lets make SEQN in biomarkers integer to conform to df
biomarkers_total['SEQN'] = biomarkers_total['SEQN'].astype(int)

#Lets do a left merge (retain all that of df and maybe kick out some biomarkers)
# Note: Chris switched order here, df should be on left so we retain all SEQNs
df_biomarkers = pd.merge(df, biomarkers_total, on="SEQN", how="left")
df_biomarkers.shape

(11394, 70)

In [None]:
df_biomarkers.head()

Unnamed: 0,SEQN,weight_2d,grams,satfat,monofat,polyfat,sodium,f_total_(cup_eq),f_citmlb_(cup_eq),f_other_(cup_eq),...,glycohemoglobin,blood_lead,blood_cadmium,blood_mercury,blood_selenium,blood_manganese,avg_systolic_bp,avg_diastolic_bp,serum_ferritin,weight_21_2yr
0,109266,2994.1,9866.51,44.307,45.51,31.964,5440.0,2.05053,1.6773,0.37323,...,5.2,0.082,0.223,0.36,167.51,8.85,99.0,54.333333,11.6,
1,109271,7988.83,13587.9,107.088,113.156,68.257,11603.0,1.35389,0.0,1.329896,...,5.6,0.072,2.31,0.42,196.7,6.95,107.0,67.0,196.0,
2,109273,28255.51,3707.63,42.654,39.373,24.282,4939.0,0.0,0.0,0.0,...,5.1,0.033,0.84,0.51,196.63,5.75,113.666667,67.333333,313.0,
3,109274,6187.41,5679.36,45.402,55.548,39.762,10156.0,0.8347,0.8316,0.0,...,5.7,0.019,0.071,0.2,181.73,3.79,134.0,70.0,366.0,
4,109282,25233.38,5262.9,101.404,60.175,21.043,6286.0,2.87094,0.03402,1.85542,...,5.5,0.057,0.21,0.31,204.25,10.77,139.333333,72.666667,49.8,


In [None]:
# Interrogate this big old biomakers + df

num_na_rows = df_biomarkers.isna().any(axis=1).sum()
print(f"Number of rows with at least one NA: {num_na_rows}")

df_biomarkers.isna().sum()

Number of rows with at least one NA: 10693


Unnamed: 0,0
SEQN,0
weight_2d,0
grams,0
satfat,0
monofat,0
...,...
blood_manganese,494
avg_systolic_bp,552
avg_diastolic_bp,552
serum_ferritin,4009


In [None]:
# Dropping NAs yields us with 1022 complete rows ~ 1/20th the size of df
cleaned_df_biomarkers=df_biomarkers.dropna()
# only 1022 complete rows exist
cleaned_df_biomarkers.shape
cleaned_df_biomarkers.head()

Unnamed: 0,SEQN,weight_2d,grams,satfat,monofat,polyfat,sodium,f_total_(cup_eq),f_citmlb_(cup_eq),f_other_(cup_eq),...,glycohemoglobin,blood_lead,blood_cadmium,blood_mercury,blood_selenium,blood_manganese,avg_systolic_bp,avg_diastolic_bp,serum_ferritin,weight_21_2yr
6962,130380,51989.6,8516.54,36.848,35.394,24.734,5869.0,2.624598,0.70677,0.576078,...,6.2,0.019,0.27,0.55,160.5,11.93,109.0,78.333333,13.3,85328.844519
6979,130441,21365.66,11599.83,34.18,24.489,18.589,6905.0,6.4818,0.0,4.7592,...,5.2,0.022,0.289,2.45,161.4,12.02,110.0,68.0,283.0,35626.638908
6982,130449,20133.43,9430.59,69.591,78.252,26.782,6522.0,0.207,0.207,0.0,...,5.1,0.048,0.26,0.12,168.5,9.43,107.666667,70.666667,127.0,22669.943011
6985,130457,15831.83,8777.63,36.765,30.114,19.939,5621.0,4.61425,1.38,2.76175,...,4.9,0.021,0.143,2.06,168.8,10.16,127.0,81.333333,18.7,22072.10819
6988,130473,96626.33,7561.32,57.451,64.048,29.088,5891.0,3.261373,1.136745,2.117696,...,5.4,0.019,0.14,1.36,157.3,6.0,95.0,61.333333,40.0,52441.672103


In [None]:
# Thresholding to drop missingest columns
threshold = 0.5  # Drop columns where >50% data is missing
df_biomarkers_thresh = df_biomarkers.loc[:, df_biomarkers.isnull().mean() < threshold]

print(df_biomarkers_thresh.shape)
print(df_biomarkers.shape)
# this did not appear to drop, lets move forward with what we got

(11394, 69)
(11394, 70)


## Save Data

Let's save this as a csv so we can play around with it elsewhere:

In [None]:
# One last check of data frame
df_biomarkers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11394 entries, 0 to 11393
Data columns (total 70 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   SEQN                       11394 non-null  int64   
 1   weight_2d                  11394 non-null  float64 
 2   grams                      11394 non-null  float64 
 3   satfat                     11394 non-null  float64 
 4   monofat                    11394 non-null  float64 
 5   polyfat                    11394 non-null  float64 
 6   sodium                     11394 non-null  float64 
 7   f_total_(cup_eq)           11394 non-null  float64 
 8   f_citmlb_(cup_eq)          11394 non-null  float64 
 9   f_other_(cup_eq)           11394 non-null  float64 
 10  f_juice_(cup_eq)           11394 non-null  float64 
 11  v_total_(cup_eq)           11394 non-null  float64 
 12  v_drkgr_(cup_eq)           11394 non-null  float64 
 13  v_redor_total_(cup_eq)     1139

In [None]:
# Remove weights_21, looks like something hinky happened with it
# And we are already set with weight_2d
df_biomarkers = df_biomarkers.drop('weight_21_2yr', axis=1)
print(df_biomarkers.columns)

Index(['SEQN', 'weight_2d', 'grams', 'satfat', 'monofat', 'polyfat', 'sodium',
       'f_total_(cup_eq)', 'f_citmlb_(cup_eq)', 'f_other_(cup_eq)',
       'f_juice_(cup_eq)', 'v_total_(cup_eq)', 'v_drkgr_(cup_eq)',
       'v_redor_total_(cup_eq)', 'v_redor_tomato_(cup_eq)',
       'v_redor_other_(cup_eq)', 'v_starchy_total_(cup_eq)',
       'v_starchy_potato_(cup_eq)', 'v_starchy_other_(cup_eq)',
       'v_other_(cup_eq)', 'v_legumes_(cup_eq)', 'g_total_(oz_eq)',
       'g_whole_(oz_eq)', 'g_refined_(oz_eq)', 'pf_total_(oz_eq)',
       'pf_mps_total_(oz_eq)', 'pf_meat_(oz_eq)', 'pf_curedmeat_(oz_eq)',
       'pf_organ_(oz_eq)', 'pf_poult_(oz_eq)', 'pf_seafd_hi_(oz_eq)',
       'pf_seafd_low_(oz_eq)', 'pf_eggs_(oz_eq)', 'pf_soy_(oz_eq)',
       'pf_nutsds_(oz_eq)', 'pf_legumes_(oz_eq)', 'd_total_(cup_eq)',
       'd_milk_(cup_eq)', 'd_yogurt_(cup_eq)', 'd_cheese_(cup_eq)',
       'oils_(grams)', 'solid_fats_(grams)', 'add_sugars_(tsp_eq)',
       'a_drinks_(no._of_drinks)', 'kcal_d1', 'k

In [None]:
df_biomarkers.to_csv('data/clean/nhanes_2017_2023_clean.csv', index=False)

# Biomarker Appendix

In [None]:
seqn_cluster= pd.read_csv('data/clean/seqn_cluster.csv')

In [None]:
# Lets merge clusters w/ appropriate biomarkers

df_cluster_bm = pd.merge(df_biomarkers, seqn_cluster, on="SEQN", how="left")
df_cluster_bm.head()
df_cluster_bm4 = df_cluster_bm[['SEQN','cluster','blood_mercury','avg_systolic_bp','avg_diastolic_bp','total_cholesterol']]

df_cluster_bm4.head()

Unnamed: 0,SEQN,cluster,blood_mercury,avg_systolic_bp,avg_diastolic_bp,total_cholesterol
0,109266,1,0.36,99.0,54.333333,195.0
1,109271,0,0.42,107.0,67.0,147.0
2,109273,1,0.51,113.666667,67.333333,164.0
3,109274,1,0.2,134.0,70.0,105.0
4,109282,1,0.31,139.333333,72.666667,233.0


In [None]:
# Missing Data

#print(df_cluster_bm4.isna().sum())

num_rows_with_na = df_cluster_bm4.isna().any(axis=1).sum()
print(num_rows_with_na)
len(df_cluster_bm4)


# Dataframe with only complete rows

df_cluster_bm_clean = df_cluster_bm4.dropna()

df_cluster_bm_clean.head()
len(df_cluster_bm_clean)

1316


10078

In [None]:
#We have 13,987 rows of complete cases

#working with df_cluster_bm_clean


df_cluster_bm_clean.head()


#Getting Cluster Means



cluster_means = df_cluster_bm_clean.groupby('cluster')[['blood_mercury', 'avg_systolic_bp', 'avg_diastolic_bp', 'total_cholesterol']].mean()

print(cluster_means)

#Pretty healthy folks.... normal Cholesterol(<200 mg/dL), 'safe' Hg(<5.8 ug/L), normal bp (120/80)


         blood_mercury  avg_systolic_bp  avg_diastolic_bp  total_cholesterol
cluster                                                                     
0             1.095992       123.279633         75.090701         182.463654
1             0.872805       122.078452         73.909717         185.649552
2             1.225009       123.882868         74.537505         186.848846
3             1.786740       123.656632         73.876081         186.911405


In [None]:
# Lets dive a little deeper into the clusters ~Percentages Above Thresholds~


# Define thresholds
thresholds = {
    'total_cholesterol': 200,
    'blood_mercury': 5.8,
    'avg_systolic_bp': 140,
    'avg_diastolic_bp': 90
}

# DataFrame to store results
proportions_df = pd.DataFrame()

# Loop through each biomarker and compute proportions
for biomarker, threshold in thresholds.items():
    # Elevated cases per cluster
    elevated = df_cluster_bm_clean[df_cluster_bm_clean[biomarker] > threshold].groupby('cluster').size()

    # Valid (non-NA) counts per cluster
    total = df_cluster_bm_clean[df_cluster_bm_clean[biomarker].notna()].groupby('cluster').size()

    # Proportion and fill missing with 0 if no elevated cases
    proportions = (elevated / total).fillna(0)

    # Store in final dataframe as percentages
    proportions_df[biomarker] = (proportions * 100).round(2)

print(proportions_df)




         total_cholesterol  blood_mercury  avg_systolic_bp  avg_diastolic_bp
cluster                                                                     
0                    31.24           2.36            13.46              9.72
1                    31.70           0.90            14.78              7.41
2                    34.14           2.84            18.78              8.94
3                    34.07           5.53            15.71              6.85


In [None]:
# Analysis of Proportions of Biomarkers above Threshold

import scipy.stats as stats

# Perform ANOVA for each biomarker
for biomarker in thresholds.keys():  # Use biomarker names from the thresholds dictionary
    # Get the data for the current biomarker
    data_by_cluster = [df_cluster_bm_clean[df_cluster_bm_clean['cluster'] == cluster][biomarker] for cluster in df_cluster_bm_clean['cluster'].unique()]

    # Perform the ANOVA
    f_stat, p_val = stats.f_oneway(*data_by_cluster)

    # Print the result
    print(f"ANOVA for {biomarker} - p-value: {p_val}")



ANOVA for total_cholesterol - p-value: 0.01769128619239881
ANOVA for blood_mercury - p-value: 5.71464639573765e-56
ANOVA for avg_systolic_bp - p-value: 0.0005710929868295701
ANOVA for avg_diastolic_bp - p-value: 0.0036651204517255563
