# HEALTH & NUTRITION
#### EXPLORING THE CORRELATION BETWEEN DIET, LIFESTYLE AND HEALTH

In this project, I explored what is the impact of the three most mainstream diets and of lifestyle on the health condition of people following those diets.


### About this Notebook

This Notebook follows the NHANES_Data_cleaning notebook. 

The cleaned datasets are here combined together, explored and preprocessed.

# Concatenate the datasets

The clean datasets are loaded and concatenated together.

In [1]:
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json

In [2]:
# Read the NHANES datasets saved as .csv files
NHANES_2017_2018 = pd.read_csv('FINAL_DATASETS/NHANES_2017-2018.csv')
NHANES_2015_2016 = pd.read_csv('FINAL_DATASETS/NHANES_2015-2016.csv')
NHANES_2013_2014 = pd.read_csv('FINAL_DATASETS/NHANES_2013-2014.csv')

In [3]:
NHANES_2017_2018.drop('Unnamed: 0', axis=1, inplace=True)
NHANES_2015_2016.drop('Unnamed: 0', axis=1, inplace=True)
NHANES_2013_2014.drop('Unnamed: 0', axis=1, inplace=True)

In [4]:
NHANES_2017_2018.head()

Unnamed: 0,Participant_id,Gender,Race,Education level,Num family members,Annual family income,Age ranges,Energy (kcal),Carbohydrate (g),Carbohydrate (% kcal),...,Walk/bicycle for commute Y-N,Num days/week walk/bicycle for commute,Vigorous recreational activities Y-N,Num days/week vigorous recreational activities,Moderate recreational activities Y-N,Num days/week moderate recreational activities,Hours/day sedentary activity,Smoke cigarettes Y-N,Days/mo smoked cigs,Num cigs/day
0,93705.0,F,African American,9-11th grade,1.0,"$10,000 to \$14,999",60-69,1231.705,152.47,0.495297,...,0.0,0.0,0.0,0.0,1.0,2.0,5.0,0.0,0.0,0.0
1,93708.0,F,Other / Multi-Racial,Less than 9th grade,2.0,"$25,000 to \$34,999",60-69,1064.425,103.815,0.391712,...,0.0,0.0,0.0,0.0,1.0,5.0,2.0,0.0,0.0,0.0
2,93709.0,F,African American,College - AA,1.0,"$5,000 to \$9,999",70-79,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,30.0,5.0
3,93711.0,M,Other / Multi-Racial,College or above,3.0,"$100,000 and Over",50-59,2757.185,337.56,0.490586,...,1.0,5.0,1.0,4.0,1.0,2.0,7.0,0.0,0.0,0.0
4,93713.0,M,White,High school - GED,1.0,"$25,000 to \$34,999",60-69,2152.82,276.0,0.508625,...,1.0,3.0,0.0,0.0,1.0,3.0,2.0,1.0,30.0,15.0


In [5]:
NHANES_2017_2018.shape

(5003, 100)

In [6]:
NHANES_2015_2016.head()

Unnamed: 0,Participant_id,Gender,Race,Education level,Num family members,Annual family income,Age ranges,Energy (kcal),Carbohydrate (g),Carbohydrate (% kcal),...,Walk/bicycle for commute Y-N,Num days/week walk/bicycle for commute,Vigorous recreational activities Y-N,Num days/week vigorous recreational activities,Moderate recreational activities Y-N,Num days/week moderate recreational activities,Hours/day sedentary activity,Smoke cigarettes Y-N,Days/mo smoked cigs,Num cigs/day
0,83732.0,M,White,College or above,2.0,"$65,000 to \$74,999",60-69,2272.855,237.465,0.420351,...,0.0,0.0,0.0,0.0,1.0,6.0,8.0,0.0,0.0,0.0
1,83733.0,M,White,High school - GED,1.0,"$15,000 to \$19,999",50-59,2663.35,290.21,0.451764,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,1.0,30.0,20.0
2,83734.0,M,White,High school - GED,2.0,"$20,000 to \$24,999",70-79,2237.455,249.35,0.440298,...,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0
3,83735.0,F,White,College or above,1.0,"$65,000 to \$74,999",50-59,1356.55,157.99,0.465858,...,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0
4,83736.0,F,African American,College - AA,5.0,"$35,000 to \$44,999",40-49,862.735,111.065,0.531071,...,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0


In [7]:
NHANES_2015_2016.shape

(5304, 100)

In [8]:
NHANES_2013_2014.head()

Unnamed: 0,Participant_id,Gender,Race,Education level,Num family members,Annual family income,Age ranges,Energy (kcal),Carbohydrate (g),Carbohydrate (% kcal),...,Walk/bicycle for commute Y-N,Num days/week walk/bicycle for commute,Vigorous recreational activities Y-N,Num days/week vigorous recreational activities,Moderate recreational activities Y-N,Num days/week moderate recreational activities,Hours/day sedentary activity,Smoke cigarettes Y-N,Days/mo smoked cigs,Num cigs/day
0,73557.0,M,African American,High school - GED,3.0,"$15,000 to \$19,999",60-69,1928.265,228.695,0.491707,...,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0
1,73558.0,M,White,High school - GED,4.0,"$35,000 to \$44,999",50-59,2774.965,267.635,0.364585,...,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,1.0,1.0
2,73559.0,M,White,College - AA,2.0,"$65,000 to \$74,999",70-79,1860.29,255.43,0.547192,...,0.0,0.0,0.0,0.0,1.0,1.0,5.0,0.0,0.0,0.0
3,73561.0,F,White,College or above,2.0,"$100,000 and Over",70-79,1518.98,187.075,0.492911,...,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0
4,73562.0,M,Mexican American,College - AA,1.0,"$55,000 to \$64,999",50-59,1824.08,189.59,0.415749,...,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0


In [9]:
NHANES_2013_2014.shape

(5504, 100)

In [10]:
# DataFrames concatenation

NHANES_2013_2018 = pd.concat([NHANES_2017_2018, 
                              NHANES_2015_2016,
                              NHANES_2013_2014])

In [11]:
NHANES_2013_2018.shape

(15811, 100)

In [12]:
NHANES_2013_2018.nunique()

Participant_id                                    15811
Gender                                                2
Race                                                  5
Education level                                       7
Num family members                                    7
                                                  ...  
Num days/week moderate recreational activities        8
Hours/day sedentary activity                         61
Smoke cigarettes Y-N                                  2
Days/mo smoked cigs                                  31
Num cigs/day                                         34
Length: 100, dtype: int64

### Diet column

Each participant can be on either 5 dietary regimes: **DASH, mediterranea, paleo, USDA balanced, or unbalanced.** Their inclusion in one of the 5 regimes in not necessarily voluntary, but has been here calculated based on the macronutrients intake each participant declared during the survey.

The Diet column is added to the data frame.

In [13]:
def diet(row):
    
    """
    This function calculates the macronutrients intake declared by each survey participants, and 
    assign them to one of the 5 dietary regimes: DASH, mediterranea, paleo, USDA balanced, or unbalanced.
    """
    
    if row['Carbohydrate (% kcal)'] > 0.52 and row['Carbohydrate (% kcal)'] <= 0.6 and \
       row['Protein (% kcal)'] >= 0.13 and row['Protein (% kcal)'] <= 0.23 and \
       row['Total fat (% kcal)'] >= 0.22 and row['Total fat (% kcal)'] < 0.32:
        return 'DASH'
    elif row['Carbohydrate (% kcal)'] >= 0.45 and row['Carbohydrate (% kcal)'] <= 0.52 and \
         row['Protein (% kcal)'] >= 0.1 and row['Protein (% kcal)'] <= 0.2 and \
         row['Total fat (% kcal)'] >= 0.32 and row['Total fat (% kcal)'] <= 0.4:
        return 'Mediterranean'
    elif row['Carbohydrate (% kcal)'] >= 0.3 and row['Carbohydrate (% kcal)'] <= 0.4 and \
         row['Protein (% kcal)'] >= 0.25 and row['Protein (% kcal)'] <= 0.35 and \
         row['Total fat (% kcal)'] >= 0.3 and row['Total fat (% kcal)'] <= 0.4:
        return 'Paleo'
    elif row['Carbohydrate (% kcal)'] >= 0.45 and row['Carbohydrate (% kcal)'] <= 0.65 and \
         row['Protein (% kcal)'] >= 0.1 and row['Protein (% kcal)'] <= 0.3 and \
         row['Total fat (% kcal)'] >= 0.25 and row['Total fat (% kcal)'] <= 0.35:
        return 'USDA Balanced'
    else:
        return 'Unbalanced'
        
    
NHANES_2013_2018['Diet type'] = NHANES_2013_2018.apply(lambda row: diet(row), axis=1)

In [14]:
NHANES_2013_2018['Diet type'].head(20)

0        Unbalanced
1        Unbalanced
2        Unbalanced
3     Mediterranean
4     Mediterranean
5        Unbalanced
6     Mediterranean
7              DASH
8     Mediterranean
9        Unbalanced
10       Unbalanced
11    USDA Balanced
12             DASH
13    USDA Balanced
14             DASH
15       Unbalanced
16    Mediterranean
17       Unbalanced
18       Unbalanced
19             DASH
Name: Diet type, dtype: object

In [15]:
NHANES_2013_2018.groupby('Diet type').count()

Unnamed: 0_level_0,Participant_id,Gender,Race,Education level,Num family members,Annual family income,Age ranges,Energy (kcal),Carbohydrate (g),Carbohydrate (% kcal),...,Walk/bicycle for commute Y-N,Num days/week walk/bicycle for commute,Vigorous recreational activities Y-N,Num days/week vigorous recreational activities,Moderate recreational activities Y-N,Num days/week moderate recreational activities,Hours/day sedentary activity,Smoke cigarettes Y-N,Days/mo smoked cigs,Num cigs/day
Diet type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DASH,1791,1791,1791,1791,1791,1791,1791,1791,1791,1791,...,1788,1788,1788,1788,1788,1788,1788,1791,1791,1791
Mediterranean,3001,3001,3001,3001,3001,3001,3001,3001,3001,3001,...,2995,2995,2995,2995,2995,2995,2995,2999,2999,2999
Paleo,204,204,204,204,204,204,204,204,204,204,...,204,204,204,204,204,204,204,204,204,204
USDA Balanced,2254,2254,2254,2254,2254,2254,2254,2254,2254,2254,...,2249,2249,2249,2249,2249,2249,2249,2254,2254,2254
Unbalanced,8561,8561,8561,8561,8561,8561,8561,8561,8561,8561,...,8535,8535,8535,8535,8535,8535,8535,8558,8558,8558


In [16]:
NHANES_2013_2018.isnull().sum()

Participant_id                   0
Gender                           0
Race                             0
Education level                  0
Num family members               0
                                ..
Hours/day sedentary activity    40
Smoke cigarettes Y-N             5
Days/mo smoked cigs              5
Num cigs/day                     5
Diet type                        0
Length: 101, dtype: int64

### Remove rows where macronutrients = 0
Rows where the macronutrients are 0 are not useful for the data analysis

In [17]:
NHANES_2013_2018[NHANES_2013_2018['Carbohydrate (g)']  == 0]

Unnamed: 0,Participant_id,Gender,Race,Education level,Num family members,Annual family income,Age ranges,Energy (kcal),Carbohydrate (g),Carbohydrate (% kcal),...,Num days/week walk/bicycle for commute,Vigorous recreational activities Y-N,Num days/week vigorous recreational activities,Moderate recreational activities Y-N,Num days/week moderate recreational activities,Hours/day sedentary activity,Smoke cigarettes Y-N,Days/mo smoked cigs,Num cigs/day,Diet type
2,93709.0,F,African American,College - AA,1.0,"$5,000 to \$9,999",70-79,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,1.0,30.0,5.0,Unbalanced
9,93718.0,M,African American,High school - GED,7.0,"$65,000 to \$74,999",40-49,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,5.0,4.0,0.0,0.0,0.0,Unbalanced
25,93750.0,F,Other / Multi-Racial,College or above,2.0,"$100,000 and Over",50-59,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,2.0,5.0,0.0,0.0,0.0,Unbalanced
40,93773.0,M,White,College - AA,2.0,"$5,000 to \$9,999",60-69,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,Unbalanced
41,93774.0,F,White,College - AA,3.0,"$75,000 to \$99,999",40-49,0.0,0.0,0.0,...,2.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0,0.0,Unbalanced
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5462,83652.0,M,Other / Multi-Racial,College or above,1.0,"$45,000 to \$54,999",30-39,0.0,0.0,0.0,...,5.0,1.0,3.0,1.0,3.0,8.0,0.0,0.0,0.0,Unbalanced
5464,83657.0,M,African American,College or above,2.0,"$10,000 to \$14,999",30-39,0.0,0.0,0.0,...,0.0,1.0,5.0,1.0,5.0,10.0,0.0,0.0,0.0,Unbalanced
5480,83687.0,F,White,College - AA,3.0,"$45,000 to \$54,999",70-79,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,Unbalanced
5498,83720.0,M,African American,College - AA,2.0,"$45,000 to \$54,999",30-39,0.0,0.0,0.0,...,5.0,1.0,2.0,0.0,0.0,7.0,0.0,0.0,0.0,Unbalanced


In [18]:
NHANES_2013_2018.drop(NHANES_2013_2018[NHANES_2013_2018['Carbohydrate (g)']  == 0].index, inplace = True)

In [19]:
NHANES_2013_2018[NHANES_2013_2018['Carbohydrate (g)']  == 0]

Unnamed: 0,Participant_id,Gender,Race,Education level,Num family members,Annual family income,Age ranges,Energy (kcal),Carbohydrate (g),Carbohydrate (% kcal),...,Num days/week walk/bicycle for commute,Vigorous recreational activities Y-N,Num days/week vigorous recreational activities,Moderate recreational activities Y-N,Num days/week moderate recreational activities,Hours/day sedentary activity,Smoke cigarettes Y-N,Days/mo smoked cigs,Num cigs/day,Diet type


In [20]:
NHANES_2013_2018.groupby('Diet type').count()

Unnamed: 0_level_0,Participant_id,Gender,Race,Education level,Num family members,Annual family income,Age ranges,Energy (kcal),Carbohydrate (g),Carbohydrate (% kcal),...,Walk/bicycle for commute Y-N,Num days/week walk/bicycle for commute,Vigorous recreational activities Y-N,Num days/week vigorous recreational activities,Moderate recreational activities Y-N,Num days/week moderate recreational activities,Hours/day sedentary activity,Smoke cigarettes Y-N,Days/mo smoked cigs,Num cigs/day
Diet type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DASH,1403,1403,1403,1403,1403,1403,1403,1403,1403,1403,...,1400,1400,1400,1400,1400,1400,1400,1403,1403,1403
Mediterranean,2343,2343,2343,2343,2343,2343,2343,2343,2343,2343,...,2339,2339,2339,2339,2339,2339,2339,2341,2341,2341
Paleo,157,157,157,157,157,157,157,157,157,157,...,157,157,157,157,157,157,157,157,157,157
USDA Balanced,1753,1753,1753,1753,1753,1753,1753,1753,1753,1753,...,1749,1749,1749,1749,1749,1749,1749,1753,1753,1753
Unbalanced,5239,5239,5239,5239,5239,5239,5239,5239,5239,5239,...,5227,5227,5227,5227,5227,5227,5227,5237,5237,5237


In [21]:
NHANES_2013_2018.describe()

Unnamed: 0,Participant_id,Num family members,Energy (kcal),Carbohydrate (g),Carbohydrate (% kcal),Protein (g),Protein (% kcal),Total fat (g),Total fat (% kcal),Total sugars (g),...,Walk/bicycle for commute Y-N,Num days/week walk/bicycle for commute,Vigorous recreational activities Y-N,Num days/week vigorous recreational activities,Moderate recreational activities Y-N,Num days/week moderate recreational activities,Hours/day sedentary activity,Smoke cigarettes Y-N,Days/mo smoked cigs,Num cigs/day
count,10895.0,10895.0,10895.0,10895.0,10895.0,10895.0,10895.0,10895.0,10895.0,10895.0,...,10872.0,10872.0,10872.0,10872.0,10872.0,10872.0,10872.0,10891.0,10891.0,10891.0
mean,88204.759615,3.018541,2007.434205,242.064505,0.486806,80.021954,0.163865,79.89871,0.3493287,103.410243,...,0.231512,1.085081,0.235467,0.779985,0.411976,1.448492,6.975918,0.151318,4.888624,2.059958
std,8450.909236,1.708956,849.873031,109.938157,0.092626,37.553668,0.0486,40.97352,0.07826997,66.118895,...,0.421818,2.18756,0.42431,1.594355,0.492213,2.09153,11.257202,0.358375,10.736941,5.628455
min,73557.0,1.0,14.12,1.0,0.096075,0.85,0.006814,5.397605e-79,1.938578e-81,0.98,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,80819.5,2.0,1433.0325,167.51,0.427536,55.2475,0.13205,51.9175,0.299191,59.19,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0
50%,88169.0,3.0,1878.955,225.07,0.486036,74.045,0.157265,73.66,0.3511505,90.145,...,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0
75%,95505.5,4.0,2437.055,296.555,0.544059,97.5975,0.187448,99.865,0.4009422,131.2625,...,0.0,0.0,0.0,0.0,1.0,3.0,8.0,0.0,0.0,0.0
max,102956.0,7.0,11481.4,1423.87,0.974284,507.495,0.540433,478.16,0.8379124,1115.5,...,1.0,7.0,1.0,7.0,1.0,7.0,166.65,1.0,30.0,90.0


In [22]:
NHANES_2013_2018.groupby('Diet type').mean()

Unnamed: 0_level_0,Participant_id,Num family members,Energy (kcal),Carbohydrate (g),Carbohydrate (% kcal),Protein (g),Protein (% kcal),Total fat (g),Total fat (% kcal),Total sugars (g),...,Walk/bicycle for commute Y-N,Num days/week walk/bicycle for commute,Vigorous recreational activities Y-N,Num days/week vigorous recreational activities,Moderate recreational activities Y-N,Num days/week moderate recreational activities,Hours/day sedentary activity,Smoke cigarettes Y-N,Days/mo smoked cigs,Num cigs/day
Diet type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DASH,87162.560228,3.135424,1863.81784,257.635859,0.556035,74.902406,0.163018,59.296087,0.280946,110.067741,...,0.243571,1.142143,0.212857,0.670714,0.415,1.46,6.786595,0.129722,4.203849,1.689237
Mediterranean,88153.147674,2.931711,2158.68548,261.614639,0.485977,81.046231,0.15257,87.560222,0.361453,111.77356,...,0.219324,0.997435,0.240701,0.788371,0.416845,1.443352,7.421085,0.155489,5.038018,1.997437
Paleo,86270.713376,3.401274,1695.833885,155.148439,0.36078,118.337484,0.28578,66.876688,0.35344,54.848631,...,0.235669,1.197452,0.33758,1.22293,0.420382,1.522293,6.731423,0.095541,3.490446,1.22293
USDA Balanced,87859.694238,3.11352,1986.948217,267.076863,0.538602,74.865114,0.154604,68.797812,0.306794,119.766506,...,0.229846,1.088622,0.220698,0.699257,0.406518,1.436249,6.503392,0.152881,4.905876,2.087279
Unbalanced,88680.361328,2.982821,1994.44417,223.386624,0.455083,81.512171,0.168589,86.094332,0.376328,93.869476,...,0.234169,1.104458,0.241056,0.819208,0.410561,1.449589,6.992877,0.156387,5.041436,2.20317


In [23]:
NHANES_2013_2018.to_csv('FINAL_DATASETS/NHANES_2013_2018.csv')

# Exploratory data analysis

Data exploration has been performed by using **Tableau Desktop.**  

The Tableau Dashboards can be found __[here](https://public.tableau.com/app/profile/francesca.scipioni)__.

# Data Preprocessing

### 1: Features Selection
To predict the factors influencing the participants' health the most, only the features in the dataframe that are actually important must be selected and retained.

In [24]:
NHANES_2013_2018.columns

Index(['Participant_id', 'Gender', 'Race', 'Education level',
       'Num family members', 'Annual family income', 'Age ranges',
       'Energy (kcal)', 'Carbohydrate (g)', 'Carbohydrate (% kcal)',
       ...
       'Num days/week walk/bicycle for commute',
       'Vigorous recreational activities Y-N',
       'Num days/week vigorous recreational activities',
       'Moderate recreational activities Y-N',
       'Num days/week moderate recreational activities',
       'Hours/day sedentary activity', 'Smoke cigarettes Y-N',
       'Days/mo smoked cigs', 'Num cigs/day', 'Diet type'],
      dtype='object', length=101)

In [25]:
drop_list = ['Energy (kcal)',
             'Carbohydrate (g)',
             'Carbohydrate (% kcal)', 
             'Protein (g)', 
             'Protein (% kcal)', 
             'Total fat (g)', 
             'Total fat (% kcal)',
             'Alcohol (g)',
             'Low calorie diet',
             'Low fat/Low cholesterol diet',
             'Low sodium diet',
             'Sugar free/Low sugar diet',
             'Diabetic diet',
             'Muscle building diet',
             'Low carbohydrate diet',
             'High protein diet',
             'Gluten-free/Celiac diet',
             'Total DS taken',
             'Days DS taken',
             'Prevent health problems',
             'Improve overall health',
             'Supplement diet',
             'Maintain health',
             'Healthy skin-hair-nails',
             'Weight loss',
             'Get more energy',
             'Antioxidants',
             'Build muscle',
             '60 sec pulse',
             'Pulse regular/irregular', 
             'Systolic BP (mm Hg)',
             'Diastolic BP (mm Hg)',
             'Weight (kg)',
             'Height (cm)',
             'Waist Circumference (cm)',
             'Alchool past 12 mo',
             'Num days consuming 4/5 drinks',
             'Num times 4/5 drinks in 2 hrs',
             'General health condition',
             'How healthy is the diet',
             'Ever had HBP Y-N',
             'Ever had HBP +2 times Y-N',
             'Ever had HCL Y-N',
             'Ever had chest pain Y-N',
             'Chest pain walking uphill/hurry Y-N',
             'Chest pain on level ground Y-N',
             '\$ spent non-food items',
             'General health condition',
             'How healthy is the diet',
             'Ever used marijuana Y-N',
             'Used marijuana every month Y-N',
             'Num daily joints',
             'Ever used cocaine Y-N',
             'Ever used heroin Y-N',
             'Ever used methamphetamine Y-N',
             'Vigorous activity at work Y-N',
             'Moderate activity at work Y-N',
             'Walk/bicycle for commute Y-N',
             'Vigorous recreational activities Y-N',
             'Moderate recreational activities Y-N',
             'Smoke cigarettes Y-N',
             'Days/mo smoked cigs'
            ]

NHANES_2013_2018_pp = NHANES_2013_2018.drop(drop_list, axis=1)

In [26]:
NHANES_2013_2018_pp.columns

Index(['Participant_id', 'Gender', 'Race', 'Education level',
       'Num family members', 'Annual family income', 'Age ranges',
       'Total sugars (g)', 'Cholesterol (mg)', 'Moisture (g)', 'Water (g)',
       'Shellfish past 30 days Y-N', 'Fish past 30 days Y-N',
       'On special diet Y-N', 'DS taken Y-N', 'Num DS taken daily',
       'BMI (kg/m**2)', 'Alcohol drinks/day', '4+ drinks every day Y-N',
       'HBP medicine now Y-N', 'HCL medicine now Y-N',
       '\$ spent supermarket/grocery store',
       '\$ spent for food at other stores', '\$ spent eating out',
       '\$ spent carryout/delivered foods', 'Have diabetes Y-N',
       'Num meals not home prepared', 'Num meals from fast food',
       'Num ready-to-eat foods', 'Num frozen meals',
       'Frequency of marijuana use', 'Frequency of cocaine use',
       'Frequency of methamphetamine use',
       'Num days/week vigorous activity at work',
       'Num days/week moderate activity at work',
       'Num days/week walk/bicycl

### 2. Columns aggregation and rename
The physical activity columns can be merged together, while others needs to be renamed.

In [27]:
NHANES_2013_2018_pp.head(10)

Unnamed: 0,Participant_id,Gender,Race,Education level,Num family members,Annual family income,Age ranges,Total sugars (g),Cholesterol (mg),Moisture (g),...,Frequency of cocaine use,Frequency of methamphetamine use,Num days/week vigorous activity at work,Num days/week moderate activity at work,Num days/week walk/bicycle for commute,Num days/week vigorous recreational activities,Num days/week moderate recreational activities,Hours/day sedentary activity,Num cigs/day,Diet type
0,93705.0,F,African American,9-11th grade,1.0,"$10,000 to \$14,999",60-69,67.295,208.0,1873.825,...,Never/Not disclosed,Never/Not disclosed,0.0,0.0,0.0,0.0,2.0,5.0,0.0,Unbalanced
1,93708.0,F,Other / Multi-Racial,Less than 9th grade,2.0,"$25,000 to \$34,999",60-69,46.105,98.5,2378.25,...,Never/Not disclosed,Never/Not disclosed,0.0,0.0,0.0,0.0,5.0,2.0,0.0,Unbalanced
3,93711.0,M,Other / Multi-Racial,College or above,3.0,"$100,000 and Over",50-59,155.985,411.5,3938.385,...,Never/Not disclosed,Never/Not disclosed,0.0,0.0,5.0,4.0,2.0,7.0,0.0,Mediterranean
4,93713.0,M,White,High school - GED,1.0,"$25,000 to \$34,999",60-69,174.455,166.5,2229.395,...,Never/Not disclosed,Never/Not disclosed,0.0,0.0,3.0,0.0,3.0,2.0,15.0,Mediterranean
5,93714.0,F,African American,College - AA,3.0,"$35,000 to \$44,999",50-59,59.0,877.5,1754.235,...,Never/Not disclosed,Never/Not disclosed,0.0,0.0,0.0,0.0,0.0,6.0,0.0,Unbalanced
7,93716.0,M,Other / Multi-Racial,College or above,3.0,"$100,000 and Over",60-69,132.695,410.0,2428.1,...,Never/Not disclosed,Never/Not disclosed,0.0,0.0,0.0,2.0,1.0,5.0,0.0,DASH
8,93717.0,M,White,High school - GED,1.0,"$15,000 to \$19,999",18-29,133.08,335.5,2816.825,...,>100 times,Never/Not disclosed,0.0,0.0,7.0,0.0,0.0,5.0,20.0,Mediterranean
10,93721.0,F,Mexican American,Less than 9th grade,2.0,"$45,000 to \$54,999",60-69,72.235,510.0,2441.645,...,Never/Not disclosed,Never/Not disclosed,0.0,0.0,0.0,0.0,0.0,2.0,0.0,Unbalanced
11,93722.0,F,White,College - AA,1.0,"$25,000 to \$34,999",60-69,87.355,174.0,2827.26,...,Never/Not disclosed,Never/Not disclosed,0.0,0.0,0.0,0.0,0.0,4.0,0.0,USDA Balanced
12,93723.0,M,White,College or above,2.0,"$55,000 to \$64,999",60-69,148.505,273.0,2158.93,...,Never/Not disclosed,Never/Not disclosed,2.0,3.0,0.0,2.0,0.0,2.0,0.0,DASH


#### Physical activity

The physical activity columns will be averaged together to get the mean number of days/week a participant dedicates to exercise.

In [28]:
NHANES_2013_2018_pp['Num days/week physical activity'] = (
    NHANES_2013_2018_pp['Num days/week vigorous activity at work'] + 
    NHANES_2013_2018_pp['Num days/week moderate activity at work'] +
    NHANES_2013_2018_pp['Num days/week walk/bicycle for commute'] +
    NHANES_2013_2018_pp['Num days/week vigorous recreational activities'] +
    NHANES_2013_2018_pp['Num days/week moderate recreational activities']) / 5

In [29]:
drop_list = ['Num days/week vigorous activity at work',
       'Num days/week moderate activity at work',
       'Num days/week walk/bicycle for commute',
       'Num days/week vigorous recreational activities',
       'Num days/week moderate recreational activities']

NHANES_2013_2018_pp.drop(drop_list, axis=1, inplace=True)

In [30]:
NHANES_2013_2018_pp.head(10)

Unnamed: 0,Participant_id,Gender,Race,Education level,Num family members,Annual family income,Age ranges,Total sugars (g),Cholesterol (mg),Moisture (g),...,Num meals from fast food,Num ready-to-eat foods,Num frozen meals,Frequency of marijuana use,Frequency of cocaine use,Frequency of methamphetamine use,Hours/day sedentary activity,Num cigs/day,Diet type,Num days/week physical activity
0,93705.0,F,African American,9-11th grade,1.0,"$10,000 to \$14,999",60-69,67.295,208.0,1873.825,...,5.397605e-79,5.397605e-79,5.397605e-79,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,5.0,0.0,Unbalanced,0.4
1,93708.0,F,Other / Multi-Racial,Less than 9th grade,2.0,"$25,000 to \$34,999",60-69,46.105,98.5,2378.25,...,5.397605e-79,4.0,5.397605e-79,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,2.0,0.0,Unbalanced,1.0
3,93711.0,M,Other / Multi-Racial,College or above,3.0,"$100,000 and Over",50-59,155.985,411.5,3938.385,...,5.397605e-79,5.397605e-79,5.397605e-79,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,7.0,0.0,Mediterranean,2.2
4,93713.0,M,White,High school - GED,1.0,"$25,000 to \$34,999",60-69,174.455,166.5,2229.395,...,0.0,8.0,3.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,2.0,15.0,Mediterranean,1.2
5,93714.0,F,African American,College - AA,3.0,"$35,000 to \$44,999",50-59,59.0,877.5,1754.235,...,5.0,3.0,8.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,6.0,0.0,Unbalanced,0.0
7,93716.0,M,Other / Multi-Racial,College or above,3.0,"$100,000 and Over",60-69,132.695,410.0,2428.1,...,0.0,5.397605e-79,5.397605e-79,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,5.0,0.0,DASH,0.6
8,93717.0,M,White,High school - GED,1.0,"$15,000 to \$19,999",18-29,133.08,335.5,2816.825,...,5.397605e-79,5.397605e-79,5.397605e-79,3-6 times per week,>100 times,Never/Not disclosed,5.0,20.0,Mediterranean,1.4
10,93721.0,F,Mexican American,Less than 9th grade,2.0,"$45,000 to \$54,999",60-69,72.235,510.0,2441.645,...,1.0,5.397605e-79,5.397605e-79,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,2.0,0.0,Unbalanced,0.0
11,93722.0,F,White,College - AA,1.0,"$25,000 to \$34,999",60-69,87.355,174.0,2827.26,...,5.397605e-79,5.0,5.397605e-79,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,4.0,0.0,USDA Balanced,0.0
12,93723.0,M,White,College or above,2.0,"$55,000 to \$64,999",60-69,148.505,273.0,2158.93,...,5.397605e-79,1.0,3.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,2.0,0.0,DASH,1.4


In [31]:
NHANES_2013_2018_pp.columns

Index(['Participant_id', 'Gender', 'Race', 'Education level',
       'Num family members', 'Annual family income', 'Age ranges',
       'Total sugars (g)', 'Cholesterol (mg)', 'Moisture (g)', 'Water (g)',
       'Shellfish past 30 days Y-N', 'Fish past 30 days Y-N',
       'On special diet Y-N', 'DS taken Y-N', 'Num DS taken daily',
       'BMI (kg/m**2)', 'Alcohol drinks/day', '4+ drinks every day Y-N',
       'HBP medicine now Y-N', 'HCL medicine now Y-N',
       '\$ spent supermarket/grocery store',
       '\$ spent for food at other stores', '\$ spent eating out',
       '\$ spent carryout/delivered foods', 'Have diabetes Y-N',
       'Num meals not home prepared', 'Num meals from fast food',
       'Num ready-to-eat foods', 'Num frozen meals',
       'Frequency of marijuana use', 'Frequency of cocaine use',
       'Frequency of methamphetamine use', 'Hours/day sedentary activity',
       'Num cigs/day', 'Diet type', 'Num days/week physical activity'],
      dtype='object')

In [32]:
NHANES_2013_2018_pp.isnull().sum()

Participant_id                           0
Gender                                   0
Race                                     0
Education level                          0
Num family members                       0
Annual family income                     0
Age ranges                               0
Total sugars (g)                         0
Cholesterol (mg)                         0
Moisture (g)                             0
Water (g)                                0
Shellfish past 30 days Y-N               0
Fish past 30 days Y-N                    0
On special diet Y-N                      0
DS taken Y-N                             0
Num DS taken daily                       0
BMI (kg/m**2)                            0
Alcohol drinks/day                      23
4+ drinks every day Y-N                 23
HBP medicine now Y-N                   112
HCL medicine now Y-N                   112
\$ spent supermarket/grocery store     400
\$ spent for food at other stores      400
\$ spent ea

In [33]:
NHANES_2013_2018_pp.shape

(10895, 37)

In [34]:
# Some float64 columns are rounded to the first decimal figure

round_cols = ['Total sugars (g)', 
              'Cholesterol (mg)', 
              'Moisture (g)',
              'Water (g)', 
              'Num DS taken daily',
              'BMI (kg/m**2)', 
              'Alcohol drinks/day',
              '\$ spent supermarket/grocery store',
              '\$ spent for food at other stores', 
              '\$ spent eating out',
              '\$ spent carryout/delivered foods',
              'Num meals not home prepared', 
              'Num meals from fast food',
              'Num ready-to-eat foods', 
              'Num frozen meals',
              'Hours/day sedentary activity',
              'Num cigs/day', 
              'Num days/week physical activity']

for col in round_cols:
    NHANES_2013_2018_pp[col] = NHANES_2013_2018_pp[col].round(1)

#### Rename columns

If a participant declared to take HBP or HCL medicines, then i will be considered as aving that condition.

In [35]:
NHANES_2013_2018_pp = NHANES_2013_2018_pp.rename(columns={'HBP medicine now Y-N' : 'Have HBP',
                                                'HCL medicine now Y-N' : 'Have HCL',
                                                'Have diabetes Y-N' : 'Have diabetes'})

#### Obesity

The BMI column is used to calculate if a participant's is normal, overweight, or obese.

In [36]:
def BMI(x):
    if x <= 18.5:
        return 'Underweight'
    elif x > 18.5 and x <= 25:
        return 'Normal'
    elif x > 25 and x <= 30:
        return 'Overweight'
    else:
        return 'Obesity'

NHANES_2013_2018_pp['BMI'] = NHANES_2013_2018_pp['BMI (kg/m**2)'].apply(BMI)

In [37]:
NHANES_2013_2018_pp.drop('BMI (kg/m**2)', axis=1, inplace=True)

In [38]:
NHANES_2013_2018_pp.shape

(10895, 37)

### 3: Fill missing data
I will explore if the DataFrame contains missing data. If so, i will select the most appropriate approach to deal with them. 

In [39]:
NHANES_2013_2018_pp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10895 entries, 0 to 5503
Data columns (total 37 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Participant_id                      10895 non-null  float64
 1   Gender                              10895 non-null  object 
 2   Race                                10895 non-null  object 
 3   Education level                     10895 non-null  object 
 4   Num family members                  10895 non-null  float64
 5   Annual family income                10895 non-null  object 
 6   Age ranges                          10895 non-null  object 
 7   Total sugars (g)                    10895 non-null  float64
 8   Cholesterol (mg)                    10895 non-null  float64
 9   Moisture (g)                        10895 non-null  float64
 10  Water (g)                           10895 non-null  float64
 11  Shellfish past 30 days Y-N          10895 

In [40]:
# For some columns, it makes sense to replace the missing values with 0, while for some others 
# with the average column value

miss_zero = ['Alcohol drinks/day',
             '4+ drinks every day Y-N',
             'Have HBP',
             'Have HCL',
             'Have diabetes']
             
miss_avg = ['\$ spent supermarket/grocery store',
             '\$ spent for food at other stores',
             '\$ spent eating out',
             '\$ spent carryout/delivered foods',
             'Num meals not home prepared',
             'Num meals from fast food',
             'Num ready-to-eat foods',
             'Num frozen meals',
             'Hours/day sedentary activity',
             'Num cigs/day',
             'Num days/week physical activity'
            ]

for col in miss_zero:
    NHANES_2013_2018_pp[col].fillna(0, inplace=True)
    
for col in miss_avg:
    mean_value = NHANES_2013_2018_pp[col].mean()
    NHANES_2013_2018_pp[col].fillna(value=mean_value, inplace=True)

In [41]:
# For the categorical columns, the missing values are replaced with 'Never/Not disclosed'

miss_cat = ['Frequency of marijuana use',
            'Frequency of cocaine use',
            'Frequency of methamphetamine use']

for col in miss_cat:
    NHANES_2013_2018_pp[col].fillna(value='Never/Not disclosed', inplace=True)

In [42]:
NHANES_2013_2018_pp.isnull().sum()

Participant_id                        0
Gender                                0
Race                                  0
Education level                       0
Num family members                    0
Annual family income                  0
Age ranges                            0
Total sugars (g)                      0
Cholesterol (mg)                      0
Moisture (g)                          0
Water (g)                             0
Shellfish past 30 days Y-N            0
Fish past 30 days Y-N                 0
On special diet Y-N                   0
DS taken Y-N                          0
Num DS taken daily                    0
Alcohol drinks/day                    0
4+ drinks every day Y-N               0
Have HBP                              0
Have HCL                              0
\$ spent supermarket/grocery store    0
\$ spent for food at other stores     0
\$ spent eating out                   0
\$ spent carryout/delivered foods     0
Have diabetes                         0


In [43]:
NHANES_2013_2018_pp.head(20)

Unnamed: 0,Participant_id,Gender,Race,Education level,Num family members,Annual family income,Age ranges,Total sugars (g),Cholesterol (mg),Moisture (g),...,Num ready-to-eat foods,Num frozen meals,Frequency of marijuana use,Frequency of cocaine use,Frequency of methamphetamine use,Hours/day sedentary activity,Num cigs/day,Diet type,Num days/week physical activity,BMI
0,93705.0,F,African American,9-11th grade,1.0,"$10,000 to \$14,999",60-69,67.3,208.0,1873.8,...,0.0,0.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,5.0,0.0,Unbalanced,0.4,Obesity
1,93708.0,F,Other / Multi-Racial,Less than 9th grade,2.0,"$25,000 to \$34,999",60-69,46.1,98.5,2378.2,...,4.0,0.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,2.0,0.0,Unbalanced,1.0,Normal
3,93711.0,M,Other / Multi-Racial,College or above,3.0,"$100,000 and Over",50-59,156.0,411.5,3938.4,...,0.0,0.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,7.0,0.0,Mediterranean,2.2,Normal
4,93713.0,M,White,High school - GED,1.0,"$25,000 to \$34,999",60-69,174.5,166.5,2229.4,...,8.0,3.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,2.0,15.0,Mediterranean,1.2,Normal
5,93714.0,F,African American,College - AA,3.0,"$35,000 to \$44,999",50-59,59.0,877.5,1754.2,...,3.0,8.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,6.0,0.0,Unbalanced,0.0,Obesity
7,93716.0,M,Other / Multi-Racial,College or above,3.0,"$100,000 and Over",60-69,132.7,410.0,2428.1,...,0.0,0.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,5.0,0.0,DASH,0.6,Obesity
8,93717.0,M,White,High school - GED,1.0,"$15,000 to \$19,999",18-29,133.1,335.5,2816.8,...,0.0,0.0,3-6 times per week,>100 times,Never/Not disclosed,5.0,20.0,Mediterranean,1.4,Normal
10,93721.0,F,Mexican American,Less than 9th grade,2.0,"$45,000 to \$54,999",60-69,72.2,510.0,2441.6,...,0.0,0.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,2.0,0.0,Unbalanced,0.0,Obesity
11,93722.0,F,White,College - AA,1.0,"$25,000 to \$34,999",60-69,87.4,174.0,2827.3,...,5.0,0.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,4.0,0.0,USDA Balanced,0.0,Normal
12,93723.0,M,White,College or above,2.0,"$55,000 to \$64,999",60-69,148.5,273.0,2158.9,...,1.0,3.0,Never/Not disclosed,Never/Not disclosed,Never/Not disclosed,2.0,0.0,DASH,1.4,Normal


In [44]:
NHANES_2013_2018_pp.columns

Index(['Participant_id', 'Gender', 'Race', 'Education level',
       'Num family members', 'Annual family income', 'Age ranges',
       'Total sugars (g)', 'Cholesterol (mg)', 'Moisture (g)', 'Water (g)',
       'Shellfish past 30 days Y-N', 'Fish past 30 days Y-N',
       'On special diet Y-N', 'DS taken Y-N', 'Num DS taken daily',
       'Alcohol drinks/day', '4+ drinks every day Y-N', 'Have HBP', 'Have HCL',
       '\$ spent supermarket/grocery store',
       '\$ spent for food at other stores', '\$ spent eating out',
       '\$ spent carryout/delivered foods', 'Have diabetes',
       'Num meals not home prepared', 'Num meals from fast food',
       'Num ready-to-eat foods', 'Num frozen meals',
       'Frequency of marijuana use', 'Frequency of cocaine use',
       'Frequency of methamphetamine use', 'Hours/day sedentary activity',
       'Num cigs/day', 'Diet type', 'Num days/week physical activity', 'BMI'],
      dtype='object')

In [45]:
NHANES_2013_2018_pp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10895 entries, 0 to 5503
Data columns (total 37 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Participant_id                      10895 non-null  float64
 1   Gender                              10895 non-null  object 
 2   Race                                10895 non-null  object 
 3   Education level                     10895 non-null  object 
 4   Num family members                  10895 non-null  float64
 5   Annual family income                10895 non-null  object 
 6   Age ranges                          10895 non-null  object 
 7   Total sugars (g)                    10895 non-null  float64
 8   Cholesterol (mg)                    10895 non-null  float64
 9   Moisture (g)                        10895 non-null  float64
 10  Water (g)                           10895 non-null  float64
 11  Shellfish past 30 days Y-N          10895 

### One-hot encoding categorical columns

In [46]:
# Gender
NHANES_2013_2018_pp = pd.concat([NHANES_2013_2018_pp, 
                      pd.get_dummies(NHANES_2013_2018_pp['Gender'], prefix='Gender', prefix_sep=' - ')], 
                      axis=1)

# Race, Education, Age, Diet type
col_encode = ['Race', 
              'Education level',
              'Annual family income',
              'Age ranges', 
              'Diet type']
for col in col_encode:
    NHANES_2013_2018_pp = pd.concat([NHANES_2013_2018_pp, 
                          pd.get_dummies(NHANES_2013_2018_pp[col])], 
                          axis=1)
    
# BMI and Frequency of drugs use
NHANES_2013_2018_pp = pd.concat([NHANES_2013_2018_pp, 
                      pd.get_dummies(NHANES_2013_2018_pp['BMI'], prefix='BMI', prefix_sep=' - ')], 
                      axis=1)

NHANES_2013_2018_pp = pd.concat([NHANES_2013_2018_pp, 
                      pd.get_dummies(NHANES_2013_2018_pp['Frequency of marijuana use'], prefix='Marijuana', prefix_sep=' - ')], 
                      axis=1)

NHANES_2013_2018_pp = pd.concat([NHANES_2013_2018_pp, 
                      pd.get_dummies(NHANES_2013_2018_pp['Frequency of cocaine use'], prefix='Cocaine', prefix_sep=' - ')], 
                      axis=1)

NHANES_2013_2018_pp = pd.concat([NHANES_2013_2018_pp, 
                      pd.get_dummies(NHANES_2013_2018_pp['Frequency of methamphetamine use'], prefix='Meth', prefix_sep=' - ')], 
                      axis=1)

drop_cols = ['Participant_id',
             'Gender',
             'Race', 
             'Education level',
             'Annual family income',
             'Age ranges', 
             'Diet type',
             'BMI',
             'Frequency of marijuana use',
             'Frequency of cocaine use',
             'Frequency of methamphetamine use',
             '7.0',
             '9.0',
             'Marijuana - Never/Not disclosed',
             'Cocaine - Never/Not disclosed',
             'Meth - Never/Not disclosed']

NHANES_2013_2018_pp.drop(drop_cols, axis=1, inplace=True)

In [47]:
NHANES_2013_2018_pp.head(10)

Unnamed: 0,Num family members,Total sugars (g),Cholesterol (mg),Moisture (g),Water (g),Shellfish past 30 days Y-N,Fish past 30 days Y-N,On special diet Y-N,DS taken Y-N,Num DS taken daily,...,Cocaine - 50-99 times,Cocaine - 6-19 times,Cocaine - >100 times,Cocaine - Once,Meth - 2-5 times,Meth - 20-49 times,Meth - 50-99 times,Meth - 6-19 times,Meth - >100 times,Meth - Once
0,1.0,67.3,208.0,1873.8,1275.0,0.0,0.0,0.0,1.0,1.2,...,0,0,0,0,0,0,0,0,0,0
1,2.0,46.1,98.5,2378.2,3684.0,1.0,1.0,0.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
3,3.0,156.0,411.5,3938.4,2427.0,0.0,0.0,1.0,1.0,1.1,...,0,0,0,0,0,0,0,0,0,0
4,1.0,174.5,166.5,2229.4,0.0,1.0,0.0,0.0,0.0,2.0,...,0,0,0,0,0,0,0,0,0,0
5,3.0,59.0,877.5,1754.2,480.0,0.0,1.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
7,3.0,132.7,410.0,2428.1,1350.6,1.0,1.0,0.0,1.0,1.7,...,0,0,0,0,0,0,0,0,0,0
8,1.0,133.1,335.5,2816.8,2850.0,1.0,1.0,0.0,0.0,0.0,...,0,0,1,0,0,0,0,0,0,0
10,2.0,72.2,510.0,2441.6,1467.0,1.0,1.0,0.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,0
11,1.0,87.4,174.0,2827.3,3735.0,0.0,1.0,0.0,1.0,1.5,...,0,0,0,0,0,0,0,0,0,0
12,2.0,148.5,273.0,2158.9,0.0,1.0,1.0,0.0,1.0,2.2,...,0,0,0,0,0,0,0,0,0,0


In [48]:
col_list = NHANES_2013_2018_pp.columns.values.tolist()

In [49]:
col_list

['Num family members',
 'Total sugars (g)',
 'Cholesterol (mg)',
 'Moisture (g)',
 'Water (g)',
 'Shellfish past 30 days Y-N',
 'Fish past 30 days Y-N',
 'On special diet Y-N',
 'DS taken Y-N',
 'Num DS taken daily',
 'Alcohol drinks/day',
 '4+ drinks every day Y-N',
 'Have HBP',
 'Have HCL',
 '\\$ spent supermarket/grocery store',
 '\\$ spent for food at other stores',
 '\\$ spent eating out',
 '\\$ spent carryout/delivered foods',
 'Have diabetes',
 'Num meals not home prepared',
 'Num meals from fast food',
 'Num ready-to-eat foods',
 'Num frozen meals',
 'Hours/day sedentary activity',
 'Num cigs/day',
 'Num days/week physical activity',
 'Gender - F',
 'Gender - M',
 'African American',
 'Mexican American',
 'Other / Multi-Racial',
 'Other Hispanic',
 'White',
 '9-11th grade',
 'College - AA',
 'College or above',
 'High school - GED',
 'Less than 9th grade',
 '$0 to \\$4,999',
 '$10,000 to \\$14,999',
 '$100,000 and Over',
 '$15,000 to \\$19,999',
 '$20,000 and Over',
 '$20,000 t

In [51]:
# The order of the columns is re-arranged in a more logic way for a better data analysis

col_order = ['Have HBP',                               
'Have HCL',    
'Have diabetes',  
'Gender - F',                             
'Gender - M',         
'Mexican American',                       
'African American',                     
'White',                     
'Other Hispanic',                         
'Other / Multi-Racial',    
'Less than 9th grade',                                                
'9-11th grade', 
'High school - GED',                                                            
'College - AA',                           
'College or above',                                                   
'18-29',                              
'30-39',                              
'40-49',                              
'50-59',                              
'60-69',                              
'70-79',                              
'80+',   
'Num family members',
'Under \\$20,000',
'$20,000 and Over',
'$0 to \\$4,999',
'$5,000 to \\$9,999',
'$10,000 to \\$14,999',
'$15,000 to \\$19,999',
'$20,000 to \\$24,999',
'$25,000 to \\$34,999',
'$35,000 to \\$44,999',
'$45,000 to \\$54,999',
'$55,000 to \\$64,999',
'$65,000 to \\$74,999',
'$75,000 to \\$99,999',
'$100,000 and Over',
'BMI - Underweight',  
'BMI - Normal',
'BMI - Overweight',
'BMI - Obesity',
'DASH',                                   
'Mediterranean',                          
'Paleo',                                  
'USDA Balanced',                          
'Unbalanced',  
'On special diet Y-N',         
'Total sugars (g)',                       
'Cholesterol (mg)',                       
'Moisture (g)',                           
'Water (g)',                              
'Shellfish past 30 days Y-N',             
'Fish past 30 days Y-N',    
'Num days/week physical activity', 
'Hours/day sedentary activity',                                
'DS taken Y-N',                           
'Num DS taken daily',                                                                            
'\$ spent supermarket/grocery store',     
'\$ spent for food at other stores',      
'\$ spent eating out',                    
'\$ spent carryout/delivered foods',                             
'Num meals not home prepared',            
'Num meals from fast food',               
'Num ready-to-eat foods',                 
'Num frozen meals',                                
'Num cigs/day',    
'Alcohol drinks/day',                     
'4+ drinks every day Y-N',                                    
'Marijuana - Once per month',  
'Marijuana - 2-3 times per month',                      
'Marijuana - 1-2 times per week',                 
'Marijuana - 3-6 times per week',                      
'Marijuana - one or more times per day',            
'Cocaine - Once',
'Cocaine - 2-5 times',   
'Cocaine - 6-19 times',                 
'Cocaine - 20-49 times',                  
'Cocaine - 50-99 times',                               
'Cocaine - >100 times',                                
'Meth - Once',                         
'Meth - 2-5 times',    
'Meth - 6-19 times',                   
'Meth - 20-49 times',                     
'Meth - 50-99 times',                                           
'Meth - >100 times']

NHANES_2013_2018_pp = NHANES_2013_2018_pp[col_order]

In [52]:
NHANES_2013_2018_pp.head(10)

Unnamed: 0,Have HBP,Have HCL,Have diabetes,Gender - F,Gender - M,Mexican American,African American,White,Other Hispanic,Other / Multi-Racial,...,Cocaine - 6-19 times,Cocaine - 20-49 times,Cocaine - 50-99 times,Cocaine - >100 times,Meth - Once,Meth - 2-5 times,Meth - 6-19 times,Meth - 20-49 times,Meth - 50-99 times,Meth - >100 times
0,1.0,0.0,0.0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1.0,1.0,0.0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0.0,0.0,0.0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0.0,0.0,0.0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0.0,0.0,1.0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0.0,0.0,0.0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,0.0,0.0,0.0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
10,0.0,0.0,0.0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11,0.0,0.0,0.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
12,0.0,0.0,0.0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [53]:
NHANES_2013_2018_pp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10895 entries, 0 to 5503
Data columns (total 85 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Have HBP                               10895 non-null  float64
 1   Have HCL                               10895 non-null  float64
 2   Have diabetes                          10895 non-null  float64
 3   Gender - F                             10895 non-null  uint8  
 4   Gender - M                             10895 non-null  uint8  
 5   Mexican American                       10895 non-null  uint8  
 6   African American                       10895 non-null  uint8  
 7   White                                  10895 non-null  uint8  
 8   Other Hispanic                         10895 non-null  uint8  
 9   Other / Multi-Racial                   10895 non-null  uint8  
 10  Less than 9th grade                    10895 non-null  uint8  
 11  9-1

In [54]:
# Save the clean NHANES DataFrame in a .csv file
NHANES_2013_2018_pp.to_csv('FINAL_DATASETS/NHANES_2013_2018_FINAL.csv')