# Intro

We will be working with the data from [NHANES](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017), which is the the National Health and Nutrition Examination Survey, conducted by the National Center for Health Statistics (USA). 

They follow up the survey participants for 2 years and also ask relevant demographic and dietary questions and also life-style related.

For each study (they are made every two years), we find data about:
- demographics
- diet
- examination
- laboratory
- questionnaire (mainly lifestyle related)

With this data we will try to predict a few diseases such as heart diseases or asthma.

As the data is quite noise, full of NaNs, etc... I took the surveys from several year-pairs:
- 2013-2014
- 2015-2016
- 2017-2018

This should be no problem, as it isn't a temporal serie. The survey methods can be consulted in the website.

In [1]:
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, callbacks

import os, sys

# Relative paths
dirname = os.path.dirname
sep = os.sep

ml_folder = dirname(os.getcwd())
sys.path.append(ml_folder)

from src.utils import mining_data_tb as md
from src.utils import visualization_tb as vi

import warnings

warnings.filterwarnings("ignore")

## Data exploration

In [2]:
# As the data variables are coded (for instance, "RIAGENDR" is Gender), we first need to load the variable descriptions. For that, we will create an object with all the info and methods to change names whenever necessary

# 1) We create the object
vardata = md.variables_data()

# 2) We load the info
vardata_path = "data" + sep + "6_variables" + sep + "0_final_variables.csv"
vardata.load_data(2, vardata_path)

In [8]:
# Now we can load the actual dataset we will be using for the ml models

# 1) Create object
dataset = md.dataset()

# 2) Load data
folders = ["1_demographics", "2_dietary", "3_examination", "4_laboratory", "5_questionnaire"]
dataset.load_data(0, folders)

# 3) Clean duplicated columns (alraedy have them detected)
columns_correction = {
            "WTDRD1_x" : "WTDRD1",
            "WTDR2D_x" : "WTDR2D",
            "DRABF_x" : "DRABF",
            "DRDINT_x" : "DRDINT",
            "WTSAF2YR_x" : "WTSAF2YR",
            "LBXHCT_x" : "LBXHCT"
        }
dataset.clean_columns(columns_correction)

# 4) Add heart heart_disease -> It creates a new column where if participant has any cardiovascular disease (angina, stroke, etc...), it will have a 1 and 0 otherwise
dataset.heart_disease()

In [11]:
# With this, we can do a first exploration. For that, we can access the df though the object attribute "df"

print(dataset.df.shape, "\n")
dataset.df.head(2)

(29285, 951) 



Unnamed: 0_level_0,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,DMQMILIZ,...,SMQ690F,SMQ830,SMQ840,SMDANY,SMAQUEX,SMQ690I,SMQ857,SMQ690J,SMQ861,heart_disease
SEQN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
83732,9,2,1,62,,3,3,1.0,,2.0,...,,,,2.0,2.0,,,,,0
83733,9,2,1,53,,3,3,1.0,,2.0,...,,,,1.0,2.0,,,,,0


In [12]:
# As we can see, it is quite wide. Let's filter by the columns we will actually be using (using our magnificent object)
features = features = ["MCQ010", "RIAGENDR", "RIDAGEYR", "DR1TCHOL", "DR1TTFAT", "DR1TSFAT", "DR1TSUGR", "DR2TCHOL", "DR2TTFAT", "DR2TSFAT", "DR2TSUGR", "BPXDI1", "BPXSY1", "BMXWT", "DXDTOPF", "BMXWAIST", "LBXTR", "LBXTC", "LBXSGL"]

dataset.filter_columns(features)

In [15]:
print(dataset.df.shape)

dataset.df.info()

(29285, 19)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 29285 entries, 83732 to 102956
Data columns (total 19 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   MCQ010    28126 non-null  float64
 1   RIAGENDR  29285 non-null  int64  
 2   RIDAGEYR  29285 non-null  int64  
 3   DR1TCHOL  24248 non-null  float64
 4   DR1TTFAT  24248 non-null  float64
 5   DR1TSFAT  24248 non-null  float64
 6   DR1TSUGR  24248 non-null  float64
 7   DR2TCHOL  20754 non-null  float64
 8   DR2TTFAT  20754 non-null  float64
 9   DR2TSFAT  20754 non-null  float64
 10  DR2TSUGR  20754 non-null  float64
 11  BPXDI1    20522 non-null  float64
 12  BPXSY1    20522 non-null  float64
 13  BMXWT     27642 non-null  float64
 14  DXDTOPF   13290 non-null  float64
 15  BMXWAIST  24476 non-null  float64
 16  LBXTR     8654 non-null   float64
 17  LBXTC     21517 non-null  float64
 18  LBXSGL    18610 non-null  float64
dtypes: float64(17), int64(2)
memory usage: 4.5 MB
