# National Health and Nutrition Examination Survey

https://www.cdc.gov/nchs/nhanes/index.htm

Can I predict diabetes based off of diet?

Be able to predict diabetes, prediabetes, and not diabetic based on:
* Lab Data
    * Insulin
    * Glucose
    * HDL
    * Total Cholesterol
* Examination Data
    * Body Measurements
    * Blood Preasure
* Diet Interview
    * 1st and 2nd Day Nutrition
* Demographics
* Questionnaire Data
    * Alcohol
    * Smoking

Look into finding the diet of people who have diabetes?

***

# Pandas Tools

#### Count up nans in a column
* df['Column_Name'].isnull().sum(axis = 0)

#### Fill all NaN values in a column with a particular value
* Demographics_df['DMDCITZN'].fillna(2.0, inplace=True)

##### Combining Dataframes

* Full_Days_Nutrition_df = pd.merge(Day_1_Nutrition_df, Day_2_Nutrition_df, on="SEQN")
* Full_Days_Nutrition_df.head()

##### Find rows that are all NaN

* nan = df[df.isnull().all(axis=1)].index
* nan

##### Change all '99' in a column to 1
Demographics_df.loc[Demographics_df['DMDBORN4'] == 99, 'DMDBORN4'] = 1

##### Check for Nans

* df.isnull().values.any()

##### Getting Dummies
* Finished_Demographics_df = pd.get_dummies(Demographics_df, columns = need_dummies, drop_first = True)

##### Rename Columns
* df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})

##### Add 'Age', then drop all 0-17 year olds
* Blood_Pressure_df = pd.merge(Blood_Pressure_df, Age_df, on="SEQN")
* Blood_Pressure_df = Blood_Pressure_df.drop(Blood_Pressure_df[Blood_Pressure_df.Age < 18].index)
* Blood_Pressure_df.head()

***

# Import Libraries

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings("ignore")


***

# Target!!!

##### 5. Diabetes
   * Ages 1+
   * DIQ010 - Doctor told you have diabetes
       * 1 = Yes
       * 2 = No
       * 3 = Borderline
       * 9 = Don't know
   * DIQ160 - Ever told you have prediabetes
       * Ages 12+
       * Missing 3530
       * 1 = Yes
       * 2 = No
       * 9 = Don't know
   * DIQ170 - Ever told have health risk for diabetes
       * Ages 12+
       * Missing 3389
       * 1 = Yes
       * 2 = No
       * 9 = Don't know -> No


In [24]:
target_df = pd.read_sas(filepath_or_buffer = 'NHANES_Questionnaire_Data/Q_Diabetes.XPT')
diabetes_columns = ['SEQN', 'DIQ010', 'DIQ160', 'DIQ170']
target_df = target_df[diabetes_columns]
target_df = target_df.rename(columns={'DIQ010': 'Diabetes', 'DIQ160': 'Prediabetes', 'DIQ170': 'At Risk'})
target_df = pd.merge(master_df[['SEQN','Age_x']], target_df, on="SEQN")
target_df = target_df.drop('Age_x', axis = 1)
print ('Target Shape: ' + str(target_df.shape))

target_df.head()

Target Shape: (5735, 4)


Unnamed: 0,SEQN,Diabetes,Prediabetes,At Risk
0,83732.0,1.0,,
1,83733.0,2.0,2.0,2.0
2,83734.0,1.0,,
3,83735.0,2.0,1.0,1.0
4,83736.0,2.0,2.0,2.0


***
# Diabetes Column

* 1 = Yes
* 2 = No
* 3 = Borderline (Prediabetes)
* 9 = Don't Know

#####  Game Plan
* 1 -> 99
* 2 -> 0
* 3 -> 10
* 9 -> 0

In [25]:
target_df.loc[target_df['Diabetes'] == 1.0, 'Diabetes'] = 99
target_df.loc[target_df['Diabetes'] == 2.0, 'Diabetes'] = 0
target_df.loc[target_df['Diabetes'] == 3.0, 'Diabetes'] = 10
target_df.loc[target_df['Diabetes'] == 9.0, 'Diabetes'] = 0

In [26]:
target_df.head()

Unnamed: 0,SEQN,Diabetes,Prediabetes,At Risk
0,83732.0,99.0,,
1,83733.0,0.0,2.0,2.0
2,83734.0,99.0,,
3,83735.0,0.0,1.0,1.0
4,83736.0,0.0,2.0,2.0


In [27]:
target_df['Diabetes'].value_counts()

0.0     4800
99.0     809
10.0     126
Name: Diabetes, dtype: int64

***
# Prediabetes Column

* 1 = Yes
* 2 = No

#####  Game Plan
* 9 -> 0
* NaN -> 0
* 2 -> 0
* 1 -> 5

In [28]:
target_df['Prediabetes'].fillna(0, inplace = True)
target_df.loc[target_df['Prediabetes'] == 9, 'Prediabetes'] = 0
target_df.loc[target_df['Prediabetes'] == 2, 'Prediabetes'] = 0
target_df.loc[target_df['Prediabetes'] == 1, 'Prediabetes'] = 5

In [29]:
target_df.head()

Unnamed: 0,SEQN,Diabetes,Prediabetes,At Risk
0,83732.0,99.0,0.0,
1,83733.0,0.0,0.0,2.0
2,83734.0,99.0,0.0,
3,83735.0,0.0,5.0,1.0
4,83736.0,0.0,0.0,2.0


In [30]:
target_df['Prediabetes'].value_counts()

0.0    5257
5.0     478
Name: Prediabetes, dtype: int64

***
# At Risk Column

* 1 = Yes
* 2 = No
* 9 = Don't know

#####  Game Plan
* 9 -> 0
* NaN -> 0
* 2 -> 0
* 1 -> 3

In [31]:
target_df['At Risk'].fillna(0, inplace = True)
target_df.loc[target_df['At Risk'] == 9, 'At Risk'] = 0
target_df.loc[target_df['At Risk'] == 2, 'At Risk'] = 0
target_df.loc[target_df['At Risk'] == 1, 'At Risk'] = 3

In [32]:
target_df['At Risk'].value_counts()

0.0    4961
3.0     774
Name: At Risk, dtype: int64

# Proxy Target

* New column thats the sum of the three columns

In [33]:
target_df['Proxy Target'] = target_df['Diabetes'] + target_df['Prediabetes'] + target_df['At Risk']

In [34]:
target_df.head()

Unnamed: 0,SEQN,Diabetes,Prediabetes,At Risk,Proxy Target
0,83732.0,99.0,0.0,0.0,99.0
1,83733.0,0.0,0.0,0.0,0.0
2,83734.0,99.0,0.0,0.0,99.0
3,83735.0,0.0,5.0,3.0,8.0
4,83736.0,0.0,0.0,0.0,0.0


In [35]:
target_df['Proxy Target'].value_counts()

0.0     3804
99.0     809
3.0      518
5.0      282
8.0      196
10.0      66
13.0      60
Name: Proxy Target, dtype: int64

99 - Diabetic                __809__
                            __14.1%__

13 - Prediabetic + Borderline   60
10 - Borderline                 66
08 - Prediabetic + At Risk     196
05 - Prediabetic               282
03 - At Risk                   518
                            __1122__           
                            __19.6%__

00 - Not                    __3804__
                            __66.3%__

# Change Proxy Target into Real Target
* 1 = Diabetic
* 2 = Prediabetic
* 3 = Healthy

In [37]:
target_df.loc[target_df['Proxy Target'] == 99, 'Proxy Target'] = 1
target_df.loc[target_df['Proxy Target'] == 13, 'Proxy Target'] = 2
target_df.loc[target_df['Proxy Target'] == 10, 'Proxy Target'] = 2
target_df.loc[target_df['Proxy Target'] == 8, 'Proxy Target'] = 2
target_df.loc[target_df['Proxy Target'] == 5, 'Proxy Target'] = 2
target_df.loc[target_df['Proxy Target'] == 3, 'Proxy Target'] = 2
target_df.loc[target_df['Proxy Target'] == 0, 'Proxy Target'] = 3


In [39]:
target_df = target_df.rename(columns={'Proxy Target': 'Target'})

In [41]:
target_df['Target'].value_counts()

3.0    3804
2.0    1122
1.0     809
Name: Target, dtype: int64

In [44]:
final_target_df = target_df[['SEQN','Target']]

In [46]:
final_target_df.to_csv('Target.csv')

# Don't forget to think about the class imbalances!!