# Lab 5: Wide and Deep Networks
## by Michael Doherty, Leilani Guzman, and Carson Pittman

In January 1922, [Leonard Thompson](https://en.wikipedia.org/wiki/Leonard_Thompson_(diabetic)) was hospitalized in Toronto, Canada. Having been diagnosed with Type 1 diabetes, a death sentence at the time, the 14 year old Thompson was staring death in the face. Fighting for his life, Thompson received the [first ever injection of insulin](https://diabetes.org/blog/history-wonderful-thing-we-call-insulin); less than 24 hours later, Thompson's blood sugar levels were back to normal.

Thanks to the discovery and production of synthetic insulin, people with diabetes can now live relatively normal lives. Nevertheless, diabetes remains a significant health hazard, especially among adults in the United States, where [about 10% of Americans have diabetes and about 33% of Americans have prediabetes](https://www.cdc.gov/diabetes/library/spotlights/diabetes-facts-stats.html#:~:text=Key%20findings%20include%3A,t%20know%20they%20have%20it.). To help combat this, the Center for Disease Control and Prevention (CDC) annually performs "Behavioral Risk Factor Surveillance System" (BRFSS) telephone surveys, in which they collect health related data about American adults.

Our dataset, titled "Diabetes Health Indicators Dataset", is a cleaned and consolidated subset of the BRFSS dataset from 2015 (i.e., our dataset has no missing data and only contains features that are potentially relevant to diabetes). Our task is to create a Wide and Deep Neural Network that can predict whether someone has diabetes, prediabetes, or neither. 

Link to the dataset: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

## 1. Preparation
### 1.1 Data Preprocessing and Description

To begin, we first need to read in the data.

In [11]:
import pandas as pd

df = pd.read_csv("data/diabetes_012_health_indicators_BRFSS2015.csv")

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_012          253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   HeartDiseaseorAttack  253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14  GenHlth               253680 non-null  float64
 15  

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


As we can see, there is no missing data in our dataset. However, our nominal categorical variables aren't one-hot encoded, so let's fix that.

In [12]:
df = pd.get_dummies(df, columns=['HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'DiffWalk', 'Sex'], dtype=int)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 36 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Diabetes_012              253680 non-null  float64
 1   BMI                       253680 non-null  float64
 2   GenHlth                   253680 non-null  float64
 3   MentHlth                  253680 non-null  float64
 4   PhysHlth                  253680 non-null  float64
 5   Age                       253680 non-null  float64
 6   Education                 253680 non-null  float64
 7   Income                    253680 non-null  float64
 8   HighBP_0.0                253680 non-null  int32  
 9   HighBP_1.0                253680 non-null  int32  
 10  HighChol_0.0              253680 non-null  int32  
 11  HighChol_1.0              253680 non-null  int32  
 12  CholCheck_0.0             253680 non-null  int32  
 13  CholCheck_1.0             253680 non-null  i

Unnamed: 0,Diabetes_012,BMI,GenHlth,MentHlth,PhysHlth,Age,Education,Income,HighBP_0.0,HighBP_1.0,...,HvyAlcoholConsump_0.0,HvyAlcoholConsump_1.0,AnyHealthcare_0.0,AnyHealthcare_1.0,NoDocbcCost_0.0,NoDocbcCost_1.0,DiffWalk_0.0,DiffWalk_1.0,Sex_0.0,Sex_1.0
0,0.0,40.0,5.0,18.0,15.0,9.0,4.0,3.0,0,1,...,1,0,0,1,1,0,0,1,1,0
1,0.0,25.0,3.0,0.0,0.0,7.0,6.0,1.0,1,0,...,1,0,1,0,0,1,1,0,1,0
2,0.0,28.0,5.0,30.0,30.0,9.0,4.0,8.0,0,1,...,1,0,0,1,0,1,0,1,1,0
3,0.0,27.0,2.0,0.0,0.0,11.0,3.0,6.0,0,1,...,1,0,0,1,1,0,1,0,1,0
4,0.0,24.0,2.0,3.0,0.0,11.0,5.0,4.0,0,1,...,1,0,0,1,1,0,1,0,1,0


It's important to note that we did not one-hot encode the <code>GenHlth</code>, <code>Age</code>, <code>Education</code>, or <code>Income</code> variables, as these categorical variables are ordinal (and thus, we want to capture the inherit ordering of the data). Now that we've one-hot encoded our nominal data, we need to normalize our numeric data. 

In [13]:
from sklearn import preprocessing

df[['BMI', 'MentHlth', 'PhysHlth']] = preprocessing.normalize(df[['BMI', 'MentHlth', 'PhysHlth']])

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 36 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Diabetes_012              253680 non-null  float64
 1   BMI                       253680 non-null  float64
 2   GenHlth                   253680 non-null  float64
 3   MentHlth                  253680 non-null  float64
 4   PhysHlth                  253680 non-null  float64
 5   Age                       253680 non-null  float64
 6   Education                 253680 non-null  float64
 7   Income                    253680 non-null  float64
 8   HighBP_0.0                253680 non-null  int32  
 9   HighBP_1.0                253680 non-null  int32  
 10  HighChol_0.0              253680 non-null  int32  
 11  HighChol_1.0              253680 non-null  int32  
 12  CholCheck_0.0             253680 non-null  int32  
 13  CholCheck_1.0             253680 non-null  i

Unnamed: 0,Diabetes_012,BMI,GenHlth,MentHlth,PhysHlth,Age,Education,Income,HighBP_0.0,HighBP_1.0,...,HvyAlcoholConsump_0.0,HvyAlcoholConsump_1.0,AnyHealthcare_0.0,AnyHealthcare_1.0,NoDocbcCost_0.0,NoDocbcCost_1.0,DiffWalk_0.0,DiffWalk_1.0,Sex_0.0,Sex_1.0
0,0.0,0.862863,5.0,0.388288,0.323574,9.0,4.0,3.0,0,1,...,1,0,0,1,1,0,0,1,1,0
1,0.0,1.0,3.0,0.0,0.0,7.0,6.0,1.0,1,0,...,1,0,1,0,0,1,1,0,1,0
2,0.0,0.550823,5.0,0.590167,0.590167,9.0,4.0,8.0,0,1,...,1,0,0,1,0,1,0,1,1,0
3,0.0,1.0,2.0,0.0,0.0,11.0,3.0,6.0,0,1,...,1,0,0,1,1,0,1,0,1,0
4,0.0,0.992278,2.0,0.124035,0.0,11.0,5.0,4.0,0,1,...,1,0,0,1,1,0,1,0,1,0


Now that we're done with pre-processing, our final dataset can be described as follows:
- <code>Diabetes_012</code>: The target variable; 0 means no diabetes, 1 means prediabetes, and 2 means diabetes.
- **Normalized Numeric Data**:
    - <code>BMI</code>: The Body Mass Index of the person.
    - <code>MentHlth</code>: Number of days during the past 30 days when the person's mental health wasn't good (i.e., lots of stress, depressive moods, etc.).
    - <code>PhysHlth</code>: Number of days during the past 30 days when the person's physical health wasn't good (i.e., was injured, sick, etc.)
- **Ordinal Data**:
    - <code>GenHlth</code>: The person's health, as rated by themselves. 1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor.
    - <code>Age</code>: The person's age category, defined as follows:
        - 1 = Age 18-24
        - 2 = Age 25-29
        - 3 = Age 30-34
        - 4 = Age 35-39
        - 5 = Age 40-44
        - 6 = Age 45-49
        - 7 = Age 50-54
        - 8 = Age 55-59
        - 9 = Age 60-64
        - 10 = Age 65-69
        - 11 = Age 70-74
        - 12 = Age 75-79
        - 13 = Age 80+
    - <code>Education</code>: The person's education level category, defined as follows:
        - 1 = Never attended school or only kindergarten
        - 2 = Grades 1 through 8 (Elementary)
        - 3 = Grades 9 through 11 (Some high school)
        - 4 = Grade 12 or GED (High school graduate)
        - 5 = College 1 year to 3 years (Some college or technical school)
        - 6 = College 4 years or more (College graduate)
    - <code>Income</code>: The person's annual housecome income category, defined as follows:
        - 1 = Less than \\$10,000
        - 2 = \\$10,000 - \\$14,999
        - 3 = \\$15,000 - \\$19,999
        - 4 = \\$20,000 - \\$24,999
        - 5 = \\$25,000 - \\$34,999
        - 6 = \\$35,000 - \\$49,999
        - 7 = \\$50,000 - \\$74,999
        - 8 = \\$75,000 or more


### 1.2 Cross-Product Features

Common factors for Diabetes (according to CDC): https://www.cdc.gov/diabetes/basics/risk-factors.html

Possible Cross Product Features (according to ChatGPT):
- HighBP and HighChol
- Smoker and HvyAlcoholConsump
- Age and BMI
- GenHlth and MentHlth
- Education and Income
- AnyHealthcare and NoDocbcCost
- HighBP and HeartDiseaseorAttack

Others:
- Age and PhysActivity
- BMI and PhysActivity
- HvyAlcoholConsump and MentHlth

### 1.3 Performance Metric
Accuracy bad because class imbalance

F1 score (or F beta score)??

High precision = when the model predicts someone has prediabetes or diabetes, it is usually correct

High recall = the model is able to correctly identify all individuals with prediabetes or diabetes, even if it might also generate some false positives

I think higher recall is better, as early diagnosis encourages a healthier lifestyle (even if the diagnosis is wrong)

### 1.4 Training and Testing Method
StratifiedKFold since the dataset is imbalanced??

## 2. Modeling
### 2.1 Initial Models

### 2.2 Model Comparison

### 2.3 Wide and Deep Network vs. Standard Multi-Layer Perceptron
(We can also compare to just a "deep" network, in which case rename this section)

## 3. Embedding Weights of Our Deep Network
### 3.1 Capture Embedding Weights

### 3.2 Visualization and Explanation