## Lab 5, Group 1
### Names: Hailey DeMark, Deborah Park, Karis Park
### Student IDs: 48869449, 48878679, 48563429

Link to DataSet: (link)

## Preparation (4 points total)
* [1 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). You have the option of using tf.dataset for processing, but it is not required. 


* [1 points] Identify groups of features in your data that should be combined into cross-product features. Provide a compelling justification for why these features should be crossed (or why some features should not be crossed).

* [1 points] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.

* [1 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Argue why your cross validation method is a realistic mirroring of how an algorithm would be used in practice. Use the method to split your data that you argue for. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import copy

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.utils import FeatureSpace
from tensorflow.keras.layers import Embedding, Flatten, Dense, Input, Concatenate
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
# load dataset
df = pd.read_csv('diabetes_binary_health_indicators_BRFSS2015.csv')
df

In [None]:
numeric_cols = ["BMI", "GenHlth", "MentHlth", "PhysHlth", "Age", "Education", "Income"]
outcome = "Diabetes_binary"
categorical_cols = [ col for col in df.columns if col != outcome and col not in numeric_cols ]

df[categorical_cols] = df[categorical_cols].astype(int).astype(str)
df[outcome] = df[outcome].astype(int)

df

1 train/test split because we have a ton of data (switch to 5/10 fold if required)

In [None]:
X = copy.deepcopy(df)
y = X.pop(outcome).values

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=1234, stratify = y)


cross features 
- high BP 
- high cholesterol 
____ 

- stroke 
- deartdiseaseorAttack
____ 
- smoker 
- hvyAlcoholConsump
- 


## Modeling (5 points total)
* [2 points] Create at least three combined wide and deep networks to classify your data using Keras (this total of "three" includes the model you will train in the next step of the rubric). Visualize the performance of the network on the training data and validation data in the same plot versus the training iterations.
    * Note: you can use the "history" return parameter that is part of Keras "fit" function to easily access this data.



In [None]:
# need to redefine the tf dataset to include cataegorical variables
# create a tensorflow dataset, for ease of use later
batch_size = 64

def create_dataset_from_dataframe(X, y):

    # get numeric feature data to start with, with categorical_headers
    df = {key: value.values[:,np.newaxis] for key, value in X[numeric_cols+categorical_cols].items()}

    # create the Dataset here
    ds = tf.data.Dataset.from_tensor_slices((dict(df), y))
    
    # now enable batching and prefetching
    ds = ds.batch(batch_size)
    ds = ds.prefetch(batch_size)
    
    return ds

ds_train = create_dataset_from_dataframe(X_train, y_train)
ds_test = create_dataset_from_dataframe(X_test, y_test)

In [None]:
features = {}
for col in categorical_cols:
  features[col] = FeatureSpace.string_categorical(num_oov_indices=0)
for col in numeric_cols:
  features[col] = FeatureSpace.float_normalized()

feature_space = FeatureSpace(
    features=features,
    crosses=[
        FeatureSpace.cross(
          feature_names=('HighBP', 'HighChol'),
          crossing_dim=2*2
        ),
        FeatureSpace.cross(
          feature_names=('Stroke', 'HeartDiseaseorAttack'),
          crossing_dim=2*2
        ),
        FeatureSpace.cross(
          feature_names=('Smoker', 'HvyAlcoholConsump'),
          crossing_dim=2*2
        ),
    ],
    output_mode="concat",
)


feature_space.adapt(ds_train.map(lambda x, _: x))

In [None]:
def setup_embedding_from_categorical(feature_space, col_name):
    # what the maximum integer value for this variable?
    # which is the same as the number of categories
    N = len(feature_space.preprocessors[col_name].get_vocabulary())
    
    # get the output from the feature space, which is input to embedding
    x = feature_space.preprocessors[col_name].output
    
    # now use an embedding to deal with integers from feature space
    x = Embedding(input_dim=N, 
                  output_dim=int(np.sqrt(N)), 
                  input_length=1, name=col_name+'_embed')(x)
    
    x = Flatten()(x) # get rid of that pesky extra dimension (for time of embedding)
    
    return x # return the tensor here 

def setup_embedding_from_crossing(feature_space, col_name):
    # what the maximum integer value for this variable?
    
    # get the size of the feature
    N = feature_space.crossers[col_name].num_bins
    x = feature_space.crossers[col_name].output
    
    
    # now use an embedding to deal with integers as if they were one hot encoded
    x = Embedding(input_dim=N, 
                  output_dim=int(np.sqrt(N)), 
                  input_length=1, name=col_name+'_embed')(x)
    
    x = Flatten()(x) # get rid of that pesky extra dimension (for time of embedding)
    
    return x

def plot_loss(history):
    plt.figure(figsize=(10,4))
    plt.subplot(2,2,1)
    plt.plot(history.history['accuracy'])

    plt.ylabel('Accuracy %')
    plt.title('Training')
    plt.subplot(2,2,2)
    plt.plot(history.history['val_accuracy'])
    plt.title('Validation')

    plt.subplot(2,2,3)
    plt.plot(history.history['loss'])
    plt.ylabel('Training Loss')
    plt.xlabel('epochs')

    plt.subplot(2,2,4)
    plt.plot(history.history['val_loss'])
    plt.xlabel('epochs')

    plt.show()

early_stopping = EarlyStopping(
    monitor = "loss",
    verbose = 1,
    patience = 5,
    mode = "min",
    min_delta = 0.05
)

### Model 1

In [None]:
dict_inputs = feature_space.get_inputs() # need to use unprocessed features here, to gain access to each output

# we need to create separate lists for each branch
crossed_outputs = []

# for each crossed variable, make an embedding
for col in feature_space.crossers.keys():
    
    x = setup_embedding_from_crossing(feature_space, col)
    
    # save these outputs in list to concatenate later
    crossed_outputs.append(x)
    

# now concatenate the outputs and add a fully connected layer
wide_branch = Concatenate(name='wide_concat')(crossed_outputs)

# reset this input branch
all_deep_branch_outputs = []

# for each numeric variable, just add it in after embedding
for idx,col in enumerate(numeric_cols):
    x = feature_space.preprocessors[col].output
    #x = tf.cast(x,float) # cast an integer as a float here
    all_deep_branch_outputs.append(x)
    
# for each categorical variable
for col in categorical_cols:
    
    # get the output tensor from ebedding layer
    x = setup_embedding_from_categorical(feature_space, col)
    
    # save these outputs in list to concatenate later
    all_deep_branch_outputs.append(x)


# merge the deep branches together
deep_branch = Concatenate(name='embed_concat')(all_deep_branch_outputs)
deep_branch = Dense(units=50,activation='relu', name='deep1')(deep_branch)
deep_branch = Dense(units=25,activation='relu', name='deep2')(deep_branch)
deep_branch = Dense(units=10,activation='relu', name='deep3')(deep_branch)
    
# merge the deep and wide branch
final_branch = Concatenate(name='concat_deep_wide')([deep_branch, wide_branch])
final_branch = Dense(units=1,activation='sigmoid',
                     name='combined')(final_branch)

model1 = keras.Model(inputs=dict_inputs, outputs=final_branch)
model1.compile(
    optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
)

model1.summary()

In [None]:
history1 = model1.fit(
    ds_train, epochs=100, validation_data=ds_test, callbacks=[early_stopping]
)

In [None]:
plot_loss(history1)

In [None]:
def build_model(deep_neurons = [], final_neurons = []):
  dict_inputs = feature_space.get_inputs() # need to use unprocessed features here, to gain access to each output

  # we need to create separate lists for each branch
  crossed_outputs = []

  # for each crossed variable, make an embedding
  for col in feature_space.crossers.keys():
      
      x = setup_embedding_from_crossing(feature_space, col)
      
      # save these outputs in list to concatenate later
      crossed_outputs.append(x)
      

  # now concatenate the outputs and add a fully connected layer
  wide_branch = Concatenate(name='wide_concat')(crossed_outputs)

  # reset this input branch
  all_deep_branch_outputs = []

  # for each numeric variable, just add it in after embedding
  for idx,col in enumerate(numeric_cols):
      x = feature_space.preprocessors[col].output
      #x = tf.cast(x,float) # cast an integer as a float here
      all_deep_branch_outputs.append(x)
      
  # for each categorical variable
  for col in categorical_cols:
      
      # get the output tensor from ebedding layer
      x = setup_embedding_from_categorical(feature_space, col)
      
      # save these outputs in list to concatenate later
      all_deep_branch_outputs.append(x)


  # merge the deep branches together
  deep_branch = Concatenate(name='embed_concat')(all_deep_branch_outputs)
  for i, neurons in enumerate(deep_neurons):
      deep_branch = Dense(units=neurons,activation='relu', name=f'deep{i}')(deep_branch)
      
  # merge the deep and wide branch
  final_branch = Concatenate(name='concat_deep_wide')([deep_branch, wide_branch])
  for i, neurons in enumerate(final_neurons):
      final_branch = Dense(units=neurons,activation='relu', name=f'final{i}')(final_branch)
  final_branch = Dense(units=1,activation='sigmoid',
                      name='combined')(final_branch)

  model = keras.Model(inputs=dict_inputs, outputs=final_branch)
  model.compile(
      optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
  )

  print(model.summary())

  return model

In [None]:
model1 = build_model([50, 25, 10], [])

* [2 points] Investigate generalization performance by altering the number of layers in the deep branch of the network. Try at least two models (this "two" includes the wide and deep model trained from the previous step). Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab to answer: What model with what number of layers performs superiorly? Use proper statistical methods to compare the performance of different models.

* [1 points] Compare the performance of your best wide and deep network to a standard multi-layer perceptron (MLP). Alternatively, you can compare to a network without the wide branch (i.e., just the deep network). For classification tasks, compare using the receiver operating characteristic and area under the curve. For regression tasks, use Bland-Altman plots and residual variance calculations.  Use proper statistical methods to compare the performance of different models.  

## Exceptional Work (1 points total)