## L4 - 1: Regression Model with TF DenseFeatures

### Instructions
- Build regression model to predict resting blood pressure (trestbps field in the dataset) using Tensorflow DenseFeatures
- Please include the age and sex features and create a TF cross feature(https://www.tensorflow.org/api_docs/python/tf/feature_column/crossed_column) with them by binning the age field 
- Evaluate with common regression(MSE, MAE) and classification metrics(accuracy, F1, precision and recall across classes, AUC). No ROC or PR curve needed since this is a regression model that was converted to have a binary output and does not have the confidence in a given prediction.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.metrics import accuracy_score, f1_score, classification_report, roc_auc_score
#from https://archive.ics.uci.edu/ml/datasets/Heart+Disease
swiss_dataset_path = "./data/heart_disease_data/processed_swiss.csv"
swiss_df = pd.read_csv(swiss_dataset_path).replace('?', np.nan)

In [2]:
column_list = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num_label']

In [3]:
cleveland_df = pd.read_csv("./data/heart_disease_data/processed.cleveland.txt",  names=column_list).replace('?', np.nan)

In [4]:
combined_heart_df = pd.concat([swiss_df, cleveland_df])

In [5]:
len(combined_heart_df)

426

In [6]:
# Review the data
combined_heart_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num_label
0,32.0,1.0,1.0,95.0,0.0,,0.0,127,0,0.7,1.0,,,1
1,34.0,1.0,4.0,115.0,0.0,,,154,0,0.2,1.0,,,1
2,35.0,1.0,4.0,,0.0,,0.0,130,1,,,,7.0,3
3,36.0,1.0,4.0,110.0,0.0,,0.0,125,1,1.0,2.0,,6.0,1
4,38.0,0.0,4.0,105.0,0.0,,0.0,166,0,2.8,1.0,,,2


In [7]:
# It would appear that we have some na values to deal with
combined_heart_df.isna().sum()

age            0
sex            0
cp             0
trestbps       2
chol           0
fbs           75
restecg        1
thalach        1
exang          1
oldpeak        6
slope         17
ca           122
thal          54
num_label      0
dtype: int64

In [8]:
# Based on the solution, it would appear that we're only trainin on specific features.
# Subset?
combined_heart_df = combined_heart_df[['sex',  'age', 'trestbps', 'thalach']]

In [9]:
# The simple method is just to drop em'. Based on what I see here, this might be 
# severe...
print(combined_heart_df.shape)
clean_df = combined_heart_df.dropna()
print(clean_df.shape)

(426, 4)
(424, 4)


In [10]:
clean_df.dtypes

sex         float64
age         float64
trestbps     object
thalach      object
dtype: object

In [11]:
# Skipping ahead, it looks as though we're being encouraged to recast sex as a string
clean_df["sex"] = np.where(clean_df["sex"] == 1, "male", "female")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [12]:
clean_df.head()

Unnamed: 0,sex,age,trestbps,thalach
0,male,32.0,95,127
1,male,34.0,115,154
3,male,36.0,110,125
4,female,38.0,105,166
5,female,38.0,110,156


In [13]:
# It's unclear to my why some of these features are reading as object. So, recast
clean_df['trestbps'] = clean_df['trestbps'].astype("float")
clean_df['thalach'] = clean_df['thalach'].astype("float")

clean_df.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


sex          object
age         float64
trestbps    float64
thalach     float64
dtype: object

In [14]:
# Define the train and test groups
training_data = clean_df.sample(frac=0.8)
test_data = clean_df.drop(training_data.index)

In [15]:
#adapted from https://www.tensorflow.org/tutorials/structured_data/feature_columns
def df_to_dataset(df, predictor,  batch_size=32):
    df = df.copy()
    labels = df.pop(predictor)
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
    ds = ds.shuffle(buffer_size=len(df))
    ds = ds.batch(batch_size)
    return ds

In [None]:
# Convert to tensors
PREDICTOR_FIELD = 'trestbps'
batch_size = 128
train_ds = df_to_dataset(training_data, PREDICTOR_FIELD, batch_size=batch_size)
test_ds = df_to_dataset(test_data, PREDICTOR_FIELD, batch_size=batch_size)

In [None]:
# Preprocess numerical features -- there is only one
tf_numeric_age_feature = tf.feature_column.numeric_column(key="age", 
                                                          default_value=0,
                                                         dtype=tf.float64)

In [None]:
# Based on what I'm seeing, they want us to bucket age, too?
b_list = [ 0, 18, 25, 40, 55, 65, 80, 100]
#create TF bucket feature from numeric feature
tf_numeric_age_feature
tf_bucket_age_feature = tf.feature_column.bucketized_column(source_column=tf_numeric_age_feature, boundaries= b_list)

In [None]:
# And then one-hot encode the gender column... Which, given cardinality, doesn't really 
# require a vocab.

# Here, save in memory as oppose to writtin to file

# I think it's also relevant to mention here that we're running this on the WHOLE file,
# not just the splits
gender_vocab = tf.feature_column.categorical_column_with_vocabulary_list(
      'sex', clean_df['sex'].unique())
gender_one_hot = tf.feature_column.indicator_column(gender_vocab)

In [2]:
# Now, this is new: cross features...
crossed_age_gender_feature = tf.feature_column.crossed_column([tf_bucket_age_feature, gender_vocab], hash_bucket_size=1000)
tf_crossed_age_gender_feature = tf.feature_column.indicator_column(crossed_age_gender_feature)

NameError: name 'tf' is not defined

In [None]:
feature_columns = [ tf_crossed_age_gender_feature, tf_bucket_age_feature, gender_one_hot ]
dense_feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [3]:
# Use same architecture as example
def build_model(dense_feature_layer):
  model = tf.keras.Sequential([
    dense_feature_layer,
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

In [17]:
# Build the model
model = build_model(dense_feature_layer)

# Train
# No validation set b/c need to build separate TF dataset
EPOCHS = 2000
# Set to patience to 100 so it trains to end of epochs
early_stop = tf.keras.callbacks.EarlyStopping(monitor='mse', patience=10)     
history = model.fit(train_ds,   callbacks=[early_stop], epochs=EPOCHS,  verbose=1)

NameError: name 'dense_feature_layer' is not defined

In [None]:
loss, mae, mse = model.evaluate(test_ds, verbose=2)
print("MAE:{}\nMSE:{}".format(mae, mse))

In [None]:
test_labels = test_dataset[PREDICTOR_FIELD].values
test_predictions = model.predict(test_ds).flatten()