# Assignment - Autoencoder

In this assignment, we will focus on healthcare. This data set contains data about patients with and without heart problems. Each row represents a single patient. There two files: heart-normal (contains patients without any heart problems) and heart_anomaly (contains patients with heart problems). This is an anomaly detection task: build an autoencoder on normal patients to identify anomalous observations. You cannot do supervised learning, because there are only 20 anomalous observations - which is not enough to build a binary classification model.

## Description of Variables

The description of variables are provided in "Heart - Data Dictionary.docx"

## Goal

Use the data set **heart-normal.csv** data set to train an autoencoder on healthy (i.e., normal) patients. Then, use the observations in **heart-anomaly.csv** data set to check whether the autoencoder can successfully detect patients who have a heart anomaly. 

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Read and Prepare the Data

In [1]:
# Common imports
import numpy as np
import pandas as pd

random_state=42

In [2]:
heartnormal = pd.read_csv("heart-normal.csv")
heartnormal.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [3]:
heartanomaly=pd.read_csv("heart-anomaly.csv")
heartanomaly.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,67,1,0,160,286,0,0,108,1,1.5,1,3,2
1,67,1,0,120,229,0,0,129,1,2.6,1,2,3
2,62,0,0,140,268,0,0,160,0,3.6,0,2,2
3,63,1,0,130,254,0,0,147,0,1.4,1,1,3
4,53,1,0,140,203,1,0,155,1,3.1,0,0,3


In [4]:
heartnormal.shape

(165, 13)

In [5]:
heartanomaly.shape

(20, 13)

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

In [7]:
# Identify the numerical columns
numeric_columns = heartnormal.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns =['thal'] ##heartnormal.select_dtypes('object').columns.to_list()

In [8]:
binary_columns = ['sex', 'fbs','exang']

In [9]:
for col in binary_columns:
    numeric_columns.remove(col)

In [10]:
binary_columns

['sex', 'fbs', 'exang']

In [11]:
numeric_columns

['age',
 'cp',
 'trestbps',
 'chol',
 'restecg',
 'thalach',
 'oldpeak',
 'slope',
 'ca',
 'thal']

In [12]:
categorical_columns

['thal']

In [13]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [14]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [15]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [16]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')

In [17]:
normal_x = preprocessor.fit_transform(heartnormal)

normal_x

array([[ 1.10306652,  1.71093264,  0.97372481, ...,  1.        ,
         1.        ,  0.        ],
       [-1.62754823,  0.65755993,  0.04323489, ...,  1.        ,
         0.        ,  0.        ],
       [-1.20745366, -0.39581278,  0.04323489, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.20745366, -0.39581278, -0.57709173, ...,  1.        ,
         0.        ,  0.        ],
       [-1.52252459,  0.65755993,  0.53949618, ...,  1.        ,
         0.        ,  0.        ],
       [-1.52252459,  0.65755993,  0.53949618, ...,  1.        ,
         0.        ,  0.        ]])

In [18]:
normal_x.shape

(165, 17)

In [19]:
anomaly_x = preprocessor.transform(heartanomaly)

anomaly_x

array([[ 1.5231611 , -1.44918549,  1.90421473,  0.81980549, -1.18012347,
        -2.6400108 ,  1.17814884, -1.0035591 ,  3.1150997 , -0.26104233,
         0.        ,  0.        ,  1.        ,  0.        ,  1.        ,
         0.        ,  1.        ],
       [ 1.5231611 , -1.44918549, -0.57709173, -0.24780329, -1.18012347,
        -1.54145941,  2.59146023, -1.0035591 ,  1.93351016,  1.89255693,
         0.        ,  0.        ,  0.        ,  1.        ,  1.        ,
         0.        ,  1.        ],
       [ 0.99804287, -1.44918549,  0.6635615 ,  0.48266588, -1.18012347,
         0.08021169,  3.87628877, -2.69322494,  1.93351016, -0.26104233,
         0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         0.        ,  0.        ],
       [ 1.10306652, -1.44918549,  0.04323489,  0.22044617, -1.18012347,
        -0.59984393,  1.04966598, -1.0035591 ,  0.75192062,  1.89255693,
         0.        ,  0.        ,  0.        ,  1.        ,  1.        ,
         0.        

In [20]:
anomaly_x.shape

(20, 17)

# Autoencoder

In [21]:
import tensorflow as tf
from tensorflow import keras

In [22]:
model = keras.models.Sequential()

#Encoder
model.add(keras.layers.InputLayer(input_shape=17))
model.add(keras.layers.Dense(50, activation='selu'))
#model.add(keras.layers.Dense(50, activation='selu'))

#Decoder
model.add(keras.layers.Dense(50, activation='selu'))
model.add(keras.layers.Dense(17))  

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 50)                900       
_________________________________________________________________
dense_1 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_2 (Dense)              (None, 17)                867       
Total params: 4,317
Trainable params: 4,317
Non-trainable params: 0
_________________________________________________________________


In [23]:
adam = keras.optimizers.Adam(learning_rate=0.001)


model.compile(loss='mean_squared_error', optimizer=adam, metrics=['mean_squared_error'])

In [24]:
from tensorflow.keras.callbacks import EarlyStopping

earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto')

callback = [earlystop]

In [25]:
# Be careful: both input and output are "housing_normal_std" while training the autoencoder

model.fit(normal_x, normal_x, 
          validation_data = (normal_x, normal_x),
          epochs=200, batch_size=100, callbacks=callback)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200


Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78/200
Epoch 79/200
Epoch 80/200
Epoch 81/200
Epoch 82/200
Epoch 83/200
Epoch 84/200
Epoch 85/200
Epoch 86/200
Epoch 87/200
Epoch 88/200
Epoch 89/200
Epoch 90/200
Epoch 91/200
Epoch 92/200
Epoch 93/200
Epoch 94/200
Epoch 95/200
Epoch 96/200
Epoch 97/200
Epoch 98/200
Epoch 99/200
Epoch 100/200
Epoch 101/200
Epoch 102/200


Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200
Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200
Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200


Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


<tensorflow.python.keras.callbacks.History at 0x1a5049f3bb0>

In [26]:
model.evaluate(normal_x, normal_x)



[0.005915837828069925, 0.005915837828069925]

In [27]:
model.evaluate(normal_x, normal_x)[0]*1000



5.915837828069925

In [28]:
model.evaluate(anomaly_x, anomaly_x)



[0.024617154151201248, 0.024617154151201248]

In [29]:
model.evaluate(anomaly_x, anomaly_x)[0]*1000



24.617154151201248

## Predict first 20 in normal data

In [30]:
from sklearn.metrics import mean_squared_error##error generated for each airbnb listing

for i in range(1,21):
    prediction = model.predict(normal_x[i:i+1])
    print((mean_squared_error(normal_x[i:i+1], prediction))*1000)

23.6382604640009
9.243500150138903
4.312235488242118
6.5294653732704475
5.4464361843314135
4.527891619218368
3.822508051588053
5.166946745002025
3.909497641122364
3.9008178473206794
3.6703807555471024
3.4455560279152486
7.2481507542067005
3.2690632714584322
6.423045023348083
5.021313112857455
11.8869246417369
7.187158887288646
7.901355638665102
3.2184421284794325


## Predict all 20 in anomaly data

In [31]:
for i in range(1,20):
    prediction = model.predict(anomaly_x[i:i+1])
    print((mean_squared_error(anomaly_x[i:i+1], prediction))*1000)

34.22687665272411
42.44947338210777
10.053766237262685
47.60404443244025
11.026728487264728
16.550880693859014
8.915174605055528
76.304948939606
23.949957713799634
23.135347947583895
23.076534088020885
6.463911978903473
20.33274454865855
13.700030530189997
4.864474453199608
49.363163205934114
9.368561361093434
13.052660530077793
11.02258235342397


# Discussion

Provide a brief discussion (one-paragraph): can the model successfully detect patients with heart anomalies? If not, why? <br>
Discuss any other relevant issues about your autoencoder. 

# Extra Credit (3 points):

# Build a GAN

Build a GAN that can generate patients with **normal hearts**. Test the effectiveness of your GAN using the autoencoder you built earlier. Hint: when you send your newly generated data to the autoencoder, the error term should be small.

In [32]:
import tensorflow as tf
from tensorflow import keras

In [46]:
codings_size = 100   # this is the number of input variables we want the generator to use

generator = keras.models.Sequential([
    keras.layers.InputLayer(input_shape=codings_size),
    keras.layers.Dense(40, activation="relu"),
    keras.layers.Dense(40, activation="relu"),
    keras.layers.Dense(17, activation=None)    
])


In [47]:
generator.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_9 (Dense)              (None, 40)                4040      
_________________________________________________________________
dense_10 (Dense)             (None, 40)                1640      
_________________________________________________________________
dense_11 (Dense)             (None, 17)                697       
Total params: 6,377
Trainable params: 6,377
Non-trainable params: 0
_________________________________________________________________


In [48]:
discriminator = keras.models.Sequential([
    keras.layers.InputLayer(input_shape=[17]),
    keras.layers.Dense(50, activation="relu"),
    keras.layers.Dense(25, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")
])


In [49]:
discriminator.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 50)                900       
_________________________________________________________________
dense_13 (Dense)             (None, 25)                1275      
_________________________________________________________________
dense_14 (Dense)             (None, 1)                 26        
Total params: 2,201
Trainable params: 2,201
Non-trainable params: 0
_________________________________________________________________


In [50]:
gan = keras.models.Sequential([generator, discriminator])

In [51]:
discriminator.compile(loss="binary_crossentropy", optimizer="Adam")
discriminator.trainable = False
gan.compile(loss="mean_squared_error", optimizer="Adam")

In [52]:
batch_size = 32

dataset = tf.data.Dataset.from_tensor_slices(normal_x).shuffle(1000)

dataset = dataset.batch(batch_size, drop_remainder=True).prefetch(1)

In [53]:
def train_gan(gan, dataset, batch_size, codings_size, n_epochs=10):
    generator, discriminator = gan.layers
    for epoch in range(n_epochs):
        for X_batch in dataset:
            # phase 1 - training the discriminator
            noise = tf.random.normal(shape=[batch_size, codings_size])
            generated_data = tf.cast(generator(noise), tf.float64)
            X_fake_and_real = tf.concat([generated_data, X_batch], axis=0)
            y1 = tf.constant([[0.]] * batch_size + [[1.]] * batch_size)
            discriminator.trainable = True
            discriminator.train_on_batch(X_fake_and_real, y1)
            # phase 2 - training the generator
            noise = tf.random.normal(shape=[batch_size, codings_size])
            y2 = tf.constant([[1.]] * batch_size)
            discriminator.trainable = False
            gan.train_on_batch(noise, y2)
        print("Epoch: {}/{}".format(epoch, n_epochs))

In [54]:
train_gan(gan, dataset, batch_size, codings_size, n_epochs=100)

Epoch: 0/100
Epoch: 1/100
Epoch: 2/100
Epoch: 3/100
Epoch: 4/100
Epoch: 5/100
Epoch: 6/100
Epoch: 7/100
Epoch: 8/100
Epoch: 9/100
Epoch: 10/100
Epoch: 11/100
Epoch: 12/100
Epoch: 13/100
Epoch: 14/100
Epoch: 15/100
Epoch: 16/100
Epoch: 17/100
Epoch: 18/100
Epoch: 19/100
Epoch: 20/100
Epoch: 21/100
Epoch: 22/100
Epoch: 23/100
Epoch: 24/100
Epoch: 25/100
Epoch: 26/100
Epoch: 27/100
Epoch: 28/100
Epoch: 29/100
Epoch: 30/100
Epoch: 31/100
Epoch: 32/100
Epoch: 33/100
Epoch: 34/100
Epoch: 35/100
Epoch: 36/100
Epoch: 37/100
Epoch: 38/100
Epoch: 39/100
Epoch: 40/100
Epoch: 41/100
Epoch: 42/100
Epoch: 43/100
Epoch: 44/100
Epoch: 45/100
Epoch: 46/100
Epoch: 47/100
Epoch: 48/100
Epoch: 49/100
Epoch: 50/100
Epoch: 51/100
Epoch: 52/100
Epoch: 53/100
Epoch: 54/100
Epoch: 55/100
Epoch: 56/100
Epoch: 57/100
Epoch: 58/100
Epoch: 59/100
Epoch: 60/100
Epoch: 61/100
Epoch: 62/100
Epoch: 63/100
Epoch: 64/100
Epoch: 65/100
Epoch: 66/100
Epoch: 67/100
Epoch: 68/100
Epoch: 69/100
Epoch: 70/100
Epoch: 71/100
Ep

In [55]:
noise = tf.random.normal(shape=[1, codings_size])
generated_data = tf.cast(generator(noise), tf.float64)

In [56]:
generated_data

<tf.Tensor: shape=(1, 17), dtype=float64, numpy=
array([[-0.73013985, -2.47334981, -1.43780434, -0.39002925, -0.7204473 ,
        -0.11289843,  0.00578984,  0.64840305, -1.35245073,  0.60112172,
        -0.38344851, -0.10997078,  0.86434484,  0.35212845, -0.00317993,
        -1.09569454, -0.27936321]])>

In [57]:
prediction=model.predict(generated_data)

In [58]:
print((mean_squared_error(generated_data,prediction))*1000)

27.637046511481294


# Discussion

Provide a brief discussion (one-paragraph): can the GAN generate patients with normal heart? If not, why? <br>
Discuss any other relevant issues about your GAN. 