# Cardiovascular anomaly detection using Machine learning

In this project, we will focus on healthcare. This data set is made available by MIT. It contains data about 9,026 heartbeat measurements. Each row represents a single measurement (captured on a timeline). There are a total of 80 data points (columns). This is a multiclass classification task: predict whether the measurement represents a normal heartbeat or other anomalies. 

## Description of Variables

You will use the **hearbeat_cleaned.csv** data set for this assignment. Each row represents a single measurement. Columns labeled as T1 from T80 are the time steps on the timeline (there are 80 time steps, each time step has only one measurement). 

The last column is the target variable. It shows the label (category) of the measurement as follows:<br>
0 = Normal<br>
1 = Supraventricular premature beat<br>
2 = Premature ventricular contraction<br>
3 = Fusion of ventricular and normal beat<br>
4 = Unclassifiable beat

## Goal

Use the data set **hearbeat_cleaned.csv** to predict the column called **Target**. The input variables are columns labeled as **T1 to T80**. 

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Note:

The data is cleaned up. There are no unqueal length sequences. And, there is no zero padding. So, you shouldn't use any `Masking` layer (like I mentioned in the lecture). 

# Read and Prepare the Data (1 points)

In [1]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(42)

## Fetching the Data

In [4]:
#We will try to predict the "Target" value in the data set:

hb_data = pd.read_csv("/Users/onkaratmaramsalunke/Downloads/heartbeat_cleaned.csv")
hb_data.head()

Unnamed: 0,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,...,T72,T73,T74,T75,T76,T77,T78,T79,T80,Target
0,0.987,0.892,0.461,0.113,0.149,0.19,0.165,0.162,0.147,0.138,...,0.197,0.197,0.196,0.203,0.201,0.199,0.201,0.205,0.208,0
1,1.0,0.918,0.621,0.133,0.105,0.125,0.117,0.0898,0.0703,0.0781,...,0.195,0.191,0.152,0.172,0.207,0.211,0.207,0.207,0.172,0
2,1.0,0.751,0.143,0.104,0.0961,0.0519,0.0442,0.0416,0.0364,0.0857,...,0.226,0.242,0.244,0.286,0.468,0.816,0.977,0.452,0.0519,0
3,1.0,0.74,0.235,0.0464,0.0722,0.0567,0.0103,0.0155,0.0284,0.0155,...,0.0851,0.0747,0.0515,0.0593,0.067,0.0361,0.121,0.451,0.869,0
4,1.0,0.833,0.309,0.0191,0.101,0.12,0.104,0.0874,0.0765,0.0765,...,0.205,0.421,0.803,0.951,0.467,0.0,0.0519,0.082,0.0628,0


## Splitting the dataset into train and test datasets


In [5]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(hb_data, test_size=0.3)

## Checking for missing values,if any


In [6]:
train_set.isna().sum()

T1        0
T2        0
T3        0
T4        0
T5        0
         ..
T77       0
T78       0
T79       0
T80       0
Target    0
Length: 81, dtype: int64

In [7]:
test_set.isna().sum()

T1        0
T2        0
T3        0
T4        0
T5        0
         ..
T77       0
T78       0
T79       0
T80       0
Target    0
Length: 81, dtype: int64

## Data Preparation


In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

## Now, we'll be separating the target variable (since, we don't want to transform it)

In [10]:
train_target = train_set[['Target']]
test_target = test_set[['Target']]

train_inputs = train_set.drop(['Target'], axis=1)
test_inputs = test_set.drop(['Target'], axis=1)

##  Identify the numerical columns (doing this to ensure all columns are being considered)

In [11]:
train_inputs.dtypes

T1     float64
T2     float64
T3     float64
T4     float64
T5     float64
        ...   
T76    float64
T77    float64
T78    float64
T79    float64
T80    float64
Length: 80, dtype: object

In [12]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

In [13]:
numeric_columns

['T1',
 'T2',
 'T3',
 'T4',
 'T5',
 'T6',
 'T7',
 'T8',
 'T9',
 'T10',
 'T11',
 'T12',
 'T13',
 'T14',
 'T15',
 'T16',
 'T17',
 'T18',
 'T19',
 'T20',
 'T21',
 'T22',
 'T23',
 'T24',
 'T25',
 'T26',
 'T27',
 'T28',
 'T29',
 'T30',
 'T31',
 'T32',
 'T33',
 'T34',
 'T35',
 'T36',
 'T37',
 'T38',
 'T39',
 'T40',
 'T41',
 'T42',
 'T43',
 'T44',
 'T45',
 'T46',
 'T47',
 'T48',
 'T49',
 'T50',
 'T51',
 'T52',
 'T53',
 'T54',
 'T55',
 'T56',
 'T57',
 'T58',
 'T59',
 'T60',
 'T61',
 'T62',
 'T63',
 'T64',
 'T65',
 'T66',
 'T67',
 'T68',
 'T69',
 'T70',
 'T71',
 'T72',
 'T73',
 'T74',
 'T75',
 'T76',
 'T77',
 'T78',
 'T79',
 'T80']

## Pipeline

In [14]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [18]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

## Transform: fit_transform() for TRAIN

In [20]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[ 0.52866194,  0.34497511, -1.32502878, ...,  0.23995314,
         0.21351362,  0.37129189],
       [ 0.52866194,  0.47203991, -1.1904895 , ..., -0.99282403,
        -1.11960255, -1.1952421 ],
       [-3.23452505, -2.87058419, -1.20272034, ...,  1.5729561 ,
         1.53246329,  1.45183029],
       ...,
       [ 0.52866194,  1.00899118,  0.46882827, ..., -1.24839979,
        -1.330146  , -1.33191294],
       [-1.11208759, -0.90927747,  0.04074875, ...,  0.25498701,
         0.18420363,  0.20847103],
       [-0.22773865, -0.20837162,  0.43213574, ..., -0.20605161,
        -0.22125127, -0.31946326]])

In [21]:
train_x.shape

(5572, 80)

## Tranform: transform() for TEST

In [22]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[ 0.52866194,  0.16872393, -1.71070804, ..., -1.18776319,
        -1.18750404, -1.23520722],
       [ 0.26900203,  0.54991834,  0.72975293, ..., -1.19026883,
        -1.37215699, -1.19968267],
       [ 0.52866194, -0.13049319, -1.86726284, ..., -0.94772243,
        -0.99308108, -0.96087875],
       ...,
       [ 0.51360919,  0.36137057, -1.03148853, ...,  0.03950157,
         0.09627365,  0.16406534],
       [ 0.49103007,  0.98849686,  1.740836  , ..., -0.0707468 ,
        -0.18217128, -0.12210464],
       [-0.63039966, -1.00765022, -0.47294668, ..., -0.47164994,
        -0.42642122, -0.73885031]])

In [23]:
test_x.shape

(2388, 80)

## Keras will need Ordinal target values for classification purpose


In [24]:
from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder()

train_y = ord_enc.fit_transform(train_target)

train_y

array([[1.],
       [0.],
       [2.],
       ...,
       [0.],
       [4.],
       [4.]])

In [25]:
test_y = ord_enc.transform(test_target)

test_y

array([[0.],
       [0.],
       [0.],
       ...,
       [1.],
       [2.],
       [0.]])

# Find the baseline (0.5 point)

In [None]:
train_target.value_counts()/len(train_target)

# Multiclass classification using Keras


In [None]:
import tensorflow as tf
from tensorflow import keras

# fix random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

In [None]:
#What is your input shape?
#(meaning: how many neurons should be in the input layer?)

train_x.shape

# Build a cross-sectional shallow model using Keras (with only one hidden layer) (2 points)

In [None]:
#Define the model: for multi-class

model = keras.models.Sequential()

model.add(keras.layers.Input(train_x.shape[1]))
model.add(keras.layers.Dense(50, activation='relu'))
model.add(keras.layers.Dense(5, activation='softmax'))

#final layer: there has to be 5 nodes with softmax (because we have 5 categories)

In [None]:
# Compile model

#Optimizer:
adam = keras.optimizers.Adam(learning_rate=0.01)

model.compile(loss='sparse_categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

In [None]:
# Fit the model

history = model.fit(train_x, train_y, 
                    validation_data=(test_x, test_y), 
                    epochs=20, batch_size=500)

In [None]:
# evaluate the model

scores = model.evaluate(test_x, test_y, verbose=0)

scores

# In results, first is loss, second is accuracy

In [None]:
# extract the accuracy from model.evaluate

print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

# Build a cross-sectional deep model using Keras (with two or more hidden layers) (2 points)

In [None]:
#Define the model: for multi-class

model = keras.models.Sequential()

model.add(keras.layers.Input(shape=62))
model.add(keras.layers.Dense(62, activation='relu'))
model.add(keras.layers.Dense(62, activation='relu'))
model.add(keras.layers.Dense(62, activation='relu'))
model.add(keras.layers.Dense(4, activation='softmax'))

#final layer: there has to be 4 nodes with softmax (because we have 4 categories)

In [None]:
# Compile model

#Optimizer:
adam = keras.optimizers.Adam(learning_rate=0.01)

model.compile(loss='sparse_categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

In [None]:
# Fit the model

history = model.fit(train_x, train_y, 
                    validation_data=(test_x, test_y), 
                    epochs=20, batch_size=500)

In [None]:
# evaluate the model

scores = model.evaluate(test_x, test_y, verbose=0)

scores

# In results, first is loss, second is accuracy

In [None]:
# extract the accuracy from model.evaluate

print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

# Build a sequential shallow LSTM Model (with only one LSTM layer) (2 points)

In [None]:
n_steps = 36
n_inputs = 1

model = keras.models.Sequential([
    
    keras.layers.LSTM(1, activation='sigmoid' , input_shape=[n_steps, n_inputs])
])

In [None]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Nadam(learning_rate=0.01)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = model.fit(train_x, train_y, epochs=20,
                   validation_data = (test_x, test_y), callbacks=callback)

In [None]:
# evaluate the model

scores = model.evaluate(test_x, test_y, verbose=0)

scores

# In results, first is loss, second is accuracy

In [None]:
# extract the accuracy from model.evaluate

print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

# Build a sequential deep LSTM Model (with only two LSTM layers) (2 points)

In [None]:
n_steps = 36
n_inputs = 1

model = keras.models.Sequential([
    
    keras.layers.LSTM(2, activation='sigmoid' , input_shape=[n_steps, n_inputs])
])

In [None]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Nadam(learning_rate=0.01)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = model.fit(train_x, train_y, epochs=20,
                   validation_data = (test_x, test_y), callbacks=callback)

In [None]:
# evaluate the model

scores = model.evaluate(test_x, test_y, verbose=0)

scores

# In results, first is loss, second is accuracy

In [None]:
# extract the accuracy from model.evaluate

print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

# Build a sequential shallow GRU Model (with only one GRU layer) (2 points)

In [None]:
n_steps = 36
n_inputs = 1

model = keras.models.Sequential([
    keras.layers.GRU(2, input_shape=[n_steps, n_inputs]),
    keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Nadam(learning_rate=0.01)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = model.fit(train_x, train_y, epochs=20,
                   validation_data = (test_x, test_y), callbacks=callback)

In [None]:
# evaluate the model

scores = model.evaluate(test_x, test_y, verbose=0)

scores

# In results, first is loss, second is accuracy

In [None]:
# extract the accuracy from model.evaluate

print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))


# Build a sequential deep GRU Model (with only two GRU layers) (2 points)

In [1]:
n_steps = 36
n_inputs = 1

model = keras.models.Sequential([
    keras.layers.GRU(2, return_sequences=True, input_shape=[n_steps, n_inputs]),
    keras.layers.GRU(2, return_sequences=True),
    keras.layers.GRU(2),
    keras.layers.Dense(1, activation='sigmoid')
])

NameError: name 'keras' is not defined

In [None]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Nadam(learning_rate=0.01)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = model.fit(train_x, train_y, epochs=20,
                   validation_data = (test_x, test_y), callbacks=callback)

In [None]:
# evaluate the model

scores = model.evaluate(test_x, test_y, verbose=0)

scores

# In results, first is loss, second is accuracy