# BRCA Multi-Omics (TCGA) - Building an RNN Model

In order to practice constructing deep learning models on Omics data, we will put together a recurrent neural network (RNN) model on an example [BRCA dataset](https://www.kaggle.com/samdemharter/brca-multiomics-tcga).

The BRCA Multi-Omics dataset (omics dataset) is made publicly available on Kaggle. The dataset is only 21.62MB in size. The dataset contains 705 breast cancer samples (611 patients survived, 94 died; "vital.status"). Additionally, the dataset contains four different omics data types with a total of 1936 total features. The omics data types can be broken down into the following four categories:

- cn: copy number variations (860 features)
- mu: mutations (249 features)
- rs: gene expression (604 features)
- pp: protein levels (223 features)

The dataset contains four different omics data types (1936 features in total).

- cn: copy number variations as calculated by gistic taking values -2, -1, 0, 1, 2 (n=860)
- mu: somatic mutations taking boolean values (n=249)
- rs: gene expression (n=604)
- pp: phospho-protein levels (n=223)

We will develop a sequential neural network (i.e. a recurrent neural network (RNN)) to predict the outcome of the patient. The pipeline is as follows:

1.	Create a training dataset of size n x 1937 and a test dataset of size m x 1937.
2.	Create an output dataset of size 1 x 705 with outputs 0 = Alive, 1 = Dead.
3.	Normalize the inputs in the training set.
4.	Train an RNN model to predict the outcome of the test set.

In [1]:
# Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense

Using TensorFlow backend.


In [2]:
# Import the data and separate into training and test sets.

data = pd.read_csv('data.csv')
train = data.sample(frac=0.8, random_state=25)
test = data.drop(train.index)
train_x = train.iloc[: , :-1]
train_y = train.iloc[: , -1]
test_x = test.iloc[: , :-1]
test_y = test.iloc[: , -1]

In [3]:
# Normalize feature set

train_x = np.array(train_x)
train_y = np.array(train_y)
test_x = np.array(test_x)
test_y = np.array(test_y)
scaler = MinMaxScaler()
train_x = scaler.fit_transform(train_x)
test_x = scaler.fit_transform(test_x)

In [4]:
# Define the keras model

model = Sequential()
model.add(Dense(12, input_dim=1936, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [5]:
# Compile the keras model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [6]:
# Fit the keras model on the dataset

model.fit(train_x, train_y, epochs=150, batch_size=10, verbose=0)

<keras.callbacks.callbacks.History at 0x25301973708>

In [7]:
# Evaluate the keras model

_, accuracy = model.evaluate(train_x, train_y)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 100.00


In [16]:
# Make class predictions with the model

predictions = (model.predict(test_x) > 0.5).astype(int)

test_y = np.reshape(test_y,(141,1))

print('Accuracy: ', (predictions == test_y).sum()/float(predictions.size))

Accuracy:  0.8368794326241135
