**Date**: 2018-07-29

**Authors**: Yichen Fang

**Purpose**: Pointing out a fault in previous experiments, and testing the old neural network on new correct one-hot encoded output with motif attached

**Background**:
- It was discovered that the old one-hot encoded outputs with motif attached are incorrect. There was a serious fault in the old motif processing file (for some reason, a level of indentation of a block of code was wrong), causing faulty outputs to be produced. This error caused: for each section, the 24 sequences were all assigned the motifs of the first sequence (i.e. the motifs assigned for all 23 sequences other than the first are not correct). Hence, the results are not valid due to this fault.
- As a result, the results based on those incorrect outputs in previous experiments are no longer valid.
- This experiement serves the purpose to test the effect of old RNN model on the new correct one-hot encoded outputs.

**Experiment**:

In [20]:
import pickle
import numpy as np
import matplotlib.pyplot as plt

Set the real/random sequences buffer path, length to curtail all sequences to, and the number of motifs:

In [21]:
real_buffer_path = "/home/ubuntu/newOutput/10_percent/random_0.1_instance_7.txt"
random_buffer_path = "/home/ubuntu/formatted/random_sequences/random_sequence_buffer.txt"
curtail_len = 3000
motif_num = 3

Load the `pickle` buffered list:

In [22]:
with open(real_buffer_path, "rb") as buff:
    seq_record_list = pickle.load(buff)
len(seq_record_list)

8088

The following cell randomly shuffles the sequences. The shuffling ensures that, for each DNA section (each consists of 24 segments), there are definitely some sequences being allocated to the training data. In this way, the final trained model would be able to learn all the characteristics from the data we have.

In [23]:
import random
from random import shuffle

first_list = [] # to add to training set
second_list = [] # to add to test set
current = [] # contains all 24 sequences from the same DNA section

for i in range(len(seq_record_list)):
    current.append(seq_record_list.pop())
    if len(current) == 24:
        shuffle(current) # Shuffle the 24 sequences from the same DNA section
        random_select = random.randint(18, 24) # Allocate the number of sequences to the training set
        first_list.extend(current[:random_select])
        second_list.extend(current[random_select:])
        current = []

shuffle(first_list) # Shuffle again to eliminate dependencies
shuffle(second_list) # Shuffle again to eliminate dependencies

seq_record_list = first_list + second_list

print("Number of sequences in training/validation set are: " + str(len(first_list)))
print("Number of sequences in testing set are: " + str(len(second_list)))

Number of sequences in training/validation set are: 7102
Number of sequences in testing set are: 986


In [24]:
train_val_num = len(first_list)
test_num = len(second_list)

The following cell transforms the data into a format that is recognizable by the neural network model.

In [25]:
# A helper function to flatten a 2d list to 1d.
# Input: [[1, 2], [2, 3], [3, 4, 5]]
# Output: [1, 2, 2, 3, 3, 4, 5]
def flatten(lst):
    new_lst = []
    for sub_lst in lst:
        for item in sub_lst:
            new_lst.append(item)
    return new_lst

# A helper function to transform a lst so that its length becomes read_len by:
# 1. If len(lst) > read_len, curtail the end of the lst.
# 2. If len(lst) < read_len, keep extending the end of the lst with 0 (NA).
def curtail(lst, read_len, motif_number):
    if len(lst) > read_len:
        lst = lst[:read_len]
    else:
        for i in range(read_len - len(lst)):
            lst.append([0 for _ in range(motif_number + 4)])
    return lst

# Produce the train-test split
# length_read: the length that you want all DNA sequences to conform to
def prepare_input(training_size, test_size, length_read, original_list, motif_number):
    X_train = []
    y_train = []
    X_test = []
    y_test = []
    seq_count = 0
    while seq_count < training_size:
        X_train.append(flatten(curtail(original_list[seq_count][3], length_read, motif_number)))
        y_train.append(int(original_list[seq_count][1]))
        seq_count += 1
    while seq_count < (training_size + test_size):
        X_test.append(flatten(curtail(original_list[seq_count][3], length_read, motif_number)))
        y_test.append(int(original_list[seq_count][1]))
        seq_count += 1
    return X_train, y_train, X_test, y_test

# Turn list into numpy tensors that can directly feed into a neural network model
def to_np_array(X_train, y_train, X_test, y_test):
    X_train = np.array(X_train)
    y_train = np.array(y_train)
    if len(y_train.shape) == 1:
        y_train = np.transpose(np.array([y_train]))
    X_test = np.array(X_test)
    y_test = np.transpose(np.array(y_test))
    if len(y_test.shape) == 1:
        y_test = np.transpose(np.array([y_test]))
    return X_train, y_train, X_test, y_test

In [26]:
X_train, y_train, X_test, y_test = prepare_input(train_val_num, test_num, curtail_len, seq_record_list, motif_num)
X_train, y_train, X_test, y_test = to_np_array(X_train, y_train, X_test, y_test)
[X_train.shape, y_train.shape, X_test.shape, y_test.shape]

[(7102, 21000), (7102, 1), (986, 21000), (986, 1)]

We run the experiment with four LSTM layers, having 8, 8, 4, 4 units respectively, and 250 epoches:

In [27]:
from keras.models import Model, Sequential
from keras.layers import Dense, CuDNNLSTM, CuDNNGRU

ModuleNotFoundError: No module named 'keras'

In [None]:
X_train_rnn = X_train.reshape(train_val_num, curtail_len, motif_num + 4)

In [None]:
model = Sequential()
model.add(CuDNNLSTM(8, input_shape=(curtail_len, motif_num + 4), return_sequences=True))
model.add(CuDNNLSTM(8, return_sequences=True))
model.add(CuDNNLSTM(4, return_sequences=True))
model.add(CuDNNLSTM(4))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train_rnn, y_train, epochs=30, batch_size=128, validation_split=0.1)

**Result**:

The following cell **visualize** the training/validation accuracies and losses over each epoch.

In [None]:
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training Accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('epoches')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('epoches')
plt.legend()

plt.show()

**Conclusion**:
- From the graphs above, one can conclude that the old RNN model does not actually work satisfiably on the correct data. Hence, we need to overhaul the neural network model.