**Date**: 2018-08-17

**Authors**: Zhanyuan Zhang

**Purpose**: Apply convolutional neural network (CNN) to train the binary classificaton model. 
- Use operations like 1D convolution, maxpooling, and dropout to improve accuracy. 
- Add bias in layers. 
- Use Relu for learning session, and softmax for final classification.

**Background**: Current model trained by recurent neural network (RNN) gives low accuracy. The reason why we switch from CNN to RNN was that this somehow improved the accuracy by 10%. However, given the math behind these two neural networks, CNN should be better in handling spacial data, which is in our case, since the order in a nucleotide sequence does matter. 

**Experiment**:

In [1]:
import pickle
import numpy as np
import matplotlib.pyplot as plt
from utility import flatten
from utility import curtail
from utility import prepare_input
from utility import to_np_array
from utility import unpickle

In [None]:
real_buffer_path = "/home/ubuntu/formatted/10_percent/random_0.1_instance_7.txt"
# random_buffer_path = "/home/ubuntu/formatted/random_sequences/random_sequence_buffer.txt"
curtail_len = 3000
motif_num = 3

In [None]:
seq_record_list = unpickle(real_buffer_path)
len(seq_record_list)

The following cell randomly shuffles the sequences. The shuffling ensures that, for each DNA section (each consists of 24 segments), there are definitely some sequences being allocated to the training data. In this way, the final trained model would be able to learn all the characteristics from the data we have.

In [None]:
import random
from random import shuffle

first_list = [] # to add to training set
second_list = [] # to add to test set
current = [] # contains all 24 sequences from the same DNA section

for i in range(len(seq_record_list)):
    current.append(seq_record_list.pop())
    if len(current) == 24:
        shuffle(current) # Shuffle the 24 sequences from the same DNA section
        random_select = random.randint(18, 24) # Allocate the number of sequences to the training set
        first_list.extend(current[:random_select])
        second_list.extend(current[random_select:])
        current = []

shuffle(first_list) # Shuffle again to eliminate dependencies
shuffle(second_list) # Shuffle again to eliminate dependencies

seq_record_list = first_list + second_list

print("Number of sequences in training/validation set are: " + str(len(first_list)))
print("Number of sequences in testing set are: " + str(len(second_list)))

In [None]:
train_val_num = len(first_list)
test_num = len(second_list)

The following cell transforms the data into a format that is recognizable by the neural network model.

In [None]:
X_train, y_train, X_test, y_test = prepare_input(train_val_num, test_num, curtail_len, seq_record_list, motif_num)
X_train, y_train, X_test, y_test = to_np_array(X_train, y_train, X_test, y_test)

# Check the shape of training and testing data
[X_train.shape, y_train.shape, X_test.shape, y_test.shape]

In [None]:
from keras.models import Model, Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, Dropout
from keras.activations import relu

In [None]:
LR = 5e-2
model = Sequential()
model.add(Conv1D(filters=1, kernel_size=210, input_shape=(21000, 1, 1), activation="relu")
model.add(MaxPooling1D())
model.add(Conv1D(filters=1, kernel_size=420, activation="relu")
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(1024, activation="relu"))
model.add(Dropout(0.25))
model.add(Dense(2, activation="softmax"))

In [None]:
model.compile(optimizer=Adam(lr=LR), 
                             loss='categorical_crossentropy',
                             metrics=['acc'])
history = model.fit(X_train_rnn, y_train, epochs=30, batch_size=128, validation_split=0.1)

In [None]:

accacc  ==  historyhistory..historyhistory[['acc''acc']]
 val_accval_acc  ==  historyhistory..historyhistory[['val_acc''val_acc ]
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training Accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('epoches')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('epoches')
plt.legend()

plt.show()