**Date**: 2018-08-17

**Authors**: Zhanyuan Zhang

**Purpose**: Apply convolutional neural network (CNN) to train the binary classificaton model.
- Use operations like 1D convolution, maxpooling, and dropout to improve accuracy.
- Add bias in layers.
- Use Relu for learning session, and softmax for final classification.

**Background**: Current model trained by recurent neural network (RNN) gives low accuracy. The reason why we switch from CNN to RNN was that this somehow improved the accuracy by 10%. However, given the math behind these two neural networks, CNN should be better in handling spacial data, which is in our case, since the order in a nucleotide sequence does matter.

**Experiment**:

In [1]:
import pickle
import numpy as np
import matplotlib.pyplot as plt
from utility import flatten
from utility import curtail
from utility import prepare_input
from utility import to_np_array
from utility import unpickle

  (fname, cnt))
  (fname, cnt))


In [2]:
real_buffer_path = "/home/ubuntu/data3/10_percent/random_0.1_instance_7.txt"
# random_buffer_path = "/home/ubuntu/formatted/random_sequences/random_sequence_buffer.txt"
curtail_len = 3000
motif_num = 3

In [3]:
seq_record_list = unpickle(real_buffer_path)
len(seq_record_list)

8088

In [4]:
import random
from random import shuffle

first_list = [] # to add to training set
second_list = [] # to add to test set
current = [] # contains all 24 sequences from the same DNA section

for i in range(len(seq_record_list)):
    current.append(seq_record_list.pop())
    if len(current) == 24:
        shuffle(current) # Shuffle the 24 sequences from the same DNA section
        random_select = random.randint(18, 24) # Allocate the number of sequences to the training set
        first_list.extend(current[:random_select])
        second_list.extend(current[random_select:])
        current = []

shuffle(first_list) # Shuffle again to eliminate dependencies
shuffle(second_list) # Shuffle again to eliminate dependencies

seq_record_list = first_list + second_list

print("Number of sequences in training/validation set are: " + str(len(first_list)))
print("Number of sequences in testing set are: " + str(len(second_list)))

Number of sequences in training/validation set are: 7073
Number of sequences in testing set are: 1015


In [5]:
train_val_num = len(first_list)
test_num = len(second_list)

In [6]:
X_train, y_train, X_test, y_test = prepare_input(train_val_num, test_num, curtail_len, seq_record_list, motif_num)
X_train, y_train, X_test, y_test = to_np_array(X_train, y_train, X_test, y_test)

# Check the shape of training and testing data
[X_train.shape, y_train.shape, X_test.shape, y_test.shape]

[(7073, 21000), (7073, 1), (1015, 21000), (1015, 1)]

In [7]:
from keras.models import Model, Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, Dropout, Flatten
from keras.activations import relu
from keras.optimizers import Adam

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [8]:
X_train_cnn = np.expand_dims(X_train, axis=2)
X_train_cnn.shape

(7073, 21000, 1)

In [46]:
LR = 5e-2
model = Sequential()
model.add(Conv1D(filters=1, kernel_size=200, input_shape=(21000, 1), activation="relu", use_bias=True))
model.add(MaxPooling1D())
model.add(Conv1D(filters=1, kernel_size=400, activation="relu", use_bias=True))
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(1000, activation="relu"))
model.add(Dropout(0.7))
model.add(Dense(1, activation="sigmoid"))

In [47]:
model.compile(optimizer=Adam(lr=LR), 
                             loss='binary_crossentropy',
                             metrics=['acc'])
history = model.fit(X_train_cnn, y_train, epochs=30, batch_size=128, validation_split=0.1)

Train on 6365 samples, validate on 708 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30

KeyboardInterrupt: 