This notebook is to build a very elementary neural network, trying to predict the expression of certain genes based on DNA sequences in non-coding regions.

In [1]:
import os
from Bio import SeqIO
import numpy as np
import tensorflow as tf

  from ._conv import register_converters as _register_converters


### Data Processing
We first perform the **one hot encoding** to translate the DNA based "AGCT" into corresponding 0/1 values. One thing to note is that there does exist 'n's in lots of DNA sequences, and we treat them as all false.

In [2]:
base_pairs = {'A': [1, 0, 0, 0], 
'C': [0, 1, 0, 0],
'G': [0, 0, 1, 0],
'T': [0, 0, 0, 1],
'a': [1, 0, 0, 0],
'c': [0, 1, 0, 0],
'g': [0, 0, 1, 0],
't': [0, 0, 0, 1],
'n': [0, 0, 0, 0],
'N': [0, 0, 0, 0]}

Following are some functions to get the one hot encoded DNA data into some input matrix that can be fed into neural network algorithms. The major things to note are the following:
1. DNA sequences are of difference lengths, some very short (100~ bases), some very long (3000~ bases). Since most sequences are in the length range 1000 - 2000, we decide to only take the first 1000 bases of each sequence to train the neural network and make the predictions. If too long, simply truncate it to length 1000. If too short, simply fill with zeros to extend it. 
2. DNA sequences are in different strands, some in negative strand, some in positive. We take the complement of the sequence if it is taken form the negative strand so thsat all our data is from the same (positive) strand.
3. The entire sequence is *flattend*. For example, AGCT would be transformed into [1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1] where the first four represent A and the next four represent G and so on.

In [3]:
def align_sequence(length, sequence):
    if len(sequence) > length:
        aligned_seq = sequence[:length]
    else:
        aligned_seq = sequence + [0]*(length-len(sequence))
    return aligned_seq
    
def to_positive_strand(strand, sequence):
    if strand == '-':
        unflattened_seq = [base_pairs[n] for n in sequence.complement()]
    else:
        unflattened_seq = [base_pairs[n] for n in sequence]
    return unflattened_seq

def process_seq_record(seq_record, X, y):
    header = seq_record.description.split('|')
    expressed = int(header[1])
    y.append(expressed)
    unflattened_seq = to_positive_strand(header[3], seq_record.seq)
    flattened_seq = [i for x in unflattened_seq for i in x]
    aligned_seq = align_sequence(4000, flattened_seq)
    X.append(np.array(aligned_seq))
    
def read_file(file, X, y):
    seq_record_list = list(SeqIO.parse("data/input/3.24_species_only/" + file,"fasta"))
    for i in range(len(seq_record_list)):
        process_seq_record(seq_record_list[i], X, y)

Training size is the number of files to read for training. Read 100 files would give us 2400 sequences. <br/> For this simple model, we use 2400 sequence to train the neural network and 240 sequences to test its performance.

In [54]:
training_size = 500
test_size = 20
file_count = 0

X_train = []
y_train = []
X_test = []
y_test = []

for file in os.listdir("data/input/3.24_species_only"):
    if (file_count < training_size):
        read_file(file, X_train, y_train)
    elif (file_count < training_size + test_size):
        read_file(file, X_test, y_test)
    file_count += 1

We examine the shape of all the training and test data matrix to check that the above code works as we expected.

In [55]:
X_train = np.array(X_train).astype(int)
y_train = np.transpose(np.array([y_train]).astype(int))
X_test = np.array(X_test).astype(int)
y_test = np.transpose(np.array([y_test]).astype(int))
[X_train.shape, y_train.shape, X_test.shape, y_test.shape]

[(12000, 4000), (12000, 1), (480, 4000), (480, 1)]

### Logistic Regression
Before actually getting into the neural network, we first try to implement a very simple logistic regression model to get a taste of the prediction procedure.

In [58]:
from sklearn import linear_model as lm

In [59]:
model = lm.LogisticRegression()
model.fit(X_train, y_train.ravel())

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [57]:
y_predicted = np.array(model.predict(X_test))
round(sum(y_test.ravel() == y_predicted)/y_test.shape[0], 2)

0.55

This result of a 55% correct prediction is slightly better, if any, than random guessing. This suggests that a lot of work needs to be done before we get a satisfying neural network.

### Neural Network
Now that we have the data ready in the desired numpy array format with correct shapes, we can proceed to train the neural network with our training data in tensorflow and use our test data to see how accurate it performs.

**TODO**: Fix the problem and somehow make the neural network actually **RUN**!!

In [6]:
X = tf.placeholder(tf.float32, [None, 4000])
W = tf.Variable(tf.truncated_normal([4000, 1] ,stddev=0.1))
B = tf.Variable(tf.zeros([1]))

# model
Y = tf.nn.softmax(tf.matmul(X, W) + B)
# placeholder for correct labels
Y_ = tf.placeholder(tf.float32, [None, 1])
init = tf.initialize_all_variables()

cross_entropy = -tf.reduce_sum(Y_ * tf.log(Y))

# % of correct answers found in batch
is_correct = tf.equal(tf.argmax(Y,1), tf.argmax(Y_,1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))

optimizer = tf.train.GradientDescentOptimizer(0.003)
train_step = optimizer.minimize(cross_entropy)

Instructions for updating:
Use `tf.global_variables_initializer` instead.


In [7]:
sess = tf.Session()
sess.run(init)

train_data={X: X_train, Y_: y_train}
sess.run([accuracy, cross_entropy], feed_dict=train_data)

[1.0, -0.0]

In [8]:
test_data={X: X_test, Y_: y_test}
sess.run([accuracy, cross_entropy], feed_dict=test_data)

[1.0, -0.0]