## Last time: CNN
* Trained model
* [Geomtric deep learning](http://geometricdeeplearning.com/)

## Non-image CNN: Sequence motif
* [Position weight matrix](https://en.wikipedia.org/wiki/Position_weight_matrix)

![](img/pwm.png)

![](http://jaspar.genereg.net/static/logos/svg/MA0149.1.svg)

In [1]:
import numpy as np

seq_length = 40
num_train = 1000
num_val = 100

# PFM from JASPAR
motif = np.array([[   0,   2, 104, 104,   1,   2, 103, 102,   0,   0,  99, 105,   0,   0, 100, 102,   5,   3],
                  [   0,   0,   0,   0,   0,   0,   0,   0,   0,   2,   4,   0,   0,   2,   3,   0,   0,   3],
                  [ 105, 103,   1,   1, 104, 102,   2,   3, 104, 103,   2,   0, 105, 103,   0,   2,  97,  97],
                  [   0,   0,   0,   0,   0,   1,   0,   0,   1,   0,   0,   0,   0,   0,   2,   1,   3,   2]])

In [2]:
def datagen(seq_length, num_sample, motif):
    from tensorflow.keras.utils import to_categorical
    
    freq = np.hstack([np.ones((4,(seq_length-motif.shape[1])//2)), 
                      motif,
                      np.ones((4,(seq_length-motif.shape[1])//2))])

    #normalize to PWM and generate positive samples
    pos = np.array([np.random.choice(['A', 'C', 'G', 'T'], num_sample, p=freq[:,i]/sum(freq[:,i])) 
                    for i in range(seq_length)]).transpose()
    [''.join(x) for x in pos[1:10,:]]
    
    neg = np.array([np.random.choice(['A', 'C', 'G', 'T'], num_sample, p=np.array([1,1,1,1])/4.0)
                for i in range(seq_length)]).transpose()

    [''.join(x) for x in neg[1:10,:]]
    
    pos_tensor = np.zeros(list(pos.shape) + [4])
    neg_tensor = np.zeros(list(neg.shape) + [4])

    base_dict = {'A': 0, 'C': 1, 'G': 2, 'T': 3}

    #naive one-hot encoding
    for row in range(num_sample):
        for col in range(seq_length):
            pos_tensor[row,col,base_dict[pos[row,col]]] = 1
            neg_tensor[row,col,base_dict[neg[row,col]]] = 1

    X = np.vstack((pos_tensor, neg_tensor))
    y = to_categorical(np.concatenate((np.ones(num_sample), np.zeros(num_sample))), 2)
    return X, y

In [3]:
def dataset(seq_length, num_train, num_val, motif):
    X_train, y_train = datagen(seq_length=seq_length,
                               num_sample=num_train,
                               motif=motif)
    X_val, y_val = datagen(seq_length=seq_length,
                           num_sample=num_val,
                           motif=motif)
    return (X_train, y_train), (X_val, y_val)

(X_train, y_train), (X_val, y_val) = dataset(seq_length, num_train, num_val, motif)

In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, Dense, Flatten, Dropout, MaxPooling1D
from tensorflow.keras.activations import relu
from tensorflow.keras.optimizers import SGD

model = Sequential()
model.add(Conv1D(filters=1, 
                 kernel_size=17,
                 padding='same',
                 input_shape=(seq_length, 4),
                 activation='relu'))

model.add(Flatten())
model.add(Dense(2, activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d (Conv1D)              (None, 40, 1)             69        
_________________________________________________________________
flatten (Flatten)            (None, 40)                0         
_________________________________________________________________
dense (Dense)                (None, 2)                 82        
Total params: 151
Trainable params: 151
Non-trainable params: 0
_________________________________________________________________


In [5]:
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd,
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, validation_split=0.3, epochs=10, shuffle=True)  # starts training

Train on 1400 samples, validate on 600 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f5e3b5f7cf8>

In [6]:
score = model.evaluate(X_val, y_val, verbose=0)
print(score[1])

1.0


In [7]:
convlayer = model.layers[0]
weights = convlayer.get_weights()[0].squeeze()
print('Convolution parameter shape: {}'.format(weights.shape))

Convolution parameter shape: (17, 4)


In [8]:
num2seq = ['A','C','G','T']

''.join([num2seq[np.argmax(weights[i,:])] for i in range(weights.shape[0])])

'TGAAAGAAAGGAAGGAA'

# Recurrent Neural Networks
**by: Santiago Hincapie**

# Fully connected
![](img/d30.png)

# Recurrent Neural Networks: Process Sequences
![](img/d31.png)

* **one to many** $\to$ Image Captioning
* **many to one** $\to$ Sentiment Classification
* **many to many** $\to$ Machine Translation
* **many to many** $\to$ Video classification on frame level

# Recurrent Neural Networks
![](img/d32.png)

We can process a sequence of vectors x by applying a recurrence formula at every time step:
$$ h_t = f_W (h_{t-1}, x_t) $$

**Notice:** the same function and the same set of parameters are used at every time step.

## Vanilla RNN
$$ h_t = f_W (h_{t-1}, x_t) $$
$$ h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$
$$ y_t = W_{hy}h_t $$

# RNN: Computational Graph
![](img/d33.png)

## Example: Character-level Language Model
**Vocabulary:** [h, e, l, o]<br>
**training:** "Hello"

![](img/de1.png)

#### Test time
![](img/de2.png)

# Backpropagation through time
Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

![](img/d34.png)

# _Truncated_ Backpropagation through time


![](img/d35.png)

## Example!
![](img/d36.png)

![](img/d37.png)

### The Stacks Project: open source algebraic geometry textbook
![](img/de3.png)


![](img/de4.png)

### Image caption
![](img/de5.png)

### Visual Question Answering
![](img/de6.png)

## Multilayer RNNs
![](img/d38.png)

$$ h_t^l = \tanh W^l \begin{pmatrix} h_t^{l-1} \\ h_{t-1}^l\end{pmatrix}$$

## Vanilla RNN Gradient Flow
![](img/d39.png)

## Vanilla RNN Gradient Flow
![](img/d310.png)

Computing gradient of $h_0$ involves many factors of $W$ (and repeated $\tanh$)

* Largest singular value > 1: Exploding gradients
* Largest singular value < 1: Vanishing gradients

Exploding gradients? $\to$ **Gradient clipping:** Scale gradient if its norm is too big

Vanishing gradients? $\to$ Sorry houston $\to$ Change RNN architecture

# Long Short Term Memory

$$ \begin{pmatrix} i \\ f \\ o \\ g \end{pmatrix} = 
\begin{pmatrix} \sigma \\ \sigma \\ \sigma \\ \tanh \end{pmatrix} W
\begin{pmatrix} h_{t-1} \\ x_t \end{pmatrix} $$
$$c_t = f \odot c_{t-1} + i \odot g $$
$$h_t = o \odot \tanh{c_t} $$

**i:** Input gate, whether to write to cell <br>
**f:** Forget gate, Whether to erase cell<br>
**o:** Output gate, How much to reveal cell<br>
**g:** Gate gate (?), How much to write to cell<br>

## LSTM Gradient Flow
![](img/d311.png)

## LSTM Gradient Flow
![](img/d312.png)

# Lets play!

In [1]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

In [2]:
max_features = 20000
# cut texts after this number of words (among top max_features most common words)
maxlen = 80
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...
25000 train sequences
25000 test sequences


In [3]:
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (25000, 80)
x_test shape: (25000, 80)


In [4]:
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [5]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
 1440/25000 [>.............................] - ETA: 1:34 - loss: 0.2885 - acc: 0.8896

KeyboardInterrupt: 

In [None]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)

## Denoising DNA Sequence

Using a deep convolutional neural network (CNN) for [denoising images](http://papers.nips.cc/paper/4686-image-denoising-and-inpainting-with-deep-neural-networks.pdf) or [constructing super resolution images](https://arxiv.org/pdf/1501.00092.pdf) has generating some quite amazing results.

**Can we “stack” a set of erroneous DNA sequences together to remove all kinds of errors with a neural network?**

In [6]:
def sim_error(seq, pi=0.05, pd=0.05, ps=0.01):
    """
    Given an input sequence `seq`, generating another
    sequence with errors. 
    pi: insertion error rate
    pd: deletion error rate
    ps: substitution error rate
    """
    out_seq = []
    for c in seq:
        while 1:
            r = random.uniform(0,1)
            if r < pi:
                out_seq.append(random.choice(["A","C","G","T"]))
            else:
                break
        r -= pi
        if r < pd:
            continue
        r -= pd
        if r < ps:
            out_seq.append(random.choice(["A","C","G","T"]))
            continue
        out_seq.append(c)
    return "".join(out_seq)

In [7]:
import random
seq = [random.choice(["A","C","G","T"]) for _ in range(220)]
print("".join(seq))

GTTTCCCTCTGGACTGTCGTTGACCGGTAACCGATGATTGCCTGAAGCGTGGGCCAATGATCCGACCAGAGCGACGCCTATAAAGTGAGCAGAAGTTGGTCCGCTTTTTCTGTCGGACCGAGGGTTATTCTTTGGCAGCTTAATCTCTGACATGTCCAATACGACTGTAAATATTGATACATAATACATCCGCGTACTAGGGCAGGCTTACGTACCGTAC


In [8]:
N = 20
length = 50

seqs = []
seqs_raw = []
for i in range(N):
    seq_i = '<' + sim_error(seq, pi=0.05, pd=0.05, ps=0.01) + '>'
    seqs_raw.append(seq_i)
    for j in range(length, len(seq_i)):
        seq_j = seq_i[j-length:j+1]
        seqs.append(seq_j)

len(seqs)

3407

In [9]:
import numpy as np
from tensorflow.keras.utils import to_categorical

chars = sorted(list(set(seqs_raw[0])))
mapping = dict((c, i) for i, c in enumerate(chars))
mapping
seqs_t = to_categorical([list(map(lambda x: mapping[x], s)) for s in seqs], 6)
seqs_t.shape

(3407, 51, 6)

In [10]:
X, y = seqs_t[:, :-1], seqs_t[:, -1]

In [11]:
X.shape, y.shape

((3407, 50, 6), (3407, 6))

In [12]:
model = Sequential()
model.add(LSTM(32, input_shape=X.shape[1:]))
model.add(Dense(12, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(6, activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 32)                4992      
_________________________________________________________________
dense_1 (Dense)              (None, 12)                396       
_________________________________________________________________
dense_2 (Dense)              (None, 12)                156       
_________________________________________________________________
dense_3 (Dense)              (None, 6)                 78        
Total params: 5,622
Trainable params: 5,622
Non-trainable params: 0
_________________________________________________________________


In [13]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [14]:
model.fit(X, y, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x7f4b81faff98>

In [22]:
c = '<'
inv = dict(map(reversed, mapping.items()))
seq_rec = c
text = [mapping[c]]
while len(seq_rec) < 220 and c != '>':
    cl = model.predict_classes(to_categorical(sequence.pad_sequences([text], maxlen=50), 6))[0]
    c = inv[cl]
    text += [mapping[c]]
    seq_rec += c

In [23]:
print(seq_rec)

<TTTTTCCGTGTAC>


In [100]:
inv

{0: '<', 1: '>', 2: 'A', 3: 'C', 4: 'G', 5: 'T'}