## Sequences

A simple example of sequence prediction - $[0, 1, 2] \rightarrow [3, 4, 5]$

Applications / examples of sequences problems
- language modelling
- translation
- time series
- generating image captions

We can have different types of sequence problem structures

![](assets/sequences.png)

The many to many structure can also be thought of as an encoder-decoder structure:

![](assets/quoc-le.png)

## Problems with dense networks

Stateless

Unaware of temporal structure

Fixed size inputs & outputs

## Promise of recurrent neural networks

Network able to learn a mapping from inputs over time
- outputs become conditional the context of the sequence

Learn the temporal dependence of data

An RRN is Turing complete
- they can simulate arbitrary programs

## Being comfortable in three dimensions

We model the temporal structure in data using a dimension in an array - by convention this is the second dimension.

Our dimensions then are: 
- the batch dimension (number of samples)
- timesteps
- features

## Practical

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

batch_size = 1000
timesteps = 32
features = 16

samples = tf.random.uniform((batch_size, timesteps, features))

Select all samples, first timestep, all features:

In [2]:
samples[:, 0, :]

<tf.Tensor: id=10, shape=(1000, 16), dtype=float32, numpy=
array([[0.46424544, 0.7721001 , 0.3713907 , ..., 0.575696  , 0.7400645 ,
        0.75130725],
       [0.7625606 , 0.9685751 , 0.42492294, ..., 0.7437755 , 0.18617857,
        0.01660204],
       [0.99516654, 0.57923126, 0.93059003, ..., 0.21720362, 0.5128175 ,
        0.4847033 ],
       ...,
       [0.6671653 , 0.658352  , 0.10562837, ..., 0.11620092, 0.6198951 ,
        0.9504447 ],
       [0.15319824, 0.16427588, 0.41125262, ..., 0.0280416 , 0.4160682 ,
        0.43512297],
       [0.81149054, 0.88620865, 0.29803014, ..., 0.44782937, 0.04601979,
        0.2196101 ]], dtype=float32)>

Last sample, all timesteps, first feature:

In [3]:
samples[-1, :, 0]

<tf.Tensor: id=15, shape=(32,), dtype=float32, numpy=
array([0.81149054, 0.36649883, 0.42218685, 0.15335298, 0.904534  ,
       0.66553354, 0.2674117 , 0.0316906 , 0.12106466, 0.5772643 ,
       0.0232017 , 0.4743743 , 0.17386997, 0.36009216, 0.49845338,
       0.49139953, 0.7374474 , 0.17315185, 0.3671106 , 0.9532969 ,
       0.46684217, 0.9984925 , 0.99783885, 0.47813845, 0.8609488 ,
       0.3230902 , 0.57820535, 0.98574424, 0.41925073, 0.4170395 ,
       0.8973566 , 0.8503436 ], dtype=float32)>

Ninth sample, sixth timestep, all features:

In [4]:
samples[8, 5, :]

<tf.Tensor: id=20, shape=(16,), dtype=float32, numpy=
array([0.02393913, 0.10989189, 0.81392586, 0.01238775, 0.26136005,
       0.8693919 , 0.08746338, 0.8578043 , 0.40168118, 0.8016467 ,
       0.9237622 , 0.87369466, 0.7280284 , 0.48793352, 0.51919794,
       0.2144723 ], dtype=float32)>

## Recurrent neural networks

A recurrent neural network passes it's outputs to itself at each timestep
- the state is the prediction from the last timestep

![](assets/unrolled.png)

Let's model the forward pass of a recurrent neural network:

In [5]:
#  data
samples = np.random.uniform(size=(4, 3, 2))

#  architecture & weights
nodes = 4
previous_state_weights = np.random.uniform(size=(nodes, nodes))
feature_weights = np.random.uniform(size=(samples.shape[2], nodes))
#state_weights = np.random.uniform(size=(nodes, 8))

#  initial state
state = np.zeros(nodes)

#  update the hidden state
#  use tanh to help deal with vanishing gradients
#  tanh squeezes between -1 to 1
state = np.tanh(
    state.dot(previous_state_weights) + samples[0][0].dot(feature_weights)
)
output = state.dot(state_weights)

state

NameError: name 'state_weights' is not defined

In [None]:
state = np.tanh(
    state.dot(previous_state_weights) + samples[0][1].dot(feature_weights)
)

output = state.dot(state_weights)

state

## Backprop through time

Backpropagating error requires error to flow backwards in time
- error must flow back to the first time step to calculate gradients

The loss function for a given layer depends not only on its infulence on layers below it - but also on the layer at the next time step

Backproping through time means unrolling, which makes
-  the memory footprint of recurrent neural network large
- parallel training on multiple sequences inefficient on hardware that shares memory (i.e. GPU)

Further reading - see *Truncated Backprop Through Time*

##  Character level language modeling

Lets use a recurrent neural network to predict the next letter in the word *goodbye!*

In [None]:
#  we will model many to many 
#  feeding in the entire input sequence then reading the output sequence

def encode(alphabet, samples, seq_len):
    samples = np.array(samples)
    encoding = np.zeros((samples.shape[0], seq_len, len(alphabet)))

    for row, sample in enumerate(samples):
        for se in range(seq_len):
            try: 
                char = samples[row][se]
                idx = alphabet.index(char)
                encoding[row, se, idx] = 1
            except IndexError:
                import pdb; pdb.set_trace()
            
    assert (np.sum(encoding, axis=2) == 1).all()
    return encoding


def make_dataset(word, seq_len):
    words = word * 50
    
    indicies = np.random.randint(seq_len, len(words) - seq_len*2, size=100)

    f, t = zip(*[
        [words[idx:idx+seq_len], words[idx+seq_len:idx+seq_len*2]] for idx in indicies
    ])
    
    alphabet = list(set(word))

    return encode(alphabet, f, seq_len), encode(alphabet, t, seq_len), alphabet

f, t, alphabet = make_dataset('goodbye!', seq_len=3)

In [None]:
model = keras.Sequential(
    [keras.layers.SimpleRNN(8, return_sequences=True),
     keras.layers.Dense(t.shape[2], activation='softmax')]
)
model.compile(loss='categorical_crossentropy', optimizer='adam')
h = model.fit(f, t, epochs=100, verbose=0)

In [None]:
def decode(alphabet, encoded):
    #  single sample only
    return [alphabet[v] for v in encoded.flatten()]

test = encode(alphabet, np.array(['goo']), seq_len=3)
decode(alphabet, np.argmax(model.predict(test), axis=2))

In [None]:
test = encode(alphabet, np.array(['bye']), seq_len=3)
decode(alphabet, np.argmax(model.predict(test), axis=2))

In [None]:
test = encode(alphabet, np.array(['!go']), seq_len=3)
decode(alphabet, np.argmax(model.predict(test), axis=2))

## Functional Keras API

In [None]:
f, t, alphabet = make_dataset('hello', seq_len=3)

feat = tf.keras.Input(shape=(3, 4))
rnn = keras.layers.SimpleRNN(8, return_sequences=True)(feat)
classes = keras.layers.Dense(t.shape[2], activation='softmax')(rnn)

model = tf.keras.Model(inputs=feat, outputs=classes)

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam')
h = model.fit(f, t, epochs=100, verbose=0)

In [None]:
model.predict(f[0].reshape(1, 3, 4))

## Practical

Train a recurrent network to predict the next letter in a word.  