# Sequences, sequences everywhere

Sequence may come from different applications: sentences, audio, video, sensors...  
All of them presents feature vectors associated to a growing "time step" index.  

Each pattern in the sequence is not i.i.d. with the other patterns in the same sequence. We usually assume i.i.d. between different sequences.

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras as K

## Sequences with feedforward networks

Why can't we use the usual MLP and live happy?  
Well, it turns out that, sometimes, you can! 

Let's take **sequence classification** as an example (Q: which are other possible tasks in sequence learning?).

Btw: I prefer to work with `batch_first` tensor layout but feel free to use what it seems best for your case. Usually, you just need to select the appropriate layout in the RNN models (see next).

In [None]:
batch_size, sequence_length, input_size =  10, 7, 3
num_classes = 2
x = tf.random.uniform(minval=-2, maxval=2, shape=(batch_size, sequence_length, input_size), dtype=tf.float32)
y = tf.random.uniform(minval=0, maxval=num_classes, shape=(batch_size,), dtype=tf.int32)
# plot first sequence, first feature
plt.plot(np.arange(0, sequence_length), x[0, :, 0]) 

Let's build a MLP which produces a linear projections **for each element** in the sequence. Each feature point of each time step has its own parameters.

In [None]:
model = K.Sequential()
model.add(K.layers.Input(shape=(sequence_length, input_size)))
model.add(K.layers.Reshape((sequence_length*input_size,)))
model.add(K.layers.Dense(32, activation="tanh"))
model.add(K.layers.Dense(num_classes, activation="softmax"))
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
print(model.summary())
model.fit(x, y, epochs=2, batch_size=2)

We treated each feature point as i.i.d. with the others -> we lost time dependency! We are back to tabular data

What if we use a **RNN** instead?

In [None]:
model = K.Sequential()
model.add(K.layers.Input(shape=(sequence_length, input_size)))
model.add(K.layers.SimpleRNN(32, time_major=False)) # this is the default, turn it to True to disable `batch_first`
model.add(K.layers.Dense(num_classes, activation="softmax"))
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
print(model.summary())
model.fit(x, y, epochs=2, batch_size=2)

## Inspecting RNNs

In [None]:
layer = K.layers.SimpleRNN(32)
out = layer(x)
print(out.shape)

Ok, we "lost" time dimension so we can use this output directly to perform sequence classification.

What if I want the output of the RNN at each time step?

In [None]:
layer = K.layers.SimpleRNN(32, return_sequences=True)
out = layer(x)
print(out.shape)

Great! What more?

In [None]:
# return last state
layer = K.layers.SimpleRNN(32, return_state=True)
out, h = layer(x)
print(out.shape, h.shape)
print(tf.math.reduce_all(out == h))

Ok, so the output it is equal to the state!

But if they are equal, why two different options? Well... RNN state can be a quite complicated object.

In [None]:
layer = K.layers.LSTM(32, return_state=True)
out, h, c = layer(x)
print(out.shape, h.shape, c.shape)
print(tf.math.reduce_all(out == h))
print(tf.math.reduce_all(out == c))

If you don't want the state, you won't notice the difference

In [None]:
layer = K.layers.LSTM(32, return_sequences=True)
out = layer(x)
print(out.shape)

### RNN cells vs. RNN layers

**Checkout [this Keras guide on working with RNNs](https://keras.io/guides/working_with_rnns/).**

How RNNs are implemented? Basically, each RNN processes batches of sequences. 

The RNN is internally built by a fixed number of RNN cells. Each cell process one timestep only (a single feature vector),. Its output is "forwarded" back to its input to create time dependency. It also keep an internal state which is updated each time it receives a vector.

The RNN use RNN cells into the loop over time steps.

* **The cell state size determines the ouput size of the RNN: `output_size * num_cells`**
* **If you use more than one cell, you get a `stacked` RNN**

## Deep RNN

Let's go deep and add more layers! It's so easy, there is little I can tell you.

In [None]:
model = K.Sequential()
model.add(K.layers.GRU(32, return_sequences=True)) # you need a 3D output
model.add(K.layers.GRU(64))
model.add(K.layers.Dense(2))

## Apply the same layer multiple times

You can easily apply weight sharing on any layer you have by using `TimeDistributed` Layer.

In [None]:
inputs = K.Input(shape=(sequence_length, input_size))
layer = K.layers.Dense(32)
outputs = K.layers.TimeDistributed(layer)(inputs) # replicate the layer over the second dimension
outputs.shape 
# do whatever you want with `outputs` depending on the task

## Use Convolutions to deal with sequences

Convolutions are heavily used also on 1D patterns like sequences.

In [None]:
inputs = K.layers.Input((sequence_length, input_size))
conv1 = K.layers.Conv1D(filters=32, kernel_size=2, padding="same")(inputs)
conv1 = K.layers.ReLU()(conv1)

conv2 = K.layers.Conv1D(filters=64, kernel_size=2, padding="same")(conv1)
conv2 = K.layers.ReLU()(conv2)

final = K.layers.GlobalAveragePooling1D()(conv2)
outputs = K.layers.Dense(num_classes, activation="softmax")(final)

model = K.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
print(model.summary())
model.fit(x, y, epochs=2, batch_size=2)

## Sequences of different lengths

In [None]:
batch_size, max_sequence_length, input_size =  10, 7, 3
num_classes = 2

x1 = tf.random.uniform(minval=-2, maxval=2, shape=(int(batch_size/2), sequence_length, input_size), dtype=tf.float32)
x2 = tf.random.uniform(minval=-2, maxval=2, shape=(int(batch_size/2), max_sequence_length-3, input_size), dtype=tf.float32)

y = tf.random.uniform(minval=0, maxval=num_classes, shape=(batch_size,), dtype=tf.int32)
# plot first sequence, first feature
fig, ax = plt.subplots(2)
ax[0].plot(np.arange(0, x1.shape[1]), x1[0, :, 0]) 
ax[1].plot(np.arange(0, x2.shape[1]), x2[0, :, 0]) 


How can we build a dataset or simply a single numpy array for `x`?

**Padding**!

In [None]:
padded = K.preprocessing.sequence.pad_sequences(list(x1) + list(x2), 
                                                padding="post", value=0, # post is required to work with cuda
                                                dtype="float32") 

print(padded.shape)
print(padded)

Padding adds 0 values. You can also pad to a fixed length and truncate longer sequences.

This is now a dataset like the first one we built and can be fed to all the previous models. 

However, you do not usually want to compute loss and backpropagation for all the padded time steps. You should compute them only for the "real" time steps.

**Masking**

A boolean tensor with shape `batch_size, sequence length` which is False whenever that time step has to be skipped, True otherwise.

Some layers (e.g. RNN) accepts a mask parameter together with input during forward. 

You can always use the `Masking` layer directly.  
Masks propagate in the model. If a previous layer uses masks and the current layer accepts masks, it will receive them automatically.

In [None]:
lstm = K.layers.LSTM(32, return_sequences=True)
mask = tf.sequence_mask([sequence_length]*int(batch_size/2)+[sequence_length-3]*int(batch_size/2), 
                        maxlen=sequence_length)
print(mask.shape)
print(mask)
out = lstm(padded, mask=mask)
print(out.shape)
# only the non-masked positions will be taken into consideration whenfor computing gradients

or also

In [None]:
# this layer will ignore every pattern containing all zeros
# you can use it in your models as an usual layer
mask_layer = K.layers.Masking(mask_value=0)

## Sliding windows

In [None]:
help(K.preprocessing.timeseries_dataset_from_array)

**Exercise**: build a model of your choice to classify sequences. You can try also with MNIST if you feel comfortable. Try to use 1 or 28 pixels at a time to create the sequence.

Otherwise, you can look for specific datasets online in the usual UCI (ask me for advices :) )

## Beyond classification: sequence modeling

Example: https://keras.io/examples/timeseries/timeseries_anomaly_detection/

Notebook: https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/timeseries/ipynb/timeseries_weather_forecasting.ipynb#scrollTo=Arln3kkJDQtq

## Natural Language Processing: sequences of tokens

How to manage sentences? Well they are a sequences of words, right?

Lots of preprocessing and then -> DL model

In [None]:
sentence = ["Hi, how are you"]

In [None]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

**Each word becomes an integer**

In [None]:
vect = TextVectorization(max_tokens=20,
                         ngrams=None,
                         output_mode="int", 
                         output_sequence_length=20, 
                         pad_to_max_tokens=False, 
                         vocabulary=None)
vect.adapt(sentence)
print(vect.get_vocabulary())

In [None]:
out = vect(sentence)
print(out)

**You may want also ngrams**

In [None]:
vect = TextVectorization(max_tokens=20,
                         ngrams=2,
                         output_mode="int", 
                         output_sequence_length=20, 
                         pad_to_max_tokens=False, 
                         vocabulary=None)
vect.adapt(sentence)
print(vect.get_vocabulary())
out = vect(sentence)
print(out)

output changed to include also found ngrams

**How to feed these results to a DL model?**

You *cannot* feed integers! 0 > 1 > 2 > 3  *does not compare with* you > how are > how

* Feed one-hot encodings
* Feed embeddings

**One-hot encoding**

In [None]:
vect = TextVectorization(max_tokens=20,
                         output_mode="int", 
                         output_sequence_length=10)
vect.adapt(sentence)
print(vect.get_vocabulary())

# one hot encoding
from keras.utils import to_categorical

to_encode = vect(sentence)
print(to_encode)
encoded = to_categorical(to_encode, num_classes=len(vect.get_vocabulary()))
print(encoded)
print(encoded.shape)

You just need to take sentences in batches and you are fine!

**The most used version in NLP is Embedding**

Let's create a dataset of words!

In [None]:
#just another way to break into words
from tensorflow.keras.preprocessing.text import text_to_word_sequence
print(text_to_word_sequence(sentence[0]))
print()

features = [["hello it me"], 
            ["how to escape"],
            ["beware of dog"]]
labels = tf.random.uniform(minval=0, maxval=2, shape=(len(features),), dtype=tf.int32)
text_only_dataset = tf.data.Dataset.from_tensor_slices(features) # this is used only to adapt the vectorization

dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=100).batch(2)
for x, y in dataset:
  print(x, y)
  print()

In [None]:
vect = TextVectorization(max_tokens=20,
                         ngrams=None,
                         output_mode="int", 
                         output_sequence_length=20, 
                         pad_to_max_tokens=False, 
                         vocabulary=None)
vect.adapt(text_only_dataset)
print(vect.get_vocabulary())

In [None]:
embedding = K.layers.Embedding(input_dim=len(vect.get_vocabulary()),
                               output_dim=10)
print(embedding(vect(features)).shape)
print(embedding(vect(features)))

Now, build a model! Strings are a valid tensor type so we are fine.

In [None]:
model = K.Sequential()
# is a tensor of size (batch_size, 1) because each element
# is a single sentence (like "hello it me" -> size = 1)
model.add(K.layers.Input(shape=(1,), dtype=tf.string))
model.add(vect)
model.add(embedding) # 3D output
model.add(K.layers.LSTM(32)) # one word at a time
model.add(K.layers.Dense(2, activation="softmax"))
model.compile(optimizer="adam", loss="mse")
model.fit(dataset, epochs=2)

**Bonus**: NLP often uses Bidirectional RNNs -> process sequence in reverse. 
You can use the `Bidirectional` keras layer or the `go_backward` parameter in RNN constructor.

**You now have all the basics to deal with NLP tasks, too**

You can also try out the `Attention` and `Transformer` architecture. These are popular models for NLP. However they are usually very large (dense), with lots of parameters and they require lots of training time.

You can use Colab but you have to periodically checkpointing your model so that when (not if) the runtime gets disconnected you don't lose everything.

[Multi-head attention](https://keras.io/api/layers/attention_layers/multi_head_attention/)


[Transformer Example](https://keras.io/examples/nlp/text_classification_with_transformer/#implement-a-transformer-block-as-a-layer)