# LSTM and Sequence Generation

In this notebook we're going to explore two changes to RNN we built in the previous notebook. 

1. The use of LSTM layers instead of SimpleRNN layers.
2. Generating sequences instead of generating a single prediction as output.

## LSTM

A Long Short Term Memory layer is an extension of the RNN idea, and one that is designed to do 2 things:

1. Give the hidden state more flexibillity in how it updates.
2. Provide an efficent route for backpropagation, similar to skip layers in CNNs

Here is a diagram from one of the readings [https://colah.github.io/posts/2015-08-Understanding-LSTMs/](https://colah.github.io/posts/2015-08-Understanding-LSTMs/):

![](https://colah.github.io/images/post-covers/lstm.png)

What you see here are the addition of several "gates" as well as an additional output from the layer compared to a simple RNN. Gates act like other hidden layers and have their own learned weights. 

The two states are called "cell state" (the top line) and "hidden state" the bottom line. Both serve a similar purpose to the RNN hidden state, but the cell state's progression doesn't involve an activation function over time which helps avoid the vanishing gradient problem that simple RNN's often suffer from.

The gates are typically referred to as follows:

**Forget gate**: The first gate uses a sigmoid activation to produce values between 0 and 1, those values are then pointwise multiplied with the incoming hidden state. This allows the cell state to "forget" irrelevant context when the activations are near 0. This gate is designed such that it can move the values in the cell state closer to zero.

**Input and "gate" gates**: The second gate is a multiply and involves the sigmoid and tanh activations. The multiply gate allows the sigmoid to scales the tanh output. The result is a value between -1 and 1 which gets pointwise added to the cell state (after the forget is applied). This allows the cell state to incrementally learn new information and store it into the context. 

The sigmoid is called the "input" gate and the tanh is called the "gate gate" but they both work together to decide how much the cell state learns from the new input.

**Output gate**: Finally, the last sigmoid is the output gate, and it scales the output next output and hidden state values, but not the cell state.

## Sequence to sequence

The next change we'll make is to allow multiple outputs over time, allowing us to generate text output from text input rather than a single output for classification or regression.

![](https://karpathy.github.io/assets/rnn/diags.jpeg)

This requires a few changes.

1. First, the training data. We'll be using a setup where our inputs and labels will be from the exact same text, but at every timestep the label will be the word directly following the current word. 
2. Second we have to enable each timestep to produce a prediction.
3. We have to map those numeric predictions to a word.
4. We have to enable the network to stop somehow, we'll be allowing the network to generate as one of the "words" a "stop" token which when predicted will cause the network to stop.

Note that this strategy works for generating text both character by character and word by word. We will be performing a character by character training process.

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import StringLookup

import numpy as np
import os
import time

In [2]:
# load the tiny_shakespear dataset, 40,000 lines of text from various Shakspear plays
# but only the "train" subset.
dataset = tfds.load(name='tiny_shakespeare')['train']

# Split the dataset from each line being a string to each line being an array of characters
# in UTF-8 encoding
dataset = dataset.map(lambda x: tf.strings.unicode_split(x['text'], 'UTF-8'))

# Extract all the uniqe charcaters to form the vocabulary
vocabulary = sorted(set(next(iter(dataset)).numpy()))

# We're creating two functions to swap between characters and their int lookup value
ids_from_chars = StringLookup(
    vocabulary=list(vocabulary)
)

chars_from_ids = tf.keras.layers.experimental.preprocessing.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True
)

# For training at each step we're asking the model to predict the next character
# based on the current state + current character.
dataset = dataset.map(lambda x: (ids_from_chars(x[:-1]), ids_from_chars(x[1:])))

# Unbatch kind of flattens the data.
# We're sort of implying that any line can flow fluidly into another line
dataset = dataset.unbatch()

# Now we're chopping the flat data into sequences of 100 characters each
seq_len = 100
dataset = dataset.batch(seq_len, drop_remainder = True)

# We can see that "next_char" is just the "cur_char" shifted one position.
# Because they are already encoded as UTF code points, they have a resaonblly efficent 
# numeric representation.
for current_char, next_char in dataset.take(1):
    print(current_char, '\n', b''.join(chars_from_ids(current_char).numpy()))
    print()
    print(next_char, '\n', b''.join(chars_from_ids(next_char).numpy()))


tf.Tensor(
[20 49 58 59 60  3 17 49 60 49 66 45 54 12  2 16 45 46 55 58 45  3 63 45
  3 56 58 55 43 45 45 44  3 41 54 65  3 46 61 58 60 48 45 58  8  3 48 45
 41 58  3 53 45  3 59 56 45 41 51 10  2  2 15 52 52 12  2 33 56 45 41 51
  8  3 59 56 45 41 51 10  2  2 20 49 58 59 60  3 17 49 60 49 66 45 54 12
  2 39 55 61], shape=(100,), dtype=int64) 
 b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'

tf.Tensor(
[49 58 59 60  3 17 49 60 49 66 45 54 12  2 16 45 46 55 58 45  3 63 45  3
 56 58 55 43 45 45 44  3 41 54 65  3 46 61 58 60 48 45 58  8  3 48 45 41
 58  3 53 45  3 59 56 45 41 51 10  2  2 15 52 52 12  2 33 56 45 41 51  8
  3 59 56 45 41 51 10  2  2 20 49 58 59 60  3 17 49 60 49 66 45 54 12  2
 39 55 61  3], shape=(100,), dtype=int64) 
 b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


In [3]:
# Now we're going to shuffle and batch it. Standard when working with Tensorflow's Dataset class
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<PrefetchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

# The Model

Some things to note: 

1. We're using an LSTM rather than a simple RNN layer.
2. It has a larger relative internal representation.
3. The embedding dimension is similarly larger than last time.
4. The output dimension is the size of the vocab (every unique character in the training data!)

Also, we're actually subclassing the model class. This allows us to have a bit more control over when and how the layers talk to each other, and how we deal with state. Code slightly adapted from TF docs [https://www.tensorflow.org/tutorials/text/text_generation](https://www.tensorflow.org/tutorials/text/text_generation)

In [4]:
class TextGeneratorModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__(self)

        # Embedding layer
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

        # LSTM Second
        self.lstm = tf.keras.layers.LSTM(rnn_units,
                                       return_sequences=True, 
                                       return_state=True)

        # Dense last.
        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, h_states=None, return_state=False, training=False):
        # Transform the input using the embedding layer
        x = inputs
        x = self.embedding(x, training=training)

        # If there is no incoming state from an earlier timestep
        # use the default initial state behavior
        if h_states is None:
            h_states = self.lstm.get_initial_state(x)
        
        # Transfrom the embedding using lstm
        # This is transformed output, transformed hidden state, and _ is hidden cell state
        x, h_states, c_states = self.lstm(x, initial_state=h_states, training=training)

        # Pass the transformed x into the dense layer
        x = self.dense(x, training=training)

        # Only return state when asked
        if return_state:
            return x, h_states, c_states
        else: 
            return x


# embedding output == 256, rnn units = 1024
model = TextGeneratorModel(len(ids_from_chars.get_vocabulary()), 256, 1024)


for input_example_batch, target_example_batch in dataset.take(1):
    print(input_example_batch)
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")


    
model.summary()

tf.Tensor(
[[ 3 51 49 ... 54 47  3]
 [56 59  8 ...  3 44 45]
 [ 3 47 45 ... 48 41 52]
 ...
 [45 58 45 ...  3 41 54]
 [45 48 55 ... 44  3 41]
 [55 44 61 ... 54  3 43]], shape=(64, 100), dtype=int64)
(64, 100, 67) # (batch_size, sequence_length, vocab_size)
Model: "text_generator_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        multiple                  17152     
_________________________________________________________________
lstm (LSTM)                  multiple                  5246976   
_________________________________________________________________
dense (Dense)                multiple                  68675     
Total params: 5,332,803
Trainable params: 5,332,803
Non-trainable params: 0
_________________________________________________________________


In [5]:
# Lets see what happens on the untrained network...
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

print("Input:\n", b''.join(chars_from_ids(input_example_batch[0]).numpy()))
print()
print("Predictions:\n", b''.join(chars_from_ids(sampled_indices).numpy()))
# Of course... a lot of garbage :)
# But that's just because it's not trained... hopefully!

Input:
 b' king to-day, my Lord of Derby?\n\nDERBY:\nBut now the Duke of Buckingham and I\nAre come from visiting '

Predictions:
 b"LP:D;e,feBOYXaqpMMUrG'?zHQAavq&P-AiAJRQxwBhPsvndd;'yU:qpCg!CR!vnw$he!tq DVAXm!&IsoQzR$REopSeM!BQIG"


In [6]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

example_batch_loss = loss(target_example_batch, example_batch_predictions)
mean_loss = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss)


Prediction shape:  (64, 100, 67)  # (batch_size, sequence_length, vocab_size)
Mean loss:         4.2039866


In [None]:

model.compile(
    loss=loss,
    optimizer=tf.keras.optimizers.Adam(1e-4)
)

history = model.fit(
    dataset, 
    epochs=20
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
 19/156 [==>...........................] - ETA: 4:37 - loss: 2.8563

# Generating Text With The Model

Using a model to generate text is done by using the outputs from the dense layer as a sampling distribution. At each step we run the model once, maintaining the state, sampling from the prediction, and then using our sampled character as input for the next timestep. Tensorflow's documentation has a wonderful visual and class to help us do this more easily, so we've stolen both. From [https://www.tensorflow.org/tutorials/text/text_generation](https://www.tensorflow.org/tutorials/text/text_generation)

![](https://www.tensorflow.org/tutorials/text/images/text_generation_sampling.png)

In [None]:
class OneStep(tf.keras.Model):
    def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
        super().__init__()
        self.temperature=temperature
        self.model = model
        self.chars_from_ids = chars_from_ids
        self.ids_from_chars = ids_from_chars

        # Create a mask to prevent "" or "[UNK]" from being generated.
        skip_ids = self.ids_from_chars(['','[UNK]'])[:, None]
        sparse_mask = tf.SparseTensor(
            # Put a -inf at each bad index.
            values=[-float('inf')]*len(skip_ids),
            indices = skip_ids,
            # Match the shape to the vocabulary
            dense_shape=[len(ids_from_chars.get_vocabulary())]) 
        self.prediction_mask = tf.sparse.to_dense(sparse_mask)

    @tf.function
    def generate_one_step(self, inputs, states=None):
        # Convert strings to token IDs.
        input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
        input_ids = self.ids_from_chars(input_chars).to_tensor()

        # Run the model.
        # predicted_logits.shape is [batch, char, next_char_logits] 
        predicted_logits, states =  self.model(inputs=input_ids, states=states, 
                                              return_state=True)
        # Only use the last prediction.
        predicted_logits = predicted_logits[:, -1, :]
        predicted_logits = predicted_logits/self.temperature
        # Apply the prediction mask: prevent "" or "[UNK]" from being generated.
        predicted_logits = predicted_logits + self.prediction_mask

        # Sample the output logits to generate token IDs.
        predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
        predicted_ids = tf.squeeze(predicted_ids, axis=-1)

        # Convert from token ids to characters
        predicted_chars = self.chars_from_ids(predicted_ids)

        # Return the characters and model state.
        return predicted_chars, states