# Text Generation with an RNN

This notebook demonstrates how to generate text using a character-based RNN. We will work with a dataset of Shakespeare's writing from Andrej Karpathy's [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). 

In the final results of the notebook you can see that while some of the sentences are grammatical, and only make partial sense. The model has not learned the meaning of words, but consider:

* The model is character-based.
* The structure of the output resembles a play—blocks of text generally begin with a speaker name, in all capital letters similar to the dataset.
* The model is trained on small batches of text (100 characters each), and is still able to generate a longer sequence of text with coherent structure.
* The model has no concept of words as of now (as there are no embeddings used), but still it is able to train it's own character embeddings and produce actual words for the most part. 

## Setup and Preprocessing

To train the model in a reasonable time and with a relatively smaller training corpus, we would we would be training the model to predict the next character according to the given contetual window, instead of training it on words and their embeddings.

We would define the following utilities:
- `process_text`: This function reads the file with `file_path` destination and returns a mapping of characters to index, index to character, and all the text encoded as indices in an array.
- `split_input_target`: It creates a (training data, output) pair for each fixed length sentence that we provide to it.
- `create_dataset`: It takes in a file and then converts it to a dataset by batching the sentences (sequences) that are fed into the model. This also allows for the shuffling of data, improving the performance of the model.

In [9]:
# Importing the required libraries
import numpy as np
import tensorflow as tf
import keras

In [10]:
def process_text(file_path):
    text = open(file_path, 'rb').read().decode(encoding='utf-8')
    vocab = sorted(set(text))  # The unique characters in the file

    # Creating a mapping from unique characters to indices and vice versa
    char2idx = {u: i for i, u in enumerate(vocab)}
    idx2char = np.array(vocab)
    text_as_int = np.array([char2idx[c] for c in text])

    return text_as_int, vocab, char2idx, idx2char

In [11]:
def split_input_target(chunk):
    input_text, target_text = chunk[:-1], chunk[1:]
    return input_text, target_text

In [12]:
def create_dataset(text_as_int, seq_length, batch_size, buffer_size):
    char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
    # Create sequences and map then to a pair of input and output
    dataset = char_dataset.batch(seq_length + 1, drop_remainder=True).map(split_input_target)
    # Create batches of the dataset and shuffle them in memory in `buffer_size` intervals
    dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)

    return dataset


## Building the Model

The model is defined as a Sequential model in TensorFlow Keras, indicating a linear stack of layers.

1. **Embedding Layer:**
    - Purpose: Converts integer-encoded vocabulary indices into dense vectors of fixed size.
    - Parameters:
        - `vocab_size`: Size of the vocabulary, i.e., the total number of unique words in the input.
        - `embedding_dim`: Dimension of the dense embedding.
        - `batch_input_shape`: Shape of the input data, with `None` indicating variable sequence length.
        - `trainable`: This option allows the encoding of the character to a higher dimensional geometry representing their semantic meaning.

2. **LSTM Layer**
    - Purpose: Long Short-Term Memory (LSTM) layer with return sequences set to True, indicating it returns the full sequence of outputs for each input sequence.
    - Parameters:
        - `rnn_units`: Number of LSTM units.
        - `return_sequences`: True to return the full sequence.
        - `stateful`: True to maintain state across batches.
        - `recurrent_initializer`: Initialization for the recurrent weights.

3. **Dropout Layer**
    - Purpose: Introduces dropout to prevent overfitting during training.
    - Parameter:
        - `0.1`: Fraction of input units to drop.

4. **Batch Normalization Layer**
    - Purpose: Normalizes and scales the inputs, helping stabilize and accelerate the training process.

5. **Dense Output Layer:**
    - Purpose: Dense (fully connected) layer responsible for generating the output predictions.
    - Parameter:
        - `vocab_size`: Number of units, representing the size of the output vocabulary.

In [14]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = keras.Sequential([
        keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None], trainable=True),
        keras.layers.LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        keras.layers.Dropout(0.1),
        keras.layers.BatchNormalization(),
        keras.layers.LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        keras.layers.Dropout(0.1),
        keras.layers.BatchNormalization(),
        keras.layers.Dense(vocab_size)
    ])

    return model

In [15]:
# Definning a loss function with logbits enabled
# This would be used in the training of our model
def loss(labels, logits):
    return keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [16]:
# Utility to generate the text
def generate_text(model, char2idx, idx2char, start_string, generate_char_num, temperature=1.0):
    # Low temperatures results in more predictable text, higher temperatures results in more surprising text.

    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]

    # Expand the dimension of the input by 1 as the model expects batches
    input_eval = tf.expand_dims(input_eval, 0)

    text_generated = []
    model.reset_states()

    for _ in range(generate_char_num):
        predictions = model(input_eval)

        # Remove the extra dimension correspinding to the batches in the output
        predictions = tf.squeeze(predictions, 0)

        predictions /= temperature
        # Using a categorical distribution to predict the character returned by the model
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()

        # We pass the predicted character as the next input to the model along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], axis=0)
        text_generated.append(idx2char[predicted_id])

    return start_string + ''.join(text_generated)

## Running the model

We have use of two models for simplicity.
- We first train a model which can accept batches of text (of size `64`) for better performance and training on the supplied data. When the training of the same is complete, we export the weights of the same.
- This model is never used for the actual prediction.
- Then we create another model with exactly the same architecture as the previous model, but with a `batch_size` of `1`. This allows this model to reuse the weights of the previously trained model, but allows us to pass the inputs simply as an array, without having to worry about batching them.

In [19]:
# Load the standard `shakespeare` text file provided by TensorFlow
path_to_file = keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

# Define the constants
RNN_UNITS = 1024
EMBEDDING_DIMENSIONS = 256
BATCH_SIZE = 64
BUFFER_SIZE = 10000
SEQUENCE_LENGTH=100
TEXT_TO_GENERATE=2000
EPOCHS = 50

# Processing the file
text_as_int, vocab, char2idx, idx2char = process_text(path_to_file)
VOCAB_SIZE = len(vocab)

# Create the dataset
dataset = create_dataset(
  text_as_int,
  SEQUENCE_LENGTH,
  BATCH_SIZE,
  BUFFER_SIZE
)
# Create the model
model = build_model(
  VOCAB_SIZE,
  EMBEDDING_DIMENSIONS,
  RNN_UNITS,
  BATCH_SIZE
)
model.compile(optimizer='adam', loss=loss)
model.summary()
# Train the model on the dataset and save the results
history = model.fit(dataset, epochs=EPOCHS)
model.save_weights("shakespeare_weights.h5", save_format='h5')

# Create the prediction model but with batch_size = 1
model = build_model(
  VOCAB_SIZE,
  EMBEDDING_DIMENSIONS,
  RNN_UNITS,
  1
)
model.load_weights("shakespeare_weights.h5")

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (64, None, 256)           16640     
                                                                 
 lstm_2 (LSTM)               (64, None, 1024)          5246976   
                                                                 
 dropout_2 (Dropout)         (64, None, 1024)          0         
                                                                 
 batch_normalization_2 (Bat  (64, None, 1024)          4096      
 chNormalization)                                                
                                                                 
 lstm_3 (LSTM)               (64, None, 1024)          8392704   
                                                                 
 dropout_3 (Dropout)         (64, None, 1024)          0         
                                                      

In [21]:
import time

# Predict the text and generate content
user_input = "KING HENRY VI: You shall face death."

start = time.time()
generated_text = generate_text(model, char2idx, idx2char, user_input, TEXT_TO_GENERATE)
end = time.time()

print(generated_text)
print("RUNTIME: ", end - start)

KING HENRY VI: You shall face death.

BRUTUS:
But since you shall perceive your grace my sons should proud te as the posterns
So excellent feathers from the earth to see him such a cuff
Than a paper early to the seat,
And tunder'd my command.

WARWICK:
I will be mild: I know you all short as sweet
That you take with unthat which grieves my heart,
And wet my choice is now my meaning to return.
Perchance she cannot meet him: that's the matter?

Messenger:
The news,
Ah, my young prince, whose honourable thoughts,
Though these were honour!

MERCUTIO:
O heavens! what fear shall wheelike him down and in the chair of state,
My name the lights, I woo not
Our renowned spirit.

Provost:
Here, my lord.

DUKE VINCENTIO:
No more of this. Canst thou tell if
Claudio is condemned.

LEONTES:
As now she proved the liars.

MENENIUS:
O sir, you are not; 'tis it true.

POMPEY:
If you head at his pomp,
Allowing him a breath, a little dinear, the sport is done effect.

DUKE OF YORK:
Well, bear you where you 