# Generating a Shrek movie script using a simple RNN 

The scripts for [Shrek](https://www.imsdb.com/scripts/Shrek.html) and [Shrek The Third](https://www.imsdb.com/scripts/Shrek-the-Third.html) were downloaded and saved into a single text file.

Using a simple RNN trained to predict the following character given an input sequence of characters from this file, we are able to generate our own Shrek 'scripts'.

![alt text](https://s.wsj.net/public/resources/images/OB-IO273_shrekf_E_20100520084037.jpg)

In [0]:
try:
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
import numpy as np
import os
import time
import re

TensorFlow 2.x selected.


Open the text file containing the two Shrek movie scripts.

In [0]:
text = open('/content/shrek_script.txt', 'rb').read().decode(encoding='utf-8')

Preview the first 1000 characters of the text file, notice all the excess white space due to the format of the scripts.

In [0]:
text[:100]

'SHREK\n\n                                       Written by\n\n                                William St'

We replace repeated white spaces with a single one using a regular expression.

In [0]:
clean_text = re.sub(' +', ' ', text)
clean_text[:100]

'SHREK\n\n Written by\n\n William Steig & Ted Elliott\n\n\n\n\n SHREK\n Once upon a time there was a lovely \n p'

The vocabulary is simply a list of all unique characters present in the scripts.

In [0]:
vocab = sorted(set(clean_text))

In [0]:
vocab[:5] # Print first 5 unique characters

['\n', ' ', '!', '"', '#']

Since our model requires numerical representations of the data, we need to map each unique character to a number and be able to reverse said mapping.

In [0]:
char2idx = {c:i for i, c in enumerate(vocab)} # Maps characters to a number
idx2char = np.array(vocab) # Maps numbers to a character

Let us look at the first 10 character mappings

In [0]:
for char in range(10):
  print('{:4s}: {:2d}'.format(repr(idx2char[char]), char2idx[idx2char[char]]))

'\n':  0
' ' :  1
'!' :  2
'"' :  3
'#' :  4
'&' :  5
"'" :  6
'(' :  7
')' :  8
',' :  9


We may now numericalise the text using `char2idx`.

In [0]:
numerical_text = np.array([char2idx[c] for c in clean_text])

The first 20 characters of the script and their numerical representation.

In [0]:
print ('{} \n\n maps to \n\n {}'.format(repr(clean_text[:20]), numerical_text[:20]))

'SHREK\n\n Written by\n\n' 

 maps to 

 [44 33 43 30 36  0  0  1 48 70 61 72 72 57 66  1 54 77  0  0]


Given a sequence of `N` characters from the script, the aim of the model is to predict the character at `N+1`. The length of the character stream can be changed, and is controlled by the `seq_length` variable. 

We use the `tf.data.Dataset.from_tensor_slices` function to convert our numericalised text into a stream of character indices. We convert this into sequences of length `seq_length + 1` (given seq_length input, we wish to predict the char at `seq_length + 1`) using the batch method.

In [0]:
seq_length = 100

char_dataset = tf.data.Dataset.from_tensor_slices(numerical_text) # Forms a tensor from each char in numerical_text

sequences = char_dataset.batch(seq_length+1, drop_remainder=True) # Converts character stream into input/target sequence where the last char is the target

We now define a function to assign the first `seq_length` characters as the input and then shift over the characters by one place to form the target. We then map all sequences using this function to form our input/target pairs.

In [0]:
def gen_splits(chunk):
    input_text = chunk[:-1] # First seq_length chars are input
    target_text = chunk[1:] # Final char is target
    return input_text, target_text

dataset = sequences.map(gen_splits)

Example input and target pair. We can see the aim of the model is to predict the following character for a given stream of characters.

In [0]:
for i, o in dataset.take(1):
  print('Input stream: ', repr(''.join(idx2char[i.numpy()])))
  print('Target stream: ', repr(''.join(idx2char[o.numpy()])))

Input stream:  'SHREK\n\n Written by\n\n William Steig & Ted Elliott\n\n\n\n\n SHREK\n Once upon a time there was a lovely \n p'
Target stream:  'HREK\n\n Written by\n\n William Steig & Ted Elliott\n\n\n\n\n SHREK\n Once upon a time there was a lovely \n pr'


In [0]:
BATCH_SIZE = 256
BUFFER_SIZE = 10000 # TF designed to deal with potentially infinite data, this is the buffer size on which it shuffles

We now shuffle the data and create trainining batches.

In [0]:
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

We now define a function to build our model using the `tf.keras.Sequential` API.

We begin with an embedding layer to map each character to an embedding space with `embedding_dim` dimensions. 

The recurrent layers follow, here we have a choice of [LSTMs](https://www.bioinf.jku.at/publications/older/2604.pdf) or [GRUs](https://arxiv.org/pdf/1412.3555.pdf), specified with the `cell_type` argument. We may add as many of these layers as we wish using the `rec_layers` argument. The number of units in each recurrent layer is controlled with the `rnn_units` argument.

Two fully connected layers with dropout between form the head of the model, the dropout rate is specified using the `drop_rate` argument.

In [0]:
def build_model(vocab_size, embedding_dim = 256, cell_type = 'lstm', rec_layers = 1, rnn_units = 1024, drop_rate = 0.2, batch_size = 256):

  cell_types = ['lstm', 'gru']

  model = tf.keras.Sequential()

  model.add(tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]))

  if cell_type not in cell_types:
    raise ValueError("Invalid cell type. Expected one of: {}".format(repr(cell_types)))

  if cell_type == 'lstm':
    for _ in range(rec_layers):
      model.add(tf.keras.layers.LSTM(rnn_units, return_sequences=True, stateful=True))

  else:
    for _ in range(rec_layers):
      model.add(tf.keras.layers.GRU(rnn_units, return_sequences=True, stateful=True))

  model.add(tf.keras.layers.Dense(rnn_units))
  model.add(tf.keras.layers.Dropout(drop_rate))
  model.add(tf.keras.layers.Dense(vocab_size))

  return model


We now build our model. Based on the results of this [paper](https://www.researchgate.net/publication/335158858_LSTM_vs_GRU_vs_Bidirectional_RNN_for_script_generation), we will use LSTM cells, as they found them to be better for script generation compared to GRUs.

In [0]:
vocab_size = len(vocab)
rec_layers = 1

model = build_model(vocab_size, cell_type = 'lstm', rec_layers = rec_layers, batch_size = BATCH_SIZE)

Since our model aims to predict a character, this is simply a standard classification problem, and we make use of the `tf.keras.losses.sparse_categorical_crossentropy` loss function.

In [0]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

We now compile our model using this loss and set Adam as the optimizer.

In [0]:
model.compile(optimizer='Adam', loss=loss)

The following callback allows checkpoints to be saved during training (We will use the final checkpoint to restore our model prior to generating a script).

In [0]:
ckpt_dir = './training_checkpoints'
ckpt = os.path.join(ckpt_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=ckpt,
    save_weights_only=True,
    save_freq=20000) # Save model checkpoint only after 20000 samples have been seen - to reduce disk consumption on colab

We are now ready to train our model.

In [0]:
EPOCHS=750

In [0]:
model_history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback], verbose=1)

Train for 7 steps
Epoch 1/750
Epoch 2/750
Epoch 3/750
Epoch 4/750
Epoch 5/750
Epoch 6/750
Epoch 7/750
Epoch 8/750
Epoch 9/750
Epoch 10/750
Epoch 11/750
Epoch 12/750
Epoch 13/750
Epoch 14/750
Epoch 15/750
Epoch 16/750
Epoch 17/750
Epoch 18/750
Epoch 19/750
Epoch 20/750
Epoch 21/750
Epoch 22/750
Epoch 23/750
Epoch 24/750
Epoch 25/750
Epoch 26/750
Epoch 27/750
Epoch 28/750
Epoch 29/750
Epoch 30/750
Epoch 31/750
Epoch 32/750
Epoch 33/750
Epoch 34/750
Epoch 35/750
Epoch 36/750
Epoch 37/750
Epoch 38/750
Epoch 39/750
Epoch 40/750
Epoch 41/750
Epoch 42/750
Epoch 43/750
Epoch 44/750
Epoch 45/750
Epoch 46/750
Epoch 47/750
Epoch 48/750
Epoch 49/750
Epoch 50/750
Epoch 51/750
Epoch 52/750
Epoch 53/750
Epoch 54/750
Epoch 55/750
Epoch 56/750
Epoch 57/750
Epoch 58/750
Epoch 59/750
Epoch 60/750
Epoch 61/750
Epoch 62/750
Epoch 63/750
Epoch 64/750
Epoch 65/750
Epoch 66/750
Epoch 67/750
Epoch 68/750
Epoch 69/750
Epoch 70/750
Epoch 71/750
Epoch 72/750
Epoch 73/750
Epoch 74/750
Epoch 75/750
Epoch 76/750
Epo

We need to rebuild our model, and load the weights saved from the checkpoint to allow our model to generate text - this is due to a change in batch size, which cannot be performed once a model has been built.

In [0]:
writer = build_model(vocab_size, cell_type = 'lstm', rec_layers = rec_layers, batch_size = 1)

writer.load_weights(tf.train.latest_checkpoint(ckpt_dir))

writer.build(tf.TensorShape([1, None]))


Since we wish to generate a Shrek script, we find the average length of the script, this is how many characters we will ask the model to generate.

In [0]:
script_length = len(clean_text)//2

We now define a function `generate_text` to generate our script.

Initially, we pass a starting string to the model from which it must then predict the following character. The prediction distribution is captured and from this the index of the predicted character is calculated using a [categorical distribution](https://en.wikipedia.org/wiki/Categorical_distribution). This predicted character, and the previous state returned by the model, is then passed as the next input. This process is repeated until `gen_to` characters have been generated.

In [0]:
def generate_text(model, start_string, gen_to):

  # Numericalise starting string
  input_eval = [char2idx[c] for c in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  text_generated = []

  # controls the degree of 'randomness' of the character generation
  temperature = 1.0

  model.reset_states()
  for i in range(gen_to):
      # predict next char
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # categorical distribution to predict the character
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # Pass predicted char and previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

We are now ready to ask our model to generate a thrilling Shrek script for us to (hopefully) enjoy.

In [0]:
print(generate_text(writer, start_string=u"THE SHREKINING\n\n Written by\n\n A Basic RNN\n\n\n\n\n", gen_to = script_length))

THE SHREKINING

 Written by

 A Basic RNN




 They hold tou of
 home on 
 lf.)
 
 DONKEY
 What's your lork away.
 
 SHREK
 What?

 DONKEY
 Where do, uh, I sleep?

 SHREK
 (irritated) Outside!

 DONKEY
 Oh, well, I I will have - - (He CREATURES
 Avil does Donkey away. He falls, knocking over a
 guard holding an axe on his way down. The guard drops the
 logerbread Man is attending school.
 
 TEACHARES
 Sell the lavy bouthis plan.
 
 SHREK
 Donkey, we're dealing with
 amateurs.
 
 The guards are confused. They all ounder the 3 mice) What are 
 you doing in my house? (He gets bumped 
 from behind and he drops the mice.) 
 Hey, the boat and if the door buss him.
 
 ST I think that went perche Third - Final Screening Script 69.
 
 
 
 SHREK
 Thanks Artie.
 
 ARTIE
 The soap's because you start the Third - Final Screening Script 17.
 
 
 
 Shrek leans in closer after each "is," waiting in
 anyou.
 
 There is a montage of scement.
 
 The pirates aim the cannon at Puss, Donkey and Artie. Artie

These scripts are obviously not going to win any awards, but they aren't bad for a simple RNN.

![alt text](https://dingo.care2.com/pictures/petition_images/petition/466/386203-1549228753-wide.jpg)