# Text Generation using Recurrent Neural Networks
 
Based on Tensorflow tutorial [Text generation with an RNN](https://www.tensorflow.org/text/tutorials/text_generation)

Text generation using Shakespeare dataset from [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

With modifications described along the way...this does not currently leverage any available GPU, but will by the time we're done...

## Step 0 - Environment Setup 


In [2]:
import tensorflow as tf

import numpy as np
import os
import time



In [1]:
local_data_path_root = "C:/LocalResearch/JPD-Research/translationWork"

In [31]:
local_data_path = local_data_path_root+ "/data"
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt', cache_subdir=local_data_path)

# Read, then decode 
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print(f'Length of text: {len(text)} characters')
print(text[:250])

vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

Length of text: 1115394 characters
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

65 unique characters


get dataset and validate:

## Step 1 - Preprocess the data 
with the dataset loaded, we need to vectorize it
characters can be turned into numeric IDs, once the text is split into tokens

and the tokens are turned into character IDs using a StringLookup layer

In [7]:
example_texts = ['abcdefg', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
print(chars)

ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None)

ids = ids_from_chars(chars)
print(ids)

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

<tf.RaggedTensor [[40, 41, 42, 43, 44, 45, 46], [63, 64, 65]]>


for generation, this needs to be able to be inverted, recovering strings from IDs, using the same LookupLayer
and these can be joined back into strings

In [8]:
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)
chars = chars_from_ids(ids)
print(chars)

print(tf.strings.reduce_join(chars, axis=-1).numpy())

def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

print( text_from_ids(ids))
print( text_from_ids(ids).numpy() )


<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>
[b'abcdefg' b'xyz']
tf.Tensor([b'abcdefg' b'xyz'], shape=(2,), dtype=string)
[b'abcdefg' b'xyz']


Task definition - to determine the next likely character, given a character or sequence
RNNs maintain state dependent on prior seen elements, use that state to predict the next character

Split set into training example sequences, each of seq_length length
each sequence predicts the seq_length+1 character

In [9]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
print(all_ids)
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
for ids in ids_dataset.take(10):
    print(chars_from_ids(ids).numpy().decode('utf-8'))


tf.Tensor([19 48 57 ... 46  9  1], shape=(1115394,), dtype=int64)
F
i
r
s
t
 
C
i
t
i


In [10]:
seq_length = 100

use batch to generate appropriate sequences

In [11]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(seq))  
for seq in sequences.take(5):
  print(text_from_ids(seq))

tf.Tensor(
[b'F' b'i' b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':'
 b'\n' b'B' b'e' b'f' b'o' b'r' b'e' b' ' b'w' b'e' b' ' b'p' b'r' b'o'
 b'c' b'e' b'e' b'd' b' ' b'a' b'n' b'y' b' ' b'f' b'u' b'r' b't' b'h'
 b'e' b'r' b',' b' ' b'h' b'e' b'a' b'r' b' ' b'm' b'e' b' ' b's' b'p'
 b'e' b'a' b'k' b'.' b'\n' b'\n' b'A' b'l' b'l' b':' b'\n' b'S' b'p' b'e'
 b'a' b'k' b',' b' ' b's' b'p' b'e' b'a' b'k' b'.' b'\n' b'\n' b'F' b'i'
 b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':' b'\n' b'Y'
 b'o' b'u' b' '], shape=(101,), dtype=string)
tf.Tensor(b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou ', shape=(), dtype=string)
tf.Tensor(b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k', shape=(), dtype=string)
tf.Tensor(b"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki", shape=(), dtype=string)


these sequences need to be turned into input/label sets
for each step, the input is the current character, and the label is the next character

In [12]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

print( split_input_target(list("tensorflow")))

(['t', 'e', 'n', 's', 'o', 'r', 'f', 'l', 'o'], ['e', 'n', 's', 'o', 'r', 'f', 'l', 'o', 'w'])


In [13]:
dataset = sequences.map(split_input_target)

for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target: b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


with these sequences, we need to pack these into training batches we can use with the model
note that not all data is pulled into memory at once using the batch and buffer size to manage this transition

In [14]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

print(dataset)

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>


## Step 3 Model generation

This Keras.Model implementation has three layers

| layer name | function                             |
|------------|--------------------------------------|
| Embedding  | input layer                          |
| GRU        | RNN with input size units=rnn_units  |
| Dense      | output layer with vocab_size outputs |
 outputs are the log-likelihood of each character in the model

In [15]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [16]:
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [17]:
model = MyModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

## Testing the model

In [18]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

print(model.summary())

(64, 100, 66) # (batch_size, sequence_length, vocab_size)
Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  16896     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 dense (Dense)               multiple                  67650     
                                                                 
Total params: 4022850 (15.35 MB)
Trainable params: 4022850 (15.35 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


to actually predict, you need to process the logits returned over the character vocabulary
for the first example (encoded and decoded):

In [19]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
print(sampled_indices)
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

[31  5 37 36 35 44 22 12 49 28 56 64 51 29 13 46 65 58 56 22 61 47 45 10
 29 55 33 22 65 41  8 12 61 42 61 59 42 62 30  9 44 61 65 56 23 41 47 45
 19 26  5 63 15 33 25 17 14 13 15 13 44 37 60 56 28 10  5 26 12 38 48 55
 47 65  0 54 49 18 45 46 20 22 21 50 40 21 65 46 25  4 54 35 16 15  4 29
 51 49 30 26]
Input:
 b's true; I heard a senator speak it.\nThus it is: the Volsces have an army forth; against\nwhom Cominiu'

Next Char Predictions:
 b'R&XWVeI;jOqylP?gzsqIvhf3PpTIzb-;vcvtcwQ.evzqJbhfFM&xBTLDA?B?eXuqO3&M;Yiphz[UNK]ojEfgGIHkaHzgL$oVCB$PljQM'


## Step 4 - Model Training

We've turned prediction into a simple classification problem - given the prior state, predict the class of the next character
We need an optimizer and loss function
- the crossentropy loss function is reasonable in this case?
- use the 'Adam' optimizer
- use Compile to comfigure the training process

In [20]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)

model.compile(optimizer='adam', loss=loss)

Prediction shape:  (64, 100, 66)  # (batch_size, sequence_length, vocab_size)
Mean loss:         tf.Tensor(4.190152, shape=(), dtype=float32)


prior to training, the mean loss should be equal to the vocabulary size - even higher values means the model us certain of its wrong answers...

In [21]:
print(tf.exp(example_batch_mean_loss).numpy())

66.03284


configure checkpoints, and train the model

In [22]:
# Directory where the checkpoints will be saved
checkpoint_dir = local_data_path_root + '/training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [23]:
EPOCHS = 20
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
print(history)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
<keras.src.callbacks.History object at 0x000002BAC1EBBC90>


## Step 5 Text Generation

can use a loop that passes in a propmpt, adds the generated output and passes that back in

We'll define a class that takes the model and char <--> id functions as inputs
- this class masks the input such that [UNK] won't be generated 
- then the function that 
    - tokenizees the current inputs
    - runs the model on the input
    - gets the last prediction
    - samples the last prediction
    - char-izes the resulting output
    - returns the updated result

In [24]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

now define an instance of our generator

In [25]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

the generation loop:
- defines a constant text of "ROMEO:" 
- initialize our result (state) to this constant 
- iterates 1000 times
    - calling our function to get the next char (and states)
    - appends the next char to "result"
- finally, the output result characters are joined together
- and the result (and runtime) are printed out  

In [11]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print('_'*80 + '\n\n' + result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

________________________________________________________________________________

ROMEO:
Ha's, condemns you.

First Citizen:
And indeed, sir; I have heard that Lucentio that have tumbled
but then poor souls,--like honour, break a grand
As every kept thing it: the king cry Cluth,'er
In hand, and made thee well, indeed, I
will attend you again; betedle the victory!

KING RICHARD III:
No; if I reaf corceing to her
Accused by a man at an unlawful bed.

TYBALT:
What art thou, man? awake! thy trial with
'Tis can yield be here? early up your haugn?
But then a grandain state, out with a piece of banished.
Threw you, last likeness, fair Duke of York,
But Tybalt scolding here already? Romeo!
Or bid him misselves twenty times to wake you: yea, my most conspiration!

HASTINGS:
Ay, my gracious lord;
I come with bathal for that which, be not so advice
My life contend; in AUng of such things as yours, as I am sorry,
Will not her dear explority of this parand?

HORTENSIO:
Sir, I have these noble-garte

### things to note:
- this model is character-based, so it doesn't really know about words, 
- it does know when to capitalize, insert paragraph breaks, emulate the Shakespearean vocab from the vocabulary
- it has not yet learned to generate coherent sentences
    - larger number of epochs may improve this

### things to experiment with
- the "temperature" parameter can be used to make more/less random predictions
- starting with a different seed string will change the output
- adding another RNN layer can improve its overall accuracy
- generation speed can be improved by batching (below code runs in similar time to above code) 

In [12]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:', 'ROMEO:', 'ROMEO:', 'ROMEO:', 'ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result, '\n\n' + '_'*80)
print('\nRun time:', end - start)

tf.Tensor(
[b"ROMEO:\nThis gentleman bound of all the deed.\n\nSecond Citizen:\nBut as I was your daughter's daughter shall.\nShe start to that; if I warrant, you will not yet.\nNo, no, sir, the medicience: against the king\nRain on the Duke of Gloucester's death\nTo grate my brother's limit; when I blame fortwwelle\nIn carrion-bear with black spidices,\nMakes her government to chole myself,\nYea, atten at the least i' the world: these severe\nOn you and you are well as gloats, they do\nnot perft; methought I hear some noise of peace,\nTear me with interchangeably cherishing\nThat when he would have lately give,\nBale praised infect the bight: pierce did I lie?\nBe anmorm, your birth ign of a friend,\nHow he goes but myself again by judges\nThan so your highness' dry or twortune\nTo shape me with it? O my dear'st trial! when\nhe wakes.\n\nShepherd:\nYou, Siciliab,\nOld Gay arroa the Tormente. To me, look thee, then,\nAs one distressed word before him, as we\nrepeal'd the Duke of Bertag

## Step 6 - Saving the trained generator model

so it can be used as a tf.saved_model 
Note that you should not call the "save" step below if you just want to load/run the saved version - skip that cell... 

In [6]:
local_model_path = local_data_path_root+ "/models/"

In [None]:
tf.saved_model.save(one_step_model, local_model_path+'one_step')

In [7]:
one_step_reloaded = tf.saved_model.load(local_model_path+'one_step')

In [8]:
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(100):
  next_char, states = one_step_reloaded.generate_one_step(next_char, states=states)
  result.append(next_char)

print(tf.strings.join(result)[0].numpy().decode("utf-8"))

ROMEO:
So strike along. Poor Margaret!
My master markless appetle and fly.

WARWICK:
Why, that's our guilt


## Step 7 - Still to do

- better understand
    - the practical difference between using an RNN vs an LSTM layer
    - the making step
    - what to expect in terms of improved performance by adding an additional (RNN?) layer
- this should operate similarly if we replaced the IDs as character placeholders with word tokenizations, no?     