# **Text generation with an RNN**

following [tensorflow tutorial](https://www.tensorflow.org/text/tutorials/text_generation).

This tutorial looks at training an RNN model to predict the next character in a sequence. The model is trained on text written by shakespeare.

Something to keep in mind. Text generation process demonstrated in Udacity.
- Tokenization, followed by converting the texts to sequences.

- With each sequences having a set length. From each sequences, we use the last token as the label and the remaining sequences as the feature vector.

- Convert the labels into one-hot vectors, with it's length being the vocabulary size

- Train a classification model on the sequences and one-hot labels.

- We would then generate text by providing a seed word followed by multiple inference.


##**Import dependencies**

In [1]:
import tensorflow as tf
import numpy as np
import os
import time

print(tf.__version__)

## **Get the dataset**

In [2]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


In [None]:
# decode the text
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print(text)

In [4]:
print(f"Number of characters in the text: {len(text)}")

Number of characters in the text: 1115394


😅 This is so much more efficient way to get the text, compared to what i did. In fairness i removed the names of the person saying each line of dialog from the text

In [6]:
# find the number of unique characters in the file
vocab = sorted(set(text))
print(vocab)

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [7]:
print(f"Number of unique charachter in the text is {len(vocab)}")

Number of unique charachter in the text is 65


Where does $ come up in the text?

## **Process the text**

In [10]:
# encode sample string into utf-8 format
example_texts = ["megasxlr", "theweekend"]

example_text_chars = tf.strings.unicode_split(example_texts, input_encoding="UTF-8")
print(example_text_chars)

<tf.RaggedTensor [[b'm', b'e', b'g', b'a', b's', b'x', b'l', b'r'],
 [b't', b'h', b'e', b'w', b'e', b'e', b'k', b'e', b'n', b'd']]>


In [None]:
for index, char_ in enumerate(list(vocab)):
  print(f"index:{index}, character:{char_}")

**Define a text encoder**

In [9]:
# create an encoder to convert the string into token
ids_from_chars = tf.keras.layers.StringLookup(vocabulary=list(vocab),
                                              mask_token=None)


In [12]:
ids = ids_from_chars(example_text_chars)
print(ids)

<tf.RaggedTensor [[52, 44, 46, 40, 58, 63, 51, 57],
 [59, 47, 44, 62, 44, 44, 50, 44, 53, 43]]>


it seems like it's encoded the text using the index from the initial vocab (with a slight offset).

**Define a text decoder**

In [14]:
# create a text decoder
chars_from_ids = tf.keras.layers.StringLookup(vocabulary=ids_from_chars.get_vocabulary(),
                                              invert=True,
                                              mask_token=None)


In [20]:
decoded_chars = chars_from_ids(ids)
print(decoded_chars)

<tf.RaggedTensor [[b'm', b'e', b'g', b'a', b's', b'x', b'l', b'r'],
 [b't', b'h', b'e', b'w', b'e', b'e', b'k', b'e', b'n', b'd']]>


In [23]:
# the decoded chars are returns as list of char. We can join the individual
# chars back into a string
tf.strings.reduce_join(decoded_chars, axis=-1).numpy()

array([b'megasxlr', b'theweekend'], dtype=object)

In [24]:
print(decoded_chars)

<tf.RaggedTensor [[b'm', b'e', b'g', b'a', b's', b'x', b'l', b'r'],
 [b't', b'h', b'e', b'w', b'e', b'e', b'k', b'e', b'n', b'd']]>


In [49]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

The trained model would be able to predict the next probable character given an initial characther or sequence of characther.

Towards this
- our training dataset needs to contain text where the input contain parts of the text and label contains the remaining parts of it.
- Example: Input: Megas, Label: Megasx


In [27]:
# convert the individual text in the dialog into chars
text_split_into_individual_chars = tf.strings.unicode_split(text, 'UTF-8')
print(text_split_into_individual_chars)

tf.Tensor([b'F' b'i' b'r' ... b'g' b'.' b'\n'], shape=(1115394,), dtype=string)


In [28]:
# convert each characthers into tokens
text_ids = ids_from_chars(text_split_into_individual_chars)
print(text_ids)

tf.Tensor([19 48 57 ... 46  9  1], shape=(1115394,), dtype=int64)


In [29]:
ids_dataset = tf.data.Dataset.from_tensor_slices(text_ids)
print(ids_dataset)

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>


In [30]:
for ids in ids_dataset.take(10):
  print(chars_from_ids(ids).numpy().decode('utf-8'))

F
i
r
s
t
 
C
i
t
i


In [31]:
# define the max_length of the sequences
seq_length = 100


In [51]:
# from the dataset generate a batch with a length of 101
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

# display a single batch
for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())


b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
b"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
b"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
b'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


In [53]:
# define a function to split a sequence into a feature vector and a label
def split_input_target(sequence):
  input_text = sequence[:-1]
  target_text = sequence[1:]
  return input_text, target_text

# example
split_input_target(list("Megasxlr"))

(['M', 'e', 'g', 'a', 's', 'x', 'l'], ['e', 'g', 'a', 's', 'x', 'l', 'r'])

In [54]:
# Apply the function into the batched sequence
dataset = sequences.map(split_input_target)
print(dataset)

<MapDataset element_spec=(TensorSpec(shape=(100,), dtype=tf.int64, name=None), TensorSpec(shape=(100,), dtype=tf.int64, name=None))>


In [55]:
for input_example, target_example in dataset.take(1):
  print(f"input: {text_from_ids(input_example).numpy()}")
  print(f"label: {text_from_ids(target_example).numpy()}")

input: b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
label: b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


wow, the feature and label are not so different.

**Create training batches from the mapped dataset**

In [56]:
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = (dataset
           .shuffle(BUFFER_SIZE)
           .batch(BATCH_SIZE, drop_remainder=True)
           .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

Summary of text processing pipeline and generating feature and labels
1. Split the text into individual chars
2. Convert the individual chars into tokens
3. From the list of tokens generate a batch containing sequences of 101 tokens.
4. Split each batch into a feature vector and label
  - feature is the batch sequence excluding the last char
  - label is the batch sequence excluding the first char
5. Generate a new dataset containing batchs of 64 samples (feature vectors and labels)

## **Define the text generation model**

In [63]:
# define the model parameters
vocab_size = len(ids_from_chars.get_vocabulary())
print(vocab_size)

embedding_dim = 256
rnn_units = 1024

66


In [59]:
# define the model
class MyModel(tf.keras.Model):

  # define class constructor to initialise the layers with the parameters
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units, return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)
  

  # What are the requirement for defining sub class models derived from the Model class
  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs

    # pass input to the embedding layer
    x = self.embedding(x, training=training)

    # set initial state if provided,
    if states is None:
      states = self.gru.get_initial_state(x)
    
    # call the GRU layer with the embedding, initial state and training
    # what is training?
    x, states = self.gru(x, initial_state=states, training=training)

    # Call the dense layer with the output of the GRU layer
    x = self.dense(x, training=training)

    # dense output and state if specified.
    if return_state:
      return x, states
    else:
      return x
     

In [64]:
# define an instance of the model
model = MyModel(vocab_size, embedding_dim, rnn_units)

try the model on a single sample from the dataset

In [None]:
for batch_feature_vector, batch_label in dataset.take(1):
  example_batch_predictions = model(batch_feature_vector)
  print(example_batch_predictions)
  print(example_batch_predictions.shape, "(batch size, sequence_length, vocab_size)")

The next probable word would be determined from the output distribution produced by the dense layer.

In [68]:
model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  16896     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 dense (Dense)               multiple                  67650     
                                                                 
Total params: 4,022,850
Trainable params: 4,022,850
Non-trainable params: 0
_________________________________________________________________


In [73]:
print(example_batch_predictions[0])

tf.Tensor(
[[ 0.00562176 -0.00374314 -0.01639711 ... -0.00193836  0.00166717
  -0.00250686]
 [-0.00671655 -0.0107293  -0.00898772 ... -0.00578413  0.00860768
   0.00137541]
 [-0.01193155 -0.00204952 -0.01788967 ... -0.00425291  0.00959354
  -0.0027385 ]
 ...
 [-0.01239498 -0.00092952  0.00505774 ... -0.01444685 -0.01578278
  -0.00837081]
 [-0.0053544   0.00378801 -0.00509801 ... -0.01159768  0.00221087
   0.00558506]
 [ 0.00099461 -0.00201884 -0.00211347 ... -0.00576868 -0.00381639
  -0.00275322]], shape=(100, 66), dtype=float32)


In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
print(sampled_indices)

In [76]:
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
print(sampled_indices)

[14 34 17 27 20  5  9 55 50 30 15 31 15 21 64 55 31 33 32 48 35 64 38 32
 40 11 37 35 49  8 23 17 39 23  3 47 22 21 37  2 33 42 53 20 42 46 58 44
 29 26 58 50 12 63 22 17 19 56 63 11 10 36 34  5  3 11 62 30 21 53 29 51
 48 17 64 31 58 33 58 42  3 47 51 61 48 38 56 25 18  2 22 64 19 29 32 34
 30 34 39 55]


In [78]:
# display the input and model prediction
print("Input:\n", text_from_ids(batch_feature_vector[0]).numpy())
print("Expected label: \n", text_from_ids(batch_label[0]).numpy())
print("Predicted label: \n", text_from_ids(sampled_indices).numpy())

Input:
 b'. For the dearth,\nThe gods, not the patricians, make it, and\nYour knees to them, not arms, must help'
Expected label: 
 b' For the dearth,\nThe gods, not the patricians, make it, and\nYour knees to them, not arms, must help.'
Predicted label: 
 b'AUDNG&.pkQBRBHypRTSiVyYSa:XVj-JDZJ!hIHX TcnGcgsePMsk;xIDFqx:3WU&!:wQHnPliDyRsTsc!hlviYqLE IyFPSUQUZp'


## **Train the model**

In [82]:
# define a loss function for the model
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

# get the loss for the first batch in the dataset
example_batch_mean_loss = loss(batch_label, example_batch_predictions)
print(f"Prediction shape: {example_batch_predictions.shape}")
print(f"Mean loss: {example_batch_mean_loss}")

Prediction shape: (64, 100, 66)
Mean loss: 4.191046237945557


In [83]:
tf.exp(example_batch_mean_loss).numpy()

66.0919

I'm not too sure about this, but 
- *The exponential of the mean loss should be approximately equal to the vocabulary size*. As the output logits from the dense layer should have similar magnitudes.

In [84]:
# define the loss and optimizer for the model
model.compile(optimizer='adam', loss=loss)


**Define model callbacks**

ModelCheckpoint
- Saves model/ weightd at a defined frequency. (So it saves the model or it's weight at different point during training)

In [85]:
# define a directory to store checkoints of the model during training
checkpoint_dir = "./training_checkpoints"
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

# define the model checkpoint
# Saves the model weight at the end of each epoch to the training_checkpoints dir
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                                         save_weights_only=True)


In [86]:
# Define the training epochs 
EPOCHS = 20

In [87]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## **Generating Text**

In generating text, we would provide a seed character and then run inference to predict the next probable characther, running this multiple times would allow us to generate larger pieces of text.

In [None]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_fro