# **Text generation with an RNN**

Text generation involves training a model to create coherent text sequences. Recurrent Neural Networks (RNNs) excel at this task due to their ability to process sequential data and retain memory of previous inputs, enabling accurate predictions of subsequent characters or words.

#### **Dataset Overview: Edgar Allan Poe's Works**

Poe's writings, known for their rich vocabulary and complex sentence structures, provide an ideal dataset for training a text generation model, enhancing the RNN's ability to generate text that mirrors Poe's style.

### Importing libraries

In [1]:
import tensorflow as tf
import numpy as np
import os
import time




### Download the Edgar Allan Poe Dataset

In [2]:
# Download the dataset - Complete Works of Edgar Allan Poe
path_to_file = tf.keras.utils.get_file('edgar_allan_poe.txt', 'https://www.gutenberg.org/files/2147/2147-0.txt')

### Read the data

In [9]:
# Read the data and decode it from bytes to a string
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print('Length of text: {} characters'.format(len(text)))
print(text[:70])

Length of text: 580856 characters
﻿*** START OF THE PROJECT GUTENBERG EBOOK THE WORKS OF EDGAR ALLAN POE


In [10]:
# Create a sorted set of unique characters in the text
vocab = sorted(set(text))
print('{} unique characters'.format(len(vocab)))

103 unique characters


### Process the text

**Creating two lookup tables**: one that maps each unique character to a numerical index (char2idx) and another that maps indices back to characters (idx2char). This allows us to convert the entire text into a sequence of integers (text_as_int), which the RNN will use as input.

In [11]:
# Create a mapping from unique characters to indices and vice versa
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

In [12]:
# Convert the entire text to a sequence of integers using the mapping
text_as_int = np.array([char2idx[c] for c in text])

In [13]:
# printing the first 50 character-to-index mappings to verify the encoding. 
print('{')
for char, _ in zip(char2idx, range(50)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  '\r':   1,
  ' ' :   2,
  '!' :   3,
  '$' :   4,
  '&' :   5,
  '(' :   6,
  ')' :   7,
  '*' :   8,
  ',' :   9,
  '-' :  10,
  '.' :  11,
  '/' :  12,
  '0' :  13,
  '1' :  14,
  '2' :  15,
  '3' :  16,
  '4' :  17,
  '5' :  18,
  '6' :  19,
  '7' :  20,
  '8' :  21,
  '9' :  22,
  ':' :  23,
  ';' :  24,
  '?' :  25,
  'A' :  26,
  'B' :  27,
  'C' :  28,
  'D' :  29,
  'E' :  30,
  'F' :  31,
  'G' :  32,
  'H' :  33,
  'I' :  34,
  'J' :  35,
  'K' :  36,
  'L' :  37,
  'M' :  38,
  'N' :  39,
  'O' :  40,
  'P' :  41,
  'Q' :  42,
  'R' :  43,
  'S' :  44,
  'T' :  45,
  'U' :  46,
  'V' :  47,
  'W' :  48,
  'X' :  49,
  ...
}


In [14]:
# Show how the first 13 characters of the text are mapped to integers
print('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

'\ufeff*** START OF' ---- characters mapped to int ---- > [102   8   8   8   2  44  45  26  43  45   2  40  31]


### **The prediction task**

The task of the RNN is to predict the next character in a sequence given the previous characters. For example, given the sequence "Edgar Allan P", the model should predict "o" as the next character. By training the model on numerous sequences, it learns to generate text by predicting one character at a time.

### Creating training examples and targets

In [15]:
seq_length = 100    # length of sequences for training (input + target)
examples_per_epoch = len(text) // (seq_length + 1)      # number of sequences we can extract from text
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)  # TensorFlow dataset from the sequence of integers

In [17]:
for i in char_dataset.take(10):
    print(idx2char[i.numpy()])


*
*
*
 
S
T
A
R
T


In [18]:
for i in char_dataset.take(10):
    print(i.numpy())

102
8
8
8
2
44
45
26
43
45


In [19]:
# Batch the characters into sequences
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

In [20]:
# to split input and target sequences from each batch
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

### Creating Training Batches

In [21]:
for input_example, target_example in dataset.take(1):
    print('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  '\ufeff*** START OF THE PROJECT GUTENBERG EBOOK THE WORKS OF EDGAR ALLAN POE\r\n— VOLUME 1 ***\r\n\r\n\r\n\r\n\r\nThe '
Target data: '*** START OF THE PROJECT GUTENBERG EBOOK THE WORKS OF EDGAR ALLAN POE\r\n— VOLUME 1 ***\r\n\r\n\r\n\r\n\r\nThe W'


In [22]:
BATCH_SIZE = 64
BUFFER_SIZE = 10000
# Shuffling the dataset and batching it into 64 sequences of groups for training
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

In [34]:
vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                  batch_input_shape=[batch_size, None]),
        tf.keras.layers.GRU(rnn_units,
                            return_sequences=True,
                            stateful=True,
                            recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

The model consists of three layers:

+ Embedding Layer: Converts character indices into dense vectors of a fixed size.

+ GRU Layer: The RNN layer that processes the sequence of vectors.

+ Dense Layer: Outputs the prediction for the next character.

In [35]:
model = build_model(vocab_size=len(vocab), embedding_dim=embedding_dim, 
                    rnn_units=rnn_units, batch_size=BATCH_SIZE)

### Trying the Model

In [36]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 103) # (batch_size, sequence_length, vocab_size)


### Training the Model

In [37]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', loss=loss)

In [38]:
# Save the model weights periodically during training.
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix, save_weights_only=True)

In [29]:
EPOCHS = 30
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/30

Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### Generating the Text

In [46]:
# After training, save the model
model.save_weights(checkpoint_prefix)

In [47]:
def generate_text(model, start_string):
    num_generate = 2000
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    text_generated = []
    temperature = 1.0       # Controls the creativity of the predictions.
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(idx2char[predicted_id])
    return start_string + ''.join(text_generated)

In [48]:
# Create a new model for text generation with batch size 1
model_for_generation = build_model(vocab_size=len(vocab),embedding_dim=embedding_dim,
                                   rnn_units=rnn_units, batch_size=1)

model_for_generation.load_weights(tf.train.latest_checkpoint(checkpoint_dir))



<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x253931a5970>

In [51]:
# Load the trained weights into the new model
model_for_generation.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

# Set the batch size to 1 for text generation
model_for_generation.build(tf.TensorShape([1, None]))

# Generate and print text starting with a specific string
print(generate_text(model_for_generation, start_string="Once upon a midnight dreary, while I pondered, weak and weary"))

Once upon a midnight dreary, while I pondered, weak and weary, but
      the sashes from the dark hemisphere in Biterary line limn affording
      them the last fle hastiety of the surface. Both hair upon it to the
      “Bight!’s ultimate designates
      took promited their future was that of a place, it
      swirely existings of the person. What a
      taph creation to those I hade corruted,
      atually accompanied by innumerable silk, some to lead, or in the madness of them of insting upon which
      is to have a popul of the principle, holding our
      choor-editor, which alarmed, now to treparture to
      tasky must hold by the Asshranting after this worshim of the car, like a
      wonderful reading thus:

      “Why I did. You say! A sense of
      this discover nothing behind us. A
      portion rather like a Greek column, being
      three-nation, and collected into my calculutiture was—but to the
      fugitive. Are is large and fister of the
      witnesses, and disp