## Import Libraries and "Tiny Shakespeare" Dataset from Tensorflow Dataset

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np

# load the Tiny Shakespeare dataset
dataset, info = tfds.load('tiny_shakespeare', with_info=True, as_supervised=False)

The dataset contains data in a textual format, and language models need numerical data. So I will convert the text to sequences of integers. I will also create sequences for training.

In [2]:
# get the text from the dataset
text = next(iter(dataset['train']))['text'].numpy().decode('utf-8')

# create a mapping from unique characters to indices
vocab = sorted(set(text))
char2idx = {char: idx for idx, char in enumerate(vocab)}
idx2char = np.array(vocab)

# numerically represent the characters
text_as_int = np.array([char2idx[c] for c in text])

# create training examples and targets
seq_length = 100
examples_per_epoch = len(text) // (seq_length + 1)

# create training sequences
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

For each sequence, I will now duplicate and shift it to form the input and target text by using the map method to apply a simple function to each batch.

In [3]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

Now, I will shuffle the dataset and pack it into training batches.

In [4]:
# batch size and buffer size
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

Now, I will use a simple Recurrent Neural Network (RNN) model with a few layers to build the model.

In [5]:
# length of the vocabulary
vocab_size = len(vocab)

# the embedding dimension
embedding_dim = 256

# number of RNN units
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        tf.keras.layers.LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)

I will now choose an optimizer and a loss function to compile the model

In [6]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', loss=loss)

## Model Training

In [7]:
import os

# directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'

# name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

# train the model
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


After training, I can now use the model to generate text. First, I will restore the latest checkpoint and rebuild the model with a batch size of 1.

In [8]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

Now, to generate text, I will input a seed string, predict the next character, and then add it back to the input, continuing this process to generate longer text:

In [10]:
def generate_text(model, start_string):
    num_generate = 1000

    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    text_generated = []

    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)

        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

print(generate_text(model, start_string=u"QUEEN: So, lets end this"))

QUEEN: So, lets end this blood he hate:
O, 'tis the dukedom Boling hath, made betwered.
If any quarrel lay or gow
Such thoughts,
The head to-nice our hand and sensed Mentagry:
Might do any oath we call death.

BUCKINGHAM:
The gentlewas we have flattered
By lawful absence: heeening to me as faults
I'll handone with the joy; I would not follow
Our niches in the city of the worst.

LADY CAPULET:
Where is thou? of all plain, call'd not;
And, as they will there, in my brother Rutlain,
And from the meaning shows that lives; thit
inquiradary.
it were I see, to Antolour said,
Trown it excucall; and promise the truits;
And live with sweet from mortal aid.
Come, your voices! for the head of Hereford's hus
'AP LIUS:
As much he gain?
Come, to too arms: leave that I must not death?
O caughtan, lest, and that we have, beseech you:
Of that there, forth; I throne instantled with his watch;
Pale in the hopes: I have am ansold grow his welling,
An entreated her brother Clarence, or I and once.

KING RIC

The **generate_text** function in the above code uses a trained Recurrent Neural Network (RNN) model to generate a sequence of text, starting with a given seed phrase (start_string). It converts the seed phrase into a sequence of numeric indices, feeds these indices into the model, and then iteratively generates new characters, each time using the model’s most recent output as the input for the next step. This process continues for a specified number of iterations **(num_generate)**, resulting in a stream of text that extends from the initial seed.

The function employs randomness in character selection to ensure variability in the generated text, and the final output is a concatenation of the seed phrase with the newly generated characters, typically reflecting the style and content of the training data used for the model.