# Shakespearean text with Transformers

_Exercise: Use the Transformers library to download a pretrained language model capable of generating text (e.g., GPT), and try generating more convincing Shakespearean text. You will need to use the model's `generate()` method—see Hugging Face's documentation for more details._

## Prepare environment

In [1]:
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
import numpy as np

## Prepare data

In [2]:
shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [3]:
# extra code – shows a short text sample
print(shakespeare_text[:80])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


In [4]:
# Load pretrained model and tokenizer
model_name = "gpt2"  # You could also try "gpt2-medium" for better results
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = TFGPT2LMHeadModel.from_pretrained(model_name)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [5]:
# Add padding token (GPT2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token

In [6]:
# Shakespeare prompt to condition the generation
prompt = """
In faith, I do not love thee with mine eyes,
For they in thee a thousand errors note;
But 'tis my heart that loves what they despise,
Who, in despite of view, is pleased to dote.
"""

In [7]:
# Encode the prompt
input_ids = tokenizer.encode(prompt, return_tensors="tf")

In [11]:
# Generate text
output = model.generate(
    input_ids,
    max_length=200,
    num_return_sequences=3,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    do_sample=True,
    no_repeat_ngram_size=3,
    pad_token_id=tokenizer.eos_token_id
)

In [12]:
# Print the generated Shakespeare-like texts
print("\n=== Generated Shakespearean Texts ===\n")
for i, sequence in enumerate(output):
    text = tokenizer.decode(sequence.numpy(), skip_special_tokens=True)
    print(f"Sample {i+1}:\n{text}\n")
    print("-" * 50)


=== Generated Shakespearean Texts ===

Sample 1:

In faith, I do not love thee with mine eyes,
For they in thee a thousand errors note;
But 'tis my heart that loves what they despise,
Who, in despite of view, is pleased to dote.

Thou shalt not, I say, make of thee the one,

For the other is to the one; to be, to think.
 (N. 6:9)

Of to say.
, 'It is to be the saying, 'Tis the beginning of many things,
 (Tis, and to be a part of; to come, to be of.)

To be in, to look; to pray; to know.
. . . to say, to answer.
— And to be to answer (O.T.) 'The in, in the time of the Lord.
 I have seen to my mind, I have asked, I ask not.
 The

--------------------------------------------------
Sample 2:

In faith, I do not love thee with mine eyes,
For they in thee a thousand errors note;
But 'tis my heart that loves what they despise,
Who, in despite of view, is pleased to dote.

In love I believe, in love, I believe in love;

I love in the love of my heart, in my heart love.
 (Cf. 1 Corinthians 8.5

In [10]:
# Fine-tuning example with TensorFlow/Keras
def fine_tune_on_shakespeare(model, tokenizer, shakespeare_texts, epochs=3):
    """
    Function to fine-tune the model on Shakespeare's works using Keras/TensorFlow.
    You would need a dataset of Shakespeare's texts.
    """
    # Prepare dataset
    def encode_texts(texts):
        encodings = tokenizer(texts, truncation=True, padding="max_length",
                              max_length=512, return_tensors="tf")
        input_ids = encodings["input_ids"]
        attention_mask = encodings["attention_mask"]
        # For language modeling, the labels are the input_ids
        labels = tf.identity(input_ids)
        return input_ids, attention_mask, labels

    # Create TensorFlow Dataset
    def create_dataset(texts, batch_size=4):
        input_ids, attention_mask, labels = encode_texts(texts)
        dataset = tf.data.Dataset.from_tensor_slices((
            {"input_ids": input_ids, "attention_mask": attention_mask},
            labels
        ))
        dataset = dataset.shuffle(buffer_size=len(texts)).batch(batch_size)
        return dataset

    # Example training code
    train_dataset = create_dataset(shakespeare_texts)

    # Define optimizer
    optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)

    # Compile model
    model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy")
    # Train the model
    model.fit(train_dataset, epochs=epochs)

    return model

# For a more complete fine-tuning approach, you could use the official Shakespeare dataset:
def download_shakespeare_dataset():
    """
    Downloads and prepares the TinyShakespeare dataset for fine-tuning.
    Returns a list of text chunks suitable for training.
    """
    import tensorflow_datasets as tfds
    import re

    # Download the Shakespeare dataset
    shakespeare_ds = tfds.load('tiny_shakespeare', split='train')

    # Extract text and preprocess
    shakespeare_text = ""
    for example in shakespeare_ds:
        shakespeare_text += example['text'].numpy().decode('utf-8')

    # Clean and chunk the text
    shakespeare_text = re.sub(r'\s+', ' ', shakespeare_text).strip()
    chunk_size = 512  # GPT-2 context window
    stride = 256      # Overlap between chunks

    chunks = []
    for i in range(0, len(shakespeare_text) - chunk_size, stride):
        chunks.append(shakespeare_text[i:i + chunk_size])

    return chunks

shakespeare_chunks = download_shakespeare_dataset()
fine_tuned_model = fine_tune_on_shakespeare(model, tokenizer, shakespeare_chunks)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
# Generate text
output = fine_tuned_model.generate(
    input_ids,
    max_length=200,
    num_return_sequences=3,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    do_sample=True,
    no_repeat_ngram_size=3,
    pad_token_id=tokenizer.eos_token_id
)

In [None]:
# Print the generated Shakespeare-like texts
print("\n=== Generated Shakespearean Texts ===\n")
for i, sequence in enumerate(output):
    text = tokenizer.decode(sequence.numpy(), skip_special_tokens=True)
    print(f"Sample {i+1}:\n{text}\n")
    print("-" * 50)