# Transformer Text Generation

This notebook implements a simple word level neural network. Based on the the text, we will train an Transformer to predict the next words in the sequence.


In [2]:
import sys
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

from collections import Counter
from torch.utils.data import Dataset, DataLoader

!pip install sacremoses

torch.manual_seed(0)

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
   ---------------------------------------- 0.0/897.5 kB ? eta -:--:--
   ---------------------------------------- 0.0/897.5 kB ? eta -:--:--
   - ------------------------------------- 41.0/897.5 kB 653.6 kB/s eta 0:00:02
   -- ------------------------------------ 61.4/897.5 kB 825.8 kB/s eta 0:00:02
   -- ------------------------------------ 61.4/897.5 kB 825.8 kB/s eta 0:00:02
   -- ------------------------------------ 61.4/897.5 kB 825.8 kB/s eta 0:00:02
   -- ------------------------------------ 61.4/897.5 kB 825.8 kB/s eta 0:00:02
   --- ----------------------------------- 71.7/897.5 kB 262.6 kB/s eta 0:00:04
   --- ----------------------------------- 71.7/897.5 kB 262.6 kB/s eta 0:00:04
   --- ----------------------------------- 71.7/897.5 kB 262.6 kB/s eta 0:00:04
   ---- ---------------------------------- 92.2/897.5 kB 180.8 kB/s eta 0:0

<torch._C.Generator at 0x1af6f4c2510>

**Download and prepare the data**

Prepare the text data by encoding characters as integers. Then divide the text into sequences of characters.

In [4]:
import requests
import re

# Download the file
url = "https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/refs/heads/master/shakespeare.txt"
response = requests.get(url)
text = response.text[:2012]
print(text)

# Tokenize the text via regular expressions and split into the individual words
punctuation = r"[,\.:;\?]"

def clean_and_split(text):
    cleaned_text = re.sub(punctuation, lambda x: f" {x.group()}\n ", text.lower().replace('\n', ' '))
    return re.split(" +", cleaned_text)

word_list = clean_and_split(text)[5:]  # split into words, starting with the actual 1st word
print(word_list)

# Create character-to-index and index-to-character mappings
vocab = Counter(word_list)
vocab = sorted(vocab, key=vocab.get, reverse=True)  # Unique words sorted by frequency in descending order
word_to_idx = {w: i for i, w in enumerate(vocab)}
idx_to_word = {i: w for i, w in enumerate(vocab)}

THE SONNETS

by William Shakespeare

From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.

When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a tattered weed of small worth held:  
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say within thine own deep sunken eyes,
Were an all-eating shame, and thriftless prais

**Define a DataLoader to manage the text data**



In [6]:
# Hyperparameters
seq_length = 50 # Length of input sequence for the Transformer
batch_size = 64

# Create dataset and dataloader (compare to week 5)
class TextDataset(Dataset):
    def __init__(self, text, seq_length):
        self.data = [word_to_idx[word] for word in text]
        self.seq_length = seq_length

    def __len__(self):
        return len(self.data) - self.seq_length

    def __getitem__(self, idx):
        input_seq = self.data[idx:idx + self.seq_length]
        target_seq = self.data[idx + 1:idx + self.seq_length + 1]
        return torch.tensor(input_seq), torch.tensor(target_seq)

dataset = TextDataset(word_list, seq_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

**Define the Transformer architecture**

 Use a Transformer and a fully connected layer for the output.

In [8]:
class TextTransformer(nn.Module):

    def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6, dim_feedforward=2048, dropout=0.1):
        super(TextTransformer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_embedding = nn.Embedding(1000, d_model)  # Positional embedding (up to 1000 words in a sequence)

        # Transformer encoder layers
        self.transformer = nn.Transformer(d_model=d_model, nhead=nhead, num_encoder_layers=num_layers,
                                          num_decoder_layers=num_layers, dim_feedforward=dim_feedforward,
                                          dropout=dropout, batch_first=True)

        # Linear layer to map the transformer output back to the vocabulary
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        seq_length = x.size(1)
        positions = torch.arange(0, seq_length).unsqueeze(0)

        # Embed input and positions
        x = self.embedding(x) + self.pos_embedding(positions)

        # Transformer expects (seq_len, batch_size, d_model)
        x = x.permute(1, 0, 2)

        # Pass through transformer
        transformer_out = self.transformer(x, x)

        # Map transformer output to vocabulary (to the fully connected layer)
        out = self.fc_out(transformer_out.permute(1, 0, 2))

        return out

**Training Loop**

Train the model by feeding sequences and their targets (next wordss) to the RNN.

Note that the words are encoded as vectors, with one "dimension" per word identified in the text.

In [10]:
# Hyperparameters
num_epochs = 100
d_model = 128
nhead = 8
num_layers = 4
d_feedforward = 2048
dropout = 0.1

lr = 0.001
acceptable_loss = 1  # Accept the model if the loss is sufficiently low

vocab_size = len(vocab)

# Initialize the model, loss function, and optimizer
model = TextTransformer(vocab_size, d_model=d_model, nhead=nhead, num_layers=num_layers,
                        dim_feedforward=d_feedforward, dropout=dropout)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

# Training loop
for epoch in range(num_epochs):
  model.train()

  for X, y in dataloader:
    # Backpropagation
    optimizer.zero_grad()
    y_pred = model(X)
    loss = loss_fn(y_pred.view(-1, vocab_size), y.view(-1))
    loss.backward()
    optimizer.step()

    # Accept the model if the loss is sufficiently low
    loss_value = loss.item()
    if loss_value < acceptable_loss:
      print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss_value:.4f}')
      break  # exit loop, skip over else clause

  else:  # End of the innner loop -- one full epoch
    print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss_value:.4f}')
    continue  # Start the next epoch

  break  # Stop after the current epoch


Epoch [1/100], Loss: 4.9238
Epoch [2/100], Loss: 4.8664
Epoch [3/100], Loss: 4.8479
Epoch [4/100], Loss: 4.8955
Epoch [5/100], Loss: 4.8353
Epoch [6/100], Loss: 4.8324
Epoch [7/100], Loss: 4.6245
Epoch [8/100], Loss: 3.7626
Epoch [9/100], Loss: 2.8472
Epoch [10/100], Loss: 2.0944
Epoch [11/100], Loss: 1.7132
Epoch [12/100], Loss: 1.3227
Epoch [13/100], Loss: 1.2765
Epoch [14/100], Loss: 1.1706
Epoch [15/100], Loss: 1.0733
Epoch [16/100], Loss: 1.1274
Epoch [17/100], Loss: 1.0971
Epoch [18/100], Loss: 1.0710
Epoch [19/100], Loss: 1.0764
Epoch [20/100], Loss: 1.0476
Epoch [21/100], Loss: 1.0678
Epoch [22/100], Loss: 1.0194
Epoch [23/100], Loss: 1.0678
Epoch [24/100], Loss: 0.9882


**Text Generation**

 Generate text by sampling words from the trained model.

In [12]:
def generate_text(model, start_text, max_length=10, temperature=1.0):
    model.eval()  # Turn-off dropout etc.

    # Preprocess the start text
    words = clean_and_split(start_text)
    input_seq = torch.tensor([word_to_idx[word] for word in words], dtype=torch.long).unsqueeze(0)

    generated_text = [words[0].capitalize()] + words[1:]
    generated_text

    for _ in range(max_length):
        with torch.no_grad():
            # Get the model's predictions for the next token
            output = model(input_seq)

            # Select the last token from the output
            next_token_energy = output[0, -1, :]

            # Apply temperature (controls randomness)
            next_token_energy = next_token_energy / temperature

            # Sample the next token from the probability distribution
            next_token_probs = torch.softmax(next_token_energy, dim=-1)
            next_token = torch.multinomial(next_token_probs, 1).item()

            # Add the predicted word to the generated text
            word = idx_to_word[next_token]
            word = word.capitalize() if generated_text[-1].endswith('\n') else word
            generated_text.append(word)

            # Update the input sequence, the input sequence gets fed back to the model in the loop
            input_seq = torch.cat([input_seq, torch.tensor([[next_token]], dtype=torch.long)], dim=1)

    return ' ' + ' '.join(generated_text)

# Generate some Shakespearean-style text
start_text = "From fairest creatures we desire"
generated_sonnet = generate_text(model, start_text, max_length=100, temperature=0.8)
print(generated_sonnet)

 From fairest creatures we desire mak'st waste all-eating shame face lovely brow ,
 Where is she in thy light's flame with held :
 Then nature's ;
 To the tomb ,
 To be a tattered weed of his beauty by in niggarding :
 Thou couldst count spring my back the time that thereby 'this of her besiege thy beauty lies ,
 Unbless unbless shall sum fresh dig tattered repair beguile face should beauty's field mine shall thereby more praise much his never who is she ,
 Where all ,
 Were an all-eating look his the time that face should form another mother livery heir his old


In [14]:
def model_size_in_kbytes(model):
    total_size = 0
    for param in model.parameters():
        # Number of elements in the parameter tensor
        num_elements = param.numel()

        # Number of bytes per element based on the data type (dtype)
        bytes_per_element = param.element_size()

        # Total size of this parameter
        total_size += num_elements * bytes_per_element

    return total_size/1000

print(f"Size of the original text : {sys.getsizeof(text)/1000: .0f} kb.")
print(f"Size of the neural network: {model_size_in_kbytes(model): .0f} kb.")

Size of the original text :  2 kb.
Size of the neural network:  20774 kb.




---

**Using `torch.hub` to load pretrained network**

We can use `torch.hub` to load a pretrained network instead of creating our own from scratch.

In [None]:
# Load the pre-trained GPT-2 model and tokenizer from Hugging Face using torch.hub
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForCausalLM', 'gpt2')
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'gpt2')

# Define a prompt (input text)
prompt = "Who was Shakespeare?"

# Tokenize the input prompt
inputs = tokenizer(prompt, return_tensors='pt')

# Generate text using GPT-2
# You can specify `max_length` for the length of the generated text
outputs = model.generate(inputs['input_ids'],
                         attention_mask=inputs['attention_mask'],
                         max_length=35,
                         num_return_sequences=1)

# Decode the generated text to human-readable format
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)

Downloading: "https://github.com/huggingface/pytorch-transformers/zipball/main" to C:\Users\khale/.cache\torch\hub\main.zip


In [16]:
print(f"Size of the neural network: {model_size_in_kbytes(model): .0f} kb.")

Size of the neural network:  497759 kb.
