# Lab 2. Building and Using Transformers

## 1. Introduction

In the previous lab we looked at how to build an LSTM to generate text. In this lab we will look at how to do the same thing with transformers.

In the lecture we have already seen how transformers are built in Keras. In this lab rather than building our transformers from scratch, we will use pre-made models on Hugging Face. In the first part we will build a story generator from a Hugging Face template, and in the second part we will look at how to build the same story generator using a Generative Pretrained Transformer (GPT).

## 2. Presentation Instructions

You will be presenting the contents of <b>Lab2AnsBk.docx</b> in a demo. Please upload your completed Lab2AnsBk.docx to Canvas by 4 July 2025, 2359 hours. Ensure that you put in the names of all your team members.


## 3. Building our Transformer Based Story Generator

We will now proceed to build our story generator. As before we begin by loading our text corpus (we are again using Sherlock Holmes). Unlike LSTMs however we can simply present an entire chunk of text to the transformer. However there is an added complication in that the chunks must be of <b>fixed length</b>.

### 3.1 Loading our Dataset

As before we will use glob to scan the training and testing directory, then loading the files into the dataset. We filter out the sentences that have fewer than 5 words, then convert all the text to lowercase. This part exactly the same as in Lab 1.



In [1]:
from datasets import load_dataset
import glob
traindir="sherlock/Train/"
testdir="sherlock/Test/"
trainlist = [file for file in glob.glob(traindir+"*.txt")]
testlist = [file for file in glob.glob(testdir+"*.txt")]

print(trainlist)
# Our training files
data_files ={"train":trainlist,
           "test":testlist}

# Now load the dataset
dataset = load_dataset("text", data_files=data_files)

# Discard statements with fewer than 5 words
min_len = 5
dataset = dataset.filter(lambda example: len(example["text"]) >= min_len)

# Turn all our text to lowercase
def preprocess(example):
    return {"text":example["text"].lower()}

dataset.map(preprocess)

['sherlock/Train/3289-0.txt', 'sherlock/Train/pg2348.txt', 'sherlock/Train/2852-0.txt', 'sherlock/Train/pg2345.txt', 'sherlock/Train/pg2344.txt', 'sherlock/Train/pg2346.txt', 'sherlock/Train/pg2347.txt', 'sherlock/Train/pg2343.txt']


Filter:   0%|          | 0/19488 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1768 [00:00<?, ? examples/s]

Map:   0%|          | 0/15042 [00:00<?, ? examples/s]

Map:   0%|          | 0/1431 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 15042
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1431
    })
})

## 3.2 Training the Tokenizer

We now tokenize the dataset. This means mapping all the sentences to vectors of integers. However unlike in Lab 1 we will create a custom tokenizer by adapting the gpt2 tokenizer we used in Lab 1.

(If you want to train a tokenizer and language model <b><i>from scratch</i></b>, which is very useful for new languages, please see here: https://huggingface.co/blog/how-to-train)

We start by loading the gpt2 tokenizer from Hugging Face. Note that transformers require our sentences to be of <b>fixed length</b>, unlike LSMs in Lab 1 that just require the last <i>lookback</i> tokens.

### Question 1

Explain why transformers train using entire sentences instead of short "look back" sentences like LSTM.

### Question 2

Explain why the sentences used to train the transformers must be of fixed length.

<b>Fill the answers to both questions inside the provided answer book</b>


In [2]:
from transformers import AutoTokenizer

max_length = 30

# Load the model.
model_name = "gpt2"
root_tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True,
                                         max_length=max_length)
#Specify the padding token
root_tokenizer.pad_token = root_tokenizer.eos_token

vocab_size = len(root_tokenizer)
print("Vocab size: ", vocab_size)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Vocab size:  50257


We can now start training the tokenizer using our dataset. To do so we create a Python generator by yielding a batch of <i>steps</i> sentences at a time. We do this because the dataset is too large to be completely loaded in memory.

In [3]:
# Create Python generator to return steps sentences at a time
# Notice that we use '()' instead of '[]' for our list. This
# yields a generator.

def return_training_corpus(dataset, steps):
    train = dataset["train"]
    return (train[i:i+steps]["text"] for i in range(0, len(train), steps))


# Create the generator that returns 1000 sentences at a time.
# We have about 15,000 sentences in the Sherlock dataset

gen = return_training_corpus(dataset, 1000)

# Now start training the tokenizer
tokenizer = root_tokenizer.train_new_from_iterator(gen, vocab_size, 
                                                   length = len(dataset["train"]))

vocab_size = len(tokenizer)
print("Vocab size of our new tokenizer: ", vocab_size)
tokenizer.pad_token = tokenizer.eos_token

# Save the tokenizer
tokenizer.save_pretrained("sherlock")





Vocab size of our new tokenizer:  20262


('sherlock/tokenizer_config.json',
 'sherlock/special_tokens_map.json',
 'sherlock/vocab.json',
 'sherlock/merges.txt',
 'sherlock/added_tokens.json',
 'sherlock/tokenizer.json')

### 3.3 Training a Transformer From Scratch

We will now start creating and training a transformer from scratch. In the lecture we have seen how to build a transformer using Keras. It is unproductive to do this again, so we will use pre-configured (but untrained) transformers from Hugging Face. 

There are several things we need to do:

    1. Tokenize the entire dataset, creating fixed-length sentences
    2. Load up a pre-configured transformer. We are using the pretrained GPT2 language model.
    3. Train the pre-configured transformer and save it.
    

In [4]:
# Tokenize the entire corpus by using map
def tokenize(example):
    outputs = tokenizer(example["text"], padding=True, truncation=True,
                        max_length=max_length, return_overflowing_tokens=True)
    
    ret_tokens=[]
    
    for input_ids in outputs["input_ids"]:
        ret_tokens.append(input_ids)

    # Map requires a dictionary of tokens to be returned. The token entries
    # must be called "input_ids"
    return {"input_ids":ret_tokens}

# We must set batched = True so that the tokenizer knows how many characters to pad to.

tokenized_dataset = dataset.map(tokenize, remove_columns = dataset["train"].column_names,
                               batched=True)



Map:   0%|          | 0/15042 [00:00<?, ? examples/s]

Map:   0%|          | 0/1431 [00:00<?, ? examples/s]

We can now print the lengths of the first 10 sentences to show you that they've all been padded/truncated to the same length, which is the length of the longest statement seen (or at most 30 characters)


In [5]:
for i, token in enumerate(tokenized_dataset["train"]["input_ids"][:10]):
    print("Length of sentence ", i, ": ", len(token))


Length of sentence  0 :  23
Length of sentence  1 :  23
Length of sentence  2 :  23
Length of sentence  3 :  23
Length of sentence  4 :  23
Length of sentence  5 :  23
Length of sentence  6 :  23
Length of sentence  7 :  23
Length of sentence  8 :  23
Length of sentence  9 :  23


Great! Now let's build our transformer. We will create it from an existing GPT-2 transformer, 

In [8]:
# Bring in the configuration and transformer
from transformers import AutoConfig, TFGPT2LMHeadModel

# Load configuration from existing GPT2 network. Set length of sentences,
# start of sentence and end of sentence tokens

config = AutoConfig.from_pretrained(model_name, 
                                    vocab_size = len(tokenizer), 
                                    n_ctx = max_length,
                                   bos_token_id = tokenizer.bos_token_id,
                                   eos_token_id = tokenizer.eos_token_id)

# Create the model
model = TFGPT2LMHeadModel(config)

### Question 3.

Using the Hugging Face website or otherwise, explain the parameters in AutoConfig.from_pretrained that we have used.

Now we are going to start training the new model. Before this we need to create a data collator that will batch the inputs for training. We then convert the dataset into a TensorFlow dataset.

In [9]:
from transformers import DataCollatorForLanguageModeling

# mlm = Masked Language Model, where we masked random words and let
# the language model infer what it is. We are not doing that here so
# we set mlm to False.

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

batch_size = 16

# Create the TensorFlow datasets
tf_train_set = tokenized_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator)

tf_test_set = tokenized_dataset["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator)


Now we begin training the Transformer! We will use an Adam optimizer to train for 5 epochs or do early termination


In [11]:
from tensorflow.keras.callbacks import ModelCheckpoint
from tf_keras.src.callbacks import EarlyStopping

from transformers import AdamWeightDecay
import os

# tsherlock is the transformer version of the sherlock story generator
filename="./tsherlock.h5"

earlystop = EarlyStopping(min_delta=0.01, patience=2)

# Compile the model 
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)

# If the weights file exists, load it
if os.path.exists(filename):
    # Call model build to initialize the model
    # variables, so that we can call load_weights
    model.build(input_shape=(None, ))
    model.load_weights(filename)
    
# Train the model. This will take a REALLY long time.
epochs=5
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=epochs,
         callbacks=[earlystop])

# Save the weights
model.save_weights(filename)

Epoch 1/5
Epoch 2/5

KeyboardInterrupt: 

### 3.4 Generating Stories

Generating stories using Hugging Face is considerably easier; we can just define a pipeline with our model and tokenizer, and tell it how many words to generate and whether or not to do sampling. 

We are going to use a text generation pipeline. You can get a complete list of available pipelines here: https://huggingface.co/docs/transformers/main_classes/pipelines

Let's try this now.

In [12]:
from transformers import pipeline

# Create the pipeline
pipe=pipeline("text-generation", model=model, tokenizer=tokenizer)

# Number of words to generate
num_words = 100

# Our seed sentence
seed="Elementary my dear Watson, "
text = pipe(seed, max_length=num_words, do_sample=True, no_repeat_ngram_size=2)[0]

print("Generated text: \n")
print(text)


Device set to use 0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Generated text: 

{'generated_text': "Elementary my dear Watson,  One the same man has a very friend a man, and face,” said more, Mr.  We own I have hand case. There very instant man own man that “ You could the great a place, to tell first man of what man very other Watson's be whole I thought much head the lady way, Watson more house, you he will you am I that I had come, a day man was well tell all it was see I will come my"}


### 3.5 Fine-tuning a Pretrained Transformer

We will now fine-tune a pretrained transformer and compare the speed and results. We begin by bringing in the pretrained model using TFAutoModelForCausalLM, then use fit as usual to train the network.

In [13]:
from transformers import TFAutoModelForCausalLM

# The from_pt parameter is to tell the model to convert the weights from PyTorch
# format. We will use the same optimizer as before but set a new checkpoint
# with a new filename
print("Training for %d epochs." % epochs)

pretrained_filename = "ptsherlock.h5"

pretrained_model = TFAutoModelForCausalLM.from_pretrained(model_name, from_pt=True)
pretrained_model.compile(optimizer=optimizer)

# If the weights file exists, load it
if os.path.exists(pretrained_filename):
    pretrained_model.build(input_shape=(None,))
    pretrained_model.load_weights(pretrained_filename)

pretrained_model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=epochs,
         callbacks=[earlystop])

pretrained_model.save_weights(pretrained_filename)



Training for 5 epochs.


pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Epoch 1/5
 16/941 [..............................] - ETA: 15:21 - loss: 8.0769

KeyboardInterrupt: 

As before we start generating our stories so we can compare them.

In [None]:
pretrained_pipe = pipeline("text-generation", model=pretrained_model, tokenizer=tokenizer)
pretrained_text = pretrained_pipe(seed, max_length=num_words, do_sample=True, 
                               no_repeat_ngram_size=2)[0]

print("Generated Text from new Transformer: \n")
print(text)

print("\nPretrained Generated Text: \n")
print(pretrained_text)

### Question 4

Compare the texts generated from the transformer that was trained from scratch versus the transformer that used the pretrained GPT2 weights. Do you see a difference in quality, e.g. fewer "non-English" words?

## 4. Summary

This lab is a follow-up to Lab 1, and here we use transformers to generate stories instead of LSTMs. We started by training a transformer from scratch, and then proceeded to fine-tune a pretrained model.

Hugging Face has many, many models that you can work with, and this lab should serve as an introduction. You are encouraged to explore the other models, and how to use train them and use them.