# Text Generation with Transformers

In this notebook we use the components developed in `modelling.transformer` to train a transformer decoder for our text generation task. We will compare the performance of this model with that established by our RNN baseline.

## Imports

The bulk of the code required to setup, train and generate new text from the model, is contained within `modelling.transformer` (check the source code for the details). We import this module together with others that serve the training data and manage model persistence.

In [1]:
from textwrap import wrap

from torch.utils.data import DataLoader

from modelling import data as data
from modelling import transformer as tfr
from modelling import utils as utils

## Model and Training Parameters

Configure hyper-parameters for the model and the training routine.

In [2]:
MODEL_NAME = "decoder_next_word_gen"

SIZE_EMBED = 256

N_EPOCHS = 20
BATCH_SIZE = 32
SEQ_LEN = 40
MIN_WORD_FREQ = 2
MAXIMUM_LEARNING_RATE = 0.001
WARMUP_EPOCHS = 2
GRADIENT_CLIP = 5

## Setup Training Data

In [3]:
dataset = data.FilmReviewSequences(split="all", seq_len=SEQ_LEN, min_freq=MIN_WORD_FREQ)
data_loader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    drop_last=True,
    collate_fn=data.pad_seq2seq_data
)

## Instantiate Model

In [4]:
model = tfr.NextWordPredictionTransformer(dataset.vocab_size, SIZE_EMBED)
model

NextWordPredictionTransformer(
  (_position_encoder): PositionalEncoding(
    (_dropout): Dropout(p=0.1, inplace=False)
  )
  (_embedding): Embedding(63223, 256)
  (_decoder): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
    )
    (multihead_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=512, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=512, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
    (dropout3): Dropout(p=0.1, inplace=Fal

If we compare this model with the RNN model, then it is easy to see that this one is significantly more complex with many more layers (and thus parameters). We start with the same embedding layer albeit combined with a positional encoding, that is then fed into a transformer decoder layer comprised of two multi-head attention blocks, two linear (dense) feed-forward layers and three sets of layer normalisation and dropout.

## Train

As well as having a far more complex architecture, transformer based models are also trickier to train successfully. In particular, the vast number of parameters can lead to gradients that can grow very large in the early stages of training, thus preventing convergence.

We handle this using a learning rate schedule that starts close to zero and slowly ramps-up, before falling again as we reach the end of the desired number of epochs. We also clip the gradients - see the source code for the full details.

In [None]:
train_losses = tfr.train(
        model,
        data_loader,
        N_EPOCHS,
        MAXIMUM_LEARNING_RATE,
        WARMUP_EPOCHS,
        GRADIENT_CLIP
    )
utils.save_model(model, name=MODEL_NAME, loss=min(train_losses.values()))

## Generate Text with Model

Start by loading a model and instantiating a tokeniser that can also map from tokens back to text. The `load_model` function will load the best performing model that has been persisted on the local filesystem.

In [11]:
tokenizer = data.IMDBTokenizer()
best_model: tfr.NextWordPredictionTransformer = utils.load_model(MODEL_NAME)

loading .models/decoder_next_word_gen/trained@2023-06-18T07:14:37;loss=10_8749.pt


Now pass a prompt to the model and get it to generate the text that comes after.

In [12]:
prompt = "I thought this movie was"
text = tfr.generate(best_model, prompt, tokenizer, temperature=2.0)

for line in wrap(text, width=89):
    print(line)

==> I THOUGHT THIS MOVIE WAS scoutmaster magnetic meaney julliard chrissy corine raju
flaunted coneheads sh_it urinating harding shelby soundgarden liquids gratified cass
unlikable overs perú rideau approx escapee gouged rodriquez terrestial pressburger safes
sloshed roy manji daimajin gair concludes mili leitch gullibility nudged chromosomes
tearjerker...


And compare this output with that from an untrained model.

In [10]:
untrained_model = tfr.NextWordPredictionTransformer(dataset.vocab_size, SIZE_EMBED)
text = tfr.generate(untrained_model, prompt, tokenizer, temperature=2.0)

for line in wrap(text, width=89):
    print(line)

==> I THOUGHT THIS MOVIE WAS scoutmaster magnetic meaney julliard chrissy corine raju
mcmansions coneheads sh_it replicant dwellers shelby draw liquids gratified cass mayer
overs perú deflate approx escapee gouged rodriquez terrestial pressburger safes sloshed
roy manji daimajin sumatra concludes mili leitch burtonesque nudged chromosomes
tearjerker...
