### Pretraining on Unlabeled Data

We will implement a training function and pretrain the LLM. We will also learn about basic model evaluation techniques to measure the quality of the generated text. Finally, we learn how to load pretrained weights, giving our LLM a starting point for fine-tuning.

In [1]:
import torch
from Chapter04 import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50_257,
    "context_length": 256, # shortened from 1,024
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=False)
        (W_key): Linear(in_features=768, out_features=768, bias=False)
        (W_value): Linear(in_features=768, out_features=768, bias=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features

We have shortened the context length to reduce the computational demands of training the model, making it possible to carry out training on a standard laptop.

The two functions below facilitate the conversion between text and token representations. 

In [2]:
import tiktoken
from Chapter04 import generate_text_simple

def text_to_token_ids(text, tokeniser):
    encoded = tokeniser.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # adds batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokeniser):
    flat = token_ids.squeeze(0) # removes batch dimension
    return tokeniser.decode(flat.tolist())

In [3]:
start_context = "Every effort moves you"
tokeniser = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model = model,
    idx = text_to_token_ids(start_context, tokeniser),
    max_new_tokens = 10,
    context_size = GPT_CONFIG_124M["context_length"]
)

print("Output text:\n", token_ids_to_text(token_ids, tokeniser))

Output text:
 Every effort moves you rentingetic wasnم refres RexAngel infieldcigans


The model isn't producing coherent text because it hasn't undergone training. To define what makes text "coherent" or "high quality", we have to implement a numerical method to evaluate the generated content (i.e. a loss function).

### Calculating the text generation loss

A general 5-step procedure to outline the flow from input text to LLM-generated text:
1. Use vocabulary to map the input text to token IDs.
2. Obtain a probability row vector (of vocab_size dimensions) for each input token via the softmax function.
3. Locate the index with the highest probability in each row vector (via the argmax function).
4. Obtain all predicted token IDs as the index positions with the highest probabilities.
5. Map index positions back into text via the inverse vocabulary.

As an example, we will work with two input examples, which have already been mapped to token IDs.

In [5]:
inputs = torch.tensor([[16833, 3626, 6100], # "every effort moves"
                      [40, 1107, 588]])     # "I really like"

# Corresponding targets
targets = torch.tensor([[3626, 6100, 345],  # "effort moves you"
                       [1107, 588, 11311]]) # "really like chocolate"

The goal of training an LLM is to maximise the likelihood of the correct token, which involves increasing its probability relative to other tokens. We do this through backpropagation - we update the model weights so that the model outputs higher values for the respective token IDs we want to generate. This requires a loss function, which calculates the difference between the model's predicted output and the actual desired output. 

To calculate the loss:
1. The model outputs logits.
2. Pass these logits through the softmax function to get probabilities. 
3. Get the target probability scores.
4. We apply the logarithm to the probability scores.
5. Calculate the average log probability.
6. The negative average log probability is the loss we want to compute. 

In [6]:
# We feed inputs into the model to calculate logit vectors 
with torch.no_grad(): # disable gradient tracking
    logits = model(inputs) # step 1

probas = torch.softmax(logits, dim=-1) # probability of each token in vocab; step 2
print(probas.shape)

torch.Size([2, 3, 50257])


The first number, 2, corresponds to the number of input rows (batch size). The 3 corresponds to the number of tokens in each input row. The last number corresponds to the embedding dimensionality, determined by vocabulary size. Following conversion from logits to probabilities via the softmax function, the generate_simple_text function then converts the resulting probability scores back into text. 

We can then apply the argmax function to those scores to obtain the corresponding token IDs.

In [None]:
# For each input row, the probability of each token being the next token
probas[0]

tensor([[1.9582e-05, 1.5537e-05, 1.1597e-05,  ..., 2.2041e-05, 7.0134e-06,
         1.8575e-05],
        [9.3378e-06, 1.0149e-05, 7.7960e-06,  ..., 2.8831e-05, 6.1058e-06,
         1.2983e-05],
        [2.8943e-05, 8.6889e-06, 1.5495e-05,  ..., 3.6617e-05, 1.3867e-05,
         1.2969e-05]])

In [6]:
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)

Token IDs:
 tensor([[[16657],
         [  339],
         [42826]],

        [[49906],
         [29669],
         [41751]]])


In [7]:
# Token IDs to text
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokeniser)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokeniser)}")

Targets batch 1:  effort moves you
Outputs batch 1:  Armed heNetflix


The model training aims to increase the softmax probability in the index positions corresponding to the correct target token IDs. This softmax probability is also used in the evaluation metric to numerically assess the model's generated outputs: the higher the probability in the correct positions, the better. 

Below displays the softmax probabilities for a 7-token vocab. This implies that the starting random values will hover around 1/7 (0.14). However, the vocab for the GPT-2 model has 50,257 tokens, so most of the initial probabilities will hover around 1/50,257 (0.00002).

In [None]:
# Step 3 - the initial probability scores corresponding to the target tokens
text_idx = 0
target_probas_1 = probas[text_idx, [0,1,2], targets[text_idx]]
print("Text 1:", target_probas_1)

text_idx = 1
target_probas_2 = probas[text_idx, [0,1,2], targets[text_idx]]
print("Text 2:", target_probas_2)

Text 1: tensor([7.2671e-05, 3.1046e-05, 1.1696e-05])
Text 2: tensor([1.0426e-05, 5.4604e-05, 4.7716e-06])


For each of the two input texts, we can print the initial softmax probability scores corresponding to the target tokens. Below, we apply the logarithm to the probability scores. This is because it's more manageable to work with in mathematical optimisations than handling the scores directly.

In [11]:
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2))) # step 4
print(log_probas)

tensor([ -9.5296, -10.3800, -11.3563, -11.4712,  -9.8154, -12.2528])


In [12]:
# Step 5 - combine log probabilities into a single score by computing the average
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)

tensor(-10.8009)


The goal is to get the average log probability as close to 0 as possible by updating the model's weights during the training process. However, the common practice isn't to push the average log probability up to 0 but rather to bring the negative average log probability down to 0.

In [13]:
# Step 6
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)

tensor(10.8009)


Doing the above is known as cross entropy loss. It measures the difference between two probability distributions - the actuals (tokens in a dataset) and the predictions (token probabilities generated by an LLM). The cross entropy function computes this measure for discrete outcomes. 

In [14]:
print("Logits shape:", logits.shape)
print("Targets shape:", targets.shape)

Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])


For the cross entropy function in PyTorch, we want to flatten the above tensors by combining them over the batch dimension.

In [15]:
logits_flat = logits.flatten(0,1)
targets_flat = targets.flatten()

print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)

Flattened logits: torch.Size([6, 50257])
Flattened targets: torch.Size([6])


Remember that the targets are the token IDs we want the LLM to generate, and the logits contain the unscaled model outputs before they enter the softmax function to obtain probability scores. Previously, we applied the softmax function, selected the probability scores corresponding to the target ID, and computed the negative average log probabilities (steps 2 - 6). The cross entropy function will do all these steps for us.

In [16]:
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)

tensor(10.8009)


Perplexity is another loss used alongside cross entropy to evaluate the performance of language models. It can provide a more interpretable way to understand the uncertainty of a model predicting the next token in a sequence. It measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset. A lower perplexity indicates that the model predictions are closer to the actual predictions. It signifies the effective vocabulary size about which the model is uncertain at each step. As shown below, the model is unsure about which among 49,064 tokens in the vocabulary to generate as the next token.

In [17]:
perplexity = torch.exp(loss)
print(perplexity)

tensor(49064.1680)
