### Pretraining on Unlabeled Data

We will implement a training function and pretrain the LLM. We will also learn about basic model evaluation techniques to measure the quality of the generated text. Finally, we learn how to load pretrained weights, giving our LLM a starting point for fine-tuning.

In [1]:
import torch
from Chapter04 import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50_257,
    "context_length": 256, # shortened from 1,024
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=False)
        (W_key): Linear(in_features=768, out_features=768, bias=False)
        (W_value): Linear(in_features=768, out_features=768, bias=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features

We have shortened the context length to reduce the computational demands of training the model, making it possible to carry out training on a standard laptop.

The two functions below facilitate the conversion between text and token representations. 

In [2]:
import tiktoken
from Chapter04 import generate_text_simple

def text_to_token_ids(text, tokeniser):
    encoded = tokeniser.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # adds batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokeniser):
    flat = token_ids.squeeze(0) # removes batch dimension
    return tokeniser.decode(flat.tolist())

In [3]:
start_context = "Every effort moves you"
tokeniser = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model = model,
    idx = text_to_token_ids(start_context, tokeniser),
    max_new_tokens = 10,
    context_size = GPT_CONFIG_124M["context_length"]
)

print("Output text:\n", token_ids_to_text(token_ids, tokeniser))

Output text:
 Every effort moves you rentingetic wasnم refres RexAngel infieldcigans


The model isn't producing coherent text because it hasn't undergone training. To define what makes text "coherent" or "high quality", we have to implement a numerical method to evaluate the generated content.

### Calculating the text generation loss

As an example, we will work with two input examples, which have already been mapped to token IDs.

In [5]:
inputs = torch.tensor([[16833, 3626, 6100], # "every effort moves"
                      [40, 1107, 588]])     # "I really like"

# Corresponding targets
targets = torch.tensor([[3626, 6100, 345],  # "effort moves you"
                       [1107, 588, 11311]]) # "really like chocolate"

In [6]:
# We feed inputs into the model to calculate logit vectors 
with torch.no_grad(): # disable gradient tracking
    logits = model(inputs)

probas = torch.softmax(logits, dim=-1) # probability of each token in vocab
print(probas.shape)

torch.Size([2, 3, 50257])


The first number, 2, corresponds to the number of input rows (batch size). The 3 corresponds to the number of tokens in each input row. The last number corresponds to the embedding dimensionality, determined by vocabulary size. Following conversion from logits to probabilities via the softmax function, the generate_simple_text function then converts the resulting probability scores back into text. 

We can then apply the argmax function to those scores to obtain the corresponding token IDs.

In [7]:
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)

Token IDs:
 tensor([[[16657],
         [  339],
         [42826]],

        [[49906],
         [29669],
         [41751]]])


In [8]:
# Token IDs to text
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokeniser)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokeniser)}")

Targets batch 1:  effort moves you
Outputs batch 1:  Armed heNetflix
