#Assignment 3: Decoder-only Transformer

Hello! Welcome to your last assignment, where you will be asked to implement a decoder-only transformer model, and train it on next word prediction objectives.

Because this is a GPU-based assignment, you are encouraged to use Google Colab's T4 runtime. It contains a 15G GPU, which we have tested to be enough for this assignment. However, there is a limited amount of free T4 usage per day. We encourage you to only connect to a runtime when you want to run the code.

You are free to make minor changes to the provided code if that will make things eaiser. However, you will be asked to summarize what you changed in the final section.

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

First, we will define some hyperparameters. You do not need to change these hyperparameters.

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from datasets import load_dataset
import re
from tqdm import tqdm
from collections import Counter
from torch.optim.lr_scheduler import LambdaLR

# Hyperparameters
vocab_size = 32100
embedding_dim = 256
num_heads = 8
num_layers = 8
ffn_dim = 256
max_seq_len = 256
batch_size = 64
num_epochs = 1
learning_rate = 0.001
unk_token = "<UNK>"
pad_token = "<PAD>"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Section 1: Implement the Decoder-only Transformer.

In [19]:
from transformers import AutoTokenizer
import numpy as np
import math
# Tokenizer
# We will use a pre-trained tokenizer from T5-base, which is why vocab_size is 32100 above.
# You will need to research on how huggingface tokenizers work
# https://huggingface.co/docs/transformers/en/main_classes/tokenizer
class BPETokenizer:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-base")

    def encode(self, text):
        # TODO
        # Implement this encode function that return a list of INTs (word IDs) based on the input text (str)
        encoding = self.tokenizer.encode(text)
        return encoding

    def decode(self, token_ids):
        # TODO
        # Implement this decode function that returns a string based on a list of INTs (word IDs)
        decoding = self.tokenizer.decode(token_ids)
        return decoding

# Positional Encoding
# TODO: You need to implement a sinusoidal positional embedding class.
# There are many great resources, such as https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/
# Input: current word embedding
# Output: new word embedding after adding the positional embeddings.
class PositionalEncoding(nn.Module):
    def __init__(self, embedding_dim, max_len):
        super().__init__()
        # TODO
        n = 10000.0
        pos_enc = torch.zeros(max_len, embedding_dim)
        positions = torch.arange(0, max_len, step=1).float().unsqueeze(1)
        index = torch.arange(0, embedding_dim, step=2).float()
        denom = 1/(n**(index/embedding_dim))
        pos_enc[:, 0::2] = torch.sin(positions/denom)
        pos_enc[:, 1::2] = torch.cos(positions/denom)
        pos_enc = pos_enc.unsqueeze(0)
        #for k in range(max_len):
        #    for i in range(int(embedding_dim/2)):
        #        denom = n ** (2*i/embedding_dim)
        #        self.pos_enc[k, 2*i] = torch.sin(k/denom)
        #        self.pos_enc[k, 2*i+1] = torch.cos(k/denom)
        self.register_buffer('pos_enc', pos_enc)

    def forward(self, x):
        # TODO
        weights = x + self.pos_enc[:, :x.size(1), :]
        return weights

# Self-Attention Layer
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embedding_dim, num_heads):
        super().__init__()
        assert embedding_dim % num_heads == 0
        # TODO
        # Hint: you need the QKV weights, also some bookkeeping values.
        # Don't forget about a fully-connected output layer
        self.embedding_d = embedding_dim
        self.num_heads = num_heads
        self.head_dim =embedding_dim // num_heads

        self.query = nn.Linear(embedding_dim, embedding_dim, bias=False)
        self.key = nn.Linear(embedding_dim, embedding_dim, bias=False)
        self.value = nn.Linear(embedding_dim, embedding_dim, bias=False)

        attn_dropout = 0.1 #as in the paper
        self.dropout = nn.Dropout(attn_dropout)
        self.layer_norm = nn.LayerNorm(embedding_dim, eps=1e-6)
        self.output = nn.Linear(embedding_dim, embedding_dim)

    # Attention masks are values of 1's and 0's.
    # When the attention mask for a position is 1, it means that the attention can be computed on that position.
    # If a mask value for a position is 0, it means that the attention should 'skip' or 'mask-out' that position.
    def forward(self, x, mask=None):
        # TODO
        # Implement the computations of attention scores, with respect to the attentin mask 'mask'
        # Hint: consider something like attention_scores.masked_fill(?, ?) to compute the masked scores.
        b_size, seq_len = x.size(0), x.size(1)
        q = self.query(x).view(b_size, seq_len, self.num_heads, self.head_dim)
        k = self.key(x).view(b_size, seq_len, self.num_heads, self.head_dim)
        v = self.value(x).view(b_size, seq_len, self.num_heads, self.head_dim)

        q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)

        attn = torch.matmul(q, k.transpose(-2, -1))/math.sqrt(self.head_dim)
        if mask is not None:
            #mask = mask.unsqueeze(1)
            mask = mask.unsqueeze(0).unsqueeze(1)
            attn = attn.masked_fill(mask == 0, float('-inf'))

        attn = F.softmax(attn, dim=-1)
        attn = torch.matmul(attn, v)
        attn = attn.transpose(1, 2).contiguous().view(b_size, seq_len, -1)
        attn = self.output(attn)
        output = self.dropout(attn)
        #output = self.output(attn)
        return output

# Transformer Decoder Layer
class TransformerDecoderLayer(nn.Module):
    def __init__(self, embedding_dim, num_heads, ffn_dim):
        super().__init__()
        self.self_attn = MultiHeadSelfAttention(embedding_dim, num_heads)
        self.norm1 = nn.LayerNorm(embedding_dim)
        # TODO
        # Implement a MLP layer here.
        # Research about what nn.LayerNorm is doing here.
        # Add a second layernorm after the MLP.
        self.fc1 = nn.Linear(embedding_dim, ffn_dim)
        self.fc2 = nn.Linear(ffn_dim, embedding_dim)
        self.act_fun = nn.ReLU()
        self.norm2 = nn.LayerNorm(embedding_dim)

    def forward(self, x, mask=None):
        attn_output = self.self_attn(x, mask)
        # TODO: research about why it is x + attn_output, instead of only attn_output?
        x = self.norm1(x + attn_output)
        # TODO: finish the rest.
        x = self.norm2(self.fc2(self.act_fun(self.fc1(x))) + x)

        return x

# Decoder-only Transformer Model
class DecoderOnlyTransformer(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_heads, num_layers, ffn_dim, max_seq_len):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.pos_encoding = PositionalEncoding(embedding_dim, max_seq_len)
        # TODO
        # Implement the layers
        self.layers = nn.ModuleList([
            TransformerDecoderLayer(embedding_dim, num_heads, ffn_dim) for _ in range(num_layers)
        ])
        self.output = nn.LayerNorm(embedding_dim)
        self.lm_head = nn.Linear(embedding_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        x = self.pos_encoding(x)
        # TODO
        # Create an attention mask, and do the rest of the computation.
        b_size, seq_len = x.size(0), x.size(1)
        device = x.device
        mask = torch.tril(torch.ones((seq_len, seq_len), device=device))

        for layer in self.layers:
            x = layer(x, mask=mask)

        fc_layer_output = self.output(x)
        logits = self.lm_head(fc_layer_output)
        #return fc_layer_output
        return logits

## Section 2: Implement the LLM training.

In [5]:
def load_training_corpus():
    dataset = load_dataset("alpindale/light-novels", split="train")
    texts = dataset["text"][:3000000]
    tokenizer = BPETokenizer()
    tokenized = []
    # TODO
    # In pre-training, we will make every training sequence with exact same lengths (max_seq_len)
    # This means that you will to concatenate shorter text in 'texts' to make every list in 'tokenized' exactly max_seq_len
    # In other words, for instance in tokenized: assert len(instance) == max_seq_len.
    # As you can see later, each element in tokenized is a number (word IDs).
    # For the final instance where you don't have enough tokens, you can pad the sequence with word id 1.
    global max_seq_len
    tokens_seq = list()
    for i in texts:
        tokens = tokenizer.encode(i)
        tokens_seq.extend(tokens)

    for i in range(0, len(tokens_seq), max_seq_len):
        if i+max_seq_len < len(tokens_seq):
            chunk = tokens_seq[i:i+max_seq_len]
        else:
            chunk = tokens_seq[i:]

        if len(chunk) == max_seq_len:
            tokenized.append(chunk)
        else:
            padding = chunk+[1]*(max_seq_len-len(chunk))
            tokenized.append(padding)

    return torch.tensor(tokenized, dtype=torch.long), tokenizer

data, tokenizer = load_training_corpus()
train_data = data

def get_lr_schedule(optimizer, warmup_steps, total_steps):
    """
    Creates a learning rate schedule with linear warmup and cosine decay.
    """
    def lr_lambda(current_step):
        if current_step < warmup_steps:
            return current_step / warmup_steps  # Linear warmup
        progress = (current_step - warmup_steps) / (total_steps - warmup_steps)
        return 0.5 * (1 + torch.cos(torch.tensor(progress * 3.1415926535)))  # Cosine decay

    return LambdaLR(optimizer, lr_lambda)

def train():
    model = DecoderOnlyTransformer(vocab_size, embedding_dim, num_heads, num_layers, ffn_dim, max_seq_len).to(device)
    # TODO: compute the model size. You can use a programmatic approach.
    model_sz = sum(i.numel() for i in model.parameters())
    print(f"The model size: {model_sz} parameters ({model_sz/1000000:.1f} millions)")

    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    loss_fn = nn.CrossEntropyLoss()
    warmup_steps = int(0.1 * (num_epochs * len(train_data) // batch_size))  # 10% of total steps
    total_steps = num_epochs * (len(train_data) // batch_size)
    scheduler = get_lr_schedule(optimizer, warmup_steps, total_steps)

    for epoch in range(num_epochs):
        total_loss = 0
        progress_bar = tqdm(range(0, len(train_data), batch_size), desc=f"Epoch {epoch+1}")
        for i in progress_bar:
            batch = train_data[i:i+batch_size].to(device)
            # TODO Implement the training process.
            inputs = batch[:, :-1]
            targets = batch[:, 1:] # TODO. What is supposed to be the supervision targets in autogressive LM?
            optimizer.zero_grad()
            logits = model(inputs)
            loss = loss_fn(logits.transpose(1, 2), targets)
            loss.backward()
            optimizer.step()
            # Your code should finish before scheduler.step() here.
            scheduler.step()
            total_loss += loss.item()*batch_size
            progress_bar.set_postfix(loss=loss.item())
        print(f"Epoch {epoch+1} completed with Loss: {total_loss / len(train_data)}")
    return model


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

LightNovels.txt:   0%|          | 0.00/907M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9240994 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Now, let's train the model! We suggest you save the trained model for later testing.

In [20]:
model = train()
torch.save(model.state_dict(), "model.ckpt")

The model size: 19631972 parameters (19.6 millions)


Epoch 1: 100%|██████████| 4830/4830 [1:06:39<00:00,  1.21it/s, loss=4.34]

Epoch 1 completed with Loss: 4.279744496579885





Then, let's evaluate your model to see if it's working!

In [85]:
# We have provided a sample greedy decoding generation function for you.
# You can use it to test if your trained model is working.
# Although we are not providing any actual test cases, it will be pretty easy
# to tell if your language model is generating coherent natural languages.
# Try a few examples! If it works, it works; we will not do 'exact-match' grading.
model.eval()
def generate_text_greedy(model, input_text, max_length=20):
    input_tokens = tokenizer.encode(input_text)
    generated = input_tokens[:]
    for _ in range(max_length - len(input_tokens)):
        input_tensor = torch.tensor([generated], device=device)
        with torch.no_grad():
            output = model(input_tensor)
        next_token = torch.argmax(output[:, -1, :], dim=-1).item()
        generated.append(next_token)
        if next_token == 1:
            break
    return tokenizer.decode(generated)

print("Generated Sequence:", generate_text_greedy(model, "triple"))

def generate_text_topk(model, input_text, max_length=20, k=50, temperature=1):
    # TODO
    # Implement a top-k sampling generation function.
    # Steps
    # 1. adjust the output distribution/logits based on the temperature
    # 2. find the top-k next tokens (hint: torch.topk)
    # 3. readjust the probability for the top-k candidates for them to sum to 1 (hint: softmax)
    # 4. sample a word from the new k tokens (hint: torch.multinomial)
    model.eval()
    input_ids = torch.tensor(tokenizer.encode(input_text)).unsqueeze(0).to(device)
    gen_tokens = input_ids.clone()
    with torch.no_grad():
        for i in range(max_length):
            logits = model(gen_tokens)
            logits_t1 = logits[:, -1, :]
            logits_t1 = logits_t1/temperature
            topk_logits, topk_indices = torch.topk(logits_t1, k, dim=-1)
            topk_dist = F.softmax(topk_logits, dim=-1)
            topk_idx = torch.multinomial(topk_dist, num_samples=1)
            topk_pred = topk_indices[0, topk_idx[0,0]].item()
            pred_token = torch.tensor([[topk_pred]], device=device)
            #token_t1 = topk_indices[0, torch.multinomial(topk_dist, num_samples=1)]
            gen_tokens = torch.cat([gen_tokens, pred_token], dim=1)

    outputs = gen_tokens[0].tolist()
    gen_text = tokenizer.decode(outputs)
    return gen_text
print("Generated Sequence:", generate_text_topk(model, "Hello"))

Generated Sequence: triple</s></s>
Generated Sequence: Hello</s> “How is she up to her? That’s a good idea.”</s> “I was


## Section 3: Written questions.
Answer all questions with ~100 words (use the fewest words that are enough to express your meanings)

### 1. What does nn.LayerNorm do?

Normalizes the input across layers so the token distribution has e a mean of 0 and variance of 1. It help it to remain stable, resulting in the training process remaining consistent.

### 2. Why the provided code use self.norm1(x + attn_output) instead of self.norm1(attn_output)?

Because we want to attenuate the gradient vanishing by applying residual to the output of the layer. Check Add&Norm ('Attention is all you need')

### 3. What is the model size using the provided hyperparameters?

As printed, The model size: 19631972 parameters (19.6 millions)

### 4. How does temperature play a part in top-k sampling? Share some observations.

Larger T (>1) -> more diverse tokens among top-k. Smaller T (<1) -> the vocabulary becomes less diverse, the model generates output more similar to its training data.

### 5. (Optional, not graded) Please summarize your change to the provided code, if any.
added import math for math.sqrt() and attention weights calucation. Changed load_books_corpus() to load_training_corpus().