# Introduction
In this laboratory we will get our hands dirty working with Large Language Models (e.g. GPT and BERT) to do various useful things. I you haven't already, it is highly recommended to:

+ Read the [Attention is All you Need](https://arxiv.org/abs/1706.03762) paper, which is the basis for all transformer-based LLMs.
+ Watch (and potentially *code along*) with this [Andrej Karpathy video](https://www.youtube.com/watch?v=kCc8FmEb1nY) which shows you how to build an autoregressive GPT model from the ground up.

# Exercise 1: Warming Up
In this first exercise you will train a *small* autoregressive GPT model for character generation (the one used by Karpathy in his video) to generate text in the style of Dante Aligheri. Use [this file](https://archive.org/stream/ladivinacommedia00997gut/1ddcd09.txt), which contains the entire text of Dante's Inferno (**note**: you will have to delete some introductory text at the top of the file before training). Train the model for a few epochs, monitor the loss, and generate some text at the end of training. Qualitatively evaluate the results 

In [9]:
import torch
import torch.nn as nn

In [10]:
# Your code here.
class Dante:
    """A class that aggregates functionality related to the "corpus" used."""
    def __init__(self, train = True, train_size=0.9, block_size=128):
        self._block_size = block_size
        self._train = train

        #Load entier text file
        with open('commedia.txt', 'r', encoding='utf-8') as fd:
            rawdata = fd.read()

        # Extract tokend BEFORE splitting. Our tokens are characters.
        self._tokens = sorted(set(rawdata))
        self.num_tokens = len(self._tokens)

        # Select train or val/test set.
        rawdata = rawdata[:int(len(rawdata)*train_size)] if train else rawdata[int(len(rawdata)*train_size):]

        # Build the encode/decode dictionaries mapping chars to token ids and back.
        self._c2i = {c: i for (i, c) in enumerate(self._tokens)}
        self._i2c = {i: c for (i, c) in enumerate(self._tokens)}

        # Encode 
        self.encode = lambda s: [self._c2i[c] for c in s] # encoder: take a string, output a list of integers
        self.decode = lambda l: ''.join([self._i2c[i] for i in l]) # decoder: take a list of integers, output a string

        # Encode the data
        self._data = torch.tensor(self.encode(rawdata), dtype=torch.long)

    def get_batch(self, batch_size):
        """ Retrives a random batch of context and targets."""
        ix = torch.randint(len(self._data) - self._block_size, (batch_size,))
        print(self._data)
        x = torch.stack([self._data[i:i+self._block_size] for i in ix])
        y = torch.stack([self._data[i+1:i+self._block_size+1] for i in ix])
        # x, y = x.to(device), y.to(device)
        return x, y

    def __len__(self):
        return len(self._data) - self._block_size - 1
    
    def __getitem__(self, i):
        xs = self._data[i:i+self._block_size]
        ys = self._data[i+1:i+self._block_size+1]
        return (xs, ys)

In [11]:
ds = Dante(train=True, train_size=0.9, block_size=128)

In [12]:
# Encode a string
ds.encode('nel mezzo del cammin')

[75, 66, 73, 1, 74, 66, 87, 87, 76, 1, 65, 66, 73, 1, 64, 62, 74, 74, 70, 75]

In [13]:
# Decode an Encoded string -> return 'nel mezzo del cammin'
ds.decode(ds.encode('nel mezzo del cammin'))

'nel mezzo del cammin'

In [14]:
(xs, ys) = ds.get_batch(32)

tensor([51, 69, 66,  ..., 81, 79, 76])


In [15]:
xs.shape, ys.shape

(torch.Size([32, 128]), torch.Size([32, 128]))

In [16]:
"First input"
xs[0]

tensor([70, 62, 11,  1, 73, 66, 81, 81, 76, 79, 11,  1, 81, 70,  1, 68, 70, 82,
        79, 76, 11,  0,  1,  1, 80,  7, 66, 73, 73, 66,  1, 75, 76, 75,  1, 80,
        70, 66, 75,  1, 65, 70,  1, 73, 82, 75, 68, 62,  1, 68, 79, 62, 87, 70,
        62,  1, 83, 76, 81, 66, 11,  0,  0, 64, 69,  7, 70,  7,  1, 83, 70, 65,
        70,  1, 77, 66, 79,  1, 78, 82, 66, 73, 73,  7, 62, 66, 79, 66,  1, 68,
        79, 76, 80, 80, 76,  1, 66,  1, 80, 64, 82, 79, 76,  0,  1,  1, 83, 66,
        75, 70, 79,  1, 75, 76, 81, 62, 75, 65, 76,  1, 82, 75, 62,  1, 67, 70,
        68, 82])

In [17]:
# ds.decode(xs[0]) Not working
ds.decode(xs[0].numpy()) # Working

"ia, lettor, ti giuro,\n  s'elle non sien di lunga grazia vote,\n\nch'i' vidi per quell'aere grosso e scuro\n  venir notando una figu"

In [18]:
ds.decode(ys[0].numpy()) # Working

"a, lettor, ti giuro,\n  s'elle non sien di lunga grazia vote,\n\nch'i' vidi per quell'aere grosso e scuro\n  venir notando una figur"

In [20]:
# All configuration parameters for out Transformer
block_size = 128
train_size = 0.9
batch_size = 32
n_embed = 128

In [21]:
# Instantiate datasets for training and test
ds_train = Dante(train=True, train_size=train_size, block_size=block_size)
ds_test = Dante(train=False, train_size=train_size, block_size=block_size)
(xs, ys) = ds_test.get_batch(batch_size)

tensor([11,  0,  1,  ..., 59,  0,  0])


In [60]:
# The top-level GPT nn.Module
class GTPLanguageModel(nn.Module):
    def __init__(self, vocab_size, n_embed):
        super().__init__()
        self._vocab_size = vocab_size
        self._n_embd = n_embed
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding = nn.Embedding(vocab_size, n_embed)

    def forward(self, idx, targets=None):
        (B, T) = idx.shape
        tok_emb = self.token_embedding(idx) # (B, T, C)
        return tok_emb

In [61]:
model = GTPLanguageModel(vocab_size=ds_train.num_tokens, n_embed = n_embed)

In [62]:
model(xs)[0][0]

tensor([ 0.0182,  0.6876,  0.9524, -0.1810,  0.9837, -0.6089, -1.1648,  0.2752,
        -0.6810,  0.4400,  1.3942, -0.2972,  0.2556,  1.7411, -0.1625,  0.7471,
        -0.3994,  0.6829,  0.6663, -1.9936, -1.0045,  0.6590,  1.0105, -0.0707,
         1.5947,  0.0098,  0.7688, -0.8266, -0.4158,  1.1425, -0.6613,  0.3734,
        -0.3173, -0.1288,  1.8279, -0.1044,  1.3437,  1.6375,  1.3891,  0.1766,
        -1.1703,  0.6529,  0.9052,  0.4542,  0.8510,  0.0475, -1.1846,  0.7598,
         1.0428,  1.2485, -0.1313, -0.1652,  0.0153, -0.1453,  0.8056,  0.1221,
        -1.8702, -0.1466,  0.7614,  0.3381, -0.6846, -1.0877,  1.8149, -0.5938,
        -0.3843,  1.2736,  1.1190, -0.9846,  0.2179, -0.1396,  0.3629,  0.3197,
         0.8835, -0.4273, -0.9002,  0.1076, -1.4472,  0.2919, -1.0444, -0.3461,
         0.9479,  0.7831, -1.8522,  1.7290,  2.5879,  0.3881,  0.2460,  0.6543,
         0.6894,  1.1303,  0.3790, -0.6986, -1.5515,  0.2599, -0.7662, -2.3683,
         0.2489, -0.9762,  0.9289,  0.83

# Exercise 2: Working with Real LLMs

Our toy GPT can only take us so far. In this exercise we will see how to use the [Hugging Face](https://huggingface.co/) model and dataset ecosystem to access a *huge* variety of pre-trained transformer models.

## Exercise 2.1: Installation and text tokenization

First things first, we need to install the [Hugging Face transformer library](https://huggingface.co/docs/transformers/index):

    conda install -c huggingface -c conda-forge transformers
    
The key classes that you will work with are `GPT2Tokenizer` to encode text into sub-word tokens, and the `GPT2LMHeadModel`. **Note** the `LMHead` part of the class name -- this is the version of the GPT2 architecture that has the text prediction heads attached to the final hidden layer representations (i.e. what we need to **generate** text). 

Instantiate the `GPT2Tokenizer` and experiment with encoding text into integer tokens. Compare the length of input with the encoded sequence length.

**Tip**: Pass the `return_tensors='pt'` argument to the togenizer to get Pytorch tensors as output (instead of lists).

In [28]:
# Your code here.
from transformers import GPT2LMHeadModel, GPT2Config, GPT2Tokenizer

# Load key classes GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
print(tokenizer("Nel mezzo del cammin di nostra vita", return_tensors='pt')["input_ids"])
print(tokenizer("Ciao mi chiamo Dante", return_tensors='pt')["input_ids"])
print(tokenizer("Paolo Brosio", return_tensors='pt')["input_ids"])
print(tokenizer("Dante alighieri", return_tensors='pt')["input_ids"])



tensor([[   45,   417,   502, 47802,  1619, 12172,  1084,  2566, 18216,   430,
           410,  5350]])
tensor([[   34, 13481, 21504,   442,  1789,    78, 34898]])
tensor([[28875, 14057, 14266,   952]])
tensor([[   35, 12427,   435,   394, 29864]])


## Exercise 2.2: Generating Text

There are a lot of ways we can, given a *prompt* in input, sample text from a GPT2 model. Instantiate a pre-trained `GPT2LMHeadModel` and use the [`generate()`](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to generate text from a prompt.

**Note**: The default inference mode for GPT2 is *greedy* which might not results in satisfying generated text. Look at the `do_sample` and `temperature` parameters.

In [36]:
# Your code here.
from transformers import GPT2LMHeadModel, GPT2Config, GPT2Tokenizer

# Load key classes GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Generate text from a prompt
prompt = "Nel mezzo del cammin di nostra vita"
generated = model.generate(tokenizer(prompt, return_tensors='pt')["input_ids"], max_length=100)
# print(generated)
print(tokenizer.decode(generated[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Nel mezzo del cammin di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra


In [38]:
# Look at the do_sample and temperature parameters
generated = model.generate(tokenizer(prompt, return_tensors='pt')["input_ids"], max_length=100, do_sample=True, temperature=0.9)   
# print(generated)
print(tokenizer.decode(generated[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Nel mezzo del cammin di nostra vita faggio alla e di siguella, una di giorale della santo e una perche di che di sanna.

D'autre pela susere quelequando sugliando il sable l'apicher che l'ampli sopentor pitta di spagna, neque l'argento e sebata e che qu


# Exercise 3: Reusing Pre-trained LLMs (choose one)

Choose **one** of the following exercises (well, *at least* one). In each of these you are asked to adapt a pre-trained LLM (`GPT2Model` or `DistillBERT` are two good choices) to a new Natural Language Understanding task. A few comments:

+ Since GPT2 is a *autoregressive* model, there is no latent space aggregation at the last transformer layer (you get the same number of tokens out that you give in input). To use a pre-trained model for a classification or retrieval task, you should aggregate these tokens somehow (or opportunistically select *one* to use).

+ BERT models (including DistillBERT) have a special [CLS] token prepended to each latent representation in output from a self-attention block. You can directly use this as a representation for classification (or retrieval).

+ The first *two* exercises below can probably be done *without* any fine-tuning -- that is, just training a shallow MLP to classify or represent with the appropriate loss function.

# Exercise 3.1: Training a Text Classifier (easy)

Peruse the [text classification datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=downloads). Choose a *moderately* sized dataset and use a LLM to train a classifier to solve the problem.

**Note**: A good first baseline for this problem is certainly to use an LLM *exclusively* as a feature extractor and then train a shallow model.

# Exercise 3.2: Training a Question Answering Model (harder)

Peruse the [multiple choice question answering datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:multiple-choice&sort=downloads). Chose a *moderately* sized one and train a model to answer contextualized multiple-choice questions. You *might* be able to avoid fine-tuning by training a simple model to *rank* the multiple choices (see margin ranking loss in Pytorch).

# Exercise 3.3: Training a Retrieval Model (hardest)

The Hugging Face dataset repository contains a large number of ["text retrieval" problems](https://huggingface.co/datasets?task_categories=task_categories:text-retrieval&p=1&sort=downloads). These tasks generally require that the model measure *similarity* between text in some metric space -- naively, just a cosine similarity between [CLS] tokens can get you pretty far. Find an interesting retrieval problem and train a model (starting from a pre-trained LLM of course) to solve it.

**Tip**: Sometimes identifying the *retrieval* problems in these datasets can be half the challenge. [This dataset](https://huggingface.co/datasets/BeIR/scifact) might be a good starting point.

## Exercise 3.1 

In [37]:
from transformers import DistilBertModel, AutoTokenizer
import torch
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
import torch.optim as optim
import wandb
import os
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.metrics import accuracy_score, classification_report
from datetime import datetime
from datasets import load_dataset

In [114]:
# Function to train a model for a single epoch over the data loader.
def train_epoch(model, dl, opt, epoch='Unknown', device='cpu'):
    model.train()
    losses = []
    for (xs, mas, ys) in tqdm(dl, desc=f'Training epoch {epoch}', leave=True):
        xs, mas, ys = xs.to(device), mas.to(device), ys.to(device)
        opt.zero_grad()
        logits = model(xs, mas)
        loss = F.cross_entropy(logits, ys)
        loss.backward()
        opt.step()
        losses.append(loss.item())
    return np.mean(losses)

# Function to evaluate model over all samples in the data loader.
def evaluate_model(model, dl, device='cpu'):
    model.eval()
    predictions = []
    ground_truths = []
    with torch.no_grad():
        for (xs, mas, ys) in tqdm(dl, desc='Evaluating', leave=False):
            xs = xs.to(device)
            mas = mas.to(device)
            logits = model(xs, mas)
            preds = torch.argmax(logits, 1)
            ground_truths.append(ys)
            predictions.append(preds.detach().cpu().numpy())
    predictions = np.hstack(predictions)
    ground_truths = np.hstack(ground_truths)
    return accuracy_score(ground_truths, predictions), classification_report(ground_truths, predictions, zero_division=0, digits=3, output_dict=True)

def train_model(model, dl_train, dl_val, opt, epochs, model_name, dataset_type, lr, batch_size, device='cpu'):
    wandb.init(
        project="DLA Assigment 2",
        name=model_name + "-" + datetime.now().strftime("%Y%m%d-%H%M%S"),
        # track hyperparameters and run metadata
        config={
            "architecture": model_name,
            "dataset": dataset_type,
            "epochs": epochs,
            "learning_rate": lr,
            "batch_size": batch_size,
            "device": device,
            "optimizer": "AdamW"
        }
    )
    # wandb.watch(model, nn.CrossEntropyLoss, log="all", log_freq=10)
    losses_and_accs = []
    classification_report = []
    for epoch in range(epochs):
        loss = train_epoch(model, dl_train, opt, epoch, device=device)
        (val_acc, class_rep) = evaluate_model(model, dl_val, device=device)
        losses_and_accs.append((loss, val_acc))
        classification_report.append(class_rep)
        
        print(f'Epoch {epoch}: Loss - {loss:.4f}, Validation Acc - {val_acc:.4f}')
        # wandb
        wandb.log({"epoch": epoch, "loss": loss, "acc": val_acc, "classification_report": class_rep})
                
    # wandb.unwatch(model)
    # [optional] finish the wandb run, necessary in notebooks
    wandb.finish()    

    # torch.save(model.state_dict(), f"model_states/model_{model_name}.pt")
    # torch.save(model, f"model/model_{model_name}.pt")
    return losses_and_accs


# Simple function to plot the loss curve and validation accuracy.
def plot_validation_curves(training_history):
    losses, accuracies = zip(*training_history)
    plt.figure(figsize=(16, 8))

    plt.subplot(1, 2, 1)
    plt.plot(losses)
    plt.title('Average Training Loss per Epoch')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')

    plt.subplot(1, 2, 2)
    plt.plot(accuracies)
    plt.title(f'Best Accuracy = {np.max(accuracies)} @ epoch {np.argmax(accuracies)}')
    plt.xlabel('Epoch')
    plt.ylabel('Validation Accuracy')
    plt.show()

In [115]:
class DistilBERTClassifier(nn.Module):
    def __init__(self, model, n_classes):
        super().__init__()
        self.model = model
        self.classifier = nn.Sequential(
            nn.Linear(model.config.hidden_size, 128),
            nn.ReLU(),
            nn.Linear(128, n_classes)
        )

    def forward(self, x, mask):
        x = self.model(x, attention_mask=mask).last_hidden_state[:, 0]
        return self.classifier(x)

In [102]:
# Convert train, val, and test to DataLoader
def get_dataloader(dataset, batch_size):
    # Tokenize only the text column and return pytorch tensors
    xx = tokenizer(dataset['text'], padding=True, truncation=True, return_tensors="pt")
    x = torch.tensor(xx['input_ids'])
    a = torch.tensor(xx['attention_mask'])
    y = torch.tensor(dataset['label'])
    dataset = torch.utils.data.TensorDataset(x, a, y)
    return torch.utils.data.DataLoader(dataset, batch_size=batch_size)

# Load the dataset
dataset = load_dataset("tweet_eval", "emoji")

dl_train = get_dataloader(dataset['train'], batch_size)
dl_val = get_dataloader(dataset['validation'], batch_size)
dl_test = get_dataloader(dataset['test'], batch_size)

n_classes = len(set(dataset['train']['label']))

  x = torch.tensor(xx['input_ids'])
  a = torch.tensor(xx['attention_mask'])


In [117]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the DistilBERT model
bert = DistilBertModel.from_pretrained('distilbert-base-uncased')

# Instantiate the model
model = DistilBERTClassifier(bert, n_classes).to(device)

# Parameters
batch_size = 128
epochs = 5
lr = 1e-5
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)



In [118]:
# Train the model
results = train_model(classifier, dl_train, dl_val, optimizer, epochs, "DistilBERT", "tweet_eval", lr, batch_size, device=device)

# Plot the results
plot_validation_curves(results)
print(f'Accuracy report on TEST:\n {evaluate_model(model, dl_test, device=device)[1]}')

Training epoch 0:   0%|          | 0/176 [00:29<?, ?it/s]


KeyboardInterrupt: 

In [120]:
from transformers import DistilBertModel, AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the DistilBERT model
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

# Instantiate the model
classifier = DistilBERTClassifier(model, n_classes)

# Move the model to the device
classifier.to(device)

# Dataset
dataset = load_dataset("tweet_eval", "emoji")

# Convert train, val, and test
xs_train = dataset['train']['text']
ys_train = dataset['train']['label']

xs_tok_train = tokenizer(xs_train, padding=True, truncation=True, return_tensors="pt")


# Train the model
losses = []
accuracy = []
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(classifier.parameters(), lr=5e-5)
epochs = 10
batch_size = 256
len_train = len(xs_train)

# Train the model
for epoch in range(epochs):
    model.train()
    for i in range(0, len_train, batch_size):
        batch = xs_tok_train['input_ids'][i:i+32], xs_tok_train['attention_mask'][i:i+32]
        batch = [b.to(device) for b in batch]
        ys = torch.tensor(ys_train[i:i+32]).to(device)
        optimizer.zero_grad()
        output = classifier(*batch)
        loss = criterion(output, ys)
        loss.backward()
        optimizer.step()
        acc = (output.argmax(1) == ys).float().mean()
        losses.append(loss.item())
        accuracy.append(acc.item())
        print(f"Epoch {epoch}, Loss {loss.item()}, Acc {acc.item()}, Mean Loss {sum(losses)/len(losses)}, Batch {i} of {len_train}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Epoch 0, Loss 2.9835972785949707, Acc 0.125, Mean Loss 2.9835972785949707, Batch 0 of 45000
Epoch 0, Loss 2.9624533653259277, Acc 0.125, Mean Loss 2.973025321960449, Batch 256 of 45000
Epoch 0, Loss 2.9558889865875244, Acc 0.0625, Mean Loss 2.967313210169474, Batch 512 of 45000
Epoch 0, Loss 2.9042811393737793, Acc 0.0625, Mean Loss 2.9515551924705505, Batch 768 of 45000
Epoch 0, Loss 2.9294204711914062, Acc 0.0625, Mean Loss 2.947128248214722, Batch 1024 of 45000
Epoch 0, Loss 2.8204762935638428, Acc 0.125, Mean Loss 2.9260195891062417, Batch 1280 of 45000
Epoch 0, Loss 3.019408702850342, Acc 0.03125, Mean Loss 2.939360891069685, Batch 1536 of 45000
Epoch 0, Loss 2.9148595333099365, Acc 0.25, Mean Loss 2.936298221349716, Batch 1792 of 45000
Epoch 0, Loss 2.9502100944519043, Acc 0.0625, Mean Loss 2.937843985027737, Batch 2048 of 45000
Epoch 0, Loss 2.8904333114624023, Acc 0.15625, Mean Loss 2.933102917671204, Batch 2304 of 45000
Epoch 0, Loss 2.833491325378418, Acc 0.21875, Mean Loss 2

KeyboardInterrupt: 