# Introduction
In this laboratory we will get our hands dirty working with Large Language Models (e.g. GPT and BERT) to do various useful things. I you haven't already, it is highly recommended to:

+ Read the [Attention is All you Need](https://arxiv.org/abs/1706.03762) paper, which is the basis for all transformer-based LLMs.
+ Watch (and potentially *code along*) with this [Andrej Karpathy video](https://www.youtube.com/watch?v=kCc8FmEb1nY) which shows you how to build an autoregressive GPT model from the ground up.

# Exercise 1: Warming Up
In this first exercise you will train a *small* autoregressive GPT model for character generation (the one used by Karpathy in his video) to generate text in the style of Dante Aligheri. Use [this file](https://archive.org/stream/ladivinacommedia00997gut/1ddcd09.txt), which contains the entire text of Dante's Inferno (**note**: you will have to delete some introductory text at the top of the file before training). Train the model for a few epochs, monitor the loss, and generate some text at the end of training. Qualitatively evaluate the results 

In [None]:
import torch
import torch.nn as nn

In [1]:
# Your code here.
class Dante:
    """A class that aggregates functionality related to the "corpus" used."""
    def __init__(self, train = True, train_size=0.9,):
        self._block_size = 128
        self._train = train

        #Load entier text file
        with open('commedia.txt', 'r', encoding='utf-8') as fd:
            rawdata = fd.read()

        # Extract tokend BEFORE splitting. Our tokens are characters.
        self._tokens = sorted(set(rawdata))
        self.num_tokens = len(self._tokens)

        # Select train or val/test set.
        rawdata = rawdata[:int(len(rawdata)*train_size)] if train else rawdata[int(len(rawdata)*train_size):]

        # Build the encode/decode dictionaries mapping chars to token ids and back.
        self._c2i = {c: i for (i, c) in enumerate(self._tokens)}
        self._i2c = {i: c for (i, c) in enumerate(self._tokens)}

        # Encode 

    def get_batch(self, batch_size):
        """ Retrives a random batch of context and targets."""
        ix = torch.randint(0, self)

    def __len__(self):
        return len(self._data) - self._block_size - 1
    
    def __getitem__(self, i):
        xs = self._data[i:i+self._block_size]
        ys = self._data[i+1:i+self._block_size+1]
        return (xs, ys)



SyntaxError: incomplete input (1748612636.py, line 9)

In [None]:
ds = Dante(train=True, train_size=0.9, block_size=128)

In [None]:
# Encode a string
ds.encode('nel mezzo del cammin')

In [None]:
# Decode an Encoded string -> return 'nel mezzo del cammin'
ds.decode(ds.encode('nel mezzo del cammin'))

In [None]:
(xs, ys) = ds.get_batch(32)

In [None]:
xs.shape, ys.shape

In [None]:
"First input"
xs[0]

In [None]:
# ds.decode(xs[0]) Not working
ds.decode(xs[0].numpy()) # Working

In [None]:
ds.decode(ys[0].numpy()) # Working

In [None]:
# All configuration parameters for out Transformer
block_size = 128
train_size = 0.9
batch_size = 32
n_embed = 128

In [None]:
# Instantiate datasets for training and test
ds_train = Dante(train=True, train_size=train_size, block_size=block_size)
ds_test = Dante(train=False, train_size=train_size, block_size=block_size)
(xs, ys) = ds_test.get_batch(batch_size)

In [None]:
# The top-level GPT nn.Module
class GTPLanguageModel(nn.Module):
    def __init__(self, vocab_size, n_embed):
        super().__init__()
        self._vocab_size = vocab_size
        self._n_embd = n_embed
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding = nn.Embedding(vocab_size, n_embed)

    def forward(self, idx, targets=None):
        (B, T) = idx.shape
        tok_emb = self.token_embedding_table(idx) # (B, T, C)
        return tok_emb

In [None]:
model = GTPLanguageModel(vocab_size=ds_train.num_tokens, n_embed = n_embed)

In [None]:
model(xs)[0][0]

# Exercise 2: Working with Real LLMs

Our toy GPT can only take us so far. In this exercise we will see how to use the [Hugging Face](https://huggingface.co/) model and dataset ecosystem to access a *huge* variety of pre-trained transformer models.

## Exercise 2.1: Installation and text tokenization

First things first, we need to install the [Hugging Face transformer library](https://huggingface.co/docs/transformers/index):

    conda install -c huggingface -c conda-forge transformers
    
The key classes that you will work with are `GPT2Tokenizer` to encode text into sub-word tokens, and the `GPT2LMHeadModel`. **Note** the `LMHead` part of the class name -- this is the version of the GPT2 architecture that has the text prediction heads attached to the final hidden layer representations (i.e. what we need to **generate** text). 

Instantiate the `GPT2Tokenizer` and experiment with encoding text into integer tokens. Compare the length of input with the encoded sequence length.

**Tip**: Pass the `return_tensors='pt'` argument to the togenizer to get Pytorch tensors as output (instead of lists).

In [19]:
# Your code here.

## Exercise 2.2: Generating Text

There are a lot of ways we can, given a *prompt* in input, sample text from a GPT2 model. Instantiate a pre-trained `GPT2LMHeadModel` and use the [`generate()`](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to generate text from a prompt.

**Note**: The default inference mode for GPT2 is *greedy* which might not results in satisfying generated text. Look at the `do_sample` and `temperature` parameters.

In [20]:
# Your code here.

# Exercise 3: Reusing Pre-trained LLMs (choose one)

Choose **one** of the following exercises (well, *at least* one). In each of these you are asked to adapt a pre-trained LLM (`GPT2Model` or `DistillBERT` are two good choices) to a new Natural Language Understanding task. A few comments:

+ Since GPT2 is a *autoregressive* model, there is no latent space aggregation at the last transformer layer (you get the same number of tokens out that you give in input). To use a pre-trained model for a classification or retrieval task, you should aggregate these tokens somehow (or opportunistically select *one* to use).

+ BERT models (including DistillBERT) have a special [CLS] token prepended to each latent representation in output from a self-attention block. You can directly use this as a representation for classification (or retrieval).

+ The first *two* exercises below can probably be done *without* any fine-tuning -- that is, just training a shallow MLP to classify or represent with the appropriate loss function.

# Exercise 3.1: Training a Text Classifier (easy)

Peruse the [text classification datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=downloads). Choose a *moderately* sized dataset and use a LLM to train a classifier to solve the problem.

**Note**: A good first baseline for this problem is certainly to use an LLM *exclusively* as a feature extractor and then train a shallow model.

# Exercise 3.2: Training a Question Answering Model (harder)

Peruse the [multiple choice question answering datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:multiple-choice&sort=downloads). Chose a *moderately* sized one and train a model to answer contextualized multiple-choice questions. You *might* be able to avoid fine-tuning by training a simple model to *rank* the multiple choices (see margin ranking loss in Pytorch).

# Exercise 3.3: Training a Retrieval Model (hardest)

The Hugging Face dataset repository contains a large number of ["text retrieval" problems](https://huggingface.co/datasets?task_categories=task_categories:text-retrieval&p=1&sort=downloads). These tasks generally require that the model measure *similarity* between text in some metric space -- naively, just a cosine similarity between [CLS] tokens can get you pretty far. Find an interesting retrieval problem and train a model (starting from a pre-trained LLM of course) to solve it.

**Tip**: Sometimes identifying the *retrieval* problems in these datasets can be half the challenge. [This dataset](https://huggingface.co/datasets/BeIR/scifact) might be a good starting point.