# Introduction
In this laboratory we will get our hands dirty working with Large Language Models (e.g. GPT and BERT) to do various useful things. I you haven't already, it is highly recommended to:

+ Read the [Attention is All you Need](https://arxiv.org/abs/1706.03762) paper, which is the basis for all transformer-based LLMs.
+ Watch (and potentially *code along*) with this [Andrej Karpathy video](https://www.youtube.com/watch?v=kCc8FmEb1nY) which shows you how to build an autoregressive GPT model from the ground up.

# Exercise 1: Warming Up
In this first exercise you will train a *small* autoregressive GPT model for character generation (the one used by Karpathy in his video) to generate text in the style of Dante Aligheri. Use [this file](https://archive.org/stream/ladivinacommedia00997gut/1ddcd09.txt), which contains the entire text of Dante's Inferno (**note**: you will have to delete some introductory text at the top of the file before training). Train the model for a few epochs, monitor the loss, and generate some text at the end of training. Qualitatively evaluate the results 

In [9]:
import torch
import torch.nn as nn

In [10]:
# Your code here.
class Dante:
    """A class that aggregates functionality related to the "corpus" used."""
    def __init__(self, train = True, train_size=0.9, block_size=128):
        self._block_size = block_size
        self._train = train

        #Load entier text file
        with open('commedia.txt', 'r', encoding='utf-8') as fd:
            rawdata = fd.read()

        # Extract tokend BEFORE splitting. Our tokens are characters.
        self._tokens = sorted(set(rawdata))
        self.num_tokens = len(self._tokens)

        # Select train or val/test set.
        rawdata = rawdata[:int(len(rawdata)*train_size)] if train else rawdata[int(len(rawdata)*train_size):]

        # Build the encode/decode dictionaries mapping chars to token ids and back.
        self._c2i = {c: i for (i, c) in enumerate(self._tokens)}
        self._i2c = {i: c for (i, c) in enumerate(self._tokens)}

        # Encode 
        self.encode = lambda s: [self._c2i[c] for c in s] # encoder: take a string, output a list of integers
        self.decode = lambda l: ''.join([self._i2c[i] for i in l]) # decoder: take a list of integers, output a string

        # Encode the data
        self._data = torch.tensor(self.encode(rawdata), dtype=torch.long)

    def get_batch(self, batch_size):
        """ Retrives a random batch of context and targets."""
        ix = torch.randint(len(self._data) - self._block_size, (batch_size,))
        print(self._data)
        x = torch.stack([self._data[i:i+self._block_size] for i in ix])
        y = torch.stack([self._data[i+1:i+self._block_size+1] for i in ix])
        # x, y = x.to(device), y.to(device)
        return x, y

    def __len__(self):
        return len(self._data) - self._block_size - 1
    
    def __getitem__(self, i):
        xs = self._data[i:i+self._block_size]
        ys = self._data[i+1:i+self._block_size+1]
        return (xs, ys)

In [11]:
ds = Dante(train=True, train_size=0.9, block_size=128)

In [12]:
# Encode a string
ds.encode('nel mezzo del cammin')

[75, 66, 73, 1, 74, 66, 87, 87, 76, 1, 65, 66, 73, 1, 64, 62, 74, 74, 70, 75]

In [13]:
# Decode an Encoded string -> return 'nel mezzo del cammin'
ds.decode(ds.encode('nel mezzo del cammin'))

'nel mezzo del cammin'

In [14]:
(xs, ys) = ds.get_batch(32)

tensor([51, 69, 66,  ..., 81, 79, 76])


In [15]:
xs.shape, ys.shape

(torch.Size([32, 128]), torch.Size([32, 128]))

In [16]:
"First input"
xs[0]

tensor([70, 62, 11,  1, 73, 66, 81, 81, 76, 79, 11,  1, 81, 70,  1, 68, 70, 82,
        79, 76, 11,  0,  1,  1, 80,  7, 66, 73, 73, 66,  1, 75, 76, 75,  1, 80,
        70, 66, 75,  1, 65, 70,  1, 73, 82, 75, 68, 62,  1, 68, 79, 62, 87, 70,
        62,  1, 83, 76, 81, 66, 11,  0,  0, 64, 69,  7, 70,  7,  1, 83, 70, 65,
        70,  1, 77, 66, 79,  1, 78, 82, 66, 73, 73,  7, 62, 66, 79, 66,  1, 68,
        79, 76, 80, 80, 76,  1, 66,  1, 80, 64, 82, 79, 76,  0,  1,  1, 83, 66,
        75, 70, 79,  1, 75, 76, 81, 62, 75, 65, 76,  1, 82, 75, 62,  1, 67, 70,
        68, 82])

In [17]:
# ds.decode(xs[0]) Not working
ds.decode(xs[0].numpy()) # Working

"ia, lettor, ti giuro,\n  s'elle non sien di lunga grazia vote,\n\nch'i' vidi per quell'aere grosso e scuro\n  venir notando una figu"

In [18]:
ds.decode(ys[0].numpy()) # Working

"a, lettor, ti giuro,\n  s'elle non sien di lunga grazia vote,\n\nch'i' vidi per quell'aere grosso e scuro\n  venir notando una figur"

In [20]:
# All configuration parameters for out Transformer
block_size = 128
train_size = 0.9
batch_size = 32
n_embed = 128

In [21]:
# Instantiate datasets for training and test
ds_train = Dante(train=True, train_size=train_size, block_size=block_size)
ds_test = Dante(train=False, train_size=train_size, block_size=block_size)
(xs, ys) = ds_test.get_batch(batch_size)

tensor([11,  0,  1,  ..., 59,  0,  0])


In [60]:
# The top-level GPT nn.Module
class GTPLanguageModel(nn.Module):
    def __init__(self, vocab_size, n_embed):
        super().__init__()
        self._vocab_size = vocab_size
        self._n_embd = n_embed
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding = nn.Embedding(vocab_size, n_embed)

    def forward(self, idx, targets=None):
        (B, T) = idx.shape
        tok_emb = self.token_embedding(idx) # (B, T, C)
        return tok_emb

In [61]:
model = GTPLanguageModel(vocab_size=ds_train.num_tokens, n_embed = n_embed)

In [62]:
model(xs)[0][0]

tensor([ 0.0182,  0.6876,  0.9524, -0.1810,  0.9837, -0.6089, -1.1648,  0.2752,
        -0.6810,  0.4400,  1.3942, -0.2972,  0.2556,  1.7411, -0.1625,  0.7471,
        -0.3994,  0.6829,  0.6663, -1.9936, -1.0045,  0.6590,  1.0105, -0.0707,
         1.5947,  0.0098,  0.7688, -0.8266, -0.4158,  1.1425, -0.6613,  0.3734,
        -0.3173, -0.1288,  1.8279, -0.1044,  1.3437,  1.6375,  1.3891,  0.1766,
        -1.1703,  0.6529,  0.9052,  0.4542,  0.8510,  0.0475, -1.1846,  0.7598,
         1.0428,  1.2485, -0.1313, -0.1652,  0.0153, -0.1453,  0.8056,  0.1221,
        -1.8702, -0.1466,  0.7614,  0.3381, -0.6846, -1.0877,  1.8149, -0.5938,
        -0.3843,  1.2736,  1.1190, -0.9846,  0.2179, -0.1396,  0.3629,  0.3197,
         0.8835, -0.4273, -0.9002,  0.1076, -1.4472,  0.2919, -1.0444, -0.3461,
         0.9479,  0.7831, -1.8522,  1.7290,  2.5879,  0.3881,  0.2460,  0.6543,
         0.6894,  1.1303,  0.3790, -0.6986, -1.5515,  0.2599, -0.7662, -2.3683,
         0.2489, -0.9762,  0.9289,  0.83

# Exercise 2: Working with Real LLMs

Our toy GPT can only take us so far. In this exercise we will see how to use the [Hugging Face](https://huggingface.co/) model and dataset ecosystem to access a *huge* variety of pre-trained transformer models.

## Exercise 2.1: Installation and text tokenization

First things first, we need to install the [Hugging Face transformer library](https://huggingface.co/docs/transformers/index):

    conda install -c huggingface -c conda-forge transformers
    
The key classes that you will work with are `GPT2Tokenizer` to encode text into sub-word tokens, and the `GPT2LMHeadModel`. **Note** the `LMHead` part of the class name -- this is the version of the GPT2 architecture that has the text prediction heads attached to the final hidden layer representations (i.e. what we need to **generate** text). 

Instantiate the `GPT2Tokenizer` and experiment with encoding text into integer tokens. Compare the length of input with the encoded sequence length.

**Tip**: Pass the `return_tensors='pt'` argument to the togenizer to get Pytorch tensors as output (instead of lists).

In [28]:
# Your code here.
from transformers import GPT2LMHeadModel, GPT2Config, GPT2Tokenizer

# Load key classes GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
print(tokenizer("Nel mezzo del cammin di nostra vita", return_tensors='pt')["input_ids"])
print(tokenizer("Ciao mi chiamo Dante", return_tensors='pt')["input_ids"])
print(tokenizer("Paolo Brosio", return_tensors='pt')["input_ids"])
print(tokenizer("Dante alighieri", return_tensors='pt')["input_ids"])



tensor([[   45,   417,   502, 47802,  1619, 12172,  1084,  2566, 18216,   430,
           410,  5350]])
tensor([[   34, 13481, 21504,   442,  1789,    78, 34898]])
tensor([[28875, 14057, 14266,   952]])
tensor([[   35, 12427,   435,   394, 29864]])


## Exercise 2.2: Generating Text

There are a lot of ways we can, given a *prompt* in input, sample text from a GPT2 model. Instantiate a pre-trained `GPT2LMHeadModel` and use the [`generate()`](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to generate text from a prompt.

**Note**: The default inference mode for GPT2 is *greedy* which might not results in satisfying generated text. Look at the `do_sample` and `temperature` parameters.

In [36]:
# Your code here.
from transformers import GPT2LMHeadModel, GPT2Config, GPT2Tokenizer

# Load key classes GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Generate text from a prompt
prompt = "Nel mezzo del cammin di nostra vita"
generated = model.generate(tokenizer(prompt, return_tensors='pt')["input_ids"], max_length=100)
# print(generated)
print(tokenizer.decode(generated[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Nel mezzo del cammin di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra vita, di nostra


In [38]:
# Look at the do_sample and temperature parameters
generated = model.generate(tokenizer(prompt, return_tensors='pt')["input_ids"], max_length=100, do_sample=True, temperature=0.9)   
# print(generated)
print(tokenizer.decode(generated[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Nel mezzo del cammin di nostra vita faggio alla e di siguella, una di giorale della santo e una perche di che di sanna.

D'autre pela susere quelequando sugliando il sable l'apicher che l'ampli sopentor pitta di spagna, neque l'argento e sebata e che qu
