# Tutorial walkthrough
> Tutorial walkthrough with annotations.

::: {.callout-warning}
Under construction!
:::

<a href="https://githubtocolab.com/BorjaRequena/GPTutorial/blob/master/nbs/metatutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

# Unsupervised learning

Unsupervised learning consists on capturing rich patterns in the data in a label-free approach. This is opposed to the typical supervised learning scheme, in which we have a data set comprised of labeled samples $\left\{\mathbf{x}, y\right\}$ and we try to approximate the function $f(x)\approx y(x)$.

::: {callout-note}
In unsupervised learning, even though we follow label-free approaches, what we would consider labels can some times be part of the data corpus.
:::

We can split deep unsupervised learning in two main categories: generative and self-superivsed learning, although the line is often blurred.
In generative learning, we try to recreate the data distribution. This allows us to generate new data points that are likely to belong to the original data set, and often even know the probability to observe them.
In self-supervised learning, we instead focus on finding different representations of the data. These are often useful to accomplish other tasks, compress the information, etc.

Indeed, in some cases, the resulting models can accomplish downstream tasks without having been trained to perform them explicitly. For example, the generative model [GPT-3](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html) is a language model that can perform question answering tasks (among others) without any further specific training for it. Or the self-supervised vision model [DINO](https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper.html) can extract segmentation masks from images (see @fig-dino or [this video](https://youtu.be/8I1RelnsgMw)).

::: {#fig-dino}
![](figures/dino.png)

Self-supervised segmentation masks from DINO.
:::

Unsupervised methods have gathered a lot of attention in scientific applications, as they can help us extract physically relevant information from experimental data, as in [this great work](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.124.010508). Actually, in science, some times we do not even know what to look for in the data! For example, supose that we want to characterize a complex quantum system. To do so, we need to consider all the possible phases the system can be and devise appropiate order parameters to test whether they exist and find the phase transitions. With self-supervised methods, we can find different data representation schemes for specific regions of the phase diagram. This way, we can explore the phase diagram autonomously to find where the phase transitions may be in our system, as shown in [this work](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.125.170603).

Most of the recent advances in the machine learning (ML) field have been mainly due to massive scaling, both in terms of the model size and the amount of data. This has relied heavily in the vast amount of unlabeled data that exists in the internet. Think about it, for every cat image in every appropiately labeled data set we can find, how many unlabeled cat images and videos are in the internet? **The current state-of-the-art practice in many ML applications consists on training an unsupervised model with huge amounts of unlabeled data and, then, leveraging its knowledge to accomplish the desired task**.

This process is akin to the way humans learn. Our brain processes a continuous stream of unlabeled data containing rich information about our environment. Furthermore, we never process the exact same information twice, as there are no two instances of our life that are exactly the same. This allows us to generalize extremely well and make the most out of the relatively scarse labeled data we have access to. For example, given a single [stegosaurus](https://en.wikipedia.org/wiki/Stegosaurus) image, we can immediately recognize this dinosaur species anywhere else, with any camera angle, any art-style, and even with partial information (e.g. just a part of the dinosaur).

Thus, unsupervised learning is essential for the entire ML field and it is specially promising in scientific applications.

## Generative modeling

In this tutorial, we focus on generative learning. As we have briefly mentioned before, it consists on learning the data distribution to generate new samples. This is extremely powerful both on its own, since high-quality new samples can be very valuable, and in combination with other tools to tackle downstream tasks.

There are many data generation approaches that we can consider. The most straightforward one is to simply generate samples that are similar to the traning ones, such as face images or digits (e.g., [this cat](https://thiscatdoesnotexist.com/)). We can also have conditioned synthesis, such as generating an audio signal from a text prompt that can be conditioned to a specific speaker voice (e.g. [WaveNet](https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio)).
This involves all sorts of translation tasks, where we write text from a sample fragment, generate a new image from a reference one (see the emblematic [horse-to-zebra](https://openaccess.thecvf.com/content_iccv_2017/html/Zhu_Unpaired_Image-To-Image_Translation_ICCV_2017_paper.html) example), or even [create a video from a text fragment](https://makeavideo.studio/)!

::: {.callout-note}
This is a very broad field and here we just show a hand full of representative examples.
:::

# Language models

Language models are generative models that write text as us (humans) would do it. Current advances in language modeling, such as [ChatGPT](https://openai.com/blog/chatgpt/), are definitely on par with humans (honestly, we set the bar quite low). See this cool example in @fig-chatgpt:

::: {#fig-chatgpt}
![](figures/chatgpt_impressive.png)

Sample conversation with ChatGPT. 
:::

The most common approach to generate sequential data, is to recursively predict the following item. In the case of language modeling, we could take a piece of text as starting point and use our model to predict the next word. Appending the predicted word to the existing text, and feeding it to the model again, we could predict the following word. Following this procedure, we could write any arbitrary amount of text!

::: {.callout-note}
While the procedure may seem simple, language is very complex and writing coherent text is quite a challenge. Even though we may write text sequentially, the relationships between elements within the text can be intrincate with forward and backwards dependencies that can be an entire book apart.
:::

There are several considerations we need to take into account. For example, how to "process text" with a machine learning model that operates purely with mathematical operations.

## Example task

The goal of this tutorial is to train a GPT-like model to count numbers: "1,2,3,4,...,8765,8766,8767,...". This seems like a rather simple task that could be easily achieved numerically with a single line of code. However, we will consider the digits as strings that conform sentences.

This toy example will allow us to understand the main concepts behind language models. We will use it as a running example and implement the main ideas as we see them.

Here, we will build our data set, which is nothing more than a text document containing the numbers.

In [None]:
max_num = 1_000_000
text = ",".join([str(i) for i in range(max_num)])

Let's see the first and last few digits of our data set.

In [None]:
#| code-fold: true
print(text[:20])
print(text[-20:])

0,1,2,3,4,5,6,7,8,9,
999997,999998,999999


## Giving numerical meaning to text

We can communicate very deep concepts with words, but how does a machine *understand* them?

When we work with text, we split it into elementary pieces called **tokens**. This is known as tokenization and there is quite a lot of freedom on how to do it. For example, we can take from full sentences, to words, to single characters. The most common practice is to use sub-word tokens that are between single characters to full words, such as [SentencePiece](https://github.com/google/sentencepiece). We can also have special tokens to account for additional grammatical information. For example, we can use special tokens to indicate the beggining and ending of a sequence, or to indicate that the words start with capital letters.

Let's see a simple tokenization example. We would take the following sentence:
```
My cat won't stop purring.
```
And transform it into the tokens:
```
<BoS><maj><my> <cat> <wo><n't> <stop> <purr><ing><.><EoS>
```

::: {.callout-note}
I just made up this tokenization, this is just to provide an idea.
:::

With this, we define a *vocabulary* of tokens. To provide them with "meaning", we assign a trainable parameter vector to each of them, which are known as **embedding vectors**. The larger the embedding, the richer the information we can associate to every individual token. We typically store these vectors in a so-called **embedding matrix**, where every row provides the associated embedding vector to a token. This way, we identify the tokens by an integer index that corresponds to their row in the embedding matrix.

Taking long tokens results into large vocabularies and, therefore, we need more memory. However, we can generate a piece of text with just a few inference steps. Conversely, short tokens require much less memory at the cost of more inference steps to write. Thus, this presents a trade-off between memory and computational time. You can get some intuition about it by comparing the number of letters in the alphabet (shortest possible tokens) with the number of entries in a dictionary (every word is a token).

To process a piece of text, we first split it into the tokens of our vocabulary (tokenization), and replace the tokens by their corresponding indices (numericalization). 

Let's see how this works in our example task. First of all, we build the token vocabulary. In this simple case, every digit is a token together with the separator ",". 

In [None]:
vocab = sorted(list(set(text)))
vocab_size = len(vocab)
vocab

[',', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Now we can build a `Tokenizer` class to encode raw text into tokens, and decode tokens to actual text.

In [None]:
class Tokenizer:
    def __init__(self, vocab):
        self.s2i = {char: i for i, char in enumerate(vocab)}
        self.i2s = {i: char for char, i in self.s2i.items()}
    
    def encode(self, string):
        "Tokenize an input string"
        return [self.s2i[char] for char in string]
    
    def decode(self, ints):
        "Transform a list of integers to a string of characters"
        return ''.join([self.i2s[i] for i in ints])

In [None]:
tkn = Tokenizer(vocab)

Let's see the map from tokens to integer.

In [None]:
tkn.s2i

{',': 0,
 '0': 1,
 '1': 2,
 '2': 3,
 '3': 4,
 '4': 5,
 '5': 6,
 '6': 7,
 '7': 8,
 '8': 9,
 '9': 10}

We can try our tokenizer with a text example.

In [None]:
pre_tkn = text[:10]
pre_tkn, tkn.encode(pre_tkn)

('0,1,2,3,4,', [1, 0, 2, 0, 3, 0, 4, 0, 5, 0])

We can also test the decoding function by encoding and decoding.

In [None]:
tkn.decode(tkn.encode(pre_tkn))

'0,1,2,3,4,'

::: {.callout-note}
Here we only perform the text pre-processing. The embedding belongs to the machine learning model.
:::

## Learning the data probability distribution

To learn how to generate text, we need to learn the underlying distribution of the data we wish to replicate $p_{\text{data}}(\mathbf{x})$. We model text as a sequence of tokens $\mathbf{x}=\left[x_1, x_2, \dots, x_{T-1}\right]$, and the goal is to predict the next token $x_T$. This way, we can recursively generate text: 

1. We start with some initial context $x_1, x_2, \dots, x_{T-1}$.
2. We predict the next token $x_T$, given the context.
3. We append the prediction to the existing text and repeat the process taking $x_1,\dots,x_T$ as context.

We typically do this defining a parametrized model to approximate the probability distribution, $p_\theta(\mathbf{x})\approx p_{\text{data}}(\mathbf{x})$. The parameters $\theta$ can represent from the weights of a neural network, to the coefficients of a gaussian mixture model.

A standard technique in the machine learning field is to use the chain rule of probability to model sequential data. This way, the probability to observe a sequence of tokens can be described as 
$$p_{\theta}(\mathbf{x})=p_\theta(x_1)\prod_{t=2}^{T}p_\theta(x_t|x_1\dots x_{t-1})\,.$$

We optimize our model parameters to obtain the **maximum likelihood estimator**, which is the most statistically efficient estimator. In this tutorial, we do not want to dive too deep in the details. The main intuition behind it is that we try to maximize the likelihood of observing the training data under our parametrized model. As such, we wish to minimize the negative log-likelihood loss or cross-entropy loss:
$$\theta^* = \text{arg}\,\text{min}_\theta - \frac{1}{N}\sum_{i=1}^N \log p_\theta\left(\mathbf{x}^{(i)}\right) = \text{arg}\,\text{min}_\theta - \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \log p_\theta\left(x_t^{(i)}|x_{<t}^{(i)}\right)$$

We can understand the task as a classification problem at every time-step where the goal is to predict the token that follows. Thus, we can build our self-supervised classification task by simply taking the text shifted by one position as target for our prediction. For example, consider the tokenized sentence
```
<this> <language> <model> <rocks><!>
```
Given the tokens
```
<this> <language>
```
we wish to predict 
```
<model>
```
among all the tokens in the vocabulary.

As we typically do in machine learning, we find the optimal parameters $\theta^*$, i.e., train our model, with gradient-based optimization.

# Baseline: bigram language model

Let's create a very simple baseline language model. This will allow us to see how the embedding matrix works and the training loop details in pytorch.

In [None]:
#| hide
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(7)
device = "cuda" if torch.cuda.is_available() else "cpu"

## Data processing

First of all, we need to properly arrange our data. We will start by tokenizing the whole text piece.

In [None]:
data = torch.tensor(tkn.encode(text))
data[:20]

tensor([ 1,  0,  2,  0,  3,  0,  4,  0,  5,  0,  6,  0,  7,  0,  8,  0,  9,  0,
        10,  0])

Now we need to save a part of the data for validation and keep the rest for training. In generative models, we do not tend to use too much data for validation because it is just to get a rough idea of how it is working. In the end, we will evaluate the performance ourselves asking the model to generate samples.

To keep this simple, we will save the last numbers as validation data.

::: {.callout-note}
Given the nature of our data, it would be best to save chunks of the data sampled at different points along the whole text piece.
:::

In [None]:
val_pct = 0.1
split_idx = int(len(data)*val_pct)
data_train = data[:-split_idx]
data_val = data[-split_idx:]

In [None]:
data_train.shape, data_val.shape

(torch.Size([6200001]), torch.Size([688888]))

To train machine learning models, we take advantage of parallelization to process several samples at once. To do so, we will split the text in sub-sequences from which we will build our training batches.

In [None]:
def get_batch(data, batch_size, seq_len):
    idx = torch.randint(len(data)-seq_len, (batch_size,))
    x = torch.stack([data[i:i+seq_len] for i in idx])
    y = torch.stack([data[i:i+seq_len] for i in idx+1])
    return x.to(device), y.to(device)

In [None]:
batch_size = 64
seq_len = 8
xb, yb = get_batch(data_train, batch_size, seq_len)

In [None]:
xb.shape

torch.Size([64, 8])

In [None]:
xb[0], yb[0]

(tensor([ 3,  0,  8,  8, 10,  1,  2,  4], device='cuda:0'),
 tensor([ 0,  8,  8, 10,  1,  2,  4,  0], device='cuda:0'))

## Model definition

We will make a bigram model that predicts the following character based on the previous one. These models are stochastic and, therefore, the output of the model is a probability distribution over our vocabulary. We can easily achieve this by making the embedding size as large as the vocabulary. This way, when we index into the embedding matrix with a token, we immediately obtain the probability distribution over the possible next tokens.

In [None]:
class BigramLanguageModel(nn.Module):
    "Language model that predicts text based on the previous character."
    def __init__(self, vocab_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, x):
        logits = self.embedding(x)
        return logits
        
    def generate(self, x, new_tkn):
        for _ in range(new_tkn):
            logits = self(x)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            new_tkn = torch.multinomial(probs, 1)
            x = torch.cat((x, new_tkn), dim=1)
        return x

In [None]:
bigram_model = BigramLanguageModel(vocab_size).to(device)

In [None]:
xb.shape, bigram_model(xb).shape

(torch.Size([64, 8]), torch.Size([64, 8, 11]))

::: {.callout-note}
The `logits` we define here are the unnormalized probability scores for each token. To transform them in a normalized probability distribution, we use a [SoftMax](https://es.wikipedia.org/wiki/Funci%C3%B3n_SoftMax) function. We will see below that pytorch takes the logits directly to compute the loss function instead of the probabilities.
:::

Let's try generating some text with our model.

In [None]:
context = torch.zeros((1, 1), dtype=torch.long).to(device)
tkn.decode(bigram_model.generate(context, 20)[0].tolist())

',3,467777067,95,,36,4'

## Training loop

With the data and the model, we're almost ready to do the training. We need to define a loss function and an optimiziation algorithm to update our model parameters.

As we have mentioned before, we wish to minimize the negative log-likelihood of the data with respect to the model. To do so, we use pytorch's [cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss).

In [None]:
def cross_entropy_loss(logits, targets):
    "Cross entropy loss flattening tensors"
    BS, T, C = logits.shape
    loss = F.cross_entropy(logits.view(BS*T, C), targets.view(-1))
    return loss    

Then, as optimizer, we will use [Adam](https://arxiv.org/abs/1412.6980).

In [None]:
optimizer = torch.optim.AdamW(bigram_model.parameters(), lr=1e-3)

Now let's define the training loop.

In [None]:
batch_size = 32
seq_len = 24
train_steps = 10000

for _ in range(train_steps):
    xb, yb = get_batch(data_train, batch_size, seq_len)
    logits = bigram_model(xb)
    loss = cross_entropy_loss(logits, yb)
    loss.backward()
    optimizer.step()
    
print(loss.item())

2.4061319828033447


Bigram models can't accomplish this example task. After every digit, all the other digits are equally likely to happen if we do not consider any further context. This model can only take advantage of the separator `,`. For instance, we know there will not be two consecutive separators and that the following number won't start with `0`.

We can see this in the first row of the embedding matrix.

In [None]:
#| code-fold: true
embedding_matrix = list(bigram_model.parameters())[0] 
embedding_matrix.softmax(-1)[0]

tensor([7.7754e-08, 3.4567e-08, 1.1784e-01, 1.0176e-01, 1.2117e-01, 1.7006e-01,
        1.1481e-01, 9.1826e-02, 1.0334e-01, 1.6928e-01, 9.8949e-03],
       device='cuda:0', grad_fn=<SelectBackward0>)

Let's generate some text.

In [None]:
context = torch.zeros((1, 1), dtype=torch.long).to(device)
tkn.decode(bigram_model.generate(context, 20)[0].tolist())

',3596825081,423,53599'

In contrast to the previous example, we see the model has learned to not add consecutive separators, but the digits are still random. GPT time!

# GPT

Let's prepare a more advanced machine learning model that overcomes the limitations of our baseline. With the bigram model, the prediction of the next token only depends on the last token in our text corpus. Thus, the model works with very limited information about the context, and it would be much more beneficial to account for further past instances.

The extreme opposite case would be to account for all the previous existing text. This can be both overkill and unfeasible in terms of memory. For example, writing a book, we may not need to account for the whole thing to write the last sentence. Therefore, in modern architectures, we fix a maximum sequence length that we keep in memory to provide context for our prediction.

## Transformer

The architecture behind the [GPT language models](https://openai.com/blog/better-language-models/) is based on the transformer, depicted in @fig-transformer.

::: {#fig-transformer}
![](figures/transformer.png){width=60%}

Transformer schematic representation. 
:::

The transformer was introduced as an architecture for translation tasks with two main parts: the encoder (left) and the decoder (right). The decoder is the responsible part for generating the translated text and, thus, it is the language model bit of the whole architecture.

The transformer architecture relies heavily on self-attention mechanisms. Indeed, the original paper is called "[Attention is all you need](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)". It has the advantage that it allows 

In [None]:
class AttentionHead(nn.Module):
    def __init__(self, ):
        pass
    
    def forward(self, ):
        pass