# Basics of Transformers

## Introduction

The goal of this tutorial is to show you the basics of training and generation
with a single GPU. The Transformers library has a lot of convenient methods
for training and generation. We are going to avoid them and instead work directly
with the neural network. This is the only way to understand the material.

We first load a couple of modules.

In [None]:
# We need to use PyTorch directly, unless we used the highest-level APIs.
import torch
import torch.nn as nn
from torch.nn import functional as F
# This is a standard optimizer, which will use during training.
from torch.optim import AdamW
# Transformers has implementations for essentially every well-known LLM. We
# are going to work with what are called causal, auto-regressive, or
# decoder-only models. These includes StarCoder, Llama, the GPT models, etc.
from transformers import AutoModelForCausalLM, AutoTokenizer
# Datasets gives convenient access to open source datasets.
import datasets
# Tqdm makes it easy to get a progress bar during training.
from tqdm.auto import tqdm
# Matplotlib will help us plot metrics after training.
from matplotlib import pyplot as plt

The following cell loads the model and tokenizer. You may need to modify
the `MODEL` variable below to load the model from a different path.
**If you mess up the model during training, consider re-running the cell below to
reload the model.**

In [None]:
MODEL = "/scratch/bchk/aguha/models/Qwen3-1.7B-Base"

# The model and tokenizer get loaded separately. But, they are typically at the
# same location, and it never makes sense to mix-and-match tokenizers and
# models.
tokenizer = AutoTokenizer.from_pretrained(MODEL, padding_side="left")
# I don't know why this isn't set by default, but you always want this.
tokenizer.pad_token = tokenizer.eos_token 

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.bfloat16,
).to("cuda:3")

## Tokenizer Basics

The tokenizer turns an input string into a sequence of tokens, which are the
`input_ids` that appear below. We can ignore the `attention_mask` for now.
We write `return_tensors="pt"` to get a PyTorch tensor. Without this flag, we
get the output as a Python list, which the model cannot use.

In [None]:
example_inputs = tokenizer(["def hello():\n\treturn"], return_tensors="pt")
example_inputs

It's worth understanding the shape of the tensor:

In [None]:
example_inputs.input_ids.shape

It's a 2D tensor, where the first dimension is the batch and the second
dimension is the sequence length. We have a single item in the batch, so the
length of the first dimension is 1. The second dimension is 4, so the input was
split into 4 tokens.

The code below loops through the tokens and *decodes* them back into strings.

In [None]:
# We have a single item in the batch, so just
example_input = example_inputs.input_ids[0]
# For every token in the sequence
for token in example_input:
    # Without tok.item() we print tensor(n) instead of n.
    # We use __repr__ so that special characeters like newlines are printed.
    print(token.item(), "->", tokenizer.decode(token).__repr__())

**Do now:** You should try to tokenize a different input string, and run the
decoding loop to see how it gets tokenized. I suggest using your name in the
function name, e.g., `hello_arjunguha`.

## Model Basics

In this section we will directly use the model to get a sense of what it
does. In ordinary usage, we train it in a loop, or we generate output in 
loop. There will be no loops in this section.

We'll use the same example input from the previous section. The model
requires the input tensors to be on the same device (the GPU) as the model. The
`.to(model.device)` at the end of the next line takes care of that.

In [None]:
example_inputs = tokenizer(['def hello():\n\treturn'], return_tensors="pt").to(model.device)
print(example_inputs)

The code below runs a *forward pass* with two simplifications:
1. We disable dropout (`model.eval`)
2. We don't compute gradients (`torch.no_grad`)
We'll enable both when we get to training.

Notice that the output from the tokenizer is conveniently structured so that we
can pass it as keyword arguments to `model.forward` by writing
`model.forward(**example_inputs)`. In our case, this is equivalent to:

```python
model.forward(
    input_ids=example_inputs.input_ids,
    attention_mask=example_inputs.attention_mask
)
```

In [None]:
model.eval()
with torch.no_grad():
    example_outputs = model.forward(**example_inputs)
print(example_outputs)

The result above has a lot of optional fields, but the only one that is set is `logits`.
Let's compare its shape to the shape of the input:

In [None]:
print(example_inputs.input_ids.shape)
print(example_outputs.logits.shape)

We have one output per input token, and each output is a tensor with ~150,000
elements. Let's look at one of them.

In [None]:
example_outputs.logits[0, 1]

Each index in this tensor corresponds to a token type ID, and the
output represents the distribution over all possible tokens. But, look at the
numbers. There are plenty of negative numbers, so this is *not* a probability
distribution.

These are raw, unnormalized predictions, or scores, or *logits*. We can turn 
them into a distribution using the *softmax* function. We can also sum them to
verify that they sum to 1:

In [None]:
example_dist_single = F.softmax(example_outputs.logits[0, 0], dim=0)
print(example_dist_single)
print(example_dist_single.sum())

Let's turn every output into a distribution:

In [None]:
# Copied from above
example_inputs = tokenizer(['def hello():\n\treturn'], return_tensors="pt").to(model.device)
model.eval()
with torch.no_grad():
    example_outputs = model.forward(**example_inputs)

# Notice that we do .logits[0] and not .logits[0,0] as above.
example_dist = F.softmax(example_outputs.logits[0], dim=1)
print(example_dist.shape)

Given these distributions (one for each output), we can produce the most likely
output token at each position:

In [None]:
example_dist

In [None]:
print("Input tokens:", example_inputs.input_ids)
for tok in example_inputs.input_ids.cpu().tolist()[0]:
    print(tok, "->", tokenizer.decode(tok).__repr__())
output_tokens = torch.argmax(example_dist, dim=1)
print("Output tokens:", output_tokens)
for tok in output_tokens.cpu().tolist():
    print(tok, "->", tokenizer.decode(tok).__repr__())

Read this as follows:

1. `find` is the most likely next token after `def`
2. `(name` is the most likely next token after `def hello`
3. `'    '` (four spaces) is the most likely next token after `def hello():\n`
4. ` "` is the most likely next token after `def hello(): return `

When doing generation, we only care about (4), which is the next token after
the full input sequence. But, we happen to get all of this, and it is
necessary for training.

Now that we've seen the output of a vanilla forward pass, let's use `forward`
in a configuration that is closer to what we need to train the model.
To train, we have to specify the expected output sequence. The language-modelling
objective is to predict the next token in a long text sequence. So, the
"expected output" is the input itself, but shifted by 1.

- Input: `[ "def",  " hello", "():\n", "return" ]`
- Expected output: `[ " hello", "():\n", "\treturn", ... ]`

We don't have a predicted next-token for last input token. So, but that's fine. We can ignore it. Given enough data, we should see that last token again in some other context where it is not the last token!

## Cross Entropy Loss

When we train a model to "predict the next token", we are effectively saying
that that any prediction that is not the expected next-token is wrong.
This definition has some obvious consequencies:

1. If the model assigns probability $p=1$ to the expected token, then the loss
   should be as small as possible.

2. If the model assigns probability $p=0$ to the expected token, then the loss
   should be as large as possible.

3. The loss should interpolate smoothly between these two extremes.

I encourage you to work through the derivation of cross entropy loss in
[Speech and Language Processing] (Chapter 5). But, the upshot is that
if the models output distribution is $[p_1, p_2, \ldots, p_n]$ and $p_k$ is
the probability of the expected token, then the loss is:

$$- \log p_k
$$

When $p_k =1$, this quantity is zero. When $p_k < 1$, the loss will be 
positive number.

*When training with cross-entropy loss, loss cannot be negative.*

PyTorch implements cross-entropy loss as `F.cross_entropy`. We give it
two arguments:

1. A 2D tensor of predictions with shape `(num_items, num_classes)`.
2. A 1D tensor of labels with shape `(num_items,)`.

The predictions must be *logits*, and not probabilities. i.e., we don't
use softmax ourselves. `F.cross_entropy` functions is effectively
softmax followed by cross-entropy loss.


Consider the following prediction. You can interpret it as follows:

1. There are exactly three tokens in the vocabulary.
2. The model predicted that token ID 1 is certainly the next token, and token 0 and 2 are effectively impossible.

Thus, the loss is as low as possible (zero).

In [None]:
pred = torch.tensor([[0.0, 100.0, 0.0 ]])
labels = torch.tensor([1])
F.cross_entropy(pred, labels)

In the example below, we again have three three tokens. The model predicted
that the next token is Token ID 0. However, the expected next token is token 1.
Thus the loss is very high.

In [None]:
pred = torch.tensor([[100.0, 0.0, 00.0 ]])
labels = torch.tensor([1])
F.cross_entropy(pred, labels)

The code below compute the loss for a single item. Notice we write
`logits[0,:-1]` to skip the last token predicted, since we don't
have a label for the last token. Conversely, we write
`example_inputs.input_ids[0, 1:]` to exclude the first token from the
list of tokens to predict, since we cannot predict the first token.

In [None]:
model.eval() # This is wrong, deliberately. Will fix later.
example_inputs = tokenizer(['def hello():\n\treturn'], return_tensors="pt").to(model.device)
logits = model.forward(**example_inputs).logits
F.cross_entropy(logits[0, :-1], example_inputs.input_ids[0, 1:])


Notice that in the output above, we now have  `grad_fn`, which allows backpropogation, which is what `loss.backward`
  does.

In PyTorch, `backward` computes and saves gradients, but *does not 
update model weights.* You can confirm this by re-running the cell above
repeatedly. You'll get exactly the same loss, which indicates the model is
not learning anything.

To actually update model weights, we need an optimizer. The textbook approach
to optimization is *stochastic gradient descent (SGD)*. We are going to use
*AdamW*, which is more sophisticated and works better with LLMs.

In [None]:
optimizer = AdamW(model.parameters(), lr=1e-5)

The code below uses the optimizer with two new lines at the end:
1. `optimizer.step()` updates weights
2. `optimizer.zero_grad()` creates the gradients that `.backward` computes.
   *Always* call `.zero_grad` immediately after `.step` for now.

The cell below is what we call the *training cell*.

In [None]:
model.train()
example_inputs = tokenizer(['def hello():\n\treturn'], return_tensors="pt").to(model.device)
logits = model.forward(**example_inputs).logits
loss = F.cross_entropy(logits[0, :-1], example_inputs.input_ids[0, 1:])
print(loss)
loss.backward()
optimizer.step()
optimizer.zero_grad()

Run the cell above several times. You will probably see loss going down. Model
is learning! You can run the cell below to see the predictions. Once the loss
above goes down significantly (e.g., below 1.0), you'll see predictions closer
to the input sequence.

(We are effectively training the model to memorize this string, which is
a silly thing to do.)

In [None]:
model.eval() # Do NOT change this 
example_inputs = tokenizer(['def hello():\n\treturn'], return_tensors="pt").to(model.device)
with torch.no_grad():
    example_outputs = model.forward(**example_inputs)
example_dist = F.softmax(example_outputs.logits[0], dim=1)
output_tokens = torch.argmax(example_dist, dim=1)
print("Output tokens:", output_tokens)
for tok in output_tokens.cpu().tolist():
    print(tok, "->", tokenizer.decode(tok).__repr__())

Finally, let's address the use of `model.eval()`.  `model.eval()` disables the 
*dropout* layers. Dropout is essential regularization, but introduces
randomness. Enable it by changing `model.eval()` to `model.train()` in the
**training cell**. You can re-run training.

## Training A Model

Based on the code above, you are ready to write a training loop. The basic idea
is this:

1. Find a dataset to train on
2. Loop over the items in the dataset:
   - Put the code in the training cell in the loop body.
   - Tokenize and train on the each item in the dataset
   - Log the loss so that you have some sense of progress.

You can can any dataset you like. But, let's work with the [GSM8K] dataset,
which is the training set for the *Math Word Problems* homework.

[GSM8K]: https://huggingface.co/datasets/nuprl/engineering-llm-systems/viewer/gsm8k

In [None]:
train_data = datasets.load_dataset("nuprl/engineering-llm-systems", "gsm8k", split="train")

Here is an example training item:

In [None]:
print(train_data[5_125])

**Do now:** Write the training loop below.

Some suggestions:

1. You should probably work with a prefix of the dataset, e.g., the first
   500 items.

2. You should also have a way of monitoring progress, e.g., using `tqdm`.

3. You need to think about how to format the data. How about formatting it
   as a single string, e.g., `"Question: {question}\nAnswer: {answer}"`, which
   makes evaluation easy.

You can use the following snippet to do both:

```
for ix in tqdm(range(500)):
    # Your code here
```

3. We recommend creating a list call `losses` to store all the losses
   you get. You can then plot the losses you get using `plt.plot(losses)`
   at the end of the loop. Here is a plot of losses we got from the first
   500 items. Note that there is some randomness, e.g., due to dropout, so
   you won't get exactly the same curve. But, a reasonable start looks like this:
   a very steep drop followed by much smaller drops.

4. You probably should reload the model, especially if you ran the training
   cell in *Model Basics* several times. The model loading code is the
   Introduction.

In [None]:
losses = [ ]
optimizer = AdamW(model.parameters(), lr=1e-5)
model.train()
for ix in tqdm(range(500)):
    item = train_data[ix]
    # Format as "Question: ...\nAnswer: ..."
    text = f"Question: {item['question']}\nAnswer: {item['answer']}"
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    outputs = model(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss
    losses.append(loss.item())
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    pass


plt.plot(losses)

### Evaluation


Finally, you should evaluate your model. You have code to evaluate on this
task from homework. Try to evaluate zero shot and see how you do, both
before and after training.

## What's Left To Do?

This is the simplest possible training code. There are a several ways to
improve it:

To make better use of your GPU:

1. You can pack several training items into a single input, separated with the
   `<|endoftext|>` token. This packing is quite standard, and you can only
   pack up to the maximum supported sequence length. For StarCoder that is
   8,192 tokens. But, that requires significant memory. We recommend not
   exceeding sequences of length 2,048.
2. You can load a batch with more than 1 item. With Flash Attention 2 and
   2,048 token sequences, you can probably do batch size 2 or 4 with
   StarCoder-1B.

To try to learn better:

3. You configure the optimizer to use *weight decay*. (Which requires setting
   some non-default options to do it right with a Transformer.)
4. You can set a learning rate schedule instead of a having a constant
   learning rate, or just try different learning rates.

To better understand training:

5. You can have a validation set, and measure evaluate on the validation set
   periodically during training.

6. You can save checkpoints for evaluation.

The last two items should be done first. We'll get to all of this later.

## Batching

We are *not* going to worry about batching at first. So, you  You can skip this for now, but return here when we get to batching towards the end of the tutorial.

When we put 2+ items in a batch, we need to pad the shorter items using `padding=True`.

You should modify `inputs` above to include another short string, e.g. 
`'def fac('` and add `padding=True` as an argument to the tokenizer to enable
padding. After that, go ahead and look at `input_ids`. Notice how the
`attention_mask` is set. We also recommend rerunning the loop above.