# Setup

In [1]:
from transformer_lens import HookedTransformer
import torch as t

device = t.device(
    "mps"
    if t.backends.mps.is_available()
    else "cuda"
    if t.cuda.is_available()
    else "cpu"
)

MAIN = __name__ == "__main__"

reference_gpt2 = HookedTransformer.from_pretrained(
    "gpt2-small",
    fold_ln=False,
    center_unembed=False,
    center_writing_weights=False,
    device=device,
)



Loaded pretrained model gpt2-small into HookedTransformer


  return t.to(


In [2]:
print(device)

cuda


# Understanding Inputs & Outputs of a Transformer

### Learning Objectives

- Understand what a transformer is used for.

- Understand causal attention, and what a transformer's output represents.

- Learn what tokenisation is, and how models do it.

- Understand what logits are, and how to use them to derive a probability distribution over the vocabulary

## What is the point of a transformer?

**Transformers exist to model text**

Focus:  GPT-2 style transformers

Key feature: They generate text! You feed in language, and the model generates a probability distribution over tokens. Repeatedly sample this to generate text.

More detail:
1. feed in a sequence of length $n$
2. sample from the probability distribution over the $n+1$-th word/token
3. use this to construct a new sequence of length $n+1$
4. feed this new sequence into the model to get the probability distribution over the $n+2$-th word/token
5. and so on. 

### How is the model trained?

Provide a block of text, and train the model to predict the next word/token

If the model is provided with 100 tokens in a sequence, the model *predicts the next token for **each** prefix*.
- I.e., it produces 100 logit vectors (representing probability distributions) over the set of all words/tokens in the model's vocabulary, where the $i$-th logit vector represents the probability distribution over the token *following* the $i$-th token in the sequence.

This makes transformers efficient to train. For every sequence of length $n$, we get $n$ different predictions to train on:

$$
p(x_1), p(x_2 | x_1), p(x_3 | x_1 x_2), ..., p(x_n | x_1...x_{n-1})
$$



#### Aside - logits

Given an arbitrary vector $x$ of dimension $m$, we can turn it into a probability distribution via the **softmax** function:

$$
x_i → \frac{e^{x_i}}{\sum^m_j{e^{x_j}}}
$$

- the exponential makes everything positive
- the normalization makes it add to one.

The model's output is the vector $x$ (one for each prediction it makes). We call this vector a **logit** because it represents a probability distribution, and it is related to the actual probabilities via the softmax function.

*How do we stop the transformer by "cheating" by just looking at the tokens it's trying to predict?*

The transformer has **causal attention** (as opposed to bidirectional attention). Causal attention only allows information to move forwards in the sequence, never backwards. The prediction of what comes after token 50 is only a function of the first 50 tokens, not of token 51. We say the transformer is **autoregressive**, because it only predicts future words based on past data.

See [this illustration](./C1P1_transformer_output.png)

## Tokens - Transformer Inputs

- Transformer input is natural language; i.e., a sequence of characters/strings.
- ML models take numeric vectors as input, not language. How to convert?

Split into two questions:
1. How to split language into small sub-units?
2. How to convert these into numeric vectors? 

### 2. Converting sub-units into vectors

- Create a lookup table called an **embedding**, which contains one vector of dimension $d_{embed}$ for each possible sub-unit of language we expect.

- The set of all sub-units is the model's **vocabulary**.

- Every element in the vocabulary is labelled with an integer, which never changes. This integer is used to index the embedding.

- Every word has a completely separate embedding vector; no relationship between words is introduced/represented when embedding is performed.

- Indexing the embedding is equivalent to multiplying the **embedding matrix** $W_E$ by the one-hot encoding.
  - The embedding matrix is the row-stack of all embedding vectors:

$$
\begin{aligned}
W_E &= \begin{bmatrix}
\leftarrow v_0 \rightarrow \\
\leftarrow v_1 \rightarrow \\
\vdots \\
\leftarrow v_{d_{vocab}-1} \rightarrow \\
\end{bmatrix} \quad \text{is the embedding matrix (size }d_{vocab} \times d_{embed}\text{),} \\
\\
t_i &= (0, \dots, 0, 1, 0, \dots, 0) \quad \text{is the one-hot encoding for the }i\text{th word (length }d_{vocab}\text{)} \\
\\
v_i &= t_i W_E \quad \text{is the embedding vector for the }i\text{th word (length }d_{embed}\text{).} \\
\end{aligned}
$$

### 1. Splitting language into sub-units (tokens)

#### How?

- Use the set of all words in a dictionary?
  - This doesn't handle aribtary text, such as URLs or punctuation.

- Use the 256 ASCII characters?
  - Fixes the previous problem.

  - Loses *structure* of language.
  
    - Some sequences of characters are more meaningful than others: "language" is more meaningful than "ngfaslghflv".
    - Want an efficient vocabulary; so want "language" to be a token, but not "ngfaslghflv".
  
#### Most common strategy: **Byte-Pair encodings**

- Begin with the 256 ASCII characters

- Find the most common pair of tokens and merge into a new token

**Note**: the *space* character is one of the 256 ASCII tokens. Merges including *space* are very common. E.g., the 5 first merges for GPT-2's tokeniser are:

```python
" t"
" a"
"he"
"in"
"re"
```

**Note**: The character `Ġ` prefixes some tokens. This is a special token that indicates that the tokwn begins with a space. *Tokens with a leading space are different from those without a leading space*.

#### GPT-2's tokeniser's vocabulary:

In [3]:
from pprint import pprint

sorted_vocab = sorted(list(reference_gpt2.tokenizer.vocab.items()), key=lambda n: n[1])
print("First 20 tokens in GPT-2 vocab:")
pprint(sorted_vocab[:20])
print("\n251st to 270th token in GPT-2 vocab:")
pprint(sorted_vocab[250:270])
print("\n991st to 1010th token in GPT-2 vocab:")
pprint(sorted_vocab[990:1010])
print("\nFinal 20 tokens in GPT-2 vocab:")
pprint(sorted_vocab[-20:])

First 20 tokens in GPT-2 vocab:
[('!', 0),
 ('"', 1),
 ('#', 2),
 ('$', 3),
 ('%', 4),
 ('&', 5),
 ("'", 6),
 ('(', 7),
 (')', 8),
 ('*', 9),
 ('+', 10),
 (',', 11),
 ('-', 12),
 ('.', 13),
 ('/', 14),
 ('0', 15),
 ('1', 16),
 ('2', 17),
 ('3', 18),
 ('4', 19)]

251st to 270th token in GPT-2 vocab:
[('ľ', 250),
 ('Ŀ', 251),
 ('ŀ', 252),
 ('Ł', 253),
 ('ł', 254),
 ('Ń', 255),
 ('Ġt', 256),
 ('Ġa', 257),
 ('he', 258),
 ('in', 259),
 ('re', 260),
 ('on', 261),
 ('Ġthe', 262),
 ('er', 263),
 ('Ġs', 264),
 ('at', 265),
 ('Ġw', 266),
 ('Ġo', 267),
 ('en', 268),
 ('Ġc', 269)]

991st to 1010th token in GPT-2 vocab:
[('Ġprodu', 990),
 ('Ġstill', 991),
 ('led', 992),
 ('ah', 993),
 ('Ġhere', 994),
 ('Ġworld', 995),
 ('Ġthough', 996),
 ('Ġnum', 997),
 ('arch', 998),
 ('imes', 999),
 ('ale', 1000),
 ('ĠSe', 1001),
 ('ĠIf', 1002),
 ('//', 1003),
 ('ĠLe', 1004),
 ('Ġret', 1005),
 ('Ġref', 1006),
 ('Ġtrans', 1007),
 ('ner', 1008),
 ('ution', 1009)]

Final 20 tokens in GPT-2 vocab:
[('Revolution', 502

#### First tokens with 3, 4, 5, 6, and 7 characters:

In [4]:
length_toptoken_pairs = dict.fromkeys(range(3, 8), "")
for tok, idx in sorted_vocab:
    if not length_toptoken_pairs.get(len(tok), True):
        length_toptoken_pairs[len(tok)] = tok

for length, tok in length_toptoken_pairs.items():
    print(f"{length}: {tok}")

3: ing
4: Ġthe
5: Ġthat
6: Ġtheir
7: Ġpeople


#### BOS token

The BOS token is a special token used to mark the **B**eginning **O**f a **S**equence. This token:

- Provides context that this is the start of a sequence, which can help the model generate more appropriate text.
- Can act as a "rest position" for attention heads (discussed later)

`TransformerLens` adds this token automatically, including in forward passes of transformer mdoels. E.g., it's implicitly added when `model()` is called.
- You can disable this behaviour by setting `prepend_bos=False` in `to_tokens`, `to_str_tokens`, `model.forward`, and any other function/method that converts strings to multi-token tensors.

Note: if you get weird off-by-one errors, check whether there's an unexpected `prepend_bos`!

- Confusingly, in GPT-2, the End of Sequence (EOS), Beginning of Sequence (BOS), and Padding (PAD) tokens are all the same: `<|endoftext\>` with index `50256`

  - Why? GPT-2 is an autoregressive model. It has no need to distinguish between BOS and EOS tokens, since it only processes text from left to right. In contrast, other transformer families like BERT benefit from distinct BOS and EOS tokens. 



### Some tokenisation annoyances:

#### Whether or not a word begins with a capital letter or space matters:

In [5]:
print(reference_gpt2.to_str_tokens("Ralph"))
print(reference_gpt2.to_str_tokens(" Ralph"))
print(reference_gpt2.to_str_tokens(" ralph"))
print(reference_gpt2.to_str_tokens("ralph"))

['<|endoftext|>', 'R', 'alph']
['<|endoftext|>', ' Ralph']
['<|endoftext|>', ' r', 'alph']
['<|endoftext|>', 'ral', 'ph']


#### Arithmetic is a mess

In [6]:
print(reference_gpt2.to_str_tokens("56873+3184623=123456789-1000000000"))

['<|endoftext|>', '568', '73', '+', '318', '46', '23', '=', '123', '45', '67', '89', '-', '1', '000000', '000']


### Key Takeaways

- We learn a dictionary of $d_{vocab}$ tokens ("sub-words")

- We (approximately) losslessly convert language to integers viz tokenisation

- We convert integers to vectors via lookup table. This is called **embedding**

- **Note**: a transformer's input is a sequence of *tokens*, not vectors.

## Text generation

### Step 1: Convert text to tokens

A sequence gets tokenised:

In [7]:
reference_text = "I am an amazing autoregressive, decoder-only, GPT-2 style transformer. One day I will exceed human level intelligence and take over the world!"

tokens = reference_gpt2.to_tokens(reference_text).to(device)
str_tokens = reference_gpt2.to_str_tokens(tokens)

print(tokens)
print(tokens.shape)
print(str_tokens)

tensor([[50256,    40,   716,   281,  4998,  1960,   382, 19741,    11,   875,
         12342,    12,  8807,    11,   402, 11571,    12,    17,  3918, 47385,
            13,  1881,  1110,   314,   481,  7074,  1692,  1241,  4430,   290,
          1011,   625,   262,   995,     0]], device='cuda:0')
torch.Size([1, 35])
['<|endoftext|>', 'I', ' am', ' an', ' amazing', ' aut', 'ore', 'gressive', ',', ' dec', 'oder', '-', 'only', ',', ' G', 'PT', '-', '2', ' style', ' transformer', '.', ' One', ' day', ' I', ' will', ' exceed', ' human', ' level', ' intelligence', ' and', ' take', ' over', ' the', ' world', '!']


Note that `tokens` has shape `(batch, seq_len)`. Here, the batch dimension is just `1`, since we only supplied one sequence.

### Step 2: Map tokens to logits

In [8]:
logits, cache = reference_gpt2.run_with_cache(tokens, device=device)
print(logits.shape)

torch.Size([1, 35, 50257])


From our input `tokens` of shape of `(batch, seq_len)`, we get an output `logits` of shape `(batch, seq_len, d_vocab)`

The `[i, j, :]`-th element of our output is a vector of logits representing our prediction for the `j+1`-th token in the `i`-th sequence.

**Note**: `run-with-cache` tells the model to cache all intermediate activations. Discussed later.

### Step 3: Convert the logits into a probability distribution using *softmax*

In [9]:
probs = logits.softmax(dim=-1)  # Softmax over the d_vocab dimension
print(probs.shape)

torch.Size([1, 35, 50257])


Still of shape `(batch, seq_len, d_vocab)`

### What is the most likely next token at each position?

In [10]:
pprint(list(zip(str_tokens, reference_gpt2.to_str_tokens(probs.argmax(dim=-1)))))

[('<|endoftext|>', '\n'),
 ('I', "'m"),
 (' am', ' a'),
 (' an', ' avid'),
 (' amazing', ' person'),
 (' aut', 'od'),
 ('ore', 'sp'),
 ('gressive', '.'),
 (',', ' and'),
 (' dec', 'ently'),
 ('oder', ','),
 ('-', 'driven'),
 ('only', ' programmer'),
 (',', ' and'),
 (' G', 'IM'),
 ('PT', '-'),
 ('-', 'only'),
 ('2', '.'),
 (' style', ','),
 (' transformer', '.'),
 ('.', ' I'),
 (' One', ' of'),
 (' day', ' I'),
 (' I', ' will'),
 (' will', ' be'),
 (' exceed', ' my'),
 (' human', 'ly'),
 (' level', ' of'),
 (' intelligence', ' and'),
 (' and', ' I'),
 (' take', ' over'),
 (' over', ' the'),
 (' the', ' world'),
 (' world', '.'),
 ('!', ' I')]


### Step 4: Map a distribution to a token

In [11]:
next_token = logits[0, -1].argmax(dim=-1)
next_str_token = reference_gpt2.to_string(next_token)
print(repr(next_str_token))

' I'


Note that we're indexing `logits[0, -1]`. This is because `logits` have shape `(1, sequence_length, vocab_size)`, so this indexing returns the vector of length `vocab_size` representing the model's prediction for what token follows the last token in the input sequence.

In this case, we can see that the model predicts the token `' I'`.

### Step 5: Add this to the input and rerun 10 times:

In [12]:
n = 10

for i in range(n):
    print(f"Sequence so far: {reference_gpt2.to_string(tokens)[0]!r}")
    next_token = logits[0, -1].argmax(dim=-1)
    print(f"{tokens.shape[-1]+1}th token = {next_token!r}")
    tokens = t.cat([tokens, next_token[None, None]], dim=-1)
    logits, _ = reference_gpt2.run_with_cache(tokens, device=device)

Sequence so far: '<|endoftext|>I am an amazing autoregressive, decoder-only, GPT-2 style transformer. One day I will exceed human level intelligence and take over the world!'
36th token = tensor(314, device='cuda:0')
Sequence so far: '<|endoftext|>I am an amazing autoregressive, decoder-only, GPT-2 style transformer. One day I will exceed human level intelligence and take over the world! I'
37th token = tensor(716, device='cuda:0')
Sequence so far: '<|endoftext|>I am an amazing autoregressive, decoder-only, GPT-2 style transformer. One day I will exceed human level intelligence and take over the world! I am'
38th token = tensor(257, device='cuda:0')
Sequence so far: '<|endoftext|>I am an amazing autoregressive, decoder-only, GPT-2 style transformer. One day I will exceed human level intelligence and take over the world! I am a'
39th token = tensor(845, device='cuda:0')
Sequence so far: '<|endoftext|>I am an amazing autoregressive, decoder-only, GPT-2 style transformer. One day I will e

## Key Takeaways

- Transformers take in language and predict the next token (for *each* token in a causal way)

- Langauge is converted into a sequence of integers with a tokeniser.

- Integers are coverted to vectors via a lookup table (embedding)

- The output is a vector of logits, one per input token, that we convert to a probability distribution using *softmax*.

- We sample from this probability distribution (e.g., by taking the largest logit) to obtain a new token.
  - More on this later.

- We append this new token to the input and run the model again to generate more text.
  - Autoregressively

- Transformers are sequence operation models.
  - They taken in a sequence, perform processing in parallel at each position, and use attention to move information between positions. 