<a href="https://colab.research.google.com/github/AmirZur/understanding-lm-cognition/blob/main/Lecture_1_Introduction_to_Language_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 1: Introduction to Language Models

In this lecture, we will introduce the basics of language models. We'll cover next-token prediction, the popular transformer architecture and its fundamental units, and how language models are trained.

### ✍ Learning goals

By the end of the lesson, we hope you take away the following.

* **Autoregressive nature of LMs**: language models predict tokens one at a time
* **The transformer architecture**: understand the basics of the current popular architecture for language models, including MLP and attention.
* **The 🤗 Transformers library**: learn how to load and inspect the latest open-source models with the Transformers library.

## 1️⃣ Talking to Language Models

### Playing around with a language model

Let's begin by chatting with language models. We'll load `gpt-2`, an open-source model released by OpenAI in 2019. As you'll see from playing with it, times have changed quickly in the world of AI!

In [16]:
# the code below loads gpt2 for us to play with
from transformers import pipeline

gpt2 = pipeline("text-generation", model="gpt2", device_map='cuda')

Device set to use cuda


The pre-trained models we'll be working with in this lesson are *base* models. These models predict the next token in a sequence - if you start a sentence, the model will try to complete it for you.

In [20]:
# try messing around with the prompt!
output = gpt2(
    'Once upon a time',
    max_new_tokens=20, # this controls how much text we want to generate
    do_sample=False, # we'll talk about sampling later - set sampling to False to get the same output each time
    pad_token_id=gpt2.tokenizer.eos_token_id # (ignore: just suprresses a warning)
)

# let's see how your model response!
print(output[0]['generated_text'])

Once upon a time, the world was a place of great beauty and great danger. The world was a place of great


### ✏ **Exercise 1**

Play around with the model! What can it do? What can't it do? This is the ancestor of GPT-5 (the latest at the time of writing). What are some differences you notice?

*Hint: Can it do math? Does it know basic math? How can you tell?*

In [None]:
# try messing around with the prompt!
output = gpt2(
    # your code here!
)

# let's see how your model response!
print(output[0]['generated_text'])

What are 3 capabilities you notice? What are 3 things that the model struggles with?

> FILL IN YOUR ANSWER HERE

In [None]:
# when you're done, delete the model from this previous section
# and move on to the next section!
del gpt2

### Peeling back a layer: tokens and logits

How does the model actually represent text? Let's break things down a little bit further. We'll see how text gets represented as **tokens**, and the model predicts a **distribution** over the next token in a sentence.

In [1]:
# this time, let's load the model and tokenizer separately
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The **tokenizer** parses a string of text into individual **tokens**. Each token has its own unique **token id**.

In [3]:
# note: the Ġ is an artifact of gpt-2
# however, it's often the case that " token" is different from "token"!
tokens = tokenizer.tokenize("The Eiffel Tower is in the city of")
token_ids = tokenizer("The Eiffel Tower is in the city of")["input_ids"]

print("Vocabulary size of our tokenizer:", tokenizer.vocab_size)
print("Number of tokens in our input:", len(token_ids))
print()
print("Tokens:", tokens)
print("Token ids:", token_ids)

A language model operates on these input ids. Each language model processes text in this way:

* **Inputs:** list of token ids of an input sentence.
* **Output:** a *probability distribution* over the next token in the sequence.

Let's look at what that output distribution looks like.

In [25]:
import torch

# this time, let's return the input ids as a tensor instead of as a list
# (it still has the same exact values!)
input_ids = tokenizer("The Eiffel Tower is in the city of", return_tensors="pt")["input_ids"]


with torch.no_grad(): # disable gradient
    # we pass the input ids into the model to get its prediction for the next token
    outputs = model(input_ids=input_ids.to(model.device)).logits[0]

outputs.shape

torch.Size([10, 50257])

What we got from the model is a very large matrix: $(10 \times 50257)$. Do these numbers look similar to you? Hopefully, you'll notice that this shape matches $(\text{number of input tokens} \times \text{vocabulary size})$.

We'll explain these numbers one at a time.

The $10$ isn't important just yet: our model is processing the text *in parallel*, so it's making a prediction for each token at the same time. This parallelism makes the model efficient, and we'll talk more about it in the next section.

The $50257$ represents our entire vocabulary. So, our model assigns a weight to each vocabulary item that represents how likely it is to be the next token. When we take the [softmax](https://en.wikipedia.org/wiki/Softmax_function) of these weights, we get a *probability distribution* over what the next token in the vocabulary is.

In [26]:
import plotly.express as px

# guessing the next token
next_token_probability = outputs[-1] # (50527,)
# softmax gives us a distribution (sums up to 1)
next_token_probability = next_token_probability.softmax(dim=-1) # (50527,)

fig = px.line(
    x = range(len(next_token_probability)),
    y = next_token_probability.cpu().numpy(),
    template = 'simple_white',
    width = 600,
    height = 400
)

# prettify
fig.update_layout(xaxis=dict(showticklabels=False,
                             title='Token index'),
                  yaxis=dict(title='Probability'),
                  title="Next token after \"The Eiffel Tower is in ___\"")

The plot above doesn't tell us much, but we can tell that there's a handful of tokens that the model predicts are likely to follow our prompt. Let's see which tokens the model predicts.

In [27]:
print("The Eiffel Tower is in the city of")

# print the top 3 tokens, in order of their probabilities
most_likely_tokens = torch.topk(next_token_probability, 3, sorted=True)
for v, i in zip(most_likely_tokens.values, most_likely_tokens.indices):
    print(f'{tokenizer.decode(i.item())} ({v.item():.2f})')

The Eiffel Tower is in the city of
 Paris (0.07)
 London (0.06)
 New (0.03)


We can also get the most likely token using the `argmax` function. Since `argmax` gives us the **index** of the most likely token, we can use the tokenizer to **decode** the model's prediction from a token index to the actual token.

In [30]:
print("Prompt: \"The Eiffel Tower is in the city of\"")

output_token = tokenizer.decode(next_token_probability.argmax(dim=-1))
print(f"Output: \"{output_token}\"")

Prompt: "The Eiffel Tower is in the city of"
Output: " Paris"


The keen observers might notice something a little funny - the output token has a space at the beginning! The `gpt-2` model, along with many other language models, prefers to tokenize words alongside a space. Since the end of our prompt doesn't have a space, the next token starts with one. Interestingly enough, whether or not you have a space at the end of your text matters a lot to these models - the token sequence no longer looks the same!

In [34]:
token_ids_no_space = tokenizer(" Paris")["input_ids"]
token_ids_with_space = tokenizer("Paris")["input_ids"]

print("Paris with a sapce:", token_ids_no_space)
print("Paris without a space:", token_ids_with_space)

Paris with a sapce: [6342]
Paris without a space: [40313]


### ✏ **Exercise 2**

Get the model's predictions when we input the same prompt as above, but with a space at the end:

"The Eiffel Tower is in the city of "

What are the model's top three token predictions?

*Hint: You're more than encouraged to copy-paste code from the cells above!*

In [37]:
# same as before, but with a space at the end of the prompt
input_ids = tokenizer("The Eiffel Tower is in the city of ", return_tensors="pt")["input_ids"]

# your code here!

What are the top three tokens predicted by the model? Were they what you expected?

> FILL IN YOUR ANSWER HERE

### 🧠 Takeaways

In this section, we saw how models take in **token ids** as inputs and output a **probability distribution** over the next token in the sequence.

We also saw how little things like whitespace can throw off our model, because of how the text is tokenized! In the next section, we'll continue breaking down the model, this time looking into its internal computations.

Run this code when you're done with this section - we'll explore a different model in the next section.

In [None]:
del model # delete gpt2 to make space for the next model!

## 2️⃣ The Anatomy of Language Models

### Looking at model internals

In this section, we'll "open the blackbox" and explore the internal computation of language models. At the end of the day, it's all numbers! But that doesn't mean we understand what these numbers represent.

We'll also get to explore our first library for interpretability, [`nnsight`](https://nnsight.net/)! This library is fantastic for getting hands-on with model internals - we strongly encourage checking out the tutorials on the `nnsight` website.

Run the code below to install `nnsight` to the current compute instance on Colab.

In [2]:
!pip install nnsight

Collecting nnsight
  Downloading nnsight-0.5.15-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.metadata (16 kB)
Collecting astor (from nnsight)
  Downloading astor-0.8.1-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting python-socketio[client] (from nnsight)
  Downloading python_socketio-5.16.0-py3-none-any.whl.metadata (3.2 kB)
Collecting jedi>=0.16 (from ipython->nnsight)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting bidict>=0.21.0 (from python-socketio[client]->nnsight)
  Downloading bidict-0.23.1-py3-none-any.whl.metadata (8.7 kB)
Collecting python-engineio>=4.11.0 (from python-socketio[client]->nnsight)
  Downloading python_engineio-4.13.0-py3-none-any.whl.metadata (2.3 kB)
Collecting simple-websocket>=0.10.0 (from python-engineio>=4.11.0->python-socketio[client]->nnsight)
  Downloading simple_websocket-1.1.0-py3-none-any.whl.metadata (1.5 kB)
Collecting wsproto (from simple-websocket>=0.10.0->python-engineio>=4.11.0->pyth

For this section, we'll use a smaller but more recent model, whose architecture will be easier to explain. Everything we covered in the last section - tokenization and output logits - also applies to this model!

You can explore a range of open-source models in the [🤗 Transformers](https://huggingface.co/) website.

In [3]:
from nnsight import LanguageModel

model = LanguageModel("HuggingFaceTB/SmolLM2-135M")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

device(type='meta')

### Weights vs. Activations

Let's begin by looking at the anatomy of our language model. You'll see our model is made up of:

* **Embedding layer**: This converts our token indices into vectors representing each token in the sequence.
* **Layers**: This is the core of our model. We'll unpack these layers step by step
* **LM head**: This projects the output from our final layer to predict the *logits* for the next token in the sequence.


In [6]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((576,), eps=1e-05)
    (rotary_emb): Lla

You'll see that inside each layer is also an **attention** block and an **MLP**. We'll discuss these components in later sections, but for now let's take a peak at what's actually there.

The code below uses `nnsight` to inspect the model's internals. Let's quickly break it down:
```python
with model.trace("<INPUT>"):
  component = model.thing.we.want.to.inspect
  thing_to_save = component. # weight, input, or output .save()
```

Running `model.trace(...)` lets `nnsight` know we're tracing the internal computations of the model. To save something that we can access outside of the `trace` session, we must use `.save()`. We can choose what to save - `input`, `output`, or `weight` - and we'll discuss the meaning of these in this section.

In [20]:
# what's inside our model?
with model.trace("This is an example"):
  layer_10_mlp = model.model.layers[10].mlp.gate_proj
  layer_10_mlp_weight = layer_10_mlp.weight.save()

layer_10_mlp_weight

Parameter containing:
tensor([[ 0.1025, -0.0850,  0.0097,  ...,  0.1113,  0.1943,  0.2793],
        [ 0.3359,  0.1147,  0.0131,  ...,  0.1963, -0.0369,  0.0457],
        [ 0.1426, -0.1426, -0.1797,  ..., -0.1206,  0.3555,  0.0786],
        ...,
        [ 0.0933, -0.0615,  0.0258,  ...,  0.0879, -0.1318,  0.0437],
        [-0.0464,  0.1377,  0.0503,  ...,  0.1953, -0.1416,  0.3496],
        [ 0.0732, -0.3086, -0.1367,  ...,  0.0049,  0.1621,  0.0104]],
       requires_grad=True)

It's one large matrix! In fact, every component of the model contains **paramaters**, or **weights**, that get applied to the inputs in order to predict the next token. Our model is nothing than just a large collection of numbers!

One important thing to note is that **model weights are static** - they don't change when we pass inputs into the model. No matter what input we pass in, the weights of the model will stay the same.

In [23]:
with model.trace("An entire different input!"):
  layer_10_mlp_weight_1 = model.model.layers[10].mlp.gate_proj.weight.save()

layer_10_mlp_weight_1

Parameter containing:
tensor([[ 0.1025, -0.0850,  0.0097,  ...,  0.1113,  0.1943,  0.2793],
        [ 0.3359,  0.1147,  0.0131,  ...,  0.1963, -0.0369,  0.0457],
        [ 0.1426, -0.1426, -0.1797,  ..., -0.1206,  0.3555,  0.0786],
        ...,
        [ 0.0933, -0.0615,  0.0258,  ...,  0.0879, -0.1318,  0.0437],
        [-0.0464,  0.1377,  0.0503,  ...,  0.1953, -0.1416,  0.3496],
        [ 0.0732, -0.3086, -0.1367,  ...,  0.0049,  0.1621,  0.0104]],
       requires_grad=True)

We can think of the **parameters** of the model kind of like the configuration of neurons in the brain - for the most part, they don't change. What changes is how these neurons **activate** when the brain receives different sensory inputs (like an fMRI scan).

In other words, the components of the model stay the same, but the intermediate outputs of these components depend on what inputs they receive. We can see this when we change what we save from the `weight` of the component to its `output` activation.

In [32]:
# this time, we're saving outputs and not weights!
with model.trace("This is an example"):
  layer_10_mlp = model.model.layers[10].mlp.gate_proj.output.save()

with model.trace("An entire different input!"):
  layer_10_mlp_1 = model.model.layers[10].mlp.gate_proj.output.save()

print("MLP from first example:")
print(layer_10_mlp)
print("MLP from second example:")
print(layer_10_mlp_1)

MLP from first example:
tensor([[[ 0.6341, -0.0745, -0.4995,  ..., -0.6501, -0.5840,  0.2198],
         [-0.1233,  0.2684, -0.2431,  ..., -0.0561, -1.6066,  0.1166],
         [-0.5244, -0.4444, -0.9604,  ..., -0.0597, -0.8774, -0.0674],
         [-0.2134,  0.3444, -0.5428,  ...,  0.4786, -1.0046, -0.7826]]],
       grad_fn=<UnsafeViewBackward0>)
MLP from second example:
tensor([[[ 0.4559, -0.4140, -1.1922,  ..., -0.4237, -0.8311,  0.5695],
         [-0.0105, -0.2903,  0.2229,  ..., -0.3602, -1.1035, -0.3711],
         [ 0.2780, -1.1566, -0.6276,  ..., -0.6801, -0.6918, -1.2675],
         [-0.4224, -0.1435,  0.1109,  ..., -0.1755, -0.5407, -0.4756],
         [ 0.1385, -0.7148,  0.4529,  ..., -0.5184, -1.1321, -0.4873]]],
       grad_fn=<UnsafeViewBackward0>)


### The residual stream

Now that we're talking about activations, we can zoom out a bit and look at the output activation of an entire layer. Even thought that layer might be made up of many individual components, these components get aggregated into a single output that's passed down to the next layer.

We call the outputs that pass between layers the **residual stream**. Let's begin by understanding the shape of the residual stream.

In [37]:
# this time, we're looking at the output of an entire layer
with model.trace("The city Paris is in France"):
  layer_10_output = model.model.layers[10].output.save()

layer_10_output.shape

torch.Size([1, 6, 576])

The shape of our residual stream is $(\text{batch size} \times \text{number of tokens} \times \text{hidden dimension})$.

The batch size is the number of inputs we pass into the model - for this lesson, it'll always be 1.

We've already seen $\text{number of tokens}$ before! It's the length of our input. In the example above, it happens to correspond directly to the number of words in the sentence.

The hidden dimension is the number of dimensions used (the number of numbers in the vector) in the output of each layer in the model. An easy way to scale up model parameters is increasing the number of layers and the hidden dimension in each layer.

Let's use our knowledge that Paris is the third word in the sentence (index 2) to inspect the model's **activation** of "Paris".

In [39]:
# this is how the model represents "Paris" in the sentence "Paris is in France"
paris_activation = layer_10_output[0, 2]

# let's show just the first 20 numbers - there's a lot of them!
paris_activation[:20]

tensor([ -4.3865,  -0.1718,  -6.3006,   5.3175,  -2.1285,   1.6359,   0.8421,
          2.2904,  -1.0070,   2.1634,  -0.9495,   5.4058,   2.6145,  -3.5306,
         -5.1822,   1.1591, -13.7739,   0.1141,   2.0305,   0.4881],
       grad_fn=<SliceBackward0>)

What do the numbers above mean? How do they come together to represent the city of "Paris"? Disappointingly, we won't be able to explain every part of the language model in the first lesson of the course.

However, we can use the internals to make some points about how the model works. One thing we can notice is that the model is **autoregressive** - it can only look backwards, never forwards! This means that **activations are shaped by words that come earlier in the sequence**.

Let's see how the model's activation of "Paris" changes as we change the input text.

In [46]:
# let's compare the model's representation of Paris across different contexts
with model.trace("The city Paris is in France"):
  # get the output of the 3rd token in the sequence (index 2)
  paris_original = model.model.layers[10].output[0, 2].save()

# will this change the representation of Paris?
with model.trace("The city Paris is in London"):
  paris_london = model.model.layers[10].output[0, 2].save()

# how about this?
with model.trace("The beautiful Paris is in France"):
  paris_beautiful = model.model.layers[10].output[0, 2].save()

print("Paris (original) activation:")
print(paris_original[:10])
print()
print("Paris (London) activation:")
print(paris_london[:10])
print()
print("Paris (beautiful) activation:")
print(paris_beautiful[:10])

Paris (original) activation:
tensor([-4.3865, -0.1718, -6.3006,  5.3175, -2.1285,  1.6359,  0.8421,  2.2904,
        -1.0070,  2.1634], grad_fn=<SliceBackward0>)

Paris (London) activation:
tensor([-4.3865, -0.1718, -6.3006,  5.3175, -2.1285,  1.6359,  0.8421,  2.2904,
        -1.0070,  2.1634], grad_fn=<SliceBackward0>)

Paris (beautiful) activation:
tensor([-3.9540, -1.9365, -5.4437,  5.1855,  0.9920,  2.0741, -1.6305,  1.1086,
         1.0773,  3.5357], grad_fn=<SliceBackward0>)


In case you're not yet convinced, we can use the `torch.equal` operation to check which activations are the same.

In [49]:
import torch

print("Changing tokens AFTER Paris:", torch.equal(paris_original, paris_london))
print("Changing tokens BEFORE Paris:", torch.equal(paris_original, paris_beautiful))

Changing tokens AFTER Paris: True
Changing tokens BEFORE Paris: False


Here are two interesting takeaways from the autoregressive nature of language models:

* **Information flows left-to-right** in the model. If we see an interesting activation at some token $t$ and layer $l$, we should be able to attribute it to some earlier activation at token $t' \leq t$ and layer $l' < l$.
* **Prior context shapes model activations**. The model's activations can be thought of as **contextual embeddings** of the tokens in their input. The activation reflects the distribution - and hence the meaning - of the token in that context.

Let's visualize how residual stream activations change depending on the context by looking at a polysemous word like "break".

In [77]:
import plotly.graph_objects as go
import torch

with model.trace("The package is very fragile - it will break easily"):
  # loop through all the layers to get their activations
  break_one = [layer.output[0, -2] for layer in model.model.layers].save()

with model.trace("If you throw the ball at the window you might break it"):
  # loop through all the layers to get their activations
  break_two = [layer.output[0, -2] for layer in model.model.layers].save()

with model.trace("This relationship is hard and I think we should break up"):
  break_three = [layer.output[0, -2] for layer in model.model.layers].save()

# we can use cosine distance to measure how similar the activations are
cosine_similarities_one_vs_two = [
    torch.nn.functional.cosine_similarity(act_one, act_two, dim=0).item()
    for act_one, act_two in zip(break_one, break_two)
]

cosine_similarities_one_vs_three = [
    torch.nn.functional.cosine_similarity(act_one, act_three, dim=0).item()
    for act_one, act_three in zip(break_one, break_three)
]

fig = go.Figure()

# plot similarities between the two physical breaks
fig.add_trace(go.Scatter(
    x=list(range(len(model.model.layers))),
    y=cosine_similarities_one_vs_two,
    mode='lines+markers',
    name='break 1 (hit) vs. break 2 (hit)',
    line=dict(color='firebrick', width=2)
))

# add similarity between physical break and figurative break up
fig.add_trace(go.Scatter(
    x=list(range(len(model.model.layers))),
    y=cosine_similarities_one_vs_three,
    mode='lines+markers',
    name='break 1 (hit) vs. break 3 (up)',
    line=dict(color='royalblue', width=2) # Dashed line
))

fig.update_layout(
    title='Similarities of \"break\" activations',
    xaxis_title='layer',
    yaxis_title='cosine similarity'
)

fig.update_layout(template="simple_white", width=600, height=400)

fig

Here are a few things to notice about the plot above:
* **Embeddings at layer 0 are all the same**. At this stage, the model processes the word itself, and doesn't incorporate the surrounding context.
* **Embeddings at the end are rather different**. At this stage, all the model cares about is predicting the next token - the more similar the next token distributions are, the closer the cosine embeddings will get.
* **Semantic similarity happens at the middle layers.** The three versions of "break" we looked at are all distinct - the first is intransitive (the package breaks by itself), the second is transitive (the ball breaks the window), and the third is figurative (break up). While the first & the second are different in their *syntax*, the first & the third are different in their *semantics*. This gets reflected most clearly around layer 10 in our case.

### ✏ **Exercise 3**

Can you find a sense of "break" that's different from both "breaking a window" and "breaking up"?

Change **only** the code in the box below to find a use of break whose representation has a cosine similarity **between 0.7 and 0.8** compared with the other uses of break.

*Hint: look at [this paper](https://aclanthology.org/2023.findings-eacl.36.pdf) for inspiration!*

In [86]:
with model.trace("If you throw the ball at the window you might break it"):
  break_window = model.model.layers[10].output[0, -2].save()

with model.trace("This relationship is hard and I think we should break up"):
  break_up = model.model.layers[10].output[0, -2].save()

###################### YOUR CODE HERE #########################
with model.trace("YOUR SENTENCE HERE"):
  # remember to change the token position (-2), but NOT the layer (10) !
  your_break = model.model.layers[10].output[0, -2].save()
###############################################################

print(
  "break (window) vs. break (yours):",
  torch.nn.functional.cosine_similarity(break_window, your_break, dim=0).item()
)

print(
  "break (up) vs. break (yours):",
  torch.nn.functional.cosine_similarity(break_up, your_break, dim=0).item()
)

break (window) vs. break (yours): 0.5196972489356995
break (up) vs. break (yours): 0.48523223400115967


What sentence did you use? Why do you think it has the similarity that it has?

> FILL IN YOUR ANSWER HERE

### The MLP layer

Now that we have a sense of the information that passes between layers in the **residual stream**, it's time to delve even further into the model architecture and inspect what these layers are made up of.

Each layer in a language model has two core components: an **attention layer** and a **multi-layer perceptron (MLP)**.

In [88]:
model.model.layers[10]

LlamaDecoderLayer(
  (self_attn): LlamaAttention(
    (q_proj): Linear(in_features=576, out_features=576, bias=False)
    (k_proj): Linear(in_features=576, out_features=192, bias=False)
    (v_proj): Linear(in_features=576, out_features=192, bias=False)
    (o_proj): Linear(in_features=576, out_features=576, bias=False)
  )
  (mlp): LlamaMLP(
    (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
    (up_proj): Linear(in_features=576, out_features=1536, bias=False)
    (down_proj): Linear(in_features=1536, out_features=576, bias=False)
    (act_fn): SiLUActivation()
  )
  (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
  (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
)

These components are responsible for different operations. In particular,

* **MLPs** store and process information. This is by far the most dense part of the model (it has the most parameters dedicated to it). Facts recalled by the model, and computational processing of these facts, happens in the MLP layers.
* **Attention** components move information between tokens. This is an important point! MLPs can only work on the token in front of them - information that comes from the rest of the context can only be brought in by attention heads.

You can think of an MLP as a smaller neural network embedded within the large transformer block. These neural networks are **dense** - MLP layers have many more parameters than the rest of the network. It's reasonable to hypothesize that facts recalled by the model are stored in the MLP layers.

Let's count the number of parameters in the MLP at layer 10.

In [97]:
mlp_components = [
    parameter for name, parameter in model.model.layers[10].named_parameters()
    if "mlp" in name
]

# get the number of weights (number of numbers) of each MLP component
num_parameters = sum(p.numel() for p in mlp_components)

print("Number of parameters in MLP at layer 10:", num_parameters)

Number of parameters in MLP at layer 10: 2654208


Layer 10 alone has more than 2 million parameters! How does this compare to the attention components in layer 10?

In [98]:
attn_components = [
    parameter for name, parameter in model.model.layers[10].named_parameters()
    if "attn" in name
]

# get the number of weights (number of numbers) of each MLP component
num_parameters = sum(p.numel() for p in attn_components)

print("Number of parameters in attention at layer 10:", num_parameters)

Number of parameters in attention at layer 10: 884736


We can see that the MLP layers have nearly a magnitude more parameters dedicated to them than attention layers.

### ✏ **Exercise 4**

Follow the code we set up above to compute the **percentage of parameters that come from MLP layers in the model**. How much of the model is just MLP layers?

*Hint: use `model.named_parameters()` to get a list of all of the parameters in the model.*

What percentage of the model is made up of MLP layers? Hypothesize functions or capabilities of the model that would require lots of parameters dedicated to them.

> FILL IN YOUR ANSWER HERE

### Attention

In [105]:
print(model.model.layers[10].self_attn.source)

                              * @deprecate_kwarg("past_key_value", new_name="past_key_values", version="4.58")
                              0 def forward(
                              1     self,
                              2     hidden_states: torch.Tensor,
                              3     position_embeddings: tuple[torch.Tensor, torch.Tensor],
                              4     attention_mask: Optional[torch.Tensor],
                              5     past_key_values: Optional[Cache] = None,
                              6     cache_position: Optional[torch.LongTensor] = None,
                              7     **kwargs: Unpack[TransformersKwargs],
                              8 ) -> tuple[torch.Tensor, torch.Tensor]:
                              9     input_shape = hidden_states.shape[:-1]
                             10     hidden_shape = (*input_shape, -1, self.head_dim)
                             11 
 self_q_proj_0            -> 12     query_states = self.q_proj(hid

In [271]:
model.config._attn_implementation = "eager"

with model.trace("a is a & b is b & y is"):
  outputs = model.model.layers[10].self_attn.source.attention_interface_0.output.save()

attn_outputs, attn_weights = outputs

attn_outputs.shape, attn_weights.shape

(torch.Size([1, 10, 9, 64]), torch.Size([1, 9, 10, 10]))

In [278]:
import plotly.express as px

px.imshow(
    attn_weights[0, 3].detach().cpu().T,
    x=[f"{i}) {l}" for i, l in enumerate("a is a & b is b & y is".split(" "))],
    y=[f"{i}) {l}" for i, l in enumerate("a is a & b is b & y is".split(" "))],
    width=600,
    height=600
)