# Accessing and Comparing Parameters and Outputs of two Equivalent Models

This notebook compares the `GPT2Model` from huggingface with `CkptedTransformer`. We will be loading `CkptedTransformer` from the weights `gpt2-small`, so except for a few `biases` in the network, everything else should be equal.

In [2]:
from transformers import GPT2Model, GPT2Tokenizer
from DoTLMViz import CkptedTransformer

import torch

## A function to compare two tensors

Let us first define a function that takes two tensors and compares them.

In [30]:
def compare(a: torch.Tensor, b: torch.Tensor):
    comparison = torch.isclose(a, b, atol=1e-4, rtol=1e-3)
    print(f"{comparison.sum() / comparison.numel():.2%} of the values are correct.")

## Loading the models

In [6]:
org_gpt2 = GPT2Model.from_pretrained("gpt2")
custom_gpt2 = CkptedTransformer.from_pretrained("gpt2-small", device="cpu")

In [7]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")



## The Input

To compare the two models, we need some input to run them on.

In [8]:
text = "alpha beta gamma delta epsilon eta zeta"
tokens = tokenizer(text, return_tensors="pt")["input_ids"]

## Comparing the Outputs

Let us first compare the outputs before comparing the individual parameters. The output from `org_gpt2` can be obtained as:

In [9]:
org_logits = org_gpt2(tokens)

Let's check its shape.

In [12]:
org_logits.shape

AttributeError: 'BaseModelOutputWithPastAndCrossAttentions' object has no attribute 'shape'

Seems like we didn't get a tensor. What did we get? Let's see by directly printing it.

In [52]:
org_logits

transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions

So, its a tuple. There are tensors in the 0th index of the tuple. May be the tensor in the 0th index is the logits? Let's see

In [16]:
org_logits[0].shape

torch.Size([1, 11, 768])

But the logits should have had the shape of `(1, 11, 50257)`.

Seems like there is some problem in the final layer of the `org_gpt2`. Let's inspect its modules.

In [19]:
org_gpt2

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2SdpaAttention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

So we didn't get the desired output because the last layer was `ln_f` - a layer norm, instead of unembedding.

This is because GPT-2 uses tied embeddings i.e. it the same embedding layer `wte` for both input token embedding and unembedding. So, we can obtain the logits by multiplying the output of `ln_f` with the embedding layer `wte`.

In [24]:
ln_f_output = org_gpt2(tokens)  # the output tensor of final layer norm
gpt2_logits = ln_f_output[0] @ org_gpt2.wte.weight.T

Now, looking at the shapes, we find that it is correct:

In [25]:
gpt2_logits.shape

torch.Size([1, 11, 50257])

Now, let us obtain the output logits from the `custom_gpt2`.

In [26]:
custom_logits, _ = custom_gpt2.run_with_ckpts(tokens)

In [27]:
custom_logits.shape

torch.Size([1, 11, 50257])

In [28]:
gpt2_logits.shape == custom_logits.shape

True

So, the shape of the outputs from both the models are matching. Now, we can proceed to compare their contents.

In [31]:
compare(gpt2_logits, custom_logits)

100.00% of the values are correct.


## Comparing the parameters of the models

To compare the parameters of the model, we need to first inspect them.

In [32]:
org_gpt2

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2SdpaAttention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

What we see are the attribute names for the various layers in the gpt2 model. We can access the parameters of these layers through these attributes. The `org_gpt2` seems simple, while `custom_gpt2` (shown below) seems more *obsfucated*. It is because of the checkpoints that we have added, which we are going to ignore for now.

In [33]:
custom_gpt2

CkptedTransformer(
  (embed): Embedding()
  (pos_embed): PosEmbedding()
  (blocks): ModuleList(
    (0-11): 12 x TransformerBlock(
      (resid_pre): Ckpt()
      (ln1): LayerNorm(
        (ckpt_scaled): Ckpt()
        (ckpt_normalized): Ckpt()
      )
      (attn): Attention(
        (ckpt_q): Ckpt()
        (ckpt_k): Ckpt()
        (ckpt_v): Ckpt()
        (ckpt_scores): Ckpt()
        (ckpt_pattern): Ckpt()
        (ckpt_z): Ckpt()
        (ckpt_attn_out): Ckpt()
      )
      (resid_mid): Ckpt()
      (ln2): LayerNorm(
        (ckpt_scaled): Ckpt()
        (ckpt_normalized): Ckpt()
      )
      (mlp): MLP(
        (ckpt_pre): Ckpt()
        (ckpt_post): Ckpt()
        (ckpt_mlp_out): Ckpt()
      )
      (resid_post): Ckpt()
    )
  )
  (ln_final): LayerNorm(
    (ckpt_scaled): Ckpt()
    (ckpt_normalized): Ckpt()
  )
  (unembed): Unembedding()
  (ckpt_embed): Ckpt()
  (ckpt_pos_embed): Ckpt()
)

To inspect the name of the parameters that the `wte` layer of `org_gpt2` has we can:

In [47]:
for name, parameter in org_gpt2.wte.named_parameters():
    print(name)

weight


So the `wte` layer has a parameter named `weight`, we can access it by simply:

In [48]:
org_gpt2.wte.weight

Parameter containing:
tensor([[-0.1101, -0.0393,  0.0331,  ..., -0.1364,  0.0151,  0.0453],
        [ 0.0403, -0.0486,  0.0462,  ...,  0.0861,  0.0025,  0.0432],
        [-0.1275,  0.0479,  0.1841,  ...,  0.0899, -0.1297, -0.0879],
        ...,
        [-0.0445, -0.0548,  0.0123,  ...,  0.1044,  0.0978, -0.0695],
        [ 0.1860,  0.0167,  0.0461,  ..., -0.0963,  0.0785, -0.0225],
        [ 0.0514, -0.0277,  0.0499,  ...,  0.0070,  0.1552,  0.1207]],
       requires_grad=True)

Similarly, we can inspect the parameters that the `embed` layer of `custom_gpt2` has by:

In [49]:
for name, parameters in custom_gpt2.embed.named_parameters():
    print(name)

W_E


So the `embed` layer has a parameter named `W_E`, we can access it by simply:

In [50]:
custom_gpt2.embed.W_E

Parameter containing:
tensor([[-0.1101, -0.0393,  0.0331,  ..., -0.1364,  0.0151,  0.0453],
        [ 0.0403, -0.0486,  0.0462,  ...,  0.0861,  0.0025,  0.0432],
        [-0.1275,  0.0479,  0.1841,  ...,  0.0899, -0.1297, -0.0879],
        ...,
        [-0.0445, -0.0548,  0.0123,  ...,  0.1044,  0.0978, -0.0695],
        [ 0.1860,  0.0167,  0.0461,  ..., -0.0963,  0.0785, -0.0225],
        [ 0.0514, -0.0277,  0.0499,  ...,  0.0070,  0.1552,  0.1207]],
       requires_grad=True)

We can check whether these two parameters are equal or not by:

In [51]:
compare(org_gpt2.wte.weight, custom_gpt2.embed.W_E)

100.00% of the values are correct.
