### State dict
Pytorch use a dictionary to keep track of weights and biases of a model. This dictionary is called state_dict. It is a python dictionary that maps each layer to its parameter tensor. The keys of the dictionary are the name of the layers, and the values are the parameter tensors.

In [6]:
# get the state dict of the model
from transformers import AutoModelForCausalLM
model_hf = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
sd_hf = model_hf.state_dict()
for k, v in sd_hf.items():
    print(k, v.shape)

transformer.wte.weight torch.Size([50257, 768])
transformer.wpe.weight torch.Size([1024, 768])
transformer.h.0.ln_1.weight torch.Size([768])
transformer.h.0.ln_1.bias torch.Size([768])
transformer.h.0.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.0.attn.c_attn.bias torch.Size([2304])
transformer.h.0.attn.c_proj.weight torch.Size([768, 768])
transformer.h.0.attn.c_proj.bias torch.Size([768])
transformer.h.0.ln_2.weight torch.Size([768])
transformer.h.0.ln_2.bias torch.Size([768])
transformer.h.0.mlp.c_fc.weight torch.Size([768, 3072])
transformer.h.0.mlp.c_fc.bias torch.Size([3072])
transformer.h.0.mlp.c_proj.weight torch.Size([3072, 768])
transformer.h.0.mlp.c_proj.bias torch.Size([768])
transformer.h.1.ln_1.weight torch.Size([768])
transformer.h.1.ln_1.bias torch.Size([768])
transformer.h.1.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.1.attn.c_attn.bias torch.Size([2304])
transformer.h.1.attn.c_proj.weight torch.Size([768, 768])
transformer.h.1.attn.c_proj.bias 

In [5]:
for k, v in sd_hf.items():
    if k.startswith("transformer.wte"):
        print(f'This is the embedding layer: {k} of shape {v.shape}')
    elif k.startswith("transformer.wpe"):
        print(f'This is the positional encoding layer: {k} of shape {v.shape}')
    elif k.startswith("transformer.h"):
        print(f'This is the transformer block: {k} of shape {v.shape}')
    elif k.startswith("transformer.ln_f"):
        print(f'This is the final layer norm: {k} of shape {v.shape}')
    elif k.startswith("lm_head"):
        print(f'This is the final classification head: {k} of shape {v.shape}')

This is the embedding layer: transformer.wte.weight of shape torch.Size([50257, 768])
This is the positional encoding layer: transformer.wpe.weight of shape torch.Size([1024, 768])
This is the transformer block transformer.h.0.ln_1.weight of shape torch.Size([768])
This is the transformer block transformer.h.0.ln_1.bias of shape torch.Size([768])
This is the transformer block transformer.h.0.attn.c_attn.weight of shape torch.Size([768, 2304])
This is the transformer block transformer.h.0.attn.c_attn.bias of shape torch.Size([2304])
This is the transformer block transformer.h.0.attn.c_proj.weight of shape torch.Size([768, 768])
This is the transformer block transformer.h.0.attn.c_proj.bias of shape torch.Size([768])
This is the transformer block transformer.h.0.ln_2.weight of shape torch.Size([768])
This is the transformer block transformer.h.0.ln_2.bias of shape torch.Size([768])
This is the transformer block transformer.h.0.mlp.c_fc.weight of shape torch.Size([768, 3072])
This is the 