#### 01: GPT-Neo-125M — Model Internals

#### 1. Setup

In [13]:
# Importing necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
import warnings
import yaml
warnings.filterwarnings("ignore")

In [14]:
# Setting random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

#### 2. Loading the model

In [15]:
with open("../config.yaml", "r") as f:
    config = yaml.safe_load(f)

In [16]:
# Starting with a smaller model for quicker iteration
model=config['model']['name']
device = "cuda" if torch.cuda.is_available() else "cpu"


In [17]:
# loading the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(
    model,
    output_attentions=True,
    output_hidden_states=True,
).to(device)
# Setting eval model
model.eval()

Loading weights: 100%|██████████| 160/160 [00:00<00:00, 548.16it/s, Materializing param=transformer.wte.weight]                         
[1mGPTNeoForCausalLM LOAD REPORT[0m from: EleutherAI/gpt-neo-125M
Key                                                   | Status     |  | 
------------------------------------------------------+------------+--+-
transformer.h.{0, 2, 4, 6, 8, 10}.attn.attention.bias | UNEXPECTED |  | 
transformer.h.{0...11}.attn.attention.masked_bias     | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=False)
            (q_proj): Linear(in_features=768, out_features=768, bias=False)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_fe

#### 3. Model architecture

In [18]:
model_config = model.config
print("Model Architecture:")
print(f"  Model type: {model_config.model_type}")
print(f"  Number of layers: {model_config.num_layers}")
print(f"  Number of attention heads: {model_config.num_heads}")
print(f"  Hidden size: {model_config.hidden_size}")
print(f"  Vocabulary size: {model_config.vocab_size}")
print(f"  Max position embeddings: {model_config.max_position_embeddings}")
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

Model Architecture:
  Model type: gpt_neo
  Number of layers: 12
  Number of attention heads: 12
  Hidden size: 768
  Vocabulary size: 50257
  Max position embeddings: 2048

Total parameters: 125,198,592


#### 4. Generating output for a test prompt

In [19]:
# Sample prompt
prompt=config['analysis']['prompt']
# Tokenizing prompt into tensor for the model
inputs = tokenizer(prompt, return_tensors="pt").to(device)

In [20]:
print(f"\nPrompt tokens: {len(inputs['input_ids'][0])}")


Prompt tokens: 15


In [21]:
with torch.no_grad():
    output = model.generate(
    **inputs,
    max_new_tokens=config['inference']['max_new_tokens'],
    do_sample=config['inference']['do_sample'],
    repetition_penalty=config['inference']['repetition_penalty'],
    return_dict_in_generate=True,
    output_attentions=True,
    output_hidden_states=True,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [22]:
# Decoding the generated text
decoded = tokenizer.decode(output.sequences[0], skip_special_tokens=True)
print("Generated text:\n")
print(decoded)

Generated text:

Marie Curie was a physicist who discovered radium. She was born in the United States and moved to Canada when she was eight years old, where she studied physics at the University of Toronto. She became interested in nuclear fusion theory after her father died.

Curie


In [23]:
new_tokens = output.sequences[0][len(inputs['input_ids'][0]):]
print(f"New tokens: {len(new_tokens)}")

New tokens: 40


#### 5. Summary

In [24]:
print(f"Layers: {model_config.num_layers}")
print(f"Hidden size: {model_config.hidden_size}")
print(f"Generation:")
print(f"Prompt tokens: {len(inputs['input_ids'][0])}")
print(f"Generated tokens: {len(new_tokens)}")
print(f"Total generation steps: {len(output.attentions)}")
print(f"Captured Internals:")
print(f"Attention tensors per step: {len(output.attentions[0])} layers")
print(f"Hidden state tensors per step: {len(output.hidden_states[0])} layers")
print(f"Attention shape: {output.attentions[0][0].shape}")
print(f"Hidden state shape: {output.hidden_states[0][0].shape}")

Layers: 12
Hidden size: 768
Generation:
Prompt tokens: 15
Generated tokens: 40
Total generation steps: 40
Captured Internals:
Attention tensors per step: 12 layers
Hidden state tensors per step: 13 layers
Attention shape: torch.Size([1, 12, 15, 15])
Hidden state shape: torch.Size([1, 15, 768])


- GPT-Neo-125M: 12 layers, 12 attention heads, 768 hidden dimensions, ~125M parameters
- Prompt tokenizes to 15 tokens, model generated 40 new tokens
- For each generation step we capture:
  - 12 attention tensors shaped [batch, heads, seq_len, seq_len]
  - 13 hidden state tensors (embedding + 12 layers) shaped [batch, seq_len, hidden_size]

Generation is smooth but factually incorrect, motivation for the attribution analysis in notebooks 02-04.

Next: 02 How do the 144 attention heads (12x12) specialise?