### 01: Transformer Internals - Baseline Analysis

- Extract and examine attention weights and hidden states from GPT-Neo-125M during text generation.
- This notebook establishes a baseline understanding of transformer internals by tracking how representations evolve across layers and how attention focuses during autoregressive generation.

#### 1. Setup

In [1]:
# Importing necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
import warnings
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Setting random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

#### 2. Loading the model

In [3]:
# Starting with a smaller model for quicker iteration
model = "EleutherAI/gpt-neo-125M"
device = "cuda" if torch.cuda.is_available() else "cpu"

In [4]:
# loading the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(
    model,
    output_attentions=True,
    output_hidden_states=True,
).to(device)
# Setting eval model
# model.eval()

The following generation flags are not valid and may be ignored: ['output_attentions', 'output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Loading weights: 100%|██████████| 160/160 [00:00<00:00, 167.16it/s, Materializing param=transformer.wte.weight]                         
[1mGPTNeoForCausalLM LOAD REPORT[0m from: EleutherAI/gpt-neo-125M
Key                                                   | Status     |  | 
------------------------------------------------------+------------+--+-
transformer.h.{0...11}.attn.attention.masked_bias     | UNEXPECTED |  | 
transformer.h.{0, 2, 4, 6, 8, 10}.attn.attention.bias | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


#### 3. Model architecture

In [5]:
config = model.config
print("Model Architecture:")
print(f"  Model type: {config.model_type}")
print(f"  Number of layers: {config.num_layers}")
print(f"  Number of attention heads: {config.num_heads}")
print(f"  Hidden size: {config.hidden_size}")
print(f"  Vocabulary size: {config.vocab_size}")
print(f"  Max position embeddings: {config.max_position_embeddings}")
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

Model Architecture:
  Model type: gpt_neo
  Number of layers: 12
  Number of attention heads: 12
  Hidden size: 768
  Vocabulary size: 50257
  Max position embeddings: 2048

Total parameters: 125,198,592


#### 4. Generating output for a test prompt

In [6]:
# Sample prompt
prompt = "Why does phone battery drain faster over time? Answer in 2 detailed points"
# Tokenizing prompt into tensor for the model
inputs = tokenizer(prompt, return_tensors="pt").to(device)

In [7]:
# Sample prompt
print(f"\nPrompt tokens: {len(inputs['input_ids'][0])}")



Prompt tokens: 14


In [8]:
with torch.no_grad():
    output = model.generate(
    **inputs,
    max_new_tokens=40,
    do_sample=False,
    repetition_penalty=10.0,
    return_dict_in_generate=True,
    output_attentions=True,
    output_hidden_states=True,
)



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [9]:
# Decoding the generated text
decoded = tokenizer.decode(output.sequences[0], skip_special_tokens=True)
print("Generated text:\n")
print(decoded)

Generated text:

Why does phone battery drain faster over time? Answer in 2 detailed points

The answer to this question is yes. The battery life of a smartphone depends on the battery capacity and the charging time. If you have a phone that has a built-in charger, then


In [10]:
new_tokens = output.sequences[0][len(inputs['input_ids'][0]):]
print(f"New tokens: {len(new_tokens)}")

New tokens: 40


#### 5. Understanding attention tensor

In [11]:
print("Number of layers per step:", len(output.attentions[0]))

Number of layers per step: 12


In [12]:
print(f"  Number of generation steps: {len(output.attentions)}")

  Number of generation steps: 40


In [13]:
# Attention tensor shape for 1st layer
print("Attention shape (layer 0):", output.attentions[0][0].shape)
#[batch_size,num_heads,seq_len,seq_len]: format

Attention shape (layer 0): torch.Size([1, 12, 14, 14])


In [14]:
print("\nAttention shapes for first generation step:")
for layer_idx in [0, config.num_layers // 2, config.num_layers - 1]:
    shape = output.attentions[0][layer_idx].shape
    print(f"  Layer {layer_idx:2d}: {shape}")


Attention shapes for first generation step:
  Layer  0: torch.Size([1, 12, 14, 14])
  Layer  6: torch.Size([1, 12, 14, 14])
  Layer 11: torch.Size([1, 12, 14, 14])


In [15]:
# Examining attention values
first_layer_attn = output.attentions[0][0][0]  # [num_heads, seq_len, seq_len]
print("Attention Statistics (Layer 0, First Generation Step):")
print(f"  Min value: {first_layer_attn.min().item():.6f}")
print(f"  Max value: {first_layer_attn.max().item():.6f}")
print(f"  Mean value: {first_layer_attn.mean().item():.6f}")
# Checking that attention weights sum to 1 (along key dimension)
attn_sums = first_layer_attn.sum(dim=-1)
print(f"\nAttention weight sums (should be ~1.0):")
print(f"  Min sum: {attn_sums.min().item():.6f}")
print(f"  Max sum: {attn_sums.max().item():.6f}")
print(f"  Mean sum: {attn_sums.mean().item():.6f}")

Attention Statistics (Layer 0, First Generation Step):
  Min value: 0.000000
  Max value: 1.000000
  Mean value: 0.071429

Attention weight sums (should be ~1.0):
  Min sum: 1.000000
  Max sum: 1.000000
  Mean sum: 1.000000


- The model concentrates most of its attention on a small number of input tokens while assigning very low attention to the rest.

- This results in a low mean attention value and a maximum value close to 1.0.

- Such focused attention is typical and expected at early generation stages, as the model identifies the most relevant context for producing the next token.

- Additionally, the fact that all attention weights sum to 1.0 confirms that the attention mechanism is properly normalized and functioning correctly.

#### 6. Understanding hidden state

In [16]:
print(f"  Number of generation steps: {len(output.hidden_states)}")
print(f"  Number of layers per step: {len(output.hidden_states[0])}")
print(f"\nFirst generation step, layer 0:")
print(f"  Shape: {output.hidden_states[0][0].shape}")
print(f"  Format: [batch_size, seq_len, hidden_size]")
print(f"  Interpretation: [1, sequence_length, {config.hidden_size}]")
print("\nHidden state shapes for first generation step:")
for layer_idx in [0, config.num_layers // 2, config.num_layers - 1]:
    shape = output.hidden_states[0][layer_idx].shape
    print(f"  Layer {layer_idx:2d}: {shape}")

  Number of generation steps: 40
  Number of layers per step: 13

First generation step, layer 0:
  Shape: torch.Size([1, 14, 768])
  Format: [batch_size, seq_len, hidden_size]
  Interpretation: [1, sequence_length, 768]

Hidden state shapes for first generation step:
  Layer  0: torch.Size([1, 14, 768])
  Layer  6: torch.Size([1, 14, 768])
  Layer 11: torch.Size([1, 14, 768])


In [17]:
print("Hidden State Statistics Across Layers (First Generation Step):")
print("\nLayer | Mean     | Std      | Min       | Max")
print("-" * 55)

for layer_idx in range(0, config.num_layers, 3):
    hidden = output.hidden_states[0][layer_idx][0]  # [seq_len, hidden_size]

    print(f"{layer_idx:5d} | {hidden.mean().item():8.4f} | "
          f"{hidden.std().item():8.4f} | {hidden.min().item():9.4f} | "
          f"{hidden.max().item():8.4f}")

Hidden State Statistics Across Layers (First Generation Step):

Layer | Mean     | Std      | Min       | Max
-------------------------------------------------------
    0 |   0.0258 |   0.3594 |   -4.6523 |   3.3027
    3 |   0.7326 |  52.5746 | -1760.0094 | 490.0633
    6 |  -0.0571 | 104.3220 | -6698.4536 | 453.7579
    9 |  -0.0499 | 105.8652 | -6768.7222 | 345.0670


- Hidden activations grow much larger in deeper layers as
standard deviation and value range increase sharply, showing stronger transformations as depth increases.

- Means stay near zero, so signals are centered but amplified indicating expressive representations, though the large extremes should be watched for stability.

In [18]:
# Comparing input (layer 0) vs output (final layer) representations
input_repr = output.hidden_states[0][0][0, -1, :]  # Last token, layer 0
output_repr = output.hidden_states[0][-1][0, -1, :]  # Last token, last layer
# Computing L2 norms
input_norm = torch.norm(input_repr, p=2).item()
output_norm = torch.norm(output_repr, p=2).item()

# Computing cosine similarity
cos_sim = torch.nn.functional.cosine_similarity(
    input_repr.unsqueeze(0),
    output_repr.unsqueeze(0)
).item()
print("\nInput vs. Output Layer Comparison (Last Token):")
print(f"  Layer 0 norm: {input_norm:.4f}")
print(f"  Layer {config.num_layers-1} norm: {output_norm:.4f}")
print(f"  Cosine similarity: {cos_sim:.4f}")
print(f"  Interpretation: {'Similar direction' if cos_sim > 0.5 else 'Different direction'}")


Input vs. Output Layer Comparison (Last Token):
  Layer 0 norm: 9.1253
  Layer 11 norm: 74.3957
  Cosine similarity: -0.0157
  Interpretation: Different direction


Because each layer repeatedly transforms the representation to extract higher-level features, by the final layer the model has reshaped the input embedding to encode task-relevant, predictive information rather than surface-level input features, resulting in a much larger norm and an almost orthogonal (different-direction) vector.

#### 7. Summarising key observations

In [19]:
print(f"\nModel: {model}")
print(f"  Layers: {config.num_layers}")
print(f"  Heads per layer: {config.num_heads}")
print(f"  Hidden size: {config.hidden_size}")
print(f"\nGeneration:")
print(f"  Prompt tokens: {len(inputs['input_ids'][0])}")
print(f"  Generated tokens: {len(new_tokens)}")
print(f"  Total generation steps: {len(output.attentions)}")
print(f"\nCaptured Internals:")
print(f"  Attention tensors per step: {len(output.attentions[0])} layers")
print(f"  Hidden state tensors per step: {len(output.hidden_states[0])} layers")
print(f"  Attention shape: {output.attentions[0][0].shape}")
print(f"  Hidden state shape: {output.hidden_states[0][0].shape}")


Model: GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=False)
            (q_proj): Linear(in_features=768, out_features=768, bias=False)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Line

**Architecture**: 12 layer GPT Neo transformer with hidden size 768 and 12 attention heads per layer.

**Generation setup**: 14 input tokens, 40 tokens generated autoregressively.


**Conclusions**

- Attention statistics show sharp focus, with low mean values and peaks near 1.0, indicating the model attends strongly to a small number of tokens at each step.

- Hidden state statistics show strong transformation with depth, as evidenced by rapidly increasing variance and large value ranges, meaning representations become more amplified and specialized in deeper layers.

**This baseline serves as a foundation for subsequent interpretability analyses.**