## Interpretability of Modern LLMs: Beyond GPT-2

### Outline:
1. [Introduction](#introduction)
2. [TransformerLens on GPT-2 (Baseline Capabilities)](#basecap)
3. [Architecture Gap Analysis](#arch)
4. [Experimental Failures: Applying TransformerLens to LLaMA](#fail)
5. [Prototyping Model-Agnostic Interpretability Components](#proto)
6. [Requirements for a Next-Generation Library](#requirement)
7. [Feasibility Roadmap for the team of Glassbox-LLMs](#roadmap)

#### <a name="introduction"></a>Introduction:
Mechanistic interpretability has made impressive progress on small transformer models, particularly those with GPT-2–style architectures. Tools such as TransformerLens allow researchers to inspect intermediate activations, trace causal pathways, and reverse-engineer small-scale behaviors with surprising precision. However, these methods break down when we move to the modern generation of large language models: GPT-3, LLaMA, Mistral, and similar architectures.

The gap is not merely one of size. Modern LLMs differ from GPT-2 in fundamental architectural ways: they use *RMSNorm* instead of *LayerNorm*, *rotary position embeddings*, *gated* and *deeper MLPs*, *custom activation functions*, and in some cases mixture-of-experts or other non-GPT-2 components. These differences invalidate many of the assumptions baked into current interpretability tooling. In addition, the increasing scale of contemporary models introduces *combinatorial complexity* that makes simple tracing techniques unreliable, while closed-source or partially restricted weights further limit direct inspection.

As a result, interpretability tools designed for GPT-2 do not generalize to GPT-3 class models, even when weights are available. The field currently lacks a *model-agnostic interpretability framework* capable of working across diverse architectures and scales.

This notebook aims to systematically investigate that problem.

We begin by demonstrating what TransformerLens does well on GPT-2 and why those successes rely on architectural simplicity. We then analyze the structural differences between GPT-2 and modern models like LLaMA, showing exactly where existing tools fail. Through empirical experiments, attempting to load, hook into, manipulate, and interpret larger models, we expose the technical points of breakdown in current interpretability workflows.

Building on these insights, we prototype components of a next-generation library. This includes generalized hooks, adaptable layer registries, and standardized internal representations designed to work across open-source LLM families.

Finally, we outline a realistic roadmap for a university-level research team to build such a library, drawing on open-source models and modern interpretability techniques.

In short, this project addresses a critical gap in today’s interpretability ecosystem:
How do we build tools that keep pace with the rapidly evolving architecture of large language models?

*This notebook provides both a technical answer and a path forward.*

#### <a name="basecap"></a>TransformerLens on GPT-2: Baseline Capabilities
Before we can understand why modern LLMs break today’s interpretability tools, we need a clear picture of what those tools do well, and why GPT-2 is uniquely compatible with them. TransformerLens was designed around the GPT-2 architecture, and as a result it provides a remarkably transparent view into the model’s internal mechanisms. This section establishes the “success case” against which later failures will be compared.

**2.1 Why GPT-2 Is an Ideal Interpretability Baseline**
GPT-2 has a relatively simple, uniform transformer architecture:
-  Standard LayerNorm
-  GELU-activated MLP blocks
-  Learned positional embeddings
-  No gating, no RMSNorm, no rotary embeddings, no MoE layers
-  Consistent module names and parameter shapes across layers 
<br>

These design choices make GPT-2 easy to instrument, easy to patch, and easy to reason about. TransformerLens directly mirrors this architecture: every submodule in the model has a named hook point, and its internals map cleanly to the textbook transformer diagram.
Modern LLMs diverge from this simplicity, making this section essential for highlighting what later breaks.

**2.2 What TransformerLens Enables on GPT-2**
TransformerLens provides multiple interpretability techniques out-of-the-box, each leveraging reliable access to intermediate activations:

**1. Activation Lens**
Reprojects hidden states at any layer back into vocabulary logits to see “what the model is already thinking.”
Useful for revealing early vs late-layer semantic processing.

**2. Logit Lens**
A lightweight variant of Activation Lens applied to MLP outputs or residual streams, allowing inspection of partial computations.

**3. Attention Visualization**
Clean extraction of attention patterns, head-by-head or token-by-token.
GPT-2’s uniform attention structure makes these visualizations straightforward.

**4. Activation Patching**
Causal intervention on hidden states:
-  Run source prompt
-  Run target prompt
-  Replace a chosen activation from one run into the other, this reveals which layers/heads encode specific information.

**5. Causal Scrubbing**
A more principled causal method for isolating mechanisms by systematically replacing components of the computation graph.

These tools operate smoothly only because GPT-2 exposes predictable internal signals that TransformerLens is built to capture.

## Under Construction ---- ---- ----- ---- ----  