### Glassbox LLMs: Scalable Interpretability for Modern Architectures

### Outline:
1. Introduction
2. TransformerLens on GPT-2 (Baseline Capabilities)
3. Architecture Gap Analysis
4. Experimental Failures: Applying TransformerLens to LLaMA
5. Prototyping Model-Agnostic Interpretability Components
6. Requirements for a Next-Generation Library
7. Blueprint of a Python Library

### Executive Summary:
**Business Impact**: The rapid evolution of large language models has created a critical gap, while model capabilities grow exponentially, our ability to understand and interpret these models stagnates. This research identifies the architectural barriers preventing interpretability tools from scaling to modern LLMs and proposes a systematic solution.

**Key Insight**: The problem isn't just model size, it's fundamental architectural changes (RoPE, RMSNorm, SwiGLU) that break assumptions in current interpretability frameworks like TransformerLens.

In [None]:
# Let's start by importing essential libraries used for analyzing and visualizing transformer models.
import transformer_lens
import transformer_lens.utils as utils
from transformer_lens import HookedTransformer
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import plotly.express as px

print("PyTorch version:", torch.__version__)

ModuleNotFoundError: No module named 'transformer_lens'

### TransformerLens on GPT-2: Baseline Capabilities
Before we can understand why modern LLMs break today’s interpretability tools, we need a clear picture of what those tools do well, and why GPT-2 is uniquely compatible with them. TransformerLens was designed around the GPT-2 architecture, and as a result it provides a remarkably transparent view into the model’s internal mechanisms. This section establishes the “success case” against which later failures will be compared.

**2.1 Why GPT-2 Is an Ideal Interpretability Baseline**
GPT-2 has a relatively simple, uniform transformer architecture:
-  Standard LayerNorm
-  GELU-activated MLP blocks
-  Learned positional embeddings
-  No gating, no RMSNorm, no rotary embeddings, no MoE layers
-  Consistent module names and parameter shapes across layers 
<br>

These design choices make GPT-2 easy to instrument, easy to patch, and easy to reason about. TransformerLens directly mirrors this architecture: every submodule in the model has a named hook point, and its internals map cleanly to the textbook transformer diagram.
Modern LLMs diverge from this simplicity, making this section essential for highlighting what later breaks.

**2.2 What TransformerLens Enables on GPT-2**
TransformerLens provides multiple interpretability techniques out-of-the-box, each leveraging reliable access to intermediate activations:

**1. Activation Lens**
Reprojects hidden states at any layer back into vocabulary logits to see “what the model is already thinking.”
Useful for revealing early vs late-layer semantic processing.

**2. Logit Lens**
A lightweight variant of Activation Lens applied to MLP outputs or residual streams, allowing inspection of partial computations.

**3. Attention Visualization**
Clean extraction of attention patterns, head-by-head or token-by-token.
GPT-2’s uniform attention structure makes these visualizations straightforward.

**4. Activation Patching**
Causal intervention on hidden states:
-  Run source prompt
-  Run target prompt
-  Replace a chosen activation from one run into the other, this reveals which layers/heads encode specific information.

**5. Causal Scrubbing**
A more principled causal method for isolating mechanisms by systematically replacing components of the computation graph.

These tools operate smoothly only because GPT-2 exposes predictable internal signals that TransformerLens is built to capture.

In [None]:
# In terminal, make sure pip is available
#python -m ensurepip --default-pip
#python -m pip install --upgrade pip

#pip install ipykernel jupyter

#pip install torch transformer-lens transformers matplotlib seaborn numpy

# Create environment with Python 3.10 (more stable for ML packages)
#conda create -n glassbox-llms python=3.10

# Activate the environment
#conda activate glassbox-llms


# Install all packages needed for your project
#pip install torch transformer-lens transformers matplotlib seaborn numpy jupyter ipykernel