# https://distill.pub/2020/circuits/zoom-in/

# Core Motivation and Conceptual Framework of Mechanistic Interpretability

## 1. Core Motivation: “Zooming In” as a Scientific Shift

Scientific progress has historically been driven by the development of tools that enable finer inspection of complex systems. Microscopes enabled the discovery of cells, and crystallography revealed the structure of DNA. The authors argue that neural network interpretability is at a comparable inflection point.

Rather than attempting to explain entire models at a high level, they propose *zooming in* to internal components such as neurons, directions in activation space, weights, and their interconnections. At this scale, neural networks cease to be opaque artifacts and instead become systems with tractable, interpretable, and algorithmic internal structure.

---

## 2. Central Thesis

The central thesis is that neural networks contain meaningful internal structure that can be studied with scientific rigor.

Algorithms are not only implicit in the training procedure but are explicitly encoded within the learned weights. These algorithms are implemented through features that are connected into circuits, forming structured computational mechanisms inside the network.

---

## 3. Three Speculative Claims (Foundational Framework)

### Claim 1 — Features

Features are the fundamental units of neural networks.  
A feature corresponds to a direction in activation space and may, but does not necessarily, align with a single neuron. These features are often meaningful, even when their function is initially unintuitive.

Crucially, features can be empirically characterized, tested, and falsified, analogous to the study of neurons in neuroscience.

---

### Claim 2 — Circuits

Features do not operate independently; they are connected by weights into circuits.  
A circuit is a small subgraph of the network that implements a specific computation.

Circuits frequently exhibit symmetry, modularity, and algorithmic clarity. By studying circuits, one can directly read computational procedures from the weights themselves.

---

### Claim 3 — Universality

Similar features and circuits tend to reappear across different models and tasks.  
This suggests convergent learning rather than arbitrary internal representations.

If this claim holds broadly, interpretability could develop a shared vocabulary and taxonomy of features and circuits, enabling cumulative scientific progress.

---

## 4. Evidence for Claim 1: Features Exist and Are Understandable

### Example A — Curve Detectors

Curve detectors appear consistently in early layers of vision models. They detect curved boundaries with specific orientations and form families spanning all orientations.

These features are supported by multiple independent lines of evidence:
- Feature visualization
- Dataset activation analysis
- Synthetic stimuli
- Joint tuning curves
- Circuit construction
- Downstream usage
- Hand-designed reimplementations

Together, these satisfy evidentiary standards comparable to those used in visual neuroscience.

---

### Example B — High–Low Frequency Detectors

These features detect low-frequency patterns on one side of a receptive field and high-frequency patterns on the other. Although less intuitive initially, they become simple once understood.

They likely function as boundary detection heuristics, particularly under blur, demonstrating that interpretability extends beyond obvious features.

---

### Example C — Pose-Invariant Dog Head Detector

This is a high-level semantic feature responding to dog heads across multiple poses and orientations. Feature visualization, dataset evidence, and synthetic rendering all confirm the interpretation.

This example demonstrates that even abstract semantic concepts are mechanistically accessible within neural networks.

---

## 5. Polysemantic Neurons and Superposition

Not all neurons are pure; many are polysemantic and respond to multiple unrelated concepts, such as cat faces, car fronts, and cat legs.

This phenomenon arises from *superposition*, where multiple features share the same neuron to save representational capacity. Superposition enables efficiency but complicates interpretability, revealing a tension between capacity optimization and feature disentanglement.

---

## 6. Evidence for Claim 2: Circuits Are Real and Structured

### Circuit 1 — Curve Detection Circuit

Curve detectors are constructed from aligned lower-level line and curve detectors. Weight matrices form explicit geometric patterns corresponding to curves and exhibit equivariance, rotating with feature orientation.

This demonstrates that weights encode spatial algorithms rather than arbitrary numerical values.

---

### Circuit 2 — Oriented Dog Head Circuit

The network maintains separate pathways for left-facing and right-facing dog heads, which inhibit each other to sharpen selectivity. A later layer unions these pathways to produce pose-invariant detection.

This implements a form of case-splitting followed by conditional recombination, showing that gradient descent can learn structured conditional computation.

---

### Circuit 3 — Cars in Superposition

A pure car detector exists at one stage, but later layers mix it into dog-related neurons. This illustrates intentional feature packing rather than accidental entanglement and explains how polysemantic neurons can still support reliable downstream computation.

---

## 7. Circuit Motifs (Recurring Patterns)

Across different circuits, repeated abstract patterns appear, including:
- Equivariance
- Unioning over cases
- Superposition

These motifs are analogous to recurring circuit motifs in systems biology and may be more fundamental than individual circuits. Understanding motifs provides leverage across many architectures.

---

## 8. Evidence for Claim 3: Universality

Certain low-level features, such as curve detectors and frequency detectors, appear consistently across:
- Different architectures
- Different datasets
- Different training runs

Prior work on representational similarity supports this observation. While current evidence is suggestive rather than conclusive, broad universality would enable the construction of a “periodic table” of features.

---

## 9. Interpretability as a Natural Science

Interpretability is currently pre-paradigmatic, in the sense described by Kuhn. There is no consensus on objects of study, evaluation standards, or success criteria.

Circuits offer a path forward by enabling:
- Small, falsifiable claims
- Empirical testing
- Predictive power, such as targeted weight edits

Circuits may therefore serve as the epistemic foundation for understanding entire models.

---

## 10. Closing Perspective

Early microscopy was qualitative, slow, and controversial, yet ultimately transformative. Mechanistic interpretability may follow a similar trajectory.

Understanding neural networks from the inside out is feasible, scientific, and necessary. Circuits research reframes deep learning models as objects of empirical investigation rather than inscrutable artifacts.

---

## One-Sentence Synthesis

Neural networks are structured computational systems whose features and circuits implement discoverable algorithms, and studying them at the appropriate scale can ground interpretability as a true natural science.
