# Circuits in Mechanistic Interpretability

## The Fundamentals of Mechanistic Interpretability

Mechanistic interpretability is a 'natural science'-inspired approach to Neural Network interpretability, based on zooming in on parts of the network which are sufficiently small as to permit us to make concrete, falsafiable predicitons. The current paradigm is loosely based on three postulates from Olah, et al., 2020:

1. **Features:** Features are the fundamental unit of neural networks. They correspond to directions. These features can be rigorously studied and understood.

2. **Circuits:** Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood.

3. **Universality:** Analogous features and circuits form across models and tasks.

Let's briefly unpack these postulates, in light of developments after 2020. (based on my very superficial current understanding)

**Features**

A feature is a fuzzy term, with mainly two related uses:

1. A property of the input: An interpretable concept in a given input to a model, such as (a curve in an image) or (the language of a text) or (more or less anything that a human could say about the input).

2. A fundamental unit of the model: A localized, interpretable concept that the model has learned. In this view, a feature is viewed as an *idealized node* of the network which we expect to light up when the concept is present in the input.

Since we study models *in general*, we will mostly use the second definintion.

The idea that features correspond to *directions in the activation space of a given layer* is a postulate on *how the model represents meaning* based on the fact that it consists of mostly linear operations. This is essentially the *Linear representation hypothesis*. Note that while there are (infinitley) many more directions than dimensions in activation space, we expect the model to minimize interference between features (hence keep them mostly orthogonal, unless the features never co-ocur in the training data). 

This inspires us to imbed the activation space of a given layer in a much larger space in such a way that the features represented in that layer point in the basis directions of the large space. This is called *disentangeling polysemanticity* and is precisely the motivation behind sparse autoencoders which have empirically been shown to work. I suppose this is strong evidence for the fist postulate being directionally correct.

In this (idealized) view, the preimage of the basis vectors in the SAE-space represent the features in a given layer - concretely, this gives us *a linear combination of activations in a single layer fixed up to scaling* for each feature. We can then remodel the network graph in terms of the features, making the notion of a feature as an *idealized node* concrete.


**Circuits**

This is also a fuzzy term, getting at the interplay between features to implement algorithms in the model. To make this somewhat more precise: If we have identified certain features in the network and recast the network graph in terms of these, a circuit is a set of *tightly connected neurons (across layers) implementing an interpretable algorithm* (speculative defininition).

For example, the features 'wheel', 'car body', 'car window' can form a car-detection circuit. The final node of this circuit is then a feature in its corresponding layer, which activates if cars are present in the input. The 'wheel' feature is also the final node of a 'wheel detection circuit', which could be the union of a 'right facing wheel'-feature and a 'left facing wheel'-feature. These again are final nodes of a cricuit containing circle-detection, which is made from curve detection, which is a union of directed curve detectors, and so on.

Motifs

Scaling

**Universality**

Universality is the idea that we can understand something about models in general by studing particular models. The usefulness of Mechanistic interpretability is proportional to the degree to which this holds (Ex: imagine if all animals had *completely unrelated* biology/cell structure. We would then study human cells and domestic animals, mostly. Similarily, mech interp wants to find general circuits which we expect all models to converge on ('optimal, reachable algorithms' for a given problem exist); if not true, we will have to study models in isolation)