## Analyzing circuits using TransformerLens

TransformerLens is a Python library that allows us to peel under the hood of how the various components within a large-language model (LLM) affect one another, as well as the output. This allows us to do mechanistic interpretability on our models and datasets, particularly for GPT2- models (source: https://transformerlensorg.github.io/TransformerLens/). 

In this section, we will introduce some methods for analyzing and intervening on circuits within our model. 

### General methods from TransformerLens


We start by importing relevant libraries and initiliazing our model. We use GPT-2 small. We also introduce the text prompt that will be used for our demonstration. 

In [None]:
from transformer_lens.hook_points import HookPoint
from transformer_lens import (
    ActivationCache,
    FactoredMatrix,
    HookedTransformer,
    HookedTransformerConfig,
    utils,
)

import circuitsvis as cv

import einops
import torch as t

import numpy as np

#device = t.device('mps' if t.backends.mps.is_available() else 'cuda' if t.cuda.is_available() else 'cpu')


device = utils.get_device()    #Finds best device for doing computations - particularly, it checks whether GPU is available

gpt2small = HookedTransformer.from_pretrained("gpt2-small")

gpt2_text = "Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets."

Loaded pretrained model gpt2-small into HookedTransformer


Next we print out the tokens that the prompt is split into, as well as the activations within the model (such as the attention blocks, MLPs, attention patterns etc.). The activations are contained within gpt2_cache, which is an "ActivationCache" object. We also print out the logits (which are the final output of the model before using softmax to return probabilities for next tokens)

In [2]:
gpt2_tokens = gpt2small.to_tokens(gpt2_text)
print(gpt2_tokens)

gpt2_logits, gpt2_cache = gpt2small.run_with_cache(gpt2_tokens, remove_batch_dim=True)

print(gpt2_logits)
print(gpt2_cache)

tensor([[50256, 35364,  3303,  7587,  8861,    11,   884,   355,  1808, 18877,
            11,  4572, 11059,    11,  3555, 35915,    11,   290, 15676,  1634,
            11,   389,  6032, 10448,   351, 28679,  4673,   319,  8861,   431,
          7790, 40522,    13]])
tensor([[[ 7.5261, 11.1214,  7.8919,  ..., -3.1299, -3.3873,  8.5934],
         [ 4.4261,  5.2600,  2.1652,  ...,  0.4973, -2.2385,  4.2680],
         [ 8.7212,  7.1920,  3.2337,  ...,  2.7380,  0.2224,  7.2776],
         ...,
         [ 4.8818,  6.6848,  2.6623,  ...,  4.4694, -1.5021,  5.2525],
         [ 7.4928,  8.2157,  3.5746,  ...,  1.5305, -0.2660,  7.6360],
         [ 4.6322,  5.0510,  5.8710,  ..., -1.5175, -6.7258, 13.3053]]],
       grad_fn=<ViewBackward0>)
ActivationCache with keys ['hook_embed', 'hook_pos_embed', 'blocks.0.hook_resid_pre', 'blocks.0.ln1.hook_scale', 'blocks.0.ln1.hook_normalized', 'blocks.0.attn.hook_q', 'blocks.0.attn.hook_k', 'blocks.0.attn.hook_v', 'blocks.0.attn.hook_attn_scores', 'block