# Othello Blank F6 Analysis
By Shea Cardozo

## Summary
**For this project we're going to analyze how Othello-GPT computes that the Cell F6 is blank.** 

1.   We attempt to pinpoint which layer in the network the model concludes that F6 is blank.  We conclude that while the model is usually finished determining F6 is blank around Layer 3 - this not consistant and the model appears to do computation related to determining F6's state at multiple points within Layers 0-3.

2.   We attempt to isolate which Attention Heads are relevant to computing whether F6 is blank. We find that while this computation seems to be distributed among many attention heads, Head L1H7 seems to provide the strongest contribution to the blank probe direction.

3.   We attempt to analyze Head L1H7 further. We produce a spectrum plot that shows L1H7 typically (but not always) provides a positive contribution in the F6-Blank direction when F6 is indeed blank and a negative contribution in the direction when F6 is filled.

4. We conduct Activation Patching to try and narrow down the source of the conputation further. We find a very large effect patching MLP0 and little effect for the other MLP and Attention layers.

5. We conduct further Activation Patching on the Neurons of MLP0. We find the neurons L0N398, L0N827 and L0N1449 all have large influence in determining whether F6 is blank. While L0N398's interpretation is unclear, we find L0N827 appears to capture F6 == Blank, and L0N1449 appears to capture G5 == Blank.

We conclude with some speculation regarding how the model identifies blank cells in general.

## Setup

(This may take a while)

### Downloads

In [None]:
%pip install transformer_lens==1.2.1
%pip install git+https://github.com/neelnanda-io/neel-plotly
!git clone https://github.com/likenneth/othello_world

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformer_lens==1.2.1
  Downloading transformer_lens-1.2.1-py3-none-any.whl (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.5/80.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wandb<0.14.0,>=0.13.5
  Downloading wandb-0.13.11-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy<2.0,>=1.23
  Downloading numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting einops<0.7.0,>=0.6.0
  Downloading einops-0.6.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hColle

### Setting up Environment

In [None]:
# Imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import einops
from fancy_einsum import einsum
import tqdm.auto as tqdm
import random
from pathlib import Path
import plotly.express as px
from torch.utils.data import DataLoader
import pandas as pd

import tqdm

from typing import List, Union, Optional
from functools import partial
import copy

import itertools
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import dataclasses
import datasets
from IPython.display import HTML

import transformer_lens
import transformer_lens.utils as utils
from transformer_lens.hook_points import (
    HookedRootModule,
    HookPoint,
)  # Hooking utilities
from transformer_lens import HookedTransformer, HookedTransformerConfig, FactoredMatrix, ActivationCache

torch.set_grad_enabled(False)

from neel_plotly import line, scatter, imshow, histogram

import sys

OTHELLO_ROOT = Path("/content/othello_world/")
sys.path.append(str(OTHELLO_ROOT/"mechanistic_interpretability"))
from mech_interp_othello_utils import plot_single_board, to_string, to_int, int_to_label, string_to_label, OthelloBoardState

### Loading Model

In [None]:
cfg = HookedTransformerConfig(
    n_layers = 8,
    d_model = 512,
    d_head = 64,
    n_heads = 8,
    d_mlp = 2048,
    d_vocab = 61,
    n_ctx = 59,
    act_fn="gelu",
    normalization_type="LNPre"
)

model = HookedTransformer(cfg)

sd = utils.download_file_from_hf("NeelNanda/Othello-GPT-Transformer-Lens", "synthetic_model.pth")
model.load_state_dict(sd)

Downloading synthetic_model.pth:   0%|          | 0.00/101M [00:00<?, ?B/s]

<All keys matched successfully>

In [None]:
# Validation
sample_input = torch.tensor([[20, 19, 18, 10, 2, 1, 27, 3, 41, 42, 34, 12, 4, 40, 11, 29, 43, 13, 48, 56, 33, 39, 22, 44, 24, 5, 46, 6, 32, 36, 51, 58, 52, 60, 21, 53, 26, 31, 37, 9, 25, 38, 23, 50, 45, 17, 47, 28, 35, 30, 54, 16, 59, 49, 57, 14, 15, 55, 7]])
# The argmax of the output (ie the most likely next move from each position)
sample_output = torch.tensor([[21, 41, 40, 34, 40, 41,  3, 11, 21, 43, 40, 21, 28, 50, 33, 50, 33,  5,
         33,  5, 52, 46, 14, 46, 14, 47, 38, 57, 36, 50, 38, 15, 28, 26, 28, 59,
         50, 28, 14, 28, 28, 28, 28, 45, 28, 35, 15, 14, 30, 59, 49, 59, 15, 15,
         14, 15,  8,  7,  8]])
sample_model_output = model(sample_input).argmax(dim=-1)

print(all([a == b for a, b in zip(sample_output[0], sample_model_output[0])]))

True


### Loading Probe

In [None]:
full_linear_probe = torch.load(OTHELLO_ROOT/"main_linear_probe.pth")

### Loading Othello Games

In [None]:
board_seqs_int = torch.tensor(np.load(OTHELLO_ROOT/"board_seqs_int_small.npy"), dtype=torch.long)
board_seqs_string = torch.tensor(np.load(OTHELLO_ROOT/"board_seqs_string_small.npy"), dtype=torch.long)

num_games, length_of_game = board_seqs_int.shape
print("Number of games:", num_games,)
print("Length of game:", length_of_game)

Number of games: 100000
Length of game: 60


In [None]:
stoi_indices = [
    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 29, 30, 31, 32, 33, 34, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
]
alpha = "ABCDEFGH"


def to_board_label(i):
    return f"{alpha[i//8]}{i%8}"


board_labels = list(map(to_board_label, stoi_indices))

### Loading Focus Games

In [None]:
num_games = 250
focus_games_int = board_seqs_int[:num_games]
focus_games_string = board_seqs_string[:num_games]

In [None]:
def one_hot(list_of_ints, num_classes=64):
    out = torch.zeros((num_classes,), dtype=torch.float32)
    out[list_of_ints] = 1.
    return out

focus_states = np.zeros((num_games, 60, 8, 8), dtype=np.float32)
focus_valid_moves = torch.zeros((num_games, 60, 64), dtype=torch.float32)

for i in (range(num_games)):
    board = OthelloBoardState()
    for j in range(60):
        board.umpire(focus_games_string[i, j].item())
        focus_states[i, j] = board.state
        focus_valid_moves[i, j] = one_hot(board.get_valid_moves())
        
print("focus states:", focus_states.shape)
print("focus_valid_moves", focus_valid_moves.shape)

focus states: (250, 60, 8, 8)
focus_valid_moves torch.Size([250, 60, 64])


In [None]:
focus_logits, focus_cache = model.run_with_cache(focus_games_int[:, :-1].cuda())

### Setting up Probes

In [None]:
rows = 8
cols = 8 
options = 3
black_to_play_index = 0
white_to_play_index = 1
blank_index = 0
their_index = 1
my_index = 2

linear_probe = torch.zeros(cfg.d_model, rows, cols, options, device="cuda")
linear_probe[..., blank_index] = 0.5 * (full_linear_probe[black_to_play_index, ..., 0] + full_linear_probe[white_to_play_index, ..., 0])
linear_probe[..., their_index] = 0.5 * (full_linear_probe[black_to_play_index, ..., 1] + full_linear_probe[white_to_play_index, ..., 2])
linear_probe[..., my_index] = 0.5 * (full_linear_probe[black_to_play_index, ..., 2] + full_linear_probe[white_to_play_index, ..., 1])

blank_probe = linear_probe[..., 0] - linear_probe[..., 1] * 0.5 - linear_probe[..., 2] * 0.5
my_probe = linear_probe[..., 2] - linear_probe[..., 1]

## Objective One: Which layer does the Model Conclude F6 is Blank?

### Aggregate Blank Probe Accuracy For F6 For Each Layer

We first use the linear probe to determine how far the model has come in determining whether F6 is blank at each layer.

In [None]:
layer_n = [[] for _ in range(8)]

layer_accuracy = {k:0 for k in range(8)}

for layer in range(8):
  for game_index in range(num_games):
    for move in range(5, 54):
      filled = set(int_to_label(focus_games_int[game_index, :move+1]))

      if "F6" in filled:
        continue

      residual_stream = focus_cache["resid_post", layer][game_index, move]
      probe_out = einops.einsum(residual_stream, linear_probe, "d_model, d_model row col options -> row col options")
      probabilities = probe_out.softmax(dim=-1)[..., 0]

      layer_n[layer].append(probabilities[5, 6].item())


{k: round(sum(v) / len(v), 2) for k, v in enumerate(layer_n)}

{0: 0.83, 1: 0.94, 2: 0.98, 3: 1.0, 4: 1.0, 5: 0.99, 6: 1.0, 7: 0.75}

This suggests the model finishes determining F6 is blank somewhere around Layer 3. However it seems computatation is done from layers 0-3.

### Single Move Cases

We will analyze three single moves where F6 is blank to try and gain some insight about how the model functions in this case.

#### Game 5, Move 30

In [None]:
# Board State
layer = 6
game_index = 5
move = 30

plot_single_board(int_to_label(focus_games_int[game_index, :move+1]))

In [None]:
# Probed blank probability per layer
probs = []
for layer in range(8):
  residual_stream = focus_cache["resid_post", layer][game_index, move]
  white_to_play_probe = full_linear_probe[1]
  probe_out = einops.einsum(residual_stream, white_to_play_probe, "d_model, d_model row col options -> row col options")
  probs.append(probe_out.softmax(dim=-1)[..., 0])

imshow(probs, facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Layer", title=f"Probed probability of blank per layer (Game {game_index} Move {move})", aspect="equal")

In [None]:
imshow([(focus_cache["attn_out", l][game_index, move][:, None, None] * blank_probe).sum(0) for l in range(layer+1)], facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Layer", title=f"Attention Layer Contributions to blank direction (Game {game_index} Move {move})", aspect="equal", zmin=-8, zmax=8)
imshow([(focus_cache["mlp_out", l][game_index, move][:, None, None] * blank_probe).sum(0) for l in range(layer+1)], facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Layer", title=f"MLP Layer Contributions to blank direction (Game {game_index} Move {move})", aspect="equal", zmin=-8, zmax=8)

The model seems to identify F6 is blank after Layer 3. But there seem to be contributions to the corresponding direction from Layers 0-5 across both the attention and MLP layers.

#### Game 125, Move 12

In [None]:
# Board State
layer = 6
game_index = 125
move = 12

plot_single_board(int_to_label(focus_games_int[game_index, :move+1]))

In [None]:
# Probed blank probability per layer
probs = []
for layer in range(8):
  residual_stream = focus_cache["resid_post", layer][game_index, move]
  white_to_play_probe = full_linear_probe[1]
  probe_out = einops.einsum(residual_stream, white_to_play_probe, "d_model, d_model row col options -> row col options")
  probs.append(probe_out.softmax(dim=-1)[..., 0])

imshow(probs, facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Layer", title=f"Probed probability of blank per layer (Game {game_index} Move {move})", aspect="equal")

In [None]:
imshow([(focus_cache["attn_out", l][game_index, move][:, None, None] * blank_probe).sum(0) for l in range(layer+1)], facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Layer", title=f"Attention Layer Contributions to blank (Game {game_index} Move {move})", aspect="equal", zmin=-8, zmax=8)
imshow([(focus_cache["mlp_out", l][game_index, move][:, None, None] * blank_probe).sum(0) for l in range(layer+1)], facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Layer", title=f"MLP Layer Contributions to blank (Game {game_index} Move {move})", aspect="equal", zmin=-8, zmax=8)

The model seems to identify F6 is blank after Layer 0. Again there are contributions to the corresponding direction from Layers 0-5 across both the attention and MLP layers. Unlike in the previous case there is a spike in activity in the MLP activations in Layers 4-6

####Game 244, Move 42

In [None]:
# Board State
layer = 6
game_index = 244
move = 42

plot_single_board(int_to_label(focus_games_int[game_index, :move+1]))

In [None]:
# Probed blank probability per layer
probs = []
for layer in range(8):
  residual_stream = focus_cache["resid_post", layer][game_index, move]
  white_to_play_probe = full_linear_probe[1]
  probe_out = einops.einsum(residual_stream, white_to_play_probe, "d_model, d_model row col options -> row col options")
  probs.append(probe_out.softmax(dim=-1)[..., 0])

imshow(probs, facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Layer", title=f"Probed probability of blank per layer (Game {game_index} Move {move})", aspect="equal")

In [None]:
imshow([(focus_cache["attn_out", l][game_index, move][:, None, None] * blank_probe).sum(0) for l in range(layer+1)], facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Layer", title=f"Attention Layer Contributions to blank (Game {game_index} Move {move})", aspect="equal", zmin=-8, zmax=8)
imshow([(focus_cache["mlp_out", l][game_index, move][:, None, None] * blank_probe).sum(0) for l in range(layer+1)], facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Layer", title=f"MLP Layer Contributions to blank (Game {game_index} Move {move})", aspect="equal", zmin=-8, zmax=8)

The model seems to identify F6 is blank after Layer 1. Again there are contributions to the corresponding direction from Layers 0-5 across both the attention and MLP layers. 

### Section Conclusion

In general the model seems to have a good idea of when F6 is blank at around Layer 2. Unfortunately, the process of computing the board state seems to take place over multiple layers of the network and there appears to be no exact layer where the model makes this determination for F6. The model instead gradually determines when F6 is blank with computation spread over these initial layers.

## Objective Two: Which Attention Heads Help Compute F6 is Blank?

We attempt to isolate attention heads of interest by measuring the contribution of each head towards the 'F6 == Blank' direction in the residual stream. We will limit our analysis to the first three layers of the model - both for simplicity but also since usually (but not alway!) F6's state is determined by then.

In [None]:
attention_head_resids, attention_head_labels = focus_cache.stack_head_results(layer=3, return_labels=True)

Tried to stack head results when they weren't cached. Computing head results now


#### Game 5, Move 30

In [None]:
game_index = 5
move = 30

game_attention_heads = {}

game_attention_heads[0] = attention_head_resids[:8, game_index, move, :]
game_attention_heads[1] = attention_head_resids[8:16, game_index, move, :]
game_attention_heads[2] = attention_head_resids[16:, game_index, move, :]
plot_single_board(int_to_label(focus_games_int[game_index, :move+1]))

In [None]:
for layer in range(3):
  imshow([(game_attention_heads[layer][h][:, None, None] * blank_probe).sum(0) for h in range(8)], facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Head", title=f"Attention Head Contributions to blank (Game {game_index} Move {move} Layer {layer})", aspect="equal", zmin=-3, zmax=3)

In general, we see small  positive contributions to the "F6 == Blank" direction across most of the heads in each layer. We see a strong positive contribution in head L1H7.

#### Game 125, Move 12

In [None]:
game_index = 125
move = 12

game_attention_heads = {}

game_attention_heads[0] = attention_head_resids[:8, game_index, move, :]
game_attention_heads[1] = attention_head_resids[8:16, game_index, move, :]
game_attention_heads[2] = attention_head_resids[16:, game_index, move, :]
plot_single_board(int_to_label(focus_games_int[game_index, :move+1]))

In [None]:
for layer in range(3):
  imshow([(game_attention_heads[layer][h][:, None, None] * blank_probe).sum(0) for h in range(8)], facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Head", title=f"Attention Head Contributions to blank (Game {game_index} Move {move} Layer {layer})", aspect="equal", zmin=-3, zmax=3)

Various attention heads (L0H0, L1H2, L1H3, H2L3) have significantly strong contributions to the "F6 == Blank" direction. Notable we do not see as strong a contribution from L1H7 here.

####Game 244, Move 42

In [None]:
game_index = 244
move = 42

game_attention_heads = {}

game_attention_heads[0] = attention_head_resids[:8, game_index, move, :]
game_attention_heads[1] = attention_head_resids[8:16, game_index, move, :]
game_attention_heads[2] = attention_head_resids[16:, game_index, move, :]
plot_single_board(int_to_label(focus_games_int[game_index, :move+1]))

In [None]:
for layer in range(3):
  imshow([(game_attention_heads[layer][h][:, None, None] * blank_probe).sum(0) for h in range(8)], facet_col=0, y=[i for i in "ABCDEFGH"], facet_name="Head", title=f"Attention Head Contributions to blank (Game {game_index} Move {move} Layer {layer})", aspect="equal", zmin=-3, zmax=3)

We see little to no positive contributions to the "F6 == Blank" direction across most of the heads in each layer. However we see an extremely strong contribution in L1H7, and a smaller but still large contribution in L2H3,

#### Aggregate Attention Head Contribution to Blank Direction

To get a sense of each head's contributions to computing whether F6 is blank over our set of games, we average the contributions of each head in the "F6 == Blank" direction for all game moves where F6 is blank.

In [None]:
head_contribution = {k:[] for k in attention_head_labels}

for layer in range(3):
  for game_index in range(num_games):
    for move in range(5, 54):
      filled = set(int_to_label(focus_games_int[game_index, :move+1]))

      if "F6" in filled:
        continue

      for head, head_label in zip(range(24), attention_head_labels):
        head = attention_head_resids[head, game_index, move, :][:, None, None]
        contributions = (head * blank_probe).sum(0)
        head_contribution[head_label].append(contributions[5, 6].item())


{k: round(sum(v) / len(v), 2) for k, v in head_contribution.items()}

{'L0H0': 0.48,
 'L0H1': 0.31,
 'L0H2': 0.26,
 'L0H3': 0.32,
 'L0H4': 0.28,
 'L0H5': 0.26,
 'L0H6': 0.23,
 'L0H7': 0.06,
 'L1H0': -0.02,
 'L1H1': 0.24,
 'L1H2': 0.23,
 'L1H3': 0.31,
 'L1H4': 0.05,
 'L1H5': 0.29,
 'L1H6': 0.17,
 'L1H7': 0.71,
 'L2H0': 0.08,
 'L2H1': 0.59,
 'L2H2': 0.18,
 'L2H3': 0.58,
 'L2H4': 0.03,
 'L2H5': 0.1,
 'L2H6': 0.12,
 'L2H7': 0.2}

We see similar trends as we saw in the previous cases. Most of the heads in layer 0 and layer 1 have small but positive contributions.  L1H7 is a notable exception with the highest average contribution of all heads. Most of the heads in layer 2 have comparatively smaller contributions, with the exceptions of L2H1 and L2H3 which both have large contributions to the "F6 == Blank" direction.

### Section Conclusion

We find computation related to determining F6 as blank distributed across most of the attention heads. A few attention heads, most notably L1H7, seem to present more of an outsized influence in this computation and may deserve further analysis.

## Objective 3: Analyzing Head L1H7 Further

We take a closer look at the head L1H7, which previous tests suggested might play a role in computing that F6 is blank in the relevant games.

### Spectrum Plot

We compare the contribution of L1H7's output towards the "F6 == Blank" direction in the residual stream between cases where F6 is Blank and cases where F6 is not blank. If L1H7 has nothing to do with determining F6 is blank, we would expect its output to be the the same and roughly orthogonal to the "F6 == Blank" direction whether F6 is blank or not in the underlying board. Conversely if it does, we would expect a positive contribution if F6 is blank in the underlying board, and a negative contribution if F6 is not blank in th underlying board.

In [None]:
head_contribution = {k: {'filled': [], 'blank': []} for k in attention_head_labels}

for layer in range(3):
  for game_index in range(num_games):
    for move in range(5, 54):
      filled = set(int_to_label(focus_games_int[game_index, :move+1]))

      for head, head_label in zip(range(24), attention_head_labels):
        head = attention_head_resids[head, game_index, move, :][:, None, None]
        contributions = (head * blank_probe).sum(0)
        if "F6" in filled:
          head_contribution[head_label]['blank'].append(contributions[5, 6].item())
        else:
          head_contribution[head_label]['filled'].append(contributions[5, 6].item())


In [None]:
blank_contribs = head_contribution['L1H7']['blank']
filled_contribs = head_contribution['L1H7']['filled']

df = pd.DataFrame({"contributions": blank_contribs + filled_contribs, "label":["blank" for b in blank_contribs] + ["filled" for b in filled_contribs]})
px.histogram(df, x="contributions", color="label", histnorm="percent", nbins=100, title="Spectrum plot for Attention Head L1H7 testing F6==BLANK")

We see evidence that L1H7 does behave differently based on F6's status. Like we would expect, we see generally a positive contribution to the "F6 == Blank" direction if F6 is blank and a negative contribution if it is not. Notably we do observe some cases where L1H7 outputs a positive contribution to the "F6 == Blank" direction even when F6 is not blank in the underlying board, and even a very small number of cases where L1H7 a negative contribution when F6 is blank. This likely indicates L1H7 processes multiple features and there are cases where this interferes with the F6 == Blank computation.

### Attention Maps

We inspect L1H7's attention maps direction across each of the game's we investigated earlier to see if we gain any more insights about what is going on in this head.

In [None]:
game_index = 5
imshow((focus_cache['attn', 1][game_index, 7].T / (focus_cache['attn', 1][game_index, 7].max(dim=1).values)).T, facet_col_spacing =0.017241, y=[i for i in range(59)], x=int_to_label(focus_games_int[game_index, :59]), \
       facet_name="Layer", title=f"Attention Map of Head L1H7 for Game {game_index} (Normalized by Row)", aspect="equal")


In [None]:
game_index = 125
imshow((focus_cache['attn', 1][game_index, 7].T / (focus_cache['attn', 1][game_index, 7].max(dim=1).values)).T, facet_col_spacing =0.017241, y=[i for i in range(59)], x=int_to_label(focus_games_int[game_index, :59]), \
       facet_name="Layer", title=f"Attention Map of Head L1H7 for Game {game_index} (Normalized by Row)", aspect="equal")


In [None]:
game_index = 244
imshow((focus_cache['attn', 1][game_index, 7].T / (focus_cache['attn', 1][game_index, 7].max(dim=1).values)).T, facet_col_spacing =0.017241, y=[i for i in range(59)], x=int_to_label(focus_games_int[game_index, :59]), \
       facet_name="Layer", title=f"Attention Map of Head L1H7 for Game {game_index} (Normalized by Row)", aspect="equal")

Unfortunately it is hard to pick out patterns specifically related to F6 here. Generally the head places a large amount of attention on the first token, and then as the game goes on places more emphasis on later tokens. Most of the time the head does not place a large amount of attention on the F6 token though, and there does not seem to be a strong change in behaviour after the F6 token actually appears. 

### Section Conclusion

Our spectrum plot seems to give evidence that L1H7 does do meaningful computation related to determining that F6 == Blank. However our attempts to analyze the exact nature of this contribution via attention pattern visualization was not fruitful. It is possible this head is relying on information that was moved between token residual streams in layer 0, but more work is needed to more precisely characterize this.

## Objective 4: Activation Patching

We use attribution patching to try and isolate circuts and neurons within Othello-GPT that compute whether F6 is blank or not. We take game 5 up to the latest move where F6 is played, and then construct a corrupted game where G5 - a different(but valid) move is played instead. 

Notably, F6 is a valid move after G5 is played in the corrupted game, and if F6 was not already filled in the clean game it would be a valid move. Thus we use the change in the F6 log prob of the model our patching metric - assuming it is thus a proxy for a change in the model's internal state.

In [None]:
game_index = 5
move = 35

In [None]:
clean_input = copy.deepcopy(focus_games_int[game_index, :move+1])
clean_moves = int_to_label(clean_input)                                               
" ".join(clean_moves)

'C3 E2 F2 C4 E5 C2 B3 G2 F1 D2 D1 E6 H3 F0 D5 C1 G1 F4 B1 H2 E1 A1 C0 D6 G0 H0 D0 B0 A0 A2 A3 H4 F3 E0 F5 F6'

In [None]:
plot_single_board(clean_moves)

In [None]:
corrupted_input = copy.deepcopy(focus_games_int[game_index, :move+1])
corrupted_input[-1] = to_int("G5")
corrupted_moves = int_to_label(corrupted_input)                                               

" ".join(corrupted_moves)

'C3 E2 F2 C4 E5 C2 B3 G2 F1 D2 D1 E6 H3 F0 D5 C1 G1 F4 B1 H2 E1 A1 C0 D6 G0 H0 D0 B0 A0 A2 A3 H4 F3 E0 F5 G5'

In [None]:
plot_single_board(corrupted_moves)

In [None]:
clean_logits, clean_cache = model.run_with_cache(clean_input)
corrupted_logits, corrupted_cache = model.run_with_cache(corrupted_input)

clean_log_probs = clean_logits.log_softmax(dim=-1)
corrupted_log_probs = corrupted_logits.log_softmax(dim=-1)

In [None]:
f6_index = to_int("F6")
clean_f6_log_prob = clean_log_probs[0, -1, f6_index]
corrupted_f6_log_prob = corrupted_log_probs[0, -1, f6_index]
print("Clean log prob", clean_f6_log_prob)
print("Corrupted log prob", corrupted_f6_log_prob)

def patching_metric(patched_logits):
    # patched_log_probs.shape is [1, 21, 61]
    patched_log_probs = patched_logits.log_softmax(dim=-1)
    return (patched_log_probs[0, -1, f6_index] - corrupted_f6_log_prob)/(clean_f6_log_prob - corrupted_f6_log_prob)
print("Clean metric", patching_metric(clean_logits))
print("Corrupted metric", patching_metric(corrupted_logits))

Clean log prob tensor(-9.2919, device='cuda:0', grad_fn=<SelectBackward0>)
Corrupted log prob tensor(-2.5760, device='cuda:0', grad_fn=<SelectBackward0>)
Clean metric tensor(1., device='cuda:0', grad_fn=<DivBackward0>)
Corrupted metric tensor(-0., device='cuda:0', grad_fn=<DivBackward0>)


In [None]:
attn_layer_patches = []
def patch_attn_layer_output(attn_out, hook, layer):
    # Only patch in on the final move, prior moves are identical
    attn_out[0, -1, :] = clean_cache["attn_out", layer][0, -1, :]
    return attn_out
for layer in range(8):
    patched_logits = model.run_with_hooks(corrupted_input, fwd_hooks=[(utils.get_act_name("attn_out", layer), partial(patch_attn_layer_output, layer=layer))])
    attn_layer_patches.append(patching_metric(patched_logits).item())

mlp_layer_patches = []
def patch_mlp_layer_output(mlp_out, hook, layer):
    # Only patch in on the final move, prior moves are identical
    mlp_out[0, -1, :] = clean_cache["mlp_out", layer][0, -1, :]
    return mlp_out
for layer in range(8):
    patched_logits = model.run_with_hooks(corrupted_input, fwd_hooks=[(utils.get_act_name("mlp_out", layer), partial(patch_mlp_layer_output, layer=layer))])
    mlp_layer_patches.append(patching_metric(patched_logits).item())
line([attn_layer_patches, mlp_layer_patches], title="Layer Output Patching Effect on F6 Log Prob", line_labels=["attn", "mlp"])

### Section Conclusion

We find a massive influence driven by patching MLP0, while patching the remaining layers do not seem to matter as much. This is interesting given we did find positive contributions to the direction of "F6 == Blank" in the attention layers. This might indicate that there are multiple redundant or similar computations that all rely on information computed in MLP0. 

## Objective 5 Analyzing MLP0

We conduct a similar activation patching experiment on the individual neurons of MLP0. This might give us some insight on the most important neurons in determining F6 == Blank. 

### Neuron Activation Patching

In [None]:
neuron_patches = []
def patch_mlp_neuron_output(mlp_out, hook, neuron):
    # Only patch in on the final move, prior moves are identical
    mlp_out[0, -1, neuron] = clean_cache["post", 0][0, -1, neuron]
    return mlp_out
for neuron in range(2048):
    patched_logits = model.run_with_hooks(corrupted_input, fwd_hooks=[(utils.get_act_name("post", 0), partial(patch_mlp_neuron_output, neuron=neuron))])
    neuron_patches.append(patching_metric(patched_logits).item())
line([neuron_patches], title="MLP0 Neuron Output Patching Effect on F6 Log Prob", line_labels=["neuron"])

Three neurons stand out - L0N398, L0N827, and L0N1449. These all warrant further inspection. We compare the output weights of these neurons and the "F6 == Blank" direction via cosine similarity.

In [None]:
# Scale the probes down to be unit norm per cell
blank_probe_normalised = blank_probe / blank_probe.norm(dim=0, keepdim=True)
my_probe_normalised = my_probe / my_probe.norm(dim=0, keepdim=True)
# Set the center blank probes to 0, since they're never blank so the probe is meaningless
blank_probe_normalised[:, [3, 3, 4, 4], [3, 4, 3, 4]] = 0.

In [None]:
layer = 0
neurons = [398, 827, 1449]
heatmaps_blank = []

for neuron in neurons:
  w_out = model.blocks[layer].mlp.W_out[neuron, :].detach()
  w_out /= w_out.norm()
  heatmaps_blank.append((w_out[:, None, None] * blank_probe_normalised).sum(dim=0))

imshow(heatmaps_blank,
    facet_col=0,
    y=[i for i in "ABCDEFGH"],
    title=f"Cosine sim of Output weights and the blank color probe for top layer 0 neurons",
    facet_labels=[f"L0N{neuron}" for neuron in neurons])

We find a strong negative cosine similarity between the output weights of L0N827 and the "F6 == Blank" direction and a strong negative cosine similarity between L0N1449 and the "G5 == Blank" direction. L0N398 has a weak positive cosine similarity with the "F6 == Blank" direction. 

### Neuron Spectrum Plots

We create spectrum plots for the activations of each neuron, testing F6 == Blank.

In [None]:
neuron_activations = {k: {'filled': [], 'blank': []} for k in neurons}

for game_index in range(num_games):
  for move in range(5, 54):
    filled = set(int_to_label(focus_games_int[game_index, :move+1]))
    mlp_neurons = focus_cache['post', 0][game_index, move, :]

    for neuron in neurons:
      if "F6" in filled:
        neuron_activations[neuron]['blank'].append(mlp_neurons[neuron].item())
      else:
        neuron_activations[neuron]['filled'].append(mlp_neurons[neuron].item())

In [None]:
neuron = neurons[0]
blank_contribs = neuron_activations[neuron]['blank']
filled_contribs = neuron_activations[neuron]['filled']

df = pd.DataFrame({"contributions": blank_contribs + filled_contribs, "label":["blank" for b in blank_contribs] + ["filled" for b in filled_contribs]})
px.histogram(df, x="contributions", color="label", histnorm="percent", nbins=100, title=f"Spectrum plot for neuron L0N{neuron} testing F6==BLANK")

L0N398 is more likely to activate when F6 is blank. But there are cases where L0N398 activates when F6 is not blank, and there are cases where F6 does not activate when F6 is blank. It is possible that this neuron is polysemantic, or captures entangled or features in superposition. 

In [None]:
neuron = neurons[1]
blank_contribs = neuron_activations[neuron]['blank']
filled_contribs = neuron_activations[neuron]['filled']

df = pd.DataFrame({"contributions": blank_contribs + filled_contribs, "label":["blank" for b in blank_contribs] + ["filled" for b in filled_contribs]})
px.histogram(df, x="contributions", color="label", histnorm="percent", nbins=100, title=f"Spectrum plot for neuron L0N{neuron} testing F6==BLANK")

L0N827 only activates when F6 == Blank, with no cases of positive activation when F6 is not blank. This suggests this neuron definitely captures F6 == Blank to a significant degree. Notably while this is a necessary condition for this neuron to to have positive activation it is not sufficient - there are cases where F6 is blank but this neuron still has negative activation. 

In [None]:
neuron = neurons[2]
blank_contribs = neuron_activations[neuron]['blank']
filled_contribs = neuron_activations[neuron]['filled']

df = pd.DataFrame({"contributions": blank_contribs + filled_contribs, "label":["blank" for b in blank_contribs] + ["filled" for b in filled_contribs]})
px.histogram(df, x="contributions", color="label", histnorm="percent", nbins=100, title=f"Spectrum plot for neuron L0N{neuron} testing F6==BLANK")

L0N1449 is more likely to have positive activation when F6 is blank, but similar to L0N398 it sometimes has positive activation when F6 is not blank, and there are cases where it has negative activation when F6 is blank. Again this suggests this neuron might capture entangled features.

Recall during our cosine similarity plot L0N1449 appeared to have a deep relation to the G5 cell. If we create a similar spectrum plot testing 

G5 == Blank, we notice something interesting:

In [None]:
neuron_activations = {k: {'filled': [], 'blank': []} for k in neurons}

for game_index in range(num_games):
  for move in range(5, 54):
    filled = set(int_to_label(focus_games_int[game_index, :move+1]))
    mlp_neurons = focus_cache['post', 0][game_index, move, :]

    for neuron in neurons:
      if "G5" in filled:
        neuron_activations[neuron]['blank'].append(mlp_neurons[neuron].item())
      else:
        neuron_activations[neuron]['filled'].append(mlp_neurons[neuron].item())

In [None]:
neuron = neurons[2]
blank_contribs = neuron_activations[neuron]['blank']
filled_contribs = neuron_activations[neuron]['filled']

df = pd.DataFrame({"contributions": blank_contribs + filled_contribs, "label":["blank" for b in blank_contribs] + ["filled" for b in filled_contribs]})
px.histogram(df, x="contributions", color="label", histnorm="percent", nbins=100, title=f"Spectrum plot for neuron L0N{neuron} testing G5==BLANK")

Just like L0N1449 for F6, L0N827 only activates when G5 == Blank with no cases of positive activation when G5 is not blank. Similarly there are cases where G5 is blank but this neuron still has negative activation. 

It makes sense why patching in activations from a game where G5 is not played to a game where it is played has a large effect on this neuron. It is somewhat less clear why it would effect the F5 log-probs. The spectrum plot testing F6 == Blank also seemed to have indicated a difference in activation distribution for this neuron, so it is possible F6 == Blank might be entangled in this neuron to some degree along with G5 == Blank. 

### Section Conclusion

We identify three neurons, L0N398, L0N827, and L0N1449 that appear to have a major effect on the log-probs of F6 in our activation patching experiment. We find L0N827 captures F6 == Blank as a necessary condition for positive activation, and L0N1449 captures G5 == Blank as a necessary condition for positive activation while still being entangled with F6 == Blank somehow. L0N398 seems to represent a correlated feature or features entangled with F6 == Blank.

## Conclusion

Our analysis reveals many things about how Othello-GPT calculates the status of the cell F6. We isolate an attention head, L1H7, that appears to have a strong relationship in computing that F6 is blank in the internal board state of the model. Using activation patching, we isolate a neuron L0N827 that appears to capture F6 == Blank, as well as additionally isolate neuron L0N1449 that capture G5 == Blank. This seems to suggest the early MLP layers are critical for determining which cells are blank in the model's internal board state.

An exact characterization of the computation of F6 being blank remains inconclusive, but these insights hopefully provide further leads to investigate. One possible avenue is trying to connect the neurons identified in MLP0 with the significant attention heads as part of a circuit rather then analyzing them individually.
