# Shard theory mechint

A summary of project activities so far:
- Basic model editing by finding a "cheese vector" in the activations, and subtracting it to reduce propensity to get cheese (in favor of going to the top right)
    - Statistical data showing this drastically decreases the amount the mouse goes to the cheese, and increases the amount the mouse goes to the top right (moreso then a random screwing up of weights would!)
- Creating useful tooling
    - Library functions to interactively and programmatically edit mazes
    - Library functions to 
    - A vector field view of agent behavior on a maze
    - Data gathering scripts for episodes, vector fields, and patched vector fields (see below)
- Behavioral statistics with preregistered predictions
    - Euclidian distance
    - (Peli should add more here. I don't recall all his results)


Concrete results:

## Model editing: Finding a "cheese" vector

The basic idea is simple, in a maze with no cheese the cheese-decision-influence won't be active and the mouse will try and go to the top-right. In a maze with cheese, *both* the top-right and cheese decision-influences will be active.

What if we record the activations for both on a specific layer $l$, and patch the network by adding the difference?

Formally, pick a maze $m$ and layer $l$. Let $a_c \in \mathbf{R}^n$ denote the activations of layer $l$ on maze $m$ *with the cheese*, and $a_{\neg c} \in \mathbf{R}^n$ denote the activations where the cheese is first removed from the maze. (In both of these cases, the mouse remains in the start position)

Now we patch the network, setting the output of layer $l$ ($a_l$) to be $a_l' = a_l + \alpha(a_c - a_{\neg c})$ for varying choices of $\alpha$.

For the first step we have $a_l = a_c$ (mouse hasn't moved, cheese exists) giving $a_l' = (1+\alpha)a_c - \alpha a_{\neg c}$, and for $\alpha = -1$ we have $a_l' = a_{\neg c}$ meaning ignore the cheese.

Does this work beyond the first timestep? Surprisingly, yes. It raises the probability we go to the top right a large amount when we reach "forks in the road" (decision squares in our terminology) and lowers the probability we go to the cheese.

(This doesn't prove we're doing something motivational. We could be doing something perceptual, like editing out the knowledge of the cheese the mouse has.)

In [8]:
%matplotlib inline
import matplotlib.pyplot as plt
import pickle
from glob import glob
from collections import defaultdict

In [9]:
vfields = [pickle.load(open(f, 'rb')) for f in glob('../data/vfields/seed-*.pkl')]
vfields_by_level = defaultdict(list)
for vf in vfields:
    vfields_by_level[vf['seed']].append(vf)

def get_vfields(seed: int, coeff: float):
    return next(vf for vf in vfields_by_level[seed] if vf['coeff'] == coeff)

In [10]:
from ipywidgets import interact, IntSlider, FloatSlider
from vfield_utils import plot_vfs

def _coeffs_for(seed: int):
    return sorted(set(vf['coeff'] for vf in vfields_by_level[seed]))

seed_max = max(vf['seed'] for vf in vfields)
min_coeff, max_coeff = min(vf['coeff'] for vf in vfields), max(vf['coeff'] for vf in vfields)

@interact(seed = IntSlider(min=0, max=seed_max, step=1, value=0), coeff = FloatSlider(min=min_coeff, max=max_coeff, step=0.1, value=1.0))
def interact_vfields(seed: int, coeff: float):
    # set coeff to nearest available in vfields_by_level[seed]
    coeff = min(_coeffs_for(seed), key=lambda x: abs(x - coeff))
    vfs = get_vfields(seed, coeff)
    plot_vfs(vfs)

interactive(children=(IntSlider(value=0, description='seed', max=99), FloatSlider(value=1.0, description='coef…

## A tour through our tooling

We have a few tools to help us understand the model and the data we're collecting.
Some of these will soon be made into a library.