# **DAS**: Uncover structure in a neural network

In this notebook, we set up DAS (Distributed Alignment Search) over the internal states of our trained neural network in order to localize high level variables (circularity, color, and/or area).

It goes through:
1. **Single source DAS**: localizing many variables in one representation
2. **Multi-source DAS**: disentangling variables by localizing different variables in different representations

In [1]:
# !pip install -r requirements.txt

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import random
import numpy as np
import torch

random.seed(42)
np.random.seed(42)
_ = torch.manual_seed(42)
_ = torch.cuda.manual_seed(42) # only if using GPU

## 1. Single source DAS: localize many variables in a single representation

We use DAS to find a representation that mediates the causal effect of one or more high level variables (circularity, color, and/or area).

In [3]:
# toggle variables to select which variables to localize & toggle intervention_size to set the # of neurons assigned to the variables
variables = [0, 1]
intervention_size = 64
n_train = 10000
n_test = 1000

Create counterfactual dataset

In [4]:
from data_utils import create_dataset
from counterfactual_data_utils import create_single_source_counterfactual_dataset

# first, create base dataset
images, labels = create_dataset(n_train)
images = images.reshape((-1, 1, 28, 28))
coefficients = np.array([0.4, 0.4, 0.4])

# create single source counterfactual dataset
X_base, X_sources, y_base, y_sources, y_counterfactual = create_single_source_counterfactual_dataset(
    variables, images, labels, coefficients, size=n_train
)

Load intervenable model with `pyvene`: set up a single intervention over the 1st 64 neurons of the output of the 1st convolutional layer

In [5]:
import pyvene as pv
from model_utils import PyTorchCNN
from das_utils import CNNConfig, CustomLowRankRotatedSpaceIntervention

# load base model
model = PyTorchCNN()
model.load_state_dict(torch.load('pytorch_models/cnn_model_1.pth'))

model.config = CNNConfig(
    hidden_size=28*28*16 # batch x 28 x (28 x 16) -> batch x 28 x 448
)

intervention_size = 2

# create a single intervention on the first 64 neurons of the first convolutional layer
representations = [{
    "component": "conv1.output",
    "low_rank_dimension": intervention_size,
}]

pv_config = pv.IntervenableConfig(
    representations=representations,
    intervention_types=CustomLowRankRotatedSpaceIntervention
)
pv_model = pv.IntervenableModel(pv_config, model)
pv_model.set_device('cuda')

In [6]:
from das_utils import das_train

das_train(pv_model, X_base.to('cuda'), X_sources.to('cuda'), y_counterfactual.to('cuda'), lr=0.0005, num_epochs=25, batch_size=256, subspaces=None)

Training (Epoch 1): 100%|██████████| 40/40 [00:11<00:00,  3.63it/s, loss=0.508]
Training (Epoch 2): 100%|██████████| 40/40 [00:10<00:00,  3.72it/s, loss=0.0797]
Training (Epoch 3): 100%|██████████| 40/40 [00:10<00:00,  3.72it/s, loss=0.0908]
Training (Epoch 4): 100%|██████████| 40/40 [00:10<00:00,  3.72it/s, loss=0.0843]
Training (Epoch 5): 100%|██████████| 40/40 [00:10<00:00,  3.72it/s, loss=0.089]
Training (Epoch 6): 100%|██████████| 40/40 [00:10<00:00,  3.72it/s, loss=0.0889]
Training (Epoch 7): 100%|██████████| 40/40 [00:10<00:00,  3.72it/s, loss=0.0904]
Training (Epoch 8): 100%|██████████| 40/40 [00:10<00:00,  3.72it/s, loss=0.0937]
Training (Epoch 9): 100%|██████████| 40/40 [00:10<00:00,  3.72it/s, loss=0.09] 
Training (Epoch 10): 100%|██████████| 40/40 [00:10<00:00,  3.72it/s, loss=0.1]  
Training (Epoch 11): 100%|██████████| 40/40 [00:10<00:00,  3.72it/s, loss=0.0938]
Training (Epoch 12): 100%|██████████| 40/40 [00:10<00:00,  3.71it/s, loss=0.0977]
Training (Epoch 13): 100%|███

Evaluate interchange intervention accuracy on a new evaluation dataset

In [14]:
from data_utils import create_dataset
from counterfactual_data_utils import create_single_source_counterfactual_dataset
from das_utils import das_evaluate

# first, create base dataset
images, labels = create_dataset(n_test)
images = images.reshape((-1, 1, 28, 28))
coefficients = np.array([0.4, 0.4, 0.4])

# create single source counterfactual dataset
X_base, X_sources, y_base, y_sources, y_counterfactual = create_single_source_counterfactual_dataset(
    variables, images, labels, coefficients, size=n_test
)

# evaluate the accuracy of the model on the counterfactual dataset
das_evaluate(pv_model, X_base.to('cuda'), X_sources.to('cuda'), y_counterfactual, subspaces=None)

Evaluating: 100%|██████████| 4/4 [00:00<00:00, 11.94it/s]


0.8999999761581421

In [9]:
from model_utils import evaluate
from data_utils import create_dataset

# create our own evaluation dataset
images, labels = create_dataset(n_test)
coefficients = np.array([0.4, 0.4, 0.4])

# batch, channel, height, width
X = torch.tensor(images.reshape((-1, 1, 28, 28))).float()
y = torch.tensor(np.matmul(labels, coefficients) > 0.6).float()

# evaluate model
accuracy = evaluate(pv_model.model, X.to('cuda'), y.to('cuda'))
print(f'Accuracy: {accuracy:.4f}')

Evaluating: 100%|██████████| 4/4 [00:00<00:00, 11.05it/s]

Accuracy: 0.9750





## 2. Multi-source DAS: localize different variables in different representations

We use DAS to find multiple representations that mediate the causal effect of one or more high level variables (circularity, color, and/or area), where each separate representation corresponds to a separate variable.

In [9]:
# toggle variables to select which variables to localize & toggle intervention_size to set the # of neurons assigned to each variable
variables = [0, 1]
intervention_size = 64
n_train = 10000
n_test = 1000

Create counterfactual data

In [10]:
from data_utils import create_dataset
from counterfactual_data_utils import create_multi_source_counterfactual_dataset

# first, create base dataset
images, labels = create_dataset(n_train)
images = images.reshape((-1, 1, 28, 28))
coefficients = np.array([0.4, 0.4, 0.4])

# create multi-source counterfactual dataset
X_base, X_sources, y_base, y_sources, y_counterfactual = create_multi_source_counterfactual_dataset(
    variables, images, labels, coefficients, size=n_train
)

Load intervenable model with `pyvene`: set up a separate intervention for each variable, but link them to use the same rotation matrix (so they can index different subspaces of the rotated neurons).

In [11]:
import pyvene as pv
from model_utils import PyTorchCNN
from das_utils import CNNConfig

# load base model
model = PyTorchCNN()
model.load_state_dict(torch.load('pytorch_models/amir_cnn_model.pth'))

model.config = CNNConfig(
    hidden_size=448 # batch x 28 x (28 x 16) -> batch x 28 x 448
)

intervention_size = 64

# create a single intervention on the first 64 neurons of the first convolutional layer
representations = [
    {
        "component": "conv1.output",
        "subspace_partition": [[0, intervention_size], [intervention_size, intervention_size * 2], [intervention_size * 2, model.config.hidden_size]],
        "intervention_link_key": 0 # link interventions to use the same rotation matrix
    },
    {
        "component": "conv1.output",
        "subspace_partition": [[intervention_size, intervention_size * 2], [0, intervention_size], [intervention_size * 2, model.config.hidden_size]], 
        "intervention_link_key": 0
    }
]

pv_config = pv.IntervenableConfig(
    representations=representations,
    intervention_types=pv.RotatedSpaceIntervention
)
pv_model = pv.IntervenableModel(pv_config, model)

In [12]:
from das_utils import das_train

das_train(pv_model, X_base, X_sources, y_counterfactual, lr=0.0005, num_epochs=2, batch_size=256)

  return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
Training (Epoch 1): 100%|██████████| 40/40 [00:32<00:00,  1.24it/s, loss=0.191]
Training (Epoch 2): 100%|██████████| 40/40 [00:32<00:00,  1.23it/s, loss=0.18] 


Evaluate interchange intervention accuracy on a new evaluation dataset

In [13]:
from data_utils import create_dataset
from counterfactual_data_utils import create_multi_source_counterfactual_dataset
from das_utils import das_evaluate

# first, create base dataset
images, labels = create_dataset(n_test)
images = images.reshape((-1, 1, 28, 28))
coefficients = np.array([0.4, 0.4, 0.4])

# create single source counterfactual dataset
X_base, X_sources, y_base, y_sources, y_counterfactual = create_multi_source_counterfactual_dataset(
    variables, images, labels, coefficients, size=n_test
)

# evaluate the accuracy of the model on the counterfactual dataset
das_evaluate(pv_model, X_base, X_sources, y_counterfactual)

Evaluating: 100%|██████████| 4/4 [00:01<00:00,  3.85it/s]


0.7839999794960022