# Generating explanations after caching the latents

Here we will show a simple example of how to generate explanations for a SAE after caching the latents.

In [1]:
import os
from functools import partial

import orjson
import torch

from delphi.clients import OpenRouter
from delphi.config import ExperimentConfig, LatentConfig
from delphi.explainers import DefaultExplainer
from delphi.latents import LatentDataset
from delphi.latents.constructors import default_constructor
from delphi.latents.samplers import sample
from delphi.pipeline import Pipeline, process_wrapper

API_KEY = os.getenv("OPENROUTER_API_KEY")


In [2]:
latent_cfg = LatentConfig(
    width=131072, # The number of latents of your SAE
    min_examples=200, # The minimum number of examples to consider for the latent to be explained
    max_examples=10000, # The maximum number of examples to be sampled from
    n_splits=5 # How many splits was the cache split into
)


In [3]:
module = ".model.layers.10" # The layer to explain
latent_dict = {module: torch.arange(0,5)} # The what latents to explain




We need to define the config for the examples shown to the explainer model.
When selecting the examples to be shown to the explainer model we can select them from:
- "top", which gets the most activating examples
- "random" which gets random examples from the whole activation distribution
- "quantiles" which gets examples from the quantiles of the data


In [4]:

experiment_cfg = ExperimentConfig(
    n_examples_train=40, # Number of examples to sample for training
    example_ctx_len=32, # Length of each example
    train_type="quantiles", # Type of sampler to use for training. 
)


The constructor defines the window of tokens to be used for the examples. We have a default constructor that builds examples of size ctx_len (should be a divisor of the ctx_len used for caching the latents).
The sampler defines how the examples are selected. The sampler will always generate a train and test set, but here we only care about the train set.


In [9]:
constructor=partial(
            default_constructor,
            token_loader=None,
            n_not_active=experiment_cfg.n_non_activating, 
            ctx_len=experiment_cfg.example_ctx_len, 
            max_examples=latent_cfg.max_examples
        )
sampler=partial(sample,cfg=experiment_cfg)
dataset = LatentDataset(
        raw_dir="latents", # The folder where the cache is stored
        cfg=latent_cfg,
        modules=[module],
        latents=latent_dict,
        constructor=constructor,
        sampler=sampler
)    

We use pipes to generate the explanations. Each pipe starts with loading the examples from the corresponding latent and then passes the examples to the explainer. It used a client (here OpenRouter) to generate the explanations.

In [10]:
client = OpenRouter("anthropic/claude-3.5-sonnet",api_key=API_KEY)

# The function that saves the explanations
def explainer_postprocess(result):
        with open(f"results/explanations/{result.record.latent}.txt", "wb") as f:
            f.write(orjson.dumps(result.explanation))
        del result
        return None

explainer_pipe = process_wrapper(
        DefaultExplainer(
            client, 
            tokenizer=dataset.tokenizer,
        ),
        postprocess=explainer_postprocess,
    )


In [12]:
!mkdir -p results/explanations

Here we are generating only explanations, so our pipeline only has two steps.

In [13]:
pipeline = Pipeline(
    dataset,
    explainer_pipe,
)
number_of_parallel_latents = 10
await pipeline.run(number_of_parallel_latents) # This will start generating the explanations.


Processing items: 0it [01:20, ?it/s]


No available randomly sampled non-activating sequences



[A
[A
Processing items: 3it [00:04,  1.61s/it]


[None, None, None]