![Attribution Targets](https://raw.githubusercontent.com/safety-research/circuit-tracer/main/demos/img/attribution_targets/attribution_targets_banner.png)

# Attribution Targets

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/safety-research/circuit-tracer/blob/main/demos/attribution_targets_demo.ipynb)

This tutorial walks through the **attribution targets API**, demonstrating how to attribute back from arbitrary tokens, functions thereof, or abstract concept directions in the residual stream.

The `AttributionTargets` class (in `circuit_tracer.attribution.targets`) accepts four input formats:

| Input type | Mode | Description |
|---|---|---|
| `None` | Salient logits | Auto-selects the most probable next tokens via `max_n_logits` / `desired_logit_prob` (default) |
| `Sequence[str]` | Token strings | Attribute from explicitly named tokens, e.g. `["▁Austin", "▁Dallas"]` |
| `torch.Tensor` | Token ID tensor | Attribute from specific vocabulary indices |
| `Sequence[TargetSpec]` | Custom target | Attribute from arbitrary residual-stream directions via `CustomTarget(token_str, prob, vec)` |

We will demo the use of all four input formats using the capital-city prompt you may be familiar with from other demos: the model must resolve *"capital of the state containing Dallas"* via multi-hop reasoning (Dallas → Texas → Austin). After comparing the top features discovered under each mode, we will apply interventions to elucidate two `CustomTarget` examples.

In [None]:
#@title Colab Environment Setup { display-mode: "form" }

import sys

IN_COLAB = "google.colab" in sys.modules

def setup_environment():
    from google.colab import output
    output.enable_custom_widget_manager()

    print("Setting up Colab environment...")
    %pip install -q uv

    # Use uv to install our PR branch for temporary testing, replace w/ commented line once installing from release.
    !uv pip install --system --no-cache "git+https://github.com/speediedan/circuit-tracer.git@attribution-targets"
    # after merged to main, install released version
    # !uv pip install --system --no-cache circuit-tracer

    from huggingface_hub import notebook_login
    notebook_login()

if IN_COLAB:
    setup_environment()
else:
    print("Running in local environment. Skipping Colab-specific setup.")

In [None]:
# @title Imports { display-mode: "form" }

from functools import partial

import torch

from circuit_tracer import ReplacementModel, attribute
from circuit_tracer.attribution.targets import CustomTarget
from circuit_tracer.utils import create_graph_files
from circuit_tracer.utils.demo_utils import (
    cleanup_cuda,
    display_ablation_chart,
    display_attribution_config,
    display_token_probs,
    display_topk_token_predictions,
    display_top_features_comparison,
    get_top_features,
    get_unembed_vecs,
)

## Setup

Load the model and define helper functions. We use `google/gemma-2-2b` with the Gemma Scope transcoders, the same configuration used in the other demos. Change `backend` to `'nnsight'` if you prefer the NNSight backend.

In [None]:
model_name = "google/gemma-2-2b"
transcoder_name = "gemma"
backend = "transformerlens"  # change to 'nnsight' for the nnsight backend!
model = ReplacementModel.from_pretrained(
    model_name, transcoder_name, dtype=torch.bfloat16, backend=backend
)

## Basic Attribution Target Modes

This section explores the three simplest ways to specify attribution targets:

1. **Automatic Salient Logit Targets** (`None`) — the default mode; auto-selects the most probable next tokens.
2. **Token-String Targets** (`Sequence[str]`) — attribute from explicit token surface forms.
3. **Token-ID Targets** (`torch.Tensor`) — attribute from specific vocabulary indices (pre-tokenized equivalent of string targets).

> **Coming up:** After these basic modes, we explore two **custom attribution target** examples that let you attribute back from arbitrary residual-stream directions — a logit *difference* (`logit(Austin) − logit(Dallas)`) and an abstract *semantic concept* (`Capitals − States`).

In [None]:
# Define the prompt, shared attribution parameters, and the three reference tokens (`▁Austin`, `▁Dallas`, `▁Texas`). 

prompt = "Fact: the capital of the state containing Dallas is"
token_x, token_y = "▁Austin", "▁Dallas"

# Shared attribution kwargs (apply to all runs)
# Note: max_n_logits / desired_logit_prob only apply to salient-logit mode
attr_kwargs = dict(
    batch_size=256,
    max_feature_nodes=8192,
    offload="disk" if IN_COLAB else "cpu",
    verbose=True,
)

# Resolve token ids for key tokens
tokenizer = model.tokenizer
idx_x = tokenizer.encode(token_x, add_special_tokens=False)[-1]
idx_y = tokenizer.encode(token_y, add_special_tokens=False)[-1]
idx_texas = tokenizer.encode("▁Texas", add_special_tokens=False)[-1]

# Bind the tokenizer and key tokens for display helpers
display_topk = partial(
    display_topk_token_predictions,
    tokenizer=tokenizer,
    key_tokens=[(token_x, idx_x), (token_y, idx_y), ("▁Texas", idx_texas)],
)

# Show baseline token probabilities
input_ids = model.ensure_tokenized(prompt)
with torch.no_grad():
    baseline_logits, _ = model.get_activations(input_ids)

key_ids = [idx_x, idx_y, idx_texas]
key_labels = [token_x, token_y, "▁Texas"]
display_token_probs(baseline_logits, key_ids, key_labels, title="Baseline probabilities")

### Automatic Target Selection — Salient Logits (`None`)

When `attribution_targets` is `None` (the default), `AttributionTargets` auto-selects the most probable next tokens until `desired_logit_prob` cumulative probability is reached (capped at `max_n_logits`). This is the standard mode used by `attribute_demo.ipynb`.

In [None]:
graph_salient = attribute(
    prompt=prompt, model=model,
    max_n_logits=10, desired_logit_prob=0.95,
    **attr_kwargs,
)
print(f"Salient-logits graph: {len(graph_salient.logit_targets)} targets, "
      f"{graph_salient.active_features.shape[0]} active features")

# Free CUDA memory before next run
cleanup_cuda()

### Token-String Targets — `Sequence[str]`

Pass a list of token strings (e.g., `["▁Austin", "▁Dallas"]`) to focus attribution on exactly those logits. Internally, each string is tokenized and its softmax probability and unembedding vector are computed automatically — you only need to supply the surface forms.

In [None]:
graph_str = attribute(
    prompt=prompt, model=model,
    attribution_targets=[token_x, token_y],
    **attr_kwargs,
)
print(f"String-targets graph: {len(graph_str.logit_targets)} targets, "
      f"{graph_str.active_features.shape[0]} active features")

# Free CUDA memory before next run
cleanup_cuda()

### Token-ID Targets — `torch.Tensor`

Pass a tensor of vocabulary token IDs to attribute from specific indices. This is the pre-tokenized equivalent of the string-target mode above — internally, the same probabilities and unembedding vectors are computed. Use this mode when you already have token IDs (e.g., from a prior tokenization step) and want to skip the string→ID lookup.

In [None]:
# Use the same token IDs as the string-target example above
tensor_targets = torch.tensor([idx_x, idx_y])

graph_tensor = attribute(
    prompt=prompt, model=model,
    attribution_targets=tensor_targets,
    **attr_kwargs,
)
print(f"Tensor-targets graph: {len(graph_tensor.logit_targets)} targets, "
      f"{graph_tensor.active_features.shape[0]} active features")

# Free CUDA memory before next run
cleanup_cuda()

## Custom Attribution Targets

Beyond the basic modes above, `AttributionTargets` also accepts a `Sequence[TargetSpec]` — fully specified custom targets that let you attribute toward **arbitrary directions** in the residual stream. This makes a vast experimental surface more accessible but we'll explore a couple examples in this tutorial:

- **Logit Difference Target** — encodes the direction `logit(Austin) − logit(Dallas)`, surfacing features that drive the model to prefer one token *over* another rather than boosting either in isolation.
- **Semantic Concept Target** — encodes an abstract *Capitals − States* direction built from multiple (capital, state) pairs via vector rejection, isolating the *capital-of* relation from shared geography.

See the expandable section below if you want a more detailed look at `CustomTarget` definition before we proceed with the examples below.

<details>
<summary><b>TargetSpec / CustomTarget — field reference</b></summary>

The `attribution_targets` argument to `attribute()` accepts a `Sequence[TargetSpec]` for fully custom residual-stream directions. Two convenience types are involved:

**`CustomTarget(token_str, prob, vec)`** is a `NamedTuple` with three fields:

| Field | Type | Description |
|---|---|---|
| `token_str` | `str` | Human-readable label for this target (e.g. `"logit(Austin)−logit(Dallas)"`) |
| `prob` | `float` | Scalar weight — typically the softmax probability of the token, or \|p(x)−p(y)\| for a contrast direction |
| `vec` | `Tensor (d_model,)` | The direction in residual-stream space to attribute toward |

**`TargetSpec`** is a type alias for `CustomTarget | tuple[str, float, torch.Tensor]`. Either form is accepted — a raw 3-tuple is coerced to a `CustomTarget` namedtuple automatically before processing.

**Example — raw tuple (coerced automatically):**

```python
raw: TargetSpec = ("my-direction", 0.05, some_tensor)   # plain 3-tuple → TargetSpec
graph = attribute(prompt=prompt, model=model, attribution_targets=[raw])
```

**Example — explicit `CustomTarget` namedtuple:**

```python
from circuit_tracer.attribution.targets import CustomTarget

target = CustomTarget(
    token_str="logit(Austin)−logit(Dallas)",
    prob=abs(p_austin - p_dallas),        # scalar weight
    vec=unembed_austin - unembed_dallas,  # shape: (d_model,)
)
graph = attribute(prompt=prompt, model=model, attribution_targets=[target])
```

</details>

We first define two helper functions for building these custom targets, then construct and attribute from each one.

### Target Builder Helpers

We define two helper functions that each return a `CustomTarget` namedtuple.

---

**`build_custom_diff_target`** — *Logit-difference direction.*

Subtracts the unembedding column of token $y$ from that of token $x$:

$$\mathbf{d} = \mathbf{u}_{x} - \mathbf{u}_{y}$$

and weights the target by the absolute softmax-probability difference $|p(x) - p(y)|$. Attributing toward $\mathbf{d}$ surfaces features that drive the model to prefer token $x$ *over* token $y$ — a narrower signal than boosting $x$ in isolation.

---

**`build_semantic_concept_target`** — *Abstract concept direction via vector rejection.*

Given paired token groups $A$ (e.g. capital cities) and $B$ (e.g. states), the function strips the "state" component from each "capital" unembedding vector via orthogonal projection:

$$\mathbf{r}_i = \mathbf{u}_{a_i} - \frac{\mathbf{u}_{a_i} \cdot \mathbf{u}_{b_i}}{\|\mathbf{u}_{b_i}\|^2}\,\mathbf{u}_{b_i}$$

The final concept direction is the mean of these residuals across all pairs:

$$\mathbf{d}_{\text{concept}} = \frac{1}{n}\sum_{i=1}^{n} \mathbf{r}_i$$

**Intuition:** Raw capital-city vectors (Austin, Sacramento, …) are partially explained by their shared geography with their respective states (Texas, California, …). Projecting away the state component leaves a representation of *"capital-ness"* that is independent of specific geography. Attributing toward $\mathbf{d}_{\text{concept}}$ reveals features the model uses to execute the abstract *capital-of* relation in this context — a strictly more targeted lens than a single logit difference or token string target.

In [None]:
def _get_last_position_probs(model, prompt):
    """Get softmax probabilities at the last token position."""
    input_ids = model.ensure_tokenized(prompt)
    with torch.no_grad():
        logits, _ = model.get_activations(input_ids)
    return torch.softmax(logits.squeeze(0)[-1], dim=-1)


def build_custom_diff_target(model, prompt, token_x, token_y, backend):
    """Build a CustomTarget for the direction logit(token_x) − logit(token_y).

    Returns (custom_target, idx_x, idx_y).
    """
    tokenizer = model.tokenizer
    idx_x = tokenizer.encode(token_x, add_special_tokens=False)[-1]
    idx_y = tokenizer.encode(token_y, add_special_tokens=False)[-1]

    # Extract unembed columns
    vec_x, vec_y = get_unembed_vecs(model, [idx_x, idx_y], backend)
    diff_vec = vec_x - vec_y

    # Weight = |p(token_x) − p(token_y)|, floored at 1e-6
    probs = _get_last_position_probs(model, prompt)
    diff_prob = max((probs[idx_x] - probs[idx_y]).abs().item(), 1e-6)

    custom_target = CustomTarget(
        token_str=f"logit({token_x})-logit({token_y})",
        prob=diff_prob,
        vec=diff_vec,
    )
    return custom_target, idx_x, idx_y


def build_semantic_concept_target(model, prompt, group_a_tokens, group_b_tokens, label, backend):
    """Build a CustomTarget for an abstract concept direction via vector rejection.

    For each (capital, state) pair, project the capital vector onto the state
    vector and subtract that projection.  The residual is the component of
    "capital-ness" orthogonal to its state, stripping out shared geography.

        v_residual_i = v_cap_i − proj_{v_state_i}(v_cap_i)

    The final direction is the mean of these residuals.

    Returns CustomTarget.
    """
    assert len(group_a_tokens) == len(group_b_tokens), "Groups must have equal length for paired differences"
    tokenizer = model.tokenizer
    ids_a = [tokenizer.encode(t, add_special_tokens=False)[-1] for t in group_a_tokens]
    ids_b = [tokenizer.encode(t, add_special_tokens=False)[-1] for t in group_b_tokens]

    vecs_a = get_unembed_vecs(model, ids_a, backend)
    vecs_b = get_unembed_vecs(model, ids_b, backend)

    # Vector rejection: for each pair, remove the state-direction component
    residuals = []
    for va, vb in zip(vecs_a, vecs_b):
        va_f, vb_f = va.float(), vb.float()
        proj = (va_f @ vb_f) / (vb_f @ vb_f) * vb_f      # proj_{state}(capital)
        residuals.append((va_f - proj).to(va.dtype))

    direction = torch.stack(residuals).mean(0)

    # Weight = average probability of group-A tokens, floored at 1e-6
    probs = _get_last_position_probs(model, prompt)
    avg_prob = max(sum(probs[i].item() for i in ids_a) / len(ids_a), 1e-6)

    return CustomTarget(token_str=label, prob=avg_prob, vec=direction)

### Custom Target Configuration

Build the two custom targets and display a summary of all attribution configurations.

In [None]:
# Build the custom diff-target: logit(Austin) − logit(Dallas)
custom_target, _, _ = build_custom_diff_target(
    model, prompt, token_x, token_y, backend=backend
)

# Build the semantic concept target: Capital Cities − States
capitals = ["▁Austin", "▁Sacramento", "▁Olympia", "▁Atlanta"]
states   = ["▁Texas", "▁California", "▁Washington", "▁Georgia"]
semantic_target = build_semantic_concept_target(
    model, prompt, capitals, states,
    label="Capitals − States", backend=backend,
)

display_attribution_config(
    token_pairs=[(token_x, idx_x), (token_y, idx_y), ("▁Texas", idx_texas)],
    target_pairs=[("Logit diff", custom_target), ("Semantic concept", semantic_target)],
)

### Logit Difference Target

Pass a `CustomTarget` (or any `TargetSpec` — a tuple of `(token_str, prob, vec)`) that encodes a contrast direction in the residual stream. Here the direction is `logit(Austin) − logit(Dallas)`, constructing an attribution graph that more narrowly surfaces features driving the selection of the *correct* answer over the surface-level attractor.

In [None]:
graph_custom = attribute(
    prompt=prompt, model=model,
    attribution_targets=[custom_target],
    **attr_kwargs,
)
print(f"Custom-target graph: {len(graph_custom.logit_targets)} targets, "
      f"{graph_custom.active_features.shape[0]} active features")

### Semantic Direction (Concept Target)

Instead of a pairwise logit difference, we can attribute to an **abstract concept direction** in the residual stream. We build a `CustomTarget` via vector rejection: for each (capital, state) pair, project the capital vector onto the state vector and subtract that projection, leaving the pure 'capital-ness' component.

In [None]:
graph_semantic = attribute(
    prompt=prompt, model=model,
    attribution_targets=[semantic_target],
    **attr_kwargs,
)
print(f"Semantic-target graph: {len(graph_semantic.logit_targets)} targets, "
      f"{graph_semantic.active_features.shape[0]} active features")

# Free CUDA memory before feature comparison
cleanup_cuda()

## Compare Top Features

Extract the top-10 features from each graph (ranked by multi-hop influence) and display them side by side. Feature indices link to their [Neuronpedia](https://www.neuronpedia.org/) dashboards. The *Custom Target* column highlights features that specifically drive the Austin-vs-Dallas logit difference — the multi-hop reasoning circuit (Dallas → Texas → capital → Austin). The *Concept Target* column surfaces features associated with the more general *capital-of* relation, which partially overlaps with the multi-hop chain but also includes distinct features that may reflect more abstract capital-related reasoning.

> **Note:** The `torch.Tensor` target example is omitted from this comparison because it uses the same token IDs as the `Sequence[str]` example — the resulting graphs are identical.

In [None]:
top_salient, scores_salient   = get_top_features(graph_salient,  n=10)
top_str, scores_str           = get_top_features(graph_str,      n=10)
top_custom, scores_custom     = get_top_features(graph_custom,   n=10)
top_semantic, scores_semantic = get_top_features(graph_semantic,  n=10)

display_top_features_comparison(
    {
        "Salient Logits": top_salient,
        f"Strings [{token_x}, {token_y}]": top_str,
        f"Custom Fn ({custom_target.token_str})": top_custom,
        f"Semantic Concept ({semantic_target.token_str})": top_semantic,
    },
    scores_sets={
        "Salient Logits": scores_salient,
        f"Strings [{token_x}, {token_y}]": scores_str,
        f"Custom Fn ({custom_target.token_str})": scores_custom,
        f"Semantic Concept ({semantic_target.token_str})": scores_semantic,
    },
    neuronpedia_model="gemma-2-2b",
)

## Circuit Interventions

Having identified the top features for each attribution mode example, we can now run interventions, manipulating the discovered features to bolster our credence in their hypothesized causal roles. We explore both amplification and ablation of the logit-difference and semantic concept circuits.

### Amplify the Austin-Dallas Logit Difference Circuit

To confirm these custom-target features are causally meaningful, we amplify them by 10× and check that the Austin-vs-Dallas logit gap widens (i.e., the model becomes even more confident Austin is correct).

In [None]:
# Get activations for interventions
input_ids = model.ensure_tokenized(prompt)
original_logits, activations = model.get_activations(input_ids, sparse=True)

# Baseline
display_token_probs(original_logits, key_ids, key_labels, title="Before amplification")

# Amplify top custom-target features by 10×
intervention_tuples = [
    (layer, pos, feat_idx, 10.0 * activations[layer, pos, feat_idx])
    for (layer, pos, feat_idx) in top_custom
]

new_logits, _ = model.feature_intervention(input_ids, intervention_tuples)

display_token_probs(new_logits, key_ids, key_labels, title="After 10× amplification")

orig_gap = (original_logits.squeeze(0)[-1, idx_x] - original_logits.squeeze(0)[-1, idx_y]).item()
new_gap = (new_logits.squeeze(0)[-1, idx_x] - new_logits.squeeze(0)[-1, idx_y]).item()
print(f"\nlogit(Austin) − logit(Dallas): {orig_gap:.4f} → {new_gap:.4f}  (Δ = {new_gap - orig_gap:+.4f})")

display_topk(prompt, original_logits, new_logits)

### Amplify the Semantic Concept Circuit

Same amplification test for the **semantic concept** features. We compare a modest 2× boost (a gentle nudge along the concept axis) with a strong 10× boost to observe the difference in behaviour.

In [None]:
# Baseline
display_token_probs(original_logits, key_ids, key_labels, title="Before amplification (semantic)")

orig_gap = (original_logits.squeeze(0)[-1, idx_x] - original_logits.squeeze(0)[-1, idx_y]).item()

# --- 2× amplification (gentle nudge along the concept axis) ---
sem_amp_tuples_2 = [
    (layer, pos, feat_idx, 2.0 * activations[layer, pos, feat_idx])
    for (layer, pos, feat_idx) in top_semantic
]

sem_amp_logits_2, _ = model.feature_intervention(input_ids, sem_amp_tuples_2)

display_token_probs(sem_amp_logits_2, key_ids, key_labels, title="After 2× amplification (semantic)")

sem_gap_2 = (sem_amp_logits_2.squeeze(0)[-1, idx_x] - sem_amp_logits_2.squeeze(0)[-1, idx_y]).item()
print(f"\nlogit(Austin) − logit(Dallas): {orig_gap:.4f} → {sem_gap_2:.4f}  (Δ = {sem_gap_2 - orig_gap:+.4f})  [2×]")

display_topk(prompt, original_logits, sem_amp_logits_2)

# --- 10× amplification (strong boost) ---
sem_amp_tuples_10 = [
    (layer, pos, feat_idx, 10.0 * activations[layer, pos, feat_idx])
    for (layer, pos, feat_idx) in top_semantic
]

sem_amp_logits_10, _ = model.feature_intervention(input_ids, sem_amp_tuples_10)

display_token_probs(sem_amp_logits_10, key_ids, key_labels, title="After 10× amplification (semantic)")

sem_gap_10 = (sem_amp_logits_10.squeeze(0)[-1, idx_x] - sem_amp_logits_10.squeeze(0)[-1, idx_y]).item()
print(f"\nlogit(Austin) − logit(Dallas): {orig_gap:.4f} → {sem_gap_10:.4f}  (Δ = {sem_gap_10 - orig_gap:+.4f})  [10×]")

display_topk(prompt, original_logits, sem_amp_logits_10)

### Ablate the Austin-Dallas Logit Difference Circuit

Now we do the opposite: zero out progressively more features important to our custom target to dampen the Austin-driving circuit. With enough of the multi-hop reasoning path suppressed, the model can no longer resolve the correct answer and reverts to nearby concepts — e.g. the intermediate state (Texas) rather than its capital.

In [None]:
from IPython.display import display, Markdown

# Progressive ablation: zero out increasing numbers of custom-target features
probs_base = torch.softmax(original_logits.squeeze(0)[-1].float(), dim=-1)
groups = {"baseline": {
    "P(Austin)": probs_base[idx_x].item(),
    "P(Dallas)": probs_base[idx_y].item(),
    "P(Texas)":  probs_base[idx_texas].item(),
}}
logit_diffs = {"baseline": orig_gap}

ablation_results = {}
for n in [10, 100]:
    top_n, _ = get_top_features(graph_custom, n=n)
    abl_tuples = [
        (layer, pos, feat_idx, 0.0 * activations[layer, pos, feat_idx])
        for (layer, pos, feat_idx) in top_n
    ]
    abl_logits, _ = model.feature_intervention(input_ids, abl_tuples)
    probs_abl = torch.softmax(abl_logits.squeeze(0)[-1].float(), dim=-1)
    gap = (abl_logits.squeeze(0)[-1, idx_x] - abl_logits.squeeze(0)[-1, idx_y]).item()
    label = f"top-{n}"
    groups[label] = {
        "P(Austin)": probs_abl[idx_x].item(),
        "P(Dallas)": probs_abl[idx_y].item(),
        "P(Texas)":  probs_abl[idx_texas].item(),
    }
    logit_diffs[label] = gap
    ablation_results[n] = abl_logits

display_ablation_chart(groups, logit_diffs=logit_diffs,
                       title="Custom-target ablation: token probabilities & logit gap")

# Show the full top-k comparison for the strongest ablation
strongest_n = max(ablation_results.keys())
display(Markdown(f"#### Top-{strongest_n} ablation — full prediction shift"))
display_topk(prompt, original_logits, ablation_results[strongest_n])

### Ablate the Semantic Concept Circuit

Same progressive ablation, now zeroing out features from the **semantic concept** graph. Because the concept direction captures the capital-vs-state pathway, ablation should similarly collapse the Austin signal.

In [None]:
from IPython.display import display, Markdown

# Progressive ablation of semantic-target features
sem_groups = {"baseline": {
    "P(Austin)": probs_base[idx_x].item(),
    "P(Dallas)": probs_base[idx_y].item(),
    "P(Texas)":  probs_base[idx_texas].item(),
}}
sem_logit_diffs = {"baseline": orig_gap}

sem_ablation_results = {}
for n in [10, 100]:
    top_n, _ = get_top_features(graph_semantic, n=n)
    abl_tuples = [
        (layer, pos, feat_idx, 0.0 * activations[layer, pos, feat_idx])
        for (layer, pos, feat_idx) in top_n
    ]
    abl_logits, _ = model.feature_intervention(input_ids, abl_tuples)
    probs_abl = torch.softmax(abl_logits.squeeze(0)[-1].float(), dim=-1)
    gap = (abl_logits.squeeze(0)[-1, idx_x] - abl_logits.squeeze(0)[-1, idx_y]).item()
    label = f"top-{n}"
    sem_groups[label] = {
        "P(Austin)": probs_abl[idx_x].item(),
        "P(Dallas)": probs_abl[idx_y].item(),
        "P(Texas)":  probs_abl[idx_texas].item(),
    }
    sem_logit_diffs[label] = gap
    sem_ablation_results[n] = abl_logits

display_ablation_chart(sem_groups, logit_diffs=sem_logit_diffs,
                       title="Semantic-target ablation: token probabilities & logit gap")

# Show the full top-k comparison for the strongest ablation
strongest_n = max(sem_ablation_results.keys())
display(Markdown(f"#### Top-{strongest_n} semantic ablation — full prediction shift"))
display_topk(prompt, original_logits, sem_ablation_results[strongest_n])

## Visualize the Semantic Concept Graph

Save the **semantic concept** graph and serve it locally. The interactive visualization shows the circuit driving the abstract `Capitals − States` direction — the multi-hop reasoning path.

**If running on a remote server, set up port forwarding so that port 8046 is accessible on your local machine.**

In [None]:
from pathlib import Path

graph_dir = Path("attribution_targets_demo/graphs")
graph_dir.mkdir(parents=True, exist_ok=True)
graph_path = graph_dir / "dallas_austin_semantic_concept_graph.pt"
graph_semantic.to_pt(graph_path)

slug = "dallas-austin-semantic-concept"
graph_file_dir = "attribution_targets_demo/graph_files"
node_threshold, edge_threshold = 0.8, 0.98

create_graph_files(
    graph_or_path=graph_path,
    slug=slug,
    output_path=graph_file_dir,
    node_threshold=node_threshold,
    edge_threshold=edge_threshold,
)

In [None]:
from circuit_tracer.frontend.local_server import serve

port = 8046
server = serve(data_dir="attribution_targets_demo/graph_files/", port=port)

if IN_COLAB:
    from google.colab import output as colab_output  # noqa
    colab_output.serve_kernel_port_as_iframe(
        port, path="/index.html", height="800px", cache_in_notebook=True
    )
else:
    from IPython.display import IFrame
    print(f"Open your graph at: http://localhost:{port}/index.html")
    display(IFrame(src=f"http://localhost:{port}/index.html", width="100%", height="800px"))

In [None]:
# server.stop()