# Explore Activations

The proposal/hypothesis is here https://docs.google.com/document/d/1x7n2iy1_LZXZNLQpxCzF84lZ8BEG6ZT3KWXC59erhJA 

In [None]:
import torch
from transformer_lens import HookedTransformer

# 1. Load an open-source model (e.g., GPT-2 or Llama-3-8B if memory allows)
# The proposal suggests models that don't call tools [cite: 90]
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")

## Step 1: Define Your Task Set
Based on your proposal, we will use the five core math tasks that share identical phrasing up until the final task-identifying word .

In [None]:
tasks = ["minimum", "maximum", "sum", "difference", "product"]
prompts = [f"Answer minimally: Given the numbers 25 and 9 calculate the {t}" for t in tasks]

## Step 2: Extract Residual Stream Activations
To isolate the "Categorization Layer", you should extract the activations from the residual stream at the final token position across all layers. The final token (the task word) is where the categorization is finalized.

In [None]:
model = HookedTransformer.from_pretrained("gpt2-small") # Or your chosen open model [cite: 88, 91]

# Storage for vectors: {task_name: [layer_activations]}
task_vectors = {}

for i, prompt in enumerate(prompts):
    # Run the prompt and cache activations
    logits, cache = model.run_with_cache(prompt)
    
    # Extract the residual stream at the VERY LAST token position
    # We look at 'resid_post' (after MLP/Attention adds to the stream)
    # Shape: [n_layers, d_model]
    layer_stack = cache.stack_activation("resid_post")[:, 0, -1, :] 
    task_vectors[tasks[i]] = layer_stack    

## Step 3: Isolate the "Direction" of Each Category
Because these prompts are nearly identical, the raw activation vectors will be very similar. To find the vector specific to a category (e.g., the "Sum-ness" of the prompt), you must subtract the "average task" baseline.

In [None]:
# 1. Stack all vectors into a tensor [5_tasks, n_layers, d_model]
all_vectors = torch.stack(list(task_vectors.values()))

# 2. Calculate the 'Centroid' (the average activation for this prompt template)
centroid = all_vectors.mean(dim=0)

# 3. Calculate the Task-Specific Vector (Deviation from the mean)
# This represents the unique 'Categorization Circuit' activation for each task
specific_vectors = {}
for i, task in enumerate(tasks):
    specific_vectors[task] = all_vectors[i] - centroid

## Step 4: Validate the Vectors (Orthogonality)
To address your hypothesis on Structural Separation and Non-overlapping space, calculate the cosine similarity between these 5 vectors

In [None]:
import torch.nn.functional as F

# Select a layer to analyze (usually late-middle layers for categorization)
layer_idx = 8 
v1 = specific_vectors["sum"][layer_idx]
v2 = specific_vectors["product"][layer_idx]

similarity = F.cosine_similarity(v1.unsqueeze(0), v2.unsqueeze(0))
print(f"Similarity between Sum and Product: {similarity.item()}")

## Analysis of the ResultsVector Magnitude: 
If your Gating Mechanism hypothesis is correct, the magnitude ($L^2$ norm) of these isolated vectors should be significantly higher in the "Categorization Layer" than in early embedding layers7777.Scale Coordination: Compare the magnitudes of the "Sum" vector vs. the "Product" vector. If they are similar across different tasks, it suggests the model has resolved the Scale Coordination Problem

In [None]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
from transformer_lens import HookedTransformer

# 1. Setup and Extraction
model = HookedTransformer.from_pretrained("gpt2-small")
tasks = ["minimum", "maximum", "sum", "difference", "product"]
prompts = [f"Answer minimally: Given the numbers 25 and 9 calculate the {t}" for t in tasks]

all_layer_vectors = []
for prompt in prompts:
    logits, cache = model.run_with_cache(prompt)
    # Extract resid_post from Layer 8 (often a core 'categorization' layer)
    # Shape: [d_model]
    vec = cache["resid_post", 8][0, -1, :].detach().cpu()
    all_layer_vectors.append(vec)

# 2. Isolate Task-Specific Vectors (Subtract Centroid)
all_vectors_tensor = torch.stack(all_layer_vectors)
centroid = all_vectors_tensor.mean(dim=0)
specific_vectors = all_vectors_tensor - centroid 

# 3. Visualization: PCA (2D Projection)
pca = PCA(n_components=2)
pca_results = pca.fit_transform(specific_vectors.numpy())

plt.figure(figsize=(8, 6))
for i, task in enumerate(tasks):
    plt.scatter(pca_results[i, 0], pca_results[i, 1], label=task, s=100)
    plt.text(pca_results[i, 0] + 0.1, pca_results[i, 1] + 0.1, task)

plt.title("2D PCA Projection of Task Categorization Vectors (Layer 8)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend()
plt.savefig("task_pca.png")

# 4. Orthogonality Check: Cosine Similarity Heatmap
# Normalize vectors to unit length for cosine similarity
norm_vectors = F.normalize(specific_vectors, p=2, dim=1)
sim_matrix = torch.mm(norm_vectors, norm_vectors.t()).numpy()

plt.figure(figsize=(8, 6))
sns.heatmap(sim_matrix, annot=True, xticklabels=tasks, yticklabels=tasks, cmap="YlGnBu")
plt.title("Cosine Similarity Matrix: Task Orthogonality")
plt.savefig("similarity_heatmap.png")

Interpreting the Visuals for Your HypothesesPCA Plot (Geometric Structure):Orthogonal Set: If the "Structural Separation" hypothesis is correct, the vectors should appear as distinct points spread out in the plot3333.Task Space: If "sum" and "difference" (arithmetic) are closer to each other than to "minimum" or "maximum" (comparison), it suggests the model organizes these tasks into a meaningful "task space"4.Similarity Heatmap (Non-overlapping Space):Low Off-Diagonal Values: If the off-diagonal values (e.g., similarity between "sum" and "product") are near $0$, it validates that categorization vectors are stored in non-overlapping parts of activation space5.High Diagonal (1.0): This confirms the self-consistency of the extracted vectors.Scale Coordination:By looking at the raw magnitudes ($L^2$ norm) of specific_vectors before normalization, you can test if the model has a common activation scale for all 100 circuits6666. If the magnitudes are nearly identical across tasks, it supports the "Scale Coordination" or "Super Weight" calibration theory.

In [None]:
# %% [markdown]
# # Experiment: Decoupled Intelligence
# This notebook investigates the **Structural Separation** hypothesis:
# 1. Models contain distinct circuits for prompt categorization and response generation[cite: 4].
# 2. Categorization circuits are invariant to specific numeric inputs[cite: 25, 53].

# %%
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from transformer_lens import HookedTransformer
from sklearn.decomposition import PCA

# %% [markdown]
# ## 1. Setup Data and Model
# We use 5 tasks and 5 pairs of numbers to create a 25-prompt matrix [cite: 16-20].

# %%
# Initialize model (using GPT-2 for local speed, can be swapped for Llama-3-8B)
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda" if torch.cuda.is_available() else "cpu")

tasks = ["minimum", "maximum", "sum", "difference", "product"]
number_pairs = [(25, 9), (42, 11), (7, 3), (99, 1), (15, 15)]
layer_to_analyze = 8  # Middle-late layers are typically where intent is finalized

# Generate the prompt list
all_prompts = []
metadata = [] # To track task and pair for each prompt
for task in tasks:
    for n1, n2 in number_pairs:
        all_prompts.append(f"Answer minimally: Given the numbers {n1} and {n2} calculate the {task}")
        metadata.append({"task": task, "pair": f"({n1},{n2})"})

# %% [markdown]
# ## 2. Extract Activations
# We extract the residual stream at the last token position[cite: 101].

# %%
activations = []

for prompt in all_prompts:
    with torch.no_grad():
        logits, cache = model.run_with_cache(prompt)
        # resid_post at the final token position
        vec = cache["resid_post", layer_to_analyze][0, -1, :].detach().cpu()
        activations.append(vec)

activations_tensor = torch.stack(activations)

# %% [markdown]
# ## 3. Disentangle Categorization
# We subtract the average prompt "template" to find the task-specific vectors[cite: 70, 109].

# %%
# Calculate global mean (centroid) to remove template bias
global_centroid = activations_tensor.mean(dim=0)
task_specific_vectors = activations_tensor - global_centroid

# %% [markdown]
# ## 4. Visualization: 25x25 Similarity Heatmap
# If categorization is independent of numbers, we should see 5 distinct 5x5 blocks[cite: 53, 54].

# %%
# Normalize for cosine similarity
norm_vecs = F.normalize(task_specific_vectors, p=2, dim=1)
sim_matrix = torch.mm(norm_vecs, norm_vecs.t()).numpy()

plt.figure(figsize=(12, 10))
labels = [f"{m['task']} {m['pair']}" for m in metadata]
sns.heatmap(sim_matrix, xticklabels=labels, yticklabels=labels, cmap="viridis", annot=False)
plt.title(f"Cosine Similarity Matrix: Stability of Task Categorization (Layer {layer_to_analyze})")
plt.xlabel("Prompt (Task + Number Pair)")
plt.ylabel("Prompt (Task + Number Pair)")
plt.show()

# %% [markdown]
# ## 5. Visualization: PCA Projection
# We project the 25 vectors into 2D space to see the "Task Clusters"[cite: 110].

# %%
pca = PCA(n_components=2)
pca_results = pca.fit_transform(task_specific_vectors.numpy())

plt.figure(figsize=(10, 8))
colors = sns.color_palette("hls", len(tasks))
task_colors = {task: colors[i] for i, task in enumerate(tasks)}

for i, meta in enumerate(metadata):
    plt.scatter(
        pca_results[i, 0], 
        pca_results[i, 1], 
        color=task_colors[meta['task']], 
        label=meta['task'] if i % 5 == 0 else ""
    )
    # Label a few points
    if i % 5 == 0:
        plt.text(pca_results[i, 0] + 0.05, pca_results[i, 1] + 0.05, meta['task'], weight='bold')

plt.legend(title="Tasks")
plt.title("PCA: Task Categorization Clusters (Invariant to Numeric Inputs)")
plt.grid(True, alpha=0.3)
plt.show()