# Generalizability Evaluation: Vector Arithmetic in Concept and Token Subspaces

This notebook evaluates whether the findings from the arithmetic_eval repository generalize beyond the original experimental setting.

## Repository Summary

The research identifies two specialized types of attention heads in Llama-2-7b:
- **Concept Induction Heads**: Excel at semantic tasks (e.g., capitals, family relations)
- **Token Induction Heads**: Excel at grammatical tasks (e.g., verb tense, pluralization)

Key finding: Word2vec-style arithmetic (a - b + d â‰ˆ c) works better when performed in focused semantic/grammatical subspaces created by summing OV matrices from top-k identified heads.

## Generalization Checklist Summary

| Item | Description | Status |
|------|-------------|--------|
| GT1 | Model Generalization | **PASS** |
| GT2 | Data Generalization | **PASS** |
| GT3 | Method Generalizability | **FAIL** |

In [1]:
import os
os.chdir('/home/smallyan/eval_agent')

import sys
sys.path.insert(0, '/net/scratch2/smallyan/arithmetic_eval/scripts')

import torch
import json
import numpy as np

print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

CUDA available: True
GPU: NVIDIA H100 NVL
GPU Memory: 100.0 GB


## GT1: Model Generalization

**Question**: Does the neuron-level finding transfer to a new model not used in the original work?

**Original Model**: Llama-2-7b-hf
**Test Model**: Meta-Llama-3-8B (different architecture, GQA attention)

### Trial Results

| Trial | Analogy | Projected | Raw |
|-------|---------|-----------|-----|
| 1 | Athens - Greece + Japan = Tokyo | Athens (FAIL) | Athens (FAIL) |
| 2 | Berlin - Germany + France = Paris | Berlin (FAIL) | Berlin (FAIL) |
| 3 | dancing - danced + ran = running | **running (PASS)** | ran (FAIL) |

### Result: **PASS**

The token-head projection method successfully predicted the grammatical analogy on a completely different model (Llama-3-8B), demonstrating that the core concept transfers across model architectures.

In [2]:
# Load GT1 results
with open('/net/scratch2/smallyan/arithmetic_eval/evaluation/gt1_results.json', 'r') as f:
    gt1_data = json.load(f)
print("GT1 Results:")
print(json.dumps(gt1_data, indent=2))

GT1 Results:
{
  "results": [
    {"proj": false, "raw": false, "proj_pred": "Athens", "raw_pred": "Athens"},
    {"proj": false, "raw": false, "proj_pred": "Berlin", "raw_pred": "Berlin"},
    {"proj": true, "raw": false, "proj_pred": "running", "raw_pred": "ran"}
  ],
  "pass": true,
  "count": 1,
  "model": "Meta-Llama-3-8B"
}


## GT2: Data Generalization

**Question**: Does the neuron-level finding hold on new data instances not appearing in the original dataset?

**Model**: Llama-2-7b-hf (same as original)
**Data**: New country-capital pairs and verbs NOT in original Word2Vec dataset

### Trial Results

| Trial | Analogy | Projected | Raw |
|-------|---------|-----------|-----|
| 1 | Brasilia - Brazil + Peru = Lima | Brasilia (FAIL) | Brasilia (FAIL) |
| 2 | Bogota - Colombia + Ecuador = Quito | Bogota (FAIL) | Ecuador (FAIL) |
| 3 | climbing - climbed + painted = painting | **painting (PASS)** | painted (FAIL) |

### Result: **PASS**

The token-head projection method successfully predicted the grammatical analogy using verbs (climbing, climbed, painting, painted) that do NOT appear in the original gram7-past-tense dataset. The raw hidden states failed on this example.

In [3]:
# Load GT2 results
with open('/net/scratch2/smallyan/arithmetic_eval/evaluation/gt2_results.json', 'r') as f:
    gt2_data = json.load(f)
print("GT2 Results:")
print(json.dumps(gt2_data, indent=2))

GT2 Results:
{
  "results": [
    {"proj": false, "raw": false, "proj_pred": "Brasilia", "raw_pred": "Brasilia"},
    {"proj": false, "raw": false, "proj_pred": "Bogota", "raw_pred": "Ecuador"},
    {"proj": true, "raw": false, "proj_pred": "painting", "raw_pred": "painted"}
  ],
  "pass": true,
  "count": 1
}


## GT3: Method/Specificity Generalizability

**Question**: Can the proposed method be applied to another similar task?

The paper proposes a method of:
1. Identifying concept/token induction heads via causal intervention
2. Constructing a lens by summing OV matrices from top-k heads
3. Projecting word embeddings through this lens for improved analogy performance

**Model**: Llama-2-7b-hf
**Test Tasks**: Synonym, Antonym, Person-Occupation (different from original capital-country and verb tense)

### Trial Results

| Trial | Task | Analogy | Projected | Raw |
|-------|------|---------|-----------|-----|
| 1 | Synonym | big - large + quick = fast | quick (FAIL) | quick (FAIL) |
| 2 | Antonym | hot - cold + down = up | down (FAIL) | down (FAIL) |
| 3 | Person-Occupation | Einstein - physicist + composer = Mozart | composer (FAIL) | composer (FAIL) |

### Result: **FAIL**

The method did not successfully generalize to new task types. The concept-head projection failed on all three new semantic relationship types (synonym, antonym, person-occupation). This suggests the method is specialized to the specific task categories studied in the original work.

In [4]:
# Load GT3 results
with open('/net/scratch2/smallyan/arithmetic_eval/evaluation/gt3_results.json', 'r') as f:
    gt3_data = json.load(f)
print("GT3 Results:")
print(json.dumps(gt3_data, indent=2))

GT3 Results:
{
  "results": [
    {"task": "synonym", "proj": false, "raw": false, "proj_pred": "quick", "raw_pred": "quick"},
    {"task": "antonym", "proj": false, "raw": false, "proj_pred": "down", "raw_pred": "down"},
    {"task": "person-occupation", "proj": false, "raw": false, "proj_pred": "composer", "raw_pred": "composer"}
  ],
  "pass": false,
  "count": 0
}


## Failed Trial Examples Record

### GT1 Failures (Model Generalization)
- Trial 1: Athens - Greece + Japan -> Athens (expected: Tokyo)
- Trial 2: Berlin - Germany + France -> Berlin (expected: Paris)

### GT2 Failures (Data Generalization)
- Trial 1: Brasilia - Brazil + Peru -> Brasilia (expected: Lima)
- Trial 2: Bogota - Colombia + Ecuador -> Bogota (expected: Quito)

### GT3 Failures (Method Generalizability)
- Trial 1 (Synonym): big - large + quick -> quick (expected: fast)
- Trial 2 (Antonym): hot - cold + down -> down (expected: up)
- Trial 3 (Person-Occupation): Einstein - physicist + composer -> composer (expected: Mozart)

## Final Checklist Summary

| Criterion | Result | Details |
|-----------|--------|----------|
| **GT1** Model Generalization | **PASS** | Grammatical task succeeded on Llama-3-8B (1/3 trials) |
| **GT2** Data Generalization | **PASS** | Grammatical task succeeded on new verbs (1/3 trials) |
| **GT3** Method Generalizability | **FAIL** | Method failed on all new task types (0/3 trials) |

In [5]:
# Final Summary
gt1_pass = gt1_data['pass']
gt2_pass = gt2_data['pass']
gt3_pass = gt3_data['pass']

print("="*80)
print("GENERALIZABILITY EVALUATION SUMMARY")
print("="*80)
print(f"\nRepository: /net/scratch2/smallyan/arithmetic_eval")
print(f"Paper: Vector Arithmetic in Concept and Token Subspaces")
print(f"Original Model: Llama-2-7b-hf")

print("\n" + "-"*40)
print("CHECKLIST RESULTS")
print("-"*40)
print(f"GT1 (Model Generalization): {'PASS' if gt1_pass else 'FAIL'}")
print(f"GT2 (Data Generalization): {'PASS' if gt2_pass else 'FAIL'}")
print(f"GT3 (Method Generalizability): {'PASS' if gt3_pass else 'FAIL'}")

pass_count = sum([gt1_pass, gt2_pass, gt3_pass])
print("\n" + "-"*40)
print("OVERALL ASSESSMENT")
print("-"*40)
print(f"Passed {pass_count}/3 generalization criteria")

if pass_count == 3:
    print("The findings demonstrate STRONG generalizability.")
elif pass_count >= 1:
    print("The findings demonstrate PARTIAL generalizability.")
else:
    print("The findings show LIMITED generalizability.")

print("\nThe OV lens projection method shows promise for grammatical/morphological")
print("tasks (verb tense transformations) and generalizes to:")
print("- New models (Llama-3-8B)")
print("- New data (unseen verbs)")
print("\nHowever, the method does not generalize well to:")
print("- Different semantic relationship types (synonymy, antonymy, person-occupation)")
print("- Capital-country analogies (both on new models and new data)")
print("\nThis suggests the identified \"token heads\" capture morphological transformations")
print("effectively, but \"concept heads\" may be more task-specific than claimed.")

GENERALIZABILITY EVALUATION SUMMARY

Repository: /net/scratch2/smallyan/arithmetic_eval
Paper: Vector Arithmetic in Concept and Token Subspaces
Original Model: Llama-2-7b-hf

----------------------------------------
CHECKLIST RESULTS
----------------------------------------
GT1 (Model Generalization): PASS
GT2 (Data Generalization): PASS
GT3 (Method Generalizability): FAIL

----------------------------------------
OVERALL ASSESSMENT
----------------------------------------
Passed 2/3 generalization criteria
The findings demonstrate PARTIAL generalizability.

The OV lens projection method shows promise for grammatical/morphological
tasks (verb tense transformations) and generalizes to:
- New models (Llama-3-8B)
- New data (unseen verbs)

However, the method does not generalize well to:
- Different semantic relationship types (synonymy, antonymy, person-occupation)
- Capital-country analogies (both on new models and new data)

This suggests the identified "token heads" capture morphologi

In [6]:
# Load and display the final summary JSON
with open('/net/scratch2/smallyan/arithmetic_eval/evaluation/generalization_eval_summary.json', 'r') as f:
    summary = json.load(f)
print("Results saved to generalization_eval_summary.json")
print(json.dumps(summary, indent=2))

Results saved to generalization_eval_summary.json
{
  "Checklist": {
    "GT1_ModelGeneralization": "PASS",
    "GT2_DataGeneralization": "PASS",
    "GT3_MethodGeneralization": "FAIL"
  },
  "Rationale": {
    "GT1_ModelGeneralization": "The OV projection method successfully predicted 1 out of 3 analogies on Meta-Llama-3-8B...",
    "GT2_DataGeneralization": "The OV projection method successfully predicted 1 out of 3 analogies on NEW data instances...",
    "GT3_MethodGeneralization": "The OV lens projection method failed on all 3 new task types tested..."
  }
}
