In [1]:
import os
os.chdir('/home/smallyan/eval_agent')
print("Working directory:", os.getcwd())

Working directory: /home/smallyan/eval_agent


In [2]:
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device:", torch.cuda.get_device_name(0))

CUDA available: True
Device: NVIDIA A100 80GB PCIe


In [3]:
# Let's explore the original repo structure first
original_repo = '/net/scratch2/smallyan/universal-neurons_eval'
replication_dir = '/net/scratch2/smallyan/universal-neurons_eval/evaluation/replications'

# List files in original repo
print("=== Original repo structure ===")
for item in os.listdir(original_repo):
    print(item)

=== Original repo structure ===
__pycache__
dataframes
summary.py
entropy_intervention.py
paper_notebooks
activations.py
LICENSE
requirements.txt
.git
make_dataset.py
attention_deactivation_qpos.py
correlations_parallel.py
evaluation
summary_viewer.py
weights.py
attention_deactivation.py
correlations_fast.py
documentation.pdf
.gitignore
explain.py
CodeWalkthrough.md
utils.py
plan.md
correlations.py
analysis
intervention.py
slurm


In [4]:
# Check for documentation files in original repo
print("=== Looking for documentation files ===")
for root, dirs, files in os.walk(original_repo):
    for f in files:
        if 'documentation' in f.lower() or f.endswith('.md'):
            print(os.path.join(root, f))

=== Looking for documentation files ===
/net/scratch2/smallyan/universal-neurons_eval/documentation.pdf
/net/scratch2/smallyan/universal-neurons_eval/CodeWalkthrough.md
/net/scratch2/smallyan/universal-neurons_eval/plan.md
/net/scratch2/smallyan/universal-neurons_eval/evaluation/replications/documentation_replication.md
/net/scratch2/smallyan/universal-neurons_eval/evaluation/replications/evaluation_replication.md


In [5]:
# Let's read CodeWalkthrough.md as the original documentation
with open(os.path.join(original_repo, 'CodeWalkthrough.md'), 'r') as f:
    original_doc = f.read()
print("=== CodeWalkthrough.md ===")
print(original_doc)

=== CodeWalkthrough.md ===
# Universal Neurons
All supporting data and code for Universal Neurons in GPT2 Language Models by Gurnee et al. (2024).

## Contents
* `dataframes/neuron_dfs` contains dataframes with neuron statistics for all neurons for the main models studies.
* `paper_notebooks` contains much of the plotting code to generate the figures in the paper.
* `correlations_fast.py` contains the script to compute neuron correlations.
* `summary.py` and `weights.py` contain scripts to compute neuron activation and weight statistic summaries for use of our summary viewer (contained in `summary_viewer.py`). See next section for more information on the data generated.
* `activations.py` contains scripts to cache neuron activations.
* `explain.py` contains script to compute our reduction in variance explanations.
* `attention_deactivation.py`, `entropy_intervention.py`, and `intervention.py` contain scripts for our functional neuron experiments.
* The `analysis` directory contains fur

In [6]:
# Let's also read the plan.md
with open(os.path.join(original_repo, 'plan.md'), 'r') as f:
    plan_doc = f.read()
print("=== plan.md ===")
print(plan_doc)

=== plan.md ===
# Plan
## Objective
Study the universality of individual neurons across GPT2 language models trained from different random seeds to identify interpretable neurons and understand whether neural mechanisms are universal across models.

## Hypothesis
1. Universal neurons (those that consistently activate on the same inputs across different models) are more likely to be monosemantic and interpretable than non-universal neurons.
2. Neurons with high activation correlation across models will have clear interpretations and can be taxonomized into a small number of neuron families.
3. Universal neurons exhibit specific statistical properties in their weights and activations that distinguish them from non-universal neurons, including large negative input bias, high pre-activation skew and kurtosis, and large weight norm.

## Methodology
1. Compute pairwise Pearson correlations of neuron activations over 100 million tokens from the Pile test set for every neuron pair across five 

In [7]:
# Now let's read the replicated documentation
with open(os.path.join(replication_dir, 'documentation_replication.md'), 'r') as f:
    replication_doc = f.read()
print("=== documentation_replication.md ===")
print(replication_doc)

=== documentation_replication.md ===
# Documentation: Universal Neurons Replication

## Goal

Replicate the key experiments from "Universal Neurons in GPT2 Language Models" by Gurnee et al. (2024). The paper investigates whether neurons that consistently activate on the same inputs across different GPT2 models trained from different random seeds are more interpretable.

## Data

### Source Data
- **Pre-computed neuron statistics**: Located in `dataframes/neuron_dfs/` for three models:
  - `pythia-160m.csv` (36,864 neurons)
  - `stanford-gpt2-small-a.csv` (36,864 neurons)
  - `stanford-gpt2-medium-a.csv` (98,304 neurons)

### Data Fields
Each CSV contains per-neuron statistics:
- `layer`, `neuron`: Neuron identification
- `max_corr`, `mean_corr`, `min_corr`: Correlation statistics across 5 random seeds
- `max_baseline`, `mean_baseline`: Baseline correlation under random rotation
- `w_in_norm`, `w_out_norm`: Weight norms
- `input_bias`: MLP input bias
- `in_out_sim`: Cosine similarity be

In [8]:
# Let's also check evaluation_replication.md for more context
with open(os.path.join(replication_dir, 'evaluation_replication.md'), 'r') as f:
    eval_replication_doc = f.read()
print("=== evaluation_replication.md ===")
print(eval_replication_doc)

=== evaluation_replication.md ===
# Evaluation: Universal Neurons Replication

## Reflection

This replication successfully verified the core claims from "Universal Neurons in GPT2 Language Models" by Gurnee et al. (2024). The repository provided well-organized pre-computed data and clear documentation in the plan.md and CodeWalkthrough.md files.

### What Worked Well
1. **Clear plan documentation**: The plan.md file explicitly stated hypotheses, methodology, and expected results
2. **Pre-computed data availability**: CSV files with neuron statistics enabled direct verification
3. **Consistent data format**: All three models used identical CSV schemas
4. **Exact numerical matches**: Universal neuron percentages matched precisely

### Challenges Encountered
1. **Kernel output suppression**: After loading transformer_lens models, Jupyter output was intermittently suppressed
2. **Large model sizes**: Full correlation recomputation would require significant GPU time
3. **External dependenc

In [9]:
# Now let's perform the evaluation
# Let's analyze the original vs replicated documentation

print("="*80)
print("DOCUMENTATION EVALUATION ANALYSIS")
print("="*80)

print("\n" + "="*80)
print("DE1: RESULT FIDELITY ANALYSIS")
print("="*80)

print("""
ORIGINAL DOCUMENTATION RESULTS (from plan.md):
----------------------------------------------
1. Neuron correlation analysis:
   - GPT2-medium: 1.23% universal neurons
   - Pythia-160M: 1.26% universal neurons  
   - GPT2-small: 4.16% universal neurons

2. Statistical properties of universal neurons:
   - Large weight norm
   - Large negative input bias
   - High pre-activation skew and kurtosis (monosemantic signature)
   - Lower activation frequency compared to non-universal neurons

REPLICATED DOCUMENTATION RESULTS (from documentation_replication.md):
--------------------------------------------------------------------
1. Universal Neuron Percentages:
   - pythia-160m: 1.26% (expected 1.26%) ✓
   - stanford-gpt2-small-a: 4.16% (expected 4.16%) ✓
   - stanford-gpt2-medium-a: 1.23% (expected 1.23%) ✓

2. Statistical Properties Verification:
   - Lower activation frequency: PASS (all models)
   - High pre-activation skew: PASS (all models)
   - High pre-activation kurtosis: PASS (all models)
   - Large negative input bias: PASS (all models)
   - Large weight norm (L2 penalty): PASS (all models)

VERDICT: The replicated results exactly match the original results.
All numerical values are identical and all statistical property claims are verified.
""")

print("\nDE1 RESULT: PASS")
print("Rationale: All replicated results (universal neuron percentages and statistical")
print("property verifications) match the original documentation exactly within stated tolerance.")

DOCUMENTATION EVALUATION ANALYSIS

DE1: RESULT FIDELITY ANALYSIS

ORIGINAL DOCUMENTATION RESULTS (from plan.md):
----------------------------------------------
1. Neuron correlation analysis:
   - GPT2-medium: 1.23% universal neurons
   - Pythia-160M: 1.26% universal neurons  
   - GPT2-small: 4.16% universal neurons

2. Statistical properties of universal neurons:
   - Large weight norm
   - Large negative input bias
   - High pre-activation skew and kurtosis (monosemantic signature)
   - Lower activation frequency compared to non-universal neurons

REPLICATED DOCUMENTATION RESULTS (from documentation_replication.md):
--------------------------------------------------------------------
1. Universal Neuron Percentages:
   - pythia-160m: 1.26% (expected 1.26%) ✓
   - stanford-gpt2-small-a: 4.16% (expected 4.16%) ✓
   - stanford-gpt2-medium-a: 1.23% (expected 1.23%) ✓

2. Statistical Properties Verification:
   - Lower activation frequency: PASS (all models)
   - High pre-activation skew

In [10]:
print("\n" + "="*80)
print("DE2: CONCLUSION CONSISTENCY ANALYSIS")
print("="*80)

print("""
ORIGINAL DOCUMENTATION CONCLUSIONS (from plan.md):
--------------------------------------------------
1. HYPOTHESIS 1: Universal neurons (those that consistently activate on the same 
   inputs across different models) are more likely to be monosemantic and 
   interpretable than non-universal neurons.

2. HYPOTHESIS 2: Neurons with high activation correlation across models will have 
   clear interpretations and can be taxonomized into a small number of neuron families.

3. HYPOTHESIS 3: Universal neurons exhibit specific statistical properties in their 
   weights and activations that distinguish them from non-universal neurons, including 
   large negative input bias, high pre-activation skew and kurtosis, and large weight norm.

KEY FINDINGS:
- Only 1-5% of neurons are universal (excess correlation > 0.5)
- Universal neurons show depth specialization
- Universal neurons have "monosemantic signature" (high skew and kurtosis)

REPLICATED DOCUMENTATION CONCLUSIONS (from documentation_replication.md):
------------------------------------------------------------------------
1. "The replication exactly matches the paper's reported percentages of universal neurons"

2. "All statistical property claims are verified: universal neurons have distinctive signatures"

3. "The 'monosemantic signature' (high skew and kurtosis) is confirmed"

4. "Layer distribution shows depth specialization patterns"

5. Stated limitations:
   - "Full correlation computation was not re-run (would require significant compute)"
   - "Detailed neuron family taxonomy was not fully replicated"

VERDICT: The conclusions are consistent with the original documentation.
The replication confirms the core hypothesis about universal neurons having
distinctive statistical signatures (monosemantic properties). The stated
limitations are appropriate acknowledgments of partial replication scope.
""")

print("\nDE2 RESULT: PASS")
print("Rationale: The replicated conclusions are consistent with the original.")
print("Key claims (monosemantic signature, statistical properties, universality") 
print("percentages) are confirmed. Limitations are appropriately acknowledged.")


DE2: CONCLUSION CONSISTENCY ANALYSIS

ORIGINAL DOCUMENTATION CONCLUSIONS (from plan.md):
--------------------------------------------------
1. HYPOTHESIS 1: Universal neurons (those that consistently activate on the same 
   inputs across different models) are more likely to be monosemantic and 
   interpretable than non-universal neurons.

2. HYPOTHESIS 2: Neurons with high activation correlation across models will have 
   clear interpretations and can be taxonomized into a small number of neuron families.

3. HYPOTHESIS 3: Universal neurons exhibit specific statistical properties in their 
   weights and activations that distinguish them from non-universal neurons, including 
   large negative input bias, high pre-activation skew and kurtosis, and large weight norm.

KEY FINDINGS:
- Only 1-5% of neurons are universal (excess correlation > 0.5)
- Universal neurons show depth specialization
- Universal neurons have "monosemantic signature" (high skew and kurtosis)

REPLICATED DOCUMEN

In [11]:
print("\n" + "="*80)
print("DE3: NO EXTERNAL OR HALLUCINATED INFORMATION ANALYSIS")
print("="*80)

print("""
ANALYSIS OF REPLICATED DOCUMENTATION FOR EXTERNAL/HALLUCINATED CONTENT:
-----------------------------------------------------------------------

1. DATA SOURCES CITED:
   - All data comes from repository dataframes/neuron_dfs/ (original source)
   - File names match original repo: pythia-160m.csv, stanford-gpt2-small-a.csv, 
     stanford-gpt2-medium-a.csv

2. METHODOLOGY REFERENCES:
   - Excess correlation formula matches original: excess_corr = mean_corr - mean_baseline
   - Threshold (0.5) matches original documentation
   - Statistical properties (skew, kurtosis, bias, weight norm) all from original

3. RESULTS CITED:
   - All percentages (1.26%, 4.16%, 1.23%) match original plan.md exactly
   - No new metrics or measurements introduced
   - Statistical property claims directly map to original hypotheses

4. EXTERNAL REFERENCES CHECK:
   - Paper citation: "Universal Neurons in GPT2 Language Models" by Gurnee et al. (2024) 
     ✓ Matches original citation in CodeWalkthrough.md
   - No additional papers or external sources referenced
   - No invented findings or metrics

5. FIGURES MENTIONED:
   - universal_neurons_properties.png: Visualizes data from original CSV files
   - layer_distribution.png: Visualizes layer distribution from original data
   - Both are legitimate visualizations of original data

VERDICT: No external references, invented findings, or hallucinated details found.
All information in the replicated documentation traces back to the original repo.
""")

print("\nDE3 RESULT: PASS")
print("Rationale: All content in the replicated documentation derives from the")
print("original repository data, documentation, and methodology. No external")
print("references or invented information was introduced.")


DE3: NO EXTERNAL OR HALLUCINATED INFORMATION ANALYSIS

ANALYSIS OF REPLICATED DOCUMENTATION FOR EXTERNAL/HALLUCINATED CONTENT:
-----------------------------------------------------------------------

1. DATA SOURCES CITED:
   - All data comes from repository dataframes/neuron_dfs/ (original source)
   - File names match original repo: pythia-160m.csv, stanford-gpt2-small-a.csv, 
     stanford-gpt2-medium-a.csv

2. METHODOLOGY REFERENCES:
   - Excess correlation formula matches original: excess_corr = mean_corr - mean_baseline
   - Threshold (0.5) matches original documentation
   - Statistical properties (skew, kurtosis, bias, weight norm) all from original

3. RESULTS CITED:
   - All percentages (1.26%, 4.16%, 1.23%) match original plan.md exactly
   - No new metrics or measurements introduced
   - Statistical property claims directly map to original hypotheses

4. EXTERNAL REFERENCES CHECK:
   - Paper citation: "Universal Neurons in GPT2 Language Models" by Gurnee et al. (2024) 
   

In [12]:
# Create the evaluation output directory
eval_output_dir = '/net/scratch2/smallyan/universal-neurons_eval/evaluation/replication_eval'
os.makedirs(eval_output_dir, exist_ok=True)
print(f"Created output directory: {eval_output_dir}")

Created output directory: /net/scratch2/smallyan/universal-neurons_eval/evaluation/replication_eval


In [13]:
# Write documentation_evaluation_summary.md
summary_md = """# Documentation Evaluation Summary

## Results Comparison

The replicated documentation faithfully reproduces the results from the original experiment documentation. The key quantitative results match exactly:

- **Universal neuron percentages** are identical across all three models:
  - Pythia-160M: 1.26% (original) vs 1.26% (replicated)
  - GPT2-small: 4.16% (original) vs 4.16% (replicated)
  - GPT2-medium: 1.23% (original) vs 1.23% (replicated)

- **Statistical property verifications** all pass: Universal neurons exhibit the documented "monosemantic signature" with lower activation frequency, high pre-activation skew and kurtosis, large negative input bias, and large weight norm (L2 penalty).

## Conclusions Comparison

The replicated documentation presents conclusions that are fully consistent with the original documentation:

1. Both documents confirm that only 1-5% of neurons are universal (excess correlation > 0.5)
2. Both confirm the distinctive statistical signatures of universal neurons
3. Both acknowledge depth specialization patterns in layer distribution
4. The replication appropriately notes limitations (no full correlation recomputation, partial taxonomy analysis)

The replicated conclusions do not contradict or omit any essential claims from the original.

## External or Hallucinated Information

**None detected.** All information in the replicated documentation traces directly to:
- Pre-computed data from `dataframes/neuron_dfs/` in the original repository
- Methodology and results documented in `plan.md` and `CodeWalkthrough.md`
- The original paper citation (Gurnee et al., 2024)

No external references, invented findings, or hallucinated details were introduced.

## Evaluation Summary Table

| Criterion | Result |
|-----------|--------|
| DE1: Result Fidelity | **PASS** |
| DE2: Conclusion Consistency | **PASS** |
| DE3: No External/Hallucinated Information | **PASS** |

## Final Verdict

**PASS**

The replicated documentation faithfully reproduces the results and conclusions of the original experiment. All numerical results match exactly, conclusions are consistent, and no external or hallucinated information was introduced.
"""

# Write the summary file
summary_path = os.path.join(eval_output_dir, 'documentation_evaluation_summary.md')
with open(summary_path, 'w') as f:
    f.write(summary_md)
    
print(f"Written: {summary_path}")
print("\n" + "="*80)
print("CONTENT OF documentation_evaluation_summary.md:")
print("="*80)
print(summary_md)

Written: /net/scratch2/smallyan/universal-neurons_eval/evaluation/replication_eval/documentation_evaluation_summary.md

CONTENT OF documentation_evaluation_summary.md:
# Documentation Evaluation Summary

## Results Comparison

The replicated documentation faithfully reproduces the results from the original experiment documentation. The key quantitative results match exactly:

- **Universal neuron percentages** are identical across all three models:
  - Pythia-160M: 1.26% (original) vs 1.26% (replicated)
  - GPT2-small: 4.16% (original) vs 4.16% (replicated)
  - GPT2-medium: 1.23% (original) vs 1.23% (replicated)

- **Statistical property verifications** all pass: Universal neurons exhibit the documented "monosemantic signature" with lower activation frequency, high pre-activation skew and kurtosis, large negative input bias, and large weight norm (L2 penalty).

## Conclusions Comparison

The replicated documentation presents conclusions that are fully consistent with the original docum

In [14]:
import json

# Create the JSON summary
summary_json = {
    "Checklist": {
        "DE1_ResultFidelity": "PASS",
        "DE2_ConclusionConsistency": "PASS",
        "DE3_NoExternalInformation": "PASS"
    },
    "Rationale": {
        "DE1_ResultFidelity": "All replicated results match the original documentation exactly. Universal neuron percentages are identical (Pythia-160M: 1.26%, GPT2-small: 4.16%, GPT2-medium: 1.23%). All five statistical property verifications (lower activation frequency, high skew, high kurtosis, negative input bias, large weight norm) passed across all three models.",
        "DE2_ConclusionConsistency": "The replicated conclusions are consistent with the original. Both confirm 1-5% universal neurons, distinctive statistical signatures (monosemantic properties), and depth specialization patterns. The replication appropriately acknowledges scope limitations without contradicting original claims.",
        "DE3_NoExternalInformation": "All information in the replicated documentation derives from the original repository (dataframes/neuron_dfs/, plan.md, CodeWalkthrough.md). No external references, invented findings, or hallucinated details were introduced. The paper citation matches the original."
    }
}

# Write the JSON file
json_path = os.path.join(eval_output_dir, 'documentation_eval_summary.json')
with open(json_path, 'w') as f:
    json.dump(summary_json, f, indent=2)

print(f"Written: {json_path}")
print("\n" + "="*80)
print("CONTENT OF documentation_eval_summary.json:")
print("="*80)
print(json.dumps(summary_json, indent=2))

Written: /net/scratch2/smallyan/universal-neurons_eval/evaluation/replication_eval/documentation_eval_summary.json

CONTENT OF documentation_eval_summary.json:
{
  "Checklist": {
    "DE1_ResultFidelity": "PASS",
    "DE2_ConclusionConsistency": "PASS",
    "DE3_NoExternalInformation": "PASS"
  },
  "Rationale": {
    "DE1_ResultFidelity": "All replicated results match the original documentation exactly. Universal neuron percentages are identical (Pythia-160M: 1.26%, GPT2-small: 4.16%, GPT2-medium: 1.23%). All five statistical property verifications (lower activation frequency, high skew, high kurtosis, negative input bias, large weight norm) passed across all three models.",
    "DE2_ConclusionConsistency": "The replicated conclusions are consistent with the original. Both confirm 1-5% universal neurons, distinctive statistical signatures (monosemantic properties), and depth specialization patterns. The replication appropriately acknowledges scope limitations without contradicting ori

In [15]:
# Verify the output files exist
print("="*80)
print("VERIFICATION OF OUTPUT FILES")
print("="*80)

for filename in ['documentation_evaluation_summary.md', 'documentation_eval_summary.json']:
    filepath = os.path.join(eval_output_dir, filename)
    if os.path.exists(filepath):
        size = os.path.getsize(filepath)
        print(f"✓ {filename} ({size} bytes)")
    else:
        print(f"✗ {filename} NOT FOUND")

print("\n" + "="*80)
print("EVALUATION COMPLETE")
print("="*80)
print(f"\nOutput directory: {eval_output_dir}")
print("\nFinal Verdict: PASS")
print("All criteria (DE1, DE2, DE3) passed. The replicated documentation")
print("faithfully reproduces the original experiment results and conclusions.")

VERIFICATION OF OUTPUT FILES
✓ documentation_evaluation_summary.md (2183 bytes)
✓ documentation_eval_summary.json (1161 bytes)

EVALUATION COMPLETE

Output directory: /net/scratch2/smallyan/universal-neurons_eval/evaluation/replication_eval

Final Verdict: PASS
All criteria (DE1, DE2, DE3) passed. The replicated documentation
faithfully reproduces the original experiment results and conclusions.


# Documentation Evaluation Complete

## Summary

This notebook performed a **Replicator-Documentation Evaluation** to determine whether the replicator's documentation faithfully reproduces the results and conclusions of the original Universal Neurons experiment.

### Files Evaluated
- **Original**: `plan.md`, `CodeWalkthrough.md` from `/net/scratch2/smallyan/universal-neurons_eval`
- **Replicated**: `documentation_replication.md` from `/net/scratch2/smallyan/universal-neurons_eval/evaluation/replications`

### Evaluation Results

| Criterion | Result |
|-----------|--------|
| DE1: Result Fidelity | **PASS** |
| DE2: Conclusion Consistency | **PASS** |
| DE3: No External/Hallucinated Information | **PASS** |

### Final Verdict: **PASS**

### Output Files
- `evaluation/replication_eval/documentation_evaluation_summary.md`
- `evaluation/replication_eval/documentation_eval_summary.json`