In [1]:
import os
os.chdir('/home/smallyan/eval_agent')
print("Working directory:", os.getcwd())

Working directory: /home/smallyan/eval_agent


In [2]:
# Based on my analysis of the documentation, I'll now create the evaluation files

# Store the evaluation results
evaluation_results = {
    "RP1": {
        "verdict": "PASS",
        "analysis": """
The documentation provides sufficient detail for the core experiment:

DOCUMENTED PROCEDURES:
1. Correlation Computation (Section 4.1, A.2):
   - Pearson correlation formula (Eq. 2)
   - Excess correlation formula (Eq. 3)
   - Baseline correlation with random rotation
   - Dataset: 100M tokens from Pile test set
   - Context length: 512 tokens
   - Efficient streaming algorithm with formulas

2. Neuron Analysis (Section 4.2):
   - Statistics: mean, skew, kurtosis of pre-activation
   - Activation sparsity definition
   - Weight metrics specified

3. Mathematical Foundations:
   - MLP formulation (Eq. 1)
   - Reduction in variance formula (Eq. 4)
   - LayerNorm formula (Eq. 5)
   - Heuristic score formula for attention deactivation

MINOR GAPS:
- Complete list of automated tests not provided
- Topic model details underspecified
- Intervention procedure parameters implicit

The core methodology for identifying universal neurons is reconstructable.
"""
    },
    "RP2": {
        "verdict": "PASS",
        "analysis": """
The documentation explicitly specifies required resources:

MODELS:
- GPT2 models: stanford-crfm/arwen-gpt2-medium-x21, stanford-crfm/alias-gpt2-small-x21
- Pythia: EleutherAI/pythia-160m
- Model hyperparameters in Table 1 (layers, heads, dmodel, etc.)

DATASETS:
- Pile test set, 100 million tokens
- Context length: 512 tokens
- Reference: Gao et al., 2020

DEPENDENCIES:
- TransformerLens library (Nanda, 2022)
- spaCy package (Honnibal et al., 2020)
- Code available: github.com/wesg52/universal-neurons

COMPUTE:
- MIT Supercloud mentioned

All critical resources are publicly available and explicitly identified.
"""
    },
    "RP3": {
        "verdict": "FAIL",
        "analysis": """
The documentation does NOT adequately address sources of randomness:

IDENTIFIED SOURCES OF VARIABILITY:
1. Random initialization of Gaussian matrix R for baseline correlation
2. Random subset selection for path ablation experiments
3. Five different random seeds for model training (acknowledged)

WHAT IS MISSING:
- No random seeds specified for reproducibility
- No explicit handling of sampling variability
- No confidence intervals or variance estimates for key metrics
- The threshold ϱ > 0.5 is acknowledged as arbitrary
- No specification of how many runs/samples for interventions

The paper studies variability across model seeds as its main research 
question, but does not provide seeds or procedures to ensure exact 
reproducibility of the analysis itself.

While some consistency is shown (Figure 2b, 6c), the specific values 
reported (1.23%, 4.16% universal neurons) would likely vary without 
exact seeds and sampling procedures.
"""
    }
}

print("Evaluation Results Summary:")
print("="*60)
for rp, result in evaluation_results.items():
    print(f"{rp}: {result['verdict']}")
print("="*60)

Evaluation Results Summary:
RP1: PASS
RP2: PASS
RP3: FAIL


In [3]:
# Create the output directory and files

import os
import json

output_dir = '/net/scratch2/smallyan/universal-neurons_eval/doc_only_evaluation'
os.makedirs(output_dir, exist_ok=True)

# Create the markdown evaluation file
md_content = """# Replication Evaluation - Universal Neurons in GPT2 Language Models

## Evaluation Overview

This evaluation assesses whether the documentation for "Universal Neurons in GPT2 Language Models" (Gurnee et al., 2024) provides sufficient information for an independent researcher to reproduce the experiment and its results without access to the original code.

**Documentation Source:** `/net/scratch2/smallyan/universal-neurons_eval/documentation.pdf`

---

## Evaluation Notes

### RP1: Implementation Reconstructability

**Assessment: PASS**

The documentation provides sufficient methodological detail for reconstructing the core experiment:

**Documented Procedures:**

1. **Correlation Computation (Section 4.1, Appendix A.2):**
   - Pearson correlation formula explicitly given (Equation 2)
   - Excess correlation formula explicitly given (Equation 3)
   - Baseline correlation methodology: correlation with random rotation of neuron basis
   - Dataset specification: 100 million tokens from Pile test set
   - Context length: 512 tokens
   - Excluded tokens: padding, BOS, newline tokens
   - Efficient streaming algorithm described with explicit formulas

2. **Neuron Analysis (Section 4.2):**
   - Activation statistics: mean, skew, kurtosis of pre-activation distribution
   - Activation sparsity: fraction of activations > 0
   - Weight metrics: input bias, cos(win, wout), weight decay penalty, kurtosis of cos(wout, WU)

3. **Mathematical Foundations:**
   - MLP layer formulation (Equation 1)
   - Reduction in variance formula for automated tests (Equation 4)
   - LayerNorm formula (Equation 5)
   - Heuristic score formula for attention deactivation

4. **Weight Preprocessing (Appendix A.1):**
   - Layer norm folding procedure
   - Weight centering procedures
   - Unembedding centering

**Minor Gaps:**
- Complete list of automated tests not enumerated (only "several hundred" mentioned)
- Topic model details underspecified
- Intervention procedure parameters implicit

**Conclusion:** The core methodology for identifying universal neurons via activation correlations is sufficiently detailed for reconstruction.

---

### RP2: Environment and Resource Specification

**Assessment: PASS**

The documentation explicitly specifies all required resources:

**Models:**
- GPT2-medium: `stanford-crfm/arwen-gpt2-medium-x21`
- GPT2-small: `stanford-crfm/alias-gpt2-small-x21`
- Pythia-160m: `EleutherAI/pythia-160m`
- Model hyperparameters detailed in Table 1 (layers, heads, dmodel, dvocab, dMLP, parameters, context, activation function, positional embeddings, precision, dataset, dropout)

**Datasets:**
- Pile test set (Gao et al., 2020)
- 100 million tokens
- Context length: 512 tokens
- Training data: OpenWebText (GPT2), Pile (Pythia)

**Dependencies:**
- TransformerLens library (Nanda, 2022)
- spaCy package (Honnibal et al., 2020)

**Code Availability:**
- GitHub repository: `https://github.com/wesg52/universal-neurons`
- Explicitly stated: "All of our code and data is available"

**Compute Resources:**
- MIT Supercloud (Reuther et al., 2018)

**Conclusion:** All critical resources are publicly available and explicitly identified.

---

### RP3: Determinism and Stability

**Assessment: FAIL**

The documentation does NOT adequately address sources of randomness and variability:

**Identified Sources of Variability:**
1. Random initialization of Gaussian matrix R for baseline correlation computation
2. Random subset selection for path ablation experiments ("random subset of tokens in the second half of the context")
3. Five different random seeds for model training (the main focus of the study)

**Missing Information:**
- No random seeds specified for the analysis procedures
- No explicit handling of sampling variability
- No confidence intervals or variance estimates for key metrics (e.g., 1.23%, 4.16% universal neurons)
- The threshold ϱ > 0.5 is acknowledged as arbitrary with no sensitivity analysis
- No specification of how many runs/samples for intervention experiments
- No explicit discussion of reproducibility guarantees

**Stability Claims Without Support:**
- Figure 2b shows consistency across models but without error bars
- Figure 6c shows shaded ranges across seeds but methodology not detailed
- The finding that "1.23% and 1.26%" of neurons are universal in different models suggests consistency but exact reproducibility procedures are absent

**Conclusion:** While the paper studies variability across model seeds as its primary research question, it does not provide sufficient detail to ensure exact reproducibility of the analysis itself. The specific quantitative claims would likely vary without exact seeds and sampling procedures.

---

## Binary Checklist Table

| Criterion | Description | Verdict |
|-----------|-------------|---------|
| **RP1** | Implementation Reconstructability | **PASS** |
| **RP2** | Environment and Resource Specification | **PASS** |
| **RP3** | Determinism and Stability | **FAIL** |

---

## Summary

**Overall Replicability Assessment: PARTIAL**

The documentation for "Universal Neurons in GPT2 Language Models" provides strong methodological detail for the core experiment. The mathematical formulations, model specifications, and dataset descriptions are sufficiently detailed for an independent researcher to implement the correlation-based universal neuron identification procedure.

However, the documentation fails to address determinism and stability adequately. Sources of randomness in the analysis (random baseline matrices, random sampling for ablation studies) are not controlled through specified seeds, and no variance estimates or confidence intervals are provided for key quantitative findings.

**What can be replicated:**
- The methodology for computing neuron correlations across models
- The identification of neurons with high excess correlation
- The general patterns in neuron families and functional roles

**What may not be exactly reproduced:**
- Specific percentage values (1.23%, 4.16% universal neurons)
- Exact lists of universal neurons
- Quantitative results from causal interventions

The availability of code at `github.com/wesg52/universal-neurons` partially mitigates these concerns, but from a documentation-only perspective, exact replication is not guaranteed.
"""

# Write the markdown file
md_path = os.path.join(output_dir, 'replication_evaluation.md')
with open(md_path, 'w') as f:
    f.write(md_content)

print(f"Created: {md_path}")
print(f"File size: {os.path.getsize(md_path)} bytes")

Created: /net/scratch2/smallyan/universal-neurons_eval/doc_only_evaluation/replication_evaluation.md
File size: 6307 bytes


In [4]:
# Create the JSON summary file

json_content = {
    "Checklist": {
        "RP1_ImplementationReconstructability": "PASS",
        "RP2_EnvironmentReproducibility": "PASS",
        "RP3_DeterminismAndStability": "FAIL"
    },
    "Rationale": {
        "RP1_ImplementationReconstructability": "The documentation provides sufficient methodological detail for reconstructing the core experiment. Key elements include: (1) Pearson correlation formula (Eq. 2) and excess correlation formula (Eq. 3) for identifying universal neurons; (2) efficient streaming algorithm for computing correlations over 100M tokens; (3) dataset specification (Pile test set, 512 token context); (4) neuron analysis metrics (activation statistics, weight metrics); (5) mathematical foundations for MLP layers, LayerNorm, and automated tests (Eqs. 1, 4, 5); (6) weight preprocessing procedures in Appendix A.1. Minor gaps exist in secondary analyses (complete test list, intervention parameters) but the core methodology is reconstructable.",
        
        "RP2_EnvironmentReproducibility": "All critical resources are explicitly identified and publicly available: (1) Models specified with exact HuggingFace identifiers (stanford-crfm/arwen-gpt2-medium-x21, stanford-crfm/alias-gpt2-small-x21, EleutherAI/pythia-160m); (2) model hyperparameters detailed in Table 1; (3) dataset specified as Pile test set with 100M tokens and 512 context length; (4) key dependencies identified (TransformerLens, spaCy); (5) code and data available at github.com/wesg52/universal-neurons; (6) compute resources (MIT Supercloud) mentioned.",
        
        "RP3_DeterminismAndStability": "The documentation fails to adequately address sources of randomness and variability. Critical gaps include: (1) No random seeds specified for the random Gaussian matrix R used in baseline correlation computation; (2) No seeds for random subset selection in path ablation experiments; (3) No confidence intervals or variance estimates for key metrics (e.g., 1.23%, 4.16% universal neurons); (4) The threshold ϱ > 0.5 is acknowledged as arbitrary with no sensitivity analysis; (5) No specification of number of runs/samples for intervention experiments; (6) No explicit reproducibility guarantees. While the paper studies variability across model seeds as its primary research question, it does not ensure exact reproducibility of the analysis itself."
    }
}

# Write the JSON file
json_path = os.path.join(output_dir, 'self_replication_evaluation.json')
with open(json_path, 'w') as f:
    json.dump(json_content, f, indent=2)

print(f"Created: {json_path}")
print(f"File size: {os.path.getsize(json_path)} bytes")

# Display the JSON content
print("\nJSON Content:")
print(json.dumps(json_content, indent=2))

Created: /net/scratch2/smallyan/universal-neurons_eval/doc_only_evaluation/self_replication_evaluation.json
File size: 2318 bytes

JSON Content:
{
  "Checklist": {
    "RP1_ImplementationReconstructability": "PASS",
    "RP2_EnvironmentReproducibility": "PASS",
    "RP3_DeterminismAndStability": "FAIL"
  },
  "Rationale": {
    "RP1_ImplementationReconstructability": "The documentation provides sufficient methodological detail for reconstructing the core experiment. Key elements include: (1) Pearson correlation formula (Eq. 2) and excess correlation formula (Eq. 3) for identifying universal neurons; (2) efficient streaming algorithm for computing correlations over 100M tokens; (3) dataset specification (Pile test set, 512 token context); (4) neuron analysis metrics (activation statistics, weight metrics); (5) mathematical foundations for MLP layers, LayerNorm, and automated tests (Eqs. 1, 4, 5); (6) weight preprocessing procedures in Appendix A.1. Minor gaps exist in secondary analyses

In [5]:
# Verify the files were created correctly
import os

output_dir = '/net/scratch2/smallyan/universal-neurons_eval/doc_only_evaluation'

print("Files in output directory:")
for f in os.listdir(output_dir):
    filepath = os.path.join(output_dir, f)
    print(f"  {f}: {os.path.getsize(filepath)} bytes")

# Verify content of files
print("\n" + "="*60)
print("Verification: replication_evaluation.md (first 1000 chars)")
print("="*60)
with open(os.path.join(output_dir, 'replication_evaluation.md'), 'r') as f:
    print(f.read()[:1000])

print("\n" + "="*60)
print("Verification: self_replication_evaluation.json")
print("="*60)
with open(os.path.join(output_dir, 'self_replication_evaluation.json'), 'r') as f:
    print(f.read())

Files in output directory:
  self_matching.ipynb: 20999 bytes
  consistency_evaluation.json: 3092 bytes
  replication_evaluation.md: 6307 bytes
  generalization_eval_summary.json: 1764 bytes
  code_critic_evaluation.ipynb: 127502 bytes
  code_critic_summary.json: 1289 bytes
  generalization_eval.ipynb: 114167 bytes
  self_replication_evaluation.json: 2318 bytes

Verification: replication_evaluation.md (first 1000 chars)
# Replication Evaluation - Universal Neurons in GPT2 Language Models

## Evaluation Overview

This evaluation assesses whether the documentation for "Universal Neurons in GPT2 Language Models" (Gurnee et al., 2024) provides sufficient information for an independent researcher to reproduce the experiment and its results without access to the original code.

**Documentation Source:** `/net/scratch2/smallyan/universal-neurons_eval/documentation.pdf`

---

## Evaluation Notes

### RP1: Implementation Reconstructability

**Assessment: PASS**

The documentation provides suffi

In [6]:
print("="*70)
print("REPLICATION EVALUATION COMPLETE")
print("="*70)

print("""
TASK: Replication Evaluation for Universal Neurons in GPT2 Language Models
DOCUMENTATION: /net/scratch2/smallyan/universal-neurons_eval/documentation.pdf

EVALUATION RESULTS:
┌──────────────────────────────────────────┬─────────┐
│ Criterion                                │ Verdict │
├──────────────────────────────────────────┼─────────┤
│ RP1: Implementation Reconstructability   │ PASS    │
│ RP2: Environment and Resource Spec       │ PASS    │
│ RP3: Determinism and Stability           │ FAIL    │
└──────────────────────────────────────────┴─────────┘

OUTPUT FILES:
1. /net/scratch2/smallyan/universal-neurons_eval/doc_only_evaluation/replication_evaluation.md
2. /net/scratch2/smallyan/universal-neurons_eval/doc_only_evaluation/self_replication_evaluation.json

SUMMARY:
- The documentation provides good methodological detail (RP1 PASS)
- All required resources are explicitly specified (RP2 PASS)  
- Sources of randomness are not adequately controlled (RP3 FAIL)

The core experiment methodology is replicable, but exact numerical results
may not be reproducible without additional seed specifications.
""")

REPLICATION EVALUATION COMPLETE

TASK: Replication Evaluation for Universal Neurons in GPT2 Language Models
DOCUMENTATION: /net/scratch2/smallyan/universal-neurons_eval/documentation.pdf

EVALUATION RESULTS:
┌──────────────────────────────────────────┬─────────┐
│ Criterion                                │ Verdict │
├──────────────────────────────────────────┼─────────┤
│ RP1: Implementation Reconstructability   │ PASS    │
│ RP2: Environment and Resource Spec       │ PASS    │
│ RP3: Determinism and Stability           │ FAIL    │
└──────────────────────────────────────────┴─────────┘

OUTPUT FILES:
1. /net/scratch2/smallyan/universal-neurons_eval/doc_only_evaluation/replication_evaluation.md
2. /net/scratch2/smallyan/universal-neurons_eval/doc_only_evaluation/self_replication_evaluation.json

SUMMARY:
- The documentation provides good methodological detail (RP1 PASS)
- All required resources are explicitly specified (RP2 PASS)  
- Sources of randomness are not adequately controlled (