In [1]:
import os
os.chdir('/home/smallyan/eval_agent')
print(f"Working directory: {os.getcwd()}")

Working directory: /home/smallyan/eval_agent


In [2]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA device count: {torch.cuda.device_count()}")

CUDA available: True
CUDA device: NVIDIA A100 80GB PCIe
CUDA device count: 1


# Replicator–Documentation Evaluator

This notebook evaluates whether the replicator's documentation faithfully reproduces the results and conclusions of the original InterpDetect experiment.

## Task
- Compare original documentation with replicated documentation
- Evaluate: DE1 (Result Fidelity), DE2 (Conclusion Consistency), DE3 (No External Information)
- Generate evaluation summary files

In [3]:
# Define paths
original_repo = "/net/scratch2/smallyan/InterpDetect_eval"
replication_dir = "/net/scratch2/smallyan/InterpDetect_eval/evaluation/replications"

# Check if paths exist
print(f"Original repo exists: {os.path.exists(original_repo)}")
print(f"Replication dir exists: {os.path.exists(replication_dir)}")

# List contents of both directories
if os.path.exists(original_repo):
    print(f"\nOriginal repo contents:")
    for item in os.listdir(original_repo):
        print(f"  {item}")
        
if os.path.exists(replication_dir):
    print(f"\nReplication dir contents:")
    for item in os.listdir(replication_dir):
        print(f"  {item}")

Original repo exists: True
Replication dir exists: True

Original repo contents:
  documentation.pdf
  plan.md
  .gitignore
  CodeWalkthrough.md
  trained_models
  LICENSE
  .git
  evaluation
  scripts
  requirements.txt
  datasets

Replication dir contents:
  replication.ipynb
  correlation_analysis.png
  evaluation_replication.md
  results_comparison.png
  results_summary.png
  documentation_replication.md
  self_replication_evaluation.json


In [4]:
# Read the original documentation - look for documentation.md or similar
original_doc_path = None
for doc_name in ['documentation.md', 'README.md', 'CodeWalkthrough.md']:
    path = os.path.join(original_repo, doc_name)
    if os.path.exists(path):
        original_doc_path = path
        print(f"Found original documentation: {doc_name}")
        break

# Also check for PDF documentation
pdf_path = os.path.join(original_repo, 'documentation.pdf')
print(f"PDF documentation exists: {os.path.exists(pdf_path)}")

# Read CodeWalkthrough.md as documentation
if original_doc_path:
    with open(original_doc_path, 'r') as f:
        original_doc = f.read()
    print(f"\nOriginal documentation ({original_doc_path}):")
    print("=" * 60)
    print(original_doc[:3000])
    print("..." if len(original_doc) > 3000 else "")

Found original documentation: CodeWalkthrough.md
PDF documentation exists: True

Original documentation (/net/scratch2/smallyan/InterpDetect_eval/CodeWalkthrough.md):
# InterpDetect

**InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation**

[![Paper](https://img.shields.io/badge/Paper-OpenReview-blue)](https://openreview.net/pdf?id=TZzBKwHLwF)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

A comprehensive framework for detecting and analyzing hallucinations in Retrieval-Augmented Generation (RAG) systems using interpretability techniques and chunk-level analysis on the RAGBench/FinQA dataset.

## Overview

This project implements a novel approach to hallucination detection by leveraging interpretability methods. The framework consists of three main parts:

1. **Preprocessing Pipeline** - Convert raw datasets to labeled data with hallucination spans
2. **Training & Prediction** - Train classifiers or use pre-train

In [5]:
# Read the replicated documentation
replicated_doc_path = os.path.join(replication_dir, 'documentation_replication.md')
if os.path.exists(replicated_doc_path):
    with open(replicated_doc_path, 'r') as f:
        replicated_doc = f.read()
    print(f"Replicated documentation ({replicated_doc_path}):")
    print("=" * 60)
    print(replicated_doc)
else:
    print(f"Replicated documentation not found at {replicated_doc_path}")

Replicated documentation (/net/scratch2/smallyan/InterpDetect_eval/evaluation/replications/documentation_replication.md):
# InterpDetect Replication Documentation

## Goal

Replicate the InterpDetect experiment that develops a mechanistic interpretability-based hallucination detection method for Retrieval-Augmented Generation (RAG) systems. The method computes External Context Scores (ECS) and Parametric Knowledge Scores (PKS) from a small proxy model (Qwen3-0.6B) to detect hallucinations in both self-generated and larger model (GPT-4.1-mini) responses.

## Data

### Training Data
- **Source**: Pre-computed ECS and PKS scores from 1,800 examples
- **Features**: 448 ECS features (28 layers × 16 attention heads) + 28 PKS features (one per FFN layer)
- **Samples**: 7,799 span-level samples after processing all response chunks
- **Balanced**: Undersampled to 3,393 samples per class (6,786 total)

### Test Data
1. **Qwen Self-Evaluation**: 256 responses, 975 span-level samples
2. **GPT Prox

In [6]:
# Read the full original documentation (CodeWalkthrough.md)
with open(original_doc_path, 'r') as f:
    original_doc = f.read()
print(f"Full original documentation:")
print("=" * 60)
print(original_doc)

Full original documentation:
# InterpDetect

**InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation**

[![Paper](https://img.shields.io/badge/Paper-OpenReview-blue)](https://openreview.net/pdf?id=TZzBKwHLwF)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

A comprehensive framework for detecting and analyzing hallucinations in Retrieval-Augmented Generation (RAG) systems using interpretability techniques and chunk-level analysis on the RAGBench/FinQA dataset.

## Overview

This project implements a novel approach to hallucination detection by leveraging interpretability methods. The framework consists of three main parts:

1. **Preprocessing Pipeline** - Convert raw datasets to labeled data with hallucination spans
2. **Training & Prediction** - Train classifiers or use pre-trained models for hallucination detection
3. **Baseline Comparisons** - Evaluate against multiple baseline methods (RAGAS, TruLens, RefChecker

In [7]:
# Check for plan.md which may contain specific expected results
plan_path = os.path.join(original_repo, 'plan.md')
if os.path.exists(plan_path):
    with open(plan_path, 'r') as f:
        plan_doc = f.read()
    print(f"Plan document:")
    print("=" * 60)
    print(plan_doc)
else:
    print("No plan.md found")

Plan document:
# Plan
## Objective
Develop a mechanistic interpretability-based hallucination detection method for Retrieval-Augmented Generation (RAG) systems by computing External Context Scores (ECS) across layers and attention heads and Parametric Knowledge Scores (PKS) across layers (FFN), training regression-based classifiers on these signals, and demonstrating generalization from a small proxy model (Qwen3-0.6b) to larger production models (GPT-4.1-mini).

## Hypothesis
1. RAG hallucinations correlate with:  later-layer FFN modules disproportionately inject parametric knowledge into the residual stream while attention heads fail to adequately exploit external context.
2. External Context Score (ECS) and Parametric Knowledge Score (PKS) are correlated with hallucination occurrence and can serve as predictive features for hallucination detection.
3. Mechanistic signals extracted from a small proxy model (0.6b parameters) can generalize to detect hallucinations in responses from la

In [8]:
# Also check the self_replication_evaluation.json for reference
self_eval_path = os.path.join(replication_dir, 'self_replication_evaluation.json')
if os.path.exists(self_eval_path):
    import json
    with open(self_eval_path, 'r') as f:
        self_eval = json.load(f)
    print("Self-replication evaluation results:")
    print(json.dumps(self_eval, indent=2))
else:
    print("No self_replication_evaluation.json found")

Self-replication evaluation results:
{
  "Checklist": {
    "RP1_ImplementationReconstructability": "PASS",
    "RP2_EnvironmentReproducibility": "PASS",
    "RP3_DeterminismAndStability": "PASS"
  },
  "Rationale": {
    "RP1_ImplementationReconstructability": "The experiment can be fully reconstructed from plan.md and CodeWalkthrough.md. The plan clearly describes ECS computation via attention weights, PKS computation via Jensen-Shannon divergence, classifier training methodology, and evaluation metrics. No major guesswork was required; only minor ambiguities existed around optional preprocessing steps.",
    "RP3_DeterminismAndStability": "Results are fully deterministic with controlled seeds. Three consecutive runs with seed=42 produced identical validation F1 (0.760060) and test F1 (0.734177) with zero variance. Random states are properly controlled in sklearn, numpy, and XGBoost."
  }
}


## Documentation Comparison Analysis

Now comparing:
1. **Original**: CodeWalkthrough.md + plan.md
2. **Replicated**: documentation_replication.md

In [9]:
# DE1: Result Fidelity Analysis
# Extract and compare key metrics

print("=" * 70)
print("DE1: RESULT FIDELITY ANALYSIS")
print("=" * 70)

print("\n### Original Results (from plan.md):")
print("-" * 50)
original_results = {
    "SVC Validation F1": "76.60%",
    "Self-Evaluation F1 (Qwen)": "74.68%",
    "Proxy-Based F1 (GPT)": "75.36%",
    "Correlation ECS-Hallucination": "Negative (all heads)",
    "Correlation PKS-Hallucination (later layers)": "Positive"
}
for k, v in original_results.items():
    print(f"  {k}: {v}")

print("\n### Replicated Results (from documentation_replication.md):")
print("-" * 50)
replicated_results = {
    "SVC Validation F1": "76.0%",
    "Self-Evaluation F1 (Qwen)": "73.42%",
    "Proxy-Based F1 (GPT)": "76.78%",
    "Correlation ECS-Hallucination": "Negative (-0.23 mean)",
    "Correlation PKS-Hallucination (later layers)": "Positive (0.24 mean)"
}
for k, v in replicated_results.items():
    print(f"  {k}: {v}")

print("\n### Differences:")
print("-" * 50)
print("  SVC Validation F1: 76.60% → 76.0% (diff: -0.60%)")
print("  Self-Evaluation F1: 74.68% → 73.42% (diff: -1.26%)")
print("  Proxy-Based F1: 75.36% → 76.78% (diff: +1.42%)")
print("  Correlation trends: MATCH (both show negative ECS, positive PKS)")

print("\n### Pre-trained Model Verification (from replication doc):")
print("-" * 50)
print("  Using repository pre-trained SVC model:")
print("    - Self-Evaluation: 74.68% (EXACT MATCH with plan)")
print("    - Proxy-Based: 75.36% (EXACT MATCH with plan)")

DE1: RESULT FIDELITY ANALYSIS

### Original Results (from plan.md):
--------------------------------------------------
  SVC Validation F1: 76.60%
  Self-Evaluation F1 (Qwen): 74.68%
  Proxy-Based F1 (GPT): 75.36%
  Correlation ECS-Hallucination: Negative (all heads)
  Correlation PKS-Hallucination (later layers): Positive

### Replicated Results (from documentation_replication.md):
--------------------------------------------------
  SVC Validation F1: 76.0%
  Self-Evaluation F1 (Qwen): 73.42%
  Proxy-Based F1 (GPT): 76.78%
  Correlation ECS-Hallucination: Negative (-0.23 mean)
  Correlation PKS-Hallucination (later layers): Positive (0.24 mean)

### Differences:
--------------------------------------------------
  SVC Validation F1: 76.60% → 76.0% (diff: -0.60%)
  Self-Evaluation F1: 74.68% → 73.42% (diff: -1.26%)
  Proxy-Based F1: 75.36% → 76.78% (diff: +1.42%)
  Correlation trends: MATCH (both show negative ECS, positive PKS)

### Pre-trained Model Verification (from replication do

In [10]:
# DE2: Conclusion Consistency Analysis
print("=" * 70)
print("DE2: CONCLUSION CONSISTENCY ANALYSIS")
print("=" * 70)

print("\n### Original Conclusions (from plan.md):")
print("-" * 50)
original_conclusions = [
    "ECS correlates negatively with hallucination (hallucinated responses utilize less external context)",
    "PKS correlates positively with hallucination in later layers",
    "SVC achieved best validation F1, XGBoost overfitted",
    "Self-evaluation achieves F1=74.68%, outperforming several baselines",
    "Proxy-based evaluation achieves F1=75.36%, demonstrating generalization from small to large models"
]
for i, c in enumerate(original_conclusions, 1):
    print(f"  {i}. {c}")

print("\n### Replicated Conclusions (from documentation_replication.md):")
print("-" * 50)
replicated_conclusions = [
    "ECS and PKS signals correlate with hallucination as hypothesized",
    "All 448 attention head features show negative correlation with hallucination",
    "SVC classifier achieves best performance without overfitting",
    "XGBoost shows significant overfitting (99.8% train vs 74.8% val)",
    "Self-evaluation and proxy-based detection both work effectively",
    "Results are stable and reproducible with proper seed control"
]
for i, c in enumerate(replicated_conclusions, 1):
    print(f"  {i}. {c}")

print("\n### Consistency Assessment:")
print("-" * 50)
print("  ✓ Both confirm ECS negative correlation with hallucination")
print("  ✓ Both confirm PKS positive correlation in later layers")
print("  ✓ Both identify SVC as best model, XGBoost as overfitting")
print("  ✓ Both confirm effectiveness of self-evaluation and proxy-based approaches")
print("  ✓ Replicated adds: reproducibility and seed control confirmation")
print("  → No contradictions found; conclusions are CONSISTENT")

DE2: CONCLUSION CONSISTENCY ANALYSIS

### Original Conclusions (from plan.md):
--------------------------------------------------
  1. ECS correlates negatively with hallucination (hallucinated responses utilize less external context)
  2. PKS correlates positively with hallucination in later layers
  3. SVC achieved best validation F1, XGBoost overfitted
  4. Self-evaluation achieves F1=74.68%, outperforming several baselines
  5. Proxy-based evaluation achieves F1=75.36%, demonstrating generalization from small to large models

### Replicated Conclusions (from documentation_replication.md):
--------------------------------------------------
  1. ECS and PKS signals correlate with hallucination as hypothesized
  2. All 448 attention head features show negative correlation with hallucination
  3. SVC classifier achieves best performance without overfitting
  4. XGBoost shows significant overfitting (99.8% train vs 74.8% val)
  5. Self-evaluation and proxy-based detection both work effe

In [11]:
# DE3: External or Hallucinated Information Analysis
print("=" * 70)
print("DE3: EXTERNAL OR HALLUCINATED INFORMATION ANALYSIS")
print("=" * 70)

print("\n### Checking replicated documentation for external information...")
print("-" * 50)

# Information in replicated doc that should be verified against original
replicated_claims = {
    "Qwen3-0.6B model": "Mentioned in plan.md as 'Qwen3-0.6b'",
    "GPT-4.1-mini for proxy evaluation": "Mentioned in plan.md as 'GPT-4.1-mini'",
    "448 ECS features (28 layers × 16 heads)": "Consistent with plan.md '28 layers and 16 attention heads'",
    "28 PKS features (one per FFN layer)": "Consistent with TransformerLens extraction mentioned in plan.md",
    "BGE-base-en-v1.5 for embeddings": "Not explicitly mentioned in original - NEEDS VERIFICATION",
    "7,799 span-level samples": "Mentioned in plan.md as '7,799 span-level samples'",
    "StandardScaler normalization": "Consistent with 'standardized' mentioned in plan.md",
    "sklearn version 1.7.1 vs 1.5.2": "Technical detail from replication process, acceptable"
}

print("\n### Verification of claims in replicated documentation:")
for claim, verification in replicated_claims.items():
    print(f"  • {claim}")
    print(f"    → {verification}")

print("\n### Potential external information found:")
print("-" * 50)
print("  1. 'BGE-base-en-v1.5 for embeddings' - Not explicitly in original docs")
print("     → However, this is a technical implementation detail that would be")
print("     discovered during replication. Acceptable as implementation detail.")
print("")
print("  2. 'Mean correlation values (-0.23 ECS, 0.05/0.24 PKS)' - Quantified")
print("     → Original only states 'negative' and 'positive'. Replication provides")
print("     more specific numbers from actual computation. This is not external")
print("     information but rather detailed results from running the code.")

print("\n### Hallucinated information check:")
print("-" * 50)
print("  No invented findings or unsupported claims detected.")
print("  All major claims trace back to original documentation or")
print("  are derivable from running the provided code.")

DE3: EXTERNAL OR HALLUCINATED INFORMATION ANALYSIS

### Checking replicated documentation for external information...
--------------------------------------------------

### Verification of claims in replicated documentation:
  • Qwen3-0.6B model
    → Mentioned in plan.md as 'Qwen3-0.6b'
  • GPT-4.1-mini for proxy evaluation
    → Mentioned in plan.md as 'GPT-4.1-mini'
  • 448 ECS features (28 layers × 16 heads)
    → Consistent with plan.md '28 layers and 16 attention heads'
  • 28 PKS features (one per FFN layer)
    → Consistent with TransformerLens extraction mentioned in plan.md
  • BGE-base-en-v1.5 for embeddings
    → Not explicitly mentioned in original - NEEDS VERIFICATION
  • 7,799 span-level samples
    → Mentioned in plan.md as '7,799 span-level samples'
  • StandardScaler normalization
    → Consistent with 'standardized' mentioned in plan.md
  • sklearn version 1.7.1 vs 1.5.2
    → Technical detail from replication process, acceptable

### Potential external information 

In [12]:
# Check scripts for BGE model reference to verify
import subprocess
result = subprocess.run(['grep', '-r', 'BGE', '/net/scratch2/smallyan/InterpDetect_eval/scripts/'], 
                       capture_output=True, text=True)
print("Searching for 'BGE' in scripts:")
print(result.stdout if result.stdout else "No matches found")

result2 = subprocess.run(['grep', '-r', 'bge', '/net/scratch2/smallyan/InterpDetect_eval/scripts/'], 
                        capture_output=True, text=True)
print("\nSearching for 'bge' in scripts:")
print(result2.stdout if result2.stdout else "No matches found")

Searching for 'BGE' in scripts:
/net/scratch2/smallyan/InterpDetect_eval/scripts/compute_scores.py:    """Calculate sentence similarity using BGE model"""




Searching for 'bge' in scripts:
/net/scratch2/smallyan/InterpDetect_eval/scripts/compute_scores.py:        bge_model = SentenceTransformer("BAAI/bge-base-en-v1.5").to(device)
/net/scratch2/smallyan/InterpDetect_eval/scripts/compute_scores.py:        return tokenizer, model, bge_model
/net/scratch2/smallyan/InterpDetect_eval/scripts/compute_scores.py:def calculate_sentence_similarity(bge_model, r_text, p_text):
/net/scratch2/smallyan/InterpDetect_eval/scripts/compute_scores.py:    part_embedding = bge_model.encode([r_text], normalize_embeddings=True)
/net/scratch2/smallyan/InterpDetect_eval/scripts/compute_scores.py:    q_embeddings = bge_model.encode([p_text], normalize_embeddings=True)
/net/scratch2/smallyan/InterpDetect_eval/scripts/compute_scores.py:def process_example(example, tokenizer, model, bge_model, device, max_ctx, iter_step=1):
/net/scratch2/smallyan/InterpDetect_eval/scripts/compute_scores.py:                layer_head_span[str(layer_head)] = calculate_sentence_similarity

In [13]:
# Good - BGE is in the code! The replicated doc correctly identified implementation detail
print("✓ VERIFIED: BGE-base-en-v1.5 is used in scripts/compute_scores.py")
print("  This is not external information but implementation detail from the code.")
print("")
print("=" * 70)
print("FINAL DE3 ASSESSMENT: PASS")
print("=" * 70)
print("No external or hallucinated information introduced.")
print("All claims are traceable to original documentation or derivable from code.")

✓ VERIFIED: BGE-base-en-v1.5 is used in scripts/compute_scores.py
  This is not external information but implementation detail from the code.

FINAL DE3 ASSESSMENT: PASS
No external or hallucinated information introduced.
All claims are traceable to original documentation or derivable from code.


In [14]:
# Create evaluation output directory
output_dir = "/net/scratch2/smallyan/InterpDetect_eval/evaluation/replication_eval"
os.makedirs(output_dir, exist_ok=True)
print(f"Created output directory: {output_dir}")
print(f"Directory exists: {os.path.exists(output_dir)}")

Created output directory: /net/scratch2/smallyan/InterpDetect_eval/evaluation/replication_eval
Directory exists: True


In [15]:
# Generate documentation_evaluation_summary.md
summary_md = """# Documentation Evaluation Summary

## InterpDetect Replication - Documentation Evaluation

**Evaluation Date**: 2026-01-08

---

## Results Comparison

The replicated documentation faithfully reproduces the core experimental results from the original InterpDetect project. Key metrics comparison:

| Metric | Original | Replicated | Difference |
|--------|----------|------------|------------|
| SVC Validation F1 | 76.60% | 76.0% | -0.60% |
| Self-Evaluation F1 (Qwen) | 74.68% | 73.42% | -1.26% |
| Proxy-Based F1 (GPT) | 75.36% | 76.78% | +1.42% |

The small differences (within ~1.5%) are attributable to different random splits during replication training. Critically, when using the repository's **pre-trained SVC model**, the replication achieves exact matches: 74.68% for self-evaluation and 75.36% for proxy-based evaluation.

All qualitative trends match: ECS shows negative correlation with hallucination across all 448 attention heads, and PKS shows positive correlation in later layers.

---

## Conclusions Comparison

The replicated documentation presents conclusions fully consistent with the original:

1. **ECS Correlation**: Both confirm that hallucinated responses utilize less external context (negative ECS correlation)
2. **PKS Correlation**: Both confirm later-layer FFNs inject more parametric knowledge for hallucinations (positive PKS correlation)
3. **Model Selection**: Both identify SVC as the best classifier and note XGBoost overfitting
4. **Generalization**: Both confirm that small proxy model (Qwen3-0.6B) signals generalize to larger models (GPT-4.1-mini)
5. **Effectiveness**: Both confirm the method outperforms or matches several baseline approaches

The replicated documentation adds additional detail on reproducibility and determinism, which supports rather than contradicts the original claims.

---

## External or Hallucinated Information

No external references, invented findings, or hallucinated details were introduced in the replicated documentation. All claims are traceable to:

- **plan.md**: Methodology, hypotheses, and expected results
- **CodeWalkthrough.md**: Project structure, usage, and implementation details
- **Source code** (e.g., scripts/compute_scores.py): Implementation specifics like BGE-base-en-v1.5 for embeddings

Technical details discovered during replication (e.g., specific correlation values, sklearn version warnings) represent legitimate replication outputs rather than external information.

---

## Evaluation Checklist

| Criterion | Status | Notes |
|-----------|--------|-------|
| **DE1: Result Fidelity** | PASS | Results within ~1.5% tolerance; pre-trained model achieves exact match |
| **DE2: Conclusion Consistency** | PASS | All conclusions consistent; no contradictions |
| **DE3: No External Information** | PASS | All claims traceable to original docs or code |

---

## Final Verdict

**PASS**

The replicated documentation faithfully reproduces both the results and conclusions of the original InterpDetect experiment. All three evaluation criteria (DE1-DE3) are satisfied.
"""

# Write the markdown summary
summary_md_path = os.path.join(output_dir, "documentation_evaluation_summary.md")
with open(summary_md_path, 'w') as f:
    f.write(summary_md)

print(f"Written: {summary_md_path}")
print(f"File size: {os.path.getsize(summary_md_path)} bytes")

Written: /net/scratch2/smallyan/InterpDetect_eval/evaluation/replication_eval/documentation_evaluation_summary.md
File size: 3072 bytes


In [16]:
import json

# Generate documentation_eval_summary.json
summary_json = {
    "Checklist": {
        "DE1_ResultFidelity": "PASS",
        "DE2_ConclusionConsistency": "PASS",
        "DE3_NoExternalInformation": "PASS"
    },
    "Rationale": {
        "DE1_ResultFidelity": "Replicated results match original within ~1.5% tolerance. SVC validation F1 differs by -0.60%, self-evaluation F1 by -1.26%, and proxy-based F1 by +1.42%. These small differences are attributable to different random splits during replication. When using the repository's pre-trained SVC model, exact matches (74.68% self-eval, 75.36% proxy-based) are achieved. All correlation trends (negative ECS, positive PKS in later layers) match the original.",
        "DE2_ConclusionConsistency": "All major conclusions are consistent: (1) ECS negative correlation with hallucination confirmed, (2) PKS positive correlation in later layers confirmed, (3) SVC identified as best classifier with XGBoost overfitting noted, (4) generalization from small proxy model to larger models validated, (5) effectiveness compared to baselines maintained. No contradictions or omitted essential claims.",
        "DE3_NoExternalInformation": "No external references, invented findings, or hallucinated details introduced. All claims trace to plan.md, CodeWalkthrough.md, or source code (e.g., BGE-base-en-v1.5 found in scripts/compute_scores.py). Technical details like specific correlation values represent legitimate replication outputs from running the provided code."
    }
}

# Write the JSON summary
summary_json_path = os.path.join(output_dir, "documentation_eval_summary.json")
with open(summary_json_path, 'w') as f:
    json.dump(summary_json, f, indent=2)

print(f"Written: {summary_json_path}")
print(f"File size: {os.path.getsize(summary_json_path)} bytes")
print("\nContents:")
print(json.dumps(summary_json, indent=2))

Written: /net/scratch2/smallyan/InterpDetect_eval/evaluation/replication_eval/documentation_eval_summary.json
File size: 1430 bytes

Contents:
{
  "Checklist": {
    "DE1_ResultFidelity": "PASS",
    "DE2_ConclusionConsistency": "PASS",
    "DE3_NoExternalInformation": "PASS"
  },
  "Rationale": {
    "DE1_ResultFidelity": "Replicated results match original within ~1.5% tolerance. SVC validation F1 differs by -0.60%, self-evaluation F1 by -1.26%, and proxy-based F1 by +1.42%. These small differences are attributable to different random splits during replication. When using the repository's pre-trained SVC model, exact matches (74.68% self-eval, 75.36% proxy-based) are achieved. All correlation trends (negative ECS, positive PKS in later layers) match the original.",
    "DE2_ConclusionConsistency": "All major conclusions are consistent: (1) ECS negative correlation with hallucination confirmed, (2) PKS positive correlation in later layers confirmed, (3) SVC identified as best classifie

In [17]:
# Verify all output files
print("=" * 70)
print("EVALUATION COMPLETE - OUTPUT FILES")
print("=" * 70)

output_files = os.listdir(output_dir)
print(f"\nOutput directory: {output_dir}")
print(f"Files created:")
for f in output_files:
    filepath = os.path.join(output_dir, f)
    print(f"  • {f} ({os.path.getsize(filepath)} bytes)")

print("\n" + "=" * 70)
print("FINAL VERDICT: PASS")
print("=" * 70)
print("""
All three evaluation criteria are satisfied:

  DE1 (Result Fidelity):         PASS
  DE2 (Conclusion Consistency):  PASS  
  DE3 (No External Information): PASS

The replicated documentation faithfully reproduces the results and
conclusions of the original InterpDetect experiment.
""")

EVALUATION COMPLETE - OUTPUT FILES

Output directory: /net/scratch2/smallyan/InterpDetect_eval/evaluation/replication_eval
Files created:
  • documentation_eval_summary.json (1430 bytes)
  • documentation_evaluation_summary.md (3072 bytes)

FINAL VERDICT: PASS

All three evaluation criteria are satisfied:

  DE1 (Result Fidelity):         PASS
  DE2 (Conclusion Consistency):  PASS  
  DE3 (No External Information): PASS

The replicated documentation faithfully reproduces the results and
conclusions of the original InterpDetect experiment.



## Summary

This evaluation compared the original InterpDetect documentation (plan.md + CodeWalkthrough.md) with the replicated documentation (documentation_replication.md).

### Results

| Criterion | Status |
|-----------|--------|
| DE1: Result Fidelity | **PASS** |
| DE2: Conclusion Consistency | **PASS** |
| DE3: No External Information | **PASS** |

### Final Verdict: **PASS**

### Output Files Created
- `evaluation/replication_eval/documentation_evaluation_summary.md`
- `evaluation/replication_eval/documentation_eval_summary.json`