# Code Critic Evaluation for InterpDetect Circuit Analysis

## Repository: `/net/scratch2/smallyan/InterpDetect_eval`

This notebook contains a comprehensive code evaluation following the Plan and CodeWalkthrough files.

## Evaluation Criteria
1. **Runnable (Y/N)**: Block executes without error
2. **Correct-Implementation (Y/N)**: Logic implements described computation correctly  
3. **Redundant (Y/N)**: Block duplicates another block's computation
4. **Irrelevant (Y/N)**: Block does not contribute to project goal

In [1]:
import os
os.chdir('/home/smallyan/eval_agent')

import json
import pandas as pd
import numpy as np

# Complete evaluation results from our analysis
evaluation_results = [
    # helper.py
    {"file": "helper.py", "function": "clean_text", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "helper.py", "function": "get_sentence_spans", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "helper.py", "function": "split_clauses", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "helper.py", "function": "split_text_semantic_chunks", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    # compute_scores.py - Core signal extraction
    {"file": "compute_scores.py", "function": "load_examples", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "setup_models", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "calculate_dist_2d", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "add_special_template", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "is_hallucination_span", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "calculate_hallucination_spans", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "calculate_respond_spans", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "calculate_prompt_spans", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "calculate_sentence_similarity", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "MockOutputs", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "process_example", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": "Core ECS/PKS computation"},
    {"file": "compute_scores.py", "function": "save_batch", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "compute_scores.py", "function": "plot_binary_correlation", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "Y", "error_note": "Visualization helper"},
    {"file": "compute_scores.py", "function": "analyze_scores", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "Y", "error_note": "Visualization helper"},
    {"file": "compute_scores.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    # classifier.py - ML training
    {"file": "classifier.py", "function": "load_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "classifier.py", "function": "preprocess_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "classifier.py", "function": "split_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "classifier.py", "function": "create_preprocessor", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "classifier.py", "function": "train_models", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "classifier.py", "function": "save_models", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "classifier.py", "function": "create_feature_importance_plot", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "Y", "error_note": "Visualization helper"},
    {"file": "classifier.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    # predict.py - Inference
    {"file": "predict.py", "function": "load_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "predict.py", "function": "preprocess_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "predict.py", "function": "load_model", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "predict.py", "function": "make_predictions", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "predict.py", "function": "evaluate_span_level", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "predict.py", "function": "evaluate_response_level", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "predict.py", "function": "save_results", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "predict.py", "function": "create_confusion_matrix_plot", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "Y", "error_note": "Visualization helper"},
    {"file": "predict.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    # preprocess.py
    {"file": "preprocess.py", "function": "load_data_from_hf", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "preprocess.py", "function": "add_prompt_spans", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "preprocess.py", "function": "process_dataset", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "preprocess.py", "function": "save_dataset", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "preprocess.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    # generate_response_hf.py
    {"file": "generate_response_hf.py", "function": "load_datasets", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_response_hf.py", "function": "filter_by_token_count", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_response_hf.py", "function": "limit_samples", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_response_hf.py", "function": "setup_model", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_response_hf.py", "function": "add_special_template", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_response_hf.py", "function": "generate_response", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_response_hf.py", "function": "save_dataset", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate of preprocess.py"},
    {"file": "generate_response_hf.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    # generate_response_gpt.py
    {"file": "generate_response_gpt.py", "function": "load_datasets", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "generate_response_gpt.py", "function": "filter_by_token_count", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "generate_response_gpt.py", "function": "limit_samples", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "generate_response_gpt.py", "function": "setup_openai_client", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_response_gpt.py", "function": "add_special_template", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Unused"},
    {"file": "generate_response_gpt.py", "function": "generate_response", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_response_gpt.py", "function": "save_dataset", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "generate_response_gpt.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    # generate_labels.py
    {"file": "generate_labels.py", "function": "load_datasets", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "generate_labels.py", "function": "setup_lettuce_detector", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_labels.py", "function": "add_lettuce_labels", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_labels.py", "function": "setup_llm_client", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_labels.py", "function": "generate_judge_prompt", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_labels.py", "function": "add_llm_judge", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "generate_labels.py", "function": "save_dataset", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "generate_labels.py", "function": "main", "runnable": "N", "correct_implementation": "N", "redundant": "N", "irrelevant": "N", "error_note": "Uses undefined args.skip_lettuce and args.skip_llm_judge"},
    
    # filter.py
    {"file": "filter.py", "function": "load_datasets", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "filter.py", "function": "add_labels_llm", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "filter.py", "function": "apply_confidence_threshold", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "filter.py", "function": "filter_datasets", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "filter.py", "function": "save_dataset", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "filter.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    # Baseline scripts
    {"file": "run_gpt.py", "function": "load_and_balance_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_gpt.py", "function": "generate_judge_prompt", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_gpt.py", "function": "llm_as_a_judge", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_gpt.py", "function": "evaluate", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_gpt.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    {"file": "run_groq.py", "function": "load_and_balance_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "run_groq.py", "function": "generate_judge_prompt", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "run_groq.py", "function": "llm_as_a_judge", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_groq.py", "function": "evaluate", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "run_groq.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    {"file": "run_hf.py", "function": "load_and_balance_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "run_hf.py", "function": "generate_judge_prompt", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "run_hf.py", "function": "llm_as_a_judge", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_hf.py", "function": "evaluate", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "run_hf.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    {"file": "run_ragas.py", "function": "load_and_balance_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "run_ragas.py", "function": "run_ragas_evaluation", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_ragas.py", "function": "evaluate_thresholds", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_ragas.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    {"file": "run_refchecker.py", "function": "load_and_balance_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "run_refchecker.py", "function": "run_refchecker_evaluation", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_refchecker.py", "function": "evaluate", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "run_refchecker.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    
    {"file": "run_trulens.py", "function": "load_and_balance_data", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Duplicate"},
    {"file": "run_trulens.py", "function": "RAG", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_trulens.py", "function": "run_trulens_evaluation", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
    {"file": "run_trulens.py", "function": "evaluate_thresholds", "runnable": "Y", "correct_implementation": "Y", "redundant": "Y", "irrelevant": "N", "error_note": "Similar to run_ragas.py"},
    {"file": "run_trulens.py", "function": "main", "runnable": "Y", "correct_implementation": "Y", "redundant": "N", "irrelevant": "N", "error_note": ""},
]

eval_df = pd.DataFrame(evaluation_results)
print(f"Total functions/blocks evaluated: {len(eval_df)}")

Total functions/blocks evaluated: 99


## Block-Level Evaluation Table

In [2]:
# Display the complete evaluation table
display_df = eval_df.copy()
display_df.columns = ['File', 'Function/Block', 'Runnable', 'Correct', 'Redundant', 'Irrelevant', 'Error Note']
print(display_df.to_string(index=False))

                    File                 Function/Block Runnable Correct Redundant Irrelevant                                               Error Note
               helper.py                     clean_text        Y       Y         N          N                                                         
               helper.py             get_sentence_spans        Y       Y         N          N                                                         
               helper.py                  split_clauses        Y       Y         N          N                                                         
               helper.py     split_text_semantic_chunks        Y       Y         N          N                                                         
       compute_scores.py                  load_examples        Y       Y         N          N                                                         
       compute_scores.py                   setup_models        Y       Y         N          N 

## Quantitative Metrics

In [3]:
# Compute quantitative metrics
total_blocks = len(eval_df)
runnable_count = (eval_df['runnable'] == 'Y').sum()
incorrect_count = (eval_df['correct_implementation'] == 'N').sum()
redundant_count = (eval_df['redundant'] == 'Y').sum()
irrelevant_count = (eval_df['irrelevant'] == 'Y').sum()

# Calculate percentages
runnable_pct = (runnable_count / total_blocks) * 100
incorrect_pct = (incorrect_count / total_blocks) * 100
redundant_pct = (redundant_count / total_blocks) * 100
irrelevant_pct = (irrelevant_count / total_blocks) * 100

# Output matches expectation
output_matches = ((eval_df['runnable'] == 'Y') & (eval_df['correct_implementation'] == 'Y')).sum()
output_matches_pct = (output_matches / total_blocks) * 100

# Correction rate
correction_rate_pct = 0.0  # No corrections made during evaluation

print("=" * 60)
print("QUANTITATIVE METRICS")
print("=" * 60)
print(f"""
Total Blocks/Functions Evaluated: {total_blocks}

| Metric                         | Value      |
|--------------------------------|------------|
| Runnable%                      | {runnable_pct:.2f}%    |
| Output-Matches-Expectation%    | {output_matches_pct:.2f}%    |
| Incorrect%                     | {incorrect_pct:.2f}%     |
| Redundant%                     | {redundant_pct:.2f}%    |
| Irrelevant%                    | {irrelevant_pct:.2f}%     |
| Correction-Rate%               | {correction_rate_pct:.2f}%     |
""")

QUANTITATIVE METRICS

Total Blocks/Functions Evaluated: 99

| Metric                         | Value      |
|--------------------------------|------------|
| Runnable%                      | 98.99%    |
| Output-Matches-Expectation%    | 98.99%    |
| Incorrect%                     | 1.01%     |
| Redundant%                     | 21.21%    |
| Irrelevant%                    | 4.04%     |
| Correction-Rate%               | 0.00%     |



## Binary Checklist Summary

In [4]:
# Binary Checklist
c1_pass = (eval_df['runnable'] == 'N').sum() == 0
c2_pass = (eval_df['correct_implementation'] == 'N').sum() == 0
c3_pass = (eval_df['redundant'] == 'Y').sum() == 0
c4_pass = (eval_df['irrelevant'] == 'Y').sum() == 0

c1_status = "PASS" if c1_pass else "FAIL"
c2_status = "PASS" if c2_pass else "FAIL"
c3_status = "PASS" if c3_pass else "FAIL"
c4_status = "PASS" if c4_pass else "FAIL"

print("=" * 80)
print("BINARY CHECKLIST SUMMARY")
print("=" * 80)
print("""
| Checklist Item                        | Condition                                | PASS/FAIL |
|---------------------------------------|------------------------------------------|-----------|""")
print(f"| C1: All core analysis code is runnable | No block has Runnable = N                | {c1_status}      |")
print(f"| C2: All implementations are correct    | No block has Correct-Implementation = N  | {c2_status}      |")
print(f"| C3: No redundant code                  | No block has Redundant = Y               | {c3_status}      |")
print(f"| C4: No irrelevant code                 | No block has Irrelevant = Y              | {c4_status}      |")

# Generate rationales
print("\n" + "=" * 80)
print("RATIONALES")
print("=" * 80)

if c1_pass:
    c1_rationale = "All 99 functions/blocks are runnable without errors"
else:
    failed = eval_df[eval_df["runnable"] == "N"]["function"].tolist()
    c1_rationale = f"1 block failed to run: generate_labels.py main() - uses undefined args.skip_lettuce and args.skip_llm_judge"

if c2_pass:
    c2_rationale = "All implementations follow the described methodology correctly"
else:
    c2_rationale = f"1 block has incorrect implementation: generate_labels.py main() - uses undefined arguments"

if c3_pass:
    c3_rationale = "No redundant code found"
else:
    c3_rationale = f"{redundant_count} duplicate utility functions found across scripts (load_and_balance_data, save_dataset, generate_judge_prompt duplicated in multiple baseline scripts)"

if c4_pass:
    c4_rationale = "All code contributes to project goal"
else:
    c4_rationale = f"{irrelevant_count} visualization helper functions are not core to the analysis (plot_binary_correlation, analyze_scores, create_confusion_matrix_plot, create_feature_importance_plot)"

print(f"\nC1_All_Runnable: {c1_rationale}")
print(f"\nC2_All_Correct: {c2_rationale}")
print(f"\nC3_No_Redundant: {c3_rationale}")
print(f"\nC4_No_Irrelevant: {c4_rationale}")

BINARY CHECKLIST SUMMARY

| Checklist Item                        | Condition                                | PASS/FAIL |
|---------------------------------------|------------------------------------------|-----------|
| C1: All core analysis code is runnable | No block has Runnable = N                | FAIL      |
| C2: All implementations are correct    | No block has Correct-Implementation = N  | FAIL      |
| C3: No redundant code                  | No block has Redundant = Y               | FAIL      |
| C4: No irrelevant code                 | No block has Irrelevant = Y              | FAIL      |

RATIONALES

C1_All_Runnable: 1 block failed to run: generate_labels.py main() - uses undefined args.skip_lettuce and args.skip_llm_judge

C2_All_Correct: 1 block has incorrect implementation: generate_labels.py main() - uses undefined arguments

C3_No_Redundant: 21 duplicate utility functions found across scripts (load_and_balance_data, save_dataset, generate_judge_prompt duplicated i

## Issues Summary

### Blocks with Runnable Issues (1 block)
- **generate_labels.py: main** - Uses undefined arguments `args.skip_lettuce` and `args.skip_llm_judge` which are not defined in argparse

### Blocks with Implementation Issues (1 block)
- **generate_labels.py: main** - Same issue as above, the main function references arguments that don't exist in the argument parser

### Redundant Blocks (21 blocks)
Duplicate utility functions across scripts:
- `load_and_balance_data` - duplicated in 6 baseline scripts
- `save_dataset` - duplicated in 5 preprocessing scripts  
- `generate_judge_prompt` - duplicated in 3 baseline scripts
- `evaluate` - duplicated in 4 baseline scripts
- `load_datasets`, `filter_by_token_count`, `limit_samples` - duplicated between generate_response scripts

### Irrelevant Blocks (4 blocks)
Visualization helper functions not core to analysis:
- `compute_scores.py: plot_binary_correlation`
- `compute_scores.py: analyze_scores`
- `predict.py: create_confusion_matrix_plot`
- `classifier.py: create_feature_importance_plot`

In [5]:
# Save JSON summary
rationales = {
    "C1_All_Runnable": c1_rationale,
    "C2_All_Correct": c2_rationale,
    "C3_No_Redundant": c3_rationale,
    "C4_No_Irrelevant": c4_rationale
}

json_summary = {
    "Runnable_Percentage": round(runnable_pct, 2),
    "Incorrect_Percentage": round(incorrect_pct, 2),
    "Redundant_Percentage": round(redundant_pct, 2),
    "Irrelevant_Percentage": round(irrelevant_pct, 2),
    "Correction_Rate_Percentage": round(correction_rate_pct, 2),
    
    "Issues": {
        "Runnable_Issues_Exist": not c1_pass,
        "Output_Mismatch_Exists": not c2_pass,
        "Incorrect_Exists": not c2_pass,
        "Redundant_Exists": not c3_pass,
        "Irrelevant_Exists": not c4_pass
    },
    
    "Checklist": {
        "C1_All_Runnable": c1_status,
        "C2_All_Correct": c2_status,
        "C3_No_Redundant": c3_status,
        "C4_No_Irrelevant": c4_status
    },
    
    "Rationale": rationales
}

# Create evaluation directory and save
eval_dir = "/net/scratch2/smallyan/InterpDetect_eval/evaluation"
os.makedirs(eval_dir, exist_ok=True)

json_path = os.path.join(eval_dir, "code_critic_summary.json")
with open(json_path, 'w') as f:
    json.dump(json_summary, f, indent=2)

print(f"JSON summary saved to: {json_path}")
print("\n" + "=" * 60)
print("JSON SUMMARY CONTENT")
print("=" * 60)
print(json.dumps(json_summary, indent=2))

JSON summary saved to: /net/scratch2/smallyan/InterpDetect_eval/evaluation/code_critic_summary.json

JSON SUMMARY CONTENT
{
  "Runnable_Percentage": 98.99,
  "Incorrect_Percentage": 1.01,
  "Redundant_Percentage": 21.21,
  "Irrelevant_Percentage": 4.04,
  "Correction_Rate_Percentage": 0.0,
  "Issues": {
    "Runnable_Issues_Exist": true,
    "Output_Mismatch_Exists": true,
    "Incorrect_Exists": true,
    "Redundant_Exists": true,
    "Irrelevant_Exists": true
  },
  "Checklist": {
    "C1_All_Runnable": "FAIL",
    "C2_All_Correct": "FAIL",
    "C3_No_Redundant": "FAIL",
    "C4_No_Irrelevant": "FAIL"
  },
  "Rationale": {
    "C1_All_Runnable": "1 block failed to run: generate_labels.py main() - uses undefined args.skip_lettuce and args.skip_llm_judge",
    "C2_All_Correct": "1 block has incorrect implementation: generate_labels.py main() - uses undefined arguments",
    "C3_No_Redundant": "21 duplicate utility functions found across scripts (load_and_balance_data, save_dataset, gen

In [6]:
# Copy the notebook to the required location
import shutil

source_notebook = "/home/smallyan/eval_agent/notebooks/2025-12-24-20-53_CodeCriticEvaluation.ipynb"
target_notebook = "/net/scratch2/smallyan/InterpDetect_eval/evaluation/code_critic_evaluation.ipynb"

# Wait a moment for notebook to be saved
import time
time.sleep(2)

# Copy the notebook
shutil.copy(source_notebook, target_notebook)
print(f"Notebook copied to: {target_notebook}")

# Verify both files exist
print("\nVerifying output files:")
print(f"  Notebook exists: {os.path.exists(target_notebook)}")
print(f"  JSON exists: {os.path.exists(json_path)}")

Notebook copied to: /net/scratch2/smallyan/InterpDetect_eval/evaluation/code_critic_evaluation.ipynb

Verifying output files:
  Notebook exists: True
  JSON exists: True


## Final Summary

### Evaluation Complete

**Repository:** `/net/scratch2/smallyan/InterpDetect_eval`

**Total Functions/Blocks Evaluated:** 99

### Quantitative Metrics
| Metric | Value |
|--------|-------|
| Runnable% | 98.99% |
| Output-Matches-Expectation% | 98.99% |
| Incorrect% | 1.01% |
| Redundant% | 21.21% |
| Irrelevant% | 4.04% |
| Correction-Rate% | 0.00% |

### Binary Checklist
| Item | Status |
|------|--------|
| C1: All Runnable | FAIL |
| C2: All Correct | FAIL |
| C3: No Redundant | FAIL |
| C4: No Irrelevant | FAIL |

### Key Findings
1. **Core analysis code is mostly correct** - The main ECS/PKS computation in `compute_scores.py`, classifier training in `classifier.py`, and prediction in `predict.py` all work correctly.

2. **One implementation bug found** - `generate_labels.py` main() function references undefined arguments `args.skip_lettuce` and `args.skip_llm_judge`.

3. **Significant code duplication** - 21 utility functions are duplicated across scripts. Recommend refactoring to use shared utility modules.

4. **Visualization code is not core** - 4 plotting functions could be moved to a separate visualization module.

### Output Files
- **Notebook:** `/net/scratch2/smallyan/InterpDetect_eval/evaluation/code_critic_evaluation.ipynb`
- **JSON Summary:** `/net/scratch2/smallyan/InterpDetect_eval/evaluation/code_critic_summary.json`

In [7]:
# Final copy of notebook with all content
import shutil
import time

time.sleep(2)  # Wait for notebook to save

source_notebook = "/home/smallyan/eval_agent/notebooks/2025-12-24-20-53_CodeCriticEvaluation.ipynb"
target_notebook = "/net/scratch2/smallyan/InterpDetect_eval/evaluation/code_critic_evaluation.ipynb"

shutil.copy(source_notebook, target_notebook)
print("=" * 60)
print("EVALUATION COMPLETE")
print("=" * 60)
print(f"\nOutput files created:")
print(f"  1. Notebook: {target_notebook}")
print(f"  2. JSON: {json_path}")
print("\nAll evaluation criteria have been assessed and documented.")

EVALUATION COMPLETE

Output files created:
  1. Notebook: /net/scratch2/smallyan/InterpDetect_eval/evaluation/code_critic_evaluation.ipynb
  2. JSON: /net/scratch2/smallyan/InterpDetect_eval/evaluation/code_critic_summary.json

All evaluation criteria have been assessed and documented.
