# GriceBench Phase 2: Complete Critical Validation

## What This Notebook Does

This notebook executes **Phase 2: Critical Validation** from the morechanges.md plan:

1. **Create Relation Evaluation Set** - Sample 200 Relation violation examples
2. **MRR Evaluation** - Measure retrieval quality (Mean Reciprocal Rank)
3. **Relevance Scoring** - Semantic similarity metrics
4. **Create Annotation Sample** - Stratified 1000 examples for human annotation
5. **Generate All Outputs** - Ready for Phase 3

---

## ‚ö†Ô∏è REQUIRED DATASET

You need to add **ONE dataset** to this notebook:

### Dataset: `gricebench-scientific-fix`

**Files required in this dataset:**

| File | Local Path on Your Computer |
|------|-----------------------------|
| `repair_test.json` | `c:\Users\pushk\OneDrive\Documents\Research Model\GriceBench\data_processed\repair_data\repair_test.json` |
| `gold_annotation_set.json` | `c:\Users\pushk\OneDrive\Documents\Research Model\GriceBench\data_processed\gold_annotation_set.json` |
| `val_examples.json` | `c:\Users\pushk\OneDrive\Documents\Research Model\GriceBench\data_processed\val_examples.json` |
| `topical_corpus.json` | `c:\Users\pushk\OneDrive\Documents\Research Model\GriceBench\data_processed\topical_corpus.json` |

**How to add dataset:**
1. Right panel ‚Üí Click "Add Data" button
2. Search for your dataset: `gricebench-scientific-fix`
3. Click "Add" to add it

**Note:** The sentence-transformers model will be downloaded automatically - no need to add it as a dataset!

---

## ‚öôÔ∏è Settings

**Recommended:**
- GPU: Enable (Settings ‚Üí Accelerator ‚Üí GPU T4 x2)
- Internet: ON (needed to download model)

In [None]:
# ============================================================================
# CELL 1: INSTALL DEPENDENCIES
# ============================================================================
# This installs the sentence-transformers library for semantic similarity

!pip install -q sentence-transformers

print("‚úÖ Dependencies installed!")

In [None]:
# ============================================================================
# CELL 2: IMPORTS AND CONFIGURATION
# ============================================================================

import os
import json
import numpy as np
import random
import re
from pathlib import Path
from typing import Dict, List
from collections import defaultdict
from datetime import datetime

# Paths
DATA_INPUT = Path("/kaggle/input/gricebench-scientific-fix")
OUTPUT_DIR = Path("/kaggle/working")

# Create output directory
OUTPUT_DIR.mkdir(exist_ok=True)

print("Configuration:")
print(f"  Input: {DATA_INPUT}")
print(f"  Output: {OUTPUT_DIR}")

In [None]:
# ============================================================================
# CELL 3: VERIFY DATASET
# ============================================================================

print("=" * 70)
print("VERIFYING DATASET")
print("=" * 70)

required_files = [
    "repair_test.json",           # OR repair_data/repair_test.json
    "gold_annotation_set.json",
    "val_examples.json",
    "topical_corpus.json"
]

# Check if dataset is mounted
if not DATA_INPUT.exists():
    print("\n‚ùå ERROR: Dataset not found!")
    print("\nPlease add the 'gricebench-scientific-fix' dataset:")
    print("1. Click 'Add Data' in the right panel")
    print("2. Search for 'gricebench-scientific-fix'")
    print("3. Click 'Add'")
    print("4. Re-run this cell")
else:
    print(f"\n‚úÖ Dataset found at: {DATA_INPUT}")
    print("\nContents:")
    for item in DATA_INPUT.iterdir():
        if item.is_file():
            size_mb = item.stat().st_size / (1024*1024)
            print(f"  üìÑ {item.name} ({size_mb:.2f} MB)")
        else:
            print(f"  üìÅ {item.name}/")
            for subitem in item.iterdir():
                size_mb = subitem.stat().st_size / (1024*1024)
                print(f"      üìÑ {subitem.name} ({size_mb:.2f} MB)")

# Find repair_test.json (could be in root or repair_data/)
repair_test_path = None
if (DATA_INPUT / "repair_test.json").exists():
    repair_test_path = DATA_INPUT / "repair_test.json"
elif (DATA_INPUT / "repair_data" / "repair_test.json").exists():
    repair_test_path = DATA_INPUT / "repair_data" / "repair_test.json"

if repair_test_path:
    print(f"\n‚úÖ repair_test.json found at: {repair_test_path}")
else:
    print("\n‚ùå repair_test.json NOT FOUND - check your dataset")

In [None]:
# ============================================================================
# CELL 4: DOWNLOAD AND LOAD MODEL
# ============================================================================
# The model is downloaded automatically from HuggingFace - no dataset needed!

from sentence_transformers import SentenceTransformer

print("=" * 70)
print("LOADING SENTENCE ENCODER")
print("=" * 70)
print("\nDownloading model from HuggingFace (first run only)...")
print("Model: sentence-transformers/all-MiniLM-L6-v2")
print("This is a lightweight but effective model for semantic similarity.\n")

encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

print("\n‚úÖ Model loaded successfully!")
print(f"   Embedding dimension: 384")

---
# Part 1: Create Relation Evaluation Set

Sample 200 examples with Relation violations for MRR evaluation.

In [None]:
# ============================================================================
# CELL 5: CREATE RELATION EVALUATION SET
# ============================================================================

print("=" * 70)
print("CREATING RELATION EVALUATION SET (200 examples)")
print("=" * 70)

random.seed(42)

# Load repair test data
print(f"\nLoading from: {repair_test_path}")
with open(repair_test_path, 'r', encoding='utf-8') as f:
    test_data = json.load(f)

print(f"Total examples in file: {len(test_data)}")

# Filter for Relation violations
relation_examples = []
for i, item in enumerate(test_data):
    input_text = item.get("input_text", "")
    if "[VIOLATION=RELATION]" in input_text:
        example = {
            "id": f"relation_eval_{i}",
            "input_text": input_text,
            "target_text": item.get("target_text", ""),
            "source_index": i
        }
        
        # Extract context and response
        context_match = re.search(r'\[CONTEXT\](.*?)\[', input_text, re.DOTALL)
        response_match = re.search(r'\[RESPONSE\](.*?)$', input_text, re.DOTALL)
        
        if context_match:
            example["context"] = context_match.group(1).strip()
        if response_match:
            example["response"] = response_match.group(1).strip()
        
        relation_examples.append(example)

print(f"Relation violations found: {len(relation_examples)}")

# Sample 200
num_samples = min(200, len(relation_examples))
relation_eval_set = random.sample(relation_examples, num_samples)

# Save
eval_set_path = OUTPUT_DIR / "relation_eval_set.json"
with open(eval_set_path, 'w', encoding='utf-8') as f:
    json.dump(relation_eval_set, f, indent=2, ensure_ascii=False)

print(f"\n‚úÖ Saved {len(relation_eval_set)} examples to: {eval_set_path}")

---
# Part 2: Load Corpus for Retrieval

Load and encode the topical corpus for MRR evaluation.

In [None]:
# ============================================================================
# CELL 6: LOAD AND ENCODE CORPUS
# ============================================================================

print("=" * 70)
print("LOADING AND ENCODING CORPUS")
print("=" * 70)

# Load corpus
corpus_path = DATA_INPUT / "topical_corpus.json"
print(f"\nLoading corpus from: {corpus_path}")

with open(corpus_path, 'r', encoding='utf-8') as f:
    corpus = json.load(f)

print(f"Total corpus size: {len(corpus)}")

# Extract text responses
if isinstance(corpus[0], dict):
    corpus_responses = [item.get('response', str(item)) for item in corpus]
else:
    corpus_responses = corpus

# Subsample for efficiency (10k is good balance of speed vs coverage)
MAX_CORPUS = 10000
if len(corpus_responses) > MAX_CORPUS:
    random.seed(42)
    corpus_sample = random.sample(corpus_responses, MAX_CORPUS)
    print(f"Subsampled to: {len(corpus_sample)} responses")
else:
    corpus_sample = corpus_responses

# Encode corpus
print("\nEncoding corpus responses (this may take a few minutes)...")
corpus_embeddings = encoder.encode(
    corpus_sample,
    convert_to_numpy=True,
    normalize_embeddings=True,
    show_progress_bar=True,
    batch_size=64
)

print(f"\n‚úÖ Corpus encoded!")
print(f"   Shape: {corpus_embeddings.shape}")

---
# Part 3: MRR Evaluation

Mean Reciprocal Rank measures how well retrieval finds relevant responses:
- For each context, retrieve top-10 from corpus
- Find rank of semantically similar response
- MRR = mean(1/rank)

**Target:** MRR ‚â• 0.5 (per morechanges.md)

In [None]:
# ============================================================================
# CELL 7: MRR EVALUATION
# ============================================================================

print("=" * 70)
print("MRR EVALUATION")
print("=" * 70)

mrr_scores = []
top1_hits = 0
top3_hits = 0
top10_hits = 0

print(f"\nEvaluating {len(relation_eval_set)} examples...\n")

for i, item in enumerate(relation_eval_set):
    if (i + 1) % 50 == 0:
        print(f"  Processed {i + 1}/{len(relation_eval_set)}")
    
    # Get context
    context = item.get('context', '')
    if not context:
        mrr_scores.append(0.0)
        continue
    
    # Get true response (the on-topic reference)
    true_response = item.get('target_text', item.get('response', ''))
    
    # Encode context for retrieval
    context_embedding = encoder.encode(
        [context],
        convert_to_numpy=True,
        normalize_embeddings=True
    )[0]
    
    # Find top-10 from corpus based on context similarity
    similarities = np.dot(corpus_embeddings, context_embedding)
    top_indices = np.argsort(similarities)[-10:][::-1]
    
    # Encode true response for comparison
    true_embedding = encoder.encode(
        [true_response],
        convert_to_numpy=True,
        normalize_embeddings=True
    )[0]
    
    # Find rank of semantically similar response
    rank = None
    for j, idx in enumerate(top_indices):
        candidate_embedding = corpus_embeddings[idx]
        sim_to_true = np.dot(candidate_embedding, true_embedding)
        if sim_to_true > 0.7:  # Threshold for "relevant"
            rank = j + 1
            break
    
    if rank:
        mrr_scores.append(1.0 / rank)
        if rank == 1:
            top1_hits += 1
        if rank <= 3:
            top3_hits += 1
        if rank <= 10:
            top10_hits += 1
    else:
        mrr_scores.append(0.0)

# Calculate final metrics
n = len(relation_eval_set)
mrr = np.mean(mrr_scores)

mrr_results = {
    'mrr': float(mrr),
    'top1_accuracy': top1_hits / n if n > 0 else 0,
    'top3_accuracy': top3_hits / n if n > 0 else 0,
    'top10_accuracy': top10_hits / n if n > 0 else 0,
    'n_examples': n,
    'timestamp': datetime.now().isoformat()
}

print("\n" + "=" * 50)
print("MRR RESULTS")
print("=" * 50)
print(f"\nMRR:          {mrr_results['mrr']:.4f}")
print(f"Top-1:        {mrr_results['top1_accuracy']:.4f} ({top1_hits}/{n})")
print(f"Top-3:        {mrr_results['top3_accuracy']:.4f} ({top3_hits}/{n})")
print(f"Top-10:       {mrr_results['top10_accuracy']:.4f} ({top10_hits}/{n})")

In [None]:
# ============================================================================
# CELL 8: VERDICT AND DECISION
# ============================================================================

print("\n" + "=" * 70)
print("VERDICT (per morechanges.md)")
print("=" * 70)

if mrr_results['mrr'] >= 0.7:
    verdict = "EXCELLENT"
    emoji = "‚úÖ"
    action = "Retrieval system is working well."
    next_step = "Proceed to Phase 3 (Annotation)"
elif mrr_results['mrr'] >= 0.5:
    verdict = "ACCEPTABLE"
    emoji = "‚ö†Ô∏è"
    action = "Retrieval acceptable but could be improved."
    next_step = "Proceed to Phase 3, consider upgrading model later"
else:
    verdict = "NEEDS IMPROVEMENT"
    emoji = "‚ùå"
    action = "Retrieval below threshold. Fix before proceeding."
    next_step = "Run improvement steps (use all-mpnet-base-v2 or expand corpus)"

print(f"\n{emoji} {verdict}")
print(f"\nAction: {action}")
print(f"Next Step: {next_step}")

# Decision point
print("\n" + "-" * 50)
print("DECISION POINT:")
if mrr_results['mrr'] >= 0.5:
    print("‚úÖ MRR >= 0.5: Continue with this notebook to create annotation sample")
else:
    print("‚ùå MRR < 0.5: Stop here and fix retrieval first")
    print("   Options:")
    print("   1. Use better encoder: 'all-mpnet-base-v2'")
    print("   2. Expand corpus with more responses")

In [None]:
# ============================================================================
# CELL 9: SAVE MRR RESULTS
# ============================================================================

mrr_output_path = OUTPUT_DIR / "relation_repair_mrr.json"
with open(mrr_output_path, 'w') as f:
    json.dump(mrr_results, f, indent=2)

print(f"\n‚úÖ MRR results saved to: {mrr_output_path}")

---
# Part 4: Create Annotation Sample (1000 examples)

Creates stratified sample for human annotation per morechanges.md:
- 200 per maxim (detector positives)
- 200 clean (detector negatives)
- 100 random

In [None]:
# ============================================================================
# CELL 10: CREATE ANNOTATION SAMPLE
# ============================================================================

print("=" * 70)
print("CREATING ANNOTATION SAMPLE (1000 examples)")
print("=" * 70)

random.seed(42)

# Load data sources
all_examples = []

# Load validation data
val_path = DATA_INPUT / "val_examples.json"
if val_path.exists():
    with open(val_path, 'r', encoding='utf-8') as f:
        val_data = json.load(f)
    print(f"Validation data: {len(val_data)} examples")
    for i, item in enumerate(val_data):
        item['source_file'] = 'validation'
        item['source_idx'] = i
    all_examples.extend(val_data)

# Load gold annotation data
gold_path = DATA_INPUT / "gold_annotation_set.json"
if gold_path.exists():
    with open(gold_path, 'r', encoding='utf-8') as f:
        gold_data = json.load(f)
    print(f"Gold data: {len(gold_data)} examples")
    for i, item in enumerate(gold_data):
        item['source_file'] = 'gold'
        item['source_idx'] = i
    all_examples.extend(gold_data)

print(f"\nTotal pool: {len(all_examples)} examples")

# Categorize by maxim
maxims = ['quantity', 'quality', 'relation', 'manner']
detector_positives = defaultdict(list)
detector_negatives = []

for item in all_examples:
    labels = item.get('labels', item.get('detector_predictions', {}))
    has_violation = any(labels.get(m, 0) for m in maxims)
    
    if not has_violation:
        detector_negatives.append(item)
    else:
        for maxim in maxims:
            if labels.get(maxim, 0):
                detector_positives[maxim].append(item)

print(f"\nCategorization:")
print(f"  Clean (no violations): {len(detector_negatives)}")
for maxim in maxims:
    print(f"  {maxim} positives: {len(detector_positives[maxim])}")

In [None]:
# ============================================================================
# CELL 11: SAMPLE AND SAVE
# ============================================================================

# Sampling function
final_sample = []
seen_ids = set()

def add_samples(pool, count, category):
    global final_sample, seen_ids
    shuffled = pool.copy()
    random.shuffle(shuffled)
    added = 0
    for item in shuffled:
        item_id = item.get('id', f"{item.get('source_file', 'unk')}_{item.get('source_idx', 0)}")
        if item_id not in seen_ids:
            item['annotation_category'] = category
            item['sample_id'] = f"sample_{len(final_sample)}"
            final_sample.append(item)
            seen_ids.add(item_id)
            added += 1
            if added >= count:
                break
    return added

print("Sampling...")

# 200 per maxim
for maxim in maxims:
    added = add_samples(detector_positives[maxim], 200, f"{maxim}_positive")
    print(f"  {maxim}_positive: {added}")

# 200 clean
added = add_samples(detector_negatives, 200, "clean")
print(f"  clean: {added}")

# 100 random from remaining
remaining = [item for item in all_examples 
             if item.get('id', f"{item.get('source_file', '')}_{item.get('source_idx', 0)}") not in seen_ids]
added = add_samples(remaining, 100, "random")
print(f"  random: {added}")

print(f"\nTotal sampled: {len(final_sample)}")

# Shuffle and assign final IDs
random.shuffle(final_sample)
for i, item in enumerate(final_sample):
    item['id'] = f"annotation_{i:04d}"

# Save
annotation_sample_path = OUTPUT_DIR / "annotation_sample_1000.json"
with open(annotation_sample_path, 'w', encoding='utf-8') as f:
    json.dump(final_sample, f, indent=2, ensure_ascii=False)

print(f"\n‚úÖ Saved {len(final_sample)} examples to: {annotation_sample_path}")

---
# Part 5: Summary and Downloads

In [None]:
# ============================================================================
# CELL 12: FINAL SUMMARY
# ============================================================================

print("=" * 70)
print("PHASE 2 COMPLETE - SUMMARY")
print("=" * 70)

print("\nüìä MRR RESULTS:")
print(f"   MRR Score: {mrr_results['mrr']:.4f}")
print(f"   Top-1 Accuracy: {mrr_results['top1_accuracy']:.2%}")
print(f"   Verdict: {verdict}")

print("\nüìÅ OUTPUT FILES (download these):")
outputs = [
    OUTPUT_DIR / "relation_eval_set.json",
    OUTPUT_DIR / "relation_repair_mrr.json",
    OUTPUT_DIR / "annotation_sample_1000.json"
]

for output_file in outputs:
    if output_file.exists():
        size_kb = output_file.stat().st_size / 1024
        print(f"   ‚úÖ {output_file.name} ({size_kb:.1f} KB)")
    else:
        print(f"   ‚ùå {output_file.name} - NOT CREATED")

print("\nüìã NEXT STEPS:")
print("   1. Download the 3 output files above")
print("   2. Add them back to your gricebench-scientific-fix dataset")
print("   3. For Phase 3: Use annotation_sample_1000.json for human annotation")
print("   4. After annotating, run Phase 4 notebook for agreement analysis")

print("\n" + "=" * 70)
print("Done! üéâ")
print("=" * 70)