# Manual Evaluation of Synthetic Data Generation

This notebook is designed for manual evaluation of generated synthetic news articles and tweets.

## Evaluation Process:
1. **Sample Selection**: 100-300 articles/tweets for manual evaluation
2. **Multi-Annotator Setup**: At least 3 annotators per item
3. **Evaluation Criteria**: 
   - **Inappropriate**: Fact extraction/modification is clearly wrong
   - **Appropriate**: Fact extraction/modification is correct and plausible
   - **In-between**: Partially correct or ambiguous
4. **Agreement Analysis**: Calculate inter-annotator agreement
5. **Decision Making**: Determine if quality is sufficient to proceed

## Output:
- Annotated evaluation dataset
- Inter-annotator agreement scores
- Quality assessment recommendations

In [None]:
# Import required libraries
import sys
import os
sys.path.append('../..')

import pandas as pd
import numpy as np
import json
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Any
from collections import Counter
import random

# For inter-annotator agreement
from sklearn.metrics import cohen_kappa_score
import itertools

print("Libraries imported successfully!")
print(f"Current time: {datetime.now()}")

In [None]:
# Load generated synthetic data results
print("📁 LOADING GENERATED RESULTS")
print("="*40)

# Load your generated results - adjust file paths as needed
results_files = {
    'news': '../../results/news_batch_final_*.json',  # Replace with actual filename
    'tweets': '../../results/tweets_batch_final_*.json'  # Replace with actual filename
}

# Function to load results
def load_generated_results(file_pattern: str, content_type: str) -> List[Dict]:
    import glob
    files = glob.glob(file_pattern)
    
    if not files:
        print(f"⚠️ No {content_type} results found matching pattern: {file_pattern}")
        return []
    
    # Use the most recent file
    latest_file = max(files, key=os.path.getctime)
    print(f"Loading {content_type} from: {latest_file}")
    
    with open(latest_file, 'r') as f:
        data = json.load(f)
    
    print(f"✅ Loaded {len(data)} {content_type} results")
    return data

# Load results
news_results = load_generated_results(results_files['news'], 'news')
tweet_results = load_generated_results(results_files['tweets'], 'tweets')

print(f"\n📊 TOTAL RESULTS LOADED:")
print(f"News articles: {len(news_results)}")
print(f"Tweets: {len(tweet_results)}")
print(f"Total items: {len(news_results) + len(tweet_results)}")

In [None]:
# Sample selection for manual evaluation
def create_evaluation_sample(results: List[Dict], 
                           content_type: str, 
                           sample_size: int = 100,
                           random_seed: int = 42) -> List[Dict]:
    """
    Create a sample for manual evaluation
    """
    random.seed(random_seed)
    
    if len(results) <= sample_size:
        print(f"📊 Using all {len(results)} {content_type} (less than requested {sample_size})")
        sample = results.copy()
    else:
        print(f"📊 Sampling {sample_size} {content_type} from {len(results)} total")
        sample = random.sample(results, sample_size)
    
    # Add evaluation structure to each item
    for i, item in enumerate(sample):
        item['evaluation_id'] = f"{content_type}_{i+1:03d}"
        item['content_type'] = content_type
        
        # Initialize evaluation fields
        item['manual_evaluation'] = {
            'annotator_1': {'fact_extraction': None, 'fact_modification': None, 'notes': ''},
            'annotator_2': {'fact_extraction': None, 'fact_modification': None, 'notes': ''},
            'annotator_3': {'fact_extraction': None, 'fact_modification': None, 'notes': ''},
            'consensus': {'fact_extraction': None, 'fact_modification': None, 'notes': ''}
        }
    
    return sample

# Configuration for sampling
SAMPLE_SIZE_NEWS = 100  # Adjust as needed (100-300)
SAMPLE_SIZE_TWEETS = 100  # Adjust as needed (100-300)

# Create evaluation samples
news_sample = create_evaluation_sample(news_results, 'news', SAMPLE_SIZE_NEWS)
tweet_sample = create_evaluation_sample(tweet_results, 'tweets', SAMPLE_SIZE_TWEETS)

print(f"\n✅ EVALUATION SAMPLES CREATED:")
print(f"News sample: {len(news_sample)} items")
print(f"Tweet sample: {len(tweet_sample)} items")
print(f"Total for evaluation: {len(news_sample) + len(tweet_sample)} items")

In [None]:
# Create evaluation templates for annotators
def create_annotation_template(sample: List[Dict], content_type: str) -> None:
    """
    Create human-readable templates for manual annotation
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Create evaluation directory
    eval_dir = "../../evaluation"
    os.makedirs(eval_dir, exist_ok=True)
    
    # Create detailed evaluation file for annotators
    eval_file = f"{eval_dir}/{content_type}_manual_evaluation_{timestamp}.json"
    
    # Create simplified version for easier annotation
    simplified_items = []
    
    for item in sample:
        simplified_item = {
            'evaluation_id': item['evaluation_id'],
            'content_type': content_type,
            'original_content': item['original_content'][:500] + "..." if len(item['original_content']) > 500 else item['original_content'],
            'extracted_facts': item['extracted_facts'],
            'modified_facts': item['modified_facts'],
            'synthetic_content': item['synthetic_content'][:500] + "..." if len(item['synthetic_content']) > 500 else item['synthetic_content'],
            'evaluation_template': {
                'fact_extraction_quality': {
                    'rating': None,  # 'appropriate', 'inappropriate', 'in-between'
                    'explanation': '',
                    'specific_issues': []
                },
                'fact_modification_quality': {
                    'rating': None,  # 'appropriate', 'inappropriate', 'in-between'
                    'explanation': '',
                    'specific_issues': []
                },
                'overall_notes': ''
            }
        }
        simplified_items.append(simplified_item)
    
    # Save evaluation template
    with open(eval_file, 'w') as f:
        json.dump(simplified_items, f, indent=2)
    
    print(f"📝 Created evaluation template: {eval_file}")
    
    # Create instruction file
    instructions_file = f"{eval_dir}/{content_type}_annotation_instructions_{timestamp}.md"
    
    instructions = f"""# Manual Evaluation Instructions - {content_type.title()}

## Task Overview
You are evaluating the quality of synthetic {content_type} generation with focus on:
1. **Fact Extraction Quality**: How well were facts extracted from original content?
2. **Fact Modification Quality**: How appropriately were facts modified?

## Rating Scale
- **Appropriate**: Extraction/modification is correct, plausible, and maintains coherence
- **Inappropriate**: Extraction/modification is clearly wrong, implausible, or breaks coherence
- **In-between**: Partially correct, ambiguous, or minor issues

## Evaluation Criteria

### Fact Extraction Quality
- Are the extracted facts actually present in the original content?
- Are the fact types (name_of_fact) appropriate?
- Are the descriptions accurate?
- Is the specific_data correctly identified?

### Fact Modification Quality
- Are the modified facts plausible alternatives?
- Do modifications maintain the same fact type/category?
- Are the changes realistic and believable?
- Do modifications create coherent misinformation?

## Instructions
1. Review each item in: `{eval_file}`
2. For each item, fill in the `evaluation_template` section:
   - Set `rating` to 'appropriate', 'inappropriate', or 'in-between'
   - Provide detailed `explanation` for your rating
   - List any `specific_issues` you identify
   - Add `overall_notes` with additional observations

## Return Instructions
Save your completed evaluations as:
`{content_type}_annotator_[YOUR_NAME]_{timestamp}.json`

Total items to evaluate: {len(simplified_items)}
Estimated time: {len(simplified_items) * 2} minutes (2 min per item)
"""
    
    with open(instructions_file, 'w') as f:
        f.write(instructions)
    
    print(f"📋 Created instructions: {instructions_file}")
    
    return eval_file, instructions_file

# Create evaluation templates
if news_sample:
    news_eval_file, news_instructions = create_annotation_template(news_sample, 'news')

if tweet_sample:
    tweet_eval_file, tweet_instructions = create_annotation_template(tweet_sample, 'tweets')

print(f"\n✅ EVALUATION SETUP COMPLETE")
print(f"📁 Files created in ../../evaluation/ directory")
print(f"📧 Share evaluation files with your 3+ annotators")

In [None]:
# Functions for analyzing annotator agreement (run after getting annotations back)
def load_annotator_evaluations(eval_dir: str, content_type: str) -> Dict:
    """
    Load completed evaluations from multiple annotators
    """
    import glob
    
    pattern = f"{eval_dir}/{content_type}_annotator_*.json"
    annotation_files = glob.glob(pattern)
    
    if not annotation_files:
        print(f"⚠️ No annotation files found for {content_type}")
        return {}
    
    annotations = {}
    for file_path in annotation_files:
        annotator_name = os.path.basename(file_path).split('_')[2]  # Extract annotator name
        
        with open(file_path, 'r') as f:
            data = json.load(f)
        
        annotations[annotator_name] = data
        print(f"✅ Loaded annotations from {annotator_name}: {len(data)} items")
    
    return annotations

def calculate_agreement_scores(annotations: Dict, metric: str = 'fact_extraction_quality') -> Dict:
    """
    Calculate inter-annotator agreement using Cohen's Kappa
    """
    annotators = list(annotations.keys())
    
    if len(annotators) < 2:
        print("⚠️ Need at least 2 annotators for agreement calculation")
        return {}
    
    # Create rating matrix
    ratings_matrix = {}
    
    # Get all evaluation IDs
    eval_ids = set()
    for annotator_data in annotations.values():
        for item in annotator_data:
            eval_ids.add(item['evaluation_id'])
    
    eval_ids = sorted(list(eval_ids))
    
    # Build rating matrix
    for annotator in annotators:
        ratings_matrix[annotator] = []
        
        # Create lookup for this annotator
        annotator_ratings = {}
        for item in annotations[annotator]:
            rating = item['evaluation_template'][metric]['rating']
            if rating:  # Only include non-None ratings
                annotator_ratings[item['evaluation_id']] = rating
        
        # Add ratings in order
        for eval_id in eval_ids:
            if eval_id in annotator_ratings:
                ratings_matrix[annotator].append(annotator_ratings[eval_id])
            else:
                ratings_matrix[annotator].append(None)  # Missing rating
    
    # Calculate pairwise agreements
    agreement_scores = {}
    
    for i, annotator1 in enumerate(annotators):
        for j, annotator2 in enumerate(annotators[i+1:], i+1):
            # Get paired ratings (exclude None values)
            paired_ratings = []
            for k in range(len(eval_ids)):
                if (ratings_matrix[annotator1][k] is not None and 
                    ratings_matrix[annotator2][k] is not None):
                    paired_ratings.append((
                        ratings_matrix[annotator1][k],
                        ratings_matrix[annotator2][k]
                    ))
            
            if len(paired_ratings) > 0:
                ratings1, ratings2 = zip(*paired_ratings)
                kappa = cohen_kappa_score(ratings1, ratings2)
                agreement_scores[f"{annotator1}_vs_{annotator2}"] = {
                    'kappa': kappa,
                    'items_compared': len(paired_ratings),
                    'interpretation': interpret_kappa(kappa)
                }
    
    return agreement_scores

def interpret_kappa(kappa: float) -> str:
    """
    Interpret Cohen's Kappa score
    """
    if kappa < 0:
        return "Poor (worse than random)"
    elif kappa < 0.20:
        return "Slight agreement"
    elif kappa < 0.40:
        return "Fair agreement" 
    elif kappa < 0.60:
        return "Moderate agreement"
    elif kappa < 0.80:
        return "Substantial agreement"
    else:
        return "Almost perfect agreement"

print("Agreement analysis functions defined!")
print("\n📋 NEXT STEPS:")
print("1. Share evaluation files with 3+ annotators")
print("2. Collect completed annotation files")
print("3. Run agreement analysis in next cell")

In [None]:
# Run this cell after collecting annotations from annotators
"""
# Load completed annotations
eval_dir = "../../evaluation"

# Load news annotations
news_annotations = load_annotator_evaluations(eval_dir, 'news')
tweet_annotations = load_annotator_evaluations(eval_dir, 'tweets')

# Calculate agreement scores
if news_annotations:
    print("📊 NEWS EVALUATION AGREEMENT ANALYSIS")
    print("="*50)
    
    # Fact extraction agreement
    extraction_agreement = calculate_agreement_scores(news_annotations, 'fact_extraction_quality')
    print("\n🔍 Fact Extraction Agreement:")
    for pair, scores in extraction_agreement.items():
        print(f"  {pair}: κ={scores['kappa']:.3f} ({scores['interpretation']}) - {scores['items_compared']} items")
    
    # Fact modification agreement  
    modification_agreement = calculate_agreement_scores(news_annotations, 'fact_modification_quality')
    print("\n🔄 Fact Modification Agreement:")
    for pair, scores in modification_agreement.items():
        print(f"  {pair}: κ={scores['kappa']:.3f} ({scores['interpretation']}) - {scores['items_compared']} items")

if tweet_annotations:
    print("\n📊 TWEET EVALUATION AGREEMENT ANALYSIS")
    print("="*50)
    
    # Similar analysis for tweets
    extraction_agreement = calculate_agreement_scores(tweet_annotations, 'fact_extraction_quality')
    print("\n🔍 Fact Extraction Agreement:")
    for pair, scores in extraction_agreement.items():
        print(f"  {pair}: κ={scores['kappa']:.3f} ({scores['interpretation']}) - {scores['items_compared']} items")
    
    modification_agreement = calculate_agreement_scores(tweet_annotations, 'fact_modification_quality')
    print("\n🔄 Fact Modification Agreement:")
    for pair, scores in modification_agreement.items():
        print(f"  {pair}: κ={scores['kappa']:.3f} ({scores['interpretation']}) - {scores['items_compared']} items")

# Overall quality assessment
print("\n🎯 QUALITY ASSESSMENT RECOMMENDATIONS:")
print("="*50)
print("Based on agreement analysis:")
print("• κ > 0.60: Proceed with full dataset processing")
print("• κ 0.40-0.60: Review and improve fact schemas, then retest")
print("• κ < 0.40: Significant improvements needed before scaling")
"""

print("Agreement analysis code ready - uncomment when annotations are complete")

## Summary

This notebook provides:

1. **Sample Creation**: Random sampling of generated results for evaluation
2. **Annotation Templates**: Structured files for manual evaluation by 3+ annotators
3. **Agreement Analysis**: Cohen's Kappa calculation for inter-annotator reliability
4. **Quality Assessment**: Recommendations based on agreement scores

### Next Steps:
1. Run this notebook to create evaluation templates
2. Distribute templates to 3+ annotators
3. Collect completed annotations
4. Run agreement analysis
5. Decide whether to proceed with full dataset based on quality scores

### Agreement Interpretation:
- **κ > 0.60**: Good agreement - proceed with full processing
- **κ 0.40-0.60**: Moderate agreement - consider improvements
- **κ < 0.40**: Poor agreement - significant improvements needed