# Automatic Evaluation of Synthetic Data Generation

This notebook implements automatic evaluation metrics for synthetic news articles and tweets.

## Automatic Evaluation Metrics:
1. **Correctness**: How accurately were facts extracted and modified?
2. **Coherence**: How well does the synthetic content maintain logical flow?
3. **Dissimilarity**: How different is the synthetic content from the original?

## Process:
- Load generated synthetic data
- Apply automatic evaluation metrics
- Generate quantitative quality scores
- Create evaluation reports
- Compare with manual evaluation results

In [None]:
# Import required libraries
import sys
import os
sys.path.append('../..')

import pandas as pd
import numpy as np
import json
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Any, Tuple
import nltk
import re
from collections import Counter

# For text similarity and coherence
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from textstat import flesch_reading_ease, flesch_kincaid_grade
import spacy

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

print("Libraries imported successfully!")
print(f"Current time: {datetime.now()}")

# Load spaCy model for advanced NLP
try:
    nlp = spacy.load("en_core_web_sm")
    print("✅ SpaCy model loaded successfully")
except OSError:
    print("⚠️ SpaCy model not found. Install with: python -m spacy download en_core_web_sm")
    nlp = None

In [None]:
# Load generated synthetic data results
def load_evaluation_data(results_dir: str = "../../results") -> Dict[str, List[Dict]]:
    """
    Load generated results for automatic evaluation
    """
    import glob
    
    data = {'news': [], 'tweets': []}
    
    # Load news results
    news_files = glob.glob(f"{results_dir}/news_batch_final_*.json")
    if news_files:
        latest_news = max(news_files, key=os.path.getctime)
        with open(latest_news, 'r') as f:
            data['news'] = json.load(f)
        print(f"✅ Loaded {len(data['news'])} news results from {latest_news}")
    
    # Load tweet results  
    tweet_files = glob.glob(f"{results_dir}/tweets_batch_final_*.json")
    if tweet_files:
        latest_tweets = max(tweet_files, key=os.path.getctime)
        with open(latest_tweets, 'r') as f:
            data['tweets'] = json.load(f)
        print(f"✅ Loaded {len(data['tweets'])} tweet results from {latest_tweets}")
    
    return data

# Load data
evaluation_data = load_evaluation_data()

print(f"\n📊 EVALUATION DATA SUMMARY:")
print(f"News articles: {len(evaluation_data['news'])}")
print(f"Tweets: {len(evaluation_data['tweets'])}")
print(f"Total items: {len(evaluation_data['news']) + len(evaluation_data['tweets'])}")

In [None]:
# Metric 1: Correctness Evaluation
class CorrectnessEvaluator:
    """
    Evaluate correctness of fact extraction and modification
    """
    
    def __init__(self):
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
    
    def evaluate_fact_extraction_correctness(self, original_text: str, extracted_facts: List[Dict]) -> float:
        """
        Evaluate if extracted facts are actually present in the original text
        """
        if not extracted_facts:
            return 0.0
        
        original_lower = original_text.lower()
        correct_extractions = 0
        
        for fact in extracted_facts:
            specific_data = fact.get('specific_data', '').lower()
            
            if not specific_data:
                continue
            
            # Check if the specific data appears in original text
            # Handle multi-word entities
            words = specific_data.split()
            if len(words) == 1:
                # Single word - exact match
                if specific_data in original_lower:
                    correct_extractions += 1
            else:
                # Multi-word - check if all words appear nearby
                if specific_data in original_lower:
                    correct_extractions += 1
                else:
                    # Check if most words appear
                    word_matches = sum(1 for word in words if word in original_lower and word not in self.stopwords)
                    if word_matches >= len(words) * 0.7:  # 70% of words match
                        correct_extractions += 0.7
        
        return correct_extractions / len(extracted_facts)
    
    def evaluate_fact_modification_appropriateness(self, extracted_facts: List[Dict], modified_facts: List[Dict]) -> float:
        """
        Evaluate if fact modifications maintain the same category/type
        """
        if not extracted_facts or not modified_facts:
            return 0.0
        
        if len(extracted_facts) != len(modified_facts):
            return 0.0  # Mismatch in fact count
        
        appropriate_modifications = 0
        
        for orig_fact, mod_fact in zip(extracted_facts, modified_facts):
            # Check if fact type/name matches
            if orig_fact.get('name_of_fact') == mod_fact.get('name_of_fact'):
                # Check if description is maintained
                if orig_fact.get('description_of_fact') == mod_fact.get('description_of_fact'):
                    # Check if specific data is actually different
                    orig_data = orig_fact.get('specific_data', '').lower()
                    mod_data = mod_fact.get('specific_data', '').lower()
                    
                    if orig_data != mod_data and mod_data != '':
                        appropriate_modifications += 1
                    elif orig_data == mod_data:
                        appropriate_modifications += 0.5  # Partial credit for no change
        
        return appropriate_modifications / len(extracted_facts)
    
    def evaluate_fact_replacement_accuracy(self, original_text: str, synthetic_text: str, 
                                         extracted_facts: List[Dict], modified_facts: List[Dict]) -> float:
        """
        Evaluate if facts were correctly replaced in synthetic text
        """
        if not extracted_facts or not modified_facts:
            return 0.0
        
        successful_replacements = 0
        
        for orig_fact, mod_fact in zip(extracted_facts, modified_facts):
            orig_data = orig_fact.get('specific_data', '')
            mod_data = mod_fact.get('specific_data', '')
            
            if not orig_data or not mod_data:
                continue
            
            # Check if original fact is removed from synthetic text
            orig_removed = orig_data.lower() not in synthetic_text.lower()
            
            # Check if modified fact is present in synthetic text
            mod_added = mod_data.lower() in synthetic_text.lower()
            
            if orig_removed and mod_added:
                successful_replacements += 1
            elif orig_removed or mod_added:
                successful_replacements += 0.5  # Partial success
        
        return successful_replacements / len(extracted_facts)

print("Correctness evaluator defined!")

In [None]:
# Metric 2: Coherence Evaluation
class CoherenceEvaluator:
    """
    Evaluate coherence of synthetic content
    """
    
    def __init__(self, nlp_model=None):
        self.nlp = nlp_model
    
    def evaluate_readability_coherence(self, text: str) -> Dict[str, float]:
        """
        Evaluate readability as a proxy for coherence
        """
        try:
            flesch_score = flesch_reading_ease(text)
            fk_grade = flesch_kincaid_grade(text)
            
            # Normalize Flesch score to 0-1 range (higher is better)
            flesch_normalized = min(1.0, max(0.0, flesch_score / 100.0))
            
            # Normalize FK grade (lower grades are better, cap at grade 20)
            fk_normalized = max(0.0, 1.0 - (fk_grade / 20.0))
            
            return {
                'flesch_ease': flesch_score,
                'fk_grade': fk_grade,
                'readability_score': (flesch_normalized + fk_normalized) / 2
            }
        except:
            return {'flesch_ease': 0, 'fk_grade': 20, 'readability_score': 0.0}
    
    def evaluate_sentence_coherence(self, text: str) -> float:
        """
        Evaluate coherence based on sentence structure and flow
        """
        if not self.nlp:
            return 0.5  # Neutral score if spaCy not available
        
        doc = self.nlp(text)
        sentences = list(doc.sents)
        
        if len(sentences) < 2:
            return 1.0  # Single sentence is coherent by default
        
        coherence_score = 0.0
        
        # Check for sentence transition coherence
        for i in range(len(sentences) - 1):
            sent1 = sentences[i]
            sent2 = sentences[i + 1]
            
            # Simple coherence checks
            # 1. Sentence length variation (not all very short or very long)
            len_ratio = min(len(sent1.text), len(sent2.text)) / max(len(sent1.text), len(sent2.text))
            length_score = min(1.0, len_ratio + 0.3)  # Penalty for extreme length differences
            
            # 2. Entity continuity (shared entities between sentences)
            ents1 = set(ent.text.lower() for ent in sent1.ents)
            ents2 = set(ent.text.lower() for ent in sent2.ents)
            
            if ents1 and ents2:
                entity_overlap = len(ents1.intersection(ents2)) / len(ents1.union(ents2))
            else:
                entity_overlap = 0.3  # Neutral score
            
            # 3. Lexical cohesion (shared content words)
            words1 = set(token.lemma_.lower() for token in sent1 
                        if not token.is_stop and not token.is_punct and token.is_alpha)
            words2 = set(token.lemma_.lower() for token in sent2 
                        if not token.is_stop and not token.is_punct and token.is_alpha)
            
            if words1 and words2:
                lexical_overlap = len(words1.intersection(words2)) / len(words1.union(words2))
            else:
                lexical_overlap = 0.0
            
            # Combine scores
            sentence_coherence = (length_score + entity_overlap + lexical_overlap) / 3
            coherence_score += sentence_coherence
        
        return coherence_score / (len(sentences) - 1)
    
    def evaluate_semantic_coherence(self, original_text: str, synthetic_text: str) -> float:
        """
        Evaluate if synthetic text maintains semantic coherence with original structure
        """
        # Use TF-IDF to compare semantic similarity at document level
        vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
        
        try:
            tfidf_matrix = vectorizer.fit_transform([original_text, synthetic_text])
            similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
            
            # We want some similarity (coherent structure) but not too much (should be different)
            # Optimal range: 0.3-0.7 similarity
            if 0.3 <= similarity <= 0.7:
                coherence_score = 1.0
            elif similarity < 0.3:
                coherence_score = similarity / 0.3  # Penalty for too much difference
            else:  # similarity > 0.7
                coherence_score = (1.0 - similarity) / 0.3  # Penalty for too much similarity
            
            return max(0.0, min(1.0, coherence_score))
        except:
            return 0.5  # Neutral score on error

print("Coherence evaluator defined!")

In [None]:
# Metric 3: Dissimilarity Evaluation
class DissimilarityEvaluator:
    """
    Evaluate dissimilarity between original and synthetic content
    """
    
    def __init__(self):
        pass
    
    def evaluate_lexical_dissimilarity(self, original_text: str, synthetic_text: str) -> float:
        """
        Evaluate lexical dissimilarity using TF-IDF cosine distance
        """
        vectorizer = TfidfVectorizer(stop_words='english', max_features=1000, ngram_range=(1, 2))
        
        try:
            tfidf_matrix = vectorizer.fit_transform([original_text, synthetic_text])
            similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
            dissimilarity = 1.0 - similarity
            return max(0.0, min(1.0, dissimilarity))
        except:
            return 0.0
    
    def evaluate_structural_dissimilarity(self, original_text: str, synthetic_text: str) -> float:
        """
        Evaluate structural differences (length, sentence count, etc.)
        """
        orig_sentences = nltk.sent_tokenize(original_text)
        synth_sentences = nltk.sent_tokenize(synthetic_text)
        
        orig_words = nltk.word_tokenize(original_text)
        synth_words = nltk.word_tokenize(synthetic_text)
        
        # Length difference (normalized)
        length_diff = abs(len(orig_words) - len(synth_words)) / max(len(orig_words), len(synth_words))
        
        # Sentence count difference (normalized)
        sent_diff = abs(len(orig_sentences) - len(synth_sentences)) / max(len(orig_sentences), len(synth_sentences))
        
        # Average sentence length difference
        orig_avg_sent_len = len(orig_words) / len(orig_sentences) if orig_sentences else 0
        synth_avg_sent_len = len(synth_words) / len(synth_sentences) if synth_sentences else 0
        
        if orig_avg_sent_len > 0:
            sent_len_diff = abs(orig_avg_sent_len - synth_avg_sent_len) / orig_avg_sent_len
        else:
            sent_len_diff = 0
        
        # Combine structural differences
        structural_dissimilarity = (length_diff + sent_diff + sent_len_diff) / 3
        return min(1.0, structural_dissimilarity)
    
    def evaluate_entity_dissimilarity(self, original_text: str, synthetic_text: str, nlp_model=None) -> float:
        """
        Evaluate dissimilarity in named entities
        """
        if not nlp_model:
            # Fallback: simple word-level comparison
            orig_words = set(word.lower() for word in nltk.word_tokenize(original_text) if word.isalpha())
            synth_words = set(word.lower() for word in nltk.word_tokenize(synthetic_text) if word.isalpha())
            
            if not orig_words:
                return 0.0
            
            overlap = len(orig_words.intersection(synth_words))
            return 1.0 - (overlap / len(orig_words))
        
        # Use spaCy for entity extraction
        orig_doc = nlp_model(original_text)
        synth_doc = nlp_model(synthetic_text)
        
        orig_entities = set(ent.text.lower() for ent in orig_doc.ents)
        synth_entities = set(ent.text.lower() for ent in synth_doc.ents)
        
        if not orig_entities:
            return 1.0 if synth_entities else 0.0
        
        overlap = len(orig_entities.intersection(synth_entities))
        entity_dissimilarity = 1.0 - (overlap / len(orig_entities))
        
        return entity_dissimilarity
    
    def evaluate_fact_dissimilarity(self, extracted_facts: List[Dict], modified_facts: List[Dict]) -> float:
        """
        Evaluate how different the modified facts are from extracted facts
        """
        if not extracted_facts or not modified_facts:
            return 0.0
        
        total_dissimilarity = 0.0
        
        for orig_fact, mod_fact in zip(extracted_facts, modified_facts):
            orig_data = orig_fact.get('specific_data', '').lower()
            mod_data = mod_fact.get('specific_data', '').lower()
            
            if not orig_data or not mod_data:
                continue
            
            # Simple string dissimilarity
            if orig_data == mod_data:
                dissimilarity = 0.0  # No change
            else:
                # Calculate character-level dissimilarity
                max_len = max(len(orig_data), len(mod_data))
                if max_len == 0:
                    dissimilarity = 0.0
                else:
                    # Simple edit distance approximation
                    common_chars = sum(1 for c in orig_data if c in mod_data)
                    dissimilarity = 1.0 - (common_chars / max_len)
            
            total_dissimilarity += dissimilarity
        
        return total_dissimilarity / len(extracted_facts)

print("Dissimilarity evaluator defined!")

In [None]:
# Comprehensive evaluation function
def evaluate_synthetic_data_item(item: Dict, 
                                correctness_eval: CorrectnessEvaluator,
                                coherence_eval: CoherenceEvaluator,
                                dissimilarity_eval: DissimilarityEvaluator) -> Dict:
    """
    Evaluate a single synthetic data item across all metrics
    """
    original_text = item.get('original_content', '')
    synthetic_text = item.get('synthetic_content', '')
    extracted_facts = item.get('extracted_facts', [])
    modified_facts = item.get('modified_facts', [])
    
    # Correctness metrics
    fact_extraction_correctness = correctness_eval.evaluate_fact_extraction_correctness(original_text, extracted_facts)
    fact_modification_appropriateness = correctness_eval.evaluate_fact_modification_appropriateness(extracted_facts, modified_facts)
    fact_replacement_accuracy = correctness_eval.evaluate_fact_replacement_accuracy(original_text, synthetic_text, extracted_facts, modified_facts)
    
    correctness_score = (fact_extraction_correctness + fact_modification_appropriateness + fact_replacement_accuracy) / 3
    
    # Coherence metrics
    readability_metrics = coherence_eval.evaluate_readability_coherence(synthetic_text)
    sentence_coherence = coherence_eval.evaluate_sentence_coherence(synthetic_text)
    semantic_coherence = coherence_eval.evaluate_semantic_coherence(original_text, synthetic_text)
    
    coherence_score = (readability_metrics['readability_score'] + sentence_coherence + semantic_coherence) / 3
    
    # Dissimilarity metrics
    lexical_dissimilarity = dissimilarity_eval.evaluate_lexical_dissimilarity(original_text, synthetic_text)
    structural_dissimilarity = dissimilarity_eval.evaluate_structural_dissimilarity(original_text, synthetic_text)
    entity_dissimilarity = dissimilarity_eval.evaluate_entity_dissimilarity(original_text, synthetic_text, nlp)
    fact_dissimilarity = dissimilarity_eval.evaluate_fact_dissimilarity(extracted_facts, modified_facts)
    
    dissimilarity_score = (lexical_dissimilarity + structural_dissimilarity + entity_dissimilarity + fact_dissimilarity) / 4
    
    # Overall quality score
    overall_score = (correctness_score + coherence_score + dissimilarity_score) / 3
    
    return {
        'evaluation_id': item.get('generation_info', {}).get('index', 0),
        'content_type': item.get('generation_info', {}).get('content_type', 'unknown'),
        
        # Correctness components
        'fact_extraction_correctness': fact_extraction_correctness,
        'fact_modification_appropriateness': fact_modification_appropriateness,
        'fact_replacement_accuracy': fact_replacement_accuracy,
        'correctness_score': correctness_score,
        
        # Coherence components
        'readability_score': readability_metrics['readability_score'],
        'sentence_coherence': sentence_coherence,
        'semantic_coherence': semantic_coherence,
        'coherence_score': coherence_score,
        
        # Dissimilarity components
        'lexical_dissimilarity': lexical_dissimilarity,
        'structural_dissimilarity': structural_dissimilarity,
        'entity_dissimilarity': entity_dissimilarity,
        'fact_dissimilarity': fact_dissimilarity,
        'dissimilarity_score': dissimilarity_score,
        
        # Overall
        'overall_quality_score': overall_score,
        
        # Additional metadata
        'flesch_reading_ease': readability_metrics['flesch_ease'],
        'flesch_kincaid_grade': readability_metrics['fk_grade'],
        'num_extracted_facts': len(extracted_facts),
        'num_modified_facts': len(modified_facts)
    }

print("Comprehensive evaluation function defined!")

In [None]:
# Run automatic evaluation on all data
print("🔄 RUNNING AUTOMATIC EVALUATION")
print("="*50)

# Initialize evaluators
correctness_evaluator = CorrectnessEvaluator()
coherence_evaluator = CoherenceEvaluator(nlp)
dissimilarity_evaluator = DissimilarityEvaluator()

all_evaluations = []

# Evaluate news articles
if evaluation_data['news']:
    print(f"\n📰 Evaluating {len(evaluation_data['news'])} news articles...")
    
    for i, item in enumerate(evaluation_data['news']):
        if i % 10 == 0:
            print(f"  Progress: {i}/{len(evaluation_data['news'])}")
        
        evaluation = evaluate_synthetic_data_item(item, correctness_evaluator, coherence_evaluator, dissimilarity_evaluator)
        all_evaluations.append(evaluation)

# Evaluate tweets
if evaluation_data['tweets']:
    print(f"\n🐦 Evaluating {len(evaluation_data['tweets'])} tweets...")
    
    for i, item in enumerate(evaluation_data['tweets']):
        if i % 10 == 0:
            print(f"  Progress: {i}/{len(evaluation_data['tweets'])}")
        
        evaluation = evaluate_synthetic_data_item(item, correctness_evaluator, coherence_evaluator, dissimilarity_evaluator)
        all_evaluations.append(evaluation)

print(f"\n✅ Automatic evaluation completed!")
print(f"Total items evaluated: {len(all_evaluations)}")

# Convert to DataFrame for analysis
evaluation_df = pd.DataFrame(all_evaluations)
print(f"\n📊 Evaluation results shape: {evaluation_df.shape}")

In [None]:
# Generate evaluation report and visualizations
print("📈 AUTOMATIC EVALUATION RESULTS")
print("="*50)

# Overall statistics
print("\n🎯 OVERALL QUALITY SCORES:")
print(f"Mean Overall Quality: {evaluation_df['overall_quality_score'].mean():.3f}")
print(f"Median Overall Quality: {evaluation_df['overall_quality_score'].median():.3f}")
print(f"Std Overall Quality: {evaluation_df['overall_quality_score'].std():.3f}")

# Metric breakdown
print("\n📋 METRIC BREAKDOWN:")
metric_cols = ['correctness_score', 'coherence_score', 'dissimilarity_score']
for metric in metric_cols:
    mean_score = evaluation_df[metric].mean()
    print(f"{metric.replace('_', ' ').title()}: {mean_score:.3f}")

# Content type comparison
if 'content_type' in evaluation_df.columns:
    print("\n📊 BY CONTENT TYPE:")
    content_type_stats = evaluation_df.groupby('content_type')['overall_quality_score'].agg(['mean', 'median', 'std', 'count'])
    print(content_type_stats)

# Create visualizations
plt.figure(figsize=(15, 10))

# 1. Overall quality distribution
plt.subplot(2, 3, 1)
plt.hist(evaluation_df['overall_quality_score'], bins=20, alpha=0.7, edgecolor='black')
plt.title('Overall Quality Score Distribution')
plt.xlabel('Quality Score')
plt.ylabel('Frequency')

# 2. Metric comparison
plt.subplot(2, 3, 2)
metric_means = [evaluation_df[col].mean() for col in metric_cols]
metric_names = [col.replace('_score', '').title() for col in metric_cols]
plt.bar(metric_names, metric_means, alpha=0.7)
plt.title('Average Scores by Metric')
plt.ylabel('Score')
plt.ylim(0, 1)

# 3. Content type comparison (if available)
if 'content_type' in evaluation_df.columns and evaluation_df['content_type'].nunique() > 1:
    plt.subplot(2, 3, 3)
    sns.boxplot(data=evaluation_df, x='content_type', y='overall_quality_score')
    plt.title('Quality by Content Type')
    plt.ylabel('Overall Quality Score')

# 4. Correlation heatmap
plt.subplot(2, 3, 4)
correlation_cols = ['correctness_score', 'coherence_score', 'dissimilarity_score', 'overall_quality_score']
corr_matrix = evaluation_df[correlation_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Metric Correlations')

# 5. Detailed metric breakdown
plt.subplot(2, 3, 5)
detailed_metrics = ['fact_extraction_correctness', 'fact_modification_appropriateness', 'fact_replacement_accuracy']
detailed_means = [evaluation_df[col].mean() for col in detailed_metrics]
detailed_names = [col.replace('_', ' ').replace('fact ', '').title() for col in detailed_metrics]
plt.bar(detailed_names, detailed_means, alpha=0.7)
plt.title('Correctness Components')
plt.ylabel('Score')
plt.xticks(rotation=45)
plt.ylim(0, 1)

# 6. Quality vs number of facts
plt.subplot(2, 3, 6)
plt.scatter(evaluation_df['num_extracted_facts'], evaluation_df['overall_quality_score'], alpha=0.6)
plt.xlabel('Number of Extracted Facts')
plt.ylabel('Overall Quality Score')
plt.title('Quality vs Number of Facts')

plt.tight_layout()
plt.show()

# Save evaluation results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_file = f"../../evaluation/automatic_evaluation_results_{timestamp}.csv"
os.makedirs("../../evaluation", exist_ok=True)

evaluation_df.to_csv(results_file, index=False)
print(f"\n💾 Evaluation results saved to: {results_file}")

# Quality thresholds and recommendations
print("\n🎯 QUALITY ASSESSMENT & RECOMMENDATIONS:")
print("="*50)

mean_overall = evaluation_df['overall_quality_score'].mean()
mean_correctness = evaluation_df['correctness_score'].mean()
mean_coherence = evaluation_df['coherence_score'].mean()
mean_dissimilarity = evaluation_df['dissimilarity_score'].mean()

if mean_overall >= 0.7:
    print("✅ EXCELLENT: Overall quality is high - proceed with full dataset")
elif mean_overall >= 0.6:
    print("✅ GOOD: Quality is acceptable - proceed with caution")
elif mean_overall >= 0.5:
    print("⚠️ MODERATE: Quality needs improvement before scaling")
else:
    print("❌ POOR: Significant improvements needed")

print("\nSpecific recommendations:")
if mean_correctness < 0.6:
    print("• Improve fact extraction and modification accuracy")
if mean_coherence < 0.6:
    print("• Review and improve text generation for better coherence")
if mean_dissimilarity < 0.4:
    print("• Increase dissimilarity - synthetic content too similar to original")
elif mean_dissimilarity > 0.8:
    print("• Decrease dissimilarity - synthetic content too different from original")

print(f"\n📊 Items with quality score >= 0.7: {(evaluation_df['overall_quality_score'] >= 0.7).sum()}/{len(evaluation_df)} ({(evaluation_df['overall_quality_score'] >= 0.7).mean()*100:.1f}%)")

## Evaluation Summary

This notebook provides comprehensive automatic evaluation with three key metrics:

### 1. Correctness (33% weight)
- **Fact Extraction Correctness**: Are extracted facts present in original text?
- **Fact Modification Appropriateness**: Are modifications realistic and type-consistent?
- **Fact Replacement Accuracy**: Are facts correctly replaced in synthetic text?

### 2. Coherence (33% weight)
- **Readability**: Flesch reading ease and grade level
- **Sentence Coherence**: Logical flow between sentences
- **Semantic Coherence**: Maintaining overall meaning structure

### 3. Dissimilarity (33% weight)
- **Lexical Dissimilarity**: TF-IDF cosine distance
- **Structural Dissimilarity**: Length and sentence structure differences
- **Entity Dissimilarity**: Named entity changes
- **Fact Dissimilarity**: How different modified facts are from originals

### Quality Thresholds:
- **>= 0.7**: Excellent quality - proceed with full dataset
- **0.6-0.7**: Good quality - proceed with monitoring
- **0.5-0.6**: Moderate quality - improvements recommended
- **< 0.5**: Poor quality - significant changes needed

### Next Steps:
1. Compare automatic scores with manual evaluation results
2. Identify patterns in high/low quality items
3. Use insights to improve fact schemas and generation process
4. Proceed to classification training if scores are satisfactory