# COVID Synthetic Data Generation - Full Methodology

This notebook implements the complete 4-step methodology for synthetic data generation:

1. **Data Collection** (already done)
2. **Fact Characterization** - Define fact types with name, description, and examples
3. **Fact Extraction** - Extract structured facts using LLM
4. **Fact Manipulation** - Modify facts and generate synthetic articles

The output follows the specified JSON format with:
- original_article
- extracted_facts
- modified_facts (before replacing)
- modified_article

In [None]:
# Import required libraries
import sys
import os
sys.path.append('../..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from datetime import datetime
from tqdm import tqdm

# Import custom modules
from src.data_generation.pipeline import SyntheticDataPipeline, run_demo_pipeline
from src.data_generation.fact_schemas import get_fact_schema, display_fact_schema, COVID_FACT_SCHEMA
from src.utils.evaluation import EvaluationManager
from src.utils.data_utils import load_config

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

print("Libraries imported successfully!")
print(f"Current working directory: {os.getcwd()}")

## Configuration and Setup

In [None]:
# Check API keys and configuration
config = load_config()

openai_key = os.getenv('OPENAI_API_KEY')
print(f"OpenAI API Key available: {openai_key is not None}")

if not openai_key:
    print("\n⚠️  Warning: No OpenAI API key found.")
    print("Please set up your .env file with OPENAI_API_KEY")
    print("Copy .env.example to .env and add your API key")
else:
    print("✅ API key configured - ready to proceed")

print("\nConfiguration loaded:")
print(json.dumps(config, indent=2))

## Step 2: Fact Characterization

Define COVID-specific fact types with:
- Name of fact
- Description of fact  
- Common examples

In [None]:
# Display the predefined COVID fact schema
print("COVID-SPECIFIC FACT CHARACTERIZATION SCHEMA")
print("="*50)

fact_schema = get_fact_schema("news")
display_fact_schema(fact_schema)

print(f"\nTotal fact types defined: {len(fact_schema)}")
print("\nThis schema will be used to extract structured facts from COVID articles.")

## Demo Pipeline - Process 10 Articles

Start with small batch as per methodology (10 articles, then 100, then full dataset)

In [None]:
# Run demo pipeline with 10 articles
print("RUNNING DEMO PIPELINE - 10 ARTICLES")
print("="*40)

if openai_key:
    try:
        # This will create sample data and run the full pipeline
        pipeline = run_demo_pipeline(num_articles=10, content_type="news")
        
        print("\n✅ Demo pipeline completed successfully!")
        
        # Get results for analysis
        results = pipeline.results
        print(f"Generated {len(results)} synthetic article pairs")
        
    except Exception as e:
        print(f"❌ Error running pipeline: {e}")
        print("Please check your API key and configuration")
        results = []
        
else:
    print("⚠️  Skipping pipeline execution - no API key configured")
    print("Please add your OpenAI API key to .env file to run the pipeline")
    results = []

## Results Analysis - Structured Output Format

In [None]:
# Display results in the required JSON format
if results:
    print("SYNTHETIC DATA GENERATION RESULTS")
    print("="*50)
    
    # Show first 2 results in detail
    for i, result in enumerate(results[:2]):
        print(f"\n📄 ARTICLE {i + 1}")
        print("-"*30)
        
        print("\n🔹 ORIGINAL ARTICLE:")
        print(f"{result.original_article[:200]}{'...' if len(result.original_article) > 200 else ''}")
        
        print("\n🔹 EXTRACTED FACTS:")
        for j, fact in enumerate(result.extracted_facts, 1):
            print(f"  {j}. {fact.get('name_of_fact', 'Unknown')}: {fact.get('specific_data', 'N/A')}")
            print(f"     Description: {fact.get('description_of_fact', 'N/A')}")
        
        print("\n🔹 MODIFIED FACTS (before replacement):")
        for j, fact in enumerate(result.modified_facts, 1):
            print(f"  {j}. {fact.get('name_of_fact', 'Unknown')}: {fact.get('specific_data', 'N/A')}")
        
        print("\n🔹 MODIFIED ARTICLE:")
        print(f"{result.modified_article[:200]}{'...' if len(result.modified_article) > 200 else ''}")
        
        print("\n" + "="*50)
    
    if len(results) > 2:
        print(f"\n... and {len(results) - 2} more results")
        
else:
    print("No results to display. Run the pipeline first.")

## Quality Assessment - Structured Facts Analysis

In [None]:
# Analyze the quality and structure of extracted/modified facts
if results:
    print("FACT EXTRACTION AND MODIFICATION ANALYSIS")
    print("="*45)
    
    # Collect statistics
    fact_stats = {
        'total_articles': len(results),
        'total_facts_extracted': 0,
        'total_facts_modified': 0,
        'fact_types_found': {},
        'avg_facts_per_article': 0
    }
    
    all_extracted_facts = []
    all_modified_facts = []
    
    for result in results:
        extracted_count = len(result.extracted_facts)
        modified_count = len(result.modified_facts)
        
        fact_stats['total_facts_extracted'] += extracted_count
        fact_stats['total_facts_modified'] += modified_count
        
        all_extracted_facts.extend(result.extracted_facts)
        all_modified_facts.extend(result.modified_facts)
        
        # Count fact types
        for fact in result.extracted_facts:
            fact_type = fact.get('name_of_fact', 'Unknown')
            fact_stats['fact_types_found'][fact_type] = fact_stats['fact_types_found'].get(fact_type, 0) + 1
    
    fact_stats['avg_facts_per_article'] = fact_stats['total_facts_extracted'] / len(results)
    
    # Display statistics
    print(f"Total articles processed: {fact_stats['total_articles']}")
    print(f"Total facts extracted: {fact_stats['total_facts_extracted']}")
    print(f"Total facts modified: {fact_stats['total_facts_modified']}")
    print(f"Average facts per article: {fact_stats['avg_facts_per_article']:.1f}")
    
    print("\nFact types found:")
    for fact_type, count in sorted(fact_stats['fact_types_found'].items(), key=lambda x: x[1], reverse=True):
        percentage = (count / fact_stats['total_facts_extracted']) * 100
        print(f"  {fact_type}: {count} ({percentage:.1f}%)")
    
    # Visualization
    if len(fact_stats['fact_types_found']) > 0:
        plt.figure(figsize=(12, 5))
        
        # Fact types distribution
        plt.subplot(1, 2, 1)
        fact_types = list(fact_stats['fact_types_found'].keys())
        fact_counts = list(fact_stats['fact_types_found'].values())
        
        plt.bar(range(len(fact_types)), fact_counts)
        plt.xlabel('Fact Types')
        plt.ylabel('Count')
        plt.title('Distribution of Extracted Fact Types')
        plt.xticks(range(len(fact_types)), fact_types, rotation=45, ha='right')
        
        # Facts per article distribution
        plt.subplot(1, 2, 2)
        facts_per_article = [len(result.extracted_facts) for result in results]
        plt.hist(facts_per_article, bins=max(1, len(set(facts_per_article))), alpha=0.7, edgecolor='black')
        plt.xlabel('Number of Facts per Article')
        plt.ylabel('Frequency')
        plt.title('Facts per Article Distribution')
        
        plt.tight_layout()
        plt.show()
        
else:
    print("No results available for analysis.")

## Manual Evaluation Setup

Create template for manual annotation by 3+ annotators

In [None]:
# Create evaluation template for manual annotation
if results:
    print("CREATING MANUAL EVALUATION TEMPLATE")
    print("="*38)
    
    # Create evaluation manager and template
    eval_manager = EvaluationManager()
    
    # Convert results to evaluation format
    eval_data = []
    for result in results:
        eval_data.append({
            'original_article': result.original_article,
            'extracted_facts': result.extracted_facts,
            'modified_facts': result.modified_facts,
            'modified_article': result.modified_article
        })
    
    # Create template
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    template_file = f"results/evaluation_template_{timestamp}.csv"
    
    # Ensure results directory exists
    os.makedirs("results", exist_ok=True)
    
    eval_manager.create_annotation_template(eval_data, template_file)
    
    print(f"\n✅ Evaluation template created: {template_file}")
    print("\n📋 Next steps for manual evaluation:")
    print("1. Share the CSV file with 3+ annotators")
    print("2. Each annotator should rate:")
    print("   - fact_extraction_quality: appropriate/inappropriate/in-between")
    print("   - fact_modification_quality: appropriate/inappropriate/in-between")
    print("   - synthetic_article_quality: appropriate/inappropriate/in-between")
    print("3. Collect completed annotations")
    print("4. Use the evaluation analysis tools to calculate agreement")
    
else:
    print("No results available for evaluation template creation.")

## Automatic Evaluation Metrics

Calculate correctness, coherence, and dissimilarity automatically

In [None]:
# Run automatic evaluation
if results:
    print("AUTOMATIC EVALUATION METRICS")
    print("="*30)
    
    eval_manager = EvaluationManager()
    automatic_evaluations = []
    
    for result in results:
        auto_eval = eval_manager.calculate_automatic_metrics(
            original_text=result.original_article,
            synthetic_text=result.modified_article,
            extracted_facts=result.extracted_facts,
            modified_facts=result.modified_facts
        )
        automatic_evaluations.append(auto_eval)
    
    # Calculate summary statistics
    correctness_scores = [eval.correctness for eval in automatic_evaluations]
    coherence_scores = [eval.coherence for eval in automatic_evaluations]
    dissimilarity_scores = [eval.dissimilarity for eval in automatic_evaluations]
    
    print(f"\n📊 SUMMARY STATISTICS (n={len(automatic_evaluations)}):")
    print(f"\nCorrectness (fact incorporation):")
    print(f"  Mean: {np.mean(correctness_scores):.3f} ± {np.std(correctness_scores):.3f}")
    print(f"  Range: [{np.min(correctness_scores):.3f}, {np.max(correctness_scores):.3f}]")
    
    print(f"\nCoherence (text quality):")
    print(f"  Mean: {np.mean(coherence_scores):.3f} ± {np.std(coherence_scores):.3f}")
    print(f"  Range: [{np.min(coherence_scores):.3f}, {np.max(coherence_scores):.3f}]")
    
    print(f"\nDissimilarity (from original):")
    print(f"  Mean: {np.mean(dissimilarity_scores):.3f} ± {np.std(dissimilarity_scores):.3f}")
    print(f"  Range: [{np.min(dissimilarity_scores):.3f}, {np.max(dissimilarity_scores):.3f}]")
    
    # Visualization
    plt.figure(figsize=(15, 5))
    
    # Correctness distribution
    plt.subplot(1, 3, 1)
    plt.hist(correctness_scores, bins=10, alpha=0.7, edgecolor='black')
    plt.axvline(np.mean(correctness_scores), color='red', linestyle='--', 
                label=f'Mean: {np.mean(correctness_scores):.3f}')
    plt.xlabel('Correctness Score')
    plt.ylabel('Frequency')
    plt.title('Fact Incorporation Correctness')
    plt.legend()
    
    # Coherence distribution
    plt.subplot(1, 3, 2)
    plt.hist(coherence_scores, bins=10, alpha=0.7, edgecolor='black')
    plt.axvline(np.mean(coherence_scores), color='red', linestyle='--',
                label=f'Mean: {np.mean(coherence_scores):.3f}')
    plt.xlabel('Coherence Score')
    plt.ylabel('Frequency')
    plt.title('Text Coherence Quality')
    plt.legend()
    
    # Dissimilarity distribution
    plt.subplot(1, 3, 3)
    plt.hist(dissimilarity_scores, bins=10, alpha=0.7, edgecolor='black')
    plt.axvline(np.mean(dissimilarity_scores), color='red', linestyle='--',
                label=f'Mean: {np.mean(dissimilarity_scores):.3f}')
    plt.xlabel('Dissimilarity Score')
    plt.ylabel('Frequency')
    plt.title('Dissimilarity from Original')
    plt.legend()
    
    plt.tight_layout()
    plt.show()
    
    # Quality assessment
    print("\n🎯 QUALITY ASSESSMENT:")
    
    avg_correctness = np.mean(correctness_scores)
    if avg_correctness > 0.8:
        print(f"✅ Excellent fact incorporation (mean: {avg_correctness:.3f})")
    elif avg_correctness > 0.6:
        print(f"✅ Good fact incorporation (mean: {avg_correctness:.3f})")
    else:
        print(f"⚠️  Fact incorporation needs improvement (mean: {avg_correctness:.3f})")
    
    avg_coherence = np.mean(coherence_scores)
    if avg_coherence > 0.8:
        print(f"✅ High text coherence (mean: {avg_coherence:.3f})")
    elif avg_coherence > 0.6:
        print(f"✅ Acceptable text coherence (mean: {avg_coherence:.3f})")
    else:
        print(f"⚠️  Text coherence needs improvement (mean: {avg_coherence:.3f})")
    
    avg_dissimilarity = np.mean(dissimilarity_scores)
    if avg_dissimilarity > 0.3:
        print(f"✅ Good dissimilarity from original (mean: {avg_dissimilarity:.3f})")
    else:
        print(f"⚠️  Synthetic articles too similar to originals (mean: {avg_dissimilarity:.3f})")
        
else:
    print("No results available for automatic evaluation.")

## Next Steps Based on Methodology

### Current Status:
- ✅ **Step 1**: Data collection ready
- ✅ **Step 2**: COVID fact schema defined
- ✅ **Step 3**: Structured fact extraction implemented
- ✅ **Step 4**: Fact modification and synthetic generation working

### Workflow Continuation:

1. **Manual Evaluation** (Current)
   - Share evaluation template with 3+ annotators
   - Evaluate fact extraction and modification quality
   - Calculate inter-annotator agreement

2. **If Manual Evaluation Shows Good Results**:
   - Scale to 100 news articles
   - Scale to 100 tweets
   - Repeat evaluation process

3. **If Evaluation Shows Issues**:
   - Refine fact characterization schema
   - Adjust LLM prompts
   - Re-run pipeline on sample data

4. **Full Dataset Processing**:
   - Process complete dataset
   - Generate final synthetic dataset
   - Prepare for classification experiments

5. **Classification Phase**:
   - Train ML/DL models on synthetic data
   - Test on real-world fake news datasets
   - Compare with models trained on real labeled data

### Files Generated:
- `data/synthetic/synthetic_data_pipeline_[timestamp].json` - Full results
- `data/synthetic/synthetic_data_pipeline_[timestamp].csv` - CSV format
- `results/evaluation_template_[timestamp].csv` - Manual annotation template