# Batch Processing for Synthetic Data Generation

This notebook implements the recommended 3-phase approach for synthetic data generation.

## 📋 Recommended Procedure:

### Phase 1: Initial Testing (10 articles + 10 tweets)
- Test fact extraction and modification process
- Verify system setup and API integration
- Quick quality check before scaling

### Phase 2: Manual Evaluation (100 articles + 100 tweets) 
- Generate larger sample for comprehensive evaluation
- **Manual evaluation** by 3+ annotators using evaluation notebooks
- **Automatic evaluation** with correctness, coherence, dissimilarity metrics
- Check inter-annotator agreement
- **Decision point**: Proceed only if quality is satisfactory

### Phase 3: Full Dataset Processing (700 articles + 700 tweets)
- Process complete dataset after validation
- Continue to classification training
- Final automatic evaluation of all results

## 🎯 Fact Characterization Strategy:
Based on your dataset analysis, vaccination news articles focus on:
- **Vaccine types** (Pfizer, Moderna, AstraZeneca, etc.)
- **Effects and side effects**
- **Death and injury statistics**  
- **Regulatory and policy information**

The fact schemas are designed to capture these patterns consistently.

## ⏱️ Processing Estimates:
- **Phase 1**: ~34 minutes (10+10 items)
- **Phase 2**: ~5.6 hours (100+100 items) 
- **Phase 3**: ~38.8 hours (700+700 items)
- **Total cost**: ~$5.60 for all phases

**Current Phase**: Start with Phase 1 below ⬇️

In [None]:
# Import required libraries
import sys
import os
sys.path.append('../..')

import pandas as pd
import numpy as np
import json
import time
from datetime import datetime, timedelta
from tqdm import tqdm
from typing import List, Dict, Optional

# Import custom modules
from src.data_generation.llm_client import create_llm_client, SyntheticDataResult
from src.utils.data_utils import load_config, load_raw_data, save_processed_data
from src.data_generation.fact_schemas import get_fact_schema

print("Libraries imported successfully!")
print(f"Current time: {datetime.now()}")

In [None]:
# Configuration
config = load_config()
llm_provider = "together"  # Using Together.ai

print("🔧 CONFIGURATION")
print("="*40)
print(f"LLM Provider: {llm_provider}")
print(f"Model: {config['llm'][llm_provider]['model']}")
print(f"Temperature: {config['llm'][llm_provider]['temperature']}")
print(f"Max Tokens: {config['llm'][llm_provider]['max_tokens']}")

# Rate limiting settings
REQUEST_DELAY = 100  # 100 seconds between requests (conservative)
BATCH_SIZE = 10      # Save progress every 10 items
MAX_RETRIES = 3      # Retry failed requests

print(f"\n⏱️ RATE LIMITING")
print(f"Delay between requests: {REQUEST_DELAY}s")
print(f"Batch size for saves: {BATCH_SIZE}")
print(f"Max retries: {MAX_RETRIES}")

In [None]:
# Initialize LLM client
try:
    llm_client = create_llm_client(
        provider=llm_provider,
        model_name=config['llm'][llm_provider]['model'],
        temperature=config['llm'][llm_provider]['temperature'],
        max_tokens=config['llm'][llm_provider]['max_tokens']
    )
    print(f"✅ Successfully initialized {llm_provider} client")
    print(f"Model: {config['llm'][llm_provider]['model']}")
except Exception as e:
    print(f"❌ Error initializing LLM client: {e}")
    print("Please check your TOGETHER_API_KEY in .env file.")
    llm_client = None

In [None]:
# Load your datasets
print("📊 LOADING DATASETS")
print("="*40)

try:
    # Load news articles
    news_df = pd.read_csv("../../data/raw/vaccination_all_news.csv")
    print(f"✅ Loaded {len(news_df):,} news articles (will process 700)")
    print(f"Columns: {list(news_df.columns)}")
    
    # Take first 700 articles
    news_df = news_df.head(700)
    print(f"📊 Using {len(news_df)} articles for processing")
    
    # Display first few rows to understand structure
    print("\nFirst few rows of news data:")
    print(news_df.head(2))
    
except FileNotFoundError:
    print("❌ vaccination_all_news.csv not found in data/raw/")
    news_df = None

try:
    # Load tweets
    tweets_df = pd.read_csv("../../data/raw/vaccination_all_tweets.csv")
    print(f"\n✅ Loaded {len(tweets_df):,} tweets (will process 700)")
    print(f"Columns: {list(tweets_df.columns)}")
    
    # Take first 700 tweets
    tweets_df = tweets_df.head(700)
    print(f"📊 Using {len(tweets_df)} tweets for processing")
    
    # Display first few rows
    print("\nFirst few rows of tweet data:")
    print(tweets_df.head(2))
    
except FileNotFoundError:
    print("❌ vaccination_all_tweets.csv not found in data/raw/")
    tweets_df = None

# Calculate processing estimates for 700 items each
if news_df is not None:
    news_time_hours = (len(news_df) * REQUEST_DELAY) / 3600
    print(f"\n⏱️ Processing time for 700 news articles: {news_time_hours:.1f} hours ({news_time_hours/24:.1f} days)")

if tweets_df is not None:
    tweets_time_hours = (len(tweets_df) * REQUEST_DELAY) / 3600
    print(f"⏱️ Processing time for 700 tweets: {tweets_time_hours:.1f} hours ({tweets_time_hours/24:.1f} days)")

if news_df is not None and tweets_df is not None:
    total_time_hours = news_time_hours + tweets_time_hours
    total_cost = (700 + 700) * 0.004  # Estimated cost per item
    print(f"\n💰 TOTAL ESTIMATES:")
    print(f"   Total processing time: {total_time_hours:.1f} hours ({total_time_hours/24:.1f} days)")
    print(f"   Estimated total cost: ${total_cost:.2f}")
    print(f"   Items to process: {700 + 700:,} total (700 articles + 700 tweets)")

In [None]:
# Batch processing function with error handling and progress tracking
def process_batch_robust(data_items: List[str], 
                        content_type: str,
                        start_index: int = 0,
                        max_items: Optional[int] = None,
                        resume_file: Optional[str] = None) -> List[Dict]:
    """
    Robust batch processing with error handling and progress saving
    
    Args:
        data_items: List of content strings to process
        content_type: "news" or "tweets"
        start_index: Index to start processing from (for resuming)
        max_items: Maximum number of items to process (None for all)
        resume_file: File to resume from if it exists
    """
    
    # Load fact schema
    fact_schema = get_fact_schema(content_type)
    
    # Resume from existing file if specified
    results = []
    if resume_file and os.path.exists(resume_file):
        with open(resume_file, 'r') as f:
            results = json.load(f)
        print(f"📁 Resumed from {resume_file} with {len(results)} existing results")
        start_index = len(results)
    
    # Determine processing range
    end_index = len(data_items) if max_items is None else min(start_index + max_items, len(data_items))
    items_to_process = data_items[start_index:end_index]
    
    print(f"\n🚀 STARTING BATCH PROCESSING")
    print(f"Content type: {content_type}")
    print(f"Processing items {start_index} to {end_index-1} ({len(items_to_process)} items)")
    print(f"Estimated time: {(len(items_to_process) * REQUEST_DELAY) / 3600:.1f} hours")
    
    # Create results directory
    os.makedirs("../../results", exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    progress_file = f"../../results/{content_type}_batch_progress_{timestamp}.json"
    
    start_time = datetime.now()
    
    for i, content in enumerate(tqdm(items_to_process, desc=f"Processing {content_type}")):
        actual_index = start_index + i
        retry_count = 0
        success = False
        
        while retry_count < MAX_RETRIES and not success:
            try:
                print(f"\n--- Processing item {actual_index + 1}/{len(data_items)} (attempt {retry_count + 1}) ---")
                
                # Extract facts with 3-fact limit
                extracted_facts = llm_client.extract_structured_facts(content, fact_schema, max_facts=3)
                
                if extracted_facts.extracted_facts:
                    # Modify facts
                    modified_facts = llm_client.modify_facts(extracted_facts.extracted_facts)
                    
                    # Generate synthetic content
                    synthetic_content = llm_client.generate_synthetic_article(content, modified_facts)
                    
                    result = {
                        'generation_info': {
                            'timestamp': datetime.now().isoformat(),
                            'content_type': content_type,
                            'model': config['llm'][llm_provider]['model'],
                            'provider': llm_provider,
                            'max_facts_limit': 3,
                            'index': actual_index
                        },
                        'original_content': content,
                        'extracted_facts': extracted_facts.extracted_facts,
                        'modified_facts': modified_facts,
                        'synthetic_content': synthetic_content,
                        'fact_count': {
                            'extracted': len(extracted_facts.extracted_facts),
                            'modified': len(modified_facts)
                        }
                    }
                    
                    results.append(result)
                    success = True
                    print(f"✅ Successfully processed item {actual_index + 1}")
                    print(f"   Extracted: {len(extracted_facts.extracted_facts)} facts")
                    print(f"   Modified: {len(modified_facts)} facts")
                    
                else:
                    print(f"⚠️ No facts extracted from item {actual_index + 1}")
                    success = True  # Don't retry if no facts found
                    
            except Exception as e:
                retry_count += 1
                print(f"❌ Error processing item {actual_index + 1} (attempt {retry_count}): {e}")
                
                if retry_count < MAX_RETRIES:
                    print(f"   Retrying in {REQUEST_DELAY}s...")
                    time.sleep(REQUEST_DELAY)
                else:
                    print(f"   Giving up after {MAX_RETRIES} attempts")
        
        # Save progress every BATCH_SIZE items
        if (i + 1) % BATCH_SIZE == 0 or i == len(items_to_process) - 1:
            with open(progress_file, 'w') as f:
                json.dump(results, f, indent=2)
            
            elapsed = datetime.now() - start_time
            items_processed = len(results)
            if items_processed > 0:
                avg_time_per_item = elapsed.total_seconds() / items_processed
                remaining_items = len(items_to_process) - (i + 1)
                eta = datetime.now() + timedelta(seconds=remaining_items * avg_time_per_item)
                
                print(f"\n💾 Progress saved: {items_processed}/{len(items_to_process)} items")
                print(f"   Elapsed time: {elapsed}")
                print(f"   ETA: {eta.strftime('%Y-%m-%d %H:%M:%S')}")
        
        # Rate limiting - wait between requests
        if i < len(items_to_process) - 1:
            print(f"⏳ Waiting {REQUEST_DELAY}s for rate limiting...")
            time.sleep(REQUEST_DELAY)
    
    # Final save
    final_file = f"../../results/{content_type}_batch_final_{timestamp}.json"
    with open(final_file, 'w') as f:
        json.dump(results, f, indent=2)
    
    total_time = datetime.now() - start_time
    print(f"\n🎉 BATCH PROCESSING COMPLETED!")
    print(f"Total items processed: {len(results)}/{len(items_to_process)}")
    print(f"Total time: {total_time}")
    print(f"Final results saved to: {final_file}")
    
    return results

print("Batch processing function defined!")

## Processing Options

Choose one of the options below based on your needs:

### Option 1: Small Test Batch (Recommended for first run)
Process 10-20 items to test your setup

In [None]:
# PHASE 1: Initial Testing - 10 News Articles
if llm_client and news_df is not None:
    print("🧪 PHASE 1: INITIAL TESTING - 10 NEWS ARTICLES")
    print("="*50)
    print("This is the first phase to test your setup and verify fact extraction quality.")
    
    # Use the content column - adjust column name if different
    content_column = 'content'  # Change this if your column name is different
    
    if content_column in news_df.columns:
        phase1_articles = news_df[content_column].head(10).tolist()
        
        phase1_results = process_batch_robust(
            data_items=phase1_articles,
            content_type="news",
            max_items=10
        )
        
        print(f"\n📊 PHASE 1 RESULTS SUMMARY:")
        print(f"Successfully processed: {len(phase1_results)}/10 articles")
        
        if phase1_results:
            total_facts = sum(r['fact_count']['extracted'] for r in phase1_results)
            print(f"Total facts extracted: {total_facts}")
            print(f"Average facts per article: {total_facts/len(phase1_results):.1f}")
            
            # Save Phase 1 results for evaluation
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            phase1_file = f"../../results/phase1_news_{timestamp}.json"
            with open(phase1_file, 'w') as f:
                json.dump(phase1_results, f, indent=2)
            print(f"💾 Phase 1 results saved: {phase1_file}")
            
        print(f"\n✅ Phase 1 completed successfully!")
        print(f"📋 NEXT STEPS:")
        print(f"1. Review the generated examples in results/")
        print(f"2. If quality looks good, proceed to Phase 1 tweets")
        print(f"3. After both Phase 1 tests, move to Phase 2 (100 items each)")
            
    else:
        print(f"❌ Column '{content_column}' not found in news data")
        print(f"Available columns: {list(news_df.columns)}")
        
else:
    print("⚠️ Skipping Phase 1 - no LLM client or data available")

### Option 2: Medium Batch 
Process 50-100 items for more substantial testing

In [None]:
# Option 2: Medium batch (uncomment to run)
# WARNING: This will take ~1.5 hours to complete

"""
if llm_client and news_df is not None:
    print("📈 RUNNING MEDIUM BATCH - 50 NEWS ARTICLES")
    print("="*50)
    print("⚠️ This will take approximately 1.5 hours to complete")
    
    content_column = 'content'  # Adjust if needed
    
    if content_column in news_df.columns:
        medium_articles = news_df[content_column].head(50).tolist()
        
        medium_results = process_batch_robust(
            data_items=medium_articles,
            content_type="news",
            max_items=50
        )
        
        print(f"\\n📊 MEDIUM BATCH RESULTS:")
        print(f"Successfully processed: {len(medium_results)}/50 articles")
        print(f"Remaining articles to process: {len(news_df) - 50}")
"""

print("Medium batch code available (commented out for safety)")
print("After successful test, you can process larger batches or the full 700 articles")

### Option 3: Full Dataset Processing
Process all your data (use with caution - will take days!)

In [None]:
# Option 3: Full dataset processing - 700 articles
# WARNING: This will take ~19.4 hours to complete

"""
if llm_client and news_df is not None:
    print("🏭 FULL DATASET PROCESSING - ALL 700 NEWS ARTICLES")
    print("="*50)
    print(f"⚠️ This will process all {len(news_df):,} articles")
    print(f"⚠️ Estimated time: {(len(news_df) * REQUEST_DELAY) / 3600:.1f} hours")
    print(f"⚠️ Estimated cost: ${len(news_df) * 0.004:.2f}")
    print("⚠️ Make sure you have sufficient API credits!")
    
    # Uncomment below to run full processing
    # content_column = 'content'
    # 
    # if content_column in news_df.columns:
    #     all_articles = news_df[content_column].tolist()
    #     
    #     full_results = process_batch_robust(
    #         data_items=all_articles,
    #         content_type="news"
    #     )
    #
    #     print(f"\\n🎉 COMPLETED: Processed {len(full_results)}/700 articles")
"""

print("Full dataset processing code available (commented out for safety)")
print("Processes all 700 articles - uncomment when ready for full run")

### Tweet Processing
Similar batch processing for tweets

In [None]:
# PHASE 1: Initial Testing - 10 Tweets
if llm_client and tweets_df is not None:
    print("🐦 PHASE 1: INITIAL TESTING - 10 TWEETS")
    print("="*40)
    print("Complete Phase 1 testing with tweets to verify tweet processing.")
    
    # Find the text column in tweets (common names: 'text', 'content', 'tweet_text')
    text_columns = ['text', 'content', 'tweet_text', 'full_text']
    tweet_text_column = None
    
    for col in text_columns:
        if col in tweets_df.columns:
            tweet_text_column = col
            break
    
    if tweet_text_column:
        print(f"Using column '{tweet_text_column}' for tweet text")
        
        # Process first 10 tweets for Phase 1
        phase1_tweets = tweets_df[tweet_text_column].head(10).tolist()
        
        phase1_tweet_results = process_batch_robust(
            data_items=phase1_tweets,
            content_type="tweets",
            max_items=10
        )
        
        print(f"\n📊 PHASE 1 TWEET RESULTS:")
        print(f"Successfully processed: {len(phase1_tweet_results)}/10 tweets")
        
        if phase1_tweet_results:
            total_facts = sum(r['fact_count']['extracted'] for r in phase1_tweet_results)
            print(f"Total facts extracted: {total_facts}")
            print(f"Average facts per tweet: {total_facts/len(phase1_tweet_results):.1f}")
            
            # Save Phase 1 tweet results
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            phase1_tweet_file = f"../../results/phase1_tweets_{timestamp}.json"
            with open(phase1_tweet_file, 'w') as f:
                json.dump(phase1_tweet_results, f, indent=2)
            print(f"💾 Phase 1 tweet results saved: {phase1_tweet_file}")
        
        print(f"\n✅ PHASE 1 COMPLETE!")
        print(f"📋 EVALUATION & NEXT STEPS:")
        print(f"1. Review both news and tweet results in ../../results/")
        print(f"2. Check fact extraction quality manually")
        print(f"3. If satisfied, proceed to Phase 2 (100 articles + 100 tweets)")
        print(f"4. Use evaluation notebooks for comprehensive assessment")
        
    else:
        print(f"❌ Could not find tweet text column")
        print(f"Available columns: {list(tweets_df.columns)}")
        print("Please manually specify the correct column name")
        
else:
    print("⚠️ Skipping Phase 1 tweets - no data or LLM client available")

## Summary and Next Steps

After running your batch processing:

1. **Check Results**: Results are saved in `../../results/` directory
2. **Monitor Progress**: Each batch saves intermediate progress files
3. **Resume Processing**: You can resume from progress files if interrupted
4. **Scale Gradually**: Start small, then increase batch sizes
5. **Monitor Costs**: Together.ai charges $1.12 per 1M tokens

### Your Dataset Processing Estimates:
- **700 news articles**: ~19.4 hours, ~$2.80 cost
- **700 tweets**: ~19.4 hours, ~$2.80 cost
- **Total**: ~38.8 hours (1.6 days), ~$5.60 total cost

### Recommended Processing Strategy:
1. **Start Small**: Run test batches (5-10 items) first
2. **Verify Quality**: Check generated examples in `results/` 
3. **Scale Up**: Process 50-100 items to test stability
4. **Full Processing**: Run all 700 articles, then all 700 tweets
5. **Monitor Progress**: Save points every 10 items allow resume if interrupted

### Processing Schedule Recommendation:
- **Day 1**: Test batches + first 100 articles (~3 hours)
- **Day 2**: Remaining 600 articles (~17 hours) 
- **Day 3**: All 700 tweets (~19 hours)

This gives you a manageable 3-day processing schedule with plenty of checkpoints!

## Phase 2: Evaluation Testing (100 Items Each)

After reviewing Phase 1 results and confirming the system works correctly, proceed to Phase 2 for evaluation purposes.

In [None]:
# PHASE 2: Evaluation Testing - 100 Articles + 100 Tweets
print("\n🔬 PHASE 2: EVALUATION TESTING - 100 ITEMS EACH")
print("="*50)
print("Complete Phase 2 only after reviewing Phase 1 results.")
print("This phase provides data for manual and automatic evaluation.")

# Phase 2 control - SET TO TRUE AFTER PHASE 1 REVIEW
run_phase2 = False  # Change to True when ready for Phase 2

if run_phase2 and llm_client:
    print("\n📰 Processing 100 News Articles...")
    if news_df is not None:
        phase2_articles = news_df['content'].head(100).tolist()
        
        phase2_news_results = process_batch_robust(
            data_items=phase2_articles,
            content_type="news articles",
            max_items=100
        )
        
        print(f"News articles processed: {len(phase2_news_results)}/100")
        
        # Calculate Phase 2 news statistics
        if phase2_news_results:
            total_facts = sum(r['fact_count']['extracted'] for r in phase2_news_results)
            print(f"Total facts extracted: {total_facts}")
            print(f"Average facts per article: {total_facts/len(phase2_news_results):.1f}")
    
    print("\n🐦 Processing 100 Tweets...")
    if tweets_df is not None and 'tweet_text_column' in locals():
        phase2_tweets = tweets_df[tweet_text_column].head(100).tolist()
        
        phase2_tweet_results = process_batch_robust(
            data_items=phase2_tweets,
            content_type="tweets",
            max_items=100
        )
        
        print(f"Tweets processed: {len(phase2_tweet_results)}/100")
        
        # Calculate Phase 2 tweet statistics
        if phase2_tweet_results:
            total_facts = sum(r['fact_count']['extracted'] for r in phase2_tweet_results)
            print(f"Total facts extracted: {total_facts}")
            print(f"Average facts per tweet: {total_facts/len(phase2_tweet_results):.1f}")
    
    # Save Phase 2 results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    if 'phase2_news_results' in locals():
        phase2_news_file = f"../../results/phase2_news_{timestamp}.json"
        with open(phase2_news_file, 'w') as f:
            json.dump(phase2_news_results, f, indent=2)
        print(f"💾 Phase 2 news results saved: {phase2_news_file}")
    
    if 'phase2_tweet_results' in locals():
        phase2_tweet_file = f"../../results/phase2_tweets_{timestamp}.json"
        with open(phase2_tweet_file, 'w') as f:
            json.dump(phase2_tweet_results, f, indent=2)
        print(f"💾 Phase 2 tweet results saved: {phase2_tweet_file}")
    
    print(f"\n✅ PHASE 2 COMPLETE!")
    print(f"📋 EVALUATION REQUIRED BEFORE PHASE 3:")
    print(f"1. Use ../evaluation/01_manual_evaluation.ipynb for manual annotation")  
    print(f"2. Recruit 3+ evaluators for inter-annotator agreement")
    print(f"3. Use ../evaluation/02_automatic_evaluation.ipynb for automatic metrics")
    print(f"4. Check Cohen's Kappa score (κ):")
    print(f"   - κ > 0.60: Proceed to Phase 3")
    print(f"   - κ 0.40-0.60: Improve guidelines, re-evaluate")
    print(f"   - κ < 0.40: Major methodology changes needed")
    
else:
    print("⚠️ Phase 2 disabled. Set run_phase2 = True after Phase 1 review")
    print("📋 Before enabling Phase 2:")
    print("1. Review Phase 1 results thoroughly")
    print("2. Verify fact extraction quality")
    print("3. Check error handling works correctly")
    print("4. Ensure you're ready for evaluation process")

## Phase 3: Full Dataset Processing (700 Items Each)

Only proceed to Phase 3 after successful Phase 2 evaluation with satisfactory inter-annotator agreement (κ > 0.60).

In [None]:
# PHASE 3: Full Dataset Processing - 700 Articles + 700 Tweets
print("\n🚀 PHASE 3: FULL DATASET PROCESSING - 700 ITEMS EACH")
print("="*55)
print("Complete Phase 3 only after successful Phase 2 evaluation.")
print("This is the final production run for the complete dataset.")

# Phase 3 control - SET TO TRUE AFTER SUCCESSFUL PHASE 2 EVALUATION
run_phase3 = False  # Change to True after Phase 2 evaluation success

if run_phase3 and llm_client:
    print("\n📊 ESTIMATED PROCESSING TIME & COST:")
    print(f"- Articles: ~70 minutes (700 × 6s average)")
    print(f"- Tweets: ~35 minutes (700 × 3s average)")  
    print(f"- Total time: ~105 minutes (~1.75 hours)")
    print(f"- Estimated cost: ~$5.60 (1.4M items × $4/1M tokens)")
    print(f"- Rate limit: 100 requests/second (should be sufficient)")
    
    confirm = input("\n⚠️  FINAL CONFIRMATION: Process full dataset? (yes/no): ")
    
    if confirm.lower() == 'yes':
        print("\n📰 Processing 700 News Articles...")
        if news_df is not None:
            # Take exactly 700 articles
            phase3_articles = news_df['content'].head(700).tolist()
            
            phase3_news_results = process_batch_robust(
                data_items=phase3_articles,
                content_type="news articles",
                max_items=700
            )
            
            print(f"News articles processed: {len(phase3_news_results)}/700")
            
            # Calculate comprehensive statistics
            if phase3_news_results:
                total_facts = sum(r['fact_count']['extracted'] for r in phase3_news_results)
                total_modified = sum(r['fact_count']['modified'] for r in phase3_news_results)
                success_rate = len(phase3_news_results) / 700 * 100
                
                print(f"Success rate: {success_rate:.1f}%")
                print(f"Total facts extracted: {total_facts}")
                print(f"Total facts modified: {total_modified}")
                print(f"Average facts per article: {total_facts/len(phase3_news_results):.1f}")
        
        print("\n🐦 Processing 700 Tweets...")
        if tweets_df is not None and 'tweet_text_column' in locals():
            # Take exactly 700 tweets
            phase3_tweets = tweets_df[tweet_text_column].head(700).tolist()
            
            phase3_tweet_results = process_batch_robust(
                data_items=phase3_tweets,
                content_type="tweets",
                max_items=700
            )
            
            print(f"Tweets processed: {len(phase3_tweet_results)}/700")
            
            # Calculate comprehensive statistics
            if phase3_tweet_results:
                total_facts = sum(r['fact_count']['extracted'] for r in phase3_tweet_results)
                total_modified = sum(r['fact_count']['modified'] for r in phase3_tweet_results)
                success_rate = len(phase3_tweet_results) / 700 * 100
                
                print(f"Success rate: {success_rate:.1f}%")
                print(f"Total facts extracted: {total_facts}")
                print(f"Total facts modified: {total_modified}")
                print(f"Average facts per tweet: {total_facts/len(phase3_tweet_results):.1f}")
        
        # Save Phase 3 results with detailed metadata
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        
        if 'phase3_news_results' in locals():
            phase3_news_file = f"../../results/FINAL_news_articles_{timestamp}.json"
            
            # Add metadata to results
            final_news_data = {
                "metadata": {
                    "processing_date": timestamp,
                    "phase": "Phase 3 - Full Dataset",
                    "content_type": "news_articles", 
                    "total_items": 700,
                    "successful_items": len(phase3_news_results),
                    "success_rate": len(phase3_news_results) / 700 * 100
                },
                "results": phase3_news_results
            }
            
            with open(phase3_news_file, 'w') as f:
                json.dump(final_news_data, f, indent=2)
            print(f"💾 Phase 3 news results saved: {phase3_news_file}")
        
        if 'phase3_tweet_results' in locals():
            phase3_tweet_file = f"../../results/FINAL_tweets_{timestamp}.json"
            
            # Add metadata to results
            final_tweet_data = {
                "metadata": {
                    "processing_date": timestamp,
                    "phase": "Phase 3 - Full Dataset", 
                    "content_type": "tweets",
                    "total_items": 700,
                    "successful_items": len(phase3_tweet_results),
                    "success_rate": len(phase3_tweet_results) / 700 * 100
                },
                "results": phase3_tweet_results
            }
            
            with open(phase3_tweet_file, 'w') as f:
                json.dump(final_tweet_data, f, indent=2)
            print(f"💾 Phase 3 tweet results saved: {phase3_tweet_file}")
        
        print(f"\n🎉 PHASE 3 COMPLETE - FULL DATASET PROCESSED!")
        print(f"📋 FINAL SUMMARY:")
        print(f"- Total items processed: {len(locals().get('phase3_news_results', [])) + len(locals().get('phase3_tweet_results', []))} / 1400")
        print(f"- Results saved with FINAL_ prefix for easy identification")
        print(f"- Ready for final analysis and classification tasks")
        print(f"- All phases completed successfully! 🚀")
        
    else:
        print("❌ Phase 3 cancelled by user")
        
else:
    print("⚠️ Phase 3 disabled. Set run_phase3 = True after Phase 2 evaluation")
    print("📋 Before enabling Phase 3:")
    print("1. Complete Phase 2 evaluation successfully")
    print("2. Achieve inter-annotator agreement κ > 0.60")
    print("3. Verify automatic evaluation metrics are satisfactory")
    print("4. Ensure sufficient resources for full processing")
    print("5. Have ~2 hours available for uninterrupted processing")

## Final Summary - Complete 3-Phase Processing Approach

This notebook implements a structured 3-phase approach for synthetic data generation:

**Phase 1** (10+10 items): Initial testing and system validation  
**Phase 2** (100+100 items): Evaluation and quality assessment  
**Phase 3** (700+700 items): Full dataset processing  

### Evaluation Requirements
- **Manual Evaluation**: Use `../evaluation/01_manual_evaluation.ipynb`
  - 3+ annotators required for inter-annotator agreement
  - Cohen's Kappa (κ) threshold: κ > 0.60 to proceed
- **Automatic Evaluation**: Use `../evaluation/02_automatic_evaluation.ipynb`  
  - Correctness, coherence, and dissimilarity metrics
  - Quantitative assessment of generated content quality

### Next Steps After Processing
1. Complete all 3 phases with proper evaluation checkpoints
2. Use evaluation results for classification model training
3. Document findings and methodology for research publication