# LLM Experimentation Notebook

This notebook is designed for **experimenting with LLM-based interactions**, specifically for entity extraction and newsletter processing.

## Purpose

🧪 **LLM Experimentation**: Test different prompts, models, and parameters  
📊 **Entity Extraction Analysis**: Analyze extraction quality and tune confidence thresholds  
🔍 **Prompt Engineering**: Iterate on prompts to improve extraction accuracy  
📈 **Performance Testing**: Test with real newsletter data and measure results  

## What This Notebook Does

- **Imports Production Modules**: Uses actual FastAPI EntityExtractor for testing
- **Prompt Experimentation**: Test different prompt templates and approaches
- **Data Analysis**: Analyze entity extraction results and quality metrics
- **Interactive Testing**: Test with real newsletter content
- **Performance Benchmarking**: Measure extraction speed and accuracy

## What This Notebook Doesn't Do

- **Infrastructure Code**: All production logic lives in `src/` directory
- **Duplicate Implementation**: Uses FastAPI modules directly
- **Production Deployment**: This is for experimentation only

## Quick Start

1. **Start FastAPI Development Server**: `./scripts/dev-server.sh`
2. **Run This Notebook**: Experiment with LLM interactions
3. **Apply Learnings**: Update production code in `src/processors/entity_extractor.py`

---

**📚 For development workflow, see: [docs/DEVELOPMENT-WORKFLOW.md](../docs/DEVELOPMENT-WORKFLOW.md)**

# Setup and Configuration

**Import production modules and setup experimentation environment**

In [None]:
# Setup: Import Production Modules
import os
import sys
from pathlib import Path
from typing import List, Dict, Any
from datetime import datetime
import json
import time

# Add src directory to path for imports
current_dir = Path.cwd()
project_root = current_dir.parent if current_dir.name == 'notebooks' else current_dir
src_path = project_root / 'src'
sys.path.insert(0, str(src_path))

# Load environment variables
from dotenv import load_dotenv
load_dotenv(project_root / ".env.local")

print("🔬 LLM Experimentation Environment")
print("=" * 50)

# Import production modules
try:
    from config import get_settings
    from models.newsletter import Entity, Newsletter
    from processors.entity_extractor import EntityExtractor
    
    # Get production configuration
    config = get_settings()
    
    # Initialize entity extractor with production settings
    entity_extractor = EntityExtractor(config)
    
    print("✅ Production modules imported successfully")
    print(f"✅ EntityExtractor initialized with model: {config.LLM_MODEL}")
    print(f"✅ Confidence threshold: {config.ENTITY_CONFIDENCE_THRESHOLD}")
    print(f"✅ Max entities per newsletter: {config.MAX_ENTITIES_PER_NEWSLETTER}")
    
except ImportError as e:
    print(f"❌ Failed to import production modules: {e}")
    print("   Make sure FastAPI development environment is set up")
    print("   Run: ./scripts/dev-setup.sh")
    entity_extractor = None

print("\n🧪 Ready for LLM experimentation!")

# Test Data for Experimentation

**Sample newsletter content for testing different scenarios**

In [None]:
# Test Data: Sample Newsletter Content

# Collection of test content for experimenting with entity extraction
test_newsletters = {
    "ai_news": """
    OpenAI announced GPT-5 at their developer conference in San Francisco. 
    CEO Sam Altman presented the new capabilities during the OpenAI DevDay 2024 event.
    Microsoft expanded their Azure AI services with enterprise features.
    Google launched Gemini Pro focusing on AI Safety.
    """,
    
    "tech_startup": """
    Y Combinator's latest batch includes 20 AI startups from Silicon Valley.
    Anthropic raised $100M Series B led by Spark Capital.
    Meta's Reality Labs division showcased new VR headsets at CES 2024.
    TechCrunch reported that Apple is developing autonomous vehicles.
    """,
    
    "research_heavy": """
    Stanford University published research on large language models in Nature.
    Professor Fei-Fei Li's computer vision lab achieved breakthrough results.
    MIT and Harvard collaborated on quantum computing applications.
    The Allen Institute for AI released new datasets for machine learning.
    """,
    
    "business_news": """
    Amazon Web Services launched new cloud infrastructure in Tokyo.
    Salesforce acquired data analytics startup MuleSoft for $6.5 billion.
    Tesla reported record quarterly earnings driven by Model 3 sales.
    Netflix expanded into gaming with mobile app development.
    """,
    
    "short_snippet": """
    OpenAI released ChatGPT-4 with improved reasoning capabilities.
    """,
    
    "complex_entities": """
    The World Economic Forum in Davos featured discussions on AI governance.
    European Union's AI Act was implemented across member states.
    NATO established a new cyber defense initiative in Brussels.
    United Nations Climate Change Conference addressed green technology.
    """
}

print("📰 Test newsletter content loaded:")
for name, content in test_newsletters.items():
    word_count = len(content.split())
    print(f"  • {name}: {word_count} words")

print(f"\n📊 Total test cases: {len(test_newsletters)}")
print("🧪 Ready to experiment with different content types!")

# Entity Extraction Testing

**Test the production entity extractor with different content**

In [None]:
# Entity Extraction: Test Production System

def test_entity_extraction(content_name: str, content: str, show_details: bool = True):
    """Test entity extraction with production system."""
    if not entity_extractor:
        print("❌ Entity extractor not available")
        return None
    
    print(f"🔍 Testing: {content_name}")
    print(f"📝 Content: {content[:100]}{'...' if len(content) > 100 else ''}")
    
    # Time the extraction
    start_time = time.time()
    entities = entity_extractor.extract_entities(content)
    extraction_time = time.time() - start_time
    
    print(f"⏱️  Extraction time: {extraction_time:.2f}s")
    print(f"📊 Entities found: {len(entities)}")
    
    if entities and show_details:
        print("\n📋 Entity Details:")
        entity_counts = {}
        for i, entity in enumerate(entities):
            entity_type = entity.type
            entity_counts[entity_type] = entity_counts.get(entity_type, 0) + 1
            
            print(f"  {i+1:2d}. {entity.name:25} | {entity.type:12} | {entity.confidence:.2f}")
            if entity.context:
                print(f"      Context: {entity.context[:80]}{'...' if len(entity.context) > 80 else ''}")
        
        print(f"\n📈 Entity Type Summary:")
        for entity_type, count in sorted(entity_counts.items()):
            print(f"  {entity_type}: {count}")
    
    print("=" * 60)
    return {
        'content_name': content_name,
        'entities': entities,
        'extraction_time': extraction_time,
        'entity_count': len(entities)
    }

# Test all newsletter content
print("🧪 Testing Entity Extraction with All Content Types")
print("=" * 60)

results = {}
for name, content in test_newsletters.items():
    result = test_entity_extraction(name, content)
    if result:
        results[name] = result
    print()

# Summary analysis
if results:
    print("📊 EXTRACTION SUMMARY")
    print("=" * 60)
    total_entities = sum(r['entity_count'] for r in results.values())
    avg_time = sum(r['extraction_time'] for r in results.values()) / len(results)
    
    print(f"Total test cases: {len(results)}")
    print(f"Total entities extracted: {total_entities}")
    print(f"Average extraction time: {avg_time:.2f}s")
    
    print(f"\nPer-test breakdown:")
    for name, result in results.items():
        print(f"  {name:15}: {result['entity_count']:2d} entities in {result['extraction_time']:.2f}s")
    
    print("\n🎯 Use these results to tune prompts and thresholds!")

# Prompt Experimentation

**Experiment with different prompts and analyze results**

In [None]:
# Prompt Experimentation: Test Different Approaches

def test_custom_prompt(content: str, custom_prompt: str, test_name: str):
    """Test entity extraction with a custom prompt."""
    if not entity_extractor:
        print("❌ Entity extractor not available")
        return None
    
    print(f"🧪 Testing Custom Prompt: {test_name}")
    print(f"📝 Prompt (first 200 chars): {custom_prompt[:200]}...")
    
    # Temporarily modify the extractor's prompt
    original_prompt = entity_extractor.ENTITY_EXTRACTION_PROMPT
    entity_extractor.ENTITY_EXTRACTION_PROMPT = custom_prompt
    
    try:
        start_time = time.time()
        entities = entity_extractor.extract_entities(content)
        extraction_time = time.time() - start_time
        
        print(f"⏱️  Extraction time: {extraction_time:.2f}s")
        print(f"📊 Entities found: {len(entities)}")
        
        if entities:
            entity_counts = {}
            for entity in entities:
                entity_type = entity.type
                entity_counts[entity_type] = entity_counts.get(entity_type, 0) + 1
            
            print(f"📈 Entity breakdown: {dict(entity_counts)}")
            
            # Show top entities
            top_entities = sorted(entities, key=lambda e: e.confidence, reverse=True)[:5]
            print(f"🎯 Top entities:")
            for i, entity in enumerate(top_entities):
                print(f"  {i+1}. {entity.name} ({entity.type}) - {entity.confidence:.2f}")
        
        print("=" * 60)
        return entities
        
    finally:
        # Restore original prompt
        entity_extractor.ENTITY_EXTRACTION_PROMPT = original_prompt

# Experimental prompts
experimental_prompts = {
    "concise": """
Extract key entities from this newsletter content. Focus on:
- Organizations (companies, institutions)
- People (individuals mentioned)  
- Products (software, hardware, services)
- Events (conferences, launches)
- Locations (cities, countries)
- Topics (subject areas, technologies)

Content: {content}

Return JSON format:
{{"entities": [{{"name": "EntityName", "type": "Organization|Person|Product|Event|Location|Topic", "confidence": 0.9}}]}}
""",

    "detailed_analysis": """
You are an expert information extraction specialist. Analyze this newsletter content and extract entities with high precision.

ENTITY CATEGORIES:
🏢 Organization: Companies, startups, institutions, government bodies
👤 Person: Individuals, executives, researchers, public figures  
🛠️ Product: Software, hardware, services, models, platforms
🎪 Event: Conferences, launches, announcements, meetings
🌍 Location: Cities, countries, regions, specific venues
🧠 Topic: Technologies, fields of study, research areas

EXTRACTION GUIDELINES:
- Only extract entities you are highly confident about (>0.8)
- Provide context showing where each entity was mentioned
- Include aliases or alternative names if present
- Rate confidence from 0.0 to 1.0 based on clarity and context

NEWSLETTER CONTENT:
{content}

REQUIRED OUTPUT (JSON only):
{{"entities": [{{"name": "EntityName", "type": "Category", "aliases": ["Alt1"], "confidence": 0.95, "context": "sentence mentioning entity"}}]}}
""",

    "strict_validation": """
Extract entities from newsletter content with strict validation.

STRICT CRITERIA:
- Only well-known, clearly identifiable entities
- Minimum confidence: 0.9
- Clear context required for each entity
- No ambiguous or unclear references

Entity types: Organization, Person, Product, Event, Location, Topic

Content: {content}

JSON output only: {{"entities": [...]}}
"""
}

# Test different prompts with AI news content
test_content = test_newsletters["ai_news"]

print("🔬 PROMPT EXPERIMENTATION")
print("=" * 60)

prompt_results = {}
for prompt_name, prompt_template in experimental_prompts.items():
    entities = test_custom_prompt(test_content, prompt_template, prompt_name)
    if entities:
        prompt_results[prompt_name] = entities
    print()

# Compare results
if prompt_results:
    print("📊 PROMPT COMPARISON RESULTS")
    print("=" * 60)
    
    for prompt_name, entities in prompt_results.items():
        avg_confidence = sum(e.confidence for e in entities) / len(entities) if entities else 0
        entity_types = set(e.type for e in entities)
        
        print(f"{prompt_name:20}: {len(entities):2d} entities, avg confidence: {avg_confidence:.2f}")
        print(f"{'':20}  Types: {', '.join(sorted(entity_types))}")
    
    print("\n🎯 Use this analysis to improve the production prompt!")
    print("💡 Update src/processors/entity_extractor.py with best performing prompt")

In [None]:
# Cell 5: Neo4j Client - Import from FastAPI
from typing import Dict, List, Optional, Any
from datetime import datetime

print("📊 Importing Neo4j client from FastAPI...")

try:
    # Import the actual Neo4j client from FastAPI
    from graph.neo4j_client import Neo4jClient as FastAPINeo4jClient
    
    print("✅ Successfully imported Neo4jClient from FastAPI")
    
    # Initialize with configuration
    neo4j_client = FastAPINeo4jClient(config)
    
    print("✅ Neo4j client initialized with production configuration")
    
    # Test connection
    if neo4j_client.connect():
        # Display current graph statistics
        stats = neo4j_client.get_graph_stats()
        print("\n📊 Current graph statistics:")
        for key, value in stats.items():
            print(f"  {key.capitalize()}: {value}")
        
        # Test basic query
        test_result = neo4j_client.execute_query("RETURN 'Neo4j is working!' as message")
        if test_result:
            print(f"✅ Test query result: {test_result[0]['message']}")
    else:
        print("⚠️ Neo4j operations will be limited without connection")
    
    print("\n🎯 Now using production Neo4j client directly!")
    
except ImportError as e:
    print(f"❌ Failed to import FastAPI Neo4j client: {e}")
    print("  This is expected during the transition phase.")
    print("  The production Neo4j client is now the single source of truth.")
    print("  For development, use the FastAPI codebase directly with hot reload.")
    
    # Simple fallback Neo4j client
    class Neo4jClient:
        """Fallback Neo4j client."""
        
        def __init__(self, config=None):
            self.connected = False
            print("⚠️ Using fallback Neo4j client (limited functionality)")
        
        def connect(self) -> bool:
            print("⚠️ Fallback Neo4j client - no real connection")
            return False
        
        def execute_query(self, query: str, parameters: Dict[str, Any] = None) -> List[Dict]:
            print("⚠️ Fallback Neo4j client - no real query execution")
            return []
        
        def get_graph_stats(self) -> Dict[str, int]:
            return {"message": "Use FastAPI development environment for Neo4j operations"}
    
    neo4j_client = Neo4jClient(config)

print("\n✅ Neo4j client setup complete!")

# Newsletter Processor (Complete Workflow)

**FastAPI File**: `/Users/paulbonneville/Developer/arrgh-fastapi/src/workflows/newsletter_processor.py`

**Purpose**: Complete newsletter processing pipeline orchestrating all components for end-to-end workflow.

**What this mirrors**: The complete workflow orchestration from the FastAPI app, including:
- `NewsletterProcessor` class - Main orchestrator for the complete processing pipeline
- `process_newsletter()` - 5-step workflow: HTML cleaning → entity extraction → Neo4j storage → summary generation
- `_generate_text_summary()` - Intelligent summary creation with entity breakdown and metrics
- Component integration: HTMLProcessor, EntityExtractor, Neo4jClient coordination
- Neo4j graph operations: entity creation/updates, relationship linking, newsletter node management
- Performance metrics and processing time tracking
- Comprehensive error handling with rollback capabilities
- Response generation with detailed processing statistics

---

In [None]:
# Cell 6: Newsletter Processor - Import from FastAPI
import uuid
from datetime import datetime
from typing import Dict, List, Any

print("🔄 Importing Newsletter Processor from FastAPI...")

try:
    # Import the actual NewsletterProcessor from FastAPI
    from workflows.newsletter_processor import NewsletterProcessor as FastAPINewsletterProcessor
    
    print("✅ Successfully imported NewsletterProcessor from FastAPI")
    
    # Initialize with configuration
    newsletter_processor = FastAPINewsletterProcessor(config)
    
    print("✅ Newsletter processor initialized with production configuration")
    print("🔄 Complete pipeline: HTML → Entities → Neo4j → Summary")
    print("📊 Functions: process_newsletter()")
    
    print("\n🎯 Now using production NewsletterProcessor directly!")
    
except ImportError as e:
    print(f"❌ Failed to import FastAPI NewsletterProcessor: {e}")
    print("  This is expected during the transition phase.")
    print("  The production NewsletterProcessor is now the single source of truth.")
    print("  For development, use the FastAPI codebase directly with hot reload.")
    
    # Simple fallback processor that encourages using FastAPI directly
    class NewsletterProcessor:
        """Fallback newsletter processor."""
        
        def __init__(self, config_obj):
            self.config = config_obj
            print("⚠️ Using fallback Newsletter processor (limited functionality)")
            print("  For full functionality, use the FastAPI development environment")
        
        def process_newsletter(self, newsletter: Newsletter) -> NewsletterProcessingResponse:
            """Fallback process that returns minimal response."""
            print("⚠️ Fallback processor - use FastAPI development environment for real processing")
            
            return NewsletterProcessingResponse(
                status="fallback",
                newsletter_id=newsletter.newsletter_id or str(uuid.uuid4()),
                processing_time=0.0,
                entities_extracted=0,
                entities_new=0,
                entities_updated=0,
                entity_summary={},
                text_summary="Use FastAPI development environment for actual newsletter processing",
                errors=["Using fallback processor - switch to FastAPI development"]
            )
    
    newsletter_processor = NewsletterProcessor(config)

print("\n✅ Newsletter Processor setup complete!")

# Testing & Validation (Complete Pipeline Test)

**FastAPI File**: Testing framework (not directly mirrored - notebook-specific utilities)

**Purpose**: Complete pipeline testing and validation with comprehensive test scenarios and performance assessment.

**What this mirrors**: Testing utilities that validate the complete FastAPI workflow, including:
- `run_complete_pipeline_test()` - Full end-to-end pipeline test with realistic AI newsletter content
- `validate_pipeline_results()` - Performance and quality assessment with multiple metrics
- `quick_test()` - Minimal test for rapid debugging and component validation
- Sample data generation with rich entity content (companies, people, products, events, locations)
- Performance benchmarking: processing time, entity extraction quality, success rates
- Error handling validation and pipeline robustness testing
- Neo4j integration testing with graph database operations
- Comprehensive reporting with success/failure analysis and recommendations

---

In [92]:
from datetime import datetime, timezone
import time

def run_complete_pipeline_test():
    """Test the complete newsletter processing pipeline."""
    
    # Sample newsletter for testing
    sample_html = """
    <!DOCTYPE html>
    <html>
    <head><title>AI Weekly Newsletter #245</title></head>
    <body>
        <h1>AI Weekly Newsletter #245</h1>
        <p>Welcome to this week's AI updates!</p>
        
        <h2>Major Announcements</h2>
        <p><strong>OpenAI</strong> announced the release of <strong>GPT-5</strong> at their 
        developer conference in <strong>San Francisco</strong>. CEO <strong>Sam Altman</strong> 
        presented the new capabilities during the <strong>OpenAI DevDay 2024</strong> event.</p>
        
        <h2>Company Updates</h2>
        <p><strong>Microsoft</strong> expanded their <strong>Azure AI</strong> services with 
        new enterprise features. The announcement was made by <strong>Satya Nadella</strong> 
        during the <strong>Microsoft Build 2024</strong> conference.</p>
        
        <h2>Industry News</h2>
        <p><strong>Google</strong> launched their new <strong>Gemini Pro</strong> model, 
        focusing on <strong>AI Safety</strong> and <strong>Responsible AI</strong> development. 
        The launch event was held at <strong>Google I/O 2024</strong> in <strong>Mountain View</strong>.</p>
        
        <h2>Educational Content</h2>
        <p><strong>Stanford University</strong> announced a new <strong>AI Safety</strong> course 
        taught by renowned professor <strong>Fei-Fei Li</strong>. The course will cover 
        <strong>Machine Learning</strong> ethics and <strong>AI Alignment</strong>.</p>
        
        <h2>Research Highlights</h2>
        <p>New research on <strong>Quantum Computing</strong> applications in <strong>AI</strong> 
        was published by researchers at <strong>MIT</strong> and <strong>IBM Research</strong>. 
        The paper explores <strong>Quantum Machine Learning</strong> algorithms.</p>
        
        <p>Thanks for reading! See you next week.</p>
        <p>Best regards,<br>The AI Weekly Team</p>
    </body>
    </html>
    """
    
    # Create newsletter object
    newsletter = Newsletter(
        html_content=sample_html,
        subject="AI Weekly Newsletter #245 - Development Test",
        sender="ai-weekly@example.com",
        received_date=datetime.now(timezone.utc),
        newsletter_id=f"ai-weekly-245-test-{int(time.time())}"
    )
    
    print("🚀 Starting complete pipeline test...")
    print(f"  Newsletter: {newsletter.subject}")
    print(f"  Newsletter ID: {newsletter.newsletter_id}")
    
    # Use the newsletter processor
    try:
        response = newsletter_processor.process_newsletter(newsletter)
        
        print(f"\n✅ Pipeline completed: {response.status}")
        print(f"  - Processing time: {response.processing_time:.2f}s")
        print(f"  - Entities extracted: {response.entities_extracted}")
        print(f"  - New entities: {response.entities_new}")
        print(f"  - Updated entities: {response.entities_updated}")
        print(f"  - Entity summary: {response.entity_summary}")
        print(f"  - Text summary: {response.text_summary}")
        
        if response.errors:
            print(f"  - Errors: {response.errors}")
            
        return response
        
    except Exception as e:
        print(f"❌ Pipeline test failed: {e}")
        import traceback
        traceback.print_exc()
        return None

def validate_pipeline_results(response) -> bool:
    """Validate pipeline results and provide feedback."""
    if not response:
        print("❌ No response to validate")
        return False
    
    print("\n📋 Pipeline Validation Results:")
    print(f"  Status: {response.status}")
    print(f"  Processing time: {response.processing_time:.2f} seconds")
    print(f"  Entities extracted: {response.entities_extracted}")
    
    if neo4j_client.connected:
        print(f"  Entities new: {response.entities_new}")
        print(f"  Entities updated: {response.entities_updated}")
    
    if response.errors:
        print(f"  ❌ Errors ({len(response.errors)}):")
        for error in response.errors:
            print(f"    - {error}")
    
    # Performance assessment
    if response.processing_time < 10:
        print("  ⚡ Performance: Excellent (< 10s)")
    elif response.processing_time < 30:
        print("  ✅ Performance: Good (< 30s)")
    else:
        print("  ⚠️ Performance: Slow (> 30s)")
    
    # Entity extraction assessment
    if response.entities_extracted > 15:
        print("  🎯 Entity extraction: Excellent (> 15 entities)")
    elif response.entities_extracted > 5:
        print("  ✅ Entity extraction: Good (> 5 entities)")
    elif response.entities_extracted > 0:
        print("  ⚠️ Entity extraction: Limited (1-5 entities)")
    else:
        print("  ❌ Entity extraction: Failed (0 entities)")
    
    return response.status == 'success' and len(response.errors) == 0

def quick_test():
    """Quick test with minimal content."""
    test_html = "<html><body><h1>Test</h1><p>OpenAI released GPT-4 in San Francisco.</p></body></html>"
    newsletter = Newsletter(
        html_content=test_html,
        subject="Quick Test",
        sender="test@example.com",
        newsletter_id=f"quick-test-{int(time.time())}"
    )
    
    print("🔬 Running quick test...")
    response = newsletter_processor.process_newsletter(newsletter)
    print(f"Quick test result: {response.status}, {response.entities_extracted} entities")
    return response

# Run the complete pipeline test
print("🔬 Running complete pipeline test...")
test_response = run_complete_pipeline_test()

# Validate results
print("\n" + "="*50)
validation_passed = validate_pipeline_results(test_response)

if validation_passed:
    print("\n🎉 All tests passed! Pipeline is working correctly.")
else:
    print("\n⚠️ Some tests failed. Check the results above.")
    print("\n💡 Running quick test to diagnose...")
    quick_response = quick_test()

print("\n✅ Pipeline testing complete!")

🔬 Running complete pipeline test...
🚀 Starting complete pipeline test...
  Newsletter: AI Weekly Newsletter #245 - Development Test
  Newsletter ID: ai-weekly-245-test-1752119831
🚀 Starting newsletter processing: AI Weekly Newsletter #245 - Development Test
1️⃣ Processing HTML content...
[2m2025-07-09 21:57:11[0m [[32m[1minfo     [0m] [1mHTML content cleaned          [0m [36mcleaned_length[0m=[35m1058[0m [36moriginal_length[0m=[35m1909[0m
[2m2025-07-09 21:57:11[0m [[32m[1minfo     [0m] [1mText sections extracted       [0m [36mheaders_count[0m=[35m6[0m [36mlinks_count[0m=[35m0[0m [36mlists_count[0m=[35m0[0m [36mparagraphs_count[0m=[35m8[0m
   ✅ Cleaned text: 1058 characters
2️⃣ Extracting entities...
   ✅ Extracted 19 entities
3️⃣ Creating newsletter node...
   ✅ Newsletter node created
4️⃣ Processing entities in graph...
   ✅ Processed: 0 new, 19 updated, 19 linked
5️⃣ Generating summary...
   ✅ Processing completed in 20.53 seconds

✅ Pipeline com

# Code Export Helper (Sync changes back to FastAPI)

**FastAPI File**: Development utilities (not directly mirrored - notebook-specific development tools)

**Purpose**: Synchronization tools for maintaining code consistency between notebook development and FastAPI production files.

**What this mirrors**: Development workflow utilities that facilitate notebook-to-FastAPI synchronization, including:
- `compare_files()` - File difference analysis between notebook code and FastAPI files
- `export_to_fastapi()` - Safe code export with automatic backup creation
- `sync_status_check()` - Comprehensive status check for all components across notebook and FastAPI
- `validate_fastapi_imports()` - Import validation to ensure exported code works in FastAPI context
- `development_summary()` - Environment status dashboard with configuration and test results
- `quick_development_check()` - Complete development environment validation workflow
- File mapping system for all FastAPI components (config, models, processors, workflows)
- Development lifecycle management with backup and rollback capabilities

---

In [93]:
import difflib
from pathlib import Path
import shutil

# Define file mappings
FILE_MAPPINGS = {
    'config': '/Users/paulbonneville/Developer/arrgh-fastapi/src/config.py',
    'models': '/Users/paulbonneville/Developer/arrgh-fastapi/src/models/newsletter.py',
    'html_processor': '/Users/paulbonneville/Developer/arrgh-fastapi/src/processors/html_processor.py',
    'entity_extractor': '/Users/paulbonneville/Developer/arrgh-fastapi/src/processors/entity_extractor.py',
    'neo4j_client': '/Users/paulbonneville/Developer/arrgh-fastapi/src/graph/neo4j_client.py',
    'newsletter_processor': '/Users/paulbonneville/Developer/arrgh-fastapi/src/workflows/newsletter_processor.py'
}

def compare_files(notebook_code: str, fastapi_file: str) -> bool:
    """Compare notebook code with FastAPI file."""
    try:
        with open(fastapi_file, 'r') as f:
            fastapi_code = f.read()
        
        # Simple comparison - in production, you'd want more sophisticated matching
        notebook_lines = notebook_code.strip().split('\n')
        fastapi_lines = fastapi_code.strip().split('\n')
        
        # Show diff if different
        if notebook_lines != fastapi_lines:
            print(f"📊 Differences found in {Path(fastapi_file).name}:")
            diff = difflib.unified_diff(
                fastapi_lines, notebook_lines,
                fromfile='FastAPI', tofile='Notebook',
                lineterm='', n=3
            )
            for line in list(diff)[:20]:  # Show first 20 lines of diff
                print(line)
            return False
        else:
            print(f"✅ {Path(fastapi_file).name} matches notebook code")
            return True
            
    except FileNotFoundError:
        print(f"❌ FastAPI file not found: {fastapi_file}")
        return False
    except Exception as e:
        print(f"❌ Error comparing files: {str(e)}")
        return False

def export_to_fastapi(component: str, notebook_code: str, backup: bool = True) -> bool:
    """Export notebook code to FastAPI file."""
    if component not in FILE_MAPPINGS:
        print(f"❌ Unknown component: {component}")
        return False
    
    fastapi_file = FILE_MAPPINGS[component]
    
    try:
        # Create backup if requested
        if backup and Path(fastapi_file).exists():
            backup_file = f"{fastapi_file}.backup"
            shutil.copy2(fastapi_file, backup_file)
            print(f"📋 Backup created: {backup_file}")
        
        # Write notebook code to FastAPI file
        with open(fastapi_file, 'w') as f:
            f.write(notebook_code)
        
        print(f"✅ Exported to {Path(fastapi_file).name}")
        return True
        
    except Exception as e:
        print(f"❌ Export failed: {str(e)}")
        return False

def sync_status_check():
    """Check sync status of all components."""
    print("🔄 Checking sync status with FastAPI...\n")
    
    sync_status = {}
    
    # Note: In a real implementation, you'd extract the actual code from notebook cells
    # This is a placeholder showing the concept
    
    components = ['html_processor', 'entity_extractor', 'neo4j_client', 'newsletter_processor']
    
    for component in components:
        print(f"📦 Checking {component}...")
        
        # This would compare actual notebook cell code with FastAPI files
        # For now, we'll just check if files exist
        fastapi_file = FILE_MAPPINGS[component]
        if Path(fastapi_file).exists():
            sync_status[component] = "✅ File exists"
        else:
            sync_status[component] = "❌ File missing"
        
        print(f"  {sync_status[component]}")
        print()
    
    # Summary
    print("📊 Sync Status Summary:")
    for component, status in sync_status.items():
        print(f"  - {component}: {status}")
    
    return sync_status

def validate_fastapi_imports():
    """Validate that FastAPI can import our changes."""
    print("🔍 Validating FastAPI imports...")
    
    try:
        # Test basic imports
        import config
        print("✅ config.py imports successfully")
        
        from models import newsletter
        print("✅ models/newsletter.py imports successfully")
        
        # Test processor imports
        try:
            from processors import html_processor
            print("✅ processors/html_processor.py imports successfully")
        except ImportError as e:
            print(f"⚠️ html_processor import issue: {e}")
        
        try:
            from processors import entity_extractor
            print("✅ processors/entity_extractor.py imports successfully")
        except ImportError as e:
            print(f"⚠️ entity_extractor import issue: {e}")
        
        return True
        
    except Exception as e:
        print(f"❌ Import validation failed: {str(e)}")
        return False

def development_summary():
    """Provide a summary of the development environment."""
    print("📋 Development Environment Summary")
    print("="*40)
    
    # Configuration status
    print(f"✅ OpenAI configured: {bool(config.openai_api_key and not config.openai_api_key.startswith('sk-your-'))}")
    print(f"✅ Neo4j connected: {neo4j_client.connected if 'neo4j_client' in globals() else False}")
    print(f"✅ HTML Processor ready: {'html_processor' in globals()}")
    print(f"✅ Entity Extractor ready: {'openai_client' in globals() and openai_client is not None}")
    print(f"✅ Newsletter Processor ready: {'newsletter_processor' in globals()}")
    
    # Recent test results
    if 'test_response' in globals() and test_response:
        print(f"\n📊 Last Test Results:")
        print(f"  Status: {test_response.status}")
        print(f"  Entities: {test_response.entities_extracted}")
        print(f"  Time: {test_response.processing_time:.2f}s")
    
    print(f"\n🎯 Ready for development and testing!")

# Helper functions for development workflow
def quick_development_check():
    """Quick check of development status."""
    print("🚀 Quick Development Status Check\n")
    
    # Check sync status
    print("1. Checking sync status...")
    sync_status = sync_status_check()
    
    # Validate components
    print("\n2. Validating components...")
    validation_results = validate_fastapi_imports()
    
    # Development summary
    print("\n3. Development summary...")
    development_summary()
    
    print("\n🎉 Development check completed!")
    return {
        'sync_status': sync_status,
        'validation_results': validation_results
    }

def extract_cell_code(cell_number: int) -> str:
    """Extract code from a specific notebook cell."""
    # This is a placeholder - in a real implementation, you'd parse the notebook JSON
    print(f"📄 Extracting code from cell {cell_number}...")
    print("⚠️  Manual extraction required - copy code from notebook cell")
    return ""

print("✅ Code Export Helper loaded successfully")
print("🔄 Functions: compare_files(), export_to_fastapi(), sync_status_check()")
print("📁 File mappings configured for all components")
print("\n💡 Usage:")
print("  - sync_status_check(): Check sync status of all components")
print("  - quick_development_check(): Run development status check")
print("  - validate_fastapi_imports(): Test FastAPI imports")
print("  - development_summary(): Show current environment status")

# Run quick check
result = quick_development_check()

✅ Code Export Helper loaded successfully
🔄 Functions: compare_files(), export_to_fastapi(), sync_status_check()
📁 File mappings configured for all components

💡 Usage:
  - sync_status_check(): Check sync status of all components
  - quick_development_check(): Run development status check
  - validate_fastapi_imports(): Test FastAPI imports
  - development_summary(): Show current environment status
🚀 Quick Development Status Check

1. Checking sync status...
🔄 Checking sync status with FastAPI...

📦 Checking html_processor...
  ✅ File exists

📦 Checking entity_extractor...
  ✅ File exists

📦 Checking neo4j_client...
  ✅ File exists

📦 Checking newsletter_processor...
  ✅ File exists

📊 Sync Status Summary:
  - html_processor: ✅ File exists
  - entity_extractor: ✅ File exists
  - neo4j_client: ✅ File exists
  - newsletter_processor: ✅ File exists

2. Validating components...
🔍 Validating FastAPI imports...
✅ config.py imports successfully
✅ models/newsletter.py imports successfully
✅ pro

# Modern Development Workflow Summary

This notebook now serves as an **interactive testing and analysis environment** that imports and uses actual FastAPI production modules.

## **New FastAPI-First Architecture**

### **FastAPI Development** (Primary)
```bash
# Start development server with hot reload
uvicorn src.main:app --reload --port 8000

# Run tests continuously
pytest --watch tests/

# Make changes directly in FastAPI codebase
# Changes are immediately available in notebook
```

### **Notebook Role** (Supporting)
1. **Interactive Testing**: Test production modules with real data
2. **Data Analysis**: Analyze entity extraction quality and results
3. **Graph Exploration**: Query Neo4j interactively
4. **Prompt Engineering**: Experiment with OpenAI prompts

## **Benefits Achieved**

✅ **Single Source of Truth**: All logic lives in FastAPI only  
✅ **Zero Code Duplication**: Notebook imports production modules  
✅ **Instant Synchronization**: Changes in FastAPI immediately available  
✅ **Real Testing**: Notebook tests actual production behavior  
✅ **Reduced Maintenance**: No need to sync two codebases  
✅ **Faster Development**: Hot reload + immediate notebook feedback  

## **Development Workflow**

### **1. Primary Development in FastAPI**
- Edit code in `src/` directory
- Use `uvicorn --reload` for instant feedback
- Write and run unit tests with pytest
- Use debugger and logging for exploration

### **2. Interactive Testing in Notebook** 
- Import updated FastAPI modules automatically
- Test with real newsletter data
- Analyze entity extraction results
- Visualize Neo4j graph data
- Experiment with prompts and configurations

### **3. Quality Assurance**
- All tests in `tests/` directory validate production code
- Notebook testing supplements unit tests with real data
- Entity extraction tests ensure production reliability

## **Next Steps**

1. **Start FastAPI Development**: `uvicorn src.main:app --reload --port 8000`
2. **Run Tests**: `pytest --watch tests/`
3. **Use Notebook for Analysis**: Import modules and test interactively
4. **Deploy Changes**: Push to production knowing notebook validates behavior

## **File Structure**
```
src/                          # Production code (single source of truth)
├── processors/
│   ├── entity_extractor.py   # Production entity extraction
│   └── html_processor.py     # Production HTML processing
├── graph/
│   └── neo4j_client.py       # Production Neo4j client
└── workflows/
    └── newsletter_processor.py # Production workflow

tests/                        # Comprehensive test suite
├── test_entity_extractor.py  # Entity extraction tests
├── test_newsletter.py        # Integration tests
└── test_simple.py           # Basic functionality tests

notebooks/                    # Interactive analysis only
└── newsletter_development.ipynb # This notebook (imports from src/)
```

**🎯 The notebook is now a consumer of production code, not a parallel implementation!**