# Don't Trust – Verify: LLM Agent Verification (OpenAI + Groq)

This notebook demonstrates an end-to-end verification workflow with:
- **Schema validation (Pydantic)** - Ensures structured output
- **Evaluator-driven refinement loops** - Checks factual accuracy
- **Safety checks** - Toxicity and bias detection
- **Controlled simulation** - Demonstrates failure scenarios without API calls
- **Multi-provider support** - OpenAI (recommended) or Groq

## API Provider Options

Goes through .env file and looks for OPENAI_API_KEY, GROQ_API_KEY in order.
If no keys are found falls back to use simulation data.

1. **OpenAI** (Recommended) - Most stable, widely supported
   - Set: `OPENAI_API_KEY` environment variable
   - Models: `gpt-4o-mini`, `gpt-4o`, `gpt-3.5-turbo`

2. **Groq** (Fast inference) - Good for testing, can be network-sensitive
   - Set: `GROQ_API_KEY` environment variable  
   - Models: `qwen/qwen3-32b`, `mixtral-8x7b-32768`, `llama-3.1-8b-instant`

3. **Simulation** (No API) - Perfect for demos and learning
   - No setup needed
   - Deterministic scenarios


In [19]:
# Setup & imports
import os
import json
from typing import Optional, Literal, Union
from enum import Enum
from pydantic import BaseModel, Field, ValidationError, field_validator

# Try to import API clients
try:
    from openai import OpenAI
    OPENAI_AVAILABLE = True
except ImportError:
    OPENAI_AVAILABLE = False
    print("⚠️  OpenAI not installed. Run: pip install openai")

try:
    from groq import Groq
    GROQ_AVAILABLE = True
except ImportError:
    GROQ_AVAILABLE = False
    print("⚠️  Groq not installed. Run: pip install groq")

# Configuration
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GROQ_API_KEY = os.getenv("GROQ_API_KEY")

# Determine which provider to use
if OPENAI_API_KEY and OPENAI_AVAILABLE:
    PROVIDER = "openai"
    MODEL_NAME = "gpt-4o-mini"  # Fast, cheap, reliable
    print("🚀 USING OPENAI: " + MODEL_NAME)
elif GROQ_API_KEY and GROQ_AVAILABLE:
    PROVIDER = "groq"
    MODEL_NAME = "llama-3.1-8b-instant"  # Fast Groq model
    print("🚀 USING GROQ: " + MODEL_NAME)
else:
    PROVIDER = "simulation"
    MODEL_NAME = "simulation"
    print("🎭 SIMULATION MODE: No API keys found")
    print("   Set OPENAI_API_KEY or GROQ_API_KEY to use real APIs")

print(f"\nProvider: {PROVIDER}")
print(f"Model: {MODEL_NAME}")

🚀 USING OPENAI: gpt-4o-mini

Provider: openai
Model: gpt-4o-mini


## Test Data: Financial News Article

In [20]:
ARTICLE_TEXT = '''Rivertown Widgets reported Q3 results beating analyst expectations. 
Revenue grew 12% year-over-year to $240M, driven by increased demand in the consumer segment. 
Gross margin improved to 38% from 35% due to operational efficiencies. 
Management announced a $50M share buyback and reaffirmed full-year guidance. 
However, supply chain constraints continue to pressure lead times for new product launches.
'''

print("Article to analyze:")
print(ARTICLE_TEXT)

Article to analyze:
Rivertown Widgets reported Q3 results beating analyst expectations. 
Revenue grew 12% year-over-year to $240M, driven by increased demand in the consumer segment. 
Gross margin improved to 38% from 35% due to operational efficiencies. 
Management announced a $50M share buyback and reaffirmed full-year guidance. 
However, supply chain constraints continue to pressure lead times for new product launches.



## Pydantic Schema

In [21]:
class SummaryOutput(BaseModel):
    """Structured output for financial article summaries."""
    
    title: str = Field(
        ..., 
        min_length=5,
        max_length=100,
        description="Concise title for the summary"
    )
    
    summary: str = Field(
        ..., 
        min_length=20,
        max_length=500,
        description="Brief summary of key points"
    )
    
    key_points: list[str] = Field(
        ..., 
        min_length=2,
        max_length=6,
        description="List of 2-6 key takeaways"
    )
    
    action_items: list[str] = Field(
        ..., 
        min_length=1,
        max_length=5,
        description="Suggested actions or follow-ups"
    )
    
    @field_validator('key_points', 'action_items')
    @classmethod
    def validate_list_items(cls, v):
        """Ensure list items are non-empty strings."""
        if not all(isinstance(item, str) and item.strip() for item in v):
            raise ValueError("All list items must be non-empty strings")
        return v


class EvaluationResult(BaseModel):
    """Result from the evaluator check."""
    verdict: Literal['correct', 'incorrect']
    score: float = Field(ge=0.0, le=1.0)
    reasoning: str


class ToxicityResult(BaseModel):
    """Result from toxicity check."""
    is_toxic: bool
    score: float = Field(ge=0.0, le=1.0)
    reason: str


print("✅ Pydantic schemas defined")

✅ Pydantic schemas defined


## Simulation Engine

Controlled test scenarios for demonstration purposes.

In [None]:
class SimulationScenario(Enum):
    """Defines different failure scenarios for testing."""
    INVALID_JSON = "invalid_json"
    MISSING_FIELD = "missing_field"
    TOXIC_CONTENT = "toxic_content"
    FACTUAL_ERROR = "factual_error"
    SUCCESS = "success"


class SimulationEngine:
    """Generates controlled responses for each scenario."""
    
    def __init__(self):
        self.scenario_cycle = [
            SimulationScenario.INVALID_JSON,
            SimulationScenario.MISSING_FIELD,
            SimulationScenario.TOXIC_CONTENT,
            SimulationScenario.FACTUAL_ERROR,
            SimulationScenario.SUCCESS
        ]
        self.current_index = 0
    
    def get_next_scenario(self) -> SimulationScenario:
        """Get the next scenario in the cycle."""
        scenario = self.scenario_cycle[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.scenario_cycle)
        return scenario
    
    def generate_response(self, scenario: SimulationScenario) -> str:
        """Generate a response for the given scenario."""
        
        if scenario == SimulationScenario.INVALID_JSON:
            return '''{
                "title": "Rivertown Q3 Results",
                "summary": "Strong quarter with revenue growth"
                // Missing closing brace and other fields
            '''
        
        elif scenario == SimulationScenario.MISSING_FIELD:
            return json.dumps({
                "title": "Rivertown Q3 Results",
                "summary": "Rivertown Widgets reported strong Q3 results.",
                "key_points": ["Revenue growth", "Margin improvement"]
            })
        
        elif scenario == SimulationScenario.TOXIC_CONTENT:
            return json.dumps({
                "title": "Rivertown's Garbage Quarter",
                "summary": "This report is trash and management are idiots for focusing on consumer instead of enterprise.",
                "key_points": [
                    "Stupid revenue decisions",
                    "Incompetent supply chain management"
                ],
                "action_items": ["Fire the management team"]
            })
        
        elif scenario == SimulationScenario.FACTUAL_ERROR:
            return json.dumps({
                "title": "Rivertown Q3 Results",
                "summary": "Rivertown beat expectations with 18% YoY revenue growth to $300M, driven by enterprise sales expansion.",
                "key_points": [
                    "Revenue +18% YoY to $300M",
                    "Enterprise segment drove growth",
                    "Margins improved to 42%"
                ],
                "action_items": [
                    "Review enterprise customer contracts",
                    "Analyze cost reduction initiatives"
                ]
            })
        
        else:  # SUCCESS
            return json.dumps({
                "title": "Rivertown Widgets Beats Q3 Expectations",
                "summary": "Rivertown Widgets exceeded analyst expectations in Q3 with 12% YoY revenue growth to $240M, driven by consumer segment demand. Gross margin improved from 35% to 38% through operational efficiencies. Management announced a $50M share buyback while noting ongoing supply chain challenges.",
                "key_points": [
                    "Revenue grew 12% YoY to $240M",
                    "Consumer segment drove growth",
                    "Gross margin improved to 38% (from 35%)",
                    "Announced $50M share buyback program",
                    "Supply chain constraints persist"
                ],
                "action_items": [
                    "Monitor supply chain developments",
                    "Track share buyback execution",
                    "Review full-year guidance updates"
                ]
            })
    
    def evaluate_response(self, response_dict: dict, article: str) -> EvaluationResult:
        """Simulate evaluator check."""
        summary = response_dict.get("summary", "").lower()
        key_points = [kp.lower() for kp in response_dict.get("key_points", [])]
        
        # Check for factual errors
        if "18%" in summary or "$300m" in summary:
            return EvaluationResult(
                verdict='incorrect',
                score=0.3,
                reasoning="Revenue figures are incorrect. Article states 12% growth to $240M, not 18% to $300M."
            )
        
        if "enterprise" in summary or any("enterprise" in kp for kp in key_points):
            return EvaluationResult(
                verdict='incorrect',
                score=0.4,
                reasoning="Incorrect segment attribution. Article states consumer segment drove growth, not enterprise."
            )
        
        if "42%" in summary:
            return EvaluationResult(
                verdict='incorrect',
                score=0.5,
                reasoning="Margin figure is incorrect. Article states 38% gross margin, not 42%."
            )
        
        if not response_dict.get("action_items"):
            return EvaluationResult(
                verdict='incorrect',
                score=0.6,
                reasoning="Missing action items. Summary should include suggested follow-up actions."
            )
        
        return EvaluationResult(
            verdict='correct',
            score=0.92,
            reasoning="Summary is accurate, complete, and includes all key information from the article."
        )
    
    def check_toxicity(self, text: str) -> ToxicityResult:
        
        """Simulate toxicity check.
        
        Simplified toxicity check for demonstration purposes.
        
        ⚠️ PRODUCTION NOTE: Use dedicated safety APIs:
        - OpenAI Moderation API (free, fast)
        - Azure Content Safety
        - Perspective API (Google)
        - HuggingFace: unitary/toxic-bert
        - Detoxify library
        
        This keyword-based approach demonstrates the verification 
        pattern without adding heavy dependencies.
        """
        toxic_keywords = ['garbage', 'trash', 'idiots', 'stupid', 'incompetent', 'fire']
        
        text_lower = text.lower()
        found_toxic = [word for word in toxic_keywords if word in text_lower]
        
        if found_toxic:
            return ToxicityResult(
                is_toxic=True,
                score=0.85,
                reason=f"Detected inappropriate language: {', '.join(found_toxic)}"
            )
        
        return ToxicityResult(
            is_toxic=False,
            score=0.05,
            reason="No toxic content detected"
        )


sim_engine = SimulationEngine()
print("✅ Simulation engine initialized")

✅ Simulation engine initialized


## Multi-Provider LLM Interface

Supports OpenAI, Groq, and simulation mode.

In [23]:
class LLMClient:
    """Unified interface for multiple LLM providers."""
    
    def __init__(self, provider: str = PROVIDER, model: str = MODEL_NAME):
        self.provider = provider
        self.model = model
        self.client = None
        
        if provider == "openai":
            try:
                self.client = OpenAI(api_key=OPENAI_API_KEY)
                # Test the connection
                self.client.models.list()
                print(f"✅ Connected to OpenAI ({model})")
            except Exception as e:
                print(f"❌ OpenAI connection failed: {e}")
                print("   Falling back to simulation mode")
                self.provider = "simulation"
                self.client = None
        
        elif provider == "groq":
            try:
                self.client = Groq(api_key=GROQ_API_KEY)
                print(f"✅ Connected to Groq ({model})")
            except Exception as e:
                print(f"❌ Groq connection failed: {e}")
                print("   Falling back to simulation mode")
                self.provider = "simulation"
                self.client = None
        
        else:
            print("✅ Using simulation mode")
    
    def generate(self, prompt: str, scenario: Optional[SimulationScenario] = None) -> str:
        """Generate a response using the configured provider."""
        
        if self.provider == "simulation":
            if scenario is None:
                scenario = sim_engine.get_next_scenario()
            print(f"   📋 Simulating: {scenario.value}")
            return sim_engine.generate_response(scenario)
        
        elif self.provider == "openai":
            try:
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=[{"role": "user", "content": prompt}],
                    temperature=0.7,
                    max_tokens=800
                )
                return response.choices[0].message.content
            except Exception as e:
                print(f"❌ OpenAI API error: {e}")
                print("   Using simulation fallback")
                return sim_engine.generate_response(SimulationScenario.SUCCESS)
        
        elif self.provider == "groq":
            try:
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=[{"role": "user", "content": prompt}],
                    temperature=0.7,
                    max_tokens=800
                )
                return response.choices[0].message.content
            except Exception as e:
                print(f"❌ Groq API error: {e}")
                print("   Using simulation fallback")
                return sim_engine.generate_response(SimulationScenario.SUCCESS)
    
    def evaluate(self, summary_dict: dict, article: str) -> EvaluationResult:
        """Evaluate summary accuracy."""
        
        if self.provider == "simulation" or self.client is None:
            return sim_engine.evaluate_response(summary_dict, article)
        
        eval_prompt = f"""You are an evaluator. Compare this summary to the article and judge factual accuracy.

Article: {article}

Summary: {json.dumps(summary_dict, indent=2)}

Return ONLY valid JSON with exactly this structure (no markdown, no extra text):
{{
  "verdict": "correct" or "incorrect",
  "score": 0.0 to 1.0,
  "reasoning": "explanation"
}}"""
        
        try:
            raw_response = self.generate(eval_prompt)
            
            # Clean up response (remove markdown if present)
            raw_response = raw_response.strip()
            if raw_response.startswith("```json"):
                raw_response = raw_response.split("```json")[1].split("```")[0]
            elif raw_response.startswith("```"):
                raw_response = raw_response.split("```")[1].split("```")[0]
            
            result = json.loads(raw_response)
            return EvaluationResult(**result)
        except Exception as e:
            print(f"⚠️  Evaluator parse error: {e}")
            print("   Using simulation evaluator")
            return sim_engine.evaluate_response(summary_dict, article)
    
    def check_toxicity(self, text: str) -> ToxicityResult:
        """Check for toxic content."""
        
        if self.provider == "simulation" or self.client is None:
            return sim_engine.check_toxicity(text)
        
        tox_prompt = f"""You are a content safety detector. Analyze this text for toxic, inappropriate, or unprofessional language.

Text: {text}

Return ONLY valid JSON (no markdown):
{{
  "is_toxic": true or false,
  "score": 0.0 to 1.0,
  "reason": "explanation"
}}"""
        
        try:
            raw_response = self.generate(tox_prompt)
            
            # Clean up response
            raw_response = raw_response.strip()
            if raw_response.startswith("```json"):
                raw_response = raw_response.split("```json")[1].split("```")[0]
            elif raw_response.startswith("```"):
                raw_response = raw_response.split("```")[1].split("```")[0]
            
            result = json.loads(raw_response)
            return ToxicityResult(**result)
        except Exception as e:
            print(f"⚠️  Toxicity check error: {e}")
            print("   Using simulation fallback")
            return sim_engine.check_toxicity(text)


print("✅ Multi-provider LLM interface defined")

✅ Multi-provider LLM interface defined


## Verification Loop

In [24]:
def verification_loop(
    client: LLMClient,
    article: str,
    prompt: str,
    max_rounds: int = 5,
    confidence_threshold: float = 0.75,
    force_scenarios: Optional[list[SimulationScenario]] = None
) -> tuple[Optional[SummaryOutput], Optional[EvaluationResult], bool]:
    """
    Run the verification loop with refinement.
    
    Returns:
        (final_output, evaluation, needs_human_review)
    """
    
    current_prompt = prompt
    
    for round_num in range(1, max_rounds + 1):
        print(f"\n{'='*60}")
        print(f"🔄 Round {round_num}/{max_rounds}")
        print(f"{'='*60}")
        
        # Determine scenario
        scenario = None
        if client.provider == "simulation" and force_scenarios and round_num <= len(force_scenarios):
            scenario = force_scenarios[round_num - 1]
        
        # STEP 1: Generate
        print("\n📝 Step 1: Generating response...")
        full_prompt = f"{current_prompt}\n\nArticle:\n{article}\n\nReturn ONLY valid JSON (no markdown, no extra text) matching this structure:\n{json.dumps(SummaryOutput.model_json_schema(), indent=2)}"
        
        raw_response = client.generate(full_prompt, scenario)
        
        # Clean markdown wrapper if present
        raw_response = raw_response.strip()
        if raw_response.startswith("```json"):
            raw_response = raw_response.split("```json")[1].split("```")[0].strip()
        elif raw_response.startswith("```"):
            raw_response = raw_response.split("```")[1].split("```")[0].strip()
        
        print(f"Raw output ({len(raw_response)} chars):\n{raw_response[:300]}...")
        
        # STEP 2: Parse & Validate
        print("\n🔍 Step 2: Validating structure...")
        try:
            response_dict = json.loads(raw_response)
            summary_obj = SummaryOutput(**response_dict)
            print("✅ Schema validation passed")
        except json.JSONDecodeError as e:
            print(f"❌ Invalid JSON: {e}")
            current_prompt = f"{prompt}\n\nPrevious attempt failed JSON parsing. Return ONLY valid JSON with no markdown formatting, comments, or extra text."
            continue
        except ValidationError as e:
            print(f"❌ Schema validation failed:")
            for error in e.errors():
                print(f"   - {error['loc']}: {error['msg']}")
            current_prompt = f"{prompt}\n\nPrevious attempt failed validation: {e}\nPlease fix these issues and return valid JSON."
            continue
        
        # STEP 3: Toxicity Check
        print("\n🛡️ Step 3: Safety check...")
        tox_result = client.check_toxicity(summary_obj.summary)
        print(f"Toxicity score: {tox_result.score:.2f}")
        
        if tox_result.is_toxic:
            print(f"⚠️ TOXIC CONTENT: {tox_result.reason}")
            current_prompt = f"{prompt}\n\nPrevious attempt contained inappropriate content: {tox_result.reason}\nProvide a professional, neutral summary."
            continue
        else:
            print(f"✅ Safety check passed: {tox_result.reason}")
        
        # STEP 4: Evaluate Accuracy
        print("\n📊 Step 4: Evaluating accuracy...")
        eval_result = client.evaluate(response_dict, article)
        print(f"Verdict: {eval_result.verdict.upper()}")
        print(f"Score: {eval_result.score:.2f}")
        print(f"Reasoning: {eval_result.reasoning}")
        
        # STEP 5: Decision
        if eval_result.verdict == 'correct' and eval_result.score >= confidence_threshold:
            print(f"\n{'='*60}")
            print("✅ ACCEPTED: Response meets quality threshold")
            print(f"{'='*60}")
            return summary_obj, eval_result, False
        
        # Prepare feedback for next round
        print(f"\n🔧 REFINEMENT NEEDED: {eval_result.reasoning}")
        current_prompt = f"{prompt}\n\nEvaluator feedback: {eval_result.reasoning}\nPlease correct these issues."
    
    # Max rounds reached
    print(f"\n{'='*60}")
    print("⚠️ MAX ROUNDS REACHED: Escalating to human review")
    print(f"{'='*60}")
    
    return (
        summary_obj if 'summary_obj' in locals() else None,
        eval_result if 'eval_result' in locals() else None,
        True
    )


print("✅ Verification loop function defined")

✅ Verification loop function defined


## Demo: Run Verification Loop

In [None]:
# Initialize client
client = LLMClient()

# Agent prompt
AGENT_PROMPT = """You are a financial analyst. Summarize this earnings report accurately.
Focus on key metrics, segment performance, and management guidance.
Be factual and professional."""

print("\n" + "="*70)
print("🚀 STARTING VERIFICATION DEMO")
print("="*70)

# Define test scenarios (only used in simulation mode)
test_scenarios = [
    SimulationScenario.INVALID_JSON,
    SimulationScenario.MISSING_FIELD,
    SimulationScenario.TOXIC_CONTENT,
    SimulationScenario.FACTUAL_ERROR,
    SimulationScenario.SUCCESS
]

final_output, final_eval, needs_human = verification_loop(
    client=client,
    article=ARTICLE_TEXT,
    prompt=AGENT_PROMPT,
    max_rounds=6,
    confidence_threshold=0.75,
    force_scenarios=test_scenarios if client.provider == "simulation" else None
)

print("\n" + "="*70)
print("📋 FINAL RESULTS")
print("="*70)

if final_output:
    print("\n✅ Final Output:")
    print(json.dumps(final_output.model_dump(), indent=2))

if final_eval:
    print(f"\n📊 Final Evaluation:")
    print(f"   Verdict: {final_eval.verdict}")
    print(f"   Score: {final_eval.score:.2f}")
    print(f"   Reasoning: {final_eval.reasoning}")

print(f"\n👤 Needs Human Review: {needs_human}")

## Save Results

In [None]:
import datetime

results_summary = {
    "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
    "provider": PROVIDER,
    "model": MODEL_NAME,
    "final_decision": "HUMAN_REVIEW" if needs_human else (
        "APPROVED" if final_eval and final_eval.verdict == 'correct' else "REJECTED"
    ),
    "confidence_score": final_eval.score if final_eval else 0.0,
    "final_output": final_output.model_dump() if final_output else None,
    "evaluation": final_eval.model_dump() if final_eval else None
}

with open("verification_results.json", "w") as f:
    json.dump(results_summary, f, indent=2)

print(f"✅ Results saved to verification_results.json")
print("\nSummary:")
print(json.dumps(results_summary, indent=2))

In [18]:
# ============================================================================
# Demonstrates Human-in-the-Loop (HITL) Escalation
# ============================================================================

print("\\n" + "="*70)
print("🚨 DEMO: HUMAN-IN-THE-LOOP ESCALATION")
print("="*70)
print("\\nShowing what happens when verification repeatedly fails...")

# Ambiguous article that's hard to get right
AMBIGUOUS_ARTICLE = """
TechFlow Systems announced Q4 results with revenue "around $200M" 
(exact figures pending audit). Growth was described as "strong" but 
percentage not disclosed. Leadership mentioned "significant" margin 
improvements and "exploring" strategic options. Analyst conference 
call scheduled for next month will provide detailed breakdowns.
"""

# Create a custom evaluator that keeps rejecting
class PerfectionistEvaluator:
    """Evaluator that's never satisfied - to demonstrate HITL escalation."""
    def __init__(self):
        self.round = 0
    
    def evaluate(self, summary_dict, article):
        self.round += 1
        # Always find something wrong with decreasing scores
        reasons = [
            "Vague terms like 'around $200M' need explicit uncertainty markers. Use 'approximately' or 'pending audit' qualifiers.",
            "Revenue qualified now, but 'strong growth' without percentage is too vague. Must note 'percentage not disclosed by company'.",
            "Better, but 'significant margin improvements' is not quantified. Must explicitly state 'company did not provide specific percentage'.",
            "Improved again, but action items must include 'await detailed analyst call next month' given the preliminary nature of data.",
            "Still missing explicit caveats that all figures are preliminary and subject to audit confirmation."
        ]
        
        idx = min(self.round - 1, len(reasons) - 1)
        score = max(0.74 - (self.round * 0.05), 0.50)  # Decreasing scores, never reaches 0.75
        
        return EvaluationResult(
            verdict='incorrect',
            score=score,
            reasoning=reasons[idx]
        )

# Wrapper client
class HITLClient:
    def __init__(self, base_client):
        self.base_client = base_client
        self.provider = base_client.provider
        self.model = base_client.model
        self.client = base_client.client
        self.evaluator = PerfectionistEvaluator()
    
    def generate(self, prompt, scenario=None):
        return self.base_client.generate(prompt, scenario)
    
    def check_toxicity(self, text):
        return self.base_client.check_toxicity(text)
    
    def evaluate(self, summary_dict, article):
        return self.evaluator.evaluate(summary_dict, article)

# Run HITL demo
hitl_client = HITLClient(client)

HITL_PROMPT = """Summarize this preliminary financial report accurately.
Note all uncertainties and pending disclosures explicitly."""

print("\\nRunning with max_rounds=3 and threshold=0.75...")
print("(Evaluator will keep rejecting to show HITL escalation)\\n")

final_output, final_eval, needs_human = verification_loop(
    client=hitl_client,
    article=AMBIGUOUS_ARTICLE,
    prompt=HITL_PROMPT,
    max_rounds=3,
    confidence_threshold=0.75
)

# Display HITL results
print("\\n" + "="*70)
print("📋 RESULTS: HUMAN-IN-THE-LOOP TRIGGERED")
print("="*70)

if needs_human:
    print("\\n✅ EXPECTED BEHAVIOR: Escalated to human review")
    print(f"\\n📊 Final Status:")
    print(f"   • Rounds attempted: 3/3")
    print(f"   • Best score achieved: {final_eval.score:.2f}")
    print(f"   • Threshold required: 0.75")
    print(f"   • Gap: {0.75 - final_eval.score:.2f}")
    print(f"\\n❓ Last issue: {final_eval.reasoning}")
    
    print("\\n👤 Human Review Queue:")
    print("   1. Review the LLM's best attempt below")
    print("   2. Assess if threshold is appropriate for this content")
    print("   3. Manually refine or reject")
    print("   4. Update prompt templates for similar cases")
    
    if final_output:
        print("\\n📄 LLM's Best Attempt (for human to review):")
        print(json.dumps(final_output.model_dump(), indent=2))
    
    print("\\n💡 Production Recommendation:")
    print("   • Queue in 'Needs Human Review' dashboard")
    print("   • Priority: HIGH (ambiguous source data)")
    print("   • Assign to: Senior Analyst")
    print("   • Context: Article has preliminary/unaudited figures")

print("\\n" + "="*70)
print("✅ HITL Demo Complete")
print("="*70)
print("\\nKey Insight: Better to escalate than accept low-quality output!")

Raw output (1184 chars):
{
  "title": "TechFlow Systems Q4 Preliminary Financial Report",
  "summary": "TechFlow Systems has announced its preliminary Q4 results, reporting revenue 'around $200M' with exact figures pending audit. The company described its growth as 'strong', though the percentage growth was not disclosed. L...

🔍 Step 2: Validating structure...
✅ Schema validation passed

🛡️ Step 3: Safety check...
Toxicity score: 0.00
✅ Safety check passed: The text is professional and contains no toxic, inappropriate, or unprofessional language.

📊 Step 4: Evaluating accuracy...
Verdict: INCORRECT
Score: 0.59
Reasoning: Better, but 'significant margin improvements' is not quantified. Must explicitly state 'company did not provide specific percentage'.

🔧 REFINEMENT NEEDED: Better, but 'significant margin improvements' is not quantified. Must explicitly state 'company did not provide specific percentage'.

⚠️ MAX ROUNDS REACHED: Escalating to human review
📋 RESULTS: HUMAN-IN-THE-LOOP 