# Credit Risk Multi-Agent Demo with Advanced Document Processing

This notebook demonstrates the enhanced multi-agent system for credit risk analysis with:
- **Tiered Document Processing**: Smart extraction using regex → section vectorization → full RAG
- **Revenue Verification**: Compare reported vs extracted financial data
- **Anomaly Detection**: Identify inconsistencies in sector codes and turnover
- **AI-Powered Suggestions**: Smart recommendations for data corrections

## Architecture Overview

### Core Agents
1. **Data Ingestion Agent**: Fetches company data from Companies House API
2. **Anomaly Detection Agent**: Identifies sector and turnover inconsistencies
3. **Sector Classification Agent**: Suggests correct SIC codes
4. **Turnover Estimation Agent**: Provides revenue estimates

### Document Processing Agents (NEW)
5. **Document Download Agent**: Downloads annual accounts from Companies House
6. **Smart Financial Extraction Agent**: Three-tier extraction system
7. **RAG Document Agent**: Vector-based semantic document analysis

### Processing Tiers
- **Tier 1**: Fast regex pattern matching (< 5 seconds)
- **Tier 2**: Section-specific intelligent extraction (< 20 seconds)
- **Tier 3**: Full RAG analysis with vector database (< 60 seconds)

In [None]:
# Install required packages
%pip install requests pandas numpy matplotlib seaborn plotly

In [None]:
import sys
import os
from datetime import datetime
import json

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Configuration
plt.style.use('default')
sns.set_palette("husl")

print("📊 Libraries loaded successfully")
print(f"🕒 Notebook started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 🔧 Configuration and Setup

In [None]:
# Add the src directory to Python path
project_root = '/Workspace/Repos/credit_risk_demo/src'  # Adjust path for Databricks
if project_root not in sys.path:
    sys.path.append(project_root)

# Import our agents
try:
    from agents.orchestrator import MultiAgentOrchestrator
    from agents.rag_document_agent import SemanticQuery
    from utils.config_manager import config
    print("✅ Multi-agent system imported successfully")
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("📝 Using mock implementation for demonstration")
    
    # Mock classes for demonstration
    class MockOrchestrator:
        def __init__(self):
            self.session_id = datetime.now().strftime("%Y%m%d_%H%M%S")
        
        def run_enhanced_workflow_with_documents(self, input_data):
            return self._generate_mock_enhanced_results(input_data)
        
        def run_complete_workflow(self, input_data):
            return self._generate_mock_results(input_data)
        
        def _generate_mock_enhanced_results(self, input_data):
            return {
                "success": True,
                "session_id": self.session_id,
                "data": {
                    "companies": self._generate_mock_companies(),
                    "anomalies": self._generate_mock_anomalies(),
                    "suggestions": self._generate_mock_suggestions(),
                    "document_processing": self._generate_mock_document_processing(),
                    "enhanced_analysis": self._generate_mock_enhanced_analysis()
                }
            }
        
        def _generate_mock_results(self, input_data):
            return {
                "success": True,
                "session_id": self.session_id,
                "data": {
                    "companies": self._generate_mock_companies(),
                    "anomalies": self._generate_mock_anomalies(),
                    "suggestions": self._generate_mock_suggestions()
                }
            }
        
        def _generate_mock_companies(self):
            return [
                {
                    "company_number": "12345678",
                    "company_name": "Tech Solutions Ltd",
                    "sic_codes": ["62020"],
                    "turnover": 850000,
                    "status": "active"
                },
                {
                    "company_number": "87654321",
                    "company_name": "Green Energy Co",
                    "sic_codes": ["35110"],
                    "turnover": 1200000,
                    "status": "active"
                },
                {
                    "company_number": "11223344",
                    "company_name": "Retail Fashion Ltd",
                    "sic_codes": ["47710"],
                    "turnover": 650000,
                    "status": "active"
                }
            ]
        
        def _generate_mock_anomalies(self):
            return {
                "sector_anomalies": [
                    {
                        "company_number": "12345678",
                        "company_name": "Tech Solutions Ltd",
                        "anomaly_type": "invalid_sic_code",
                        "confidence": 0.95
                    }
                ],
                "turnover_anomalies": [
                    {
                        "company_number": "11223344",
                        "company_name": "Retail Fashion Ltd",
                        "anomaly_type": "turnover_sector_mismatch",
                        "confidence": 0.85
                    }
                ],
                "summary": {
                    "total_anomalies": 2,
                    "anomaly_rate": 0.67
                }
            }
        
        def _generate_mock_suggestions(self):
            return [
                {
                    "company_number": "12345678",
                    "suggestion_type": "sector_correction",
                    "suggested_sic_code": "62020",
                    "confidence": 0.92,
                    "reasoning": "Based on company name and business activities"
                },
                {
                    "company_number": "11223344",
                    "suggestion_type": "turnover_validation",
                    "suggested_turnover_range": [600000, 800000],
                    "confidence": 0.78,
                    "reasoning": "Industry benchmarks for retail fashion companies"
                }
            ]
        
        def _generate_mock_document_processing(self):
            return {
                "download_summary": {
                    "total_documents": 3,
                    "successful_downloads": 3
                },
                "extraction_summary": {
                    "total_extractions": 3,
                    "successful_extractions": 2,
                    "average_confidence": 0.82
                },
                "rag_summary": {
                    "total_queries": 2,
                    "successful_queries": 2
                },
                "extraction_results": [
                    {
                        "financial_data": {"revenue": 845000, "profit_before_tax": 155000},
                        "extraction_method": "regex_pattern",
                        "confidence": 0.9,
                        "processing_time": 2.3
                    },
                    {
                        "financial_data": {"revenue": 1180000, "profit_before_tax": 225000},
                        "extraction_method": "section_vectorization",
                        "confidence": 0.85,
                        "processing_time": 8.7
                    },
                    {
                        "financial_data": {"revenue": 658000, "profit_before_tax": 89000},
                        "extraction_method": "full_rag",
                        "confidence": 0.72,
                        "processing_time": 45.2
                    }
                ]
            }
        
        def _generate_mock_enhanced_analysis(self):
            return {
                "enhanced_insights": [
                    {
                        "company_name": "Tech Solutions Ltd",
                        "company_number": "12345678",
                        "reported_revenue": 850000,
                        "extracted_revenue": 845000,
                        "discrepancy_percentage": 0.6,
                        "has_discrepancy": False,
                        "extraction_confidence": 0.9,
                        "extraction_method": "regex_pattern"
                    },
                    {
                        "company_name": "Green Energy Co",
                        "company_number": "87654321",
                        "reported_revenue": 1200000,
                        "extracted_revenue": 1180000,
                        "discrepancy_percentage": 1.7,
                        "has_discrepancy": False,
                        "extraction_confidence": 0.85,
                        "extraction_method": "section_vectorization"
                    },
                    {
                        "company_name": "Retail Fashion Ltd",
                        "company_number": "11223344",
                        "reported_revenue": 650000,
                        "extracted_revenue": 658000,
                        "discrepancy_percentage": 1.2,
                        "has_discrepancy": False,
                        "extraction_confidence": 0.72,
                        "extraction_method": "full_rag"
                    }
                ],
                "summary": {
                    "total_companies_analyzed": 3,
                    "companies_with_discrepancies": 0,
                    "average_discrepancy_rate": 1.2,
                    "high_confidence_extractions": 2
                }
            }
    
    MultiAgentOrchestrator = MockOrchestrator
    print("🔄 Using mock implementation for demonstration")

## 🚀 Demo 1: Enhanced Workflow with Document Processing

In [None]:
# Initialize the orchestrator
orchestrator = MultiAgentOrchestrator()

print(f"🔬 Multi-Agent Orchestrator initialized")
print(f"📋 Session ID: {orchestrator.session_id}")
print(f"🕒 Started at: {datetime.now().strftime('%H:%M:%S')}")

In [None]:
# Define enhanced workflow input with document processing enabled
enhanced_input = {
    "company_numbers": ["12345678", "87654321", "11223344"],
    "enable_document_processing": True,
    "fallback_enabled": True,
    "rag_queries": [
        {
            "query_text": "What was the company's revenue for the latest financial year?",
            "query_type": "financial_data",
            "expected_data_type": "numeric"
        },
        {
            "query_text": "What are the main risk factors mentioned in the annual report?",
            "query_type": "risk_factors",
            "expected_data_type": "text"
        }
    ]
}

print("📋 Enhanced workflow configuration:")
print(f"   • Companies to analyze: {len(enhanced_input['company_numbers'])}")
print(f"   • Document processing: {'✅ Enabled' if enhanced_input['enable_document_processing'] else '❌ Disabled'}")
print(f"   • Tiered extraction fallback: {'✅ Enabled' if enhanced_input['fallback_enabled'] else '❌ Disabled'}")
print(f"   • RAG queries: {len(enhanced_input['rag_queries'])}")

In [None]:
# Run the enhanced workflow
print("🔄 Running enhanced workflow with document processing...")
print("⏱️  This may take a few moments as we process documents")

enhanced_results = orchestrator.run_enhanced_workflow_with_documents(enhanced_input)

if enhanced_results["success"]:
    print("\n✅ Enhanced workflow completed successfully!")
    
    # Extract key metrics
    companies = enhanced_results["data"]["companies"]
    anomalies = enhanced_results["data"]["anomalies"]
    document_processing = enhanced_results["data"]["document_processing"]
    enhanced_analysis = enhanced_results["data"]["enhanced_analysis"]
    
    print(f"\n📊 Workflow Summary:")
    print(f"   • Companies processed: {len(companies)}")
    print(f"   • Anomalies detected: {anomalies['summary']['total_anomalies']}")
    print(f"   • Documents processed: {document_processing['download_summary']['total_documents']}")
    print(f"   • Successful extractions: {document_processing['extraction_summary']['successful_extractions']}")
    print(f"   • Average extraction confidence: {document_processing['extraction_summary']['average_confidence']:.2f}")
    print(f"   • Revenue discrepancies found: {enhanced_analysis['summary']['companies_with_discrepancies']}")
    
else:
    print(f"❌ Enhanced workflow failed: {enhanced_results.get('error_message', 'Unknown error')}")

## 📊 Document Processing Analysis

In [None]:
# Analyze extraction methods performance
if enhanced_results["success"] and "document_processing" in enhanced_results["data"]:
    extraction_results = enhanced_results["data"]["document_processing"]["extraction_results"]
    
    # Create DataFrame for analysis
    extraction_data = []
    for i, result in enumerate(extraction_results):
        extraction_data.append({
            "company_index": i + 1,
            "extraction_method": result["extraction_method"],
            "confidence": result["confidence"],
            "processing_time": result["processing_time"],
            "revenue_extracted": result["financial_data"].get("revenue") if result["financial_data"] else None
        })
    
    df_extraction = pd.DataFrame(extraction_data)
    
    # Create visualizations
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=[
            'Extraction Method Distribution',
            'Confidence vs Processing Time',
            'Extraction Method Performance',
            'Revenue Extraction Results'
        ],
        specs=[[{"type": "pie"}, {"type": "scatter"}],
               [{"type": "bar"}, {"type": "bar"}]]
    )
    
    # Method distribution
    method_counts = df_extraction['extraction_method'].value_counts()
    fig.add_trace(
        go.Pie(labels=method_counts.index, values=method_counts.values, name="Methods"),
        row=1, col=1
    )
    
    # Confidence vs Processing Time
    fig.add_trace(
        go.Scatter(
            x=df_extraction['processing_time'],
            y=df_extraction['confidence'],
            mode='markers+text',
            text=df_extraction['extraction_method'],
            textposition="top center",
            marker=dict(size=10, color=df_extraction['confidence'], colorscale='Viridis'),
            name="Extraction Performance"
        ),
        row=1, col=2
    )
    
    # Method performance comparison
    method_stats = df_extraction.groupby('extraction_method').agg({
        'confidence': 'mean',
        'processing_time': 'mean'
    }).reset_index()
    
    fig.add_trace(
        go.Bar(
            x=method_stats['extraction_method'],
            y=method_stats['confidence'],
            name="Avg Confidence",
            marker_color='lightblue'
        ),
        row=2, col=1
    )
    
    # Revenue extraction results
    fig.add_trace(
        go.Bar(
            x=[f"Company {i}" for i in df_extraction['company_index']],
            y=[r/1000 for r in df_extraction['revenue_extracted']],  # Convert to thousands
            name="Revenue (£000s)",
            marker_color='lightgreen'
        ),
        row=2, col=2
    )
    
    fig.update_layout(
        height=800,
        title_text="Document Processing Performance Analysis",
        showlegend=False
    )
    
    fig.update_xaxes(title_text="Processing Time (seconds)", row=1, col=2)
    fig.update_yaxes(title_text="Confidence Score", row=1, col=2)
    fig.update_yaxes(title_text="Average Confidence", row=2, col=1)
    fig.update_yaxes(title_text="Revenue (£000s)", row=2, col=2)
    
    fig.show()
    
    # Display summary table
    print("\n📋 Extraction Method Performance Summary:")
    display(method_stats.round(2))
else:
    print("❌ No document processing data available for analysis")

## 🔍 Enhanced Revenue Analysis

In [None]:
# Analyze revenue discrepancies between reported and extracted data
if enhanced_results["success"] and "enhanced_analysis" in enhanced_results["data"]:
    enhanced_insights = enhanced_results["data"]["enhanced_analysis"]["enhanced_insights"]
    
    # Create DataFrame for analysis
    revenue_data = []
    for insight in enhanced_insights:
        revenue_data.append({
            "company_name": insight["company_name"],
            "reported_revenue": insight["reported_revenue"],
            "extracted_revenue": insight["extracted_revenue"],
            "discrepancy_percentage": insight["discrepancy_percentage"],
            "has_discrepancy": insight["has_discrepancy"],
            "extraction_confidence": insight["extraction_confidence"],
            "extraction_method": insight["extraction_method"]
        })
    
    df_revenue = pd.DataFrame(revenue_data)
    
    # Create comparison visualization
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=[
            'Reported vs Extracted Revenue',
            'Revenue Discrepancy Analysis',
            'Extraction Confidence by Method',
            'Revenue Accuracy Summary'
        ]
    )
    
    # Reported vs Extracted Revenue
    x_pos = np.arange(len(df_revenue))
    
    fig.add_trace(
        go.Bar(
            x=df_revenue['company_name'],
            y=[r/1000 for r in df_revenue['reported_revenue']],
            name='Reported Revenue',
            marker_color='lightblue'
        ),
        row=1, col=1
    )
    
    fig.add_trace(
        go.Bar(
            x=df_revenue['company_name'],
            y=[r/1000 for r in df_revenue['extracted_revenue']],
            name='Extracted Revenue',
            marker_color='lightcoral'
        ),
        row=1, col=1
    )
    
    # Discrepancy percentage
    fig.add_trace(
        go.Bar(
            x=df_revenue['company_name'],
            y=df_revenue['discrepancy_percentage'],
            name='Discrepancy %',
            marker_color='orange'
        ),
        row=1, col=2
    )
    
    # Confidence by method
    method_confidence = df_revenue.groupby('extraction_method')['extraction_confidence'].mean().reset_index()
    
    fig.add_trace(
        go.Bar(
            x=method_confidence['extraction_method'],
            y=method_confidence['extraction_confidence'],
            name='Avg Confidence',
            marker_color='green'
        ),
        row=2, col=1
    )
    
    # Accuracy summary (pie chart of discrepancy status)
    accuracy_summary = df_revenue['has_discrepancy'].value_counts()
    fig.add_trace(
        go.Pie(
            labels=['Accurate' if not disc else 'Discrepancy' for disc in accuracy_summary.index],
            values=accuracy_summary.values,
            name="Accuracy"
        ),
        row=2, col=2
    )
    
    fig.update_layout(
        height=800,
        title_text="Enhanced Revenue Analysis: Reported vs Extracted",
        showlegend=True
    )
    
    fig.update_yaxes(title_text="Revenue (£000s)", row=1, col=1)
    fig.update_yaxes(title_text="Discrepancy %", row=1, col=2)
    fig.update_yaxes(title_text="Confidence Score", row=2, col=1)
    
    fig.show()
    
    # Display detailed comparison table
    print("\n📋 Detailed Revenue Comparison:")
    comparison_table = df_revenue[[
        'company_name', 'reported_revenue', 'extracted_revenue', 
        'discrepancy_percentage', 'extraction_confidence', 'extraction_method'
    ]].copy()
    
    comparison_table['reported_revenue'] = comparison_table['reported_revenue'].apply(lambda x: f"£{x:,.0f}")
    comparison_table['extracted_revenue'] = comparison_table['extracted_revenue'].apply(lambda x: f"£{x:,.0f}")
    comparison_table['discrepancy_percentage'] = comparison_table['discrepancy_percentage'].apply(lambda x: f"{x:.1f}%")
    comparison_table['extraction_confidence'] = comparison_table['extraction_confidence'].apply(lambda x: f"{x:.2f}")
    
    display(comparison_table)
    
    # Summary statistics
    print("\n📊 Enhanced Analysis Summary:")
    summary = enhanced_results["data"]["enhanced_analysis"]["summary"]
    print(f"   • Total companies analyzed: {summary['total_companies_analyzed']}")
    print(f"   • Companies with significant discrepancies: {summary['companies_with_discrepancies']}")
    print(f"   • Average discrepancy rate: {summary['average_discrepancy_rate']:.1f}%")
    print(f"   • High confidence extractions: {summary['high_confidence_extractions']}")
    
else:
    print("❌ No enhanced analysis data available")

## 🔄 Demo 2: Tiered Processing Performance Comparison

In [None]:
# Demonstrate the efficiency of tiered processing
print("🔬 Analyzing Tiered Processing Efficiency")
print("\nTiered Processing Strategy:")
print("📊 Tier 1 (Regex): Fast pattern matching - targets 80%+ confidence in <5 seconds")
print("🎯 Tier 2 (Section): Intelligent section analysis - targets 70%+ confidence in <20 seconds")
print("🧠 Tier 3 (RAG): Full semantic analysis - comprehensive extraction in <60 seconds")

# Mock data for tiered processing comparison
processing_scenarios = [
    {
        "scenario": "Simple Financial Statement",
        "tier1_time": 1.2, "tier1_confidence": 0.92, "tier1_success": True,
        "tier2_time": 0, "tier2_confidence": 0, "tier2_success": False,
        "tier3_time": 0, "tier3_confidence": 0, "tier3_success": False,
        "final_method": "Tier 1 (Regex)"
    },
    {
        "scenario": "Complex Annual Report",
        "tier1_time": 2.1, "tier1_confidence": 0.65, "tier1_success": False,
        "tier2_time": 8.7, "tier2_confidence": 0.82, "tier2_success": True,
        "tier3_time": 0, "tier3_confidence": 0, "tier3_success": False,
        "final_method": "Tier 2 (Section)"
    },
    {
        "scenario": "Unusual Format Document",
        "tier1_time": 1.8, "tier1_confidence": 0.45, "tier1_success": False,
        "tier2_time": 12.3, "tier2_confidence": 0.58, "tier2_success": False,
        "tier3_time": 35.6, "tier3_confidence": 0.78, "tier3_success": True,
        "final_method": "Tier 3 (RAG)"
    },
    {
        "scenario": "Standard UK Filing",
        "tier1_time": 0.9, "tier1_confidence": 0.88, "tier1_success": True,
        "tier2_time": 0, "tier2_confidence": 0, "tier2_success": False,
        "tier3_time": 0, "tier3_confidence": 0, "tier3_success": False,
        "final_method": "Tier 1 (Regex)"
    },
    {
        "scenario": "Multi-language Report",
        "tier1_time": 2.5, "tier1_confidence": 0.52, "tier1_success": False,
        "tier2_time": 15.2, "tier2_confidence": 0.69, "tier2_success": False,
        "tier3_time": 48.9, "tier3_confidence": 0.84, "tier3_success": True,
        "final_method": "Tier 3 (RAG)"
    }
]

df_scenarios = pd.DataFrame(processing_scenarios)

# Calculate total processing time and success rates
df_scenarios['total_processing_time'] = (
    df_scenarios['tier1_time'] + 
    df_scenarios['tier2_time'] + 
    df_scenarios['tier3_time']
)

df_scenarios['final_confidence'] = (
    df_scenarios['tier1_confidence'] * df_scenarios['tier1_success'] +
    df_scenarios['tier2_confidence'] * df_scenarios['tier2_success'] +
    df_scenarios['tier3_confidence'] * df_scenarios['tier3_success']
)

# Create comprehensive visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[
        'Processing Time by Tier',
        'Final Method Distribution',
        'Confidence vs Processing Time',
        'Efficiency Analysis'
    ],
    specs=[[{"type": "bar"}, {"type": "pie"}],
           [{"type": "scatter"}, {"type": "bar"}]]
)

# Processing time breakdown
for i, scenario in enumerate(df_scenarios['scenario']):
    row = df_scenarios.iloc[i]
    y_pos = [i] * 3
    
    if row['tier1_success']:
        color = 'green'
        time_used = row['tier1_time']
    elif row['tier2_success']:
        color = 'orange'
        time_used = row['tier1_time'] + row['tier2_time']
    else:
        color = 'red'
        time_used = row['total_processing_time']
    
    fig.add_trace(
        go.Bar(
            x=[time_used],
            y=[scenario],
            orientation='h',
            name=f"{scenario}",
            marker_color=color,
            showlegend=False
        ),
        row=1, col=1
    )

# Final method distribution
method_counts = df_scenarios['final_method'].value_counts()
fig.add_trace(
    go.Pie(
        labels=method_counts.index, 
        values=method_counts.values, 
        name="Methods"
    ),
    row=1, col=2
)

# Confidence vs Processing Time scatter
fig.add_trace(
    go.Scatter(
        x=df_scenarios['total_processing_time'],
        y=df_scenarios['final_confidence'],
        mode='markers+text',
        text=df_scenarios['final_method'],
        textposition="top center",
        marker=dict(
            size=12, 
            color=df_scenarios['final_confidence'], 
            colorscale='RdYlGn',
            showscale=True
        ),
        name="Performance"
    ),
    row=2, col=1
)

# Efficiency analysis (confidence per second)
df_scenarios['efficiency'] = df_scenarios['final_confidence'] / df_scenarios['total_processing_time']

fig.add_trace(
    go.Bar(
        x=df_scenarios['scenario'],
        y=df_scenarios['efficiency'],
        name='Efficiency (Confidence/Second)',
        marker_color='purple'
    ),
    row=2, col=2
)

fig.update_layout(
    height=800,
    title_text="Tiered Processing Performance Analysis",
    showlegend=False
)

fig.update_xaxes(title_text="Processing Time (seconds)", row=1, col=1)
fig.update_xaxes(title_text="Processing Time (seconds)", row=2, col=1)
fig.update_yaxes(title_text="Confidence Score", row=2, col=1)
fig.update_yaxes(title_text="Efficiency Score", row=2, col=2)

fig.show()

# Display efficiency analysis
print("\n📊 Tiered Processing Efficiency Summary:")
efficiency_summary = df_scenarios[[
    'scenario', 'final_method', 'total_processing_time', 
    'final_confidence', 'efficiency'
]] .copy()

efficiency_summary['total_processing_time'] = efficiency_summary['total_processing_time'].round(1)
efficiency_summary['final_confidence'] = efficiency_summary['final_confidence'].round(3)
efficiency_summary['efficiency'] = efficiency_summary['efficiency'].round(4)

display(efficiency_summary)

# Calculate tier success rates
tier1_success_rate = df_scenarios['tier1_success'].mean() * 100
tier2_success_rate = df_scenarios['tier2_success'].mean() * 100
tier3_success_rate = df_scenarios['tier3_success'].mean() * 100

avg_processing_time = df_scenarios['total_processing_time'].mean()
avg_confidence = df_scenarios['final_confidence'].mean()

print(f"\n🎯 Tier Success Rates:")
print(f"   • Tier 1 (Regex): {tier1_success_rate:.0f}% success rate")
print(f"   • Tier 2 (Section): {tier2_success_rate:.0f}% success rate (when Tier 1 fails)")
print(f"   • Tier 3 (RAG): {tier3_success_rate:.0f}% success rate (when Tier 1-2 fail)")
print(f"\n⚡ Overall Performance:")
print(f"   • Average processing time: {avg_processing_time:.1f} seconds")
print(f"   • Average confidence: {avg_confidence:.3f}")
print(f"   • System efficiency: {(avg_confidence/avg_processing_time):.4f} confidence/second")

## 💡 Key Insights and Recommendations

In [None]:
# Generate comprehensive insights
print("🔍 COMPREHENSIVE ANALYSIS INSIGHTS")
print("=" * 50)

print("\n📊 DOCUMENT PROCESSING PERFORMANCE:")
if enhanced_results["success"] and "document_processing" in enhanced_results["data"]:
    doc_summary = enhanced_results["data"]["document_processing"]
    
    print(f"✅ Successfully processed {doc_summary['download_summary']['total_documents']} documents")
    print(f"📈 Achieved {doc_summary['extraction_summary']['average_confidence']:.1%} average extraction confidence")
    print(f"🎯 {doc_summary['extraction_summary']['successful_extractions']}/{doc_summary['extraction_summary']['total_extractions']} successful extractions")
    
    # Method effectiveness
    extraction_results = doc_summary.get("extraction_results", [])
    method_distribution = {}
    for result in extraction_results:
        method = result["extraction_method"]
        method_distribution[method] = method_distribution.get(method, 0) + 1
    
    print(f"\n🔧 METHOD DISTRIBUTION:")
    for method, count in method_distribution.items():
        percentage = (count / len(extraction_results)) * 100
        print(f"   • {method.replace('_', ' ').title()}: {count} documents ({percentage:.0f}%)")

print("\n🚨 ANOMALY DETECTION RESULTS:")
if enhanced_results["success"]:
    anomaly_summary = enhanced_results["data"]["anomalies"]["summary"]
    print(f"🔍 Detected {anomaly_summary['total_anomalies']} anomalies across {len(enhanced_results['data']['companies'])} companies")
    print(f"📊 Anomaly rate: {anomaly_summary['anomaly_rate']:.1%}")
    
    sector_anomalies = len(enhanced_results["data"]["anomalies"]["sector_anomalies"])
    turnover_anomalies = len(enhanced_results["data"]["anomalies"]["turnover_anomalies"])
    
    print(f"   • Sector code anomalies: {sector_anomalies}")
    print(f"   • Turnover anomalies: {turnover_anomalies}")

print("\n💰 REVENUE VERIFICATION ANALYSIS:")
if enhanced_results["success"] and "enhanced_analysis" in enhanced_results["data"]:
    revenue_summary = enhanced_results["data"]["enhanced_analysis"]["summary"]
    
    print(f"📋 Analyzed revenue data for {revenue_summary['total_companies_analyzed']} companies")
    print(f"⚠️  Found significant discrepancies in {revenue_summary['companies_with_discrepancies']} companies")
    print(f"📊 Average discrepancy rate: {revenue_summary['average_discrepancy_rate']:.1f}%")
    print(f"🎯 High confidence extractions: {revenue_summary['high_confidence_extractions']}/{revenue_summary['total_companies_analyzed']}")
    
    # Calculate accuracy rate
    accuracy_rate = ((revenue_summary['total_companies_analyzed'] - revenue_summary['companies_with_discrepancies']) / 
                    revenue_summary['total_companies_analyzed']) * 100
    print(f"✅ Overall revenue accuracy: {accuracy_rate:.0f}%")

print("\n🚀 SYSTEM PERFORMANCE HIGHLIGHTS:")
print("✨ INTELLIGENT TIERED PROCESSING:")
print(f"   • Tier 1 (Regex): Handles {tier1_success_rate:.0f}% of cases in <5 seconds")
print(f"   • Tier 2 (Section): Processes complex documents in <20 seconds")
print(f"   • Tier 3 (RAG): Comprehensive analysis for challenging cases")
print(f"   • Average processing time: {avg_processing_time:.1f} seconds per document")

print("\n🎯 BUSINESS VALUE DELIVERED:")
print("💡 AUTOMATED ANOMALY DETECTION:")
print("   • Identifies sector code inconsistencies automatically")
print("   • Flags turnover/sector mismatches for review")
print("   • Provides confidence scores for all detections")

print("\n📄 ADVANCED DOCUMENT ANALYSIS:")
print("   • Extracts financial data from annual accounts")
print("   • Verifies reported vs actual revenue figures")
print("   • Supports multiple document formats and layouts")

print("\n🤖 AI-POWERED INSIGHTS:")
print("   • Generates smart suggestions for data corrections")
print("   • Provides reasoning for all recommendations")
print("   • Enables human-in-the-loop validation workflow")

print("\n⚡ SCALABILITY & EFFICIENCY:")
print("   • Tiered processing optimizes speed vs accuracy")
print("   • Vector database enables semantic document search")
print("   • Parallel processing supports high-volume analysis")

print("\n" + "=" * 50)
print("🎉 DEMO COMPLETED SUCCESSFULLY!")
print("📊 The multi-agent system demonstrates enterprise-ready")
print("   capabilities for credit risk analysis and document processing.")
print("=" * 50)

## 🔧 Configuration and Next Steps

### For Production Deployment:

1. **API Configuration**:
   - Set up Companies House API key in Databricks secrets
   - Configure authentication for document downloads

2. **Vector Database Setup**:
   - Deploy ChromaDB or Pinecone for production RAG
   - Configure embedding models (OpenAI, Hugging Face, etc.)

3. **Document Processing**:
   - Install PDF processing libraries (PyPDF2, pdfplumber)
   - Set up document storage (Delta Lake, Azure Blob)

4. **Monitoring & Logging**:
   - Enable MLflow for experiment tracking
   - Set up performance monitoring dashboards

5. **Scaling Considerations**:
   - Configure cluster auto-scaling
   - Implement distributed processing for large datasets

### Key Features Demonstrated:

- ✅ **Multi-agent coordination** for complex workflows
- ✅ **Tiered document processing** (regex → section → RAG)
- ✅ **Revenue verification** through document analysis
- ✅ **Anomaly detection** with confidence scoring
- ✅ **AI-powered suggestions** for data quality improvement
- ✅ **Interactive visualizations** for insights
- ✅ **Scalable architecture** ready for production