# Credit Risk Multi-Agent Anomaly Detection System

## Overview
This notebook demonstrates an intelligent system that:
1. **Ingests data** from Companies House API and external sources
2. **Detects anomalies** in sector codes and turnover data
3. **Provides AI-powered suggestions** with confidence scores
4. **Enables analyst review** for human-in-the-loop workflows

## Architecture
- **Data Ingestion Agent**: Fetches company data from APIs
- **Anomaly Detection Agent**: Identifies inconsistencies
- **Sector Classification Agent**: Suggests correct sector codes
- **Turnover Estimation Agent**: Proposes turnover corrections
- **Confidence Scoring Agent**: Calculates suggestion reliability
- **Human Review Interface**: Analyst approval workflow

## 1. Setup and Configuration

In [None]:
# Install required packages
%pip install requests pandas numpy scikit-learn python-dotenv

# Restart Python to use newly installed packages
dbutils.library.restartPython()

In [None]:
# Import libraries
import os
import sys
import pandas as pd
import numpy as np
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Add project source to path
project_root = "/Workspace/Repos/credit_risk_system/src"
if project_root not in sys.path:
    sys.path.append(project_root)

print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"Numpy version: {np.__version__}")

In [None]:
# Configure environment variables
# Note: In production, these should be stored in Databricks secrets

# Companies House API Configuration
os.environ["COMPANIES_HOUSE_API_KEY"] = "your_api_key_here"

# OpenAI Configuration (for AI suggestions)
os.environ["OPENAI_API_KEY"] = "your_openai_key_here"

print("✅ Environment variables configured")
print("⚠️  Remember to replace placeholder API keys with actual values")

## 2. Initialize Agents

In [None]:
# Initialize the multi-agent system
try:
    from agents.data_ingestion_agent import DataIngestionAgent
    from agents.anomaly_detection_agent import AnomalyDetectionAgent
    from utils.config_manager import config
    from utils.logger import setup_logger
    
    # Setup logging
    logger = setup_logger("credit_risk_notebook")
    
    # Initialize agents
    data_agent = DataIngestionAgent()
    anomaly_agent = AnomalyDetectionAgent()
    
    print("✅ Agents initialized successfully")
    print(f"📊 Data Ingestion Agent: {data_agent.name} (ID: {data_agent.id[:8]}...)")
    print(f"🔍 Anomaly Detection Agent: {anomaly_agent.name} (ID: {anomaly_agent.id[:8]}...)")
    
except ImportError as e:
    print(f"❌ Error importing agents: {e}")
    print("Please ensure the source code is available in the workspace")

## 3. Data Ingestion Demo

In [None]:
# Demo: Fetch sample company data
print("🔄 Starting data ingestion process...")

# For demo purposes, we'll use some sample data
# In production, this would fetch from Companies House API
sample_companies = [
    {
        "company_number": "12345678",
        "company_name": "Tech Innovations Ltd",
        "company_status": "active",
        "company_type": "private-limited-company",
        "date_of_creation": "2020-01-15",
        "primary_sic_code": "62012",  # Software development
        "all_sic_codes": "62012,62020",
        "description": "Software development and consulting services",
        "turnover": 850000,
        "address_line1": "123 Tech Street",
        "locality": "London",
        "postal_code": "E1 6AN",
        "country": "United Kingdom"
    },
    {
        "company_number": "87654321",
        "company_name": "Green Energy Solutions",
        "company_status": "active",
        "company_type": "private-limited-company",
        "date_of_creation": "2019-06-10",
        "primary_sic_code": "47190",  # Other retail sale in non-specialised stores (ANOMALY!)
        "all_sic_codes": "47190",
        "description": "Renewable energy consulting and solar panel installation",
        "turnover": -50000,  # Negative turnover (ANOMALY!)
        "address_line1": "456 Green Lane",
        "locality": "Bristol",
        "postal_code": "BS1 2AB",
        "country": "United Kingdom"
    },
    {
        "company_number": "11111111",
        "company_name": "Financial Advisory Group",
        "company_status": "active",
        "company_type": "private-limited-company",
        "date_of_creation": "2015-03-22",
        "primary_sic_code": None,  # Missing SIC code (ANOMALY!)
        "all_sic_codes": "",
        "description": "Financial planning and investment advisory services",
        "turnover": None,  # Missing turnover (ANOMALY!)
        "address_line1": "789 Finance Square",
        "locality": "Edinburgh",
        "postal_code": "EH1 3BT",
        "country": "United Kingdom"
    },
    {
        "company_number": "22222222",
        "company_name": "Mega Corp Industries",
        "company_status": "active",
        "company_type": "public-limited-company",
        "date_of_creation": "1995-11-08",
        "primary_sic_code": "46190",
        "all_sic_codes": "46190,70210",
        "description": "Manufacturing and distribution of industrial equipment",
        "turnover": 2500000000,  # Very high turnover (potential anomaly)
        "address_line1": "Corporate Tower",
        "locality": "Manchester",
        "postal_code": "M1 1AA",
        "country": "United Kingdom"
    },
    {
        "company_number": "33333333",
        "company_name": "Dormant Holdings Ltd",
        "company_status": "active",
        "company_type": "private-limited-company",
        "date_of_creation": "2018-09-14",
        "primary_sic_code": "64209",
        "all_sic_codes": "64209",
        "description": "Investment holding company",
        "turnover": 0,  # Zero turnover (potential anomaly)
        "address_line1": "Investment House",
        "locality": "Leeds",
        "postal_code": "LS1 4AP",
        "country": "United Kingdom"
    }
]

# Convert to DataFrame
companies_df = pd.DataFrame(sample_companies)

print(f"✅ Loaded {len(sample_companies)} sample companies")
print("\n📋 Sample Company Data:")
display(companies_df[['company_number', 'company_name', 'primary_sic_code', 'turnover', 'description']].head())

## 4. Anomaly Detection

In [None]:
# Run anomaly detection
print("🔍 Running anomaly detection analysis...")

# Prepare data for anomaly detection
detection_data = {
    "companies": sample_companies
}

# Run anomaly detection (using mock implementation since agent imports may not work)
try:
    result = anomaly_agent.process(detection_data)
    
    if result.success:
        anomalies = result.data["anomalies"]
        summary = result.data["summary"]
        
        print(f"\n📊 Anomaly Detection Results:")
        print(f"   • Total companies analyzed: {summary['total_companies']}")
        print(f"   • Total anomalies detected: {summary['total_anomalies']}")
        print(f"   • Anomaly rate: {summary['anomaly_rate']:.2%}")
        print(f"   • Sector code anomalies: {summary['sector_anomalies']}")
        print(f"   • Turnover anomalies: {summary['turnover_anomalies']}")
        
        # Display anomalies in a table
        if anomalies:
            anomaly_data = []
            for anomaly in anomalies:
                anomaly_data.append({
                    "Company": anomaly.company_name,
                    "Type": anomaly.anomaly_type,
                    "Current Value": anomaly.current_value,
                    "Confidence": f"{anomaly.confidence:.1%}",
                    "Description": anomaly.description,
                    "Investigation": anomaly.suggested_investigation
                })
            
            anomalies_df = pd.DataFrame(anomaly_data)
            print("\n🚨 Detected Anomalies:")
            display(anomalies_df)
        else:
            print("\n✅ No anomalies detected")
    else:
        print(f"❌ Anomaly detection failed: {result.error_message}")
        
except Exception as e:
    print(f"⚠️  Using mock anomaly detection due to import issues: {e}")
    
    # Mock anomaly detection results for demo
    mock_anomalies = [
        {
            "Company": "Green Energy Solutions",
            "Type": "sector_code",
            "Current Value": "47190",
            "Confidence": "75%",
            "Description": "SIC code mismatch with business description",
            "Investigation": "Review business description against SIC code classification"
        },
        {
            "Company": "Green Energy Solutions",
            "Type": "turnover",
            "Current Value": -50000,
            "Confidence": "100%",
            "Description": "Negative turnover value",
            "Investigation": "Verify turnover calculation and data source"
        },
        {
            "Company": "Financial Advisory Group",
            "Type": "sector_code",
            "Current Value": None,
            "Confidence": "90%",
            "Description": "Missing SIC code classification",
            "Investigation": "Assign appropriate SIC code based on business activity"
        },
        {
            "Company": "Financial Advisory Group",
            "Type": "turnover",
            "Current Value": None,
            "Confidence": "70%",
            "Description": "Missing turnover data",
            "Investigation": "Obtain turnover information from financial reports"
        },
        {
            "Company": "Dormant Holdings Ltd",
            "Type": "turnover",
            "Current Value": 0,
            "Confidence": "80%",
            "Description": "Zero turnover for active company",
            "Investigation": "Confirm if company is dormant or verify turnover data"
        }
    ]
    
    anomalies_df = pd.DataFrame(mock_anomalies)
    print("\n📊 Mock Anomaly Detection Results:")
    print(f"   • Total companies analyzed: {len(sample_companies)}")
    print(f"   • Total anomalies detected: {len(mock_anomalies)}")
    print(f"   • Anomaly rate: {len(mock_anomalies)/len(sample_companies):.2%}")
    
    print("\n🚨 Detected Anomalies:")
    display(anomalies_df)

## 5. AI-Powered Suggestions (Mock Implementation)

In [None]:
# Mock AI-powered suggestions for anomaly corrections
print("🤖 Generating AI-powered correction suggestions...")

# Mock suggestions based on detected anomalies
ai_suggestions = [
    {
        "Company": "Green Energy Solutions",
        "Anomaly Type": "sector_code",
        "Current Value": "47190",
        "Suggested Value": "71200",
        "Suggested Description": "Technical testing and analysis (energy consulting)",
        "Confidence": "85%",
        "Reasoning": "Business description mentions renewable energy consulting, which aligns with technical testing and analysis",
        "Data Sources": "Business description analysis, industry classification patterns"
    },
    {
        "Company": "Green Energy Solutions",
        "Anomaly Type": "turnover",
        "Current Value": -50000,
        "Suggested Value": 450000,
        "Suggested Description": "Estimated based on company size and industry averages",
        "Confidence": "60%",
        "Reasoning": "Similar sized energy consulting firms typically have £400-500k turnover",
        "Data Sources": "Industry benchmarks, company age, sector analysis"
    },
    {
        "Company": "Financial Advisory Group",
        "Anomaly Type": "sector_code",
        "Current Value": None,
        "Suggested Value": "66220",
        "Suggested Description": "Activities of insurance agents and brokers",
        "Confidence": "90%",
        "Reasoning": "Business description clearly indicates financial planning and investment advisory services",
        "Data Sources": "Business description NLP analysis, industry classification"
    },
    {
        "Company": "Financial Advisory Group",
        "Anomaly Type": "turnover",
        "Current Value": None,
        "Suggested Value": 320000,
        "Suggested Description": "Estimated based on financial advisory industry standards",
        "Confidence": "70%",
        "Reasoning": "Financial advisory firms established in 2015 typically generate £300-400k annually",
        "Data Sources": "Industry reports, company age analysis, market data"
    }
]

suggestions_df = pd.DataFrame(ai_suggestions)

print("\n💡 AI-Generated Suggestions:")
display(suggestions_df[['Company', 'Anomaly Type', 'Current Value', 'Suggested Value', 'Confidence', 'Reasoning']])

print("\n📈 Suggestion Summary:")
print(f"   • Total suggestions generated: {len(ai_suggestions)}")
print(f"   • High confidence suggestions (>80%): {len([s for s in ai_suggestions if float(s['Confidence'].rstrip('%')) > 80])}")
print(f"   • Medium confidence suggestions (60-80%): {len([s for s in ai_suggestions if 60 <= float(s['Confidence'].rstrip('%')) <= 80])}")
print(f"   • Low confidence suggestions (<60%): {len([s for s in ai_suggestions if float(s['Confidence'].rstrip('%')) < 60])}")

## 6. Human Review Interface (Mock)

In [None]:
# Mock human review interface
print("👥 Human Review Interface Demo")
print("\nIn a production system, this would be an interactive interface where analysts can:")
print("   • Review AI suggestions with detailed reasoning")
print("   • Accept or reject suggestions")
print("   • Provide feedback for model improvement")
print("   • Add manual corrections")
print("   • Track decision audit trail")

# Mock review decisions
review_decisions = [
    {
        "Company": "Green Energy Solutions",
        "Suggestion": "Change SIC code from 47190 to 71200",
        "Decision": "ACCEPTED",
        "Analyst": "Sarah Johnson",
        "Timestamp": "2025-09-13 10:30:00",
        "Notes": "Correct classification based on business activities"
    },
    {
        "Company": "Green Energy Solutions",
        "Suggestion": "Change turnover from -50000 to 450000",
        "Decision": "REJECTED",
        "Analyst": "Sarah Johnson",
        "Timestamp": "2025-09-13 10:32:00",
        "Notes": "Need to verify actual financial data before accepting estimate"
    },
    {
        "Company": "Financial Advisory Group",
        "Suggestion": "Set SIC code to 66220",
        "Decision": "ACCEPTED",
        "Analyst": "Mike Chen",
        "Timestamp": "2025-09-13 10:35:00",
        "Notes": "Accurate classification for financial advisory services"
    },
    {
        "Company": "Financial Advisory Group",
        "Suggestion": "Set turnover to 320000",
        "Decision": "PENDING",
        "Analyst": "Mike Chen",
        "Timestamp": "2025-09-13 10:37:00",
        "Notes": "Waiting for additional data sources to confirm estimate"
    }
]

review_df = pd.DataFrame(review_decisions)

print("\n📝 Sample Review Decisions:")
display(review_df)

# Calculate review statistics
accepted = len([d for d in review_decisions if d['Decision'] == 'ACCEPTED'])
rejected = len([d for d in review_decisions if d['Decision'] == 'REJECTED'])
pending = len([d for d in review_decisions if d['Decision'] == 'PENDING'])

print(f"\n📊 Review Statistics:")
print(f"   • Accepted suggestions: {accepted}")
print(f"   • Rejected suggestions: {rejected}")
print(f"   • Pending review: {pending}")
print(f"   • Acceptance rate: {accepted/(accepted+rejected)*100:.1f}%" if (accepted+rejected) > 0 else "   • Acceptance rate: N/A")

## 7. Data Quality Dashboard

In [None]:
# Create data quality dashboard
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('default')
sns.set_palette("husl")

# Create dashboard
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Credit Risk Data Quality Dashboard', fontsize=16, fontweight='bold')

# 1. Anomaly Types Distribution
anomaly_types = ['Sector Code', 'Turnover', 'Missing Data', 'Invalid Values']
anomaly_counts = [2, 2, 1, 1]
colors1 = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A']

ax1.pie(anomaly_counts, labels=anomaly_types, autopct='%1.1f%%', colors=colors1)
ax1.set_title('Anomaly Types Distribution')

# 2. Company Data Completeness
fields = ['Company Name', 'SIC Code', 'Turnover', 'Address', 'Description']
completeness = [100, 80, 60, 100, 100]

bars = ax2.bar(fields, completeness, color=['#2ECC71' if x == 100 else '#F39C12' if x >= 80 else '#E74C3C' for x in completeness])
ax2.set_title('Data Completeness by Field')
ax2.set_ylabel('Completeness %')
ax2.set_ylim(0, 110)
plt.setp(ax2.get_xticklabels(), rotation=45, ha='right')

# Add value labels on bars
for bar, value in zip(bars, completeness):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
             f'{value}%', ha='center', va='bottom')

# 3. Confidence Score Distribution
confidence_scores = [85, 60, 90, 70, 75, 95, 80, 65]
ax3.hist(confidence_scores, bins=5, color='#3498DB', alpha=0.7, edgecolor='black')
ax3.set_title('AI Suggestion Confidence Distribution')
ax3.set_xlabel('Confidence Score (%)')
ax3.set_ylabel('Frequency')
ax3.axvline(x=np.mean(confidence_scores), color='red', linestyle='--', 
            label=f'Mean: {np.mean(confidence_scores):.1f}%')
ax3.legend()

# 4. Review Decision Timeline
decisions = ['Accepted', 'Rejected', 'Pending']
decision_counts = [2, 1, 1]
colors4 = ['#2ECC71', '#E74C3C', '#F39C12']

bars = ax4.bar(decisions, decision_counts, color=colors4)
ax4.set_title('Review Decisions Summary')
ax4.set_ylabel('Count')

# Add value labels on bars
for bar, value in zip(bars, decision_counts):
    ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05, 
             str(value), ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Summary statistics
print("\n📊 Data Quality Summary:")
print(f"   • Total companies processed: {len(sample_companies)}")
print(f"   • Anomalies detected: {len(mock_anomalies)}")
print(f"   • Data quality score: {(1 - len(mock_anomalies)/len(sample_companies)) * 100:.1f}%")
print(f"   • Average AI confidence: {np.mean(confidence_scores):.1f}%")
print(f"   • Suggestions reviewed: {accepted + rejected}")
print(f"   • Acceptance rate: {accepted/(accepted+rejected)*100:.1f}%" if (accepted+rejected) > 0 else "   • Acceptance rate: N/A")

## 8. Next Steps and Production Considerations

In [None]:
print("🚀 Next Steps for Production Implementation:")
print("")
print("1. 🔐 Security & Configuration:")
print("   • Store API keys in Databricks secrets")
print("   • Implement proper authentication and authorization")
print("   • Set up secure connections to external APIs")
print("")
print("2. 🤖 AI Model Enhancement:")
print("   • Train custom sector classification models")
print("   • Implement advanced NLP for business description analysis")
print("   • Develop turnover estimation algorithms using industry data")
print("   • Set up MLflow for model versioning and tracking")
print("")
print("3. 💾 Data Pipeline:")
print("   • Implement Delta Lake for reliable data storage")
print("   • Set up automated data ingestion workflows")
print("   • Create data validation and quality checks")
print("   • Implement real-time streaming for live updates")
print("")
print("4. 👥 User Interface:")
print("   • Build interactive dashboard with Databricks SQL")
print("   • Create analyst workflow interface")
print("   • Implement notification system for high-priority anomalies")
print("   • Add audit trail and decision tracking")
print("")
print("5. 📊 Monitoring & Observability:")
print("   • Set up model performance monitoring")
print("   • Implement data drift detection")
print("   • Create alerting for system issues")
print("   • Track business metrics and KPIs")
print("")
print("6. 🔄 Continuous Improvement:")
print("   • Collect analyst feedback for model retraining")
print("   • Implement A/B testing for new features")
print("   • Regular model performance reviews")
print("   • Expand to additional data sources and use cases")

print("\n✅ Demo Complete!")
print("This notebook demonstrates the core capabilities of the multi-agent")
print("anomaly detection system. The next step is to implement the production")
print("components listed above for a fully operational system.")