<a href="https://colab.research.google.com/github/LashawnFofung/AI-Portfolio/blob/main/src/AI_Powered_Document_Intelligence_Automation_Platform_MVP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Final Project: ü§ñ AI-Powered Document Intelligence Automation Platform MVP**

Production-Grade Retrieval-Augmented Generation (RAG) & Document Governance
This platform is a high-performance solution for high-volume document environments (Legal, Finance, HR). Unlike generic RAG systems that suffer from **"Context Contamination"** (mixing unrelated data), this system uses **Intelligent Boundary Detection** and **Metadata-Rich Chunking** to isolate and retrieve precise document segments.

<br><br>

## **üõ†Ô∏è MVP Objectives**
- **Contextual Fidelity:** Eliminate hallucinations by segregating the vector space based on document classification.

- **Intelligent Automation:** Automatically split bulk PDFs (e.g., a 50-page file containing 5 different contracts) into distinct logical units.

- **Hardware-Aware Versatility (Multi-Model Versatility):** Toggle between Gemini 2.0 (API), Mistral 7B, and Phi-2 (Local) with automated VRAM Deep Purges for stability on T4 GPUs.

- **Audit-Ready Compliance:** Every response is passed through a Quality Audit Gate before being displayed to the user.

<br><br>

## **üèóÔ∏è Key Technical Architecture**
The system follows a modular Six-Stage Execution Cycle:

1. **Ingestion Layer:** Hybrid OCR (PyMuPDF + Tesseract) extracts text and images while preserving spatial metadata.
2. **Intelligence Layer:** LLMs classify documents into a taxonomy (e.g., "Invoice," "Legal Agreement") and detect page boundaries.
3. **Storage Layer:** LlamaIndex + FAISS create Segregated Silos using metadata filters, preventing data leakage between files.
4. **Orchestration Layer:** A Python-generator-based "Thinking Loop" manages asynchronous status updates and hardware state safety.
5. **Audit Layer:** Automatic calculation of the **RAG Triad** (Faithfulness, Relevance, and Context Density).
6. **Presentation Layer:** An **Obsidian-themed** Gradio UI with real-time PDF previewing and exportable audit reports.


<br><br>

## **üåü Core MVP Capabilities**
- **Multi-Modal Routing:** Detects if a query is about "Amounts" vs. "Legal Terms" and targets the specific document silo automatically.
- **VRAM Management:** A "Safety Gate" logic (`deep_purge_gpu`) ensures the system never crashes when switching between heavy local models and light API models.
- **Performance Dashboard:** Real-time visualization of **Latency vs. Token Speed** and **Industry Ground Truth** benchmarks.
- **Source Attribution:** Every AI response includes clickable citations (e.g., "Source: Invoice (p. 4)") to ensure human-in-the-loop verification.

<br><br>

## **üìñ How to Operate**
- **Environment Setup:** Run Section 1 and 2 to install dependencies and configure your Gemini API Key in the Colab Secrets.
- **Initialization:** Run the global configuration cells to load the BGE embedding model and initialize the default LLM.
- **Select Engine:** Use the switch_llm function or the UI dropdown to select your preferred AI model (e.g., "Gemini 2.0 Flash").
- **Upload & Process:** Upload your documents via the Gradio interface. The system will automatically classify and index them.
- **Query & Audit:** Enter your business questions. Use the "Audit Log" tab to view performance metrics and download a professional PDF summary of the session.

<br><br>
## **üîç Section Logic & Flow Analysis**

- **1-2: Core Config**
  - Initializes global state, mounts drives, and manages API/Security keys.
      - [Section 1: Setup And Installation](#scrollTo=OnjSSFKJmRQc)
        - [Global Asyncio & Uvicorn Fix](#scrollTo=luHZTqNM1YzY&line=2&uniqifier=1)
      - [Section 2A: Core Imports, Security Keys, And Global Settings Configurations](#scrollTo=1TlPkZLbvExS)
      - [Section 2B: LLM Factory & Resource Configuration](#scrollTo=1MYP_DZAw6QJ)

- **3	Data Structures (Schema)**
  - Defines LogicalDocument and ChunkMetadata dataclasses for pipeline consistency
      - [Section 3: Data Structures For Enhanced Document Management](#scrollTo=RLrZv_r30NBc)

- **4-5	Ingestion & OCR**
  - **Aware Routing:** Uses LLM-based boundary detection to separate scanned images from text PDFs.
      - [Section 4: Document Intelligence Functions](#scrollTo=xQNJUFwB2gMZ)
      - [Section 5: Advanced PDF Processing Pipeline](#scrollTo=yGrQwXDp4gh_)

- **6	Intelligent Chunking**
  - Implements Metadata-Rich Chunking where every fragment knows its DocID and Page#.
      - [Section 6: Intelligent Chunking With Metadata Preservation](#scrollTo=qYroZgDG9FxX)

- **7-8	Vector DB**
  - Segregates FAISS indices to create "Document Silos" for cleaner retrieval.
      - [Section 7: Query Routing And Intelligent Retrieval](#scrollTo=d_MsgfAaAGsm)
      - [Section 8: Enhanced Answer Generation With Source Attribution](#scrollTo=VrLmw0oyEGfq)

- **9-10	The Brain (Orchestrator)**
  - Handles Streaming UX, Context Window Safety, and Hardware Deep Purges.
      - [Section 9: Enhanced Document Store](#scrollTo=mJTVS2CcHBq3)
      - [Section 10: Backend Chat & Audit Loogic](#scrollTo=txaCA0n9MNYN)

- **11-12	UI & Reporting**
  - Gradio Layout: A three-pillar interface (Operations, Configurations, Audit).
      - [Section 11:Chatbot Logic & Orchestration](#scrollTo=80rvI0tlTlEu)
      - [Section 12: Gradio Interface, Chat Handlers, & Wiring Logic](#scrollTo=2pYWo7wCa47E)

- **13 Application Launcher**
  - [Application MVP](#scrollTo=GgDAO0pTcury)





# **SECTION 1. SETUP AND INSTALLATION**

**Logic and Flow Analysis**

This section serves as the **Foundation Layer** of the AI-Powered Document Intelligence Platform. The logic follows a linear, non-destructive sequence:
- **Dependency Provisioning:** Installs the multi-modal stack required for the MVP. This includes UI components (`Gradio`), document parsing (`PyMuPDF`, `LlamaIndex`), OCR engines (`Tesseract`), and the vector search backend (`FAISS`).

- **Global State Initialization:** Sets up persistent tracking variables (`audit_logs` and `current_llm`). This is a critical design choice for the MVP, as it allows for performance metrics to persist across multiple document uploads and ensures the system knows which LLM engine is currently "warm" in memory.

- **Asynchronous Handling:** Applies `nest_asyncio` to prevent event loop conflicts during RAG pipeline execution.

- **Resource Mounting:** Links Google Drive to ensure the UI has access to static assets (logos/branding) and persistent storage for output reports.

<br>

In [None]:
# ------- GLOBAL ASYNCIO & UVICORN FIX -------

# --- [ASYNC & ENVIRONMENT PREP] ---
# Prevents kernel crashes when switching models or running RAG queries
!pip install -q nest_asyncio
!pip install uvicorn>=0.34.0

import uvicorn
import asyncio
import nest_asyncio
import sys

print("‚úÖ GLOBAL ASYNCIO INSTALL & IMPORTS Complete).")

‚úÖ GLOBAL ASYNCIO INSTALL & IMPORTS Complete).


In [None]:
# ------- SECTION 1. SETUP AND INSTALLATION -------

# 1.1 UI, PDF Processing & Machine Learning Foundations
# Grouping core utilities for document extraction and interface building
!pip install -q \
    gradio gradio_pdf \
    pypdf PyPDF2 pymupdf \
    pillow \
    sentence-transformers transformers \
    faiss-cpu \
    numpy pandas jedi\
    json-repair

# # 1.2 LlamaIndex Orchestration Stack
# Specifically for RAG (Retrieval-Augmented Generation) and metadata management
!pip install -q \
    llama-index \
    llama-index-readers-file \
    llama-index-vector-stores-faiss

#1.3 LLM Engine Support (Multi-Modal Switching)
# Libraries required to swap between API-based (Gemini) and Local (HuggingFace) models
!pip install -q \
    llama-index-llms-huggingface \
    llama-index-embeddings-huggingface \
    transformers accelerate bitsandbytes

!pip install -U -q google-generativeai llama-index-llms-google-genai

# 1.4 OCR & Specialized Reporting Tools
# Tesseract for scanned docs; ReportLab for automated PDF performance summaries
!apt-get install -y tesseract-ocr
!pip install -q \
    pytesseract \
    reportlab rouge-score \
    matplotlib seaborn

# --- MISTRAL MODEL INSTALLATION with COLAB GPU  ---
# Install llama-cpp-python with CUDA support for the T4 GPU
!CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
# Install the LlamaIndex connector for LlamaCPP and json-repair for Mistral cleaning
!pip install llama-index-llms-llama-cpp json-repair


# --- [GLOBAL STATE INITIALIZATION] ---

# --- 1. Initialize GLOBAL AUDIT LOG for Performance Tracking ---
# audit_logs: Stores performance data (latencies, ROUGE scores) for the report generator
audit_logs = []
print("‚úÖ Global audit_logs list initialized.")

# --- 2. Initialize GLOBAL STATE TRACKING FOR LLM CHOICE ---
# current_llm/name: Tracks the active engine to prevent unnecessary re-loading of weights
current_llm = None
current_model_name = ""
print("‚úÖ Global state for LLM variables initialized.")



# --- [EXTERNAL STORAGE LINKING] ---
# MOUNT GOOGLE DRIVE (For UI Image) ---
from google.colab import drive
drive.mount('/content/drive')

print("‚úÖ SECTION 1. SETUP AND INSTALLATION COMPLETE.")



Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.
‚úÖ Global audit_logs list initialized.
‚úÖ Global state for LLM variables initialized.
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ SECTION 1. SETUP AND INSTALLATION COMPLETE.


# **SECTION 2: CORE IMPORTS, SECURITY KEYS, AND GLOBAL SETTINGS CONFIGURATIONS**

## **SECTION 2A. CORE IMPORTS, SECURITY KEYS, AND GLOBAL SETTINGS CONFIGURATIONS**

This section establishes the Intelligence Layer and Security Protocol.

<br>

**Logic and Flow Analysis:**

- **Comprehensive Imports:** Consolidates all necessary libraries from data visualization (`Seaborn`) to RAG orchestration (`LlamaIndex`).

- **Memory Safeguards:** Implements a 4-bit quantization configuration (`BitsAndBytes`) specifically tuned for the 16GB VRAM limit of the Google Colab T4 GPU.

- **Secret Management:** Securely retrieves API keys from Colab's internal `userdata `(Secrets) to prevent accidental exposure in the code.

- **Global Singleton Configuration:** Sets the `Settings` object for LlamaIndex, ensuring that every retrieval and generation call throughout the application uses a consistent embedding model and LLM.

<br>

In [None]:
# ------- SECTION 2A. CORE IMPORTS, SECURITY KEYS, AND GLOBAL SETTINGS CONFIGURATIONS -------

# 1. Standard Library & Utilities
import os, time, json, re, io, tempfile, hashlib
import random # Used for simulating performance audit metrics
import gc # Memory Management: Essential for Google Colab T4 GPU
from datetime import datetime
from typing import List, Dict, Tuple, Optional, Any
from dataclasses import dataclass, field
from concurrent.futures import ThreadPoolExecutor


# 2. Data Science & Visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# 3. Document Processing & OCR
import fitz  # PyMuPDF
import pytesseract
from PIL import Image
from PyPDF2 import PdfReader


# 4. AI & Machine Learning (Vector Engine/Backend)
# Core Frameworks
import torch # Memory Management: Essential for Google Colab T4 GPU
gc.collect()
torch.cuda.empty_cache()


import faiss
from rouge_score import rouge_scorer
from transformers import BitsAndBytesConfig
from sentence_transformers import SentenceTransformer


# 5. LlamaIndex (RAG Framework)
# The Orchestrator
from llama_index.core.schema import TextNode
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Document, VectorStoreIndex, StorageContext, Settings
from llama_index.core.vector_stores import MetadataFilters, MetadataFilter, FilterOperator

        # --- LLM & Embedding -----
from json_repair import repair_json  # Critical imports for your Mistral/Router logic
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding


# 6. UI & Automated PDF Reporting
import gradio as gr
from gradio_pdf import PDF
from reportlab.lib import colors
from google.colab import userdata
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from reportlab.lib.colors import HexColor, black, green
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle


# --- --- --- [MEMORY MANAGEMENT] --- --- ---
# Shared 4-bit configuration for T4 GPU efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# --- --- --- [RESOURCE PATH DEFINITIONS] ---- --- ---
# 1A. Define File Path - LOGO
PROJECT_FOLDER = '/content/drive/MyDrive/AI_Powered_Document_Intelligence_Automation_Platform'
LOGO_PATH = os.path.join(PROJECT_FOLDER, 'AI Document Assistant logo v2.png')

# 2A. Define File Path - CONFIG & FILTER IMAGE
CONFIG_FILTER_PATH = os.path.join(PROJECT_FOLDER, 'Document Filter and RAG.png')

# 1B. Verify the path exists to avoid "File Not Found" errors later
if os.path.exists(LOGO_PATH):
    print(f"‚úÖ Image found at: {LOGO_PATH}")
else:
    print(f"‚ùå Warning: Image not found. Check path: {LOGO_PATH}")


# 2B. Verify the path exists to avoid "File Not Found" errors later
if os.path.exists(CONFIG_FILTER_PATH):
    print(f"‚úÖ Image found at: {CONFIG_FILTER_PATH}")
else:
    print(f"‚ùå Warning: Image not found. Check path: {CONFIG_FILTER_PATH}")


# --- --- --- [SECURITY & GLOBAL CONFIGURATION] --- --- ---
# 1A. Load Gemini API
API_KEY = userdata.get('GEMINI_API_KEY')
if not API_KEY:
    raise ValueError("GEMINI_API_KEY not found in Colab Secrets.")

# MANDATORY: Set as environment variable so all internal LlamaIndex
# calls and the switch_llm factory can find it automatically.
os.environ["GOOGLE_API_KEY"] = API_KEY



# 1B Load Hugginf Face Token
# Retrieve the secret from Colab and set it as an environment variable
try:
    hf_token = userdata.get('HF_TOKEN')
    os.environ["HF_TOKEN"] = hf_token
    print("‚úÖ Hugging Face Token successfully loaded from Colab Secrets.")
except Exception as e:
    print("‚ùå Could not find HF_TOKEN in Colab Secrets. Check the 'Key' icon on the left.")


# 2. Configure Embedding Model
llama_embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# 3A. GLOBAL CONFIGURATION (Crucial for Section 9 & 10)
Settings.embed_model = llama_embed_model

# 3B. Initial default LLM
Settings.llm = GoogleGenAI(model="models/gemini-2.0-flash")

# 3C. SAFE NAME ASSIGNMENT
# We use a custom attribute that Pydantic won't block,
# or simply use the existing 'model' attribute.
# To satisfy your Section 11 logs, we use this "monkeypatch" method:
try:
    # This bypasses Pydantic's strict check
    object.__setattr__(Settings.llm, 'model_name', "Gemini 2.0 Flash")
except Exception:
    pass

print(f"‚úÖ Settings initialized with: {getattr(Settings.llm, 'model_name', Settings.llm.model)}")


print("‚úÖ SECTION 2A. CORE IMPORTS, SECURITY KEYS LOADED, AND GLOBAL SETTINGS CONFIGURATIONS COMPLETE.")



‚úÖ Image found at: /content/drive/MyDrive/AI_Powered_Document_Intelligence_Automation_Platform/AI Document Assistant logo v2.png
‚úÖ Image found at: /content/drive/MyDrive/AI_Powered_Document_Intelligence_Automation_Platform/Document Filter and RAG.png
‚úÖ Hugging Face Token successfully loaded from Colab Secrets.
‚úÖ Settings initialized with: Gemini 2.0 Flash
‚úÖ SECTION 2A. CORE IMPORTS, SECURITY KEYS LOADED, AND GLOBAL SETTINGS CONFIGURATIONS COMPLETE.


## **SECTION 2B. LLM FACTORY & RESOURCE CONFIGURATION**


This section implements the **Model Orchestration Layer** using a Factory Pattern.

<br>

**Logic and Flow Analysis:**


- **Memory Purge Mechanism:** Before loading a new model, the system explicitly deletes the old object and triggers `torch.cuda.empty_cache()`. This is essential for the T4 GPU, which only has 16GB of VRAM.

- **Dynamic Model Loading:  Gemini 2.0 Flash:** Uses an API-based approach (zero local VRAM impact).
  - **Mistral-7B (GGUF):** Uses `LlamaCPP` with 4-bit quantization, offloading all layers to the GPU for maximum speed.
  - **Phi-2:** A lightweight alternative for rapid testing and low-latency extraction.

- **Global Settings Sync:** Every time a model is switched, the `Settings.llm` singleton is updated so the rest of the RAG pipeline automatically uses the new engine.

In [None]:
# ------- SECTION 2B. LLM FACTORY & RESOURCE CONFIGURATION -------

# --- üß† LLM FACTORY ---

def switch_llm(model_name: str):

    # Global trackers
    global current_llm, current_model_name

    # MEMORY PURGE (T4 GPU Optimization) ---
    deep_purge_gpu()


    print(f"üöÄ Initializing {model_name}...")

    try:

    # ---- üß† Gemini 2.0 (Cloud) -----
        if model_name == "Gemini 2.0 Flash":

            if not os.environ.get("GOOGLE_API_KEY"):
                return "‚ùå Error: GOOGLE_API_KEY not found in environment."

            current_llm = GoogleGenAI(model="models/gemini-2.0-flash")

        # ---- üß† Mistral-7B (Local GGUF) -----
        elif model_name == "Mistral-7B (Llama-CPP)":
            from llama_index.llms.llama_cpp import LlamaCPP

            # Using 4-bit quantization to fit comfortably on T4 GPU
            current_llm = LlamaCPP(
                model_url="https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
                temperature=0.1,
                max_new_tokens=512,
                context_window=4096,
                # n_gpu_layers: -1 moves everything to GPU, 30-35 is safer for 16GB VRAM
                model_kwargs={"n_gpu_layers": 30, "offload_kqv": True},
                messages_to_prompt=lambda msgs: f"[INST] {' '.join([m.content for m in msgs])} [/INST]",
            )

        # ---- üß† Phi-2 (Local HF) -----
        elif model_name == "Phi-2 (Small & Fast)":
            from llama_index.llms.huggingface import HuggingFaceLLM

            # Phi-2 requires 'trust_remote_code' and specific dtypes for T4
            current_llm = HuggingFaceLLM(
                model_name="microsoft/phi-2",
                tokenizer_name="microsoft/phi-2",
                context_window=2048,
                max_new_tokens=512,
                model_kwargs={"trust_remote_code": True, "torch_dtype": torch.float16},
                device_map="cuda"
            )

        # --- VALIDATION & SETTINGS BINDING ---
        from llama_index.core import Settings
        Settings.llm = current_llm
        current_model_name = model_name

        return f"‚úÖ System Active: {model_name}"

    except Exception as e:
        # Emergency Recovery: Revert to Gemini to keep the tunnel alive
        current_llm = GoogleGenAI(model="models/gemini-2.0-flash")
        Settings.llm = current_llm
        return f"‚ö†Ô∏è Fallback Active: {str(e)}"



print("‚úÖ SECTION 2B. LLM FACTORY & RESOURCE CONFIGURATION COMPLETE.")



‚úÖ SECTION 2B. LLM FACTORY & RESOURCE CONFIGURATION COMPLETE.


## **SECTION 2C. LLM FACTORY & RESOURCE CONFIGURATION**

This section establishes the **Taxonomic Framework** for the platform. It moves beyond simple lists by creating a multi-dimensional mapping of document types.

<br>

**Logic and Flow Analysis:**

- **Semantic Mapping:** Defines each category with "Context Keywords." This allows the LLM to identify a document even if the explicit title (e.g., "Invoice") is missing, by looking for supporting evidence (e.g., "amounts due," "billing").

- **Sector Clustering:** Groups document types into industry-specific "Sectors" (Real Estate, Healthcare, Legal). This enables the UI to eventually offer "Industry-Specific Extraction Modes."

- **Prompt Automation:** Dynamically generates the `TAXONOMY_PROMPT_STR`. This ensures that if you add a new category to the dictionary, the LLM‚Äôs "instructions" are automatically updated without manual rewriting.



In [None]:
# ------- SECTION 2C. LLM FACTORY & RESOURCE CONFIGURATION -------


# --- GLOBAL DOCUMENT TAXONOMY CONFIGURATION ---
DOCUMENT_TAXONOMY = {
    "categories": {
        "Resume": "CV, professional profile, work history, Career, experience, education, skills, career summary, employment dates, employment history",
        "Contract": "Legal agreement, force majeure, governing law, indemnification, confidentiality, termination clause, 'in witness whereof', terms and conditions, service agreement, obligations, parties, signature page, clauses, 'hereby agree'",
        "Mortgage_Contract": "Home loan agreement, home loan, deed of trust, lien, mortgage terms, property financing, interest rates, principal amount, principal, note, amortization, escrow, prepayment penalty, borrower, lender",
        "Invoice": "Bill, payment request, financial statement, amounts due, billing, charges, invoiced items, itemized list, subtotal, remittance, 'please pay by'",
        "Pay_Slip": "Salary statement, wage slip, wages, earnings statement, deductions, pay period, year-to-date (YTD), gross pay, net pay, pay period, social security withholding, employer tax ID",
        "Lender_Fee_Sheet": "Loan fee, loan estimate, lender charge, closing cost, closing disclosure, origination fee, appraisal fee, title insurance, escrow deposit, settlement charges",
        "Land_Deed": "Property deed, property ownership, title document, title, ownership certificate, property ownership, grantor, grantee, survey, parcel number, county recorder, transfer of ownership, quitclaim, warranty deed, conveyance, parcel ID, legal description, notary acknowledgement",
        "Bank_Statement": "Account statement, opening balance, account balance, transaction history, deposits, withdrawals, checking/savings, available credit",
        "Tax_Document": "W2, 1099, 1099-MISC, Form 1040, IRS, tax return, tax form, federal income tax, social security wages, tax year, withholding, tax amounts (Includes W2s and Tax Returns)",
        "Insurance": "Insurance policy, coverage document, coverage, policy details, premium, claims, policy declaration, coverage, premium, deductible, insured party, policy number, liability, claims, effective date, policy holder",
        "Report": "Analysis, research document, findings, conclusion, research data, whitepaper, executive summary, methodology",
        "Legal_Letter": "Formal correspondence, formal notice, formal request, attorney-client privilege, re:, service of process, demand letter, notice to quit, legal notification, attorney-client communication, correspondence, memo, communication, requests, instructions, notifications",
        "Health_Form": "Application, questionnaire, data entry form, submitted data, form fields",
        "ID_Document": "Driver's license, passport, passport number, identification, ID numbers, birth certificate, visa, state ID, identity verification, expiration date, photo ID, place of birth, issue date, biometric, state seal",
        "Medical_Report": "Medical report, prescription, health record, health information, medical conditions, patient records, clinical notes, diagnosis, lab results, physician statement",
        "Other": "Miscellaneous documents that do not contain keywords for specific financial, legal, or professional categories, or doesn't fith other categories"
    },
    "sectors": {
        "Real Estate": ["Mortgage_Contract", "Lender_Fee_Sheet", "Land_Deed", "Pay_Slip", "Tax_Document", "Bank_Statement", "Report"],
        "Healthcare": ["Medical", "Medical_Report", "Health_Form", "Insurance"],
        "Legal": ["Contract", "Land_Deed", "Legal_Letter", "Health_Form"]
    }
}

# Helper function to get sector for a document type
def get_sector_for_type(doc_type: str) -> str:
    """Returns the primary industry sector for a given document type."""
    for sector, types in DOCUMENT_TAXONOMY["sectors"].items():
        if doc_type in types:
            return sector
    return "General"

print("‚úÖ SECTION 2C. LLM FACTORY & RESOURCE CONFIGURATION COMPLETE.")


‚úÖ SECTION 2C. LLM FACTORY & RESOURCE CONFIGURATION COMPLETE.


# **SECTION 3. DATA STRUCTURES FOR ENHANCED DOCUMENT MANAGEMENT**

This section defines the **Data Blueprint** for the entire platform. By using Python `dataclasses`, we create a hierarchical representation of documents that prevents "context drift."

<br>

**Logic and Flow Analysis**
- **Physical Layer (** `PageInfo` **):** Captures raw text and page numbers to ensure "ground truth" citations.

- **Business Layer (**`LogicalDocument` **):** Enables boundary detection. It treats a multi-document PDF as a collection of semantic entities (e.g., separating an Invoice from a Contract within the same file).

- **Retrieval Layer (** `ChunkMetadata` **):** The unit of search. It stores rich metadata (IDs and types) alongside embeddings, allowing the vector engine to perform "Siloed Retrieval" (filtering results by document type).

<br>

In [None]:
# ------- SECTION 3. DATA STRUCTURES FOR ENHANCED DOCUMENT MANAGEMENT -------
@dataclass
class PageInfo:
    """
    PHYSICAL LAYER: Represents one page of the input file.
    Used for OCR tracking and initial classification. Stores information about a single page.
    """
    page_num: int
    text: str
    doc_type: Optional[str] = None
    page_in_doc: int = 0   # Position relative to the logical start

@dataclass
class LogicalDocument:
    """
    BUSINESS LAYER: Groups pages into a single 'semantic' entity.
    Represents a logical document within a PDF.
    """
    doc_id: str
    doc_type: str
    page_start: int
    page_end: int
    text: str
    sector: str = "General"
    chunks: List[Dict] = field(default_factory=list)

@dataclass
class ChunkMetadata:
    """
    RETRIEVAL LAYER: The actual object indexed in the Vector Database.
    Rich metadata here allows for 'Siloed Retrieval' (filtering by doc_type).
    Rich metadata for each chunk.
    """
    chunk_id: str
    doc_id: str
    doc_type: str
    chunk_index: int
    page_start: int
    page_end: int
    text: str
    sector: str = "General"
    embedding: Optional[np.ndarray] = None

    def to_dict(self) -> Dict:
        """Converts metadata to a dictionary for LlamaIndex node compatibility."""
        return {
            "chunk_id": self.chunk_id,
            "doc_id": self.doc_id,
            "doc_type": self.doc_type,
            "sector": self.sector,
            "page_start": self.page_start,
            "page_end": self.page_end
        }

print("‚úÖ DATA STRUCTURES INITIALIZED.")

print("‚úÖ SECTION 3. DATA STRUCTURES FOR ENHANCED DOCUMENT MANAGEMENT COMPLETE.")


‚úÖ DATA STRUCTURES INITIALIZED.
‚úÖ SECTION 3. DATA STRUCTURES FOR ENHANCED DOCUMENT MANAGEMENT COMPLETE.


# **SECTION 4. DOCUMENT INTELLIGENCE FUNCTIONS**

This section contains the "brains" of the ingestion engine. It moves beyond simple text extraction by adding two critical semantic checks.

<br>

**Logic and Flow Analysis**

- **Semantic Classification (**`classify_document_type` **):** Instead of just indexing text, the system identifies what it is reading (e.g., a "Tax Return" vs. a "Medical Record"). This allows for metadata-filtered searches later.

- **Logical Boundary Detection
(** `detect_document_boundary` **):** This logic prevents "Context Contamination." It analyzes the transition between pages to decide if a new document has started. If page 5 looks like a "Contract" but page 6 looks like a "Bank Statement," the system creates a hard boundary.

In [None]:
# ------- SECTION 4. DOCUMENT INTELLIGENCE FUNCTIONS -------

# --- GLOBAL TAXONOMY DEFINITIONS ---

# VALID_DOC_TYPES to match DOCUMENT_TAXONOMY keys exactly
VALID_DOC_TYPES = list(DOCUMENT_TAXONOMY["categories"].keys())

TAXONOMY_PROMPT_STR = ", ".join(VALID_DOC_TYPES)



def heuristic_classify(text: str) -> str:
    """Fallback classifier using keywords if the LLM fails. Case-insensitive."""
    text_lower = text.lower()


    # Mapping keys from DOCUMENT_TAXONOMY to their expanded keyword lists (all lowercase)
    keywords_map = {
        "Resume": ["cv", "work history", "skills", "education"],
        "Contract": ["agreement", "indemnification", "confidentiality", "hereby agree"],
        "Mortgage_Contract": ["mortgage", "deed of trust", "amortization", "interest rate"],
        "Invoice": ["bill to", "amount due", "subtotal", "tax invoice"],
        "Pay_Slip": ["payslip", "pay slip", "earnings statement", "net pay", "ytd", "income", "salary", "base pay", "payroll", "compensation"],
        "Lender_Fee_Sheet": ["closing cost", "origination fee", "settlement charges"],
        "Land_Deed": ["grantor", "grantee", "parcel id", "quitclaim", "conveyance"],
        "Bank_Statement": ["account statement", "transaction history", "deposits"],
        "Tax_Document": ["w2", "w-2", "1099", "irs", "tax return", "income tax"],
        "Insurance": ["insurance policy", "premium", "easement", "encumbrance"],
        "ID_Document": ["passport", "driver's license", "id number", "photo id"]
    }


    # TEMPLATE CHECK: If it contains many underscores or dollar signs WITHOUT digits
    # Prevents the AI from getting confused by blank forms
    underscore_count = text_lower.count("____")
    dollar_no_digit = ("$" in text_lower and not any(char.isdigit() for char in text_lower))

    if underscore_count > 5 or dollar_no_digit or "sample" in text_lower:
        return "Other"

    for category, keywords in keywords_map.items():
        if any(word in text_lower for word in keywords):
            return category

    return "Other"



def classify_document_type(text: str, max_length: int = 2000) -> str:
    """
    Identifies document category by providing the LLM with the full taxonomy context.
    """

    # Truncate text if too long to avoid token limits
    # Safety Check: Use a sample to stay within LLM context limits and reduce latency
    text_sample = text[:max_length] if len(text) > max_length else text


    # 1. Build a context string from your actual taxonomy
    taxonomy_context = "\n".join([f"- {cat}: {desc}" for cat, desc in DOCUMENT_TAXONOMY["categories"].items()])

    # 2. Updated Prompt: No more hardcoded "Financial Statement" instructions
    prompt = f"""
    You are a document expert. Classify the text into exactly ONE of these categories:
    {VALID_DOC_TYPES}

    Use these definitions for guidance:
    {taxonomy_context}

    CRITICAL RULES:
    1. Respond with ONLY the category name.
    2. If the text looks like a BLANK FORM, TEMPLATE, or contains "____" placeholders instead of real data, you MUST use 'Other'.
    3. Do NOT classify blank appraisal forms as 'Lender_Fee_Sheet' unless they contain filled-in numbers.
    4. If unsure, use 'Other'.

    Text snippet:
    {text_sample}
    """

    try:
        # Check for MockLLM
        if "MockLLM" in str(type(Settings.llm)):
            return heuristic_classify(text_sample)

        response = Settings.llm.complete(prompt).text.strip()

        # Exact Match Check
        for valid_type in VALID_DOC_TYPES:
            if valid_type.lower() == response.lower() or valid_type.lower() in response.lower():
                return valid_type

        # Fallback to heuristic if LLM output is ambiguous
        return heuristic_classify(text_sample)
    except Exception:
        return heuristic_classify(text_sample)


def detect_document_boundary(prev_text: str, curr_text: str,
                            current_doc_type: str = None) -> bool:
    """
    Detect if two consecutive pages belong to the same document.
    Returns True if they're from the same document.
    """
    # Quick heuristic checks first
    if not prev_text or not curr_text:
        return False

    # Sample the texts for L\LM analysis
    prev_sample = prev_text[-500:] if len(prev_text) > 500 else prev_text
    curr_sample = curr_text[:500] if len(curr_text) > 500 else curr_text

    prompt = f"""
    Determine if these two pages are from the SAME document or different documents.

    Current document type: {current_doc_type or 'Unknown'}

    End of Previous Page:
    ...{prev_sample}

    Start of Current Page:
    {curr_sample}...

    Consider:
    - Continuity of content
    - Formatting consistency
    - Topic coherence
    - Page numbers or headers

    Default to 'Yes' unless you see a clear signal of a different entity
    (e.g., a new person's name on a resume, a different bank logo, or a new header 'Exhibit A').
    Answer ONLY 'Yes' or 'No'.

    Decision Criteria:
    1. Does the sentence from the previous page continue?
    2. Is the formatting (headers/footers) consistent?
    3. Does the subject matter suddenly shift (e.g., from a lease to a utility bill)?

    Answer 'Yes' if they are the SAME document.
    Answer 'No' if a NEW document has started.
    Respond with ONLY 'Yes' or 'No'.
    """

    try:
        # Use the global LlamaIndex LLM setting
        response = Settings.llm.complete(prompt)

        return response.text.strip().lower().startswith('yes')
    except Exception as e:
        print(f"Boundary detection error: {e}")
        # Default to keeping pages together if uncertain
        return True

print("‚úÖ SECTION 4. DOCUMENT INTELLIGENCE FUNCTIONS COMPLETE.")


‚úÖ SECTION 4. DOCUMENT INTELLIGENCE FUNCTIONS COMPLETE.


# **SECTION 5. ADVANCED PDF PROCESSING PIPELINE**


**Logic and Flow Analysis**

This section implements the **Transformation Layer** of the platform. It is designed to handle "dirty" real-world data through three specialized sub-systems:

<br>

- **Hybrid OCR Router:** Detects if a page is a "Searchable PDF" or a "Scanned Image." If no text is found, it automatically triggers Tesseract OCR to "see" the content.

- **Taxonomy-Aware Segmentation:** Uses the intelligence functions from Section 4 and the Schema from Section 2C to group pages into `LogicalDocuments` while simultaneously tagging them with their business `Sector`.

- **High-Fidelity UI Rendering:** Includes a specialized rendering engine that converts PDF pages into high-contrast images (using `fitz.Matrix(3, 3)`) for display within the Gradio interface, ensuring even small-print legal text is legible.

In [None]:
# ------- SECTION 5. ADVANCED PDF PROCESSING PIPELINE -------

# --- 1. CORE SEGMENTATION LOGIC ---
def analyze_pages(pages_info):  # Shared Analysis Logic
    """
   Groups individual pages into logical business units using the 2C Taxonomy.
    Flow: Page Ingestion -> Boundary Detection -> Classification -> Sector Mapping.
    """

    logical_docs = []
    current_pages = []
    doc_counter = 0

    for i, page in enumerate(pages_info):
        if i == 0:
            # Initialize the first document type
            doc_type = classify_document_type(page.text)
            current_pages = [page]
        else:
            # Check if current page is a continuation of the previous one
            if detect_document_boundary(pages_info[i-1].text, page.text, doc_type):
                current_pages.append(page)
            else:
              # Boundary detected: Finalize the current logical document
                sector = get_sector_for_type(doc_type)

                logical_docs.append(
                    LogicalDocument(
                        doc_id=f"doc_{doc_counter}",
                        doc_type=doc_type,
                        page_start=current_pages[0].page_num,
                        page_end=current_pages[-1].page_num,
                        text="\n\n".join(p.text for p in current_pages),
                    )
                )
                doc_counter += 1
                # Start new document tracking
                doc_type = classify_document_type(page.text)
                current_pages = [page]

    # Handle the final trailing document in the sequence
    if current_pages:
        sector = get_sector_for_type(doc_type)
        logical_docs.append(
            LogicalDocument(
                doc_id=f"doc_{doc_counter}",
                doc_type=doc_type,
                sector=sector,
                page_start=current_pages[0].page_num,
                page_end=current_pages[-1].page_num,
                text="\n\n".join(p.text for p in current_pages),
            )
        )

    return pages_info, logical_docs


# --- 2. MULTI-MODAL INGESTION ROUTERS ---
def extract_and_analyze_file(file): # Aware Router
    """Detects file extension and routes to PDF or Image processor."""

    ext = os.path.splitext(file.name)[1].lower()

    if ext == ".pdf":
        return extract_and_analyze_pdf(file)
    elif ext in [".png", ".jpg", ".jpeg"]:
        return extract_and_analyze_image(file)
    else:
        raise ValueError(f"Unsupported file type: {ext}")


def extract_and_analyze_pdf(pdf_file) -> Tuple[List[PageInfo], List[LogicalDocument]]:
    """
    HYBRID OCR PIPELINE: Extracts digital text or triggers OCR for scanned pages.
    """

    # Capture the actual name from the Gradio file object
    original_filename = os.path.basename(pdf_file.name)


    print("üìñ Starting PDF extraction and analysis for: {original_filename}")

    doc = fitz.open(pdf_file.name) # open file

    pages_info = []
    for i, page in enumerate(doc):
        text = page.get_text().strip()

        # Hybrid OCR: If no text found, render page to image and use Tesseract
        if not text:
            pix = page.get_pixmap()
            img = Image.open(io.BytesIO(pix.tobytes("png")))
            text = pytesseract.image_to_string(img)

        pages_info.append(PageInfo(page_num=i, text=text))

    doc.close()
    return analyze_pages(pages_info)


def extract_and_analyze_image(image_file): # Image Ingestion
    """Processes standalone images via OCR and treats them as a single document."""

    print("üñºÔ∏è Processing Image:", image_file.name)

    img = Image.open(image_file.name)
    text = pytesseract.image_to_string(img)

    pages_info = [PageInfo(page_num=0, text=text)]
    return analyze_pages(pages_info)



 # --- 3. UI RENDERING LOGIC ---

# For Document Viewer in UI (Convert Uploaded PDF file into an image to be viewed in Gradio UI)
def load_pdf_into_viewer(selected_file):
    """Renders PDF pages to crisp images for the Gradio viewer."""

    if not selected_file or not os.path.exists(str(selected_file)):
        return None, {"current_page": 0, "images": []}, "**Page 0 of 0**"

    try:
        doc = fitz.open(selected_file)
        images = []
        # Matrix(3,3) provides 300 DPI equivalent for high readability
        for page in doc:
            pix = page.get_pixmap(matrix=fitz.Matrix(3, 3), colorspace=fitz.csRGB)
            img = Image.open(io.BytesIO(pix.tobytes("png")))
            images.append(img)
        doc.close()

        indicator = f"<center>**Page 1 of {len(images)}**</center>"

        return images[0], {"current_page": 0, "images": images}, f"<center>**Page 1 of {len(images)}**</center>"
    except Exception as e:
        print(f"Viewer Error: {e}")
        return None, {"current_page": 0, "images": []}, "**Error loading viewer**"


def flip_page(direction, state):
    """Handles 'Next' and 'Previous' button clicks in the UI."""
    images = state.get("images", [])
    current = state.get("current_page", 0)

    if not images:
        return None, state, "**Page 0 of 0**"

    current = min(current + 1, len(images) - 1) if direction == "next" else max(current - 1, 0)
    state["current_page"] = current
    indicator = f"<center>**Page {current + 1} of {len(images)}**</center>"
    return images[current], state, indicator


print("‚úÖ SECTION 5. ADVANCED PDF PROCESSING PIPELINE COMPLETE.")


‚úÖ SECTION 5. ADVANCED PDF PROCESSING PIPELINE COMPLETE.


# **SECTION 6. INTELLIGENT CHUNKING WITH METADATA PRESERVATION**

This section defines the **Granular Transformation Layer**. After a document has been logically segmented (e.g., separating an Invoice from a Contract), the text must be broken down into "chunks" that fit the LLM's context window while ensuring "provenance" (data origin) is never lost.

<br>

**Logic and Flow Analysis**

- **Semantic Sliding Window:** A custom algorithm that ensures no information is lost at chunk boundaries by creating a calibrated "overlap."

- **LlamaIndex Orchestration (** `SentenceSplitter` **):** A high-level path that respects paragraph and sentence boundaries, preventing a chunk from being cut off in the middle of a critical legal clause.

- **Metadata Injection:** The "Secret Sauce." Every chunk is stamped with its `doc_type`, `sector`, `doc_id` and `page_range`. This ensures 100% precision in the retrieval phase by filtering out irrelevant document "silos."

<br>

In [None]:
# ------- SECTION 6. INTELLIGENT CHUNKING WITH METADATA PRESERVATION -------

# --- 1. CUSTOM SLIDING WINDOW CHUNKING ---
def chunk_document_with_metadata(logical_doc: LogicalDocument,
                                chunk_size: int = 500,
                                overlap: int = 100) -> List[ChunkMetadata]:
    """
    Splits a logical document into overlapping chunks while preserving business context.
    Ensures that context at boundaries is maintained via the 'stride' method.
    """

    chunks_metadata = []
    words = logical_doc.text.split()

    # Case A: Document is smaller than the threshold
    if len(words) <= chunk_size:
        # Document is small enough to be a single chunk
        chunk_meta = ChunkMetadata(
            chunk_id=f"{logical_doc.doc_id}_chunk_0",
            doc_id=logical_doc.doc_id,
            doc_type=logical_doc.doc_type,
            sector=logical_doc.sector,
            chunk_index=0,
            page_start=logical_doc.page_start,
            page_end=logical_doc.page_end,
            text=logical_doc.text
        )
        chunks_metadata.append(chunk_meta)

    # Case B: Multi-chunk split with sliding window
    else:
        # Create overlapping chunks
        stride = chunk_size - overlap
        for i, start_idx in enumerate(range(0, len(words), stride)):
            end_idx = min(start_idx + chunk_size, len(words))
            chunk_text = ' '.join(words[start_idx:end_idx])

            # Calculate which pages this chunk spans
            # (simplified - in production, track more precisely)
            chunk_position = start_idx / len(words)
            page_range = logical_doc.page_end - logical_doc.page_start
            relative_page = int(chunk_position * page_range)
            chunk_page_start = logical_doc.page_start + relative_page
            chunk_page_end = min(chunk_page_start + 1, logical_doc.page_end)

            chunk_meta = ChunkMetadata(
                chunk_id=f"{logical_doc.doc_id}_chunk_{i}",
                doc_id=logical_doc.doc_id,
                doc_type=logical_doc.doc_type,
                sector=logical_doc.sector,
                chunk_index=i,
                page_start=chunk_page_start,
                page_end=chunk_page_end,
                text=chunk_text
            )
            chunks_metadata.append(chunk_meta)

            if end_idx >= len(words):
                break

    return chunks_metadata


# --- 2. LLAMA-INDEX ADVANCED CHUNKING ---
def chunk_with_llama_index(logical_doc: LogicalDocument,
                           chunk_size: int = 500,
                           chunk_overlap: int = 100) -> List[Document]: # Chunk Metadata
    """
    Uses LlamaIndex SentenceSplitter to ensure chunks respect natural language boundaries.
    """
    # Create LlamaIndex document with metadata
    doc = Document(
        text=logical_doc.text,
        metadata={
            "doc_id": logical_doc.doc_id,
            "doc_type": logical_doc.doc_type,
            "sector": logical_doc.sector, # Carrying over from Section 5
            "page_start": logical_doc.page_start,
            "page_end": logical_doc.page_end,
            "source": f"{logical_doc.doc_type}_document"
        }
    )

    # Use LlamaIndex's sentence splitter for better chunking
    # Sentence-aware splitter prevents cutting mid-sentence
    splitter = SentenceSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        paragraph_separator="\n\n",
    )

    # Create nodes (chunks) from document
    nodes = splitter.get_nodes_from_documents([doc])

    # Convert to our ChunkMetadata format for consistency
    chunks_metadata = []
    for i, node in enumerate(nodes):
        # IMPORTANT: Explicitly pull from node.metadata
        # LlamaIndex nodes store metadata in a .metadata dictionary
        m = node.metadata

        chunk_meta = ChunkMetadata(
            chunk_id=f"{logical_doc.doc_id}_chunk_{i}",
            doc_id=m.get("doc_id", logical_doc.doc_id),
            doc_type=m.get("doc_type", logical_doc.doc_type),
            sector=m.get("sector", logical_doc.sector),
            chunk_index=i,
            # If splitter doesn't track pages, we fallback to logical_doc values
            page_start=m.get("page_start", logical_doc.page_start),
            page_end=m.get("page_end", logical_doc.page_end),
            text=node.get_content()
        )
        chunks_metadata.append(chunk_meta)

    return chunks_metadata


# --- 3. BATCH PROCESSOR ---
def process_all_documents(logical_docs: List[LogicalDocument],
                         use_llama_index: bool = False) -> List[ChunkMetadata]:
    """
    Orchestrates the conversion of segmented documents into searchable chunks.
    """

    all_chunks = []

    print(f"üß© Processing {len(logical_docs)} logical documents into chunks...")

    for logical_doc in logical_docs:
        if use_llama_index:
            chunks = chunk_with_llama_index(logical_doc)
        else:
            chunks = chunk_document_with_metadata(logical_doc)

        logical_doc.chunks = chunks  # Store reference, Maintain parent-child relationship
        all_chunks.extend(chunks)
        print(f" üìÑ  ‚àü {logical_doc.doc_type} ({logical_doc.doc_id}): {len(chunks)} chunks created.")

    return all_chunks


print("‚úÖ SECTION 6. INTELLIGENT CHUNKING WITH METADATA PRESERVATION COMPLETE.")



‚úÖ SECTION 6. INTELLIGENT CHUNKING WITH METADATA PRESERVATION COMPLETE.


# **SECTION 7. QUERY ROUTING AND INTELLIGENT RETRIEVAL**

 This section introduces the Search Orchestration Layer. It uses a "Router-First" architecture to solve the "Needle in a Haystack" problem in large document sets:

 <br>

 **Logic and Flow Analysis**

- **The Intent Router:** Before searching, the LLM analyzes the query to predict which document type (from Section 2C) contains the answer. It uses a robust JSON-repair logic to ensure the system doesn't crash if the LLM's formatting is imperfect.

- **Segregated Indices (Silos):** Instead of one massive index, the `IntelligentRetriever` builds specialized "mini-indices" for each document type. This prevents "contextual noise" (e.g., a Bank Statement's numbers confusing a Legal Contract's query).

- **Confidence-Based Logic:** If the AI is >70% confident in its routing, it searches a specific silo. If unsure, it automatically falls back to a global search, ensuring no information is missed.

<br>

In [None]:
# ------- SECTION 7. QUERY ROUTING AND INTELLIGENT RETRIEVAL -------


# --- 1. THE ROUTER (INTENT ANALYSIS) ---
def predict_query_document_type(query: str, llm=None) -> Tuple[str, float]:
    """
    Predicts the target document category based on the Section 2C Taxonomy.
    Returns the predicted 'type' and a 'confidence' score.
    """

    query_lower = query.lower()

    # HARD OVERRIDES (Instant High Confidence)
    # Ensures "Find Amounts" and "Income" queries bypass the LLM for speed/accuracy
    all_keywords = ["monetary amounts", "financial figures", "all amounts", "every amount"]
    if any(kw in query_lower for kw in all_keywords):
        return "All", 1.0

    income_keywords = ["income", "salary", "wages", "earnings", "take-home pay"]
    if any(kw in query_lower for kw in income_keywords):
        return "Pay_Slip", 0.95


    # SETUP LLM & TAXONOMY
    # Use the passed LLM, or fall back to the global Settings.llm
    active_llm = llm or Settings.llm

    # Extract model name for logging (handles LlamaIndex LLM objects)
    model_name = getattr(active_llm,
                 "model_name",
                 "AI Engine").lower()

    # Access global document taxonomy
    valid_keys = list(DOCUMENT_TAXONOMY["categories"].keys())
    taxonomy_str = "\n".join([f"- {k}: {v}" for k, v in DOCUMENT_TAXONOMY["categories"].items()])

    print(f"üß† Routing via {model_name}...")



    # --- LOGGING THE CHOSEN LLM ---
    log_entry = {
        "timestamp": datetime.now().strftime("%H:%M:%S"),
        "event": "ROUTING_ATTEMPT",
        "model_used": model_name,
        "query_preview": query[:30] + "..."
    }
    audit_logs.append(log_entry)
    print(f"üß† Routing via {model_name}...")


    # CONSTRUCT MODEL-AWARE PROMPTS
    is_mistral = "mistral" in model_name
    is_phi = "phi" in model_name
    # is_gemini = "gemini" in model_name

    raw_prompt = f"""
    Analyze the user query and pick the best category from the list.
    Query: "{query}"

    Available categories:
    {taxonomy_str}
    - All: Use this if the user asks for a summary of EVERY document, "all amounts", or "all figures" across the entire file.
    - Pay_Slip: Use for income, salary, wages, earnings, or take-home pay questions.
    - Tax_Document: Use for annual income reports, W2s, or IRS filings.
    - Mortgage_Contract: Use for loan terms and interest.


    Return ONLY a JSON object:
    <json>{{"type": "CategoryName or All", "confidence": 0.9}}</json>"""

    # Apply special wrapping for small/specific models
    if is_mistral:
        final_prompt = f"[INST] {raw_prompt} [/INST]"
    elif is_phi:
        # Phi-2 works best with very direct few-shot or completion style
        final_prompt = f"Instruct: {raw_prompt}\nOutput: <json>"
    else:
        final_prompt = raw_prompt


    try:
        # 4. EXECUTION & RESPONSE CLEANING
        response_text = active_llm.complete(final_prompt).text.strip()

        # Phi-2 fix: if we pre-filled <json>, add it back to the text for parsing
        if is_phi and not response_text.startswith("<json"):
            response_text = "<json>" + response_text

        # 5. EXTRACTION LOGIC
        json_str = None
        xml_match = re.search(r'<json>(.*?)</json>', response_text, re.DOTALL)
        if xml_match:
            json_str = xml_match.group(1)
        else:
            json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
            if json_match:
                json_str = json_match.group()

        doc_type, confidence = "Other", 0.0

        # Try parsing JSON
        if json_str:
            try:
                # Use your repair_json helper for minor syntax fixes
                result = json.loads(repair_json(json_str))
                doc_type = result.get("type", "Other")
                confidence = float(result.get("confidence", 0.0))
            except:
                pass

        # 6. KEYWORD FALLBACK (CRITICAL FOR SMALL MODELS)
        # If JSON failed or returned an invalid key, sweep the text for any valid category name
        if doc_type not in valid_keys or confidence < 0.1:
            for key in valid_keys:
                if key.lower() in response_text.lower():
                    doc_type, confidence = key, 0.7
                    break

        # Final validation against taxonomy - ALLOW "All"
        if doc_type not in valid_keys and doc_type != "All":
            doc_type = "Other"

        print(f"‚úÖ Router assigned: {doc_type} ({confidence*100:.1f}%)")
        return doc_type, confidence

    except Exception as e:
        print(f"üéØ Routing error: {e}")
        return "Other", 0.0




# --- 2. THE INTELLIGENT RETRIEVER (VECTOR ENGINE) ---
class IntelligentRetriever:
    """
    Advanced FAISS retrieval system with metadata-driven Silo Filtering.
    """

    def __init__(self):
        self.index = None
        self.chunks_metadata = [] # Master list of all chunks from all files
        self.doc_type_indices = {} # Map of indices per document type

    def build_indices(self, new_chunks: List[ChunkMetadata]):
        """
        Builds or updates FAISS indices with new document embeddings.
        """

        print(f"üî® Processing {len(new_chunks)} new chunks for the vector index...")

        # 1. Create embeddings only for the NEW chunks
        print(f"üî® Processing {len(new_chunks)} new chunks for the vector index...")
        texts = [chunk.text for chunk in new_chunks]
        embeddings_list = Settings.embed_model.get_text_embedding_batch(texts, show_progress=True)
        new_embeddings = np.array(embeddings_list).astype('float32')
        dim = new_embeddings.shape[1]

        # Store embeddings in metadata for these new chunks
        for i, chunk in enumerate(new_chunks):
            chunk.embedding = new_embeddings[i]

        # --- TIER 1: GLOBAL INDEX (APPEND MODE) ---
        if self.index is None:
            self.index = faiss.IndexFlatL2(dim)

        self.index.add(new_embeddings)

        # IMPORTANT: Append new chunks to the master metadata list
        # Prevents previous files from disappearing
        self.chunks_metadata.extend(new_chunks)

        # --- TIER 2: SEGREGATED INDICES (SILOS) ---
        # Updates the silos to include the new data
        doc_types = set(chunk.doc_type for chunk in new_chunks)

        for doc_type in doc_types:
            # Find indices of the new chunks that match this type
            # Reference the full self.chunks_metadata to rebuild the mapping correctly
            all_type_indices = [idx for idx, chunk in enumerate(self.chunks_metadata)
                                if chunk.doc_type == doc_type]

            if all_type_indices:
                # Rebuild the specific silo index for this type
                # (FAISS IndexFlatL2 is fast enough to rebuild for specific silos)
                type_embeddings = np.array([self.chunks_metadata[i].embedding for i in all_type_indices]).astype('float32')

                type_index = faiss.IndexFlatL2(dim)
                type_index.add(type_embeddings)

                self.doc_type_indices[doc_type] = {
                    'index': type_index,
                    'mapping': all_type_indices  # Maps back to the updated master list
                }

        print(f"‚úÖ Database updated. Total Chunks: {len(self.chunks_metadata)}")



    def retrieve(self, query: str, k: int = 4,
                filter_doc_type: Optional[str] = None,
                auto_route: bool = True) -> List[Any]:
        """Performs a routed or global search based on intent confidence."""


        # 1. GENERATE QUERY EMBEDDING - Use Settings.embed_model.get_query_embedding
        # FAISS expects a 2D numpy array (float32)
        # Wrap the single embedding in a list
        query_vec = Settings.embed_model.get_query_embedding(query)
        query_embedding = np.array([query_vec]).astype('float32')

        # Variables to store search results
        chunk_indices = []
        distances = []


        # 2. SELECTION (ROUTING) LOGIC (Which index to search?)
        # CASE A: User manually selected a specific filter (and it's not "All")
        if filter_doc_type and filter_doc_type.lower() != "all" and filter_doc_type in self.doc_type_indices:
            print(f"üîç Searching specific silo: {filter_doc_type}")
            type_data = self.doc_type_indices[filter_doc_type]
            D, I = type_data['index'].search(query_embedding, k)

            # Map the silo-specific index back to the master self.chunks_metadata list
            chunk_indices = [type_data['mapping'][i] for i in I[0] if i != -1]
            distances = D[0][:len(chunk_indices)]

        # CASE B: Auto-Route is enabled (AI guesses the document type)
        elif auto_route:
            predicted_type, confidence = predict_query_document_type(query)

            # If AI is confident and the silo exists, search the silo
            if confidence > 0.7 and predicted_type in self.doc_type_indices:
                print(f"üéØ Auto-routed to: {predicted_type} ({confidence:.2%})")
                type_data = self.doc_type_indices[predicted_type]
                D, I = type_data['index'].search(query_embedding, k)
                chunk_indices = [type_data['mapping'][i] for i in I[0] if i != -1]
                distances = D[0][:len(chunk_indices)]
            else:
                # Fallback to global search if AI is unsure
                print(f"üåê Low routing confidence ({confidence:.2%}). Searching all documents...")
                D, I = self.index.search(query_embedding, k)
                chunk_indices = [i for i in I[0] if i != -1]
                distances = D[0][:len(chunk_indices)]

        # CASE C: Search Everything (Filter is "All" or no filter provided)
        else:
            print("üåê Searching global index (all files)...")
            D, I = self.index.search(query_embedding, k)
            chunk_indices = [i for i in I[0] if i != -1]
            distances = D[0][:len(chunk_indices)]

        # 3. CONVERT RESULTS TO SCORED CHUNKS
        valid_results = []

        # Lower the strict threshold to 0.45 for general use
        RELAXED_THRESHOLD = 0.40

        for idx, i in enumerate(chunk_indices):
            dist = distances[idx]
            score = 1.0 / (1.0 + dist)
            chunk_obj = self.chunks_metadata[i]

            if score >= RELAXED_THRESHOLD:
                # Store as a simple namespace or dict to avoid scope/class errors
                node_data = type('Node', (object,), {
                    'text': chunk_obj.text,
                    'metadata': {
                        "page_start": chunk_obj.page_start,
                        "page_end": chunk_obj.page_end,
                        "doc_type": chunk_obj.doc_type,
                        "doc_id": chunk_obj.doc_id
                    },
                    'get_content': lambda: chunk_obj.text
                })
                valid_results.append(type('Result', (object,), {'node': node_data, 'score': score}))

        # Safety fallback
        if not valid_results and len(chunk_indices) > 0:
            print("‚ö†Ô∏è Threshold too high. Falling back to top result.")
            idx = chunk_indices[0]
            chunk_obj = self.chunks_metadata[idx]
            node_data = type('Node', (object,), {
                'text': chunk_obj.text,
                'metadata': {"page_start": chunk_obj.page_start, "page_end": chunk_obj.page_end, "doc_type": chunk_obj.doc_type},
                'get_content': lambda: chunk_obj.text
            })
            valid_results.append(type('Result', (object,), {'node': node_data, 'score': 1.0 / (1.0 + distances[0])}))

        return valid_results


print("‚úÖ SECTION 7. QUERY ROUTING AND INTELLIGENT RETRIEVAL COMPLETE.")



‚úÖ SECTION 7. QUERY ROUTING AND INTELLIGENT RETRIEVAL COMPLETE.


# **SECTION 8. ENHANCED ANSWER GENERATION WITH SOURCE ATTRIBUTION**


This section represents the final stage of the RAG pipeline. Its goal is to provide evidence-based answers while implementing an automated quality gate.

<br>

**Logic and Flow Analysis**

1. **Context Synthesis (**  `generate_answer_with_sources` **):** Instead of just passing raw text, the system builds a "Structured Context." Every chunk is prefixed with its metadata (Sector, Doc Type, and Page Numbers). This forces the LLM to provide in-text citations, allowing a human reviewer to verify the answer instantly.

2. **Strict Constraint Enforcement:** The prompt is engineered with a "Closed-Domain" instruction (Answer based ONLY on provided context). This is your primary defense against hallucinations.

3. **The RAG Triad Auditor (** `evaluate_rag_performance` **):** This implements a "Judge LLM" to evaluate the response on three metrics: Faithfulness (factuality), Context Relevance (retrieval quality), and Answer Relevance (helpfulness).

    

In [None]:
# ------- SECTION 8. ENHANCED ANSWER GENERATION WITH SOURCE ATTRIBUTION -------

def clean_llm_json(raw_response):
    """Fixes JSON formatting and strips Mistral/Llama-CPP instruction tags."""
    # Convert object to string if it's a Gradio list or dict
    text = str(raw_response)

    # --- MISTRAL REPAIR: Strip Echoed Prompt Artifacts ---
    # Removes everything before the last [/INST] tag if present
    if "[/INST]" in text:
        text = text.split("[/INST]")[-1]

    # Remove common local LLM artifacts that break JSON parsers
    junk_markers = ["[INST]", "Context:", "Question:", "Answer:", "```json", "```"]
    for marker in junk_markers:
        text = text.replace(marker, "")

    text = text.strip()

    try:
        # Use json_repair to handle trailing commas or missing quotes common in 4-bit Mistral
        return json_repair.repair_json(text, return_objects=True)
    except Exception as e:
        print(f"‚ö†Ô∏è JSON Repair failed: {e}")
        # Return a basic structure so the UI doesn't crash
        return {"answer": text, "sources": []}


# --- THE GENERATOR (CONTEXT-AWARE SYNTHESIS) ---
def generate_answer_with_sources(query: str,
                                retrieved_chunks: list) -> dict:
    """
    Generate answer with detailed source attribution using the active LLM.
    """
    if not retrieved_chunks:
        return {
            'answer': "I couldn't find relevant information to answer your question.",
            'sources': [],
            'confidence': 0.0,
            'context_text': ""
        }

    # Identify the active model name safely
    current_model = getattr(Settings.llm, "model_name", "Mistral 7B").lower()


    # 1.1 Context Preparation
    # Prefix every chunk with its 'Physical Provenance' (Type + Page)
    context_parts = []
    sources = []

    for item in retrieved_chunks:
        node = item.node
        score = item.score
        meta = node.metadata

        doc_type = meta.get('doc_type', 'Document')
        p_start = meta.get('page_start', '?')
        p_end = meta.get('page_end', '?')
        text_content = node.get_content()

        header = f"[Source: {doc_type}, Pages {p_start}-{p_end}]"
        context_parts.append(f"{header}\n{text_content}\n")
        sources.append({'doc_type': doc_type, 'pages': f"{p_start}-{p_end}", 'relevance': f"{score:.2%}"})

    context = "\n".join(context_parts)

    # Model-Specific Prompting
    if "gemini" in current_model:
        prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer based ONLY on context. Cite document types and page numbers."
    elif "phi" in current_model:
        # Phi-2 works best with "Instruct/Output" tags or direct completion
        safe_context = context[:4000] # Stricter limit for Phi-2 memory
        prompt = f"Instruct: Use the context below to answer the question.\nContext: {safe_context}\nQuestion: {query}\nOutput:"
    else:
        # Mistral 7B Instruction format
        prompt = f"[INST] Answer using ONLY the context provided.\nContext: {context}\nQuestion: {query} [/INST]"

    try:
        raw_response = Settings.llm.complete(prompt).text.strip()

        # Post-process to remove model artifacts
        response = raw_response.split("Output:")[-1].split("Answer:")[-1].strip()

        avg_score = sum(item.score for item in retrieved_chunks) / len(retrieved_chunks)

        return {
            'answer': response,
            'sources': sources,
            'context_text': context,
            'confidence': avg_score,
            'chunks_used': len(retrieved_chunks)
        }
    except Exception as e:
        print(f"‚ö†Ô∏è Generation Error ({current_model}): {e}")
        return {'answer': f"Error: {str(e)}", 'sources': sources, 'confidence': 0.0, 'context_text': context}


# --- THE AUDITOR (PERFORMANCE METRICS) ---
def evaluate_rag_performance(query, context, answer):
    """
    The 'Judge LLM' logic: Evaluates the RAG Triad in JSON format.
    Ensures high-fidelity output and detects potential hallucinations.

    RAG Triad:
    Faithfulness, Answer Relevance, and Context Relevance.
    """
    prompt = f"""
    Act as an AI Quality Auditor. Rate this RAG response (1-5 scale).

    Query: {query}
    Context: {context}
    Answer: {answer}

    Rate the following from 1-5 (5 is best) in JSON format:
    1. Faithfulness (Is the answer supported ONLY by the context?)
    2. Context Relevance (Is the retrieved context useful for the query?)
    3. Answer Relevance (Does the answer actually address the user's question?)

    Respond ONLY in JSON: {{"faithfulness": 5, "relevance": 4, "answer_relevance": 5}}
    """
    try:
        # Use the universal LlamaIndex LLM object
        response = Settings.llm.complete(prompt).text.strip()

        # 1. Clean the response of Markdown code blocks if they exist
        # This removes ```json and ``` wrapping
        clean_response = re.sub(r'```(?:json)?\n?|```', '', response).strip()

        # 2. JSON extraction (more targeted)
        json_match = re.search(r'(\{.*\})', clean_response, re.DOTALL)

        if json_match:
            parsed_data = json.loads(json_match.group(1))

            # 3. Ensure keys match what run_performance expects
            # Your run_performance uses 'Relevance' (capital R) or 'audit_score'
            # Let's standardize the keys here:
            return {
                "faithfulness": float(parsed_data.get("faithfulness", 0)),
                "relevance": float(parsed_data.get("relevance", 0)),
                "answer_relevance": float(parsed_data.get("answer_relevance", 0))
            }
        else:
            raise ValueError(f"No valid JSON found. Raw response: {response[:50]}...")

    except Exception as e:
        print(f"‚ö†Ô∏è Audit Evaluation Error: {e}")
        return {"faithfulness": 0, "relevance": 0, "answer_relevance": 0}

print("‚úÖ SECTION 8. ENHANCED ANSWER GENERATION WITH SOURCE ATTRIBUTION COMPLETE.")



‚úÖ SECTION 8. ENHANCED ANSWER GENERATION WITH SOURCE ATTRIBUTION COMPLETE.


# **SECTION 9. ENHANCED DOCUMENT STORE**

The EnhancedDocumentStore manages the document lifecycle using a State-Machine Architecture:

<br>

**Logic and Flow Analysis**

- **Unified Ingestion (The "Push" Pipeline):** When a user uploads a file, `process_file` triggers a sequential flow: **Extraction ‚Üí Logical Segmentation ‚Üí Siloed Chunking ‚Üí Vector Indexing.** It is designed to be additive, meaning you can upload multiple PDFs and the system will merge their intelligence into one searchable brain.

- **Semantic Query Routing:** The `query` method serves as the bridge between the UI and the retrieval engine. It is "self-healing"‚Äîif a targeted silo search fails to find an answer, it automatically falls back to a global search, ensuring the user is never met with a "No results" error due to a misclassification.

- **UI Serialization:** Methods like `get_document_structure` translate complex internal Python objects (Dataclasses) into human-readable strings and labels for the Gradio dashboard.

<br>

In [None]:
# ------- SECTION 9. ENHANCED DOCUMENT STORE -------
class EnhancedDocumentStore:
    """
    The central hub orchestrating the end-to-end RAG lifecycle.
    Maintains state for pages, logical documents, and vector indices.
    """

    def __init__(self):
        # Master State Variables
        self.pages_info = []       # List of RawPage objects (from Section 4)
        self.logical_docs = []     # List of LogicalDocument objects (from Section 5)
        self.chunks_metadata = []  # List of ChunkMetadata objects (from Section 6)

        # Core Engines
        self.retriever = IntelligentRetriever() # Section 7

        # System Metadata
        self.is_ready = False
        self.processing_stats = {}
        self.active_filenames = []


    # --- PRIMARY INGESTION PIPELINE ---
    def process_pdf(self, pdf_file, filename: str = "document.pdf"):
        """
        Executes the full pipeline: Extract -> Segment -> Chunk -> Index.
        Supports additive processing (adding new files to existing index).
        """

        self.filename = filename
        self.is_ready = False
        start_time = datetime.now()


        # Step 1: File Type Routing. Get file extension
        ext = filename.split('.')[-1].lower()

        try:
          # --- THE ROUTER LOGIC (Decision Gate) ---
          # Step 1: Append instead of Overwrite ---
          # Extract new info but add it to our existing lists
          new_pages, new_logical_docs = extract_and_analyze_pdf(pdf_file)

          self.pages_info.extend(new_pages)
          self.logical_docs.extend(new_logical_docs)

          # Step 2: Chunking
          new_chunks = process_all_documents(new_logical_docs)
          self.chunks_metadata.extend(new_chunks) # Add new chunks to the master list

          # --- Update Index - Not recreate) ---
          # Ensure build_indices function is capable of adding nodes
          # or if using VectorStoreIndex: self.vector_index.insert_nodes(new_nodes)
          self.retriever.build_indices(new_chunks)

          # Step 4: Telemetry (Update stats for total database)
          process_time = (datetime.now() - start_time).total_seconds()
          self.processing_stats = {
              'filename': filename,
              'total_pages': len(self.pages_info), # Total pages in entire system
              'documents_found': len(self.logical_docs),
              'total_chunks': len(self.chunks_metadata),
              'document_types': list(set(doc.doc_type for doc in self.logical_docs)),
              'processing_time': f"{process_time:.1f}s"
          }

          self.is_ready = True
          return True, self.processing_stats

        except Exception as e:
            return False, {'error': str(e)}



    # --- 1. THE INGESTION ENGINE ---
    def process_file(self, file):
        """
        Executes the full pipeline: Extract -> Segment -> Chunk -> Index.
        Ensures a hard reset to maintain data privacy between uploads.
        """

        self.filename = os.path.basename(file.name)
        start = time.time()

        print(f"‚öôÔ∏è Orchestrator: Starting pipeline for {self.filename}")

        # Step 1: Extract NEW content
        new_pages, new_docs = extract_and_analyze_file(file)

        # Step 2: Chunk and Append
        new_chunks = process_all_documents(new_docs)

        self.pages_info.extend(new_pages)
        self.logical_docs.extend(new_docs)
        self.chunks_metadata.extend(new_chunks)


        # Step 3: Index (Append Mode)
        self.retriever.build_indices(new_chunks)

        # Step 4: Stats
        self.processing_stats = {
            "filename": self.filename,
            "total_pages": len(self.pages_info),
            "total_chunks": len(self.chunks_metadata),
            "document_types": list(set(doc.doc_type for doc in self.logical_docs)),
            "processing_time": f"{time.time() - start:.2f}s",
        }

        self.is_ready = True
        return True, self.processing_stats

    # --- 2. THE QUERY ENGINE ---
    def query(self, question: str, filter_type: Optional[str] = None,
             auto_route: bool = True, k: int = 4) -> Dict:
        """
        Handles user queries with automated intent routing and silo fallbacks.
        """
        if not self.is_ready:
            return {
                'answer': "Please upload and process a PDF first.",
                'sources': [],
                'confidence': 0.0
            }

        # Sanitize the filter: Convert "All" strings to None for global search
        search_filter = None
        if filter_type and str(filter_type).strip().lower() != "all":
            search_filter = filter_type

        # FIRST ATTEMPT: Targeted Retrieval. Retrieve relevant chunks (Section 7 - Segregated Retrieval)
        retrieved = self.retriever.retrieve(
            question, k=k,
            filter_doc_type=search_filter,
            auto_route=auto_route
        )

        # FALLBACK: If 0 results found and we used a filter, try searching EVERYTHING
        if not retrieved and search_filter is not None:
            print(f"‚ö†Ô∏è No results found for silo '{search_filter}'. Falling back to Global Search...")
            retrieved = self.retriever.retrieve(
                question, k=k,
                filter_doc_type=None, # Remove the filter
                auto_route=False      # Disable routing for the fallback
            )
            filter_type = "All (Fallback)"

        # GENERATE RESPONSE - (Section 8 - Evidence-based Response)
        # This function should take the list of nodes and return a dict with 'answer' and 'context_text'
        result = generate_answer_with_sources(question, retrieved)

        # Ensure 'retrieved_chunks' is in the dictionary so chat_with_status can see it
        result['retrieved_chunks'] = retrieved
        result['filter_used'] = filter_type or ('auto' if auto_route else 'none')

        # Calculate a simple confidence score for the logs
        result['confidence'] = sum([n.score for n in retrieved]) / len(retrieved) if retrieved else 0.0

        return result


    # --- 3. UI HELPER METHODS ---
    def get_document_structure(self) -> List[Dict]:
        """
        Get the document structure for UI display.
        """
        if not self.logical_docs:
            return []

        structure = []
        for doc in self.logical_docs:
            structure.append({
                'id': doc.doc_id,
                'type': doc.doc_type,
                'pages': f"{doc.page_start + 1}-{doc.page_end + 1}",  # 1-indexed for UI
                'chunks': len(doc.chunks) if doc.chunks else 0,
                'preview': doc.text[:200] + "..." if len(doc.text) > 200 else doc.text
            })

        return structure


# --- UI FILTERING LOGIC ---

def get_filtered_structure(selected_filters):
    """
    selected_filters: List of strings from the Multiselect (e.g., ["Type: Report", "File: my_doc.pdf"])
    """
    # 1. Get all logical documents from your store
    # (Using the LogicalDocument dataclass from your Section 3)
    all_docs = doc_store.logical_documents

    if not selected_filters or "All" in selected_filters:
        filtered = all_docs
    else:
        # Extract the actual values from the labels
        type_filters = [f.replace("Type: ", "") for f in selected_filters if f.startswith("Type: ")]
        file_filters = [f.replace("File: ", "") for f in selected_filters if f.startswith("File: ")]

        filtered = [
            d for d in all_docs
            if d.doc_type in type_filters or os.path.basename(d.source) in file_filters
        ]

    # 2. Build the display string
    structure_lines = ["üß¨ FILTERED DOCUMENT STRUCTURE:"]
    current_file = ""
    for doc in filtered:
        fname = os.path.basename(doc.source)
        if fname != current_file:
            structure_lines.append(f"\nüìÇ FILE: {fname}")
            current_file = fname
        structure_lines.append(f"   ‚îî‚îÄ üè∑Ô∏è {doc.doc_type.upper()} | üìë Pgs: {doc.page_start + 1}-{doc.page_end + 1}")

    return "\n".join(structure_lines)

print("‚úÖ SECTION 9. ENHANCED DOCUMENT STORE COMPLETE.")


‚úÖ SECTION 9. ENHANCED DOCUMENT STORE COMPLETE.


# **SECTION 10. BACKEND CHAT  & AUDIT  LOGIC**

This section introduces the RAG Triad evaluation and the Data Serialization engine.

<br>

**Logic and Flow Analysis**
- **Golden Datasets:** By defining ground-truth Q&A pairs for Healthcare, Legal, and Real Estate, you move from "guessing" if the AI is right to "measuring" it.

- **The AI Auditor:** The `evaluate_response_audit` function uses a "Judge LLM" pattern. It asks the model to critique its own work or another model's work, returning structured JSON to calculate faithfulness and relevance.

- **Performance Telemetry:** The `run_performance_audit` function acts as a data scientist. It calculates latency, tokens-per-second, and success rates, then generates a visual bottleneck report using `Seaborn`.

- **Document Intelligence Bridge:** `process_pdf_handler` is the crucial "middleman" that connects the Gradio UI upload button to the `EnhancedDocumentStore` .

<br>

In [None]:
# ------- SECTION 10. BACKEND CHAT  & AUDIT  LOGIC -------


# 1. GLOBAL STORE INSTANCE (Initialize The Engine)
doc_store = EnhancedDocumentStore()
audit_logs = [] # Global registry for session analytics

# 2. - GOLDEN DATASETS (Define Ground-Truth) -
# GOLDEN DATASETS to test RAG Pipleine responses with source of truth
GOLDEN_DATASETS = {
    "Healthcare": [
        {"question": "What is the primary diagnosis?", "golden_answer": "Diagnosis of Type 2 Diabetes with neuropathy."},
        {"question": "What are the latest lab results for Glucose?", "golden_answer": "Fasting glucose was 145 mg/dL."
}
    ],
    "Legal": [
        {"question": "What is the termination notice period?", "golden_answer": "The agreement requires a 30-day written notice for termination."},
        {"question": "Who are the parties involved?", "golden_answer": "Between Acme Corp and John Smith."}
    ],
    "Real Estate": [
        {"question": "What is the total cash to close?", "golden_answer": "The estimated cash to close is $95,802."},
        {"question": "What is the loan amount and the interest rate?", "golden_answer": "The loan amount is $380,000 and the interest rate is 4.25%."},
        {"question": "Who are the applicants and what is the property address?", "golden_answer": "The applicants are John Q. Smith and Mary A. Smith. The property is 1254 Main Street, San Diego, CA 92110."},
        {"question": "What is the estimated cash to close?", "golden_answer": "The estimated cash to close is $95,802."},
        {"question": "Is there a prepayment penalty or balloon payment?", "golden_answer": "No, the loan does not have a prepayment penalty or a balloon payment."},
        {"question": "What are the total origination charges in Section A?", "golden_answer": "The total origination charges are $1,070, including an Underwriting Fee of $550, Wire Transfer Fee of $75, and Administration Fee of $445."}

    ]
}

# 3. - AUDIT EVALUATOR FOR MISTRAL LLM -
def evaluate_response_audit(query: str, response: str) -> Dict:
    """
    Uses a Judge LLM to score the RAG response based on:
    1. Faithfulness (Is it derived from context?)
    2. Relevance (Does it answer the user?)
    """
    active_llm = Settings.llm
    model_name = getattr(active_llm, "model_name", "AI Engine")
    is_mistral = "Mistral" in model_name

    raw_audit_prompt = f"""
    Evaluate this Q&A pair:
    Query: {query}
    AI Response: {response}

    Return ONLY JSON:
    {{"score": 0.9, "reasoning": "1-sentence explanation"}}
    """

    # Apply Mistral tags if needed
    final_audit_prompt = f"[INST] {raw_audit_prompt} [/INST]" if is_mistral else raw_audit_prompt

    try:
        raw_output = active_llm.complete(final_audit_prompt).text.strip()

        # Robust JSON search
        json_match = re.search(r'\{.*\}', raw_output, re.DOTALL)

        if json_match:
            repaired = repair_json(json_match.group())
            result = json.loads(repaired)
            return result
        else:
            raise ValueError("Auditor output was not structured JSON")

    except Exception as e:
        print(f"‚ö†Ô∏è Audit Error: {e}")
        return {"score": 0.0, "reasoning": "Audit engine parsing failed."}


 #-- 4. PERFORMANCE AUDIT & VISUALIZATION LOGIC ---
 # Generates The Accuracy Metrics & Plots
def run_performance_audit(doc_filter, audit_num_chunks):
    """
    Calculates Speed, Chunk Metrics, and RAG Triad scores for the Audit Dashboard.
    """
    if not audit_logs:
        return "**Avg Latency:** N/A", {}, "N/A", None, [["No Data", "-", "-", "-"]]

    # Convert logs to DataFrame for filtering. Filter logs based on active UI selection
    full_df = pd.DataFrame(audit_logs).copy()

    # --- Ensure required columns exist ---
    required_cols = ['latency', 'audit_score', 'Relevance', 'Filter_Used']
    for col in required_cols:
        if col not in full_df.columns:
            full_df[col] = 0 if col != 'Filter_Used' else 'Unknown'

    # 1. DEFINE SECTOR MAPPINGS
    # This ties the UI selection to the AI's classification types
    sector_map = {
        "Real Estate": [
            "Mortgage_Contract", "Lender_Fee_Sheet", "Land_Deed",
            "Pay_Slip", "Tax_Document", "Bank_Statement", "Report",
            "Other"
        ],
        "Healthcare": [
            "Medical", "Medical_Report", "Insurance", "Health_Form","Other"
        ],
        "Legal": [
            "Contract", "Land_Deed", "Legal_Letter", "Form", "Other"
        ]
    }

    # 3. APPLY FILTERING
    if doc_filter == "All":
        filtered_df = full_df.copy()
    elif doc_filter in sector_map:
        relevant_types = sector_map[doc_filter]
        filtered_df = full_df[full_df['Filter_Used'].isin(relevant_types)].copy()
    else:
        filtered_df = full_df[full_df['Filter_Used'] == doc_filter].copy()

    if filtered_df.empty:
        return f"**No audit data for: {doc_filter}**", {}, "0%", "0%", None, [["No Data", "-", "-", "-"]]


    # 4. DATA CLEANING
    for col in ['audit_score', 'Relevance', 'latency']:
        if col in filtered_df.columns:
            # This converts "4" or 4.0 to float, and handles errors
            filtered_df[col] = pd.to_numeric(filtered_df[col], errors='coerce').fillna(0)

    # 5. CALCULATE METRICS
    avg_latency = filtered_df['latency'].mean()
    avg_tokens = 150
    tokens_per_sec = avg_tokens / avg_latency if avg_latency > 0 else 0

    # Success Rate (Faithfulness/Audit Score > 0.7)
    success_rate = (filtered_df['audit_score'] > 0.7).mean() * 100

    # Context Density Calculation
    # Make sure we aren't dividing by zero
    avg_relevance = filtered_df['Relevance'].mean() if not filtered_df.empty else 0
    context_density = (avg_relevance / 5) * 100 # Assuming a 0-5 scale

    # 6. VISUALIZATION
    plt.figure(figsize=(6, 4))
    sns.barplot(x=['Retriever', 'LLM Speed'],
                y=[avg_latency * 0.3, tokens_per_sec / 10],
                hue=['Retriever', 'LLM Speed'],
                palette="viridis",
                legend=False)
    plt.title(f"Efficiency Metrics | Sector: {doc_filter}")
    plt.savefig("bottlenecks.png")
    plt.close()

    # 7. UI TABLE DATA
    audit_table_data = [
        ["Generation Speed", f"{tokens_per_sec:.1f} t/s", "12.5 t/s", "Industrial"],
        ["Context Precision", f"{context_density:.0f}%", "85%", "Target"],
        ["Avg Latency", f"{avg_latency:.1f}s", "3.0s", "Target"],
        ["Processed Silos", f"{len(filtered_df['Filter_Used'].unique())}", "N/A", "Count"]
    ]

    return (
        f"**Avg Latency:** {avg_latency:.2f}s | **Speed:** {tokens_per_sec:.1f} tokens/sec",
        {"Faithfulness": avg_relevance/5, "Context Density": context_density/100},
        f"{success_rate:.1f}%",
        f"{context_density:.0f}%",
        "bottlenecks.png",
        audit_table_data
    )

# --- 5. BATCH UPLOAD & UI STATE HANDLER  TO UPLOAD & PROCESS PDF
def process_pdf_handler(file_list):
    """
    Orchestrates the ingestion of multiple files and updates UI components.
    Returns: (Status Message, Structure JSON, Structure Display, Filter Update, View Selector)
    """

    try:
      if not file_list:

              # Return empty defaults for all 6 outputs
              return (
                "‚ö†Ô∏è No files uploaded.",
                "[]",
                "",
                gr.update(choices=["All"], value=["All"]),
                gr.update(choices=[], value=None),
                "No file uploaded"
              )

      file_reports = []
      all_doc_types = set()
      all_filenames = []
      view_selector_choices = []

      total_pages = 0
      total_chunks = 0
      start_batch_time = datetime.now()

      for file in file_list:
          # full_path is the /tmp/gradio/... path needed for the PDF viewer
          full_path = file.name
          # fname is the clean name for the UI
          fname = os.path.basename(full_path)

          # 2. PROCESS FILE - Call the Orchestrator from Section 9
          # Pass full_path to the engine so it can actually read the bits
          success, stats = doc_store.process_pdf(file, filename=fname)

          if success:
              all_doc_types.update(stats.get('document_types', []))
              all_filenames.append(fname)

              # Create the (Label, Value) tuple for the dropdown
              view_selector_choices.append((fname, full_path))

              total_pages += stats.get('total_pages', 0)
              total_chunks += stats.get('total_chunks', 0)

              # Build a clean plain-text report for this specific file
              report = (
                f"üíæ  {fname}\n"
                f"   ‚îî‚îÄ üìÑ Pages: {stats['total_pages']} | üß© Chunks: {stats['total_chunks']}\n"
                f"   ‚îî‚îÄ üè∑Ô∏è Types: {', '.join(stats['document_types'])}\n"
                f"   ‚îî‚îÄ ‚è±Ô∏è Time: {stats['processing_time']}"
              )

              # Build individual file report line
              file_reports.append(report)

          else:
            file_reports.append(f"‚ùå {fname} | FAILED: {stats.get('error', 'Unknown Error')}")


      # --- DATA AGGREGATION for STRUCTURE VIEW LOGIC---
      # 3. STRUCTURE DATA AGGREGATION - Generate Structure Visuals
      structure_json = doc_store.get_document_structure()
      structure_lines = ["üß¨ GLOBAL DOCUMENT STRUCTURE:"]
      current_file = ""

      for doc in structure_json:
          doc_source = doc.get('source') or doc.get('filename') or doc.get('file_name') or "Unknown File"

          if doc_source != current_file:
              # Clean up path if it's a full /tmp/ path
            display_name = os.path.basename(doc_source)
            structure_lines.append(f"\nüìÇ FILE: {display_name}")
            current_file = doc_source

          structure_lines.append(f"   ‚îî‚îÄ üè∑Ô∏è {doc['type'].upper()} | üìë Pgs: {doc['pages']} | üß© {doc['chunks']} chunks")

      structure_display = "\n".join(structure_lines)

      # CONSTRUCT THE MAIN STATUS LOG
      batch_time = (datetime.now() - start_batch_time).total_seconds()
      joined_reports = "\n\n".join(file_reports)

      status_msg = f"""
  ================================================================
  üìÇ BATCH PROCESSING COMPLETE ({batch_time:.1f}s)

  {joined_reports}

  -----------------------------------------------------------------
  üìä TOTAL BATCH STATS:
  Files: {len(all_filenames)} | Pages: {total_pages} | Chunks: {total_chunks}
  ================================================================="""


      # JSON String for the Code box
      # Convert the list (structure_json) to a JSON string
      # indent=2 makes it look like a pretty-printed JSON object in the UI
      structure_json_string = json.dumps(structure_json, indent=2)

      # 4. PREPARE SMART FILTERS (Types + Files)
      # Create labels that distinguish between Document Types and Specific Files
      unique_types = sorted(list(all_doc_types))
      type_options = [f"Type: {t}" for t in unique_types]
      file_options = [f"File: {f}" for f in sorted(all_filenames)]

      # Dynamic UI Filter logic. Combine them into one list for the multiselect dropdown
      smart_filter_choices = ["All"] + type_options + file_options


      # Update the the Search Document Filter Dropdown
      doc_type_filter_update = gr.update(choices=smart_filter_choices, value=["All"])


      # Update the "Select File to View" Dropdown
      # choices = paths, value = the first path in the list
      view_selector_update = gr.update(
          choices=view_selector_choices,
          value=view_selector_choices[0][1] if view_selector_choices else None
      )

      # 5. WIRING THE RETURN
      # Ensure outputs match the click event:
      # (status, json_code, textbox_display, doc_filter, view_selector, status_bar)
      return (
          "\n\n".join(file_reports),                         # 1. status_msg (Textbox)
          json.dumps(structure_json, indent=2),               # 2. structure_json (Code)
          "\n".join(structure_lines),                         # 3. structure_display (Textbox)
          gr.update(choices=smart_filter_choices, value=["All"]), # 4. doc_type_filter (Multiselect)
          gr.update(                                          # 5. view_selector (Dropdown)
              choices=view_selector_choices,
              value=view_selector_choices[0][1] if view_selector_choices else None
          ),
          f"‚úÖ Successfully indexed {len(all_filenames)} files." # 6. op_status_bar (Status Label)
      )

    except Exception as e:
        print(f"Process Error: {e}")
        return f"Error: {str(e)}", "[]", "‚ùå Failed", gr.update(), gr.update(), "Error"


# ------- 6. EXPORT LOGIC (ReportLab) - PERFORMANCE AUDIT REPORT EXPORT (Logic for File Generation) ------- #
def handle_audit_export(audit_data):
    """
    Logic to convert your dataframe/audit results into a PDF.
    This is similar to your chat export but for the audit tab.
    """

    # Ensure audit_data is a DataFrame and not empty
    if audit_data is None or (isinstance(audit_data, pd.DataFrame) and audit_data.empty):
        return gr.update(visible=False, value=None), "‚ö†Ô∏è No audit data available to export."

    # Create a temporary file
    fd, path = tempfile.mkstemp(suffix=".pdf")
    os.close(fd) # Close immediately to allow ReportLab to write to it

    try:
        doc = SimpleDocTemplate(path, pagesize=letter)
        styles = getSampleStyleSheet()
        elements = []

        # 2. Add Header
        title_style = ParagraphStyle(
            'Title',
            parent=styles['Heading1'],
            fontSize=16,
            spaceAfter=20,
            alignment=1  # Center alignment
        )
        elements.append(Paragraph("AI-Powered Document Intelligence - Performance Audit Report", title_style))

        # Date and Time
        current_time = datetime.now().strftime('%Y-%m-%d %H:%M')
        elements.append(Paragraph(f"Generated on: {current_time}", styles['Normal']))
        elements.append(Spacer(1, 20))

        # 3. Process Table Data
        # Convert DataFrame to list of lists (Header + Rows)
        # Ensure all values are strings for ReportLab
        data = [audit_data.columns.to_list()] + audit_data.values.tolist()

        # Create the Table object
        audit_table = Table(data, hAlign='CENTER')

        # Apply Industry-Standard Styling
        audit_table.setStyle(TableStyle([
            ('BACKGROUND', (0, 0), (-1, 0), colors.darkslategray), # Header Background
            ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),     # Header Text
            ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
            ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
            ('FONTSIZE', (0, 0), (-1, -1), 10),
            ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
            ('BACKGROUND', (0, 1), (-1, -1), colors.whitesmoke),   # Body Background
            ('GRID', (0, 0), (-1, -1), 1, colors.black),           # Table Grid
            ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.lightgrey]) # Striped rows
        ]))

        elements.append(audit_table)
        elements.append(Spacer(1, 30))
        elements.append(Paragraph("<b>End of Performance Audit Report</b>", styles['Italic']))

        # 4. Build PDF
        doc.build(elements)

        # 5. Final check and return
        if os.path.exists(path):
            # We return two things: the file update and the status message
            return gr.update(value=path, visible=True, label="üì• Download Performance Audit Report"), "‚úÖ Audit report generated successfully!"
        else:
            return gr.update(visible=False), "‚ùå Error: PDF file was not created."

    except Exception as e:
        print(f"Process Error: {e}")
        # Return 6 items to match the expected Gradio outputs
        return (
            f"‚ùå Error: {str(e)}", # 1. Status Message
            "[]",                  # 2. JSON Code
            "‚ùå Processing Failed", # 3. Display Text
            gr.update(),           # 4. Filter Update
            gr.update(),           # 5. View Selector
            "‚ö†Ô∏è System Error"      # 6. Status Bar
        )



# --- 7. REPORT GENERATION UTILITIES ---

# Wrapper to combine to generate PDF, export, and download Performance Audit. Works with 'handle_audit_export' function
def handle_audit_export_ui(audit_data):
    """
    UI Wrapper: Connects the Audit Dashboard state to the PDF downloader.
    Categorized as: UI-Backend Bridge.
    """

    # 1. Call your existing PDF generator
    # handle_audit_export usually returns (gr.update(value=path), status_msg)
    file_update, status_msg = handle_audit_export(audit_data)

    # 2. Extract the actual string path from the dictionary
    file_path = file_update.get("value") if isinstance(file_update, dict) else file_update

    if file_path and os.path.exists(file_path):
        # We return the STRING path for the DownloadButton
        # and the status message for the status bar
        return file_path, status_msg

    return None, "‚ùå Export failed: No data found."


# CHAT HISTORY EXPORT (Logic for PDF File Generation) & DOWNLOAD CHAT HISTORY
def export_chat_history_to_pdf(history):
    """
    Transforms the live chat session into a formatted PDF document.
    Categorized as: Data Serialization logic.
    """

    if not history or len(history) == 0:
        return None  # No file to download

    try:
        # 1. Create temporary file path
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        file_path = f"/content/AI-Powered_Document_Intelligence_Platform_Chat_History_{timestamp}.pdf"

        # 2. Setup ReportLab PDF
        doc = SimpleDocTemplate(file_path, pagesize=letter)
        styles = getSampleStyleSheet()
        elements = []

        # --- CUSTOM STYLES ---
        user_header_style = ParagraphStyle('UserHeader', parent=styles['Normal'], fontSize=10,
                                          textColor=colors.white, backColor=HexColor("#2E4053"),
                                          borderPadding=5, borderRadius=3, spaceAfter=5)

        ai_header_style = ParagraphStyle('AIHeader', parent=styles['Normal'], fontSize=10,
                                        textColor=colors.white, backColor=HexColor("#1A5276"),
                                        borderPadding=5, borderRadius=3, spaceAfter=5)

        metadata_style = ParagraphStyle('Metadata', parent=styles['Normal'], fontSize=8,
                                       textColor=colors.grey, fontName="Courier-Oblique",
                                       leftIndent=20, spaceBefore=5)


        # --- HEADER & TITLE TO ELEMENTS ---
        # timestamp/header info at the top right
        gen_time = datetime.now().strftime("%B %d, %Y | %H:%M")
        elements.append(Paragraph(f"Generated on: {gen_time}", styles['Heading1']))
        elements.append(Paragraph(f"AI Document Intelligence Automation Platform", styles['Heading1']))
        elements.append(Spacer(1, 10))

        # Main Title
        elements.append(Paragraph("AI-Powered Document Intelligence Automation Platform Chat History", styles['Normal']))

        # --- ADD 3 LINES OF EXTRA SPACE AFTER MAIN TITLE ---
        elements.append(Spacer(1, 45)) # Approximately 3 lines of space (15 units per line)


        # ------ BUILD CHAT CONTENT ------

        # HANDLE DICTIONARY FORMAT ---
        for entry in history:
            # 1. Extract role and content safely
            if not isinstance(entry, dict): continue

            role = entry.get("role", "user")
            content_data = entry.get("content", "")



            # --- CRITICAL FIX FOR LIST ERROR ---
            # Extract text from Gradio's complex message format if it's a list
            if isinstance(content_data, list):
                raw_text = " ".join([item.get("text", "") for item in content_data if isinstance(item, dict)])
            else:
                raw_text = str(content_data)

            # --- CLEANING (DO NOT PURGE \n YET) ---
            # Remove raw structural code artifacts
            clean_text = re.sub(r"\[\{'text':\s*'", "", raw_text)
            clean_text = re.sub(r"',\s*'type':\s*'text'\}\]", "", clean_text)


            # Purge symbols but LEAVE \n and \r for the splitting logic below
            purge_list = ["‚ñ†", "ü§ñ", "üë§", "‚úÖ", "‚è≥", "üß†", "üéØ", "****", "**", "##"]
            for sym in purge_list:
                clean_text = clean_text.replace(sym, "")

            clean_text = clean_text.replace("AI Document Assistant:", "").replace("You:", "").strip()

            # Handle Metadata Split (Keep text before '---')
            parts = clean_text.split('---')
            main_answer = parts[0].strip()
            metadata = parts[1].strip() if len(parts) > 1 else ""

            # Add Role Header
            header_text = "<b>USER QUERY</b>" if role == "user" else "<b>AI ASSISTANT VERIFIED RESPONSE</b>"
            elements.append(Paragraph(header_text, user_header_style if role == "user" else ai_header_style))

            # --- BULLET POINT & NEW LINE LOGIC ---
            # We split by actual newlines to create distinct Paragraph blocks
            # This makes reading much easier than one big block of text
            text_lines = main_answer.replace("\\n", "\n").split('\n')
            for line in text_lines:
                line = line.strip()
                if not line:
                    elements.append(Spacer(1, 6)) # Adds space between paragraphs
                    continue

                # Detect bullet points (starts with *, -, or :)
                if line.startswith('*') or line.startswith('-') or line.startswith(':'):
                    # Clean the prefix and add a professional bullet
                    bullet_text = f"&bull; {line.lstrip('*-: ').strip()}"
                    elements.append(Paragraph(bullet_text, styles['Normal']))
                else:
                    elements.append(Paragraph(line, styles['Normal']))

            # 6. Add Metadata (Audit Trail) in a separate gray box
            if role != "user" and metadata:
                elements.append(Spacer(1, 5))
                # Filter metadata to remove the ugly \n‚ñ†‚ñ† prefixes
                clean_meta = metadata.replace("\\n", " ").replace("n‚ñ†‚ñ†", "").replace("‚ñ†", "").strip()
                elements.append(Paragraph(f"<i>Audit Trail:</i> {clean_meta}", metadata_style))

            elements.append(Spacer(1, 15)) # Space after each Q&A turn

        doc.build(elements)
        return file_path

    except Exception as e:
        print(f"PDF Export Error: {e}")
        return None

print("‚úÖ SECTION 10. BACKEND CHAT  & AUDIT  LOGIC COMPLETE.")



‚úÖ SECTION 10. BACKEND CHAT  & AUDIT  LOGIC COMPLETE.


# **SECTION 11. CHATBOT LOGIC & ORCHESTRATION**

This section is characterized by its use of **Python Generators (** `yield` **).** This is a sophisticated architectural choice that allows the UI to stay "alive" while the heavy lifting happens in the background. Instead of the user waiting 5 seconds for a blank screen to update, they see a step-by-step progress report (Routing -> Searching -> Analyzing -> Scoring).

<br>

**Logic and Flow Analysis**

- **Safety First:** It performs "Context Window Safety," capping the number of retrieved chunks (k) based on the model's specific limits (e.g., lower for Phi-2, higher for Mistral).

- **The Fallback Loop:** If a filtered search (e.g., looking only in "Legal") fails, it automatically expands to a "Global" search so the user never hits a dead end.

- **The Quality Gate:** It doesn't just show an answer; it calculates a **Faithfulness Score** (0.0‚àí5.0) before displaying the result, giving the user a "Confidence Meter."

- **Hardware Management:** The `deep_purge_gpu` logic is vital for the Free Tier (T4 GPU), ensuring that VRAM is zeroed out before a new model is loaded.


In [None]:
# ------- SECTION 11. CHATBOT LOGIC & ORCHESTRATION -------


# Chat handler with status bar update. Define how the AI thinks and responds.
def chat_with_status(message, history, doc_type_filter, auto_route, audit_num_chunks):
        """
        Primary UI Controller.
        Manages the 'Thinking Loop' and streams status updates via yield.
        """

        global current_model_name

        # HISTORY INITIALIZATION
        # Handles both list-of-dicts and list-of-lists (Gradio formats)
        if history is None:
          history = []


        # DYNAMIC ENGINE DETECTION
        active_engine_name = (
            globals().get('current_model_name') or
            getattr(Settings.llm, "model_name", "AI Engine")
        )


        # FILTER SANITIZATION
        # Ensure filter is a string (hashable) for the Vector Store
        # Gradio dropdowns often pass a list like ['Contract']. LlamaIndex filters need the string 'Contract'.
        if isinstance(doc_type_filter, list) and len(doc_type_filter) > 0:
            clean_filter = str(doc_type_filter[0])
        elif doc_type_filter and doc_type_filter != "none":
            clean_filter = str(doc_type_filter)
        else:
            clean_filter = "All"

        filter_label = clean_filter

        # READINESS CHECK - Check if documents exist
        if not doc_store.is_ready:
            response = "üìö Please upload and process a PDF document first."
            history.append({"role": "user", "content": f"**üë§ You:** {message}"})
            history.append({"role": "assistant", "content": f"**ü§ñ AI Docuement Assistant:** {response}"})
            yield history, "‚ö†Ô∏è System Not Ready"
            return


        # START PROCESSING TELEMETRY
        # Responsive UI Start. Immediate UI Feedback (Streaming Yield)
        # Immediately tell the user the AI is working so the app feels responsive.
        log_entry = None
        start_total = time.time()
        routed_type = clean_filter
        routing_confidence = 1.0  # Default for manual selection


        # AGENTIC ROUTING (UI FEEDBACK LAYER)
        if auto_route:
            yield history, f"üéØ AI ({active_engine_name}) is routing..."
            # Pass Settings '.llm' to ensure it uses the engine selected in the UI
            routed_type, routing_confidence = predict_query_document_type(message, Settings.llm)
            clean_filter = routed_type
            filter_label = f"{routed_type} ({routing_confidence:.1%})"
            print(f"‚úÖ Router assigned category: {routed_type} ({routing_confidence:.2%})")


        # STREAMING ANALYTICS (Update UI to show the 'Silo' being searched)
        # Responsive UI Start
        history.append({"role": "user", "content": f"**üë§ You:** {message}"})
        history.append({"role": "assistant", "content": f"**ü§ñ AI Document Assistant is üß† Analyzing {filter_label} documents...**"})
        yield history, f"‚è≥ Searching (Filter: {filter_label})..."

        # --- CONTEXT WINDOW SAFETY LOGIC ---
        # Map safety limits based on the current global model name
        active_k = int(audit_num_chunks) if audit_num_chunks else 5

        # Check if the global variable 'current_model_name' exists (from Section 2B)
        # Conservative if model name contains "Phi" or "Mistral"
        # Only reduce if the slider is HIGHER than the model's physical limit
        if "Phi-2" in active_engine_name:
            if active_k > 3: # Phi-2 can usually handle 3 chunks safely
                print(f"‚ö†Ô∏è Phi-2 context safety. Capping slider from {active_k} to 3.")
                active_k = 3
        elif "Mistral" in active_engine_name:
            if active_k > 6: # Mistral can handle about 6 chunks on a T4
                print(f"‚ö†Ô∏è Mistral context safety. Capping slider from {active_k} to 6.")
                active_k = 6
        # --------------------------------------------------

        # DEBUG PRINT: Verify what we are asking the database
        print(f"DEBUG: Querying for '{message}' with filter '{clean_filter}'")
        print("-" * 30)
        print(f"üîç DEBUG: Sending to Database...")
        print(f"   > Query: '{message}'")
        print(f"   > Applied Filter: '{clean_filter}'")
        print(f"   > Search Depth (k): {active_k} (Safety Adjusted)")
        print("-" * 30)


        try:
            # ----- RETRIEVAL -----
            # Ensure search_filter is None if "All" is selected to bypass metadata silos
            search_filter = None if clean_filter.strip().lower() == "all" else clean_filter

            # Limit the search depth 'k' based on the user's Audit Slider
            result = doc_store.query(
                message,
                filter_type=search_filter,
                auto_route=False,
                k=active_k
            )

            # Extract the chunks correctly for the next step
            # Note: your doc_store.query likely returns chunks in a key called 'retrieved_chunks'
            chunks_to_process = result.get('retrieved_chunks', [])


            # --- SMART FALLBACK ---
            if len(chunks_to_process) == 0 and search_filter is not None:
                print(f"‚ö†Ô∏è Zero results in '{search_filter}'. Retrying with Global Search...")
                result = doc_store.query(message, filter_type=None, k=active_k)
                chunks_to_process = result.get('retrieved_chunks', [])
                applied_filter = "Global Fallback" # Update this for the footer
            else:
                applied_filter = result.get('filter_used', clean_filter)


            # --- NEW SYSTEM WRAPPING LOGIC ---
            if "dollar amounts" in message.lower() or "financial figures" in message.lower():
                # Financial Extraction System Instruction
                system_instruction = (
                    "You are a financial auditor. Extract only digits and dollar values. "
                    "Ignore blank form fields, underscores (___), and empty placeholders. "
                    "If a field is empty, do not mention it."
                )
            elif "summary" in message.lower():
                # Summary System Instruction
                system_instruction = "You are an executive assistant. Provide a structured, concise overview of the text."
            else:
                # General Instruction
                system_instruction = "You are a helpful document assistant. Answer based strictly on the context provided."

            # Combine for the LLM
            effective_query = f"[SYSTEM: {system_instruction}]\n\n[USER REQUEST: {message}]"

            # 7. GENERATE
            generation_result = generate_answer_with_sources(effective_query, chunks_to_process)
            answer = generation_result.get('answer', "I'm sorry, I couldn't generate an answer for that.")

            # POST-GENERATION CLEANUP (The "Safety Net")
            # Strips out if model echoes rules
            if "IMPORTANT RULES" in answer:
                answer = answer.split("completely.")[-1].strip()
            if "professional executive summary" in answer:
                answer = answer.split("points.")[-1].strip()

            # EVALUATION & AUDIT
            context_text = generation_result.get('context_text', "")

            # POST-GENERATION AUDIT (The Quality Gate)
            # Evaluate the 'RAG Triad' immediately after generation
            scores = evaluate_rag_performance(message, context_text, answer)
            latency = time.time() - start_total


            # LOGGING - Inside chat_with_status when creating log_entry:
            log_entry = {
                "timestamp": datetime.now().strftime("%H:%M:%S"),
                "model": active_engine_name, # Dynamic model name,
                "query": message[:50],
                "latency": round(latency, 3),
                "routed_category": routed_type,
                "audit_score": float(scores.get("faithfulness", 0)),
                "Relevance": float(scores.get("relevance", 0)),
                "Filter_Used": str(clean_filter), # Ensure this is a string, not a list
            }


            audit_logs.append(log_entry) # Ensure audit_logs = [] is defined in Section 1

            # --------- FINAL UI RESPONSE CONSTRUCTION --------------

            # RETRIEVED DATA PROCESSING
            retrieved_chunks = result.get('retrieved_chunks', [])
            chunk_count = len(retrieved_chunks)

            # SOURCE LOGIC
            unique_sources_list = []  # Initialize a list first

            if retrieved_chunks:
                for c in retrieved_chunks:
                    # LlamaIndex returns a NodeWithScore object; metadata is in .node.metadata
                    node = c.node if hasattr(c, 'node') else (c[0] if isinstance(c, (tuple, list)) else c)
                    meta = getattr(node, 'metadata', {})

                    # Fallback cascade: Meta dict -> Object attribute -> Default '?'
                    d_type = meta.get('doc_type', getattr(node, 'doc_type', 'Other'))
                    p_start = meta.get('page_start', getattr(node, 'page_start', '?'))
                    p_end = meta.get('page_end', getattr(node, 'page_end', '?'))

                    label = f"{d_type} (p.{p_start}-{p_end})"
                    if label not in unique_sources_list:
                        unique_sources_list.append(label)

            # FORMAT SOURCES TEXT
            sources_text = f"\n\nüîç **Sources:** {', '.join(unique_sources_list)}" if unique_sources_list else ""

            conf_suffix = f" | üéØ Routing Confidence: {routing_confidence:.1%}" if auto_route else ""


            # 4. METADATA FOOTER LOGIC
            applied_filter = result.get('filter_used', 'Global')
            stats_text = (
                f"\n\n---\n"
                f"*‚è±Ô∏è {latency:.2f}s | ü§ñ **Engine: {active_engine_name}**{conf_suffix} *\n"
                f"|*‚úÖ Faithfulness: {scores.get('faithfulness', 0)}/5 | üß© Retrieved from {chunk_count} document chunks*"
                f"\n*üìÇ Search Scope: {applied_filter}*"
            )

            # 5. FINAL ASSEMBLY
            full_response = (
                f"ü§ñ **AI Document Assistant:**\n"
                f"{answer.strip()}\n\n"
                f"(Answer derived from {chunk_count} document chunks.)"
                f"{sources_text}"
                f"{stats_text}"
            )

            history[-1] = {"role": "assistant", "content": full_response}

            yield history, "‚úÖ Response Generated"

        except Exception as e:
            # THIS IS THE MISSING BLOCK
            error_msg = f"**ü§ñ AI Document Assistant:** ‚ö†Ô∏è Error: {str(e)}"
            history[-1] = {"role": "assistant", "content": error_msg}
            yield history, "‚ùå Search Failed"




# --- 2. CENTRAL SWITCHING & VRAM MANAGEMENT ---
def deep_purge_gpu():

    global current_llm

    current_llm = None

    Settings.llm = None

    gc.collect()

    # Comprehensive CUDA purge
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect() # Clears inter-process memory
        torch.cuda.synchronize()
    print("üßπ VRAM Deep Purged.")


def handle_model_transition(model_name, history, clear_history):
    """
    The UI manager that talks to Gradio. Backend connector for the UI dropdown.
    Switches the LLM and manages the Chatbot state.
    """
   # 1. Clear history if the checkbox is checked
    if clear_history:
        history = []

    # Get the result string from the backend
    switch_result = switch_llm(model_name)

    # 2. Call the backend switch function (Section 2B)
    # Force switch_result to a string to prevent "NoneType" iteration errors
    result_text = str(switch_result) if switch_result else "‚ùå Error: Unknown Failure"

    if "‚ùå" in switch_result:
        new_status = f"### ‚ö†Ô∏è Error: {result_text}"

    else:
        # 3. Dynamic Status Construction
        status_type = "History Cleared" if clear_history else "Context Retained"
        new_status = f"### üß† Engine: {model_name} | ‚úÖ {status_type}"

    return history, new_status
print("‚úÖ SECTION 11. CHATBOT LOGIC & ORCHESTRATION Complete")



‚úÖ SECTION 11. CHATBOT LOGIC & ORCHESTRATION Complete


# **SECTION 12. GRADIO INTERFACE, CHAT HANDLERS, & WIRING LOGIC**



## **SECTION 12A. GRADIO INTERFACE, CHAT HANDLERS, & WIRING LOGIC**

This section serves as the **Environmental Patch Layer**. In complex environments like Google Colab or Python 3.12+, asynchronous event loops (which Gradio uses to handle multiple users or streaming text) can often "clash" or crash.

<br>

**Logic and Flow Analysis**

- **Event Loop Patching:**
By using `nest_asyncio`, the system allows the UI to run inside an already-active notebook loop without hanging.

- **Uvicorn Override:** The patch for `loop_factory` is a critical stability fix. It prevents a known "TypeError" in newer Python versions by stripping out incompatible arguments before the server starts.

In [None]:
# ------- SECTION 12A. GRADIO INTERFACE, CHAT HANDLERS, & WIRING LOGIC -------


# 1. Force Standard Policy (Insurance against uvloop hijacking)
if sys.platform != "win32":
    asyncio.set_event_loop_policy(asyncio.DefaultEventLoopPolicy())

# 2. THE MONKEY PATCH: Intercepts the Gradio-Uvicorn handshake
original_run = uvicorn.run

def patched_uvicorn_run(*args, **kwargs):
    # Fix 1: Strip the 'loop_factory' which crashes Python 3.12/Colab
    if "loop_factory" in kwargs:
        kwargs.pop("loop_factory")

    # Fix 2: Force standard asyncio and limit concurrency for T4 stability
    kwargs["loop"] = "asyncio"

    # Fix 3: Ensure uvloop isn't used even if requested
    if "http" in kwargs and kwargs["http"] == "httptools":
        kwargs["http"] = "h11"

    return original_run(*args, **kwargs)

# Inject the patch
uvicorn.run = patched_uvicorn_run


print("‚úÖ SECTION 12A. GRADIO INTERFACE, CHAT HANDLERS, & WIRING LOGIC  Complete")

‚úÖ SECTION 12A. GRADIO INTERFACE, CHAT HANDLERS, & WIRING LOGIC  Complete


## **SECTION 12B. GRADIO INTERFACE, CHAT HANDLERS, & WIRING LOGIC**

This is the Layout & Event Wiring Layer. It follows a "Pillar Architecture" designed for professional document workflows.

<br>

**Logic and Flow Analysis:**

1. **Behavioral Scripting (JS):** Uses a JavaScript `MutationObserver` to ensure the chatbot automatically scrolls to the bottom as the AI "types."

2. **The Three-Pillar Layout:**
  - **Tab 1 (Operations):** The primary workspace. It features a "Dual-Column" design where the left side handles the "Physical" (files, page viewing, engine selection) and the right side handles the "Digital" (the chat conversation).
  
  - **Tab 2 (Transparency):** Dedicated to explainability. It shows the **JSON Structure** of how the AI "sees" the document, which is vital for developer debugging.

  - **Tab 3 (Governance):** The Audit dashboard. It visualizes metrics like **Latency vs. Token Speed** and **Context Density**, translating technical data into business-ready status reports.

3. **Event Wiring:** This uses the `.click()`, .`change()`, and `.submit()` methods to create a "Reactive" interface. Specifically, the use of `.then()` allows for "Chain Reactions"‚Äîfor example, sending a message and then automatically clearing the input box for the next question.


In [None]:
# ------- SECTION 12B. GRADIO INTERFACE, CHAT HANDLERS, & WIRING LOGIC -------


# --- JavaScript is for behavior (Auto-Scroll) ---
scroll_script = """
function() {
    const targetNode = document.querySelector('#chatbot-box');
    if (!targetNode) {
        console.log("Chatbot box not found yet...");
        return;
    }

    const observer = new MutationObserver(() => {
        // In newer Gradio, the scrollable area is usually a 'div' inside the chatbot
        const scrollContainer = targetNode.querySelector('.scrollable-auto') || targetNode.querySelector('.wrapper') || targetNode;
        if (scrollContainer) {
            scrollContainer.scrollTo({
                top: scrollContainer.scrollHeight,
                behavior: 'smooth'
            });
        }
    });

    observer.observe(targetNode, { childList: true, subtree: true });
}
"""


# ----------------------------------------------- UI LAYOUT  ------------------------------------------------------------------------------------- #

# CSS: We target only the 'header-container' for centering
# CSS: Targets only the tab navigation bar to make it look like a black menu
# Targets Download & Export buttons for Chat History & Performance Audit Report
custom_css = """
    /* Center the header text */
    .welcome-text-header-container {
        text-align: center;
        margin-bottom: 20px;
    }

    /* 1. FORCE ALL BUTTONS TO BLACK */
    /* This targets primary buttons, secondary buttons, and specific IDs */
    button.primary, button.secondary, #dark-btn, #chat-export-btn, #ingest_btn, #send_btn {
        background-color: black !important;
        background: black !important;
        color: white !important;
        border: 1px solid #444 !important;
        box-shadow: none !important;
    }

    /* Button Hover Effect */
    button.primary:hover, button.secondary:hover {
        background-color: #222 !important;
        border-color: #00d1b2 !important;
    }

    /* 2. NAVIGATION BAR (TABS) STYLING */
    /* The horizontal strip background */
    .tabs > .tab-nav {
        background-color: black !important;
        border-bottom: 2px solid #333 !important;
        padding: 8px 10px 0px 10px !important;
        display: flex !important;
        gap: 5px !important;
        border-radius: 8px 8px 0 0 !important;
    }

    /* Individual Tab Labels (Inactive) */
    .tabs > .tab-nav > button {
        background-color: #111 !important; /* Very dark grey for inactive */
        color: white !important;           /* White font */
        border: none !important;
        border-radius: 5px 5px 0 0 !important;
        padding: 10px 25px !important;
        font-weight: bold !important;
    }

    /* The Active (Selected) Tab */
    .tabs > .tab-nav > button.selected {
        background-color: black !important; /* Pure black for active */
        color: #00d1b2 !important;           /* Highlight font color */
        border-bottom: 3px solid #00d1b2 !important;
    }

    /* LABELS (The specific fix you requested) */
    .gradio-container .label {
        background-color: black !important;
        color: white !important;
        padding: 4px 10px !important;
        border-radius: 5px 5px 0 0 !important;
        border: 1px solid #444 !important;
        box-shadow: none !important;
        font-weight: bold !important;
    }

    .control-frame {
    border: 1px solid #e0e0e0;
    border-radius: 12px;
    padding: 20px;
    background-color: #fcfcfc;
    box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1);
    }
    .section-divider {
        border-top: 2px solid #3b82f6;
        margin: 15px 0;
        opacity: 0.5;
    }

    .vertical-divider {
    border-right: 2px solid #e0e0e0;
    height: 90vh; /* Fills most of the vertical screen */
    margin: 0 20px;
    align-self: center;
}
    """


def create_interface():
    # Load custom CSS for the 'Obsidian' black-and-teal theme
    with gr.Blocks(title="AI-Powered Document Intelligence Automation Platform") as demo:
      # --- 1. HEADER SECTION(CENTERED) ---
      with gr.Column(elem_classes="welcome-text-header-container"):
          gr.Markdown("# ü§ñ AI-Powered Document Intelligence Automation Platform")
          gr.Markdown("### Providing assistance with document search. ‚ú®")
          gr.Markdown("üìÇ Upload & Process Multi-page PDF or Scanned Image and then enter search request in chatbot")
          gr.Markdown("Accepted fiile formats: .pdf, .png, .jpg, .jpeg")
          gr.HTML("<hr style='border: 1px solid #e0e0e0;'>")

      # --- 2. THREE-PILLAR NAVIGATION TABS ---
      with gr.Tabs() as tabs:
# -------- --- TAB 1: OPERATIONS (CHATTING & VIEWING) ---
          with gr.TabItem("üí¨ Chat Operations", id="chat_tab"):
             with gr.Row():
 # -----------     --------   # TAB 1: OPERATIONAL CORE (LEFT TOP COLUMN: UI IMAGE)
                with gr.Column(scale=1):
                    gr.Image( # AI-Powered Document Assistant logo v2.png
                        value=LOGO_PATH,
                        width=100,
                        show_label=False,
                        container=False,
                        scale=1)

                    # --- DIVIDER ---
                    gr.HTML("<hr>")

 # -----------     --------    # TAB 1 - LEFT COLUMN: LARGE LANGUAGE MODEL (LLM) SELECTION
                    with gr.Row():
                      gr.Markdown("# üß† Large Language Models (LLM)")

                    with gr.Row():
                      # Status indicator
                      engine_status = gr.Markdown("*Status: Ready*")

                    with gr.Row():
                      # Create the choice button
                      llm_selector = gr.Dropdown(
                          choices=["Gemini 2.0", "Mistral 7B", "Phi-2"],
                          label="Select LLM Engine",
                          value="Gemini 2.0",
                          scale=1,
                          container=False)
                      clear_on_switch_checkbox = gr.Checkbox(label="Clear History on Switch")

                    # --- DIVIDER ---
                    gr.HTML("<hr>")



                    # DOCUMENT PROCESSING CENTRAL
                    with gr.Row():
                      gr.Markdown("## üìÇ Document Processing Central")

                    # FILE_UPLOAD, INGEST_BTN (PROCESS DOCUMENT), CLEAR_ALL_BTN, DOC_TYPE_FILTER, STATUS_OUTPUT
                    with gr.Row():
                      gr.Markdown("Upload file(s) and press Process Document button")

                    with gr.Row():
                      file_upload = gr.File(
                        label="Upload Multi-page PDF or Scanned Image",
                        file_count="multiple", #Enable muliple
                        file_types=[".pdf", ".png", ".jpg", ".jpeg"],
                        interactive=True,
                        type="filepath")


                    with gr.Row():
                      ingest_btn = gr.Button("üîÑ Process Document", variant="primary", interactive=True, scale=1 )
                      clear_all_btn = gr.Button("üóëÔ∏è Clear All", variant="primary", interactive=True, scale=1)

                    # --- DIVIDER ---
                    gr.HTML("<hr>")


                    # PROCESSING STATUS & METADATA
                    with gr.Row():
                      # This shows the bullet points of document pages created in 'structure_display' string in def 'process_pdf_handler'.
                      status_output = gr.Textbox(
                          label="Processing Status & Metadata",
                          lines=15, # Increased height
                          elem_classes="status-window",
                          interactive=False,
                          placeholder="Technical details will appear here after upload...",
                          visible=True,
                          elem_id="status-box")   # ID for custom styling

                    # --- DIVIDER ---
                    gr.HTML("<hr>")


                    # VIEW DOCUMENT
                    with gr.Row():
                      gr.Markdown("# üìÑ Document Preview")

                    with gr.Row():
                      # View the PDF pages as images in the UI
                      # Fixed height viewer prevents layout shifts
                      doc_viewer = gr.Image(
                              label="Page Viewer",
                              type="pil",
                              interactive=False,
                              height=550)

                    with gr.Row():
                      prev_btn = gr.Button("‚¨ÖÔ∏è Previous", scale=1)
                      # Indicator shows: Page 1 of 10
                      next_btn = gr.Button("Next ‚û°Ô∏è", scale=1)

                    with gr.Row():
                      page_indicator = gr.Markdown("## <center>Page 0 of 0</center>")

                      op_status_bar = gr.Markdown("**Status:** Ready")
                      # Hidden State to store the PDF pages and current index
                      viewer_state = gr.State({"current_page": 0, "images": []})
                      filename_debug_output = gr.Textbox(label="Uploaded Filename (Debug)", visible=False, lines=1, interactive=False) # ADDED DEBUG TEXTBOX



# -----------     --------   # TAB 1 - RIGHT COLUMN: AI-POWERED DOCUMENT INTELLIGENCE CHATBOT INTERFACE

                with gr.Column(scale=2):
                    gr.Markdown("## AI-Powered Document Intelligence Chatbot")

                    # Chatbot Design
                    chatbot = gr.Chatbot(
                      label="AI Document Assistant",
                      height=1000,
                      show_label=True,
                      value=[{"role": "assistant", "content": "**ü§ñ AI Document Assistant:** üëã Welcome! Upload files in the üìÇ Upload & Process Documents tab to begin. üöÄ"}],
                      elem_id="chatbot-box",
                      autoscroll=True,
                      render_markdown=True) # Processes the **bold** text

                    # --- DIVIDER ---
                    gr.HTML("<hr>")

                    with gr.Row():
                        msg_input = gr.Textbox(show_label=False, placeholder="Ask a question about your docs...", scale=8, container=True)

                    with gr.Row():
                        send_btn = gr.Button("üöÄSend", scale=1, variant="primary", interactive=True)
                        chat_download_btn = gr.DownloadButton(
                               "üì§ Download Chat History (PDF)", # Button the user clicks to start the export
                               visible=True,
                               interactive=True,
                               elem_id="chat-export-btn",
                               variant="primary",
                               scale=1)

                        # This component holds the actual file once generated
                        # Visible=False until the file is ready
                        chat_download_file = gr.File(label="Download Ready", visible=False, scale=1)

                    with gr.Row():
                        example_btn1 = gr.Button("üìù Summary", variant="primary", interactive=True, scale=1)
                        example_btn2 = gr.Button("üí∞ Find Amounts", variant="primary", interactive=True, scale=1)
                        clear_chat_btn = gr.Button("üóëÔ∏è Clear Chat", variant="primary", interactive=True, scale=1)

# -----------  # --- TAB 2: RAG CONFIGURATIONS & DOCUMENT(s) FILTERS ---
          with gr.TabItem("‚öôÔ∏è Configurations & üìÇ Filters", id="Config_filters_tab"):
             with gr.Row():


 # -----------     --------   # TAB 2 - LEFT COLUMN: DOCUMENT STRUCTURE
                with gr.Column(scale=2):
                    # DOCUMENT STRUCTURE
                    gr.Markdown("# üß¨ Processed Document Breakdown")

                    with gr.Row():
                          gr.Markdown("""
                            These view displays the breakdown of your file. Our system identifies
                            identifies distinct sub-documents types (e.g., an Invoice followed by a Lease)
                            within a single upload, mapping the specific page ranges and initial content previews
                            to ensure the retriever (search) knows exactly where each piece of information originated.
                            """)

                    # --- DIVIDER ---
                    gr.HTML("<hr>")


                    # Human-Readable Text output
                    with gr.Row():
                        gr.Markdown("### Document Structure")

                    with gr.Row():
                        gr.Markdown("Identifies distinct sub-documents and page ranges within your file.")

                    with gr.Row():
                        structure_output_textbox = gr.Textbox(label="Text Output", visible=True, scale=3, lines=8)


                    # --- DIVIDER ---
                    gr.HTML("<hr>")

                    # Developer JSON output
                    with gr.Row():
                        gr.Markdown("### Developer Document Structure")

                    with gr.Row():
                        gr.Markdown("Machine-ready schema for debugging and database integration.")

                    with gr.Row():
                         # This shows the actual raw JSON data "ADDED code" in def 'process_pdf_handler'.
                        structure_output_code = gr.Code(label="JSON Output", language="json", lines=25, interactive=False, elem_id="structure-json-box")


 # -----------     --------   # TAB 2 - RIGHT COLUMN: FILTERS  & RAG CONFIGURATIONS
                with gr.Column(scale=2):
                    gr.Image( # Document Filter and RAG.png
                        value=CONFIG_FILTER_PATH,
                        show_label=False,
                        container=False,
                        scale=1)

                    # --- DIVIDER ---
                    gr.HTML("<hr>")


                    # FILTER DOCUMENTS
                    with gr.Row():
                        gr.Markdown("# Filters")

                    with gr.Row():
                        gr.Markdown("Filter AI Document Assistant responses or document preview by File or Document Type.")

                    with gr.Row():
                        view_selector = gr.Dropdown(label="Select File to View", choices=[],scale=2)
                    with gr.Row():
                        doc_type_filter = gr.Dropdown(
                            choices=["All"]+ VALID_DOC_TYPES, # Automatically populates from DOCUMENT_TAXONOMY
                            label="Filter By Document (File) & Type:",
                            value="All",
                            interactive=True,
                            multiselect=True,
                            scale=2)

                    # --- DIVIDER ---
                    gr.HTML("<hr>")


                    with gr.Row():
                        # RIGHT COLUMN: RAG CONFIGURATION ROW
                        gr.Markdown("# ‚öôÔ∏è RAG Configuration")

                    with gr.Row():
                        gr.Markdown("To refine AI Document Assistant response, adjust Recall Chunks.")

                    with gr.Row():
                        auto_route = gr.Checkbox(value=True, label="üéØ Auto-Route Queries")

                    with gr.Row():
                        audit_num_chunks = gr.Slider(
                                  minimum=1,
                                  maximum=10,
                                  value=4,
                                  step=1,
                                  label="üìä Recall Chunks",
                                  info="Determines how many chunks are analyzed for precision.")


# -----------    # --- TAB 3: AUDIT & GROUND TRUTH ---
          with gr.TabItem("‚öñÔ∏è Performance Audit", id="audit_tab"):

              #Blank row for Spacing
              with gr.Row():


                  # LEFT COLUMN: SECTOR FILTER PERFORMANCE AUDIT
                  with gr.Column(scale=1):
                      gr.Markdown("### ‚öôÔ∏è PERFORMANCE AUDIT CONFIGURATIONS")

                      with gr.Row():
                        sector_dropdown = gr.Dropdown(
                        choices=["Real Estate", "Healthcare", "Legal", "All"], # Can be optimized in future enhancements for other domains
                        label="Select Performance Audit Sector",
                        value="All",
                        interactive=True,
                        visible=True
                      )

                      # Left-Middle-Top: GRAPHS
                      with gr.Row():
                        run_audit_btn = gr.Button("üèÅ Run Performance Audit", variant="primary", scale=1)
                        audit_download_btn = gr.DownloadButton("üìÑ Download Performance Audit Report (PDF)", variant="primary", scale=1)
                        audit_download_file = gr.File(label="üìÑ Download Performance Audit Report (PDF)", visible=False, interactive=False, container=True)

                  # RIGHT COLUMN: VISUALIZATION PERFORMANCE AUDIT METRICS
                  with gr.Column(scale=6):
                        gr.Markdown("# ‚öñÔ∏è Performance Audit")

                        # HEADING
                        with gr.Column(scale=6):
                            gr.Markdown("### ‚öôÔ∏è MONITORING & PERFORMANCE DASHBOARD")
                            gr.Markdown("### üõ†Ô∏è Industry Ground Truth Evaluation")
                            gr.Markdown("This dashboard translates raw AI-Judge scores into Permonace Audit status.")
                            gr.Markdown("--------------------------------------------------------------------------------")
                            gr.Markdown("### üìä Live Performance")
                            gr.Markdown("--------------------------------------------------------------------------------")

                        # RIGHT COLUMN: VISUALIZATION PERFORMANCE AUDIT METRICS
                        with gr.Column(scale=6):

                          # Right-Top: LIVE PERFORMANCE
                          with gr.Row():
                            latency_stat = gr.Markdown("**Avg Latency:** -- | **Speed:** --")
                            density_perc = gr.Label(label="Context Density")
                          with gr.Row():
                            audit_accuracy_gauge = gr.Label(label="RAG Triad & Context Metrics")
                            accuracy_gauge = gr.Label(label="Context Density Score")
                            bottleneck_plot = gr.Image(label="Latency vs. Token Speed")

                          # Right-Middle-Top: AUDIT TABLE
                          with gr.Row():
                            audit_table = gr.Dataframe(
                              headers=["Metric", "Current Audit", "Industry Benchmark", "Status"],
                              value=[]
                          )


          # Bottom: GLOBAL STATUS BAR (Visible across all tabs)
          op_status_bar = gr.Markdown(
                value="**Status:** Ready | **Documents:** 0 | **Chunks:** 0 | **Cache Hits:** 0/0",
                elem_id="op_status_bar"
          )

# ----------------------------------------------- COMPONENTS WIRING (Defined with chat_interface) ------------------------------------------------------------------------------------- #

      # Chat Event handlers
      def update_status_bar():
            """Update the status bar with current statistics."""
            if doc_store.is_ready:
                stats = doc_store.processing_stats
                cache_rate = 0
                if hasattr(doc_store.retriever, 'total_queries') and doc_store.retriever.total_queries > 0:
                    cache_rate = (doc_store.retriever.cache_hits / doc_store.retriever.total_queries) * 100

                return f"**Status:** ‚úÖ Ready | **Documents:** {stats.get('documents_found', 0)} | **Chunks:** {stats.get('total_chunks', 0)} | **Cache Rate:** {cache_rate:.0f}%"
            return "**Status:** Ready | **Documents:** 0 | **Chunks:** 0 | **Cache Hits:** 0/0"



      def clear_all():
          """Clear everything and reset the interface."""
          global doc_store, audit_logs
          doc_store = EnhancedDocumentStore()
          audit_logs = []

          # Return 14 values to match your specific UI layout
          return (
              [],                                 # 1. chatbot
              None,                               # 2. file_upload
              "",                                 # 3. chat_input
              None,                               # 4. doc_viewer (Now gr.Image, so return None)
              "",                                 # 5. structure_output_textbox
              "",                                 # 6. structure_output_code
              "",                                 # 7. extra status
              gr.update(choices=[], value=None),  # 8. doc_type_filter
              gr.update(choices=[], value=None),  # 9. view_selector
              pd.DataFrame(),                     # 10. audit_table
              None,                               # 11. audit_download_btn
              "üîÑ System Reset",                  # 12. op_status_bar
              {"current_page": 0, "images": []}, # 13. viewer_state (Reset State)
              "**Page 0 of 0**"                  # 14. page_indicator (Reset Markdown)
          )


      def process_pdf_with_status(file_list):
            """Processes uploaded file and ensures UI doesn't hang on error."""
            try:
                # Calls your existing handler from Section 11;
                status, structure_json_string, structure_display, doc_type_filter, view_selector, filename_summary = process_pdf_handler(file_list)

                # UI Gloabl Status Bar
                status_bar_text = update_status_bar()

                return status, structure_json_string, structure_display, view_selector, doc_type_filter, status_bar_text, filename_summary

                return (
                    status,                     # -> status_output
                    structure_json_string,      # -> structure_output_code
                    structure_display,          # -> structure_output_textbox
                    gr.update(choices=filter_choices, value="All"), # For search filter
                    gr.update(choices=file_paths, value=file_paths[0] if file_paths else None), # For viewer
                    view_selector,
                    f"‚úÖ {len(all_filenames)} Files Ready"
                )

            except Exception as e:
                # Debugging print to see exactly what happened in Colab logs
                print(f"Error in wrapper: {str(e)}")
                return f"‚ùå System Error: {str(e)}","[]", "‚ö†Ô∏è Error", gr.update(choices=["All"]), "Error", gr.update(choices=[])


      # UI Buttons: SUMMARY, FIND AMOUNTS,CLEAR CHAT
      # Define Example question handlers
      def ask_summary(history, doc_type_filter, auto_route, audit_num_chunks):
          """Specific wrapper for the Summary button with deep retrieval."""

          # 1. Cleaner, Goal-Oriented Prompt
          msg = (
              "Provide a high-level executive summary of the document. "
              "Highlight the primary purpose, key stakeholders, and critical deadlines. "
              "Do not repeat these instructions in your response."
          )

          if history is None: history = []
          final_history = history

          # 2. FORCE HIGHER K-VALUE: Summaries need more context than a single question.
          # We use max(10, audit_num_chunks) to ensure it's at least 10 chunks.
          summary_k = max(10, int(audit_num_chunks) if audit_num_chunks else 10)

          for updated_history, status in chat_with_status(
              msg, history, doc_type_filter, auto_route, summary_k
          ):
              final_history = updated_history

          return final_history


      def ask_amounts(history, doc_type_filter, auto_route, audit_num_chunks):
          """Specific wrapper for finding financial data."""

          # 1. Simplified Message (Easier for Gemini to process)
          msg = "Identify and list all numerical dollar amounts, fees, and financial figures found in the text."

          if history is None: history = []
          final_history = history

          # 2. DEEP SEARCH: Financials are often buried in late-page exhibits.
          amounts_k = max(12, int(audit_num_chunks) if audit_num_chunks else 12)

          for updated_history, status in chat_with_status(
              msg, history, doc_type_filter, auto_route, amounts_k
          ):
              final_history = updated_history

          return final_history



      # --- EVENT WIRING ---

      # üîó 1. LLM Selector
      # Connect the selector to your handle_model_Transition (UI Manager) function
      llm_selector.change(
          fn=handle_model_transition,
          inputs=[llm_selector, chatbot, clear_on_switch_checkbox],
          outputs=[chatbot, engine_status]
      )


      # üîó 2. Processing Events

      # A. File Upload. Ensures the loading state for uploading a file is properly cleared
      #    File Preview (2 Outputs)
      # When files are picked, show the first one in the viewer and names in the debug box
      file_upload.change(
            fn=lambda x: (x[0].name if x else None, f"üìë {len(x)} files selected" if x else "No files"),
            inputs=[file_upload],
            outputs=[doc_viewer, filename_debug_output]
        )

      # B. "View File" dropdown to switch which PDF is showing in the doc_viewer
      view_selector.change(
         fn=load_pdf_into_viewer,
        inputs=[view_selector],
        outputs=[doc_viewer, viewer_state, page_indicator]
      )


      # C. Document Processing (Backend Ingestion)
      ingest_btn.click(
            fn=process_pdf_handler,
            inputs=[file_upload], # Pull from the actual uploaded file
            outputs=[
                status_output,              # Receives status_msg
                structure_output_code,      # Receives structure_json_string
                structure_output_textbox,   # Receives structure_display (Bulleted String)
                doc_type_filter,               # Receives filter update - doc_type_filter (Dropdown - Filter Document Type)
                view_selector,
                op_status_bar]              # Receives status bar update - status_bar_text (String - Global Status Indicator)
        )


      # D. Document Viewer Navigation: Previous Button
      prev_btn.click(
          fn=flip_page,
          inputs=[gr.State("prev"), viewer_state],
          outputs=[doc_viewer, viewer_state, page_indicator]
      )

      # E. Document Viewer Navigation: Next Button
      next_btn.click(
          fn=flip_page,
          inputs=[gr.State("next"), viewer_state],
           outputs=[doc_viewer, viewer_state, page_indicator]
      )


      # üîó 3. Chat Texbox (Message) Input & Send

      # A. Chat Functionality (The "Conversation" bridge)
      # Use .then() to clear the input box after sending
      msg_input.submit(
            fn=chat_with_status,
            inputs=[msg_input, chatbot, doc_type_filter, auto_route, audit_num_chunks],
            outputs=[chatbot, op_status_bar]
      ).then(lambda: "", None, [msg_input]) # Only clear the input


      # B. Chat Functionality (The "Conversation" bridge)
      # Use .then() to clear the input box after sending
      send_btn.click(
            fn=chat_with_status,
            inputs=[msg_input, chatbot, doc_type_filter, auto_route, audit_num_chunks],
            outputs=[chatbot, op_status_bar]
      ).then(lambda: "", None, [msg_input])



      # üîó 4. Download Chat History & Performance Audit Events

      # A. When the button is clicked:
      # -----Take 'chatbot' as input, run 'export_chat_history_to_pdf',
      # -----and send the result to 'chat_download_file'
      chat_download_btn.click(
          fn=export_chat_history_to_pdf,
          inputs=[chatbot],
          outputs=[chat_download_btn]
      )

      # B. Run Performance Audit
      run_audit_btn.click(
          fn=run_performance_audit,
          inputs=[sector_dropdown, audit_num_chunks],
          outputs=[
              latency_stat,          # 1. (Markdown) -> f"**Avg Latency:**..."
              audit_accuracy_gauge,  # 2. (Label/Plot) -> {"Faithfulness": ...}
              accuracy_gauge,        # 3. (Label) -> f"{success_rate}%"
              density_perc,          # 4. (Label) -> f"{context_density}%"
              bottleneck_plot,       # 5. (Image) -> "bottlenecks.png"
              audit_table            # 6. (Dataframe) -> audit_table_data
          ]
      )


      # B. Export & Download Performance Audit
      audit_download_btn.click(
          fn=handle_audit_export_ui,
          inputs=[audit_table],
          outputs=[
              audit_download_btn,  # Receives the file update
              op_status_bar         # Receives the status message (the "‚úÖ Audit report..." text)
          ]
      )


      # üîó 5. UI Utility Events

      # A. Utility/Reset Buttons: Clear ALL (Start new)
      # Clear the entire platform
      clear_all_btn.click(
            fn=clear_all,
            inputs=[],
            outputs=[
                chatbot,                  # 1
                file_upload,               # 2
                status_output,            # 3
                doc_viewer,               # 4
                filename_debug_output,     # 5
                structure_output_code,    # 6
                structure_output_textbox, # 7
                doc_type_filter,           # 8
                view_selector,            # 9
                audit_table,              # 10
                audit_download_file,       # 11
                op_status_bar,            # 12
                viewer_state,             # 13
                page_indicator            # 14
            ]
      )

      # B. Utility/Reset Buttons: Clear Chat History
      clear_chat_btn.click(
          fn=lambda: (
              [{"role": "assistant", "content": "**ü§ñ AI Document Assistant:** üëã Chat cleared. How can I help you with your documents? üöÄ"}],
              gr.update(visible=False)
          ),
          inputs=None,
          outputs=[chatbot, chat_download_file]
      )

      # C. Summary Button Wiring
      example_btn1.click(
          fn=ask_summary,
          inputs=[chatbot, doc_type_filter, auto_route, audit_num_chunks],
          outputs=[chatbot]
      )

      # D. Find Amounts Button Wiring
      example_btn2.click(
          fn=ask_amounts,
          inputs=[chatbot, doc_type_filter, auto_route, audit_num_chunks],
          outputs=[chatbot]
      )

      return demo

      # üîó 6. ADDED - Initialize JavaScript for Auto-Scroll
      demo.load(js=scroll_script)

print("‚úÖ SECTION 12B. GRADIO INTERFACE, CHAT HANDLERS, & WIRING LOGIC Complete.")



‚úÖ SECTION 12B. GRADIO INTERFACE, CHAT HANDLERS, & WIRING LOGIC Complete.


# **SECTION 13. APPLICATION LAUNCHER**

This section is the **Runtime Orchestrator**. Its primary responsibility is to manage the transition between development and production environments.

<br>

**Logic and Flow Analysis**

1. **Port Management:** `gr.close_all()` is a critical "defensive" coding practice for notebook environments (like Google Colab). It prevents the common "Address already in use" error by forcefully closing previous sessions before starting a new one.

2. **Theme Merging::** IIt combines the `gr.themes.Soft()` base (which provides modern typography and spacing) with your `custom_css` (the "Obsidian" black-and-teal skin). This ensures the UI is both structurally sound and aesthetically branded.

3. **TThe Connectivity Bridge:**
  - `debug=True:` This allows you to see real-time Python errors in the notebook console while the app is running, which is essential for troubleshooting RAG retrieval issues.

  - `share=True:` This is the most powerful feature of the launcher; it creates a temporary **Gradio Live Link** (e.g.,` https://xyz123.gradio.live`). This allows stakeholders to test the platform on their own devices without needing to install Python or the local models.

  <br>

In [None]:
# ------- SECTION 13. APPLICATION LAUNCHER -------

# 1. Cleanup: Close any existing Gradio servers to free up ports
gr.close_all()

print("üöÄ Initializing Platform Components...")
print("üìÇ Loading Vector Store...")
print("üß† Connecting LLM Engine...")


# 2. Build the Interface
# Calls the create_interface() function defined in Section 12
demo = create_interface()

if __name__ == "__main__":
    print("üöÄ AI-Powered Document Intelligence Automation Platform Launching...")

    demo.launch(
        theme=gr.themes.Soft(),
        css=custom_css,
        debug=True,
        share=True,
    )