# 📄 AskYourDocuments: Chat with Your Documents using AI

Welcome to **AskYourDocuments**! This notebook creates an intelligent document analysis system that lets you upload various document types (PDF, DOCX, XLSX, PPTX, images) and ask questions about their content using powerful AI models.

## 🔍 What This Notebook Does

1. Sets up a complete document processing pipeline
2. Extracts text, tables, and image content from your documents
3. Creates a searchable knowledge base from your documents
4. Deploys a web interface to upload documents and ask questions
5. Processes your queries using AI and retrieves targeted information

## 📋 How To Use This Notebook (in Google Colab)

1. **Run each cell in sequence** from top to bottom
2. **Set up your API keys** in the second cell:
   - Azure API key - for embeddings and LLM (required)
   - Hugging Face API token - for vision features (optional)
   - Ngrok authtoken - for better web access (recommended)
3. **Upload HTML and JS files** when prompted (or use the auto-generated basic templates)
4. The **final cell launches the web interface** via ngrok link

## 🔑 Required API Keys

- **Azure AI Inference API Key**: Required for document processing and query answering. Get one from your [Azure AI account](https://github.com/settings/tokens).
- **Hugging Face API Token**: Optional for enhanced vision features. Get a token from your [Hugging Face account settings](https://huggingface.co/settings/tokens).
- **Ngrok Authtoken**: Recommended for stable public URLs. Sign up at [ngrok.com](https://ngrok.com/) and get your token.

## 🚀 Getting Started

Run the first cell below to install all required dependencies, then proceed through each cell sequentially. Make sure to configure your API keys in the second cell!

In [1]:
# --- 0. Installations ---
!pip install Flask Flask-CORS pyngrok -q
!pip install --upgrade pymupdf Flask Flask-CORS pyngrok loguru -q
!pip install langchain tiktoken sentence-transformers accelerate chromadb torch numpy pdfplumber pdf2image pytesseract opencv-python-headless Pillow huggingface_hub python-dotenv -q
!pip install azure-ai-inference openai langchain-openai -q
!pip install python-docx openpyxl python-pptx -q
!sudo apt-get update -qq
!sudo apt-get install -y tesseract-ocr tesseract-ocr-eng poppler-utils libreoffice unoconv -qq
print("--- Dependencies installed/updated ---")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.3/19.3 MB[0m [31m83.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m5.5 MB/s[0m eta [36m0

In [None]:
AZURE_EMBEDDING_API_KEY_SECRET = 'KEY' # add your own key
HF_API_TOKEN_SECRET = 'KEY' # add your own key
!ngrok config add-authtoken KEY # add your own key

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [4]:
# Backend

# --- 1. Imports ---
import base64
import io
import json
import logging
import os
import re
import shutil
import tempfile
import time
import uuid
from typing import Any, Dict, List, Optional

import chromadb
import fitz  # PyMuPDF
import numpy as np
import pdfplumber
import pytesseract
import torch
from flask import Flask, jsonify, render_template, request, send_from_directory
from flask_cors import CORS
from google.colab import files as colab_files # Specific to Colab, handle if not in Colab
from google.colab import userdata # Specific to Colab
from huggingface_hub import InferenceClient
from huggingface_hub.utils import (GatedRepoError, HFValidationError,
                                   RepositoryNotFoundError)
from langchain.chains.summarize import \
    load_summarize_chain
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from loguru import logger
from pdf2image import convert_from_path
from PIL import Image, UnidentifiedImageError

# New Azure Imports
from azure.ai.inference import EmbeddingsClient
from azure.core.credentials import AzureKeyCredential
# New OpenAI Imports for LLM
from openai import OpenAI
from langchain_openai import ChatOpenAI # For LangChain integration with OpenAI

import cv2

from docx import Document as DocxDocument
from openpyxl import load_workbook
# Note: python-pptx doesn't have a direct save-to-PDF method, unoconv is best here.
# from pptx import Presentation as PptxPresentation
# from pptx.enum.shapes import MSO_SHAPE_TYPE
# from pptx.util import Inches

# --- 2. Loguru Configuration ---
logger.remove()
logger.add(
    lambda msg: print(msg, end=""),
    level="INFO",
    format="<green>{time:YYYY-MM-DD HH:mm:ss.SSS}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> - <level>{message}</level>",
)
LOG_FILE_PATH = "ask_your_doc_flask_merged_app.log"
logger.add(LOG_FILE_PATH, rotation="10 MB", level="DEBUG")
logger.info(f"AskYourDoc Merged Flask App Starting. Log file: {LOG_FILE_PATH}")

# --- 3. Configuration (Combined and Prioritized) ---
HF_API_TOKEN_SECRET = None

try:
    logger.info("HF_TOKEN loaded.")
except userdata.SecretNotFoundError:
    logger.warning("HF_TOKEN not in Colab secrets.")

# AZURE EMBEDDING CONFIGURATION
AZURE_EMBEDDING_ENDPOINT = "https://models.inference.ai.azure.com"
AZURE_EMBEDDING_MODEL_NAME = "text-embedding-3-large"
logger.info(f"Azure Embedding Model: {AZURE_EMBEDDING_MODEL_NAME} configured.")
logger.warning("Using a GitHub token for Azure AI Inference API Key. Please verify this is the correct type of key for your Azure endpoint.")


# AZURE LLM CONFIGURATION (using the same endpoint and key as embedding)
AZURE_LLM_ENDPOINT = AZURE_EMBEDDING_ENDPOINT # Reusing the endpoint
AZURE_LLM_MODEL_NAME = "gpt-4o"
AZURE_LLM_API_KEY_SECRET = AZURE_EMBEDDING_API_KEY_SECRET # Reusing the token for both
logger.info(f"Azure LLM Model: {AZURE_LLM_MODEL_NAME} configured.")


# From Parsing Script
VISION_MODEL_ID_CONFIG = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
IMAGE_OCR_PROMPT_CONFIG = "You are an advanced OCR model. Transcribe all text from this image accurately. If it's handwritten, do your best to read it. If there is no text, output 'No text found'."
IMAGE_DESCRIPTION_PROMPT_CONFIG = "You are an expert image analyst. Describe this image in detail, focusing on elements relevant to understanding a document (e.g., charts, diagrams, important visual features). If the image primarily contains text, transcribe the text instead of describing it as a visual."


# From KB/Retrieval Script
EMBEDDING_MODEL_NAME = AZURE_EMBEDDING_MODEL_NAME # Reflect the new model for logging
EMBEDDING_MODEL_MAX_TOKENS = 8192 # Max input tokens for text-embedding-3-large
RERANKER_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-12-v2"
RERANKER_MAX_TOKENS = 512 # Max sequence length for reranker.

# Adjusted for larger context window of text-embedding-3-large (8192 tokens)
PRIMARY_CHUNK_SIZE_TOKENS = 1500
PRIMARY_CHUNK_OVERLAP_TOKENS = 150
MIN_CHUNK_SIZE_TOKENS_NO_SPLIT = 75
EFFECTIVE_MAX_CHUNK_TOKENS = EMBEDDING_MODEL_MAX_TOKENS - 5
ATOMIC_BLOCK_NO_SPLIT_THRESHOLD = EFFECTIVE_MAX_CHUNK_TOKENS

SEPARATORS_HIERARCHICAL = ["\n\n\n", "\n\n", "\n", ". ", "? ", "! ", "; ", ", ", " ", ""]

# From Generation Script
GENERATION_MODEL_NAME = AZURE_LLM_MODEL_NAME # Reflect the new LLM model for logging

CHROMA_DB_PATH = "./ask_your_doc_chroma_db_merged_flask"
CHROMA_COLLECTION_NAME_PREFIX = "askdocmerged"
MAX_TOKENS_FOR_DIRECT_LLM_SUMMARIZATION = 80000

# --- 4. Global Application State (Unified) ---
APP_STATE = {
    "hf_api_token": HF_API_TOKEN_SECRET,
    "hf_vision_client": None,
    "vision_client_initialized": False,
    "langchain_llm": None,
    "llm_initialized": False,
    "azure_llm_client": None,
    "tokenizer": None,
    "embedding_model": None,
    "azure_embedding_client": None,
    "reranker_model": None,
    "rag_models_loaded": False,
    "chroma_client": None,
    "db_initialized": False,
    "chroma_collection_cache": {},
    "original_doc_filename_for_chunking_context": None,
}
# Web-specific state
APP_WEB_STATE = {
    "staged_files": {},
    "processed_collections": {},
}

# --- 5. AI Persona and System Instructions (No Inline Citations) ---
AI_PERSONA_NAME = "DocuMind AI"
AI_ROLE_DESCRIPTION = (
    "a meticulous and highly accurate AI assistant, designed to interact with "
    "and answer questions about documents. You are an expert in information "
    "retrieval and text comprehension."
)
AI_CORE_DIRECTIVE = (
    "Your responses must be grounded *exclusively* in the provided context data. "
    "You do not have access to external knowledge or the internet. "
    "You must not invent, assume, or infer information beyond what is explicitly stated in the context. "
    "Accuracy and adherence to the provided context are paramount. "
    "DO NOT invent data columns, values, or statuses that are not explicitly written in the PROVIDED CONTEXT."
)
AI_ANSWERING_STYLE_REFINED = (
    "Provide answers that are factual, precise, and directly supported by the text in "
    "the 'PROVIDED CONTEXT'. Use clear and concise language. If the query asks for "
    "specific data points (e.g., numbers, dates, names), ensure they are extracted "
    "accurately. If the query is more general, synthesize the relevant information "
    "from the context into a coherent response. Avoid jargon unless it's part of the "
    "document's language and relevant to the answer. You are polite, professional, and helpful."
)
AI_CONTEXTUAL_PRIORITIZATION_POLICY = (
    "When multiple context sources are provided, synthesize information if they are "
    "complementary. If they are contradictory, point out the discrepancy if relevant "
    "to the query, or prioritize the most specific or seemingly authoritative source if "
    "a single answer is required and discernable. If context chunks seem to be out of "
    "order, try to make sense of them logically if possible, but do not assume continuity "
    "if it's not evident."
)
AI_CITATION_POLICY_TEXT = (
    "DO NOT include bracketed citations like [Doc: ..., P:..., CkID:...] in your response. "
    "You may refer to 'the document' or specific document names if it's natural for the answer, "
    "but avoid detailed source tagging in the answer text."
)
AI_NO_ANSWER_POLICY = (
    "If the context is insufficient to answer the query, you must state: 'Based on the provided document context, I could not find the information to answer this question.'"
)
AI_SUMMARIZATION_POLICY = (
    "If the query asks for a summary, provide a concise yet comprehensive overview of the main points "
    "from the 'PROVIDED CONTEXT'. The summary should be neutral, "
    "objective, and reflect the key information accurately. Follow the citation policy "
    "(i.e., no bracketed citations). Focus on the core aspects and main themes."
)

# --- 6. Initialization Functions (Combined and Refined) ---
def initialize_hf_client_from_parser():
    global APP_STATE
    if APP_STATE["vision_client_initialized"]:
        return True
    if not APP_STATE["hf_api_token"]:
        logger.warning("HF_TOKEN not found. Vision features disabled.")
        return False
    if VISION_MODEL_ID_CONFIG == "YOUR_NOVITA_SUPPORTED_VISION_MODEL_ID_PLACEHOLDER":
        logger.error(f"VISION_MODEL_ID_CONFIG is placeholder. Vision disabled.")
        return False
    try:
        APP_STATE["hf_vision_client"] = InferenceClient(
            provider="novita", api_key=HF_API_TOKEN_SECRET
        )
        logger.success(f"HF client (Novita) for {VISION_MODEL_ID_CONFIG} initialized.")
        APP_STATE["vision_client_initialized"] = True
    except Exception as e:
        logger.error(f"HF client init error: {e}", exc_info=True)
        APP_STATE["hf_vision_client"] = None
        APP_STATE["vision_client_initialized"] = False
    return APP_STATE["vision_client_initialized"]

def initialize_llm_model():
    global APP_STATE
    if APP_STATE["llm_initialized"]:
        return True
    if not AZURE_LLM_API_KEY_SECRET:
        logger.critical("AZURE_LLM_API_KEY_SECRET missing!")
        return False
    try:
        APP_STATE["azure_llm_client"] = OpenAI(
            base_url=AZURE_LLM_ENDPOINT,
            api_key=AZURE_LLM_API_KEY_SECRET,
        )
        logger.success(f"OpenAI client for '{AZURE_LLM_MODEL_NAME}' initialized.")

        APP_STATE["langchain_llm"] = ChatOpenAI(
            model=AZURE_LLM_MODEL_NAME,
            openai_api_base=AZURE_LLM_ENDPOINT,
            openai_api_key=AZURE_LLM_API_KEY_SECRET,
            temperature=0.25,
            max_tokens=1000,
        )
        logger.success(
            f"LangChain LLM wrapper for '{AZURE_LLM_MODEL_NAME}' initialized for potential summarization."
        )
        APP_STATE["llm_initialized"] = True
    except Exception as e:
        logger.critical(f"OpenAI LLM init FAILED: {e}", exc_info=True)
        APP_STATE["llm_initialized"] = False
    return APP_STATE["llm_initialized"]

def initialize_rag_models_from_kb():
    global APP_STATE
    if APP_STATE["rag_models_loaded"]:
        return True
    try:
        APP_STATE["tokenizer"] = tiktoken.get_encoding("cl100k_base")
        logger.success("Tiktoken 'cl00k_base' OK.")
    except Exception as e:
        logger.warning(f"Tiktoken fail: {e}. Using len().")
        APP_STATE["tokenizer"] = None

    device = "cuda" if torch.cuda.is_available() else "cpu"
    models_ok = True

    # Initialize Azure Embeddings Client
    if not APP_STATE.get("azure_embedding_client"):
        try:
            logger.info(f"Loading Azure Embeddings client: {AZURE_EMBEDDING_MODEL_NAME}...")
            APP_STATE["azure_embedding_client"] = EmbeddingsClient(
                endpoint=AZURE_EMBEDDING_ENDPOINT,
                credential=AzureKeyCredential(AZURE_EMBEDDING_API_KEY_SECRET)
            )
            APP_STATE["embedding_model"] = APP_STATE["azure_embedding_client"]
            logger.success(f"Azure Embeddings client for '{AZURE_EMBEDDING_MODEL_NAME}' initialized.")
        except Exception as e:
            logger.critical(f"Azure Embeddings client load FAILED: {e}", exc_info=True)
            models_ok = False

    if not APP_STATE.get("reranker_model"):
        try:
            logger.info(f"Loading reranker: {RERANKER_MODEL_NAME} ({device})...")
            APP_STATE["reranker_model"] = CrossEncoder(
                RERANKER_MODEL_NAME,
                device=device,
                trust_remote_code=True,
                max_length=RERANKER_MAX_TOKENS,
            )
            logger.success("Reranker OK.")
        except Exception as e:
            logger.warning(f"Reranker load FAIL: {e}. Reranking off.")
            APP_STATE["reranker_model"] = None
    APP_STATE["rag_models_loaded"] = models_ok and APP_STATE["embedding_model"] is not None
    if not APP_STATE["rag_models_loaded"]:
        logger.error("Essential RAG models FAILED to load.")
    return APP_STATE["rag_models_loaded"]

def initialize_chromadb_from_kb():
    global APP_STATE
    if APP_STATE["db_initialized"]:
        return True
    logger.info(f"Initializing ChromaDB: path='{CHROMA_DB_PATH}'")
    try:
        APP_STATE["chroma_client"] = chromadb.PersistentClient(path=CHROMA_DB_PATH)
        logger.success(f"ChromaDB persistent OK: '{CHROMA_DB_PATH}'.")
        APP_STATE["db_initialized"] = True
    except Exception as e:
        logger.error(f"Chroma persistent FAIL: {e}. Trying in-memory.")
        try:
            APP_STATE["chroma_client"] = chromadb.Client()
            logger.success("Chroma in-memory OK.")
            APP_STATE["db_initialized"] = True
        except Exception as e_mem:
            logger.critical(f"Chroma in-memory FAIL: {e_mem}")
            APP_STATE["db_initialized"] = False
            return False
    return True

def get_or_create_collection_cached(
    collection_name_str: str,
) -> Optional[chromadb.api.models.Collection.Collection]:
    if not APP_STATE.get("db_initialized") or not APP_STATE.get("chroma_client"):
        logger.error("ChromaDB not init.")
        return None
    if collection_name_str in APP_STATE["chroma_collection_cache"]:
        logger.debug(f"Using cached collection: {collection_name_str}")
        return APP_STATE["chroma_collection_cache"][collection_name_str]
    try:
        collection = APP_STATE["chroma_client"].get_or_create_collection(
            name=collection_name_str, metadata={"hnsw:space": "cosine"}
        )
        APP_STATE["chroma_collection_cache"][collection_name_str] = collection
        logger.info(
            f"Collection '{collection_name_str}' accessed/created. Items: {collection.count()}"
        )
        return collection
    except Exception as e:
        logger.error(
            f"FAIL get/create Chroma collection '{collection_name_str}': {e}"
        )
        return None

# --- 7. Helper Functions ---
def count_tokens(text: str) -> int:
    if APP_STATE.get("tokenizer"):
        try:
            return len(APP_STATE["tokenizer"].encode(text, disallowed_special=()))
        except Exception as e:
            logger.warning(f"Tiktoken encode error for text '{text[:30]}...': {e}. Using char count.");
            return len(text)
    return len(text)

def image_to_base64_data_uri(pil_image: Image.Image) -> str:
    buffered = io.BytesIO()
    img_format = pil_image.format if pil_image.format else 'PNG'
    try:
        pil_image.save(buffered, format=img_format)
    except Exception as e:
        logger.warning(f"Warning: Could not save image in format {img_format}, falling back to PNG. Error: {e}")
        img_format = 'PNG'
        pil_image.save(buffered, format=img_format)
    encoded_bytes = base64.b64encode(buffered.getvalue())
    encoded_string = encoded_bytes.decode('utf-8')
    mime_type = Image.MIME.get(img_format.upper(), 'image/png')
    return f"data:{mime_type};base64,{encoded_string}"

def _process_image_with_vision_model(pil_image: Image.Image, prompt_text: str) -> str:
    if not APP_STATE.get("vision_client_initialized") or not APP_STATE.get(
        "hf_vision_client"
    ):
        logger.error("Vision model client not initialized. Cannot process image.")
        return "Error: Vision model client not initialized."
    if VISION_MODEL_ID_CONFIG == "YOUR_NOVITA_SUPPORTED_VISION_MODEL_ID_PLACEHOLDER":
        logger.error(f"Vision model ID is a placeholder: {VISION_MODEL_ID_CONFIG}. Cannot process image.")
        return "Error: Vision model ID is a placeholder."

    base64_image_url = image_to_base64_data_uri(pil_image)
    try:
        completion = APP_STATE["hf_vision_client"].chat.completions.create(
            model=VISION_MODEL_ID_CONFIG,
            messages=[
                {"role": "user", "content": [{"type": "text", "text": prompt_text}, {"type": "image_url", "image_url": {"url": base64_image_url}}]}
            ],
            max_tokens=1500
        )
        if completion.choices and completion.choices[0].message and completion.choices[0].message.content:
            return completion.choices[0].message.content.strip()
        else:
            logger.error(f"Vision model ({VISION_MODEL_ID_CONFIG}) - No content in response. Prompt: {prompt_text[:50]}...")
            return f"Error: No content from vision model {VISION_MODEL_ID_CONFIG}."
    except Exception as e:
        logger.error(f"An unexpected error occurred with the vision model ({VISION_MODEL_ID_CONFIG}): {e}", exc_info=True)
        return f"Error processing image with vision model {VISION_MODEL_ID_CONFIG}: {str(e)}"

def sanitize_chromadb_collection_name(name: str) -> str:
    name = str(name)
    name = re.sub(r'[ \t\n\r\f\v.,;:!?"\'`()\[\]{}<>|/\\]+', '_', name)
    name = re.sub(r'[^a-zA-Z0-9_-]', '', name)
    name = re.sub(r'^_+|-+$', '', name)
    name = re.sub(r'^-+|_+$', '', name)
    if name and not name[0].isalnum():
        name = 'c_' + name
    if name and not name[-1].isalnum():
        name = name + '_c'
    if len(name) < 3:
        name = name + '___'
        name = name[:3]
    if len(name) > 63:
        name = name[:63]
    if name and not name[0].isalnum(): name = 'c' + name[1:]
    if name and len(name) > 1 and not name[-1].isalnum(): name = name[:-1] + 'c'
    elif name and len(name) == 1 and not name[0].isalnum(): name = 'cc'
    if not name or len(name) < 3:
        name = f"coll_{uuid.uuid4().hex[:8]}"
    return name.lower()


# --- 8. PDF Parsing Logic (The CORE parser for all converted PDFs) ---
def get_font_properties(span_dict: Dict) -> Dict:
    return {
        "size": span_dict.get("size", 0.0),
        "font": span_dict.get("font", "UnknownFont"),
        "color": span_dict.get("color", 0),
        "flags": span_dict.get("flags", 0),
        "origin_x": span_dict.get("bbox", [0, 0, 0, 0])[0],
        "origin_y": span_dict.get("bbox", [0, 0, 0, 0])[1],
    }

def is_likely_header_or_footer(
    block_text: str,
    page_rect: fitz.Rect,
    block_bbox: fitz.Rect,
    page_number: int,
    num_pages: int,
) -> bool:
    block_content = block_text.strip()
    if not block_content or len(block_content) > 150:
        return False
    page_height = page_rect.height
    is_top_zone = block_bbox.y1 < page_rect.y0 + 0.12 * page_height
    is_bottom_zone = block_bbox.y0 > page_rect.y1 - 0.12 * page_height
    if not (is_top_zone or is_bottom_zone):
        return False
    page_num_str = str(page_number)
    num_pages_str = str(num_pages)
    patterns = [
        r"^(Page\s*)?" + re.escape(page_num_str) + r"(\s*(of|-|/)\s*" + re.escape(num_pages_str) + r")?([\s.]*)$",
        r"^(\[?" + re.escape(page_num_str) + r"\]?)$",
        r"^\s*-\s*\d+\s*-\s*$",
        r"^\s*" + re.escape(page_num_str) + r"\s*$",
    ]
    for pattern in patterns:
        if re.fullmatch(pattern, block_content, re.IGNORECASE):
            return True
    if len(block_content.split()) < 7 and len(block_content) < 70:
        if not re.search(r"[.!?]$", block_content):
            if block_content.isupper() or block_content.istitle():
                return True
            if len(block_content) < 10 and len(set(block_content.replace(" ", ""))) < 4:
                return True
    return False

def parse_pdf_content_worker(pdf_path: str, original_filename: str, original_document_type: str, config_dict: dict) -> list:
    logger.info(f"  Worker: Parsing PDF content from '{os.path.basename(pdf_path)}' (originally {original_document_type}) with config: {config_dict}")
    all_content_blocks = []
    doc_block_counter = 0

    use_pdfplumber_tables = config_dict.get("use_pdfplumber_tables", True)
    use_pymupdf_text = config_dict.get("use_pymupdf_text", True)
    process_scanned_pages = config_dict.get("process_scanned_pages", False)
    use_vision_for_ocr_flag = config_dict.get("use_vision_for_ocr", False)
    process_structured_images = config_dict.get("process_structured_images", False)
    use_vision_for_description_flag = config_dict.get("use_vision_for_description", False)
    scan_detection_char_threshold = config_dict.get("scan_detection_char_threshold", 100)
    dpi_for_conversion = config_dict.get("dpi_for_conversion", 200)

    tables_by_page = {}
    if use_pdfplumber_tables:
        logger.debug("  Worker: Attempting table extraction with PdfPlumber...")
        try:
            with pdfplumber.open(pdf_path) as pdf_pl:
                for p_idx, page_pl in enumerate(pdf_pl.pages):
                    page_num_human = p_idx + 1
                    raw_tables = page_pl.find_tables()
                    extracted_tables_data = []
                    if raw_tables:
                        logger.debug(f"    P{page_num_human}: Found {len(raw_tables)} potential tables with PdfPlumber.")
                        for tbl_idx, raw_table in enumerate(raw_tables):
                            md_table = f"[Table {tbl_idx+1} on Page {page_num_human}]\n"
                            data = raw_table.extract()
                            if data:
                                try:
                                    header = data[0]
                                    if header and all(isinstance(c, (str, type(None))) for c in header):
                                        md_table += "| " + " | ".join(str(c).strip().replace("\n", " ") if c is not None else "" for c in header) + " |\n"
                                        md_table += "| " + " | ".join("---" for _ in header) + " |\n"
                                        for row in data[1:]:
                                            if row and all(isinstance(c, (str, type(None))) for c in row):
                                                md_table += f"| {' | '.join(str(c).strip().replace(chr(10),' ') if c is not None else '' for c in row)} |\n"
                                    else:
                                        for row_idx, row in enumerate(data):
                                            if row and all(isinstance(c, (str, type(None))) for c in row):
                                                md_table += ("| " if row_idx == 0 else "") + " | ".join(str(c).strip().replace("\n", " ") if c is not None else "" for c in row) + (" |\n" if row_idx == 0 and len(data) > 1 else "\n")
                                                if row_idx == 0 and len(data) > 1:
                                                    md_table += "| " + " | ".join("---" for _ in row) + " |\n"
                                except TypeError:
                                    md_table += "(Complex table structure, basic markdown conversion failed)\n"
                                    logger.debug(f"    P{page_num_human} T{tbl_idx+1}: TypeError during MD conversion.")
                            else:
                                md_table += "(Table detected, but no data could be extracted or table is empty)\n"
                            extracted_tables_data.append(
                                {"bbox": raw_table.bbox, "markdown_content": md_table, "order": raw_table.bbox[1]}
                            )
                    if extracted_tables_data:
                        tables_by_page[page_num_human] = sorted(
                            extracted_tables_data, key=lambda t: t["order"]
                        )
        except Exception as e:
            logger.error(
                f"  Worker: Pdfplumber table extraction FAILED for '{os.path.basename(pdf_path)}': {e}",
                exc_info=True,
            )

    try:
        doc = fitz.open(pdf_path)
    except Exception as e:
        logger.critical(f"  Worker: PyMuPDF open FAILED for '{os.path.basename(pdf_path)}': {e}.")
        return [{"block_id":"error_doc_open", "type":"error", "content":f"PDF open fail: {e}", "page_number":0, "document_type": original_document_type}]

    num_pages = len(doc)
    all_font_sizes_doc = []
    logger.debug(f"  Worker: Document '{os.path.basename(pdf_path)}' has {num_pages} pages. Calculating font statistics...")
    for page_fitz_temp in doc:
        try:
            textpage_temp = page_fitz_temp.get_textpage_ocr(flags=0, full=False)
            blocks_dict_temp = textpage_temp.extractDICT().get("blocks", [])
            for block_dict_temp in blocks_dict_temp:
                if block_dict_temp.get("type") == 0:
                    for line_dict_temp in block_dict_temp.get("lines", []):
                        for span_dict_temp in line_dict_temp.get("spans", []):
                            all_font_sizes_doc.append(span_dict_temp.get("size", 0))
        except Exception as e_fs:
            logger.warning(f"    Font size stats extraction error on page {page_fitz_temp.number + 1}: {e_fs}")

    avg_font_size = np.mean([s for s in all_font_sizes_doc if s > 0]) if any(s > 0 for s in all_font_sizes_doc) else 10.0
    std_font_size = np.std([s for s in all_font_sizes_doc if s > 0]) if any(s > 0 for s in all_font_sizes_doc) and len(set(s for s in all_font_sizes_doc if s > 0)) > 1 else 2.0
    if std_font_size < 1.5:
        std_font_size = 1.5
    logger.info(f"  Doc stats: Pages:{num_pages}, AvgFont:{avg_font_size:.1f}, StdFont:{std_font_size:.1f}")

    for page_idx, page_fitz in enumerate(doc):
        page_num_human = page_idx + 1
        logger.debug(f"  Worker: Processing Page {page_num_human}/{num_pages}")
        page_content_elements = []
        current_page_table_bboxes = [
            tuple(t["bbox"]) for t in tables_by_page.get(page_num_human, [])
        ]

        def check_overlap_with_tables(b_bbox_coords, t_bboxes_coords_list, threshold=0.3):
            b_x0,b_y0,b_x1,b_y1 = b_bbox_coords
            b_area=(b_x1-b_x0)*(b_y1-b_y0)
            if b_area == 0: return False
            for t_coords in t_bboxes_coords_list:
                t_x0,t_y0,t_x1,t_y1 = t_coords
                ix0,iy0 = max(b_x0,t_x0), max(b_y0,t_y0)
                ix1,iy1 = min(b_x1,t_x1), min(b_y1,t_y1)
                iarea = max(0,ix1-ix0) * max(0,iy1-iy0)
                if (iarea/b_area) > threshold: return True
            return False

        if use_pymupdf_text:
            try:
                textpage_flags = (
                    fitz.TEXTFLAGS_SEARCH | fitz.TEXTFLAGS_PRESERVE_LIGATURES
                    | fitz.TEXTFLAGS_PRESERVE_IMAGES | fitz.TEXTFLAGS_PRESERVE_WHITESPACE
                )
            except AttributeError:
                textpage_flags = fitz.TEXTFLAGS_SEARCH
                logger.warning(f"    P{page_num_human}: Older PyMuPDF version, using basic text flags.")

            try:
                textpage = page_fitz.get_textpage_ocr(flags=textpage_flags, full=False)
                page_dict = textpage.extractDICT()

                for block_dict in page_dict.get("blocks", []):
                    if block_dict.get("type") == 0:
                        block_text_content, span_font_props_list = "", []
                        for line_dict in block_dict.get("lines", []):
                            line_text_parts = []
                            for span_dict_val in line_dict.get("spans", []):
                                line_text_parts.append(span_dict_val["text"])
                                span_font_props_list.append(get_font_properties(span_dict_val))
                            block_text_content += " ".join(line_text_parts).strip() + "\n"

                        block_text_content = re.sub(r'\s*\n\s*', '\n', block_text_content).strip()
                        block_text_content = re.sub(r' +', ' ', block_text_content)
                        block_bbox_fitz = fitz.Rect(block_dict["bbox"])

                        if not block_text_content or check_overlap_with_tables(tuple(block_bbox_fitz), current_page_table_bboxes):
                            continue

                        current_block_avg_font_size = np.mean([s['size'] for s in span_font_props_list if s['size'] > 0]) if any(s['size'] > 0 for s in span_font_props_list) else avg_font_size
                        is_bold = any(s['flags'] & (1 << 4) for s in span_font_props_list)
                        num_words_first_line = len(block_text_content.split('\n')[0].split())

                        block_sem_type, is_heading_candidate = "text_paragraph", False
                        if is_likely_header_or_footer(block_text_content, page_fitz.rect, block_bbox_fitz, page_num_human, num_pages):
                            block_sem_type = "noise_header_footer"
                        elif len(block_text_content) <= 6 and not re.search(r'[a-zA-Z]{2,}', block_text_content) and \
                             not (current_block_avg_font_size > avg_font_size + 0.8 * std_font_size or is_bold):
                            block_sem_type = "noise_short_irrelevant"
                        else:
                            if page_idx == 0 and current_block_avg_font_size >= avg_font_size + 1.8 * std_font_size and \
                               num_words_first_line < 18 and block_bbox_fitz.y0 < page_fitz.rect.height * 0.3:
                                block_sem_type, is_heading_candidate = "title_document", True
                            elif current_block_avg_font_size >= avg_font_size + 1.4 * std_font_size and \
                                 (is_bold or num_words_first_line < 20 or (page_idx <=1 and num_words_first_line < 28)):
                                block_sem_type, is_heading_candidate = "h1_heading", True
                            elif current_block_avg_font_size >= avg_font_size + 0.7 * std_font_size and \
                                 (is_bold or num_words_first_line < 25):
                                block_sem_type, is_heading_candidate = "h2_heading", True
                            elif (current_block_avg_font_size >= avg_font_size + 0.3 * std_font_size and \
                                  is_bold and num_words_first_line < 30 and not block_text_content.endswith('.')):
                                block_sem_type, is_heading_candidate = "h3_heading", True

                        if is_heading_candidate:
                            logger.trace(f"    P{page_num_human}: Found {block_sem_type} (F:{current_block_avg_font_size:.1f},B:{is_bold}): '{block_text_content[:40]}...'")
                        page_content_elements.append({
                            "type": block_sem_type, "content": block_text_content, "bbox": tuple(block_bbox_fitz),
                            "order": block_bbox_fitz.y0, "is_semantic_heading": is_heading_candidate,
                            "font_size": current_block_avg_font_size, "is_bold": is_bold, "document_type": original_document_type
                        })
            except Exception as e_pymupdf_txt:
                logger.error(f"  Worker: PyMuPDF text extraction error on P{page_num_human}: {e_pymupdf_txt}", exc_info=True)

        for tbl_data in tables_by_page.get(page_num_human, []):
            page_content_elements.append({
                "type": "table_markdown", "content": tbl_data["markdown_content"],
                "bbox": tbl_data["bbox"], "order": tbl_data["order"],
                "is_semantic_heading": False, "font_size": avg_font_size, "is_bold": False, "document_type": original_document_type
            })

        initial_digital_text_len = sum(len(el["content"]) for el in page_content_elements if el["type"] not in ["noise_header_footer", "noise_short_irrelevant"])
        page_might_be_scan = initial_digital_text_len < scan_detection_char_threshold

        structured_images_on_page_meta = []
        if process_scanned_pages or process_structured_images:
            try:
                structured_images_on_page_meta = page_fitz.get_images(full=True)
            except Exception as e:
                logger.warning(f"  Worker: PyMuPDF get_images failed P{page_num_human}: {e}")

        if process_scanned_pages and page_might_be_scan and not structured_images_on_page_meta:
            logger.info(f"  P{page_num_human}: Low digital text ({initial_digital_text_len} chars), no structured images. Candidate for full page OCR.")
            try:
                pil_page_img_list = convert_from_path(
                    pdf_path, dpi=dpi_for_conversion, first_page=page_num_human,
                    last_page=page_num_human, fmt='jpeg', thread_count=1
                )
                if not pil_page_img_list:
                    logger.warning(f"    P{page_num_human}: pdf2image returned no image for full page OCR.")
                    continue
                pil_page_img = pil_page_img_list[0]
                fp_bbox, fp_ocr_text, fp_ocr_type_detail = tuple(page_fitz.rect), "", ""

                if use_vision_for_ocr_flag and APP_STATE.get("hf_vision_client"):
                    logger.debug(f"    P{page_num_human}: Attempting Vision OCR (full page).")
                    fp_ocr_text = _process_image_with_vision_model(pil_page_img, IMAGE_OCR_PROMPT_CONFIG)
                    fp_ocr_type_detail = "vision_ocr (full_page)"

                if not fp_ocr_text.strip() or "Error:" in fp_ocr_text or "No text found" in fp_ocr_text.lower():
                    logger.debug(f"    P{page_num_human}: Vision OCR failed or no text. Fallback/Attempt Tesseract (full page). Vision out: '{fp_ocr_text[:50]}...'")
                    if not use_vision_for_ocr_flag or ("Error:" in fp_ocr_text or "No text found" in fp_ocr_text.lower()):
                        fp_ocr_text = pytesseract.image_to_string(pil_page_img)
                        fp_ocr_type_detail = "tesseract_ocr (full_page)"

                if fp_ocr_text and fp_ocr_text.strip() and "Error:" not in fp_ocr_text and "No text found" not in fp_ocr_text.lower():
                    page_content_elements = [el for el in page_content_elements if el["type"] == "table_markdown"]
                    page_content_elements.append({
                        "type": "full_page_ocr_text", "content": fp_ocr_text.strip(), "bbox": fp_bbox,
                        "order": 0, "is_semantic_heading": False, "font_size": avg_font_size,
                        "is_bold": False, "ocr_source": fp_ocr_type_detail, "document_type": original_document_type
                    })
                    logger.info(f"    P{page_num_human}: Full scan OCR successful via {fp_ocr_type_detail}. Text length: {len(fp_ocr_text)}")
                else:
                    logger.warning(f"    P{page_num_human}: Full page OCR attempt ({fp_ocr_type_detail or 'N/A'}) yielded no usable text.")
            except Exception as e_fp_ocr:
                logger.error(f"    Error during P{page_num_human} full page OCR processing: {e_fp_ocr}", exc_info=True)

        if process_structured_images and structured_images_on_page_meta:
            logger.debug(f"  P{page_num_human}: Processing {len(structured_images_on_page_meta)} structured image regions.")
            for img_idx, img_meta_item in enumerate(structured_images_on_page_meta):
                xref = img_meta_item[0]
                try:
                    base_image = doc.extract_image(xref)
                    if not base_image or "image" not in base_image:
                        logger.warning(f"    P{page_num_human} ImgX{xref}: Could not extract base image data.")
                        continue

                    pil_img = Image.open(io.BytesIO(base_image["image"]))
                    img_placements_rects = page_fitz.get_image_rects(xref)

                    if not img_placements_rects:
                         if len(structured_images_on_page_meta) == 1 and page_might_be_scan:
                             img_placements_rects = [page_fitz.rect]
                             logger.debug("    Single image on scan-like page, using full page rect.")
                         else:
                             logger.warning(f"    P{page_num_human} ImgX{xref}: No placement rectangles found. Skipping image.")
                             continue

                    for rect_idx, fitz_rect_obj in enumerate(img_placements_rects):
                        img_bbox_coords = tuple(fitz_rect_obj)
                        img_filename_ref = f"p{page_num_human}_img{img_idx}_xref{xref}_r{rect_idx}.{base_image.get('ext','png')}"
                        if check_overlap_with_tables(img_bbox_coords, current_page_table_bboxes, 0.7):
                            logger.trace(f"    Skipping image XREF {xref} Rect {rect_idx} due to high table overlap.")
                            continue

                        ocr_from_img_content, desc_from_img_content, vision_ocr_attempted_for_img = "", "", False
                        if use_vision_for_ocr_flag and APP_STATE.get("hf_vision_client"):
                            vision_ocr_attempted_for_img = True
                            ocr_from_img_content = _process_image_with_vision_model(pil_img, IMAGE_OCR_PROMPT_CONFIG)
                            if ocr_from_img_content and "Error:" not in ocr_from_img_content and "No text found" not in ocr_from_img_content.lower():
                                page_content_elements.append({
                                    "type": "image_ocr_text", "content": ocr_from_img_content.strip(),
                                    "bbox": img_bbox_coords, "order": img_bbox_coords[1],
                                    "source_image_filename": img_filename_ref, "is_semantic_heading": False,
                                    "font_size": avg_font_size, "is_bold": False, "document_type": original_document_type
                                })
                            elif "No text found" in ocr_from_img_content.lower(): ocr_from_img_content = ""
                            elif "Error:" in ocr_from_img_content:
                                logger.warning(f"      P{page_num_human},ImgX{xref}R{rect_idx} Vision OCR Error: {ocr_from_img_content}")
                                ocr_from_img_content = ""

                        if process_structured_images and use_vision_for_description_flag and APP_STATE.get("hf_vision_client"):
                            should_describe_img = not (
                                vision_ocr_attempted_for_img and len(ocr_from_img_content) > 75 and
                                not any(k in ocr_from_img_content.lower() for k in ["chart","diagram","graph","figure","table"])
                            )
                            if should_describe_img:
                                desc_from_img_content = _process_image_with_vision_model(pil_img, IMAGE_DESCRIPTION_PROMPT_CONFIG)
                                if desc_from_img_content and "Error:" not in desc_from_img_content and desc_from_img_content.strip():
                                    if not(ocr_from_img_content and desc_from_img_content.strip().lower() == ocr_from_img_content.strip().lower()):
                                        page_content_elements.append({
                                            "type": "image_description", "content": desc_from_img_content.strip(),
                                            "bbox": img_bbox_coords, "order": img_bbox_coords[1] + 0.01,
                                            "source_image_filename": img_filename_ref, "is_semantic_heading": False,
                                            "font_size": avg_font_size, "is_bold": False, "document_type": original_document_type
                                        })
                except UnidentifiedImageError:
                    logger.warning(f"  P{page_num_human}, ImgX{xref}: PIL UnidentifiedImageError. Skipping this image resource.")
                except Exception as e_img_proc:
                    logger.error(f"  Error processing structured image XREF {xref} on P{page_num_human}: {e_img_proc}", exc_info=True)


        page_content_elements.sort(key=lambda item: item.get("order", float('inf')))
        for el_data in page_content_elements:
            content, el_type_str = el_data.get("content", "").strip(), el_data.get("type", "")
            if "noise" in el_type_str or content.lower() == "no text found":
                logger.trace(f"    P{page_num_human}: Final skip of noise block type '{el_type_str}' or 'No text found'.")
                continue
            if content:
                doc_block_counter += 1
                block_to_add = {k: v for k, v in el_data.items() if k not in ["order"]}
                block_to_add.update({"block_id": f"doc_block_{doc_block_counter}", "page_number": page_num_human})
                block_to_add.setdefault("is_semantic_heading", False)
                block_to_add.setdefault("font_size", avg_font_size)
                block_to_add.setdefault("is_bold", False)
                block_to_add.setdefault("document_type", original_document_type)
                all_content_blocks.append(block_to_add)
    try:
        doc.close()
    except Exception as e:
        logger.warning(f"  Worker: Error closing PDF '{os.path.basename(pdf_path)}': {e}")

    if not all_content_blocks:
        all_content_blocks.append({
            "block_id": "fallback_empty_doc", "type": "info",
            "content": f"No content could be extracted from this {original_document_type} file after PDF conversion. It might be empty, purely graphical without OCR, or a protected file.",
            "page_number": 0, "document_type": original_document_type
        })
    logger.info(
        f"  Worker: Finished parsing '{os.path.basename(pdf_path)}' (originally {original_document_type}). Total blocks extracted: {len(all_content_blocks)}"
    )
    return all_content_blocks

# --- NEW: Universal File Converter to PDF ---
def convert_to_pdf(input_path: str, temp_dir: str) -> Optional[Dict]:
    """
    Converts various document types (DOCX, XLSX, PPTX, Image) to PDF using unoconv or Pillow.
    Returns a dictionary with {pdf_path, original_filename, original_extension, original_document_type}.
    """
    original_filename = os.path.basename(input_path)
    file_extension = os.path.splitext(original_filename)[1].lower()
    base_name_no_ext = os.path.splitext(original_filename)[0]
    output_pdf_path = os.path.join(temp_dir, f"{base_name_no_ext}_{uuid.uuid4().hex[:6]}.pdf")

    original_doc_type = "unknown"
    if file_extension == ".pdf":
        original_doc_type = "pdf"
    elif file_extension == ".docx":
        original_doc_type = "docx"
    elif file_extension == ".xlsx":
        original_doc_type = "xlsx"
    elif file_extension == ".pptx":
        original_doc_type = "pptx"
    elif file_extension in [".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff"]:
        original_doc_type = "image"
    elif file_extension == ".xls": # Old Excel format, map to xlsx for parsing
        original_doc_type = "xlsx" # Treat as xlsx for parsing context after conversion
    else:
        logger.error(f"Conversion failed: Unsupported file type for conversion: {file_extension} for '{original_filename}'")
        return {"pdf_path": None, "original_filename": original_filename, "original_extension": file_extension, "original_document_type": "unsupported", "error": f"Unsupported file type: {file_extension}"}

    if file_extension == ".pdf":
        logger.info(f"  File '{original_filename}' is already a PDF. Skipping conversion.")
        return {
            "pdf_path": input_path,
            "original_filename": original_filename,
            "original_extension": file_extension,
            "original_document_type": original_doc_type
        }
    elif original_doc_type == "image":
        try:
            img = Image.open(input_path)
            if img.mode != "RGB":
                img = img.convert("RGB")
            img.save(output_pdf_path, "PDF")
            logger.info(f"  Converted image '{original_filename}' to PDF at '{output_pdf_path}'.")
            return {
                "pdf_path": output_pdf_path,
                "original_filename": original_filename,
                "original_extension": file_extension,
                "original_document_type": original_doc_type
            }
        except Exception as e:
            logger.error(f"  Failed to convert image '{original_filename}' to PDF using Pillow: {e}", exc_info=True)
            return None
    else: # Use unoconv for DOCX, XLSX, PPTX, and now .xls
        command = f"unoconv -f pdf -o {output_pdf_path} {input_path}"
        logger.info(f"  Attempting to convert '{original_filename}' to PDF using unoconv: {command}")
        try:
            result = os.system(command)
            if result == 0 and os.path.exists(output_pdf_path) and os.path.getsize(output_pdf_path) > 0:
                logger.success(f"  Successfully converted '{original_filename}' to PDF at '{output_pdf_path}'.")
                return {
                    "pdf_path": output_pdf_path,
                    "original_filename": original_filename,
                    "original_extension": file_extension,
                    "original_document_type": original_doc_type
                }
            else:
                logger.error(f"  Unoconv failed or produced empty PDF for '{original_filename}'. Result code: {result}. PDF exists: {os.path.exists(output_pdf_path)}. Output file size: {os.path.getsize(output_pdf_path) if os.path.exists(output_pdf_path) else 'N/A'}")
                return None
        except Exception as e:
            logger.error(f"  Error during unoconv conversion of '{original_filename}': {e}", exc_info=True)
            return None

# --- Central Parsing Function (now orchestrates conversion and then PDF parsing) ---
def parse_document_via_profile(file_path: str, profile_name: str = "default_fallback") -> list:
    logger.info(f"Initiating parsing process for document '{os.path.basename(file_path)}' with profile: '{profile_name}'")

    current_config = {
        "use_pdfplumber_tables": True, "use_pymupdf_text": True, "process_scanned_pages": False,
        "process_structured_images": False, "use_vision_for_ocr": False, "use_vision_for_description": False,
        "scan_detection_char_threshold": 120, "dpi_for_conversion": 220
    }
    if profile_name == "fastest":
        current_config.update({"process_scanned_pages": False, "process_structured_images": False, "use_vision_for_ocr": False, "use_vision_for_description": False})
    elif profile_name == "digital_plus_ocr":
        current_config.update({"process_scanned_pages": True, "use_vision_for_ocr": False, "process_structured_images": True, "use_vision_for_description": False, "scan_detection_char_threshold": 150, "dpi_for_conversion": 200})
    elif profile_name == "comprehensive_vision":
        current_config.update({"process_scanned_pages": True, "use_vision_for_ocr": True, "process_structured_images": True, "use_vision_for_description": True, "scan_detection_char_threshold": 100, "dpi_for_conversion": 250})
    elif profile_name == "default_fallback":
        logger.warning(f"Using 'default_fallback' profile for '{os.path.basename(file_path)}'.")
        current_config.update({"process_scanned_pages": True, "use_vision_for_ocr": True, "process_structured_images": True, "use_vision_for_description": True})
    else:
        logger.error(f"Unexpected profile '{profile_name}'. Using minimal config.")
        current_config = {"use_pdfplumber_tables": True, "use_pymupdf_text": True}

    if not APP_STATE.get("vision_client_initialized", False):
        if current_config.get("use_vision_for_ocr"):
            logger.info(f"Profile '{profile_name}': Vision OCR globally disabled (client not ready).")
            current_config["use_vision_for_ocr"] = False
        if current_config.get("use_vision_for_description"):
            logger.info(f"Profile '{profile_name}': Vision Desc globally disabled (client not ready).")
            current_config["use_vision_for_description"] = False

    logger.info(f"Final config for parsing '{os.path.basename(file_path)}' (profile '{profile_name}'): {current_config}")
    start_time = time.time()

    converted_info = convert_to_pdf(file_path, STAGED_UPLOADS_DIR) # Use staged uploads dir for temp PDFs
    if not converted_info or not converted_info["pdf_path"]:
        error_msg = converted_info.get("error", "Unknown conversion error.")
        logger.error(f"Document conversion failed for '{os.path.basename(file_path)}': {error_msg}")
        return [{"block_id":"conversion_error", "type":"error", "content":f"File conversion to PDF failed: {error_msg}", "page_number":0, "document_type": converted_info.get("original_document_type", "unsupported")}]

    pdf_to_parse_path = converted_info["pdf_path"]
    original_filename = converted_info["original_filename"]
    original_document_type = converted_info["original_document_type"]

    try:
        # Now call the single PDF content parser
        parsed_data = parse_pdf_content_worker(pdf_to_parse_path, original_filename, original_document_type, current_config)

        # Ensure that blocks from non-PDFs have their `document_type` consistently set
        for block in parsed_data:
            block["document_type"] = original_document_type

        logger.success(f"Profile '{profile_name}' parsing for '{original_filename}' (converted from {original_document_type}) in {time.time()-start_time:.2f}s. Blocks: {len(parsed_data)}.")
        return parsed_data
    except Exception as e:
        logger.error(f"Error during PDF content parsing of converted file '{pdf_to_parse_path}' (originally {original_document_type}): {e}", exc_info=True)
        return [{"block_id":"parsing_error", "type":"error", "content":f"Error parsing PDF content after conversion: {e}", "page_number":0, "document_type": original_document_type}]
    finally:
        # Clean up the temporary PDF file if it was created
        if pdf_to_parse_path != file_path: # Only remove if it's a new temp file, not the original if it was already a PDF
            try:
                os.remove(pdf_to_parse_path)
                logger.info(f"  Removed temporary PDF file: '{pdf_to_parse_path}'.")
            except Exception as e:
                logger.warning(f"  Could not remove temporary PDF file '{pdf_to_parse_path}': {e}")


# --- 9. Knowledge Base Functions ---
def _prepend_context_to_chunk(original_chunk_text: str, block_metadata: dict) -> str:
    """
    Dynamically prepends contextual information to a chunk based on its type and metadata.
    Refined to use the actual `document_type` metadata.
    """
    orig_filename = APP_STATE.get("original_doc_filename_for_chunking_context", "this document")
    block_type = block_metadata.get("type", "text_paragraph")
    page_num = block_metadata.get("page_number", "N/A")
    doc_type = block_metadata.get("document_type", "document") # Crucial: Get original document type
    font_size = block_metadata.get("font_size", 0) # Only relevant for PDFs originally
    is_bold_str = " (Bold)" if block_metadata.get("is_bold", False) else "" # Only relevant for PDFs originally
    source_img_filename = block_metadata.get("source_image_filename")
    ocr_source = block_metadata.get("ocr_source")

    prefix = ""

    # General document type indicator for clarity in context
    doc_type_label = {
        "pdf": "PDF document",
        "docx": "Word document",
        "xlsx": "Excel document",
        "pptx": "PowerPoint presentation",
        "image": "Image file"
    }.get(doc_type, "document")

    # Specific prefixes for different block types
    if block_type == "table_markdown":
        prefix = f"Context: Data table from {doc_type_label} '{orig_filename}', section/page {page_num}.\nContent:\n"
    elif block_type == "image_ocr_text":
        img_info = f"image '{source_img_filename}'" if source_img_filename else "an image"
        prefix = f"Context: OCR transcription from {img_info} on page {page_num} of {doc_type_label} '{orig_filename}'.\nContent:\n"
    elif block_type == "image_description":
        img_info = f"image '{source_img_filename}'" if source_img_filename else "an image"
        prefix = f"Context: Description of {img_info} on page {page_num} of {doc_type_label} '{orig_filename}'.\nContent:\n"
    elif block_type == "full_page_ocr_text":
        ocr_source_info = f"via {ocr_source}" if ocr_source else ""
        prefix = f"Context: Full page OCR text from scanned page {page_num} of {doc_type_label} '{orig_filename}' {ocr_source_info}.\nContent:\n"
    elif block_type == "title_document":
        prefix = f"Context: Document Title from {doc_type_label} '{orig_filename}'(P{page_num},F{font_size:.0f}{is_bold_str}): \n"
    elif block_type in ["h1_heading", "h2_heading", "h3_heading", "heading_1", "heading_2", "heading_3", "h_candidate_uppercase", "pptx_slide_text"]:
        level_map = {
            "h1_heading": "H1", "h2_heading": "H2", "h3_heading": "H3",
            "heading_1": "Heading 1", "heading_2": "Heading 2", "heading_3": "Heading 3",
            "h_candidate_uppercase": "Potential Heading",
            "pptx_slide_text": "PowerPoint Slide Content" # Special case for PPTX converted content
        }
        level_str = level_map.get(block_type, "Heading")
        prefix = f"Context: {level_str} from {doc_type_label} '{orig_filename}', section/page {page_num} (F{font_size:.0f}{is_bold_str}).\nContent:\n"
    elif block_type == "excel_sheet_data": # For sheets that might have been detected as whole text blocks in PDF
        prefix = f"Context: Data from Excel sheet of {doc_type_label} '{orig_filename}', sheet/page {page_num}.\nContent:\n"
    elif block_type == "docx_paragraph": # For paragraphs from Word docs
        prefix = f"Context: Paragraph from Word document '{orig_filename}', page {page_num}.\nContent:\n"
    else: # Default for generic text paragraphs
        # Heuristic for detecting potential sub-headings within generic text blocks
        lines = original_chunk_text.split('\n', 1)
        first_line = lines[0].strip()
        # Check if first line is short, few words, not ending in punctuation (common for headings)
        if (1 < len(first_line.split()) < 9 and
            count_tokens(first_line) < 40 and
            not first_line.endswith(('.', '?', '!')) and
            (first_line.isupper() or first_line.istitle() or is_bold_str)):
            prefix = f"Context: From {doc_type_label} '{orig_filename}', section/page {page_num}, likely section titled '{first_line}'.\nContent:\n"
        else:
            prefix = f"Context: Text from {doc_type_label} '{orig_filename}', section/page {page_num}.\nContent:\n"

    return prefix + original_chunk_text

def _chunk_parsed_blocks(
    parsed_blocks: List[Dict], collection_name_as_doc_id: str, original_filename_for_meta: str
) -> List[Dict]:
    all_final_chunks = []
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=PRIMARY_CHUNK_SIZE_TOKENS,
        chunk_overlap=PRIMARY_CHUNK_OVERLAP_TOKENS,
        length_function=count_tokens,
        separators=SEPARATORS_HIERARCHICAL,
        keep_separator=False,
    )
    for block_idx, block in enumerate(parsed_blocks):
        original_content = block.get("content", "").strip()
        block_type = block.get("type", "unknown")
        if not original_content or "noise" in block_type or block_type == "error":
            continue

        current_block_metadata = {k:v for k,v in block.items() if k != "content"}
        current_block_metadata.update({"doc_id": collection_name_as_doc_id, "original_filename": original_filename_for_meta})
        current_block_metadata.pop("content", None)

        APP_STATE["original_doc_filename_for_chunking_context"] = original_filename_for_meta
        temp_prefixed_content = _prepend_context_to_chunk(original_content, block)
        APP_STATE["original_doc_filename_for_chunking_context"] = None

        prefixed_token_count = count_tokens(temp_prefixed_content)
        original_token_count = count_tokens(original_content)

        should_split = False
        if prefixed_token_count > EFFECTIVE_MAX_CHUNK_TOKENS:
            logger.warning(f"Block {block.get('block_id', f'b{block_idx}')} (pg:{block.get('page_number')}, type:{block_type}) prefixed form ({prefixed_token_count} tokens) exceeds max {EFFECTIVE_MAX_CHUNK_TOKENS}. Will be split.")
            should_split = True
        elif block_type in ["text_paragraph", "full_page_ocr_text", "docx_paragraph", "excel_sheet_data", "pptx_slide_text", "image_ocr_text", "image_description", "table_markdown"] and original_token_count > PRIMARY_CHUNK_SIZE_TOKENS: # Added table_markdown to types that can be split if too large
            should_split = True
        elif block_type not in ["text_paragraph", "full_page_ocr_text", "docx_paragraph", "excel_sheet_data", "pptx_slide_text", "image_ocr_text", "image_description", "table_markdown"] and original_token_count > ATOMIC_BLOCK_NO_SPLIT_THRESHOLD:
             pass

        if not should_split:
            chunk_id = f"{block.get('block_id', f'b{block_idx}')}_chunk0"
            final_meta = {**current_block_metadata, "chunk_id": chunk_id, "original_block_id": block.get('block_id')}
            all_final_chunks.append({
                "text_content_for_embedding": temp_prefixed_content,
                "original_chunk_text": original_content,
                "metadata": final_meta
            })
        else:
            sub_texts = text_splitter.split_text(original_content)
            logger.debug(f"  Splitting block {block.get('block_id')} ({block_type}, pg {block.get('page_number')}) into {len(sub_texts)} sub-chunks.")
            for i, sub_text_content in enumerate(sub_texts):
                sub_text_content = sub_text_content.strip()
                if not sub_text_content: continue

                APP_STATE["original_doc_filename_for_chunking_context"] = original_filename_for_meta
                final_sub_chunk_for_embedding = _prepend_context_to_chunk(sub_text_content, block)
                APP_STATE["original_doc_filename_for_chunking_context"] = None

                final_sub_chunk_token_count = count_tokens(final_sub_chunk_for_embedding)
                if final_sub_chunk_token_count > EFFECTIVE_MAX_CHUNK_TOKENS:
                    logger.warning(f"    Sub-chunk {i} of {block.get('block_id')} (pg:{block.get('page_number')}) still too large ({final_sub_chunk_token_count} tokens) after splitting. TRUNCATING. Text: '{final_sub_chunk_for_embedding[:100]}...'")
                    tokenizer = APP_STATE.get("tokenizer")
                    if tokenizer:
                        encoded = tokenizer.encode(final_sub_chunk_for_embedding, disallowed_special=())
                        final_sub_chunk_for_embedding = tokenizer.decode(encoded[:EFFECTIVE_MAX_CHUNK_TOKENS])
                    else:
                        final_sub_chunk_for_embedding = final_sub_chunk_for_embedding[:EFFECTIVE_MAX_CHUNK_TOKENS * 4]

                chunk_id = f"{block.get('block_id', f'b{block_idx}')}_subchunk{i}"
                final_meta = {**current_block_metadata, "chunk_id": chunk_id, "original_block_id": block.get('block_id'), "split_index": i}
                all_final_chunks.append({
                    "text_content_for_embedding": final_sub_chunk_for_embedding,
                    "original_chunk_text": sub_text_content,
                    "metadata": final_meta
                })
    logger.info(f"Chunking for '{original_filename_for_meta}': {len(all_final_chunks)} final chunks.")
    return all_final_chunks

def _embed_chunks(chunks_to_embed: List[Dict], batch_size: int = 16) -> Optional[List[Dict]]:
    if not APP_STATE.get("rag_models_loaded") or not APP_STATE.get("azure_embedding_client"):
        logger.error("Azure Embedding client not ready.")
        return None

    texts = [c["text_content_for_embedding"] for c in chunks_to_embed]
    if not texts:
        logger.warning("No texts to embed.")
        return []

    logger.info(f"Embedding {len(texts)} chunks using Azure OpenAI '{AZURE_EMBEDDING_MODEL_NAME}'...")
    try:
        response = APP_STATE["azure_embedding_client"].embed(
            input=texts,
            model=AZURE_EMBEDDING_MODEL_NAME
        )
        embeddings_list = []
        for item in response.data:
            embeddings_list.append(item.embedding)

        embeddings_np = np.array(embeddings_list, dtype=np.float32)

        for c, emb in zip(chunks_to_embed, embeddings_np):
            c["embedding_vector"] = emb.tolist()
        logger.success(f"Embedded {len(texts)} chunks using Azure OpenAI.")
        return chunks_to_embed
    except Exception as e:
        logger.error(f"Azure Embedding FAILED: {e}", exc_info=True)
        return None

def _store_chunks_in_chromadb(
    chunks_with_embeddings: List[Dict], collection: chromadb.api.models.Collection.Collection
) -> int:
    if not collection:
        logger.error("Chroma collection invalid.")
        return 0

    ids, embs, metas, docs = [], [], [], []
    valid_count = 0
    for chunk in chunks_with_embeddings:
        if "embedding_vector" not in chunk or not chunk.get("metadata", {}).get("chunk_id"):
            logger.warning(f"Skip chunk: missing embedding_vector/chunk_id. BlockID: {chunk.get('metadata',{}).get('original_block_id')}")
            continue

        meta = chunk["metadata"]
        ids.append(meta["chunk_id"])
        embs.append(chunk["embedding_vector"])
        docs.append(chunk["original_chunk_text"])

        clean_meta = {}
        for k, v in meta.items():
            if k == "bbox" and isinstance(v, (list, tuple)):
                try: clean_meta[k] = json.dumps(v)
                except TypeError: logger.warning(f"Could not serialize bbox {v}. Storing as str."); clean_meta[k] = str(v)
            elif isinstance(v, (str, int, float, bool)) or v is None:
                clean_meta[k] = v
            else:
                clean_meta[k] = str(v)
        metas.append(clean_meta)
        valid_count += 1

    if not ids:
        logger.warning(f"No valid chunks to store in '{collection.name}'.")
        return 0
    try:
        collection.upsert(ids=ids, embeddings=embs, metadatas=metas, documents=docs)
        logger.success(f"Stored {valid_count} chunks. Collection '{collection.name}' total: {collection.count()}.")
        return valid_count
    except Exception as e:
        logger.error(f"Chroma upsert FAILED for '{collection.name}': {e}", exc_info=True)
        return 0

def ingest_document_into_knowledge_base(
    parsed_blocks: List[Dict], collection_name: str, original_filename: str,
    target_collection: chromadb.api.models.Collection.Collection
) -> bool:
    start_time = time.time()
    logger.info(f"Ingesting '{original_filename}' into collection: {collection_name}")

    if not APP_STATE.get("rag_models_loaded") or not target_collection:
        logger.error(
            f"Ingest prereqs FAIL. RAG loaded: {APP_STATE.get('rag_models_loaded')}, "
            f"Collection valid: {target_collection is not None}."
        )
        return False

    chunks_for_embedding = _chunk_parsed_blocks(parsed_blocks, collection_name, original_filename)
    if not chunks_for_embedding:
        logger.error(f"Chunking for '{original_filename}' FAIL.")
        return False

    chunks_with_vectors = _embed_chunks(chunks_for_embedding)
    if not chunks_with_vectors:
        logger.error(f"Embedding for '{original_filename}' FAIL.")
        return False

    num_stored = _store_chunks_in_chromadb(chunks_with_vectors, target_collection)
    success = num_stored > 0

    logger.log(
        "SUCCESS" if success else "ERROR",
        f"Ingest '{original_filename}' ({collection_name}) in {time.time()-start_time:.2f}s. "
        f"Stored: {num_stored}."
    )
    return success

# --- 10. Retrieval Logic ---
def retrieve_relevant_context(
    query_text: str, target_collection_names: List[str],
    k_initial_retrieval: int = 20, k_final_target: int = 12,
    use_reranking: bool = True
) -> Dict:
    retrieval_start_time = time.time()
    payload = {
        "query": query_text,
        "formatted_context_for_llm": "Error: Retrieval fail.",
        "source_chunks_retrieved": [],
        "retrieval_metadata": {
            "collections_queried": target_collection_names,
            "k_initial": k_initial_retrieval,
            "k_final": k_final_target,
        },
    }

    required_keys = ["rag_models_loaded", "db_initialized", "chroma_client", "azure_embedding_client"]
    if not all(APP_STATE.get(k) for k in required_keys):
        payload["retrieval_metadata"]["error"] = "RAG/DB systems not ready (Azure Embedding client missing)."
        logger.error(payload["retrieval_metadata"]["error"])
        return payload

    try:
        query_response = APP_STATE["azure_embedding_client"].embed(
            input=[query_text],
            model=AZURE_EMBEDDING_MODEL_NAME
        )
        query_emb = query_response.data[0].embedding
    except Exception as e:
        payload["retrieval_metadata"]["error"] = f"Query embed fail with Azure: {e}"
        logger.error(f"Query embed fail with Azure: {e}")
        return payload

    candidates = []
    noise_types = ["noise_header_footer", "noise_short_irrelevant", "info"]
    is_about_query = any(p in query_text.lower() for p in ["about this document", "summarize", "overview", "main idea", "tell me about"])
    anchors = []

    for coll_name in target_collection_names:
        coll = get_or_create_collection_cached(coll_name)
        if not coll:
            logger.warning(f"Skip collection '{coll_name}' as it's invalid or couldn't be accessed.")
            continue

        current_collection_candidates_map = {}

        try:
            res = coll.query(
                query_embeddings=[query_emb],
                n_results=k_initial_retrieval,
                where={"type": {"$nin": noise_types}},
                include=['metadatas', 'documents', 'distances']
            )
            if res and res.get('ids') and res['ids'][0]:
                for i in range(len(res['ids'][0])):
                    candidate_id = res['ids'][0][i]
                    current_collection_candidates_map[candidate_id] = {
                        "id": candidate_id,
                        "text_content": res['documents'][0][i],
                        "metadata": res['metadatas'][0][i],
                        "retrieval_score_distance": res['distances'][0][i],
                        "source_stage": f"S1_Semantic_{coll_name}"
                    }
        except Exception as e:
            logger.error(f"Chroma S1 semantic query failed for '{coll_name}': {e}")

        if is_about_query and len(target_collection_names) == 1:
            try:
                res_anc = coll.query(
                    query_embeddings=[query_emb],
                    n_results=5,
                    where={"$and": [
                        {"type": {"$in": ["title_document", "h1_heading", "pptx_slide_text", "heading_1", "excel_sheet_data", "image_ocr_text", "image_description"]}},
                        {"page_number": {"$lte": 2}}
                    ]},
                    include=['metadatas', 'documents', 'distances']
                )
                if res_anc and res_anc.get('ids') and res_anc['ids'][0]:
                    page_num_key = lambda x: x["metadata"].get("page_number", 999)
                    temp_anchors_from_coll = sorted([
                        {
                            "id": res_anc['ids'][0][i],
                            "text_content": res_anc['documents'][0][i],
                            "metadata": res_anc['metadatas'][0][i],
                            "retrieval_score_distance": res_anc['distances'][0][i],
                            "source_stage": f"S2_Anchor_{coll_name}"
                        } for i in range(len(res_anc['ids'][0]))
                    ], key=lambda x: (page_num_key(x), x["retrieval_score_distance"]))

                    for anchor_cand in temp_anchors_from_coll[:min(2, len(temp_anchors_from_coll))]:
                        current_collection_candidates_map[anchor_cand['id']] = anchor_cand
                        anchors.append(anchor_cand)
            except Exception as e:
                logger.error(f"Chroma S2 anchor query failed for '{coll_name}': {e}")

        candidates.extend(list(current_collection_candidates_map.values()))

    final_anchors = list({ac['id']: ac for ac in anchors}.values())

    if not candidates:
        payload["formatted_context_for_llm"] = "No relevant context found in the document(s)."
        return payload

    selected_chunks = list(final_anchors)
    general_candidates = [c for c in candidates if c['id'] not in {fa['id'] for fa in final_anchors}]

    slots_left_for_general = k_final_target - len(selected_chunks)

    if slots_left_for_general > 0 and general_candidates:
        if use_reranking and APP_STATE.get("reranker_model"):
            try:
                rerank_pairs = [[query_text, chunk['text_content']] for chunk in general_candidates]

                scores = APP_STATE["reranker_model"].predict(rerank_pairs, show_progress_bar=False)
                for chunk, score in zip(general_candidates, scores):
                    chunk['rerank_score'] = score

                if is_about_query and len(target_collection_names) == 1:
                    for chunk in general_candidates:
                        boost = 0.0
                        meta_type = chunk.get("metadata",{}).get("type")
                        meta_page = chunk.get("metadata",{}).get("page_number", 999)
                        if meta_type == "title_document": boost = 5.0
                        elif meta_type in ["h1_heading", "heading_1", "pptx_slide_text"] and meta_page <= 2: boost = 2.5
                        elif meta_type in ["h2_heading", "heading_2"] and meta_page <= 3: boost = 1.0
                        elif meta_type in ["excel_sheet_data", "image_ocr_text", "image_description"] and meta_page <= 2: boost = 1.5
                        chunk['rerank_score'] = chunk.get('rerank_score', 0) + boost

                general_candidates.sort(key=lambda x: x.get('rerank_score', -float('inf')), reverse=True)
                payload["retrieval_metadata"]["reranked_general_candidates"] = True
            except Exception as e:
                logger.error(f"Reranking FAILED: {e}. Falling back to semantic scores for general candidates.")
                general_candidates.sort(key=lambda x: x.get('retrieval_score_distance', float('inf')))
                payload["retrieval_metadata"]["reranked_general_candidates"] = False
        else:
            general_candidates.sort(key=lambda x: x.get('retrieval_score_distance', float('inf')))
            payload["retrieval_metadata"]["reranked_general_candidates"] = False

        selected_chunks.extend(general_candidates[:slots_left_for_general])

    if not selected_chunks:
        payload["formatted_context_for_llm"] = "No relevant context found after selection and reranking."
        return payload

    def sort_key_final(chunk):
        score = chunk.get('rerank_score', -float('inf')) if use_reranking and APP_STATE.get("reranker_model") else -chunk.get('retrieval_score_distance', float('-inf'))
        page = chunk.get('metadata', {}).get('page_number', 999)
        bbox_str = chunk.get('metadata', {}).get('bbox', None)
        y_order = float('inf')
        if bbox_str:
            try:
                bbox = json.loads(bbox_str) if isinstance(bbox_str, str) else bbox_str
                if isinstance(bbox, (list, tuple)) and len(bbox) >= 2:
                    y_order = bbox[1]
            except:
                pass
        return (score, -page, -y_order)

    selected_chunks.sort(key=sort_key_final, reverse=True)

    formatted_context_parts = []
    for i, chunk in enumerate(selected_chunks):
        meta = chunk['metadata']
        orig_fname = meta.get('original_filename', 'UnknownDocument')
        doc_type = meta.get('document_type', 'document')
        bbox_val = meta.get("bbox", "N/A")

        content_for_llm = chunk.get('text_content', '')

        bbox_str_fmt = "N/A"
        if isinstance(bbox_val, str):
            try:
                parsed_bbox = json.loads(bbox_val)
                bbox_str_fmt = f"({parsed_bbox[0]:.0f},{parsed_bbox[1]:.0f},{parsed_bbox[2]:.0f},{parsed_bbox[3]:.0f})"
            except (json.JSONDecodeError, TypeError, IndexError):
                bbox_str_fmt = bbox_val
        elif isinstance(bbox_val, (list, tuple)) and len(bbox_val) == 4:
            bbox_str_fmt = f"({bbox_val[0]:.0f},{bbox_val[1]:.0f},{bbox_val[2]:.0f},{bbox_val[3]:.0f})"


        header = (
            f"--- Context Source {i+1} (Doc:'{orig_fname}', Type:{doc_type}, SrcStage:{chunk.get('source_stage','N/A')}) ---\n"
            f"Page/Section: {meta.get('page_number','?')}, Type: {meta.get('type','?')}, ChunkID: {meta.get('chunk_id','N/A')}\n"
            f"Font Size: {meta.get('font_size',0):.0f}, Bold: {meta.get('is_bold',False)}, Semantic Heading: {meta.get('is_semantic_heading',False)}\n"
            f"BBox: {bbox_str_fmt}, Distance: {chunk.get('retrieval_score_distance',-1):.4f}"
        )
        if 'rerank_score' in chunk:
            header += f", RerankScore: {chunk['rerank_score']:.4f}"

        formatted_context_parts.append(f"{header}\nContent:\n{content_for_llm}\n--- End Src {i+1} ---")
        payload["source_chunks_retrieved"].append(chunk)

    payload["formatted_context_for_llm"] = "\n\n".join(formatted_context_parts)
    payload["retrieval_metadata"]["final_chunk_count"] = len(selected_chunks)

    logger.success(
        f"Context retrieval for '{query_text[:30]}...' in {time.time()-retrieval_start_time:.2f}s. "
        f"Final chunks: {len(selected_chunks)}."
    )
    return payload

# --- 11. Response Generation ---
def _generate_llm_messages(
    user_query: str,
    context_str: str,
    document_identifier: str,
    dynamic_role_instructions: Optional[str],
    is_summary_context: bool,
) -> List[Dict]:
    system_content_parts = [
        f"You are {AI_PERSONA_NAME}, {AI_ROLE_DESCRIPTION}",
        AI_CORE_DIRECTIVE.replace("'{document_name}'", f"'{document_identifier}'"),
        AI_ANSWERING_STYLE_REFINED,
        AI_CONTEXTUAL_PRIORITIZATION_POLICY,
        AI_CITATION_POLICY_TEXT,
        AI_NO_ANSWER_POLICY,
    ]

    if is_summary_context:
        system_content_parts.append(AI_SUMMARIZATION_POLICY)
        context_header = "PROVIDED SUMMARY (Answer ONLY from this summary):"
    else:
        context_header = "PROVIDED CONTEXT (Answer ONLY from this):"

    if dynamic_role_instructions and dynamic_role_instructions.strip():
        system_content_parts.extend([
            "\n--- ADDITIONAL ROLE GUIDANCE ---",
            dynamic_role_instructions.strip(),
            "--- END GUIDANCE ---",
            "Strictly adhere to ALL guidance."
        ])

    system_message = {"role": "system", "content": "\n\n".join(system_content_parts).strip()}
    user_message = {"role": "user", "content": f"USER'S QUERY: {user_query.strip()}\n\n{context_header}\n{context_str.strip()}"}

    return [system_message, user_message]


def generate_standard_llm_response(
    user_query: str, formatted_context_from_rag: str,
    source_chunks_from_rag: List[Dict], doc_name_identifier: str,
    dynamic_role_info: Optional[str], response_payload: Dict
) -> Dict:
    messages = _generate_llm_messages(
        user_query, formatted_context_from_rag, doc_name_identifier,
        dynamic_role_info, is_summary_context=False
    )
    response_payload["llm_prompt_sent"] = messages

    try:
        logger.info(f"Sending RAG request to OpenAI ('{GENERATION_MODEL_NAME}'). Query: '{user_query[:70]}...'")
        gen_start_time = time.time()

        if not APP_STATE.get("azure_llm_client"):
            logger.error("OpenAI LLM client instance not available.")
            response_payload.update({"answer_text": "Error: AI model not configured.", "error_message": "AI model not configured."})
            return response_payload

        openai_api_response = APP_STATE["azure_llm_client"].chat.completions.create(
            model=AZURE_LLM_MODEL_NAME,
            messages=messages,
            temperature=0.25,
            max_tokens=1000,
            top_p=1.0,
        )
        response_payload["generation_time_s"] = round(time.time() - gen_start_time, 2)
        logger.success(f"OpenAI LLM response in {response_payload['generation_time_s']:.2f}s.")

        try:
            response_payload["llm_full_response_obj_str"] = str(openai_api_response)
        except Exception:
            response_payload["llm_full_response_obj_str"] = "Could not serialize full OpenAI response object."

        if openai_api_response.choices and openai_api_response.choices[0].message:
            generated_text = openai_api_response.choices[0].message.content.strip()
            response_payload["answer_text"] = generated_text

            citations_found_raw = re.findall(r'(\[Doc:[^\]]+?CkID:[^\]]+?\])', generated_text)
            if citations_found_raw:
                logger.warning(
                    f"LLM included {len(citations_found_raw)} bracketed citations despite instruction. "
                    "These should not be displayed in the final answer if the prompt was followed."
                )
            response_payload["parsed_citations"] = [{"tag": c} for c in citations_found_raw]
        else:
            logger.error("OpenAI API response did not contain expected message content.")
            response_payload.update({
                "answer_text": "Error: Could not parse text from AI response.",
                "error_message": "OpenAI response parsing error: No choices or message found."
            })

    except Exception as e:
        logger.error(f"General error during OpenAI API call: {e}", exc_info=True)
        response_payload.update({
            "answer_text": f"Error: LLM call failed ({type(e).__name__}).",
            "error_message": str(e)
        })

    response_payload.setdefault("answer_text", "Error: Processing failed to produce an answer.")
    response_payload.setdefault("parsed_citations", [])
    return response_payload

def generate_llm_response(
    user_query: str, formatted_context_from_rag: str,
    source_chunks_from_rag: List[Dict], default_doc_name: str = "the document",
    dynamic_role_info: Optional[str] = None, is_aboutness_query_flag: bool = False
) -> Dict:
    response_payload = {
        "answer_text": "Error: LLM response could not be generated.",
        "llm_prompt_sent": "",
        "llm_full_response_obj_str": None,
        "parsed_citations": [],
        "error_message": None,
        "generation_time_s": 0.0,
        "summarization_chain_used": False,
    }

    required_llm_keys = ["llm_initialized", "azure_llm_client", "langchain_llm"]
    if not all(APP_STATE.get(k) for k in required_llm_keys):
        logger.error("LLM systems not initialized (OpenAI client or LangChain LLM missing).")
        response_payload["error_message"] = "LLM systems not initialized."
        return response_payload

    doc_names = sorted(list(set(
        c['metadata'].get('original_filename', default_doc_name)
        for c in source_chunks_from_rag if c.get('metadata')
    )))
    doc_id_for_prompt = default_doc_name
    if doc_names:
        if len(doc_names) == 1:
            doc_id_for_prompt = f"document '{doc_names[0]}'"
        else:
            doc_id_for_prompt = f"documents ({', '.join(doc_names)})"

    context_total_text = "\n\n".join([
        c.get("original_chunk_text", c.get("text_content",""))
        for c in source_chunks_from_rag
    ])
    context_token_count = count_tokens(context_total_text)

    if is_aboutness_query_flag and context_token_count > MAX_TOKENS_FOR_DIRECT_LLM_SUMMARIZATION and source_chunks_from_rag:
        logger.info(
            f"Aboutness query with large context ({context_token_count} tokens > {MAX_TOKENS_FOR_DIRECT_LLM_SUMMARIZATION}). "
            "Attempting LangChain map-reduce summarization."
        )
        response_payload["summarization_chain_used"] = True

        langchain_documents = [
            Document(page_content=chunk.get("original_chunk_text", chunk.get("text_content","")), metadata=chunk.get("metadata", {}))
            for chunk in source_chunks_from_rag if chunk.get("original_chunk_text", chunk.get("text_content","")).strip()
        ]

        if not langchain_documents:
            logger.warning("No valid documents for LangChain summarization. Falling back to standard generation.")
            return generate_standard_llm_response(
                user_query, formatted_context_from_rag, source_chunks_from_rag,
                doc_id_for_prompt, dynamic_role_info, response_payload
            )
        try:
            summarize_chain = load_summarize_chain(
                llm=APP_STATE["langchain_llm"], chain_type="map_reduce", verbose=False
            )
            summarization_question = (
                f"Provide a comprehensive summary of the key information in '{doc_id_for_prompt}' "
                f"that is relevant to the user's query: '{user_query}'"
            )

            start_time_lc_summarize = time.time()
            summary_result = summarize_chain.invoke({
                "input_documents": langchain_documents,
                "question": summarization_question
            })
            response_payload["generation_time_s"] = round(time.time() - start_time_lc_summarize, 2)

            raw_summary_text = summary_result.get("output_text", "Error: Summarization failed to produce text.").strip()
            logger.success(
                f"LangChain summarization completed in {response_payload['generation_time_s']:.2f}s. "
                f"Raw summary length: {len(raw_summary_text)} characters."
            )

            if not raw_summary_text or "Error:" in raw_summary_text or "failed to produce" in raw_summary_text.lower():
                logger.warning("LangChain summary was empty or indicated failure. Falling back to standard generation.")
                return generate_standard_llm_response(
                    user_query, formatted_context_from_rag, source_chunks_from_rag,
                    doc_id_for_prompt, dynamic_role_info, response_payload
                )

            messages_refine = _generate_llm_messages(
                user_query, raw_summary_text, doc_id_for_prompt,
                dynamic_role_info, is_summary_context=True
            )
            response_payload["llm_prompt_sent"] = messages_refine

            start_time_final_refine = time.time()
            final_openai_response = APP_STATE["azure_llm_client"].chat.completions.create(
                model=AZURE_LLM_MODEL_NAME,
                messages=messages_refine,
                temperature=0.25,
                max_tokens=1000,
                top_p=1.0,
            )
            response_payload["generation_time_s"] += round(time.time() - start_time_final_refine, 2)
            response_payload["llm_full_response_obj_str"] = str(final_openai_response)

            if final_openai_response.choices and final_openai_response.choices[0].message:
                refined_answer_text = final_openai_response.choices[0].message.content.strip()
                response_payload["answer_text"] = refined_answer_text + f"\n\n(This answer is based on a summary of {doc_id_for_prompt}.)"
                logger.info("Summary refined by main LLM.")
                response_payload["parsed_citations"] = []
            else:
                logger.error("OpenAI summary refinement response did not contain expected message content.")
                response_payload.update({
                    "answer_text": "Error: Could not parse text from AI's refined summary.",
                    "error_message": "OpenAI summary refinement parsing error: No choices or message found."
                })

        except Exception as e_lc_summarize:
            logger.error(f"LangChain summarization process FAILED: {e_lc_summarize}. Falling back to standard generation.", exc_info=True)
            response_payload.update({
                "error_message": f"Summarization chain error: {str(e_lc_summarize)}. Fallback executed.",
                "summarization_chain_used": False
            })
            return generate_standard_llm_response(
                user_query, formatted_context_from_rag, source_chunks_from_rag,
                doc_id_for_prompt, dynamic_role_info, response_payload
            )
    else:
        return generate_standard_llm_response(
            user_query, formatted_context_from_rag, source_chunks_from_rag,
            doc_id_for_prompt, dynamic_role_info, response_payload
        )

    response_payload.setdefault("answer_text", "Error: Summarization path failed to produce a final answer.")
    response_payload.setdefault("parsed_citations", [])
    return response_payload

# --- One-time Initialization for AI/DB systems ---
def perform_initial_setup():
    logger.info("Flask App: Performing one-time initial setup for AI/DB systems...")
    if not AZURE_LLM_API_KEY_SECRET:
        logger.critical("AZURE_LLM_API_KEY_SECRET MISSING. APP CANNOT FUNCTION.")
        return False

    all_ok = True
    if not initialize_rag_models_from_kb():
        all_ok = False
    if not initialize_llm_model():
        all_ok = False
    if not initialize_chromadb_from_kb():
        all_ok = False
    if HF_API_TOKEN_SECRET and not initialize_hf_client_from_parser():
        logger.warning("HF Vision client (for parser) initialization failed or was skipped. Vision features might be limited.")

    if all_ok:
        logger.success("Flask App: All critical AI/DB systems initialized.")
    else:
        logger.error("Flask App: CRITICAL - One or more essential AI/DB system initializations FAILED.")
    return all_ok

APP_SYSTEMS_READY = perform_initial_setup()

# --- 12. Flask App Definition and Endpoints ---
BASE_DIR = os.getcwd()
TEMPLATES_DIR = os.path.join(BASE_DIR, 'templates')
STATIC_DIR = os.path.join(BASE_DIR, 'static')
STAGED_UPLOADS_DIR = os.path.join(BASE_DIR, 'colab_staged_uploads_final_nocite')

os.makedirs(TEMPLATES_DIR, exist_ok=True)
os.makedirs(STATIC_DIR, exist_ok=True)
os.makedirs(os.path.join(STATIC_DIR, 'js'), exist_ok=True)
os.makedirs(STAGED_UPLOADS_DIR, exist_ok=True)

app = Flask(__name__, template_folder=TEMPLATES_DIR, static_folder=STATIC_DIR)
CORS(app)

# --- HTML Templates (Placeholder - create these files) ---

@app.route('/')
def hero_page_route():
    hero_html_path = os.path.join(TEMPLATES_DIR, 'hero-geometric.html')
    if not os.path.exists(hero_html_path):
        with open(hero_html_path, 'w') as f:
            f.write('<!DOCTYPE html><html><head><title>DocuMind AI</title><style>body{font-family: sans-serif; display: flex; flex-direction: column; align-items: center; justify-content: center; height: 100vh; margin: 0; background-color: #f0f2f5;} h1{color: #333;} a{text-decoration: none; padding: 10px 20px; background-color: #007bff; color: white; border-radius: 5px;}</style></head><body><h1>Welcome to DocuMind AI</h1><p>Your intelligent document assistant.</p><a href="/chatui">Start Chatting</a></body></html>')
    return render_template('hero-geometric.html')

@app.route('/chatui')
def chat_page_route():
    chat_html_path = os.path.join(TEMPLATES_DIR, 'chat.html')
    if not os.path.exists(chat_html_path):
        with open(chat_html_path, 'w') as f:
            f.write('<!DOCTYPE html><html><head><title>DocuMind AI Chat</title><style>body{font-family: sans-serif; padding: 20px; background-color: #f0f2f5;} h1{color: #333; text-align: center;}</style></head><body><h1>Chat with Your Documents</h1><div><!-- Basic chat UI elements would be added here by frontend JS --> <p>Chat interface placeholder. Real UI would be built with JavaScript.</p> <a href="/">Back to Home</a></div></body></html>')
    return render_template('chat.html')

@app.route('/stage_file', methods=['POST'])
def stage_single_file_route():
    logger.info("--- API Call: /stage_file ---")
    if 'file' not in request.files:
        return jsonify({"error": "No 'file' part in the request."}), 400

    file_obj = request.files['file']
    original_filename = request.form.get('originalFilename', file_obj.filename)

    if not file_obj or not original_filename:
        return jsonify({"error": "No file provided or filename is missing."}), 400

    safe_original_filename = re.sub(r'[^a-zA-Z0-9._-]', '_', original_filename)

    staging_id = f"staged_{uuid.uuid4().hex[:8]}_{safe_original_filename}"
    staged_file_path = os.path.join(STAGED_UPLOADS_DIR, staging_id)

    try:
        file_obj.save(staged_file_path)
        file_size = os.path.getsize(staged_file_path)
        APP_WEB_STATE["staged_files"][staging_id] = {
            "original_filename": original_filename,
            "staged_path": staged_file_path,
            "status": "staged",
            "size": file_size
        }
        logger.info(f"  Staged '{original_filename}' (ID:{staging_id}, Size:{file_size}B) to '{staged_file_path}'.")
        return jsonify({
            "message": "File staged successfully.",
            "original_filename": original_filename,
            "staging_id": staging_id,
            "status": "staged"
        }), 200
    except Exception as e:
        logger.error(f"  Error staging '{original_filename}': {e}", exc_info=True)
        return jsonify({"error": f"File staging failed: {str(e)}"}), 500

@app.route('/process_staged_files', methods=['POST'])
def process_staged_files_route():
    logger.info("--- API Call: /process_staged_files ---")
    if not APP_SYSTEMS_READY:
        logger.error("Backend AI/DB systems are not ready. Cannot process files.")
        return jsonify({"error": "Backend systems are not ready. Please try again later."}), 503

    data = request.get_json()
    if not data:
        logger.warning("/process_staged_files: No JSON data received.")
        return jsonify({"error": "Invalid JSON payload."}), 400

    staging_ids = data.get('staging_ids', [])
    options = data.get('options', {})
    logger.info(f"  Processing options: {options}. Staging IDs: {staging_ids}")

    processed_docs_summary = []
    errors_during_processing = []

    for staging_id in staging_ids:
        staged_info = APP_WEB_STATE["staged_files"].get(staging_id)
        if not staged_info or staged_info.get("status") != "staged":
            err_msg = f"File with staging ID '{staging_id}' not found or not in 'staged' state."
            errors_during_processing.append({"staging_id": staging_id, "filename": staged_info.get("original_filename", "Unknown"), "error": err_msg})
            logger.warning(f"  {err_msg}")
            continue

        path_to_staged_file = staged_info["staged_path"]
        actual_original_filename = staged_info["original_filename"]
        logger.info(f"  Processing '{actual_original_filename}' (ID: {staging_id}) from '{path_to_staged_file}'")

        profile_to_use = "fastest"
        if options.get("handwritten", False) or options.get("images", False):
            profile_to_use = "comprehensive_vision" if APP_STATE.get("vision_client_initialized") else "digital_plus_ocr"
        elif options.get("ocr", False):
            profile_to_use = "digital_plus_ocr"
        elif not options:
            profile_to_use = "default_fallback"

        logger.info(f"    Effective parsing profile for '{actual_original_filename}': '{profile_to_use}'")

        try:
            parsed_blocks = parse_document_via_profile(path_to_staged_file, profile_name=profile_to_use)

            if parsed_blocks and parsed_blocks[0].get("type") == "error":
                err_content = parsed_blocks[0].get('content', 'Unknown parsing error')
                errors_during_processing.append({"staging_id": staging_id, "filename": actual_original_filename, "error": f"Parsing/Conversion failed: {err_content}"})
                staged_info["status"] = "parse_failure"
                logger.error(f"    Parsing/Conversion failed for '{actual_original_filename}': {err_content}")
                continue

            base_collection_name = f"{CHROMA_COLLECTION_NAME_PREFIX}_{os.path.splitext(actual_original_filename)[0]}_{uuid.uuid4().hex[:6]}"
            final_collection_name = sanitize_chromadb_collection_name(base_collection_name)

            target_collection_obj = get_or_create_collection_cached(final_collection_name)
            if not target_collection_obj:
                err_msg = f"Failed to get or create ChromaDB collection '{final_collection_name}'."
                errors_during_processing.append({"staging_id": staging_id, "filename": actual_original_filename, "error": err_msg})
                staged_info["status"] = "db_collection_failure"
                logger.error(f"    {err_msg} for '{actual_original_filename}'")
                continue

            ingestion_successful = ingest_document_into_knowledge_base(
                parsed_blocks, final_collection_name, actual_original_filename, target_collection_obj
            )

            if ingestion_successful:
                APP_WEB_STATE["processed_collections"][actual_original_filename] = final_collection_name
                staged_info.update({"status": "processed", "collection_name": final_collection_name})
                processed_docs_summary.append({
                    "filename": actual_original_filename,
                    "collection_name": final_collection_name,
                    "staging_id": staging_id,
                    "status": "processed"
                })
                logger.info(f"    Successfully ingested '{actual_original_filename}' into collection '{final_collection_name}'.")
                try:
                    os.remove(path_to_staged_file)
                    logger.info(f"    Removed staged file '{path_to_staged_file}'.")
                    if staging_id in APP_WEB_STATE["staged_files"]:
                         del APP_WEB_STATE["staged_files"][staging_id]

                except Exception as e_remove:
                    logger.warning(f"    Could not remove staged file '{path_to_staged_file}': {e_remove}")
            else:
                errors_during_processing.append({"staging_id": staging_id, "filename": actual_original_filename, "error": "Ingestion into knowledge base failed."})
                staged_info["status"] = "ingestion_failure"
                logger.error(f"    Ingestion failed for '{actual_original_filename}' into '{final_collection_name}'.")

        except Exception as e_pipeline:
            error_message = f"Unhandled error in processing pipeline for '{actual_original_filename}': {str(e_pipeline)}"
            errors_during_processing.append({"staging_id": staging_id, "filename": actual_original_filename, "error": error_message})
            if staged_info: staged_info["status"] = "pipeline_error"
            logger.error(f"    {error_message}", exc_info=True)

    logger.info(f"--- /process_staged_files call ended. Successfully processed: {len(processed_docs_summary)}, Errors: {len(errors_during_processing)} ---")

    if not processed_docs_summary and errors_during_processing:
        return jsonify({"error": "All file processing attempts failed.", "details": errors_during_processing}), 500

    return jsonify({
        "message": "File processing finished.",
        "processed_documents": processed_docs_summary,
        "errors": errors_during_processing
    }), 200


@app.route('/ask', methods=['POST'])
def ask_question_route():
    logger.info("--- API Call: /ask ---")
    if not APP_SYSTEMS_READY:
        logger.error("Backend AI/DB systems not ready for /ask call.")
        return jsonify({"error": "Backend systems are not ready. Please try again later."}), 503

    data = request.get_json()
    if not data or 'query' not in data:
        logger.warning("/ask: Missing 'query' in JSON payload.")
        return jsonify({"error": "Missing 'query' in request payload."}), 400

    user_query = data['query']
    logger.info(f"  Received query for /ask: '{user_query[:100]}...'")

    active_document_filenames = data.get('active_documents', [])
    collections_to_query_names = []
    query_scope_description = "all processed documents"

    if active_document_filenames:
        logger.info(f"  Frontend specified active documents: {active_document_filenames}")
        valid_collections_for_active_docs = []
        all_active_docs_found = True
        for fname in active_document_filenames:
            coll_name = APP_WEB_STATE["processed_collections"].get(fname)
            if coll_name:
                valid_collections_for_active_docs.append(coll_name)
            else:
                logger.warning(f"    Document '{fname}' specified as active, but not found in processed collections. Query will default to all processed documents.")
                all_active_docs_found = False
                break

        if all_active_docs_found and valid_collections_for_active_docs:
            collections_to_query_names = valid_collections_for_active_docs
            if len(active_document_filenames) == 1:
                query_scope_description = f"document '{active_document_filenames[0]}'"
            else:
                query_scope_description = f"documents: {', '.join(active_document_filenames)}"
            logger.info(f"  Query will target specific collections: {collections_to_query_names}")
        else:
            collections_to_query_names = list(APP_WEB_STATE["processed_collections"].values())
            logger.info(f"  Fallback: Querying ALL {len(collections_to_query_names)} processed collections.")
    else:
        collections_to_query_names = list(APP_WEB_STATE["processed_collections"].values())
        logger.info(f"  No specific documents indicated by frontend. Querying ALL {len(collections_to_query_names)} processed collections.")

    if not collections_to_query_names:
        if not APP_WEB_STATE["processed_collections"]:
            return jsonify({"answer_text": "No documents have been processed yet. Please upload and process a document first.", "parsed_citations": []}), 400
        else:
            logger.error("Internal state error: No collections to query despite having processed documents, or collections list became empty.")
            return jsonify({"answer_text": "Error: No valid document collections available to query at this time.", "parsed_citations": []}), 500

    is_aboutness_type_query = any(
        phrase in user_query.lower() for phrase in ["about this document", "summarize", "overview", "main idea", "tell me about"]
    )
    if is_aboutness_type_query:
        logger.info("  Query detected as an 'aboutness' or summarization request.")

    retrieval_results = retrieve_relevant_context(
        user_query,
        collections_to_query_names,
        use_reranking=(APP_STATE.get("reranker_model") is not None)
    )

    formatted_context = retrieval_results.get("formatted_context_for_llm", "")
    source_chunks = retrieval_results.get("source_chunks_retrieved", [])

    if "Error:" in formatted_context or not source_chunks:
        logger.warning(f"  Context retrieval yielded no usable chunks or an error: {formatted_context[:200]}")

    llm_generation_output = generate_llm_response(
        user_query,
        formatted_context,
        source_chunks,
        default_doc_name=query_scope_description,
        is_aboutness_query_flag=is_aboutness_type_query
    )

    final_answer_text = llm_generation_output.get("answer_text", "Error: AI response generation failed or produced no text.")
    error_message_from_llm = llm_generation_output.get("error_message")

    if error_message_from_llm and "Error:" in final_answer_text:
        logger.warning(f"  LLM generation resulted in an error message displayed as answer: '{final_answer_text}'. Underlying error: {error_message_from_llm}")
    elif final_answer_text == AI_NO_ANSWER_POLICY or not final_answer_text.strip() or "Error: AI" in final_answer_text:
        final_answer_text = "I'm sorry, I couldn't find specific information to answer your question in the provided document(s)."
        logger.info("  AI indicated no answer or returned an empty/error state; providing a polite 'no information' message.")

    logger.info(f"--- /ask call finished. Answer length: {len(final_answer_text)}. Summarization used: {llm_generation_output.get('summarization_chain_used', False)} ---")

    return jsonify({
        "answer_text": final_answer_text,
        "parsed_citations": llm_generation_output.get("parsed_citations", []),
        "llm_prompt_sent": "REDACTED_IN_RESPONSE",
        "error_message": error_message_from_llm,
        "summarization_chain_used": llm_generation_output.get("summarization_chain_used", False)
    })

@app.route('/list_processed_documents_for_chat', methods=['GET'])
def list_processed_docs_for_chat_api():
    processed_filenames = list(APP_WEB_STATE["processed_collections"].keys())
    logger.info(f"API Call: /list_processed_documents_for_chat. Returning {len(processed_filenames)} document filenames.")
    return jsonify({"processed_document_filenames": processed_filenames}), 200

# --- Ngrok and App Run Function ---
public_url_cell4_final = None

def start_flask_app_cell4_final():
    global public_url_cell4_final
    if public_url_cell4_final:
        try:
            ngrok.disconnect(public_url_cell4_final)
            logger.info("Disconnected previous ngrok tunnel.")
        except Exception as e_ngrok_disc:
            logger.warning(f"Could not disconnect previous ngrok tunnel: {e_ngrok_disc}")
        public_url_cell4_final = None

    try:
        NGROK_AUTHTOKEN_SECRET = userdata.get('NGROK_AUTHTOKEN')
        if NGROK_AUTHTOKEN_SECRET:
            conf.get_default().auth_token = NGROK_AUTHTOKEN_SECRET
            logger.info("Ngrok authentication token set successfully.")
        else:
            logger.warning("NGROK_AUTHTOKEN not found in Colab secrets. Ngrok might be rate-limited or fail for non-authenticated users.")
    except userdata.SecretNotFoundError:
        logger.warning("NGROK_AUTHTOKEN secret not found. Proceeding without ngrok authentication.")
    except Exception as e_ngrok_auth:
        logger.warning(f"An error occurred while setting ngrok authtoken: {e_ngrok_auth}")

    if not APP_SYSTEMS_READY:
        logger.critical("CRITICAL: Backend AI/DB systems FAILED initialization. Flask application will not run effectively or may fail.")
        print("Flask app cannot start because critical backend systems are not ready. Check logs for errors.")
        return

    try:
        public_url_cell4_final = ngrok.connect(5000)
        logger.info(f"Flask application (Merged FINAL Version - NoCite) is accessible at: {public_url_cell4_final}")
        print(f" * Ngrok tunnel (FINAL NoCite Version) \"{public_url_cell4_final}\" -> \"http://127.0.0.1:5000\"")

        app.run(host='0.0.0.0', port=5000, debug=False, use_reloader=False)

    except Exception as e_flask_run:
        logger.error(f"Error starting Flask application or ngrok tunnel (FINAL NoCite Version): {e_flask_run}", exc_info=True)
        if public_url_cell4_final:
            try:
                ngrok.disconnect(public_url_cell4_final)
            except Exception as e_disc_fatal:
                logger.error(f"Failed to disconnect ngrok tunnel during error handling: {e_disc_fatal}")
        try:
            ngrok.kill()
        except Exception as e_kill_fatal:
            logger.error(f"Failed to kill ngrok processes during error handling: {e_kill_fatal}")
        raise

if __name__ == "__main__":
    pass

start_flask_app_cell4_final()

2025-06-01 09:16:18.063 | INFO     | __main__:<cell line: 0>:62 - AskYourDoc Merged Flask App Starting. Log file: ask_your_doc_flask_merged_app.log
2025-06-01 09:16:18.064 | INFO     | __main__:<cell line: 0>:68 - HF_TOKEN loaded.
2025-06-01 09:16:18.064 | INFO     | __main__:<cell line: 0>:75 - Azure Embedding Model: text-embedding-3-large configured.
2025-06-01 09:16:18.064 | INFO     | __main__:<cell line: 0>:83 - Azure LLM Model: gpt-4o configured.
2025-06-01 09:16:18.079 | INFO     | __main__:perform_initial_setup:1598 - Flask App: Performing one-time initial setup for AI/DB systems...
2025-06-01 09:16:18.080 | INFO     | __main__:initialize_rag_models_from_kb:253 - Loading Azure Embeddings client: text-embedding-3-large...
2025-06-01 09:16:18.081 | SUCCESS  | __main__:initialize_rag_models_from_kb:259 - Azure Embeddings client for 'text-embedding-3-large' initialized.
2025-06-01 09:16:18.081 | INFO     | __main__:initialize_rag_models_from_kb:266 - Loading reranker: cross-encoder

 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://172.28.0.12:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [01/Jun/2025 09:18:24] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [01/Jun/2025 09:18:26] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -


2025-06-01 09:18:52.833 | INFO     | __main__:stage_single_file_route:1655 - --- API Call: /stage_file ---


INFO:werkzeug:127.0.0.1 - - [01/Jun/2025 09:18:53] "POST /stage_file HTTP/1.1" 200 -


2025-06-01 09:18:53.223 | INFO     | __main__:stage_single_file_route:1679 -   Staged 'HastiVadariyaResume (1) (1).pdf' (ID:staged_81a5e222_HastiVadariyaResume__1___1_.pdf, Size:114517B) to '/content/colab_staged_uploads_final_nocite/staged_81a5e222_HastiVadariyaResume__1___1_.pdf'.




2025-06-01 09:19:02.792 | INFO     | __main__:process_staged_files_route:1692 - --- API Call: /process_staged_files ---
2025-06-01 09:19:02.794 | INFO     | __main__:process_staged_files_route:1704 -   Processing options: {'textOnly': True, 'ocr': False, 'handwritten': False, 'images': False}. Staging IDs: ['staged_81a5e222_HastiVadariyaResume__1___1_.pdf']
2025-06-01 09:19:02.795 | INFO     | __main__:process_staged_files_route:1719 -   Processing 'HastiVadariyaResume (1) (1).pdf' (ID: staged_81a5e222_HastiVadariyaResume__1___1_.pdf) from '/content/colab_staged_uploads_final_nocite/staged_81a5e222_HastiVadariyaResume__1___1_.pdf'
2025-06-01 09:19:02.795 | INFO     | __main__:process_staged_files_route:1729 -     Effective parsing profile for 'HastiVadariyaResume (1) (1).pdf': 'fastest'
2025-06-01 09:19:02.795 | INFO     | __main__:parse_document_via_profile:844 - Initiating parsing process for document 'staged_81a5e222_HastiVadariyaResume__1___1_.pdf' with profile: 'fastest'
2025-06-0



2025-06-01 09:19:03.648 | INFO     | __main__:parse_pdf_content_worker:533 -   Doc stats: Pages:1, AvgFont:9.6, StdFont:6.0
2025-06-01 09:19:03.685 | INFO     | __main__:parse_pdf_content_worker:765 -   Worker: Finished parsing 'staged_81a5e222_HastiVadariyaResume__1___1_.pdf' (originally pdf). Total blocks extracted: 32
2025-06-01 09:19:03.686 | SUCCESS  | __main__:parse_document_via_profile:893 - Profile 'fastest' parsing for 'staged_81a5e222_HastiVadariyaResume__1___1_.pdf' (converted from pdf) in 0.89s. Blocks: 32.
2025-06-01 09:19:03.715 | INFO     | __main__:get_or_create_collection_cached:317 - Collection 'askdocmerged_hastivadariyaresume_1_1__80d0ed' accessed/created. Items: 0
2025-06-01 09:19:03.715 | INFO     | __main__:ingest_document_into_knowledge_base:1129 - Ingesting 'HastiVadariyaResume (1) (1).pdf' into collection: askdocmerged_hastivadariyaresume_1_1__80d0ed
2025-06-01 09:19:03.716 | INFO     | __main__:_chunk_parsed_blocks:1049 - Chunking for 'HastiVadariyaResume (1)

INFO:werkzeug:127.0.0.1 - - [01/Jun/2025 09:19:05] "POST /process_staged_files HTTP/1.1" 200 -


2025-06-01 09:19:05.633 | SUCCESS  | __main__:_store_chunks_in_chromadb:1118 - Stored 32 chunks. Collection 'askdocmerged_hastivadariyaresume_1_1__80d0ed' total: 32.
2025-06-01 09:19:05.633 | SUCCESS  | __main__:ingest_document_into_knowledge_base:1151 - Ingest 'HastiVadariyaResume (1) (1).pdf' (askdocmerged_hastivadariyaresume_1_1__80d0ed) in 1.92s. Stored: 32.
2025-06-01 09:19:05.635 | INFO     | __main__:process_staged_files_route:1765 -     Successfully ingested 'HastiVadariyaResume (1) (1).pdf' into collection 'askdocmerged_hastivadariyaresume_1_1__80d0ed'.
2025-06-01 09:19:05.636 | INFO     | __main__:process_staged_files_route:1768 -     Removed staged file '/content/colab_staged_uploads_final_nocite/staged_81a5e222_HastiVadariyaResume__1___1_.pdf'.
2025-06-01 09:19:05.636 | INFO     | __main__:process_staged_files_route:1785 - --- /process_staged_files call ended. Successfully processed: 1, Errors: 0 ---


INFO:werkzeug:127.0.0.1 - - [01/Jun/2025 09:19:06] "GET /chatui HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [01/Jun/2025 09:19:07] "GET /static/js/documents.js HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [01/Jun/2025 09:19:07] "GET /static/js/chat.js HTTP/1.1" 200 -


2025-06-01 09:19:15.558 | INFO     | __main__:ask_question_route:1799 - --- API Call: /ask ---
2025-06-01 09:19:15.559 | INFO     | __main__:ask_question_route:1810 -   Received query for /ask: 'hey whats the document abot...'
2025-06-01 09:19:15.559 | INFO     | __main__:ask_question_route:1817 -   Frontend specified active documents: ['HastiVadariyaResume (1) (1).pdf']
2025-06-01 09:19:15.559 | INFO     | __main__:ask_question_route:1835 -   Query will target specific collections: ['askdocmerged_hastivadariyaresume_1_1__80d0ed']
2025-06-01 09:19:15.896 | SUCCESS  | __main__:retrieve_relevant_context:1355 - Context retrieval for 'hey whats the document abot...' in 0.34s. Final chunks: 12.
2025-06-01 09:19:15.897 | INFO     | __main__:generate_standard_llm_response:1410 - Sending RAG request to OpenAI ('gpt-4o'). Query: 'hey whats the document abot...'


INFO:werkzeug:127.0.0.1 - - [01/Jun/2025 09:19:19] "POST /ask HTTP/1.1" 200 -


2025-06-01 09:19:19.272 | SUCCESS  | __main__:generate_standard_llm_response:1426 - OpenAI LLM response in 3.37s.
2025-06-01 09:19:19.273 | INFO     | __main__:ask_question_route:1885 - --- /ask call finished. Answer length: 428. Summarization used: False ---
2025-06-01 09:19:45.482 | INFO     | __main__:ask_question_route:1799 - --- API Call: /ask ---
2025-06-01 09:19:45.484 | INFO     | __main__:ask_question_route:1810 -   Received query for /ask: 'can you tell me summary of her current job...'
2025-06-01 09:19:45.484 | INFO     | __main__:ask_question_route:1817 -   Frontend specified active documents: ['HastiVadariyaResume (1) (1).pdf']
2025-06-01 09:19:45.484 | INFO     | __main__:ask_question_route:1835 -   Query will target specific collections: ['askdocmerged_hastivadariyaresume_1_1__80d0ed']
2025-06-01 09:19:45.645 | SUCCESS  | __main__:retrieve_relevant_context:1355 - Context retrieval for 'can you tell me summary of her...' in 0.16s. Final chunks: 12.
2025-06-01 09:19:45.646

INFO:werkzeug:127.0.0.1 - - [01/Jun/2025 09:19:50] "POST /ask HTTP/1.1" 200 -


2025-06-01 09:19:50.025 | SUCCESS  | __main__:generate_standard_llm_response:1426 - OpenAI LLM response in 4.38s.
2025-06-01 09:19:50.026 | INFO     | __main__:ask_question_route:1885 - --- /ask call finished. Answer length: 1374. Summarization used: False ---
