# Application Demonstration

The Application has two main functionalities that are demonstrated in this notebook.

**Functionality 1: Publication based RAG and Query Answering**

The RAG pipeline allows user querying of the publication base (Zotero collection) in natural language, thus enabling retrieving information no matter where it is written and even synthesizing know[ledge]

LLM-based answers are always grounded and relevant claims are supported by sources (publication title and section) which are provided to the user.

This functionality encapsulates the following steps and modules:
- pdfProcessing
    - Extracting text/metadata from PDFs
    - Preparing and chunking content for populating the Vector DB
- Vector DB and Embedding models
    - Vector embeddings are computed for the paper chunks (e.g. pretrained ModernBert embedder)
    - Vector embeddings are stored in the vector DB with relevant metadata
    - Enables similarity search for most relevant chunks given a user query
- LLM
    - LLM configuration
    - Prompt building; User and System prompts are constructed, retrieved chunks are passed to the chosen LLM (e.g. Mistral nemo)
    - User query is answered based on retrieved knowledge

*On top of the functionality demonstration, a structured evaluation is performed with several user queries of different difficulties.

**Functionality 2: External paper search**

If a user finds that relevant information is not covered by the current publication base (Zotero collection), this functionality allows him to retrieve external papers via the SemanticScholar API.

***USAGE NOTES:***
- For the first run, set CLEAR_DB_ON_RUN = True to populate VectorDB.
- The outputs of the query demonstrations (Chapters 1 to 4 for functionality 1) are written to the outputs/application_demo folder in case the notebook outputs are difficult to read.

# Functionality 1: Publication based RAG and Query Answering

## 1. Setup & Initialization

In [1]:
import sys
import os
from pathlib import Path
import json
import time
from tqdm import tqdm

# Change to parent directory for config.yaml access
parent_dir = Path.cwd().parent
os.chdir(parent_dir)
sys.path.insert(0, str(parent_dir))

from pdfProcessing.docling_PDF_processor import DoclingPDFProcessor
from pdfProcessing.chunking import create_chunks_from_sections
from embeddingModels.ModernBertEmbedder import ModernBertEmbedder
from embeddingModels.QwenEmbedder import QwenEmbedder
from backend.services.embedder import EmbeddingService
from backend.services.vector_db import VectorDBService
from backend.services.rag_answer_service import ChromaRagRetriever
from llmAG.rag.pipeline import RagPipeline
from llmAG.llm import build_llm
from zotero_integration.metadata_loader import ZoteroMetadataLoader

import pandas as pd
import numpy as np

print(f"Working directory: {os.getcwd()}")

  from .autonotebook import tqdm as notebook_tqdm


Working directory: c:\Users\leonb\Repos\GenAI


In [None]:
# Configuration
EMBEDDER_TYPE = "bert"  # "bert" or "qwen"
CHROMA_PATH = "./backend/chroma_db"
MAX_CHUNK_SIZE = 2500
OVERLAP_SIZE = 200
TOP_K_RETRIEVAL = 5
CLEAR_DB_ON_RUN = True  # Set to True to clear DB and re-ingest all PDFs

# Output directory for full chunk outputs
OUTPUT_DIR = Path("outputs/application_demo")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"

# Initialize services
print("Initializing Zotero metadata loader...")
try:
    zotero_loader = ZoteroMetadataLoader()
    print(f"Zotero metadata loaded")
except Exception as e:
    print(f"Warning: Zotero metadata not available: {e}")
    zotero_loader = None

print("Initializing PDF processor...")
processor = DoclingPDFProcessor()

print("Initializing embedding service...")
embed_service = EmbeddingService()
embedder = embed_service.load_model(EMBEDDER_TYPE)

print("Initializing ChromaDB...")
db_service = VectorDBService(
    db_path=CHROMA_PATH,
    collection_names={
        "bert": "scientific_papers_bert",
        "qwen": "scientific_papers_qwen"
    }
)

print("Initializing LLM (Ollama mistral-nemo)...")
try:
    llm = build_llm(model="mistral-nemo", temperature=0.1)
    print("LLM initialized")
except Exception as e:
    print(f"Error: LLM initialization failed: {e}")
    print("  Make sure Ollama app is running")
    llm = None

Initializing Zotero metadata loader...
Loaded 24 items from zotero_export_20260114_160922.json
✓ Zotero metadata loaded
Initializing PDF processor...
Initializing Docling Converter...
CUDA not found. Using CPU for PDF Processing.
Initializing embedding service...
Loading Model Key: bert...
Loading Alibaba-NLP/gte-modernbert-base on cpu...
Initializing ChromaDB...
Initializing LLM (Ollama mistral-nemo)...
✓ LLM initialized


## 2. Ingest Pipeline

In [None]:
# Check database status
collection = db_service.get_collection(EMBEDDER_TYPE)
chunk_count = collection.count()

print(f"Database status (model: {EMBEDDER_TYPE})")
print(f"  Chunks in database: {chunk_count}")
print(f"  CLEAR_DB_ON_RUN: {CLEAR_DB_ON_RUN}")

if CLEAR_DB_ON_RUN and chunk_count > 0:
    print(f"  Clearing existing {chunk_count} chunks...")
    all_ids = collection.get()['ids']
    if all_ids:
        collection.delete(ids=all_ids)
    print("  Database cleared")

Database status (model: bert)
  Chunks in database: 327
  CLEAR_DB_ON_RUN: False


In [None]:
def ingest_pdf(pdf_path: Path, model_key: str = "bert"):
    """Ingest single PDF: Process → Chunk → Embed → Store"""
    print(f"\nProcessing: {pdf_path.name}")
    
    # Try Zotero metadata first
    zotero_meta = None
    if zotero_loader:
        zotero_meta = zotero_loader.get_metadata_by_filename(pdf_path.name)
        if zotero_meta:
            print(f"  Using Zotero metadata: '{zotero_meta['title'][:50]}...'")
        else:
            print(f"  Warning: No Zotero match - using Docling extraction")
    
    # Process PDF
    metadata, sections = processor.process_pdf(str(pdf_path), zotero_metadata=zotero_meta)
    print(f"  Extracted {len(sections)} sections")
    
    # Create chunks
    docs, metas, ids = create_chunks_from_sections(
        filename=pdf_path.name,
        metadata=metadata,
        sections=sections,
        max_chunk_size=MAX_CHUNK_SIZE,
        overlap_size=OVERLAP_SIZE
    )
    print(f"  Created {len(docs)} chunks")
    
    if not docs:
        print("  Error: No chunks created")
        return 0
    
    # Embed and store
    embeddings = embedder.encode(docs)
    db_service.upsert_chunks(
        model_key=model_key,
        ids=ids,
        documents=docs,
        embeddings=embeddings.tolist(),
        metadata=metas
    )
    
    print(f"  Ingested {len(docs)} chunks")
    return len(docs)

# Conditional ingestion
pdf_dir = Path.cwd() / "data" / "testPDFs"
pdf_files = list(pdf_dir.glob("*.pdf"))
print(f"Found {len(pdf_files)} PDFs in {pdf_dir}")

collection = db_service.get_collection(EMBEDDER_TYPE)
chunk_count = collection.count()

if chunk_count == 0 or CLEAR_DB_ON_RUN:
    print(f"\nIngesting {len(pdf_files)} PDFs...")
    total_chunks = 0
    for i, pdf in enumerate(pdf_files):
        print(f"[{i+1}/{len(pdf_files)}]", end="")
        chunks = ingest_pdf(pdf, model_key=EMBEDDER_TYPE)
        total_chunks += chunks
    print(f"\nIngestion complete: {total_chunks} chunks from {len(pdf_files)} PDFs")
else:
    print(f"Skipping ingestion ({chunk_count} chunks already in database)")

Found 13 PDFs in c:\Users\leonb\Repos\GenAI\data\testPDFs
⏭ Skipping ingestion (327 chunks already in database)


## 3. RAG Pipeline Initialization

In [None]:
# Initialize RAG components
retriever = ChromaRagRetriever(
    embed_service=embed_service,
    db_service=db_service,
    model_name=EMBEDDER_TYPE
)

rag_pipeline = RagPipeline(
    retriever=retriever,
    model="mistral-nemo",
    temperature=0.1
)
print("RAG pipeline initialized")

def show_llm_prompt(question: str, top_k: int = 5, template_name: str = "answer"):
    """Display the exact prompt that will be sent to the LLM."""
    retrieved_docs = retriever.get_relevant_documents(question, k=top_k)
    context = rag_pipeline._format_context(retrieved_docs)
    prompt_template = rag_pipeline._prompts.get(template_name, rag_pipeline._prompts["answer"])
    formatted_prompt = prompt_template.format_messages(question=question, context=context)
    
    print(f"{'='*80}")
    print(f"EXACT PROMPT SENT TO LLM")
    print(f"{'='*80}")
    print(f"Template: {template_name} | Retrieved chunks: {len(retrieved_docs)} | Context: {len(context)} chars\n")
    
    for i, msg in enumerate(formatted_prompt):
        role = msg.__class__.__name__.replace('Message', '').upper()
        print(f"\n{'='*80}")
        print(f"MESSAGE {i+1}: {role}")
        print(f"{'='*80}\n")
        print(msg.content)
    
    print(f"\n{'='*80}")
    print(f"Total prompt length: {sum(len(m.content) for m in formatted_prompt)} chars")
    print(f"{'='*80}")

✓ RAG pipeline initialized


## 4. RAG Pipeline Demonstration

Three example queries, one from each evaluation difficulty tier.

### Tier 1: Direct Factual Question

In [None]:
# Tier 1 Query: Direct factual retrieval
query_tier1 = "What physical quantity is the controller changing (the actuator variable) in the liquid-lens autofocus setup?"

print(f"QUERY (Tier 1): {query_tier1}\n")
print("="*80)
print("RETRIEVAL RESULTS")
print("="*80)

# Retrieve chunks
query_embedding = embedder.encode([query_tier1])[0]
results = db_service.query(
    model_key=EMBEDDER_TYPE,
    query_embedding=query_embedding.tolist(),
    n_results=TOP_K_RETRIEVAL
)

# Collect full output for file
output_lines = [f"QUERY (Tier 1): {query_tier1}\n", "="*80 + "\nRETRIEVAL RESULTS\n" + "="*80 + "\n"]

for i in range(len(results['ids'][0])):
    chunk_id = results['ids'][0][i]
    distance = results['distances'][0][i]
    content = results['documents'][0][i]
    meta = results['metadatas'][0][i]
    
    chunk_output = f"""
{'='*80}
Rank {i+1} | Distance: {distance:.4f}
{'='*80}
ID:      {chunk_id}
Section: {meta.get('section', 'N/A')}
Paper:   {meta.get('title', 'N/A')}
Authors: {meta.get('authors', 'N/A')}

Content ({len(content)} chars):
{'-'*80}
{content}
"""
    print(chunk_output)
    output_lines.append(chunk_output)

# Save full output to file
with open(OUTPUT_DIR / "tier1_retrieval.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(output_lines))
print(f"\nFull retrieval output saved to {OUTPUT_DIR / 'tier1_retrieval.txt'}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


QUERY (Tier 1): What physical quantity is the controller changing (the actuator variable) in the liquid-lens autofocus setup?

RETRIEVAL RESULTS

Rank 1 | Distance: 0.2860
ID:      Zhang_et_al.___2024___Precision_autofocus_in_optical_microscopy_with_liquid_lenses_controlled_by_deep_reinforcement_learni.pdf#Introduction_part5
Section: Introduction
Paper:   Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning
Authors: Jing Zhang, Yong-feng Fu, Hao Shen, Quan Liu, Li-ning Sun, Li-guo Chen

Content (1250 chars):
--------------------------------------------------------------------------------
In addition, the integration of software algorithms and simple hardware enables end-to-end optical microscope autofocusing, reducing system complexity and cost. Fast Response: The combination of liquid lenses with millisecond focusing speeds and intelligent focusing algorithms enables the rapid autofocusing of optical microscopes. Robustness: The utiliz

In [7]:
# Show exact prompt sent to LLM
show_llm_prompt(query_tier1, top_k=TOP_K_RETRIEVAL)

EXACT PROMPT SENT TO LLM
Template: answer | Retrieved chunks: 5 | Context: 9950 chars


MESSAGE 1: SYSTEM

You are a RAG assistant answering questions about scientific PDFs using only the provided context.
Use the context as the sole source of truth. Do not guess or use prior knowledge.
Answer with factual statements supported by the context.
Every factual claim must include an inline citation formatted as [Title | Section] placed immediately after the clause it supports.
Citations must use titles and section labels exactly as they appear in the context headers; do not invent, shorten, or paraphrase them.
If only part of the question is supported, answer only that part and state that the remaining parts are not in the provided context; do not ask to search online.
If you cannot answer with exact [Title | Section] citations from the context, respond exactly with: "I do not know based on the provided context because the retrieved sections do not mention this. Would you like me to find re

In [8]:
# Generate LLM answer
response_tier1 = rag_pipeline.run(query_tier1, k=TOP_K_RETRIEVAL, include_sources=True)

print("="*80)
print("LLM ANSWER")
print("="*80 + "\n")
print(response_tier1.answer)

print("\n" + "="*80)
print(f"SOURCES ({len(response_tier1.sources)} documents)")
print("="*80)
for i, source in enumerate(response_tier1.sources):
    print(f"\n[{i+1}] {source.metadata.get('title', 'Unknown')}")
    print(f"    Section: {source.metadata.get('section', 'N/A')}")

LLM ANSWER

The controller is changing the voltage applied to the liquid lens [Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning | Effect of actions on autofocus performance].

SOURCES (5 documents)

[1] Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning
    Section: Introduction

[2] Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning
    Section: Introduction

[3] Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning
    Section: Effect of actions on autofocus performance

[4] Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning
    Section: Introduction

[5] AutoFocus: AI-driven alignment of nanofocusing X-ray mirror systems
    Section: 5.3 Challenges and Considerations


**Comment**:
- The LLM answer is correct; the voltage applied is indeed the variable the controller adjusts.
- The answer is based on the relevant context.
- The chunks stem from the correct paper without mentioning it explicitly.
- The correct chunks were retrieved, namely chunks 2 (Introduction) and 3 (Effect of actions on autofocus performance).

### Tier 2: Multi-detail Question

In [None]:
# Tier 2 Query: Requires extracting multiple related details
query_tier2 = "List the reward hyperparameters (e.g., alpha, beta, mu, delta) for DRL autofocus and what each incentivizes."

print(f"QUERY (Tier 2): {query_tier2}\n")
print("="*80)
print("RETRIEVAL RESULTS")
print("="*80)

# Retrieve chunks
query_embedding = embedder.encode([query_tier2])[0]
results = db_service.query(
    model_key=EMBEDDER_TYPE,
    query_embedding=query_embedding.tolist(),
    n_results=TOP_K_RETRIEVAL
)

output_lines = [f"QUERY (Tier 2): {query_tier2}\n", "="*80 + "\nRETRIEVAL RESULTS\n" + "="*80 + "\n"]

for i in range(len(results['ids'][0])):
    chunk_id = results['ids'][0][i]
    distance = results['distances'][0][i]
    content = results['documents'][0][i]
    meta = results['metadatas'][0][i]
    
    chunk_output = f"""
{'='*80}
Rank {i+1} | Distance: {distance:.4f}
{'='*80}
ID:      {chunk_id}
Section: {meta.get('section', 'N/A')}
Paper:   {meta.get('title', 'N/A')}
Authors: {meta.get('authors', 'N/A')}

Content ({len(content)} chars):
{'-'*80}
{content}
"""
    print(chunk_output)
    output_lines.append(chunk_output)

with open(OUTPUT_DIR / "tier2_retrieval.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(output_lines))
print(f"\nFull retrieval output saved to {OUTPUT_DIR / 'tier2_retrieval.txt'}")

QUERY (Tier 2): List the reward hyperparameters (e.g., alpha, beta, mu, delta) for DRL autofocus and what each incentivizes.

RETRIEVAL RESULTS

Rank 1 | Distance: 0.2574
ID:      Zhang_et_al.___2024___Precision_autofocus_in_optical_microscopy_with_liquid_lenses_controlled_by_deep_reinforcement_learni.pdf#Reward_function_part1
Section: Reward function
Paper:   Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning
Authors: Jing Zhang, Yong-feng Fu, Hao Shen, Quan Liu, Li-ning Sun, Li-guo Chen

Content (721 chars):
--------------------------------------------------------------------------------
The last term δ is an additional reward component aimed at enhancing the discriminative ability of the reward function by setting relatively large positive and negative rewards for the clearest and least clear images, respectively, thereby further reducing the focusing steps. Since achieving clear imaging, reducing the time to focus, and stopping au

In [10]:
# Show exact prompt sent to LLM
show_llm_prompt(query_tier2, top_k=TOP_K_RETRIEVAL)

EXACT PROMPT SENT TO LLM
Template: answer | Retrieved chunks: 5 | Context: 7509 chars


MESSAGE 1: SYSTEM

You are a RAG assistant answering questions about scientific PDFs using only the provided context.
Use the context as the sole source of truth. Do not guess or use prior knowledge.
Answer with factual statements supported by the context.
Every factual claim must include an inline citation formatted as [Title | Section] placed immediately after the clause it supports.
Citations must use titles and section labels exactly as they appear in the context headers; do not invent, shorten, or paraphrase them.
If only part of the question is supported, answer only that part and state that the remaining parts are not in the provided context; do not ask to search online.
If you cannot answer with exact [Title | Section] citations from the context, respond exactly with: "I do not know based on the provided context because the retrieved sections do not mention this. Would you like me to find re

In [11]:
# Generate LLM answer
response_tier2 = rag_pipeline.run(query_tier2, k=TOP_K_RETRIEVAL, include_sources=True)

print("="*80)
print("LLM ANSWER")
print("="*80 + "\n")
print(response_tier2.answer)

print("\n" + "="*80)
print(f"SOURCES ({len(response_tier2.sources)} documents)")
print("="*80)
for i, source in enumerate(response_tier2.sources):
    print(f"\n[{i+1}] {source.metadata.get('title', 'Unknown')}")
    print(f"    Section: {source.metadata.get('section', 'N/A')}")

LLM ANSWER

The reward hyperparameters for DRL autofocus in the provided context are:

- Alpha (α): 100 [Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning | Reward function]
- Beta (β): 30 [Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning | Reward function]
- Mu (μ): 200 [Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning | Reward function]
- Delta (δ): 100 [Precision autofocus in optical microscopy with liquid lenses controlled by deep reinforcement learning | Reward function]

These hyperparameters incentivize the following aspects of the autofocus task:

- Alpha (α) encourages achieving clear imaging.
- Beta (β) rewards reducing the time to focus.
- Mu (μ) incentivizes stopping automatically once focused.
- Delta (δ) enhances the discriminative ability of the reward function by setting relatively large positive and negative re

**Comment:**
- The LLM answer is correct; the reward hyperparameters and their specific incentives are accurately identified.
- The answer is based on the relevant context provided in the text.
- The correct chunks were retrieved, namely Chunk 1 (Reward function) and Chunk 2 (Ablation experiments on the reward function).
- With the increased difficulty of the multi-detail question, the system still provides a useful answer.

### Tier 3: Synthesis / Cross-paper Question

In [None]:
# Tier 3 Query: Synthesis requiring reasoning across sources
query_tier3 = "How does FAST define 'scanning efficiency,' and in what way is this fundamentally different from raster-grid scanning?"

print(f"QUERY (Tier 3): {query_tier3}\n")
print("="*80)
print("RETRIEVAL RESULTS")
print("="*80)

# Retrieve chunks
query_embedding = embedder.encode([query_tier3])[0]
results = db_service.query(
    model_key=EMBEDDER_TYPE,
    query_embedding=query_embedding.tolist(),
    n_results=TOP_K_RETRIEVAL
)

output_lines = [f"QUERY (Tier 3): {query_tier3}\n", "="*80 + "\nRETRIEVAL RESULTS\n" + "="*80 + "\n"]

for i in range(len(results['ids'][0])):
    chunk_id = results['ids'][0][i]
    distance = results['distances'][0][i]
    content = results['documents'][0][i]
    meta = results['metadatas'][0][i]
    
    chunk_output = f"""
{'='*80}
Rank {i+1} | Distance: {distance:.4f}
{'='*80}
ID:      {chunk_id}
Section: {meta.get('section', 'N/A')}
Paper:   {meta.get('title', 'N/A')}
Authors: {meta.get('authors', 'N/A')}

Content ({len(content)} chars):
{'-'*80}
{content}
"""
    print(chunk_output)
    output_lines.append(chunk_output)

with open(OUTPUT_DIR / "tier3_retrieval.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(output_lines))
print(f"\nFull retrieval output saved to {OUTPUT_DIR / 'tier3_retrieval.txt'}")

QUERY (Tier 3): How does FAST define 'scanning efficiency,' and in what way is this fundamentally different from raster-grid scanning?

RETRIEVAL RESULTS

Rank 1 | Distance: 0.2472
ID:      Kandel_et_al.___2023___Demonstration_of_an_AI_driven_workflow_for_autonomous_high_resolution_scanning_microscopy.pdf#Discussion_part3
Section: Discussion
Paper:   Demonstration of an AI-driven workflow for autonomous high-resolution scanning microscopy
Authors: Saugat Kandel, Tao Zhou, Anakha V. Babu, Zichao Di, Xinxin Li, Xuedan Ma, Martin Holt, Antonino Miceli, Charudatta Phatak, Mathew J. Cherukara

Content (1200 chars):
--------------------------------------------------------------------------------
As such, there could exist scenarios in which the time required for the motormovementeclipsesthe time required for a single measurement. We expect to address the latter challenge by explicitly including a measurement-density-based term 38 or a movement-time-based term in the candidate selection proce

In [13]:
# Show exact prompt sent to LLM
show_llm_prompt(query_tier3, top_k=TOP_K_RETRIEVAL)

EXACT PROMPT SENT TO LLM
Template: answer | Retrieved chunks: 5 | Context: 4945 chars


MESSAGE 1: SYSTEM

You are a RAG assistant answering questions about scientific PDFs using only the provided context.
Use the context as the sole source of truth. Do not guess or use prior knowledge.
Answer with factual statements supported by the context.
Every factual claim must include an inline citation formatted as [Title | Section] placed immediately after the clause it supports.
Citations must use titles and section labels exactly as they appear in the context headers; do not invent, shorten, or paraphrase them.
If only part of the question is supported, answer only that part and state that the remaining parts are not in the provided context; do not ask to search online.
If you cannot answer with exact [Title | Section] citations from the context, respond exactly with: "I do not know based on the provided context because the retrieved sections do not mention this. Would you like me to find re

In [14]:
# Generate LLM answer
response_tier3 = rag_pipeline.run(query_tier3, k=TOP_K_RETRIEVAL, include_sources=True)

print("="*80)
print("LLM ANSWER")
print("="*80 + "\n")
print(response_tier3.answer)

print("\n" + "="*80)
print(f"SOURCES ({len(response_tier3.sources)} documents)")
print("="*80)
for i, source in enumerate(response_tier3.sources):
    print(f"\n[{i+1}] {source.metadata.get('title', 'Unknown')}")
    print(f"    Section: {source.metadata.get('section', 'N/A')}")

LLM ANSWER

FAST defines 'scanning efficiency' as the ability to isolate regions of interest in sparse settings and prepare for pointwise scanning in these regions, or more generally, to guide any scanning microscopy experiment where full pointwise information is not needed [Demonstration of an AI-driven workflow for autonomous high-resolution scanning microscopy | Discussion].

This definition is fundamentally different from raster-grid scanning because FAST does not require a systematic sampling of all points within the field of view. Instead, it strategically selects regions or points based on estimated reconstruction discrepancy (ERD) to minimize experimental dead-time while ensuring adequate sparsity [Self-driving scanning microscopy workflow]. In contrast, raster-grid scanning systematically samples every point in a predefined grid pattern, regardless of the information content at each location.

SOURCES (5 documents)

[1] Demonstration of an AI-driven workflow for autonomous hig

**Comment:**
- The LLM answer is correct; it accurately describes the functionality of FAST in isolating regions of interest and its fundamental departure from the systematic nature of raster-grid scanning.
- The answer is based on the relevant context, specifically regarding the strategic selection of points and the reduction of experimental dead-time.
- The correct chunks were retrieved, namely Chunk 1 (Discussion) and Chunk 4 (Self-driving scanning microscopy workflow).

## 5. Systematic Evaluation

Evaluation across all questions in the dataset, measuring retrieval accuracy and answer quality.

In [None]:
# Load evaluation dataset
def load_eval_dataset(filename="eval_dataset.json"):
    potential_dirs = [Path.cwd(), Path.cwd().parent]
    for directory in potential_dirs:
        file_path = directory / filename
        if file_path.exists():
            with open(file_path, "r", encoding="utf-8") as f:
                data = json.load(f)
            print(f"Loaded {len(data)} questions from {file_path}")
            return data
    print(f"Warning: {filename} not found")
    return []

eval_dataset = load_eval_dataset()

✓ Loaded 16 questions from c:\Users\leonb\Repos\GenAI\eval_dataset.json


In [None]:
# Enhanced RAG Evaluator with chunk-level, multi-paper, and answer quality metrics
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class EnhancedRAGEvaluator:
    def __init__(self, pipeline):
        self.pipeline = pipeline
        self.results = []
        print("Loading semantic similarity model...")
        self.semantic_model = SentenceTransformer('all-MiniLM-L6-v2')
        print("Model loaded")

    def evaluate(self, dataset, top_k=5):
        print(f"Starting evaluation of {len(dataset)} questions...")
        self.results = []
        
        for item in tqdm(dataset):
            question = item['question']
            target_tag = item.get('target_tag')
            tier = item.get('tier')
            expected_chunk_id = item.get('expected_chunk_id')
            expected_answer = item.get('expected_answer')
            expected_papers = item.get('expected_papers', [])
            
            start_time = time.time()
            try:
                response = self.pipeline.run(question, k=top_k, include_sources=True)
                elapsed = time.time() - start_time
                
                retrieved_filenames = [src.metadata.get('filename', '') for src in response.sources]
                unique_papers = list(set(retrieved_filenames))
                num_unique_papers = len(unique_papers)
                
                # Exact chunk match (Tier 1-2)
                exact_chunk_match = False
                chunk_found_at_rank = None
                if expected_chunk_id:
                    for rank, src in enumerate(response.sources, 1):
                        parent_id = src.metadata.get('parent_id', '')
                        if parent_id == expected_chunk_id or expected_chunk_id.split('#')[0] in parent_id:
                            exact_chunk_match = True
                            chunk_found_at_rank = rank
                            break
                
                # Semantic chunk similarity
                semantic_chunk_hit = None
                best_chunk_similarity = None
                if expected_chunk_id and not exact_chunk_match:
                    try:
                        collection = self.pipeline.retriever.db_service.get_collection(
                            self.pipeline.retriever.model_name
                        )
                        expected_docs = collection.get(ids=[expected_chunk_id])
                        if expected_docs and expected_docs['documents']:
                            expected_text = expected_docs['documents'][0]
                            expected_embedding = self.semantic_model.encode([expected_text])
                            retrieved_texts = [src.page_content for src in response.sources]
                            retrieved_embeddings = self.semantic_model.encode(retrieved_texts)
                            similarities = cosine_similarity(expected_embedding, retrieved_embeddings)[0]
                            best_chunk_similarity = float(similarities.max())
                            semantic_chunk_hit = best_chunk_similarity > 0.7
                    except Exception:
                        pass
                
                # Multi-paper metrics (Tier 3)
                multi_paper_match = num_unique_papers >= 2
                paper_recall = None
                paper_precision = None
                
                if expected_papers and len(expected_papers) > 0:
                    retrieved_normalized = {f.lower() for f in retrieved_filenames if f}
                    expected_normalized = {p.lower() for p in expected_papers}
                    correct_papers = retrieved_normalized & expected_normalized
                    
                    if len(expected_normalized) > 0:
                        paper_recall = len(correct_papers) / len(expected_normalized)
                    if len(retrieved_normalized) > 0:
                        paper_precision = len(correct_papers) / len(retrieved_normalized)
                
                # Answer quality
                answer_similarity = None
                if expected_answer:
                    answer_embedding = self.semantic_model.encode([response.answer])
                    expected_embedding = self.semantic_model.encode([expected_answer])
                    answer_similarity = float(cosine_similarity(answer_embedding, expected_embedding)[0][0])
                
                self.results.append({
                    "Tier": tier,
                    "Question": question[:60] + "..." if len(question) > 60 else question,
                    "Target_Tag": target_tag,
                    "Exact_Chunk_Match": exact_chunk_match if expected_chunk_id else None,
                    "Chunk_Rank": chunk_found_at_rank if exact_chunk_match else None,
                    "Semantic_Chunk_Hit": semantic_chunk_hit,
                    "Best_Chunk_Similarity": round(best_chunk_similarity, 3) if best_chunk_similarity else None,
                    "Num_Papers": num_unique_papers,
                    "Multi_Paper_Match": multi_paper_match if tier == 3 else None,
                    "Paper_Recall": round(paper_recall, 3) if paper_recall is not None else None,
                    "Paper_Precision": round(paper_precision, 3) if paper_precision is not None else None,
                    "Answer_Similarity": round(answer_similarity, 3) if answer_similarity else None,
                    "Papers": " | ".join([p.split(' - ')[0][:30] for p in unique_papers[:2]]),
                    "Latency": round(elapsed, 2)
                })
                
            except Exception as e:
                print(f"Error on: {question[:30]}... {e}")
                self.results.append({
                    "Tier": tier, "Question": question[:60] + "...", "Target_Tag": target_tag,
                    "Exact_Chunk_Match": False, "Chunk_Rank": None, "Semantic_Chunk_Hit": None,
                    "Best_Chunk_Similarity": None, "Num_Papers": 0, "Multi_Paper_Match": False,
                    "Paper_Recall": None, "Paper_Precision": None, "Answer_Similarity": None,
                    "Papers": f"ERROR", "Latency": 0
                })

        return pd.DataFrame(self.results)

evaluator = EnhancedRAGEvaluator(rag_pipeline)

Loading semantic similarity model...
✓ Model loaded


In [None]:
# Run evaluation
df_results = evaluator.evaluate(eval_dataset, top_k=5)

# Display summary
print("\n" + "="*80)
print("EVALUATION SUMMARY")
print("="*80 + "\n")

# Tier 1-2: Exact chunk matching
tier_12 = df_results[df_results['Tier'].isin([1, 2])]
if len(tier_12) > 0:
    chunk_match_rate = tier_12['Exact_Chunk_Match'].sum() / tier_12['Exact_Chunk_Match'].notna().sum()
    print(f"Tier 1-2 (Single Paper) - Exact Chunk Hit Rate: {chunk_match_rate:.2%} ({int(tier_12['Exact_Chunk_Match'].sum())}/{int(tier_12['Exact_Chunk_Match'].notna().sum())})")
    
    found_ranks = tier_12[tier_12['Exact_Chunk_Match'] == True]['Chunk_Rank']
    if len(found_ranks) > 0:
        print(f"  - Avg rank of correct chunk: {found_ranks.mean():.1f}")
    
    semantic_hits = tier_12[tier_12['Semantic_Chunk_Hit'] == True]
    if len(semantic_hits) > 0:
        print(f"  - Semantic near-miss hits: {len(semantic_hits)} (similarity > 0.7)")
    
    misses_with_sim = tier_12[(tier_12['Exact_Chunk_Match'] == False) & (tier_12['Best_Chunk_Similarity'].notna())]
    if len(misses_with_sim) > 0:
        print(f"  - Avg similarity for misses: {misses_with_sim['Best_Chunk_Similarity'].mean():.3f}")

# Tier 3: Multi-paper matching
tier_3 = df_results[df_results['Tier'] == 3]
if len(tier_3) > 0:
    multi_match_rate = tier_3['Multi_Paper_Match'].sum() / len(tier_3)
    print(f"\nTier 3 (Synthesis) - Multi-Paper Hit Rate: {multi_match_rate:.2%} ({int(tier_3['Multi_Paper_Match'].sum())}/{len(tier_3)})")
    print(f"  - Avg papers retrieved: {tier_3['Num_Papers'].mean():.1f}")
    
    tier_3_with_expected = tier_3[tier_3['Paper_Recall'].notna()]
    if len(tier_3_with_expected) > 0:
        print(f"  - Avg paper recall: {tier_3_with_expected['Paper_Recall'].mean():.2%}")
        print(f"  - Avg paper precision: {tier_3_with_expected['Paper_Precision'].mean():.2%}")

# Answer Quality
with_answer_eval = df_results[df_results['Answer_Similarity'].notna()]
if len(with_answer_eval) > 0:
    avg_answer_sim = with_answer_eval['Answer_Similarity'].mean()
    print(f"\nAnswer Quality (semantic similarity to expected):")
    print(f"  - Avg answer similarity: {avg_answer_sim:.3f} ({len(with_answer_eval)} questions)")
    print(f"  - High quality (>0.7): {(with_answer_eval['Answer_Similarity'] > 0.7).sum()}/{len(with_answer_eval)}")

print(f"\nAverage Latency: {df_results['Latency'].mean():.2f}s")

Starting evaluation of 16 questions...


  0%|          | 0/16 [00:00<?, ?it/s]

In [None]:
# Detailed results table
print("="*80)
print("DETAILED RESULTS")
print("="*80)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 40)

print(df_results.to_string(index=False))

# Save results
output_filename = OUTPUT_DIR / "evaluation_results.csv"
df_results.to_csv(output_filename, index=False)
print(f"\nResults saved to {output_filename}")

# Functionality 2: External Paper Search