# üî¨ Dual Evaluator RAG Pipeline Evaluation Notebook

## Purpose
This notebook allows **two independent human evaluators** (A and B) to assess a Retrieval-Augmented Generation (RAG) pipeline in parallel.

## Goals
- **Independent, unbiased human evaluation** from two perspectives
- **Structured scoring and notes** for each query using a standardized rubric
- **Later comparison** between evaluators and optional automated metrics

## Workflow
1. ‚úÖ **Run Setup** (pipeline + environment)
2. üë§ **Evaluator A** completes their section independently
3. üë§ **Evaluator B** completes their section independently
4. ü§ñ **Run automated evaluation** (e.g., DeepEval) if desired
5. üîç **Compare results** and discuss insights

## ‚ö†Ô∏è Independence & Blindness Notice
- **Evaluator A and B must NOT see each other's answers or scores before finishing**
- **Automated evaluation should NOT be shown to either evaluator ahead of time**
- Each evaluator works in their own dedicated section

---

# 1Ô∏è‚É£ Setup: Environment & API Keys

## Purpose
This section is for the **organizer/technical person**, not the evaluators.

## What This Section Does
- ‚úÖ Verifies required API keys (language model, vector DB, RAG backend)
- ‚úÖ Initializes the RAG pipeline (similar to main.py)
- ‚úÖ Runs a simple test query to confirm the pipeline works

## Requirements
You must have the following environment variables set:
- `ANTHROPIC_API_KEY` - For Claude LLM
- `JINA_API_KEY` - For embeddings
- `QDRANT_API_KEY` - For vector database
- `QDRANT_URL` - Qdrant instance URL

## Success Criteria
After running setup cells, you should see:
- ‚úÖ "All required API keys loaded successfully!"
- ‚úÖ "RAG pipeline fully configured and ready!"
- ‚úÖ Test query returns an answer with context

## ‚ö†Ô∏è Important
**Do not modify setup cells after evaluators have started their work!**

---

## Import Dependencies and Load API Keys

**Instructions**: Run this cell to import all required libraries and validate API keys.

# Configure Anthropic multimodal LLM (Claude 3) for image understanding
# Requires: pip install llama-index-multi-modal-llms-anthropic
anthropic_mm_llm = AnthropicMultiModal(
    model="claude-3-sonnet-20240229",  # or "claude-3-opus-20240229"
    max_tokens=300,
)

print("\u2705 Anthropic multimodal LLM configured (Claude 3 Sonnet)")


In [1]:
# Import all required libraries
import pandas as pd
import os
import json
import uuid
from typing import List, Any, Dict, Tuple
from datetime import datetime
from pathlib import Path

import qdrant_client
from dotenv import load_dotenv
from llama_parse import LlamaParse

from langchain.docstore.document import Document
from langchain_anthropic import ChatAnthropic
from langchain_community.document_loaders import PyPDFLoader

from ai_eval.resources import deepeval_scorer as deep
from ai_eval.resources.rag_template import RAG
from ai_eval.resources import eval_dataset_builder as eval_builder
from ai_eval.services.file import JSONService
from ai_eval.config import global_config as glob

from llama_index.core import (
    Document as LlamaIndexDocument,
    Settings,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.jinaai import JinaEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.response_synthesizers import ResponseMode
from llama_index.llms.langchain import LangChainLLM

# Load environment variables
load_dotenv()

# Validate API keys
required_keys = {
    "ANTHROPIC_API_KEY": os.getenv("ANTHROPIC_API_KEY"),
    "JINA_API_KEY": os.getenv("JINA_API_KEY"),
    "QDRANT_API_KEY": os.getenv("QDRANT_API_KEY"),
    "QDRANT_URL": os.getenv("QDRANT_URL"),
    "LLAMAPARSE_API_KEY": os.getenv("LLAMAPARSE_API_KEY"),
    "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY"),
}

missing_keys = [key for key, value in required_keys.items() if not value]
if missing_keys:
    raise ValueError(
        f"‚ùå Missing required API keys: {', '.join(missing_keys)}. Please set them in your .env file.")

# Optional: LlamaCloud API key
llamacloud_api_key = os.getenv("LLAMACLOUD_API_KEY")
if llamacloud_api_key:
    os.environ["LLAMA_CLOUD_API_KEY"] = llamacloud_api_key

print("‚úÖ All required API keys loaded successfully!")
print(f"üìÖ Session started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


  from .autonotebook import tqdm as notebook_tqdm
  from google.cloud.aiplatform.utils import gcs_utils


‚úÖ All required API keys loaded successfully!
üìÖ Session started at: 2025-11-22 15:16:01


## 2Ô∏è‚É£ Configure Embedding Model

**Instructions**: Configure Jina AI embeddings for document and query encoding.

In [2]:
# ‚úÖ Explicitly configure the embedding vector dimension

VECTOR_DIM = 2048



# Configure Jina AI embeddings (Jina v4, retrieval-optimized, multimodal-capable)

# Get your Jina AI API key for free: https://jina.ai/?sui=apikey

embed_model = JinaEmbedding(

    api_key=required_keys["JINA_API_KEY"],

    model="jina-embeddings-v4",   # ensure this model is 2048-dim or adjust VECTOR_DIM accordingly

    task="retrieval.passage",

)



# Register embedding model with LlamaIndex

Settings.embed_model = embed_model



# Validate that the model dimension matches VECTOR_DIM (fail fast if not)

actual_dim = getattr(embed_model, "dimension", None)

if actual_dim is None:

    actual_dim = getattr(embed_model, "embed_dim", None)



if actual_dim is not None and actual_dim != VECTOR_DIM:

    raise ValueError(

        f"Embedding model dimension ({actual_dim}) does not match configured VECTOR_DIM ({VECTOR_DIM}).\n"

        f"Please switch to a 2048-dimensional Jina embedding model or update VECTOR_DIM to {actual_dim}."

    )



embedding_dim = VECTOR_DIM



print("‚úÖ Jina AI embeddings configured (model: jina-embeddings-v4)")

print(f"üìè Embedding dimension (vector size): {embedding_dim}")

print("üìä Model: multimodal & multilingual, retrieval-optimized")


‚úÖ Jina AI embeddings configured (model: jina-embeddings-v4)
üìè Embedding dimension (vector size): 2048
üìä Model: multimodal & multilingual, retrieval-optimized


## 3Ô∏è‚É£ Load and Process Documents

**Instructions**: Load the Allplan PDF manual and convert it to LlamaIndex format.

In [7]:
# Load PDF document using LlamaParse with Claude 3.5 Sonnet (multimodal parsing)
filename = "../../data/Allplan_2020_Manual.pdf"
loader = LlamaParse(

    parse_mode="parse_page_with_lvm",          # page-level multimodal parsing

    model="anthropic-sonnet-4.0",                # multimodal LVM used by LlamaParse

    vendor_multimodal_api_key=required_keys["ANTHROPIC_API_KEY"],

    api_key=required_keys["LLAMAPARSE_API_KEY"],

)



# Use the correct method to load the PDF

raw_docs = loader.load_data(filename)



print(f"üìÑ Loaded {len(raw_docs)} pages from PDF (multimodal LlamaParse)")



# Convert to LlamaIndex Documents

# NOTE: LlamaParse documents expose `.text`; keep a fallback to `.page_content` for safety.

llama_documents: List[LlamaIndexDocument] = []

for i, doc in enumerate(raw_docs):

    text = getattr(doc, "text", None)

    if text is None:

        text = getattr(doc, "page_content", "")



    metadata = getattr(doc, "metadata", {}) or {}

    metadata = {

        **metadata,

        "source": filename,

        "page": i + 1,

    }



    llama_documents.append(

        LlamaIndexDocument(

            text=text,

            metadata=metadata,

        )

    )



print(f"‚úÖ Converted {len(llama_documents)} pages to LlamaIndex format")

Started parsing the file under job_id 907080ca-00c6-4815-bea4-51724742b6cc
..üìÑ Loaded 307 pages from PDF (multimodal LlamaParse)
‚úÖ Converted 307 pages to LlamaIndex format


## 4Ô∏è‚É£ Chunk Documents into Nodes

**Instructions**: Split documents into smaller chunks for better retrieval.

In [8]:
# Create nodes using SentenceSplitter with formatting preservation
from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(
    chunk_size=1024,          # token-based chunk size
    chunk_overlap=200,
    # use blank lines as paragraph/markdown block separators
    paragraph_separator="\n\n\n",
)
nodes = parser.get_nodes_from_documents(llama_documents)

# Enrich nodes with chunk-level metadata for downstream analysis / tooling
for i, node in enumerate(nodes):
    if node.metadata is None:
        node.metadata = {}
    
    source = node.metadata.get("source", "allplan_docs_collection")
    page = node.metadata.get("page", "NA")
    
    node.metadata["chunk_index"] = i
    node.metadata["chunk_id"] = f"{source}_p{page}_c{i}"

print(f"‚úÖ Created {len(nodes)} nodes from {len(llama_documents)} documents")
print(f"üìä Chunk size: 1024 tokens, Overlap: 200 tokens")
print(f"üìù Formatting preserved: separators kept, paragraphs maintained")
print(f"üè∑Ô∏è  Chunk metadata enriched: chunk_index and chunk_id added")

# Convert back to LangChain Documents for compatibility
documents = [
    Document(
        page_content=doc.text,
        metadata=doc.metadata
    )
    for doc in llama_documents
]


‚úÖ Created 312 nodes from 307 documents
üìä Chunk size: 1024 tokens, Overlap: 200 tokens
üìù Formatting preserved: separators kept, paragraphs maintained
üè∑Ô∏è  Chunk metadata enriched: chunk_index and chunk_id added


## 5Ô∏è‚É£ Connect to Qdrant and Build Vector Index

**Instructions**: Connect to Qdrant vector database and create/update the index.

In [9]:
# Create Qdrant client
client = qdrant_client.QdrantClient(
    url=required_keys["QDRANT_URL"],
    api_key=required_keys["QDRANT_API_KEY"],
)

# Create vector store with explicit 2048-dimensional vectors (matches embedding_dim)
vector_store = QdrantVectorStore(
    collection_name="allplan_docs_collection",
    client=client,
    # Match the enforced Jina embedding dimensionality (VECTOR_DIM = 2048)
    vector_size=embedding_dim,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Build index
index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
)

print("‚úÖ VectorStoreIndex built and documents stored in Qdrant")
print(f"üì¶ Collection: allplan_docs_collection")
print(f"üìê Vector dimension: {embedding_dim} (Jina v4)")


‚úÖ VectorStoreIndex built and documents stored in Qdrant
üì¶ Collection: allplan_docs_collection
üìê Vector dimension: 2048 (Jina v4)


## 6Ô∏è‚É£ Configure RAG Pipeline

**Instructions**: Set up the retriever, LLM, and complete RAG system.

In [10]:
# Create retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=3,
)

# Configure LLM (use a real Claude 3 Haiku model ID)
chat_model = ChatAnthropic(
    model="claude-haiku-4-5-20251001",
    temperature=0.1,
    max_retries=2,
    api_key=required_keys["ANTHROPIC_API_KEY"],
)

llama_llm = LangChainLLM(llm=chat_model)

# Define LlamaIndexRAG class
class LlamaIndexRAG(RAG):
    """RAG implementation using LlamaIndex Core, Jina embeddings, and Qdrant."""

    def __init__(
        self,
        llm,
        documents: List[Document],
        k: int = 3,
        index: VectorStoreIndex = None,
        retriever: VectorIndexRetriever = None,
        query_engine: RetrieverQueryEngine = None,
    ):
        super().__init__(llm, documents, k)
        self.index = index
        self.retriever = retriever

        # Create query engine if not provided
        if query_engine is None and self.retriever is not None:
            self.query_engine = RetrieverQueryEngine.from_args(
                retriever=self.retriever,
                llm=llama_llm,
                response_mode=ResponseMode.COMPACT,
            )
        else:
            self.query_engine = query_engine

    def retrieve(self, question: str, *args: Any, **kwargs: Any) -> List[Document]:
        """Retrieve relevant documents using LlamaIndex retriever."""
        if self.retriever is None:
            return []

        # Retrieve nodes from LlamaIndex
        nodes = self.retriever.retrieve(question)

        # Convert LlamaIndex nodes to LangChain Documents
        langchain_docs = []
        for node in nodes[: self.k]:
            doc = Document(
                page_content=node.get_content(),
                metadata=getattr(node, "metadata", {}) or {},
            )
            langchain_docs.append(doc)

        return langchain_docs

    def generate(self, question: str, context: str, *args: Any, **kwargs: Any) -> str:
        """Generate answer using LlamaIndex query engine."""
        if self.query_engine is None:
            # Fallback: simple LLM call using raw context
            prompt = (
                "Using the following context, answer the question:\n\n"
                f"Context: {context}\n\n"
                f"Question: {question}\n\n"
                "Answer:"
            )
            answer = self.llm.invoke(prompt)
            if hasattr(answer, "content"):
                return answer.content
            return str(answer)

        # Use LlamaIndex query engine
        response = self.query_engine.query(question)
        return str(response)


# Create RAG instance
rag = LlamaIndexRAG(
    llm=chat_model,
    documents=documents,
    k=3,
    index=index,
    retriever=retriever,
)

print("‚úÖ RAG pipeline fully configured and ready!")
print("üìù Retrieval: Top-3 similarity search")
print("ü§ñ LLM: Claude 3 Haiku (text-only generation; parsing is multimodal via LlamaParse)")

# Create results directory if it doesn't exist
results_dir = Path(glob.DATA_PKG_DIR) / "evaluation_results"
results_dir.mkdir(exist_ok=True)

‚úÖ RAG pipeline fully configured and ready!
üìù Retrieval: Top-3 similarity search
ü§ñ LLM: Claude 3 Haiku (text-only generation; parsing is multimodal via LlamaParse)


## Test Pipeline with Sample Query

**Instructions**: Run this cell to verify the RAG pipeline is working correctly.


In [11]:
# Run a test query to confirm everything works
test_query = "What is Allplan?"
print(f"\nüß™ Testing pipeline with query: '{test_query}'\n")

try:
    answer, relevant_docs = rag.answer(question=test_query)
    print("‚úÖ Pipeline test successful!")
    print(f"   Answer length: {len(answer)} characters")
    print(f"   Retrieved documents: {len(relevant_docs)}")
    print(f"\nüí¨ Sample answer: {answer[:200]}...")
except Exception as e:
    print(f"‚ùå Pipeline test failed: {e}")
    raise


üß™ Testing pipeline with query: 'What is Allplan?'

‚úÖ Pipeline test successful!
   Answer length: 335 characters
   Retrieved documents: 3

üí¨ Sample answer: Allplan 2020 is a high-performance CAD program designed for architects and civil engineers. It provides tools and features to help users carry out common operations and accomplish daily tasks in their...


---

# üìã Human Evaluation Rubric (Shared by A and B)

## Scoring Criteria
Both evaluators must use this **exact rubric** to keep evaluations comparable.

### 1Ô∏è‚É£ Relevance
*Does the answer address the question asked?*

- **5 - Excellent**: Directly and fully addresses the question
- **4 - Good**: Addresses the question with only minor tangential content
- **3 - Fair**: Partially addresses the question but includes irrelevant information
- **2 - Poor**: Barely addresses the question, mostly irrelevant
- **1 - Very Poor**: Completely off-topic or does not address the question

### 2Ô∏è‚É£ Accuracy
*Is the information provided factually correct?*

- **5 - Excellent**: All information is accurate and correct
- **4 - Good**: Mostly accurate with minor errors that don't affect understanding
- **3 - Fair**: Some accurate information but notable errors present
- **2 - Poor**: Many errors, unreliable information
- **1 - Very Poor**: Completely incorrect, misleading, or fabricated

### 3Ô∏è‚É£ Completeness
*Does the answer cover all important aspects of the question?*

- **5 - Excellent**: Comprehensive, covers all key aspects thoroughly
- **4 - Good**: Covers most aspects with minor gaps
- **3 - Fair**: Covers basic aspects but missing important details
- **2 - Poor**: Significant gaps, incomplete answer
- **1 - Very Poor**: Severely incomplete, missing most key information

### 4Ô∏è‚É£ Source Quality
*Are the retrieved documents relevant and helpful?*

- **5 - Excellent**: All retrieved documents are highly relevant and support the answer
- **4 - Good**: Most documents are relevant and helpful
- **3 - Fair**: Some relevant documents but also irrelevant ones
- **2 - Poor**: Few relevant documents, mostly irrelevant
- **1 - Very Poor**: No relevant documents retrieved

## Overall Score
- Computed as the **average** of the four criteria above
- Evaluators may slightly adjust if needed, but should explain why in notes

## Notes Fields
For each query, provide:
- **Brief notes per criterion** (what worked, what didn't)
- **General comments** (overall impression, suggestions, concerns)

---


---

# üë§ SECTION 2 ‚Äî EVALUATOR A

## ‚ö†Ô∏è IMPORTANT - READ CAREFULLY

### Audience
- This section is **exclusively for Evaluator A**
- **Do NOT scroll to Evaluator B's section or comparison sections**

### Estimated Time
- 20-30 minutes (depending on number of queries)

### Your Task
1. You will see a list of queries (questions to test the RAG system)
2. For each query:
   - The system will run the RAG pipeline and show:
     - üìù The query text
     - üí¨ The model's answer
     - üìö Retrieved context snippets
     - üîó Information about the sources
3. Your job is to:
   - Read the question, answer, and context carefully
   - Score the answer on **4 criteria** (1-5 each) using the rubric above
   - Provide short notes for each criterion
   - Provide general comments for the query

### Workflow
1. **Start a new session** (runs automatically)
2. **Define your queries** (or use provided list)
3. **Evaluate each query** (fill in scores and notes)
4. **Save results** (happens automatically)
5. **STOP** - Do not proceed to other sections

### Checkpoint
You are done when:
- ‚úÖ All queries are evaluated
- ‚úÖ The save step confirms results were written
- ‚úÖ You see your results file path

**Then STOP and do not proceed to Evaluator B or comparison sections!**

---

## Evaluator A - Session Setup

In [None]:
# Generate unique session ID for Evaluator A
evaluator_a_session_id = f"evaluator_a_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:6]}"
print(f"üÜî Evaluator A Session ID: {evaluator_a_session_id}")

# Define queries for Evaluator A
# NOTE: These should be the same queries that Evaluator B will evaluate
evaluator_a_queries = [
    "How do I create a new project in Allplan?",
    "What are the system requirements for Allplan 2020?",
    "How can I export drawings to PDF?",
    "What is the purpose of layers in Allplan?",
    "How do I import CAD files into Allplan?",
]

print(f"\nüìù Evaluator A will evaluate {len(evaluator_a_queries)} queries:")
for i, q in enumerate(evaluator_a_queries, 1):
    print(f"   {i}. {q}")

## Function to Collect Evaluation Scores

**Note**: In an interactive notebook, this would collect user input. For demonstration, it shows the structure.

In [None]:
# Function to collect evaluation scores from user
def collect_evaluation(query_idx: int, query: str, answer: str, docs: List[Document]) -> Dict:
    """
    Display query/answer/context and collect evaluation scores.
    In an interactive notebook, this would use input() or a form.
    For this template, we'll show what needs to be collected.
    """

    print("\n" + "="*80)
    print(f"üìù QUERY {query_idx}/{len(evaluator_a_queries)}")
    print("="*80)
    print(f"\n‚ùì QUESTION:\n{query}")
    print("\n" + "-"*80)
    print(f"\nüí¨ ANSWER:\n{answer}")
    print("\n" + "-"*80)
    print(f"\nüìö RETRIEVED CONTEXT ({len(docs)} documents):\n")

    for i, doc in enumerate(docs, 1):
        print(f"[{i}] Page {doc.metadata.get('page', 'N/A')}:")
        print(f"    {doc.page_content[:300]}...")
        print()

    print("="*80)
    print("üìä EVALUATION TIME - Use the rubric above (1-5 scale)")
    print("="*80)

    # In a real interactive session, you would use input() here
    # For this template, we'll structure what needs to be collected

    evaluation = {
        "query_index": query_idx,
        "query": query,
        "answer": answer,
        "retrieved_docs_count": len(docs),
        "timestamp": datetime.now().isoformat(),

        # These would be collected via input() in interactive mode
        "scores": {
            "relevance": None,  # input("Relevance (1-5): ")
            "accuracy": None,   # input("Accuracy (1-5): ")
            "completeness": None,  # input("Completeness (1-5): ")
            "source_quality": None,  # input("Source Quality (1-5): ")
        },
        "notes": {
            "relevance_notes": None,  # input("Relevance notes: ")
            "accuracy_notes": None,   # input("Accuracy notes: ")
            "completeness_notes": None,  # input("Completeness notes: ")
            "source_quality_notes": None,  # input("Source quality notes: ")
        },
        "general_comments": None,  # input("General comments: ")
    }

    # Calculate overall score (average of 4 criteria)
    scores = evaluation["scores"]
    if all(v is not None for v in scores.values()):
        evaluation["overall_score"] = sum(scores.values()) / len(scores)

    print("\n‚ö†Ô∏è IN INTERACTIVE MODE: You would fill in scores and notes here")
    print("   For now, this is a template structure showing what to collect")

    return evaluation

## Evaluator A - Interactive Evaluation Loop

**Instructions**:
- Run the cell below to start the evaluation loop
- For each query, you will:
  1. See the question, answer, and context
  2. Enter scores (1-5) for each criterion
  3. Provide notes and comments
- Results are automatically saved after each query


In [None]:
# EVALUATOR A: Main evaluation loop
print("="*80)
print("üöÄ STARTING EVALUATOR A EVALUATION SESSION")
print("="*80)

evaluator_a_results = []

for idx, query in enumerate(evaluator_a_queries, 1):
    try:
        # Run RAG pipeline
        answer, relevant_docs = rag.answer(question=query)

        # Collect evaluation (in interactive mode, this would prompt for input)
        evaluation = collect_evaluation(idx, query, answer, relevant_docs)

        # Store result
        evaluator_a_results.append(evaluation)

        print(f"\n‚úÖ Query {idx} evaluation recorded")

    except Exception as e:
        print(f"‚ùå Error evaluating query {idx}: {e}")
        continue

print("\n" + "="*80)
print(f"‚úÖ EVALUATOR A COMPLETED {len(evaluator_a_results)} EVALUATIONS")
print("="*80)


In [None]:
# Save Evaluator A results
evaluator_a_file = results_dir / f"{evaluator_a_session_id}.json"

with open(evaluator_a_file, 'w') as f:
    json.dump({
        "session_id": evaluator_a_session_id,
        "evaluator": "A",
        "timestamp": datetime.now().isoformat(),
        "num_queries": len(evaluator_a_queries),
        "evaluations": evaluator_a_results,
    }, f, indent=2)

print(f"\nüíæ Evaluator A results saved to:")
print(f"   {evaluator_a_file}")
print("\n" + "="*80)
print("üéâ EVALUATOR A: YOU ARE DONE!")
print("="*80)
print("\n‚ö†Ô∏è  IMPORTANT: Do NOT proceed to other sections!")
print("   Please close this notebook now or wait for the organizer.")


---

# üë§ SECTION 3 ‚Äî EVALUATOR B

## ‚ö†Ô∏è IMPORTANT - READ CAREFULLY

### Audience
- This section is **exclusively for Evaluator B**
- **Do NOT scroll up to Evaluator A's section**

### Independence & Blindness
- **You should NOT see Evaluator A's queries, answers, or scores in advance**
- **Use the same rubric** but rely on your own judgment
- **Until both evaluators are finished**, no one should open comparison sections

### Estimated Time
- 20-30 minutes (depending on number of queries)

### Your Task
1. You will see a list of queries (the same queries as Evaluator A)
2. For each query:
   - The system will run the RAG pipeline and show:
     - üìù The query text
     - üí¨ The model's answer
     - üìö Retrieved context snippets
     - üîó Information about the sources
3. Your job is to:
   - Read the question, answer, and context carefully
   - Score the answer on **4 criteria** (1-5 each) using the rubric above
   - Provide short notes for each criterion
   - Provide general comments for the query

### Workflow
1. **Start a new session** (runs automatically)
2. **Use provided queries** (same as Evaluator A)
3. **Evaluate each query** (fill in scores and notes)
4. **Save results** (happens automatically)
5. **STOP** - Do not proceed to comparison sections

### Checkpoint
You are done when:
- ‚úÖ All queries are evaluated
- ‚úÖ The save step confirms results were written
- ‚úÖ You see your results file path

**Then STOP - Do not open comparison or automated evaluation results yet!**

---

## Evaluator B - Session Setup

In [None]:
# Generate unique session ID for Evaluator B
evaluator_b_session_id = f"evaluator_b_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:6]}"
print(f"üÜî Evaluator B Session ID: {evaluator_b_session_id}")

# Define queries for Evaluator B (SAME as Evaluator A for comparison)
evaluator_b_queries = [
    "How do I create a new project in Allplan?",
    "What are the system requirements for Allplan 2020?",
    "How can I export drawings to PDF?",
    "What is the purpose of layers in Allplan?",
    "How do I import CAD files into Allplan?",
]

print(f"\nüìù Evaluator B will evaluate {len(evaluator_b_queries)} queries:")
for i, q in enumerate(evaluator_b_queries, 1):
    print(f"   {i}. {q}")

## Evaluator B - Interactive Evaluation Loop

**Instructions**:
- Run the cell below to start the evaluation loop
- For each query, you will:
  1. See the question, answer, and context
  2. Enter scores (1-5) for each criterion
  3. Provide notes and comments
- Results are automatically saved after each query

In [None]:
# EVALUATOR B: Main evaluation loop
print("="*80)
print("üöÄ STARTING EVALUATOR B EVALUATION SESSION")
print("="*80)

evaluator_b_results = []

for idx, query in enumerate(evaluator_b_queries, 1):
    try:
        # Run RAG pipeline
        answer, relevant_docs = rag.answer(question=query)

        # Collect evaluation (in interactive mode, this would prompt for input)
        evaluation = collect_evaluation(idx, query, answer, relevant_docs)

        # Store result
        evaluator_b_results.append(evaluation)

        print(f"\n‚úÖ Query {idx} evaluation recorded")

    except Exception as e:
        print(f"‚ùå Error evaluating query {idx}: {e}")
        continue

print("\n" + "="*80)
print(f"‚úÖ EVALUATOR B COMPLETED {len(evaluator_b_results)} EVALUATIONS")
print("="*80)

In [None]:
# Save Evaluator B results
evaluator_b_file = results_dir / f"{evaluator_b_session_id}.json"

with open(evaluator_b_file, 'w') as f:
    json.dump({
        "session_id": evaluator_b_session_id,
        "evaluator": "B",
        "timestamp": datetime.now().isoformat(),
        "num_queries": len(evaluator_b_queries),
        "evaluations": evaluator_b_results,
    }, f, indent=2)

print(f"\nüíæ Evaluator B results saved to:")
print(f"   {evaluator_b_file}")
print("\n" + "="*80)
print("üéâ EVALUATOR B: YOU ARE DONE!")
print("="*80)
print("\n‚ö†Ô∏è  IMPORTANT: Do NOT proceed to comparison sections!")
print("   Please close this notebook now or wait for the organizer.")


### Evaluator B - Observations and Notes

**Instructions**: Document your observations below:

#### Query 1: [Query text]
- **Relevance** (1-5): 
- **Accuracy** (1-5): 
- **Completeness** (1-5): 
- **Retrieved Docs Quality** (1-5): 
- **Notes**: 

#### Query 2: [Query text]
- **Relevance** (1-5): 
- **Accuracy** (1-5): 
- **Completeness** (1-5): 
- **Retrieved Docs Quality** (1-5): 
- **Notes**: 

#### Query 3: [Query text]
- **Relevance** (1-5): 
- **Accuracy** (1-5): 
- **Completeness** (1-5): 
- **Retrieved Docs Quality** (1-5): 
- **Notes**: 

#### Overall Observations:
- **Strengths**: 
- **Weaknesses**: 
- **Suggestions**: 


---

# üìä Automated Evaluation with DeepEval

## Instructions:
Now that both evaluators have completed their manual testing, let's run automated metrics on the ground truth dataset.

This will evaluate the RAG system on:
- **Answer Relevancy**: How relevant is the answer to the question?
- **Faithfulness**: Is the answer grounded in the retrieved context?
- **Contextual Relevancy**: Are retrieved documents relevant?

---

### Load Ground Truth QA Dataset

In [None]:
# Load annotated evaluation data
json = JSONService(
    path="generated_qa_data_tum.json",
    root_path=glob.DATA_PKG_DIR,
    verbose=True
)

qa_data = json.doRead()
print(f"‚úÖ Loaded {len(qa_data)} evaluation samples")

# Extract components
ground_truth_contexts = [item["context"] for item in qa_data]
sample_queries = [item["question"] for item in qa_data]
expected_responses = [item["answer"] for item in qa_data]

print(f"\nüìã Dataset composition:")
print(f"   - Questions: {len(sample_queries)}")
print(f"   - Contexts: {len(ground_truth_contexts)}")
print(f"   - Expected answers: {len(expected_responses)}")

### Build Evaluation Dataset

In [None]:
# Create evaluation dataset builder
builder = eval_builder.EvalDatasetBuilder(rag)

# Build the evaluation dataset
evaluation_dataset = builder.build_evaluation_dataset(
    input_contexts=ground_truth_contexts,
    sample_queries=sample_queries,
    expected_responses=expected_responses,
)

print(f"‚úÖ Evaluation dataset built with {len(evaluation_dataset.test_cases)} test cases")

### Run DeepEval Scoring

**Note**: This may take several minutes depending on dataset size.

In [None]:
# Initialize scorer
scorer = deep.DeepEvalScorer(evaluation_dataset)

# Calculate scores
print("üîÑ Running DeepEval metrics... (this may take a few minutes)")
results = scorer.calculate_scores()

print("\n" + "="*80)
print("üìä DEEPEVAL RESULTS")
print("="*80)
print(results)

### View Overall Metrics

In [None]:
# Get overall metrics summary
overall_metrics = scorer.get_overall_metrics()

print("\n" + "="*80)
print("üìà OVERALL METRICS SUMMARY")
print("="*80)
print(overall_metrics)

### Generate and Save Summary Report

In [None]:
# Generate comprehensive summary
summary = scorer.get_summary(save_to_file=True)

print("\n" + "="*80)
print("üìù EVALUATION SUMMARY")
print("="*80)
print(summary)
print("\n‚úÖ Summary saved to file!")

---

# üîç SECTION 5 ‚Äî Comparison & Discussion

## Audience
- **Organizer plus both evaluators jointly**, after both have finished

## Overview
This section loads:
- Evaluator A's results
- Evaluator B's results
- Optional automated metrics

It then:
- Aligns evaluations by query
- Compares individual scores (relevance, accuracy, completeness, source quality)
- Compares overall scores

## Graceful Handling of Missing Data
- If one evaluator did not complete all queries, the notebook will:
  - Compare only on the overlapping queries
  - Warn when data is missing or incomplete

## What You'll See
- **Aggregate statistics**: Average scores per evaluator, per criterion
- **Inter-rater agreement**: Measure of disagreement between evaluators
- **Query-level comparison**: Side-by-side scores for each query
- **Discussion prompts**: Questions to guide your analysis

---

### Load Evaluation Results

In [None]:
# Function to load most recent evaluator results
def load_evaluator_results(evaluator: str) -> Dict:
    """Load the most recent results for a given evaluator."""
    pattern = f"evaluator_{evaluator.lower()}_*.json"
    files = sorted(results_dir.glob(pattern),
                   key=lambda x: x.stat().st_mtime, reverse=True)

    if not files:
        print(f"‚ö†Ô∏è  No results found for Evaluator {evaluator}")
        return None

    with open(files[0], 'r') as f:
        data = json.load(f)

    print(f"‚úÖ Loaded Evaluator {evaluator} results from: {files[0].name}")
    return data


# Load results
eval_a_data = load_evaluator_results("A")
eval_b_data = load_evaluator_results("B")

if eval_a_data is None or eval_b_data is None:
    print("\n‚ùå Cannot proceed with comparison - missing evaluator data")
else:
    print(f"\nüìä Comparison ready:")
    print(f"   - Evaluator A: {len(eval_a_data['evaluations'])} evaluations")
    print(f"   - Evaluator B: {len(eval_b_data['evaluations'])} evaluations")

### Aggregate Statistics

In [None]:
def calculate_aggregate_stats(evaluations: List[Dict]) -> Dict:
    """Calculate aggregate statistics for an evaluator's results."""

    # Filter out evaluations with None scores
    valid_evals = [e for e in evaluations if all(
        s is not None for s in e['scores'].values())]

    if not valid_evals:
        return {
            "num_evaluations": 0,
            "avg_relevance": None,
            "avg_accuracy": None,
            "avg_completeness": None,
            "avg_source_quality": None,
            "avg_overall": None,
        }

    n = len(valid_evals)

    return {
        "num_evaluations": n,
        "avg_relevance": sum(e['scores']['relevance'] for e in valid_evals) / n,
        "avg_accuracy": sum(e['scores']['accuracy'] for e in valid_evals) / n,
        "avg_completeness": sum(e['scores']['completeness'] for e in valid_evals) / n,
        "avg_source_quality": sum(e['scores']['source_quality'] for e in valid_evals) / n,
        "avg_overall": sum(e.get('overall_score', 0) for e in valid_evals) / n,
    }


if eval_a_data and eval_b_data:
    stats_a = calculate_aggregate_stats(eval_a_data['evaluations'])
    stats_b = calculate_aggregate_stats(eval_b_data['evaluations'])

    print("="*80)
    print("üìà AGGREGATE STATISTICS")
    print("="*80)

    print("\nüë§ Evaluator A:")
    for key, value in stats_a.items():
        if value is not None and key != "num_evaluations":
            print(
                f"   {key.replace('avg_', '').replace('_', ' ').title()}: {value:.2f}")

    print("\nüë§ Evaluator B:")
    for key, value in stats_b.items():
        if value is not None and key != "num_evaluations":
            print(
                f"   {key.replace('avg_', '').replace('_', ' ').title()}: {value:.2f}")

### Inter-Rater Agreement Analysis


In [None]:
def calculate_agreement(eval_a: List[Dict], eval_b: List[Dict]) -> Dict:
    """Calculate inter-rater agreement metrics."""

    # Match evaluations by query
    agreements = []

    for ea, eb in zip(eval_a, eval_b):
        if ea['query'] != eb['query']:
            print(f"‚ö†Ô∏è  Query mismatch: '{ea['query']}' vs '{eb['query']}'")
            continue

        if all(s is not None for s in ea['scores'].values()) and all(s is not None for s in eb['scores'].values()):
            diff_relevance = abs(
                ea['scores']['relevance'] - eb['scores']['relevance'])
            diff_accuracy = abs(
                ea['scores']['accuracy'] - eb['scores']['accuracy'])
            diff_completeness = abs(
                ea['scores']['completeness'] - eb['scores']['completeness'])
            diff_source = abs(ea['scores']['source_quality'] -
                              eb['scores']['source_quality'])
            diff_overall = abs(ea.get('overall_score', 0) -
                               eb.get('overall_score', 0))

            agreements.append({
                'query': ea['query'],
                'diff_relevance': diff_relevance,
                'diff_accuracy': diff_accuracy,
                'diff_completeness': diff_completeness,
                'diff_source': diff_source,
                'diff_overall': diff_overall,
            })

    if not agreements:
        return None

    return {
        'num_queries': len(agreements),
        'mean_abs_diff_overall': sum(a['diff_overall'] for a in agreements) / len(agreements),
        'max_diff_overall': max(a['diff_overall'] for a in agreements),
        'queries_with_large_disagreement': [a for a in agreements if a['diff_overall'] >= 2.0],
    }


if eval_a_data and eval_b_data:
    agreement = calculate_agreement(
        eval_a_data['evaluations'], eval_b_data['evaluations'])

    if agreement:
        print("\n" + "="*80)
        print("ü§ù INTER-RATER AGREEMENT")
        print("="*80)
        print(f"\nQueries compared: {agreement['num_queries']}")
        print(
            f"Mean absolute difference (overall score): {agreement['mean_abs_diff_overall']:.2f}")
        print(
            f"Maximum difference (overall score): {agreement['max_diff_overall']:.2f}")

        if agreement['queries_with_large_disagreement']:
            print(f"\n‚ö†Ô∏è  Queries with large disagreement (‚â•2.0 points):")
            for q in agreement['queries_with_large_disagreement']:
                print(
                    f"   - '{q['query'][:60]}...' (diff: {q['diff_overall']:.2f})")


---

# üì§ SECTION 6 ‚Äî Export & Reporting

## Purpose
Export all evaluation data for further analysis or sharing with stakeholders.

## What Can Be Exported
- **Human evaluation data** (from Evaluator A and B) ‚Üí CSV files
- **Automated metrics** (if present) ‚Üí CSV file
- **Combined comparison report** ‚Üí CSV or JSON
- **Entire notebook** ‚Üí HTML or PDF (via Jupyter tools)

## How to Export

### CSV Export
- Run the cells below to generate CSV files
- Files will be saved in the `evaluation_results` directory

### Notebook Export
- Use Jupyter's File ‚Üí Download as ‚Üí HTML/PDF
- For PDF export, you may need LaTeX installed
- Alternative: Export to HTML and print to PDF from browser

## Versioning
- **Recommendation**: Track the notebook file and exported CSVs in git
- The notebook contains a version string for tracking changes over time

---


In [None]:
def export_evaluations_to_csv(eval_data: Dict, output_file: Path):
    """Export evaluation results to CSV format."""

    rows = []
    for e in eval_data['evaluations']:
        row = {
            'query_index': e['query_index'],
            'query': e['query'],
            'timestamp': e['timestamp'],
            'relevance': e['scores'].get('relevance'),
            'accuracy': e['scores'].get('accuracy'),
            'completeness': e['scores'].get('completeness'),
            'source_quality': e['scores'].get('source_quality'),
            'overall_score': e.get('overall_score'),
            'relevance_notes': e['notes'].get('relevance_notes'),
            'accuracy_notes': e['notes'].get('accuracy_notes'),
            'completeness_notes': e['notes'].get('completeness_notes'),
            'source_quality_notes': e['notes'].get('source_quality_notes'),
            'general_comments': e.get('general_comments'),
        }
        rows.append(row)

    df = pd.DataFrame(rows)
    df.to_csv(output_file, index=False)
    print(f"‚úÖ Exported to: {output_file}")


# Export Evaluator A and B results to CSV
if eval_a_data:
    export_file_a = results_dir / f"{eval_a_data['session_id']}.csv"
    export_evaluations_to_csv(eval_a_data, export_file_a)

if eval_b_data:
    export_file_b = results_dir / f"{eval_b_data['session_id']}.csv"
    export_evaluations_to_csv(eval_b_data, export_file_b)

print("\nüíæ All evaluation results exported successfully!")


---

# ‚úÖ Evaluation Complete!

## Summary

You have successfully:
1. ‚úÖ Set up the RAG pipeline with LlamaIndex, Jina AI v4, and Qdrant
2. ‚úÖ Conducted independent manual evaluations (Evaluator A & B)
3. ‚úÖ Run automated metrics with DeepEval
4. ‚úÖ Compared and analyzed results
5. ‚úÖ Exported data for further analysis

## Next Steps

- üìä Review the generated summary report
- üîß Implement suggested improvements
- üîÑ Re-run evaluation to measure improvements
- üß™ Consider testing with different:
  - Chunk sizes (current: 1024 tokens)
  - Retrieval parameters (current: top-3)
  - LLM models (current: Claude Haiku 4.5)
  - Embedding models (current: Jina v4)

## Version
- **Notebook Version**: 2.0.0
- **Created**: November 22, 2025
- **Last Modified**: November 22, 2025
- **Key Improvements**: Enhanced evaluation workflow, Jina v4 embeddings, robust comparison analytics

---

## üìù Notes for Future Iterations

- Consider adding more diverse query types
- Implement automatic query generation
- Add visualization of score distributions
- Create dashboards for real-time evaluation monitoring
- Integrate with CI/CD pipeline for continuous evaluation

---


### Discussion Section

**Instructions**: Both evaluators should now discuss their findings together.

#### Key Discussion Points:

1. **What types of queries worked well?**
   - 

2. **What types of queries struggled?**
   - 

3. **Did manual evaluation align with automated metrics?**
   - 

4. **What are the main strengths of this RAG system?**
   - 

5. **What are the main weaknesses?**
   - 

6. **Recommended improvements:**
   - Retrieval:
   - Generation:
   - Chunking strategy:
   - Other:

7. **Overall assessment (1-10):**
   - Evaluator A score: 
   - Evaluator B score: 
   - Automated score (average): 


---

# ‚úÖ Evaluation Complete!

## Summary

You have successfully:
1. ‚úÖ Set up the RAG pipeline with LlamaIndex, Jina AI, and Qdrant
2. ‚úÖ Conducted independent manual evaluations (Evaluator A & B)
3. ‚úÖ Run automated metrics with DeepEval
4. ‚úÖ Compared and analyzed results

## Next Steps

- Review the generated summary report
- Implement suggested improvements
- Re-run evaluation to measure improvements
- Consider testing with different:
  - Chunk sizes
  - Retrieval parameters (top-k)
  - LLM models
  - Embedding models

---
