# ðŸ“Š Notebook 04: Evaluation & Monitoring

In this notebook, we explore how to measure the performance of our RAG engine and how to monitor it in production.

## ðŸŽ¯ Objectives
1. **Retrieval Evaluation**: Compare Vector vs Hybrid search precision.
2. **Guardrails**: Test prompt grounding and hallucination prevention.
3. **Monitoring**: Introduction to Prometheus metrics in RAG Engine Mini.

---

In [None]:
import os
import sys
from typing import List

# Add src to path
sys.path.append(os.path.abspath("../"))

from src.core.bootstrap import get_container
from src.domain.entities import TenantId, Answer
from src.application.use_cases.ask_question_hybrid import AskHybridRequest

container = get_container()
log = f"Container services loaded: {list(container.keys())}"
print(log)

## 1. Retrieval Evaluation

We use a "Golden Set" of questions and expected answers or keywords to evaluate retrieval quality.

In [None]:
GOLDEN_SET = [
    {
        "question": "What is the main architecture of this project?",
        "expected": ["Clean Architecture", "Ports and Adapters", "Hexagonal"]
    },
    {
        "question": "How do we handle hybrid search?",
        "expected": ["Reciprocal Rank Fusion", "RRF", "Vector", "Keyword"]
    }
]

def evaluate_recall(question, expected_keywords, retrieved_chunks):
    combined_text = " ".join([c.text.lower() for c in retrieved_chunks])
    found = [kw for kw in expected_keywords if kw.lower() in combined_text]
    return len(found) / len(expected_keywords)

eval_use_case = container["ask_hybrid_use_case"]
tenant = TenantId("notebook_eval")

for entry in GOLDEN_SET:
    # Execute retrieval only
    chunks = eval_use_case.execute_retrieval_only(
        tenant_id=tenant, 
        question=entry["question"],
        expand_query=True  # Power of expansion!
    )
    
    score = evaluate_recall(entry["question"], entry["expected"], chunks)
    print(f"Question: {entry['question']}")
    print(f"Recall Score: {score:.2f} ({len(chunks)} chunks retrieved)\n")

## 2. Guardrail Testing

Let's try to "trick" the model into hallucinating something not in the context.

In [None]:
def test_guardrails(question, context):
    # We'll use the prompt builder directly to see how it grounds the model
    from src.application.services.prompt_builder import build_rag_prompt
    from src.domain.entities import Chunk, DocumentId
    
    mock_chunks = [Chunk(id="1", tenant_id=tenant, document_id=DocumentId("doc"), text=context)]
    prompt = build_rag_prompt(question, mock_chunks)
    
    llm = container["llm"]
    answer = llm.generate(prompt)
    return answer

irrelevant_question = "What is the capital of Mars?"
context = "This project is about RAG Engine and Artificial Intelligence."

answer = test_guardrails(irrelevant_question, context)
print(f"Question: {irrelevant_question}")
print(f"Context: {context}")
print(f"Model Response: {answer}")
print("\nIs Hallucination Prevented?", "Yes" if "don't have" in answer.lower() or "not mentioned" in answer.lower() else "No")

## 3. Production Monitoring

RAG Engine Mini exposes its internal state via **Prometheus**. 

You can monitor:
- **rag_api_request_duration_seconds**: How long requests take.
- **rag_llm_tokens_total**: Current token costs.
- **rag_embedding_cache_total**: Efficiency of Redis storage.

To see metrics in action, run `make run` and visit `http://localhost:8000/metrics`.

---
**Congratulations!** You've completed the educational journey for RAG Engine Mini Stage 1.

Check out the [White-Box Documentation](file:///../docs/README.md) for more details.