# RAG Evaluation and Observability with MLflow

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Build** a complete RAG pipeline using LangChain v1.0+
2. **Understand** why evaluation is critical for production RAG systems
3. **Create** a "Golden Dataset" for systematic evaluation
4. **Use MLflow** to track experiments and enable observability
5. **Interpret** LLM-as-a-Judge metrics (Faithfulness, Answer Relevance)
6. **Debug** RAG failures using per-question analysis
7. **Iterate** on RAG configurations using experiment comparison


---

## üéØ Why Evaluate RAG Systems?

Before we dive into building, let's understand **why evaluation matters**.

### The Problem: "It Looks Good" is Not Enough

When you test a chatbot manually, you might ask 5-10 questions and think: *"The answers seem reasonable!"* But in production:

- You can't manually check thousands of queries
- Users will ask questions you never anticipated
- Small changes (new documents, different LLM) can break things silently

### What Can Go Wrong in RAG?

| Failure Type | Description | Example |
|--------------|-------------|--------|
| **Retrieval Failure** | Wrong documents were fetched | User asks about "Python" (the language) but retrieves documents about "python" (the snake) |
| **Hallucination** | LLM invents information not in documents | LLM confidently states a date that doesn't exist in your PDFs |
| **Irrelevant Answer** | Answer is technically correct but doesn't address the question | User asks "How do I install X?" and gets "X was developed in 2020..." |
| **Context Window Overflow** | Too many chunks stuffed into prompt | The LLM gets confused or ignores important context |

### The Solution: Systematic Evaluation

We need:
1. **A benchmark** (Golden Dataset) with known correct answers
2. **Automated metrics** that can score answers at scale
3. **Observability** to trace what happened at each step
4. **Experiment tracking** to compare different configurations

> üí° **Key Insight**: Evaluation is not just about quality‚Äîit's about **confidence**. You need to know *when* your system will fail, not just hope it won't.


---

## üî≠ Introduction to LLM Observability

### What is Observability?

**Observability** is the ability to understand what's happening *inside* your system by examining its *outputs*. For LLM applications, this means:

- **Traces**: The complete journey of a request (query ‚Üí retrieval ‚Üí generation ‚Üí response)
- **Metrics**: Quantitative measurements (latency, token count, quality scores)
- **Logs**: Detailed records of inputs, outputs, and intermediate steps

### Why Observability for RAG?

RAG systems are **multi-step pipelines**. When something goes wrong, you need to know:

```
User Query ‚Üí [Embedding] ‚Üí [Retrieval] ‚Üí [Prompt Construction] ‚Üí [LLM Generation] ‚Üí Response
     ‚Üì            ‚Üì             ‚Üì                ‚Üì                     ‚Üì              ‚Üì
  Logged?      Traced?      What docs?      What prompt?          What output?    Scored?
```

Without observability, debugging is like finding a needle in a haystack.

### MLflow for LLM Observability

**MLflow** is an open-source platform originally designed for ML experiment tracking. It now supports:

| Feature | Description |
|---------|-------------|
| **Autologging** | Automatically capture LangChain traces (no code changes!) |
| **Experiment Tracking** | Compare different configurations side-by-side |
| **LLM Evaluation** | Built-in metrics for faithfulness, relevance, etc. |
| **Artifacts** | Store evaluation datasets and results |
| **UI** | Visual dashboard to explore all of the above |

> üìå **In this notebook**, we use `mlflow.langchain.autolog()` to automatically capture every LangChain call, then `mlflow.evaluate()` to score our RAG responses.


---

## Step 0: Install Required Packages

Before we begin, we need to install the required packages. This cell installs:

- **`langchain`**: The core LangChain framework
- **`langchain-community`**: Community integrations (document loaders, etc.)
- **`langchain-openai`**: OpenAI-specific components (embeddings, chat models)
- **`langchain-chroma`**: Chroma vector store integration
- **`langgraph`**: Graph-based agent orchestration (required for modern agents in v1.0+)
- **`pypdf`**: PDF parsing library
- **`gradio`**: Web interface for interactive demos
- **`python-dotenv`**: Environment variable management

> ‚ö†Ô∏è **Important**: After running this cell, you may need to **restart the kernel** to ensure all packages are properly loaded.

In [1]:
# Install required packages
# Note: The --force-reinstall for numpy and scipy fixes potential binary incompatibility issues
# !uv pip install -U -q langchain langchain-community langchain-openai langchain-chroma langgraph pypdf gradio python-dotenv
# !uv pip install --force-reinstall -q numpy scipy

print("‚úÖ Packages installed! Please restart the kernel if this is your first time running this cell.")

‚úÖ Packages installed! Please restart the kernel if this is your first time running this cell.


In [2]:
# Install MLflow with GenAI evaluation support
!uv pip install -q mlflow

print("‚úÖ MLflow installed - restart kernel if this is your first time")


‚úÖ MLflow installed - restart kernel if this is your first time


---

## Step 1: Environment Setup

We need to configure our API keys to authenticate with OpenAI. This notebook supports both:

- **Google Colab**: Uses `google.colab.userdata` to securely access keys stored in Colab Secrets
- **Local Execution**: Uses `python-dotenv` to load keys from a `.env` file

### Setting Up Your API Key

**For local development**, create a `.env` file in this directory with:
```
OPENAI_API_KEY=your-api-key-here
```

**For Colab**, add your key to Colab Secrets with the name `OPENAI_API_KEY`.

In [None]:
import os
import sys

# Configuration
MODEL = "gpt-4o-mini"  # The LLM model to use
db_name = "vector_db"  # Directory name for the vector store

# Option 1: Set your API key directly (for Colab)
#from google.colab import userdata
#os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Option 2
# Load environment variables from .env file
# from dotenv import load_dotenv
# load_dotenv()

# Verify API key is set
if os.environ.get("OPENAI_API_KEY"):
    print("‚úÖ OPENAI_API_KEY loaded successfully")
else:
    print("‚ö†Ô∏è Warning: OPENAI_API_KEY not found. Please set it in your .env file or environment.")

‚úÖ OPENAI_API_KEY loaded successfully


### RAG Chain Architecture

Here's the complete flow of our RAG system:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  RAG Chain Flow                                     ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                     ‚îÇ
‚îÇ  1. User Question + Chat History                    ‚îÇ
‚îÇ     ‚Üì                                               ‚îÇ
‚îÇ  2. History-Aware Retriever                         ‚îÇ
‚îÇ     (Reformulates question to be standalone)        ‚îÇ
‚îÇ     ‚Üì                                               ‚îÇ
‚îÇ  3. Vector Store Search (Chroma)                    ‚îÇ
‚îÇ     (Finds top-k most similar chunks)               ‚îÇ
‚îÇ     ‚Üì                                               ‚îÇ
‚îÇ  4. Question-Answer Chain                           ‚îÇ
‚îÇ     (LLM generates answer using retrieved context)  ‚îÇ
‚îÇ     ‚Üì                                               ‚îÇ
‚îÇ  5. Final Response                                  ‚îÇ
‚îÇ                                                     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Key Components:**
- **History-Aware Retriever**: Handles follow-up questions by reformulating them
- **Vector Store**: Stores embeddings and performs semantic search
- **Stuff Documents Chain**: "Stuffs" all retrieved docs into the LLM prompt


---

## Step 2: Import Dependencies

Now we import all the necessary modules from LangChain and other libraries. Here's what each import does:

### Document Processing
- **`DirectoryLoader`**: Loads multiple files from a directory
- **`PyPDFLoader`**: Parses PDF files into text
- **`RecursiveCharacterTextSplitter`**: Splits text into chunks while respecting natural boundaries

### Embeddings & Vector Store
- **`OpenAIEmbeddings`**: Converts text to vector embeddings using OpenAI's models
- **`Chroma`**: A fast, open-source vector database

### LLM & Chains
- **`ChatOpenAI`**: OpenAI's chat models (GPT-4, etc.)
- **`create_history_aware_retriever`**: Creates a retriever that understands conversation context
- **`create_retrieval_chain`**: Combines retrieval and generation into a single chain
- **`create_stuff_documents_chain`**: Creates a chain that "stuffs" documents into the prompt

### Prompts & Messages
- **`ChatPromptTemplate`**: Templates for structured prompts
- **`MessagesPlaceholder`**: Placeholder for conversation history
- **`HumanMessage` / `AIMessage`**: Message types for chat history

In [4]:
import glob
import os

# Document loading and processing
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Embeddings and LLM
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

# Vector store
from langchain_chroma import Chroma

# Chains for RAG
from langchain_classic.chains import create_history_aware_retriever, create_retrieval_chain
from langchain_classic.chains.combine_documents import create_stuff_documents_chain

# Prompts and messages
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage

# UI
import gradio as gr

print("‚úÖ All imports successful!")

‚úÖ All imports successful!


In [5]:
import pandas as pd
import mlflow

# Configure MLflow to use SQLite database instead of deprecated filesystem
# This provides better durability, easier collaboration, and is future-proof
mlflow.set_tracking_uri("sqlite:///mlflow.db")

# Optional: import GenAI evaluation metrics if you plan to use mlflow.evaluate
from mlflow.metrics.genai import (
    faithfulness,
    answer_relevance,
)

# Enable automatic tracing for your LangChain RAG pipeline
# By default, trace logging is enabled; you can add more options per your MLflow version.
mlflow.langchain.autolog(
    log_traces=True,
)

# Set experiment name (all runs will be grouped here)
mlflow.set_experiment("RAG_PDF_Embeddings_Evaluation_v3")

print("‚úÖ MLflow configured with SQLite backend")
print("üí° Your experiments are stored in: mlflow.db")
print("üí° Start MLflow UI in terminal: mlflow ui --backend-store-uri sqlite:///mlflow.db")
print("   Then visit: http://localhost:5000")


2025/12/12 14:18:17 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/12/12 14:18:17 INFO mlflow.store.db.utils: Updating database tables
2025-12-12 14:18:17 INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
2025-12-12 14:18:17 INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
2025-12-12 14:18:17 INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
2025-12-12 14:18:17 INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
2025-12-12 14:18:17 INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
2025-12-12 14:18:17 INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
2025-12-12 14:18:17 INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
2025-12-12 14:18:17 INFO  [alembic.runtime.mig

‚úÖ MLflow configured with SQLite backend
üí° Your experiments are stored in: mlflow.db
üí° Start MLflow UI in terminal: mlflow ui --backend-store-uri sqlite:///mlflow.db
   Then visit: http://localhost:5000


### üîç Detailed Breakdown of Imports

Let's break down exactly what each imported component does:

- **`DirectoryLoader, PyPDFLoader`**: `DirectoryLoader` helps us grab all files in a folder. `PyPDFLoader` is the specialist that knows how to read PDF files page by page.
- **`RecursiveCharacterTextSplitter`**: This is a "smart" splitter. Instead of just chopping text every 1000 characters (which might cut a sentence in half), it tries to split at natural boundaries like paragraphs `\n\n` or sentences `\n` to check context.
- **`OpenAIEmbeddings`**: This tool takes text and turns it into a list of numbers (vectors). It uses OpenAI's models to do this translation.
- **`Chroma`**: This is our database. It stores the vectors we created with `OpenAIEmbeddings` so we can search them later.
- **`create_history_aware_retriever`**: A special chain that takes your follow-up question (e.g., "How does it work?") and your chat history, and rewrites it into a full question (e.g., "How does RAG work?") so the database can understand it.
- **`create_retrieval_chain`**: The manager that coordinates everything: it gets the question, sends it to the retriever, gets documents back, and sends them to the LLM.
- **`create_stuff_documents_chain`**: The worker that actually sends the prompt to the LLM. It "stuffs" all the retrieved text into the system prompt.
- **`ChatPromptTemplate`**: A flexible template builder. It lets us create prompts with placeholders (like `{context}` or `{input}`) that get filled in dynamically.


---

## Step 3: Load Documents

The first step in building a RAG application is loading your documents. We use:

- **`glob.glob()`**: To find all PDF files in the `pdfs/` directory and current directory
- **`PyPDFLoader`**: To parse each PDF and extract text content

### Document Structure

Each loaded document contains:
- **`page_content`**: The actual text content
- **`metadata`**: Information about the document (source file, page number, etc.)

> üìÅ **Note**: Place your PDF files in a `pdfs/` subdirectory or in the same directory as this notebook.

In [6]:
# Find all PDF files in the pdfs/ subdirectory and current directory
folders = glob.glob("pdfs/*.pdf") + glob.glob("*.pdf")

if not folders:
    print("‚ö†Ô∏è No PDF files found. Please add PDF files to the 'pdfs/' directory or current directory.")
else:
    print(f"üìÑ Found {len(folders)} PDF file(s)")

# Load all documents
documents = []
for file_path in folders:
    loader = PyPDFLoader(file_path)
    docs = loader.load()
    for doc in docs:
        # Add custom metadata to track source file
        doc.metadata["source_file"] = os.path.basename(file_path)
        documents.append(doc)

print(f"‚úÖ Loaded {len(documents)} pages from {len(folders)} file(s)")

üìÑ Found 9 PDF file(s)
‚úÖ Loaded 126 pages from 9 file(s)


In [7]:
# print first document metadata, such as file name, source, total number of pages, etc.
print("First document metadata:")
print(documents[0].metadata['source'])
print(documents[0].metadata['total_pages'])
print(documents[0].metadata['source_file'])


First document metadata:
pdfs/1901.09069v2.pdf
11
1901.09069v2.pdf


In [8]:
# show the content of first document
print(documents[0].page_content)


Word Embeddings: A Survey
Felipe Almeida Geraldo Xex ¬¥eo‚àó
Computer and Systems Engineering Program (PESC-COPPE)
Federal University of Rio de Janeiro
Rio de Janeiro, Brazil
{falmeida,xexeo}@cos.ufrj.br
Abstract
This work lists and describes the main re-
cent strategies for building Ô¨Åxed-length,
dense and distributed representations for
words, based on the distributional hypoth-
esis. These representations are now com-
monly called word embeddings and, in ad-
dition to encoding surprisingly good syn-
tactic and semantic information, have been
proven useful as extra features in many
downstream NLP tasks.
1 Introduction
The task of representing words and documents is
part and parcel of most, if not all, Natural Lan-
guage Processing (NLP) tasks. In general, it has
been found to be useful to represent them as vec-
tors, which have an appealing, intuitive interpreta-
tion, can be the subject of useful operations (e.g.
addition, subtraction, distance measures, etc) and
lend themselves we

In [9]:
print(len(documents[0].page_content))

3856


---

## Step 4: Split Documents into Chunks

LLMs have a **context window limit** (maximum tokens they can process at once). Additionally, for effective retrieval, we want to find *specific* relevant passages, not entire documents.

We use **`RecursiveCharacterTextSplitter`** which:
- Splits text hierarchically (paragraphs ‚Üí sentences ‚Üí words)
- Tries to keep semantically related text together
- Creates overlapping chunks to preserve context at boundaries

### Key Parameters

| Parameter | Value | Description |
|-----------|-------|-------------|
| `chunk_size` | 1000 | Maximum characters per chunk |
| `chunk_overlap` | 200 | Characters shared between adjacent chunks |
| `add_start_index` | True | Tracks the position of each chunk in the original document |

In [10]:
# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Max characters per chunk
    chunk_overlap=200,     # Overlap between chunks for context continuity
    add_start_index=True   # Track position in original document
)

# Split documents into chunks
chunks = text_splitter.split_documents(documents)
print(f"‚úÖ Split {len(documents)} pages into {len(chunks)} chunks")

# Show example chunk
if chunks:
    print("\nüìù Example Chunk:")
    print("-" * 50)
    print(chunks[0].page_content[:300] + "...")
    print("-" * 50)
    print(f"Metadata: {chunks[0].metadata}")

‚úÖ Split 126 pages into 688 chunks

üìù Example Chunk:
--------------------------------------------------
Word Embeddings: A Survey
Felipe Almeida Geraldo Xex ¬¥eo‚àó
Computer and Systems Engineering Program (PESC-COPPE)
Federal University of Rio de Janeiro
Rio de Janeiro, Brazil
{falmeida,xexeo}@cos.ufrj.br
Abstract
This work lists and describes the main re-
cent strategies for building Ô¨Åxed-length,
dense...
--------------------------------------------------
Metadata: {'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-05-03T00:58:57+00:00', 'author': '', 'keywords': '', 'moddate': '2023-05-03T00:58:57+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'pdfs/1901.09069v2.pdf', 'total_pages': 11, 'page': 0, 'page_label': '1', 'source_file': '1901.09069v2.pdf', 'start_index': 0}


In [11]:
# inspect a chunk 
print(chunks[0].page_content)


Word Embeddings: A Survey
Felipe Almeida Geraldo Xex ¬¥eo‚àó
Computer and Systems Engineering Program (PESC-COPPE)
Federal University of Rio de Janeiro
Rio de Janeiro, Brazil
{falmeida,xexeo}@cos.ufrj.br
Abstract
This work lists and describes the main re-
cent strategies for building Ô¨Åxed-length,
dense and distributed representations for
words, based on the distributional hypoth-
esis. These representations are now com-
monly called word embeddings and, in ad-
dition to encoding surprisingly good syn-
tactic and semantic information, have been
proven useful as extra features in many
downstream NLP tasks.
1 Introduction
The task of representing words and documents is
part and parcel of most, if not all, Natural Lan-
guage Processing (NLP) tasks. In general, it has
been found to be useful to represent them as vec-
tors, which have an appealing, intuitive interpreta-
tion, can be the subject of useful operations (e.g.
addition, subtraction, distance measures, etc) and


In [12]:
print(len(chunks[0].page_content))

976


---

## Step 5: Create Embeddings and Vector Store

### What are Embeddings?

**Embeddings** are numerical representations (vectors) of text that capture semantic meaning. Texts with similar meanings will have vectors that are close together in the embedding space.

### What is a Vector Store?

A **Vector Store** is a specialized database optimized for:
- Storing high-dimensional vectors
- Performing fast similarity searches
- Enabling "semantic search" (finding text by meaning, not just keywords)

### Our Setup

- **`OpenAIEmbeddings`**: Uses OpenAI's `text-embedding-3-small` model (fast and cost-effective)
- **`Chroma`**: Open-source vector database that persists to disk

> üí° **Tip**: The embeddings are stored locally, so subsequent runs will be faster as you won't need to re-embed documents.

In [13]:
# Initialize embedding model
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')

# Clean up existing database if it exists (to ensure fresh data)
# NOTE: In production, you would likely load the existing DB mostly.
# For this lab, we check if it exists and load it to save time/cost.

if os.path.exists(db_name):
    # Load existing vector store
    vectorstore = Chroma(
        persist_directory=db_name, 
        embedding_function=embeddings
    )
    print(f"‚úÖ Loaded existing vector store: {db_name}")
    try:
        count = vectorstore._collection.count()
        print(f"üìä Document count: {count}")
    except:
        print("üìä Could not get document count")
else:
    # Create new vector store
    print(f"üÜï Creating new vector store: {db_name}...")
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=db_name
    )
    print(f"‚úÖ Vector store created with {vectorstore._collection.count()} documents")


üÜï Creating new vector store: vector_db...
‚úÖ Vector store created with 688 documents


In [14]:
# test chroma db for similarity search 
query = "What is Byte Pair Encoding?"
docs_with_scores = vectorstore.similarity_search_with_score(query)
for doc, score in docs_with_scores:
    print(f"Document: {doc.page_content}\nScore: {score}\n")


Document: arXiv:2411.08671v1  [cs.DS]  13 Nov 2024
Theoretical Analysis of Byte-Pair Encoding
L¬¥ aszl¬¥ o Kozma and Johannes Voderholzer
Institut f¬® ur Informatik, Freie Universit¬® at Berlin, Germany
Abstract
Byte-Pair Encoding (BPE) is a widely used method for subword token ization, with origins in
grammar-based text compression. It is employed in a variety of lang uage processing tasks such
as machine translation or large language model (LLM) pretraining, t o create a token dictionary
of a prescribed size. Most evaluations of BPE to date are empirical, a nd the reasons for its good
practical performance are not well understood.
In this paper we focus on the optimization problem underlying BPE: Ô¨Ån ding a pair encoding
that achieves optimal compression utility. We show that this problem is APX-complete, indi-
cating that it is unlikely to admit a polynomial-time approximation scheme . This answers, in a
stronger form, a question recently raised by Zouhar et al. [ ZMG+23].
Score: 0

### Understanding Chroma Similarity Scores

Chroma uses **L2 (Euclidean) distance** for similarity search. The score represents how "far apart" two vectors are in the embedding space.

| Score Range | Interpretation |
|-------------|---------------|
| **< 0.5** | Highly relevant - strong semantic match |
| **0.5 - 1.0** | Moderately relevant - related content |
| **> 1.0** | Potentially irrelevant - consider filtering these out |

> üí° **Lower is better** for L2 distance. If you see high scores (>1.5), the retrieved chunks may not actually be relevant to the query.


---

## Step 6: Build the RAG Chain

Now we create the complete RAG pipeline using **LangChain Expression Language (LCEL)**. The chain consists of two main components:

### 1. History-Aware Retriever

This component reformulates the user's question to be **standalone** (understandable without context). 

**Example:**
- Chat history: "Tell me about SecLM"
- Follow-up: "What are its main features?"
- Reformulated: "What are the main features of SecLM?"

### 2. Question-Answer Chain

This component:
1. Takes the retrieved documents and the question
2. "Stuffs" the documents into the prompt as context
3. Generates a grounded answer using the LLM

### The Complete Flow

```
User Question ‚Üí Contextualize ‚Üí Retrieve ‚Üí Generate Answer
      ‚Üë              ‚Üì            ‚Üì            ‚Üì
  Chat History    Standalone    Relevant    Final
                   Question     Documents   Response
```

In [15]:
# 1. Initialize the LLM
llm = ChatOpenAI(temperature=0, model_name=MODEL)

In [16]:
# 2. Create a retriever from the vector store
# k=5 means we retrieve the top 5 most relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

> üìä **Experiment Tracking Note**: The value of `k` (number of retrieved chunks) will be logged as a parameter in MLflow. This allows you to compare runs with different retrieval depths to see how it affects answer quality.


In [17]:
query = "What is Byte Pair Encoding?"

# Just check what the retriever returns (raw documents)
docs = retriever.invoke(query)
print(f"Retrieved {len(docs)} documents")
len(docs)


Retrieved 5 documents


5

In [18]:
print(docs[0].page_content)

arXiv:2411.08671v1  [cs.DS]  13 Nov 2024
Theoretical Analysis of Byte-Pair Encoding
L¬¥ aszl¬¥ o Kozma and Johannes Voderholzer
Institut f¬® ur Informatik, Freie Universit¬® at Berlin, Germany
Abstract
Byte-Pair Encoding (BPE) is a widely used method for subword token ization, with origins in
grammar-based text compression. It is employed in a variety of lang uage processing tasks such
as machine translation or large language model (LLM) pretraining, t o create a token dictionary
of a prescribed size. Most evaluations of BPE to date are empirical, a nd the reasons for its good
practical performance are not well understood.
In this paper we focus on the optimization problem underlying BPE: Ô¨Ån ding a pair encoding
that achieves optimal compression utility. We show that this problem is APX-complete, indi-
cating that it is unlikely to admit a polynomial-time approximation scheme . This answers, in a
stronger form, a question recently raised by Zouhar et al. [ ZMG+23].


In [19]:
for i, doc in enumerate(docs):
    print(f'---'*20)
    print(f'=== Document {i+1} sourced from {doc.metadata["source_file"]} page {doc.metadata["page"]} ===')
    print(f'=== Content of Document {i+1} ===')
    print(doc.page_content)
    

------------------------------------------------------------
=== Document 1 sourced from 2411.08671v1.pdf page 0 ===
=== Content of Document 1 ===
arXiv:2411.08671v1  [cs.DS]  13 Nov 2024
Theoretical Analysis of Byte-Pair Encoding
L¬¥ aszl¬¥ o Kozma and Johannes Voderholzer
Institut f¬® ur Informatik, Freie Universit¬® at Berlin, Germany
Abstract
Byte-Pair Encoding (BPE) is a widely used method for subword token ization, with origins in
grammar-based text compression. It is employed in a variety of lang uage processing tasks such
as machine translation or large language model (LLM) pretraining, t o create a token dictionary
of a prescribed size. Most evaluations of BPE to date are empirical, a nd the reasons for its good
practical performance are not well understood.
In this paper we focus on the optimization problem underlying BPE: Ô¨Ån ding a pair encoding
that achieves optimal compression utility. We show that this problem is APX-complete, indi-
cating that it is unlikely to admit a

In [20]:
query = "What is the capital of France?"
# This query might not be in the documents, so retrieval might return irrelevant info
docs = retriever.invoke(query)
for i, doc in enumerate(docs):
    print(f'---'*20)
    print(f'=== Document {i+1} sourced from {doc.metadata["source_file"]} page {doc.metadata["page"]} ===')
    print(f'=== Content of Document {i+1} ===')
    print(doc.page_content)

------------------------------------------------------------
=== Document 1 sourced from 1301.3781v3.pdf page 4 ===
=== Content of Document 1 ===
resulting vectors can be used to answer very subtle semantic relationships between words, such as
a city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin. Word vectors
with such semantic relationships could be used to improve many existing NLP applications, such
as machine translation, information retrieval and question answering systems, and may enable other
future applications yet to be invented.
5
------------------------------------------------------------
=== Document 2 sourced from 1301.3781v3.pdf page 4 ===
=== Content of Document 2 ===
Somewhat surprisingly, these questions can be answered by performing simple algebraic operations
with the vector representation of words. To Ô¨Ånd a word that is similar to small in the same sense as
biggest is similar to big, we can simply compute vectorX = vector(‚Äùbigge

In [21]:
from langchain_openai import ChatOpenAI

llm_basic = ChatOpenAI(model_name=MODEL, temperature=0)
# 3. Define the contextualization prompt
# This prompt helps reformulate questions based on chat history
contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, "
    "just reformulate it if needed and otherwise return it as is."
)

contextualize_q_prompt = ChatPromptTemplate.from_messages([
    ("system", contextualize_q_system_prompt),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Create the history-aware retriever
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)



__What contextualize_q_system_prompt does__
* This prompt is used only by create_history_aware_retriever.‚Äã
* It tells the LLM: ‚ÄúGiven chat history + latest user input, rewrite the question so it‚Äôs standalone; don‚Äôt answer it.‚Äù That rewritten question is then sent to the retriever.‚Äã
* You need this only if:
    * You want follow‚Äëup questions like ‚ÄúWhat about its limitations?‚Äù to still retrieve the right chunks, and
    * You are using create_history_aware_retriever (or an equivalent ‚Äúconversational retriever‚Äù).

If you don‚Äôt care about multi‚Äëturn context in retrieval, you can skip the history‚Äëaware retriever entirely and just use retriever = vectorstore.as_retriever(...) as you did with RetrievalQA. In that case, contextualize_q_system_prompt and its prompt are not needed.

In [22]:
# 4. Define the QA prompt
# This prompt instructs the LLM how to use the retrieved context
qa_system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, just say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", qa_system_prompt),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Create the question-answer chain
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)



__What qa_system_prompt does__
* This is the system prompt used by the answer‚Äëgeneration step (create_stuff_documents_chain).‚Äã
* It controls how the LLM:
    * Uses {context} (retrieved docs),
    * Handles ‚ÄúI don‚Äôt know‚Äù cases,
    * Constrains length and style of answers
    

In [23]:
# 5. Combine into the final RAG chain
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

print("‚úÖ RAG chain created successfully!")

‚úÖ RAG chain created successfully!


---

## üìù Logging Prompts for Reproducibility

Prompts are **hyperparameters** of your LLM system. Even small changes can dramatically affect results:

- Changing "answer concisely" to "answer in detail" ‚Üí Different response lengths
- Adding "cite sources" ‚Üí Better grounded responses

By logging prompts as artifacts:
- ‚úÖ You can compare Prompt V1 vs V2 side-by-side in MLflow
- ‚úÖ You know exactly what prompt produced which results
- ‚úÖ You can roll back to previous prompt versions

> üéì **Advanced**: MLflow has a Prompt Registry feature for managing prompt versions at scale.


In [24]:
# Preview the prompts that will be logged during evaluation
print("üìù Contextualize Prompt:")
print("-" * 40)
print(contextualize_q_system_prompt)
print()
print("üìù QA System Prompt:")
print("-" * 40)
print(qa_system_prompt)
print()
print("üí° These prompts will be logged as artifacts during the evaluation run.")


üìù Contextualize Prompt:
----------------------------------------
Given a chat history and the latest user question which might reference context in the chat history, formulate a standalone question which can be understood without the chat history. Do NOT answer the question, just reformulate it if needed and otherwise return it as is.

üìù QA System Prompt:
----------------------------------------
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

{context}

üí° These prompts will be logged as artifacts during the evaluation run.


---

## Step 7: Test the RAG Chain

Let's test our RAG chain with a simple query. The chain will:

1. Take the user's question
2. Retrieve relevant document chunks from the vector store
3. Generate a response based on the retrieved context

The response object contains:
- **`answer`**: The generated response
- **`context`**: The retrieved document chunks used to generate the answer

In [25]:
# Initialize empty chat history
chat_history = []

# Ask a question
query = "What is the main topic of these documents?"
response = rag_chain.invoke({"input": query, "chat_history": chat_history})

print("‚ùì Question:", query)
print("\nüí¨ Answer:", response["answer"])

# Update chat history for follow-up questions
chat_history.extend([
    HumanMessage(content=query),
    AIMessage(content=response["answer"])
])

print("\n‚úÖ Chat history updated. You can now ask follow-up questions!")

‚ùì Question: What is the main topic of these documents?

üí¨ Answer: The main topic of these documents is the development and evaluation of universal text embeddings, which are models designed to perform well across a variety of natural language processing tasks. They discuss the challenges of creating effective embeddings, the importance of diverse and high-quality datasets, and recent advancements in the field. Additionally, they mention benchmarks like the Massive Text Embedding Benchmark (MTEB) that assess the performance of these models across multiple languages and tasks.

‚úÖ Chat history updated. You can now ask follow-up questions!


---

## üîç Understanding Traces in MLflow

MLflow automatically captured a **trace** of that RAG chain execution. Let's explore it!

### What is a Trace?

A trace is like an X-ray of your RAG pipeline. It shows:

- **Each step**: Retrieval ‚Üí Prompt construction ‚Üí LLM call ‚Üí Response
- **Timing**: How long each step took (find bottlenecks!)
- **Inputs/Outputs**: What data flowed through each step
- **Token counts**: How many tokens were used (costs!)

### How to View Traces

1. Start the MLflow UI (if not already running):
   ```bash
   mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000
   ```

2. Open browser: `http://localhost:5000`

3. Navigate to: **Traces** tab (top menu)

4. Click on any trace to see:
   - **Timeline view**: Visual representation of execution time
   - **Span details**: Click each span to see inputs/outputs
   - **Retrieval inspection**: See which documents were retrieved
   - **LLM calls**: See the exact prompt sent to the LLM

### üéØ Debugging with Traces

Traces help you diagnose problems:

| Problem | What to Check in Trace |
|---------|------------------------|
| Wrong answer | **Retrieval span**: Were the right documents retrieved? |
| Hallucination | **Context vs Answer**: Does answer contain info NOT in context? |
| Slow responses | **Timeline**: Is retrieval slow? LLM call slow? |
| High costs | **Token counts**: Are you retrieving too many chunks? |

> üí° **Try This**: Run the chain with `k=10` instead of `k=5` and compare the traces. You'll see more retrieval time and higher token usage!


In [26]:
query = "repeat the answer but this time in bullet points please"
response = rag_chain.invoke({"input": query, "chat_history": chat_history})

In [27]:
print("‚ùì Question:", query)
print("\nüí¨ Answer:", response["answer"])

‚ùì Question: repeat the answer but this time in bullet points please

üí¨ Answer: - The main topic is the development and evaluation of universal text embeddings for natural language processing tasks.
- It discusses challenges in creating effective embeddings and the importance of diverse, high-quality datasets.
- Recent advancements in the field and benchmarks like the Massive Text Embedding Benchmark (MTEB) are highlighted, assessing model performance across multiple languages and tasks.


In [28]:
query = "What is byte pair encoding?"
response = rag_chain.invoke({"input": query, "chat_history": chat_history})
print("‚ùì Question:", query)
print("\nüí¨ Answer:", response["answer"])

‚ùì Question: What is byte pair encoding?

üí¨ Answer: Byte Pair Encoding (BPE) is a data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. It is adapted for word segmentation by merging characters or character sequences instead of bytes, creating a token dictionary of variable-length subword units. BPE is commonly used in natural language processing tasks, such as machine translation, to improve tokenization and reduce vocabulary size.


## üìù Step 3: Create Golden Evaluation Dataset

### üß† Educational Context: The "Golden Dataset"

To scientifically evaluate a RAG system, we cannot just "eyeball" a few answers. We need a **benchmark**‚Äîoften called a "Golden Dataset" or "Ground Truth" set.

#### What makes a good evaluation dataset?
1.  **Diversity**: Questions should cover different topics within your documents.
2.  **Complexity**: Include simple fact lookups ("What is X?") and reasoning questions ("Compare X and Y").
3.  **Ground Truth**: You must provide the *ideal* answer. The LLM Judge will compare the RAG system's output against this reference.

**Measurement Goals**:
*   **Retrieval Quality**: Did the system find the right page in the PDF?
*   **Generation Quality**: Did the LLM answer accurately based on that page?

üëá **Action**: The code below creates a list of dictionaries, where each item has a `question` and a `ground_truth` answer. We convert this to a pandas DataFrame for easy handling.

In [29]:
# Golden dataset: Questions with ground truth answers
# Based on your PDFs: Word Embeddings, BPE, NMT, MTEB
eval_data = [
    {
        "question": "What is Byte Pair Encoding (BPE)?",
        "ground_truth": "BPE is a data compression technique that iteratively replaces the most frequent pair of bytes/symbols with a new symbol. It's used for subword tokenization in NLP tasks like machine translation."
    },
    {
        "question": "What complexity result did Kozma and Voderholzer prove about optimal pair encoding?",
        "ground_truth": "They proved that optimal pair encoding is APX-complete, meaning it's unlikely to admit a polynomial-time approximation scheme unless P=NP."
    },
    {
        "question": "What is the distributional hypothesis in NLP?",
        "ground_truth": "Words that appear in similar contexts tend to have similar meanings. This principle, suggested by Harris (1954), underlies modern word embeddings."
    },
    {
        "question": "What is the Vector Space Model (VSM)?",
        "ground_truth": "The VSM represents words and documents as vectors in high-dimensional space, enabling mathematical operations like cosine similarity for information retrieval. Generally attributed to Salton (1975)."
    },
    {
        "question": "Who introduced the GloVe word embedding model and when?",
        "ground_truth": "GloVe (Global Vectors for Word Representation) was introduced by Pennington et al. in 2014."
    },
    {
        "question": "What is the main contribution of Neural Network Language Models (NNLMs)?",
        "ground_truth": "NNLMs, pioneered by Bengio et al. (2003), reframed language modeling as unsupervised learning and introduced embedding layers that project words into dense vector spaces."
    },
    {
        "question": "What benchmark is used to evaluate text embedding models across multiple languages?",
        "ground_truth": "The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across multiple languages and diverse NLP tasks."
    },
    {
        "question": "What is the key advantage of subword tokenization in neural machine translation?",
        "ground_truth": "Subword tokenization (like BPE) enables open-vocabulary translation, handling rare words and achieving better compression while maintaining translation quality."
    }
]

eval_df = pd.DataFrame(eval_data)
print(f"‚úÖ Golden dataset ready: {len(eval_df)} evaluation questions")
print(f"üìÑ Covering: Word Embeddings, BPE, NMT, Vector Models")
eval_df[["question"]].head()


‚úÖ Golden dataset ready: 8 evaluation questions
üìÑ Covering: Word Embeddings, BPE, NMT, Vector Models


Unnamed: 0,question
0,What is Byte Pair Encoding (BPE)?
1,What complexity result did Kozma and Voderholz...
2,What is the distributional hypothesis in NLP?
3,What is the Vector Space Model (VSM)?
4,Who introduced the GloVe word embedding model ...


In [30]:
# Analyze the golden dataset before proceeding
print("üìä GOLDEN DATASET STATISTICS:")
print("=" * 50)
print(f"Total evaluation questions: {len(eval_df)}")
print(f"Average question length: {eval_df['question'].str.len().mean():.0f} characters")
print(f"Average ground truth length: {eval_df['ground_truth'].str.len().mean():.0f} characters")

# Show topic distribution (based on keywords)
print("\nüìå Topic Coverage:")
topics = {
    'BPE/Tokenization': eval_df['question'].str.contains('BPE|Byte Pair|tokeniz', case=False).sum(),
    'Embeddings': eval_df['question'].str.contains('embed|vector|GloVe|word2vec', case=False).sum(),
    'Language Models': eval_df['question'].str.contains('language model|NNLM|neural', case=False).sum(),
    'Benchmarks': eval_df['question'].str.contains('benchmark|MTEB|evaluat', case=False).sum(),
}
for topic, count in topics.items():
    print(f"  - {topic}: {count} question(s)")


üìä GOLDEN DATASET STATISTICS:
Total evaluation questions: 8
Average question length: 61 characters
Average ground truth length: 152 characters

üìå Topic Coverage:
  - BPE/Tokenization: 2 question(s)
  - Embeddings: 3 question(s)
  - Language Models: 2 question(s)
  - Benchmarks: 1 question(s)


## üîç Step 4: Run RAG Inference

### üß† Educational Context: Batch Inference

Now that we have our questions, we need to generate answers using our RAG pipeline. This is called **Inference**.

#### Why are we doing this loop?
We need to capture two things for every question:
1.  **The Generated Answer**: What the LLM actually said.
2.  **The Retrieved Contexts**: The specific text chunks the system found in your PDF.

**Why context matters**: To measure "Faithfulness" (hallucination), the Judge needs to see *exactly* what the LLM read before it answered. If the LLM answers correctly but the context was irrelevant, it might be using its pre-trained knowledge instead of your data!

üëá **Action**: The loop below iterates through each question in our golden dataset, sends it to the `rag_chain`, and saves the results.

In [31]:
results = []
print("üîç Running RAG evaluation inference...\n")

for idx, row in eval_df.iterrows():
    try:
        # Invoke your existing RAG chain (single-turn evaluation)
        response = rag_chain.invoke({
            "input": row["question"],
            "chat_history": []  # Empty history for clean evaluation
        })
        
        # Extract answer and retrieved contexts
        answer = response["answer"]
        contexts = [doc.page_content for doc in response["context"]]
        
        results.append({
            "question": row["question"],
            "ground_truth": row["ground_truth"],
            "answer": answer,
            "contexts": contexts  # Required for faithfulness metric
        })
        
        print(f"  ‚úì Q{idx+1}: {row['question'][:70]}...")
        
    except Exception as e:
        print(f"  ‚úó Q{idx+1} failed: {e}")
        continue

results_df = pd.DataFrame(results)
print(f"\n‚úÖ Inference complete: {len(results_df)}/{len(eval_df)} questions answered")
results_df[["question", "answer"]].head(3)


üîç Running RAG evaluation inference...

  ‚úì Q1: What is Byte Pair Encoding (BPE)?...
  ‚úì Q2: What complexity result did Kozma and Voderholzer prove about optimal p...
  ‚úì Q3: What is the distributional hypothesis in NLP?...
  ‚úì Q4: What is the Vector Space Model (VSM)?...
  ‚úì Q5: Who introduced the GloVe word embedding model and when?...
  ‚úì Q6: What is the main contribution of Neural Network Language Models (NNLMs...
  ‚úì Q7: What benchmark is used to evaluate text embedding models across multip...
  ‚úì Q8: What is the key advantage of subword tokenization in neural machine tr...

‚úÖ Inference complete: 8/8 questions answered


Unnamed: 0,question,answer
0,What is Byte Pair Encoding (BPE)?,Byte Pair Encoding (BPE) is a data compression...
1,What complexity result did Kozma and Voderholz...,Kozma and Voderholzer proved that the problem ...
2,What is the distributional hypothesis in NLP?,The distributional hypothesis in NLP posits th...


In [32]:
# Quality checks before running expensive LLM Judge
print("üõ°Ô∏è QUALITY GUARDRAILS CHECK:")
print("=" * 50)

empty_answers = sum(1 for r in results if not r.get('answer', '').strip())
zero_contexts = sum(1 for r in results if not r.get('contexts', []))
avg_answer_len = sum(len(r.get('answer', '')) for r in results) / len(results) if results else 0

print(f"‚úì Questions answered: {len(results)}/{len(eval_df)}")
print(f"‚ö† Empty answers: {empty_answers}")
print(f"‚ö† Zero retrieved contexts: {zero_contexts}")
print(f"üìè Average answer length: {avg_answer_len:.0f} characters")

if empty_answers > 0 or zero_contexts > 0:
    print("\nüî¥ WARNING: Some questions have issues. Review before running the Judge.")
else:
    print("\n‚úÖ All checks passed. Ready for LLM Judge evaluation.")


üõ°Ô∏è QUALITY GUARDRAILS CHECK:
‚úì Questions answered: 8/8
‚ö† Empty answers: 0
‚ö† Zero retrieved contexts: 0
üìè Average answer length: 378 characters

‚úÖ All checks passed. Ready for LLM Judge evaluation.


---

## üìè RAG Evaluation Metrics Reference

Before we run evaluation, let's understand the metrics we'll use.

### How LLM-as-a-Judge Works

Traditional metrics like **BLEU** or **ROUGE** compare word overlap. But for conversational AI:
- "The capital of France is Paris" ‚â† "Paris is the capital city of France" (different words, same meaning!)

**LLM-as-a-Judge** uses a powerful LLM (like GPT-4) to evaluate responses semantically. We give it:
- The original question
- The retrieved context (from your documents)
- The generated answer
- The ground truth answer

The Judge LLM then applies a rubric to score the response.

### Metrics We Use

| Metric | Question the Judge Asks | Score Range | What It Measures |
|--------|------------------------|-------------|------------------|
| **Faithfulness** | "Is the answer supported *only* by the retrieved context?" | 1-5 | Anti-hallucination: did the LLM make things up? |
| **Answer Relevance** | "Does the answer actually address the user's question?" | 1-5 | Is the response on-topic and helpful? |

### Interpreting Scores

| Score | Interpretation | Action |
|-------|----------------|--------|
| **5** | Excellent - no issues | ‚úÖ Keep configuration |
| **4** | Good - minor issues | üëÄ Monitor, may need attention |
| **3** | Acceptable - noticeable issues | ‚ö†Ô∏è Investigate specific failures |
| **1-2** | Poor - significant problems | üî¥ Debug and fix before production |

### Additional Metrics Available

MLflow and other frameworks offer more metrics:

- **Context Precision**: Did we retrieve the *right* documents?
- **Context Recall**: Did we retrieve *all* relevant documents?
- **Answer Correctness**: How close is the answer to the ground truth?
- **Toxicity**: Is the response harmful or inappropriate?

> üí° **Note**: Each Judge LLM call costs money. Start with 2-3 key metrics, then expand as needed.


## üìä Step 5: MLflow LLM-as-a-Judge Evaluation

### üß† Educational Context: LLM-as-a-Judge

Evaluating free-text answers is hard. In the past, we used **BLEU** or **ROUGE** scores (checking for exact word overlap), but these are bad for chatbots. A correct answer might use completely different words than the ground truth!

**The Solution**: Use a smart LLM (like GPT-4) as a "Judge". We give the Judge the evidence (Question, Context, Answer, Ground Truth) and a rubric, and it assigns a score.

### Understanding the Metrics

We are using MLflow's GenAI metrics:

#### 1. Faithfulness (Anti-Hallucination)
*   **Question asked to Judge**: "Is the generated answer based *solely* on the provided context?"
*   **Interpretation**: 
    *   Score 1.0 (High): The model acted like a faithful storage retrieval system. 
    *   Score 0.0 (Low): The model made things up (hallucinated) or used outside knowledge.

#### 2. Answer Relevance
*   **Question asked to Judge**: "Does this answer actually address the user's question?"
*   **Interpretation**:
    *   Score 1.0 (High): The answer is on-topic and helpful.
    *   Score 0.0 (Low): The answer is irrelevant, evaded the question, or just rambled.

üëá **Action**: `mlflow.evaluate` runs this entire process automatically. It sends prompts to the Judge model for every row in your dataframe.

In [33]:
# Define evaluation metrics (GPT-4o-mini as judge)
metrics = [
    faithfulness(model="openai:/gpt-4o-mini"),        # Anti-hallucination: answer supported by context?
    answer_relevance(model="openai:/gpt-4o-mini"),    # Does answer address the question?
]

# Log current configuration as baseline
current_config = {
    "chunk_size": 1000,
    "chunk_overlap": 200,
    "retrieval_k": 5,
    "embedding_model": "text-embedding-3-small",
    "llm_model": MODEL
}

import os

# Run evaluation and log to MLflow
with mlflow.start_run(run_name="RAG_Baseline_v1") as run:
    
    # Log parameters
    mlflow.log_params(current_config)
    
    # Also log prompts as parameters (shows in Parameters tab)
    mlflow.log_param("contextualize_prompt", contextualize_q_system_prompt[:250] + "...")  # Truncate for param limit
    mlflow.log_param("qa_prompt", qa_system_prompt[:250] + "...")  # Truncate for param limit
    
    # Save prompt files to disk
    with open("contextualize_prompt.txt", "w") as f:
        f.write(contextualize_q_system_prompt)
    with open("qa_prompt.txt", "w") as f:
        f.write(qa_system_prompt)
    
    # Save dataset files to disk
    eval_df.to_csv("golden_dataset.csv", index=False)
    results_df.to_csv("evaluation_results.csv", index=False)
    
    # Verify files exist before logging
    files_to_log = ["contextualize_prompt.txt", "qa_prompt.txt", "golden_dataset.csv", "evaluation_results.csv"]
    print("üìÅ Verifying files before logging:")
    for f in files_to_log:
        if os.path.exists(f):
            size = os.path.getsize(f)
            print(f"   ‚úÖ {f} ({size} bytes)")
        else:
            print(f"   ‚ùå {f} NOT FOUND!")
    
    # Log all artifacts
    for f in files_to_log:
        if os.path.exists(f):
            mlflow.log_artifact(f)
    
    # Also log dataset using log_input for Datasets tab
    try:
        import mlflow.data
        dataset = mlflow.data.from_pandas(
            results_df,
            source="golden_evaluation_dataset",
            name="rag_eval_questions"
        )
        mlflow.log_input(dataset, context="evaluation")
        print("   ‚úÖ Dataset logged via log_input()")
    except Exception as e:
        print(f"   ‚ö†Ô∏è log_input() failed: {e}")
    
    # Log tables for better UI display (MLflow 2.9+)
    try:
        mlflow.log_table(data=eval_df, artifact_file="golden_dataset.json")
        mlflow.log_table(data=results_df[["question", "ground_truth", "answer"]], artifact_file="results_summary.json")
        print("   ‚úÖ Tables logged via log_table()")
    except Exception as e:
        print(f"   ‚ö†Ô∏è log_table() failed: {e}")
    
    # Get artifact URI for debugging
    artifact_uri = mlflow.get_artifact_uri()
    print(f"\nüìç Artifact URI: {artifact_uri}")
    
    # Run LLM judge evaluation
    eval_results = mlflow.evaluate(
        data=results_df,
        targets="ground_truth",
        predictions="answer",
        extra_metrics=metrics,
        model_type="question-answering",
        evaluator_config={"col_mapping": {"inputs": "question", "context": "contexts"}}
    )
    
    run_id = run.info.run_id
    experiment_id = run.info.experiment_id

print(f"\nüéâ Evaluation complete!")
print(f"üÜî Run ID: {run_id}")
print(f"üåê View in MLflow UI: http://localhost:5000/#/experiments/{experiment_id}/runs/{run_id}")

# Display aggregate metrics
print("\nüìä AGGREGATE METRICS (1-5 scale, higher is better):")
print("="*60)
key_metrics = [
    "faithfulness/v1/mean",
    "answer_relevance/v1/mean", 
]
for metric in key_metrics:
    if metric in eval_results.metrics:
        score = eval_results.metrics[metric]
        print(f"  {metric:.<50} {score:.3f}")


üìÅ Verifying files before logging:
   ‚úÖ contextualize_prompt.txt (271 bytes)
   ‚úÖ qa_prompt.txt (248 bytes)
   ‚úÖ golden_dataset.csv (1752 bytes)
   ‚úÖ evaluation_results.csv (43373 bytes)
   ‚úÖ Dataset logged via log_input()
   ‚úÖ Tables logged via log_table()

üìç Artifact URI: /Users/tarekatwan/Repos/MyWork/Teach/repos/advanced_machine_learning/activities/03_Generative_AI/rag_demo/mlruns/1/093a8d47bc354efb908b0bb3dfd691f8/artifacts


  faithfulness(model="openai:/gpt-4o-mini"),        # Anti-hallucination: answer supported by context?
  return make_genai_metric(
  answer_relevance(model="openai:/gpt-4o-mini"),    # Does answer address the question?
  return make_genai_metric(
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(

 - For traditional ML or deep learning models: Use `mlflow.models.evaluate`, which maintains full compatibility with the original `mlflow.evaluate` API.

 - For LLMs or GenAI applications: Use the new `mlflow.genai.evaluate` API, which offers enhanced features specifically designed for evaluating LLMs and GenAI applications.

  token_count(),
  toxicity(),
  flesch_kincaid_grade_level(),
  ari_grade_level(),
  builtin_metrics = [*text_metrics, exact_match()]
2025/12/12 14:19:21 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025-12-12 14:19:23 WARNI [evaluate_modules.metrics.evaluate-measurement--toxicity.2390290fa0bf6d78480

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/8 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]


üéâ Evaluation complete!
üÜî Run ID: 093a8d47bc354efb908b0bb3dfd691f8
üåê View in MLflow UI: http://localhost:5000/#/experiments/1/runs/093a8d47bc354efb908b0bb3dfd691f8

üìä AGGREGATE METRICS (1-5 scale, higher is better):
  faithfulness/v1/mean.............................. 4.375
  answer_relevance/v1/mean.......................... 5.000


## üî¨ Step 6: Per-Question Breakdown

### üß† Educational Context: Debugging

Averages hide details. To improve your system, you must look at **individual failures**.

#### How to interpret this table:
1.  **Low Faithfulness, High Relevance**: The model gave a good-sounding answer, but it wasn't in the document! This is dangerous (Hallucination).
    *   *Fix*: Check if the retrieval step failed to find the right chunk.
2.  **High Faithfulness, Low Relevance**: The model quoted the document perfectly, but it didn't answer the user's question.
    *   *Fix*: The retrieved chunk might be irrelevant to the question.

üëá **Action**: We highlight scores below 0.7 in red. Focus on these rows to understand *why* the system failed.

In [34]:
# Extract per-question scores
eval_table = eval_results.tables["eval_results_table"]

# Display detailed breakdown
print("üìã PER-QUESTION PERFORMANCE:\n")
display_cols = [
    "question",
    "faithfulness/v1/score",
    "answer_relevance/v1/score",
    # "context_precision/v1/score", 
    # "context_recall/v1/score"
]

breakdown = eval_table[display_cols].round(3)
breakdown.columns = ["Question", "Faithfulness", "Relevance"]

# Highlight low-scoring questions (< 0.7)
# Note: style for dataframe
styled = breakdown.style.map(
    lambda x: 'background-color: #ffcccc' if isinstance(x, (int, float)) and x < 3.5 else '',
    subset=["Faithfulness", "Relevance"]
)

styled


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

üìã PER-QUESTION PERFORMANCE:



Unnamed: 0,Question,Faithfulness,Relevance
0,What is Byte Pair Encoding (BPE)?,5,5
1,What complexity result did Kozma and Voderholzer prove about optimal pair encoding?,4,5
2,What is the distributional hypothesis in NLP?,4,5
3,What is the Vector Space Model (VSM)?,4,5
4,Who introduced the GloVe word embedding model and when?,5,5
5,What is the main contribution of Neural Network Language Models (NNLMs)?,4,5
6,What benchmark is used to evaluate text embedding models across multiple languages?,4,5
7,What is the key advantage of subword tokenization in neural machine translation?,5,5


---

## üîç Common RAG Failure Patterns & Fixes

Use this reference when analyzing your per-question scores.

### Pattern 1: Low Faithfulness + High Relevance

**Symptom**: The answer sounds great and addresses the question, but it's not actually in the documents.

**Diagnosis**: The LLM is **hallucinating** - using its pretrained knowledge instead of your documents.

**Fixes**:
- Make your system prompt stricter: "Answer ONLY based on the provided context"
- Lower the LLM temperature to reduce creativity
- Check if retrieval is returning irrelevant chunks (forcing the LLM to guess)

---

### Pattern 2: High Faithfulness + Low Relevance

**Symptom**: The answer quotes the document perfectly but doesn't answer the question.

**Diagnosis**: **Retrieval failure** - the wrong chunks were fetched.

**Fixes**:
- Increase `k` (number of retrieved chunks) to get more options
- Improve chunk overlap to preserve context boundaries
- Try a different embedding model (some are better for your domain)
- Add metadata filtering (e.g., only search specific document types)

---

### Pattern 3: Low Faithfulness + Low Relevance

**Symptom**: The answer is both wrong and off-topic.

**Diagnosis**: Complete system failure - likely the question is **out of scope**.

**Fixes**:
- Add a fallback: "I don't have information about that topic"
- Check if your documents even contain relevant information
- Review your embedding model - it may not understand the domain

---

### Pattern 4: Inconsistent Scores Across Similar Questions

**Symptom**: "What is X?" scores 5, but "Explain X" scores 2.

**Diagnosis**: Your chunking or retrieval is **sensitive to phrasing**.

**Fixes**:
- Use a history-aware retriever (like we have) to normalize queries
- Add query expansion or rewriting before retrieval
- Consider hybrid search (combine semantic + keyword matching)


In [35]:
# Identify and analyze low-scoring questions
THRESHOLD = 3.5  # Scores below this are concerning

print("üîç FAILURE ANALYSIS:")
print("=" * 50)

# Find questions with low faithfulness
low_faith = eval_table[eval_table['faithfulness/v1/score'] < THRESHOLD]

if len(low_faith) > 0:
    print(f"\n‚ö† Found {len(low_faith)} question(s) with Faithfulness < {THRESHOLD}:")
    for idx, row in low_faith.iterrows():
        print(f"\n  Q: {row['question'][:70]}...")
        print(f"  Score: {row['faithfulness/v1/score']}")
        
        # Show what was retrieved (from results_df)
        matching_result = results_df[results_df['question'] == row['question']]
        if len(matching_result) > 0 and 'contexts' in matching_result.columns:
            contexts = matching_result.iloc[0]['contexts']
            print(f"  Retrieved {len(contexts)} chunks. First chunk preview:")
            if contexts:
                print(f"    '{contexts[0][:150]}...'")
else:
    print("‚úÖ No low-faithfulness questions found!")


üîç FAILURE ANALYSIS:
‚úÖ No low-faithfulness questions found!


## üß™ Step 7: Experiment Comparison & Iteration

### üß† Educational Context: The Scientific Method for RAG

RAG is not "set and forget". It requires tuning. This is the **Experiment Loop**:

1.  **Baseline**: Run the system with default settings (e.g., chunk_size=1000).
2.  **Hypothesis**: "I think smaller chunks will capture details better."
3.  **Experiment**: Change `chunk_size` to 500.
4.  **Evaluate**: Rerun this notebook.
5.  **Compare**: Look at the table below. Did Faithfulness go up?

#### Parameters you can tune:
*   **Chunk Size**: 500, 1000, 2000 characters.
*   **Overlap**: 10%, 20% of chunk size.
*   **k (Retrieval Count)**: Provide 3, 5, or 10 chunks to the LLM.

üëá **Action**: This table aggregates all your MLflow runs so you can pick the best configuration.

### How to Interpret This Table

**Finding the Best Configuration:**

1. **Primary Metric**: Look for highest `faithfulness/v1/mean` (prevents hallucination)
2. **Secondary Metric**: Among high-faithfulness runs, pick highest `answer_relevance/v1/mean`
3. **Tradeoffs**: Smaller chunks may increase faithfulness but could reduce relevance

**Reading the Parameters:**
- `chunk_size`: Larger = more context per chunk, but may include noise
- `chunk_overlap`: Higher = better context continuity at boundaries
- `retrieval_k`: More chunks = better recall, but may confuse the LLM

> üí° **Pro Tip**: If two runs have similar scores, prefer the one with lower API costs (smaller chunks = more embeddings, higher k = more tokens in prompt).


In [36]:
# After you modify chunk_size, k, or embeddings ‚Üí rerun Steps 4-6
# This cell shows all runs side-by-side

runs_df = mlflow.search_runs(
    experiment_names=["RAG_PDF_Embeddings_Evaluation"],
    order_by=["start_time DESC"]
)

# Select key columns for comparison
comparison_cols = [
    "run_id",
    "params.chunk_size",
    "params.chunk_overlap", 
    "params.retrieval_k",
    "params.embedding_model",
    "metrics.faithfulness/v1/mean",
    "metrics.answer_relevance/v1/mean",
    "metrics.context_precision/v1/mean",
    "metrics.context_recall/v1/mean"
]

# Ensure cols exist
available_cols = [c for c in comparison_cols if c in runs_df.columns]
comparison = runs_df[available_cols].round(3)

print("üî¨ EXPERIMENT COMPARISON:")
print("="*100)
comparison.head(10)




üî¨ EXPERIMENT COMPARISON:


Unnamed: 0,run_id


---

## üéØ Student Challenge

Now it's your turn to experiment!

### Challenge 1: Tune the Chunking Strategy

**Hypothesis**: Smaller chunks might capture specific details better.

**Task**:
1. Go back to **Step 4** and change `chunk_size` from 1000 to 500
2. Delete the `vector_db` folder to force re-indexing
3. Re-run Steps 4, 5, and 6 to rebuild the vector store
4. Re-run Steps 3-7 (Evaluation) with a new run name (e.g., "RAG_SmallChunks_v1")
5. Compare results in the experiment comparison table

**Questions to Answer**:
- Did faithfulness improve or get worse?
- What about answer relevance?
- Why do you think you saw these changes?

---

### Challenge 2: Expand the Golden Dataset

**Task**: Add 3 new questions to the evaluation dataset:

1. One **factual question** ("When was X published?")
2. One **reasoning question** ("Compare X and Y")
3. One **out-of-scope question** (something NOT in your documents)

**Questions to Answer**:
- How did the system handle the out-of-scope question?
- Did it admit it didn't know, or did it hallucinate?

---

### Challenge 3 (Bonus): Try a Different Embedding Model

**Task**:
1. Change from `text-embedding-3-small` to `text-embedding-3-large`
2. Delete the `vector_db` folder
3. Re-run the entire pipeline
4. Compare the cost vs. quality tradeoff

---

### Reflection Questions

After completing the challenges, answer these:

1. Which parameter had the biggest impact on quality?
2. What tradeoffs did you observe (quality vs. cost vs. speed)?
3. How would you decide on the best configuration for production?


---

## üìö Next Steps & Resources

### Further Reading

| Resource | Description | Link |
|----------|-------------|------|
| **MLflow LLM Evaluation Docs** | Complete guide to GenAI metrics | [mlflow.org/docs](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html) |
| **RAGAS Framework** | Alternative evaluation framework with more metrics | [ragas.io](https://docs.ragas.io/) |
| **LangSmith** | LangChain's own observability platform | [docs.smith.langchain.com](https://docs.smith.langchain.com/) |
| **Arize Phoenix** | Open-source LLM observability | [phoenix.arize.com](https://phoenix.arize.com/) |

### Research Papers

- **"RAGAS: Automated Evaluation of Retrieval Augmented Generation"** - Original paper on RAG metrics
- **"Judging LLM-as-a-Judge"** - Meta-analysis of using LLMs for evaluation

### What's Next in Your Learning Path?

1. **Production Deployment**: Learn to deploy RAG with proper monitoring
2. **Advanced Retrieval**: Explore hybrid search, re-ranking, and multi-index strategies
3. **Agentic RAG**: Combine RAG with tool use for more complex tasks
4. **Fine-tuning**: Train custom embedding models for your domain

---

## üèÅ Summary

In this notebook, you learned:

- ‚úÖ RAG systems need **systematic evaluation**, not just manual testing
- ‚úÖ **Golden Datasets** provide the ground truth for benchmarking
- ‚úÖ **LLM-as-a-Judge** enables semantic evaluation at scale
- ‚úÖ **MLflow** provides observability and experiment tracking
- ‚úÖ **Faithfulness** measures hallucination, **Relevance** measures answer quality
- ‚úÖ Per-question analysis helps you **debug specific failures**
- ‚úÖ Experiment comparison helps you **iterate on configurations**

**Remember**: A RAG system is only as good as your ability to measure and improve it!
