# Evaluating Retrieval Quality: Evidence-to-Chunk Matching for RAG Systems

## 🎯 What This Notebook Does

This notebook solves a critical challenge in RAG evaluation: **automatically determining whether retrieved chunks contain the ground-truth evidence needed to answer questions correctly.**

Instead of manually checking hundreds of retrieved chunks, this system:
- ✅ Automatically matches evidence strings to retrieved chunks using fuzzy matching + numeric overlap
- ✅ Filters multi-hop questions to keep only those with complete evidence coverage
- ✅ Creates a high-quality evaluation dataset for measuring retrieval performance
- ✅ Outputs ready-to-use data for computing precision, recall, and other retrieval metrics

## 🏗️ The Problem This Solves

When evaluating RAG systems, you need to know:
1. **Did the retrieval system find the right information?** (Recall)
2. **How much irrelevant information was retrieved?** (Precision)
3. **Can the system handle complex, multi-step reasoning?** (Multi-hop evaluation)

Manual evaluation doesn't scale. This notebook automates the process.

## 📊 Input & Output

**Input:**
- QA dataset with annotated evidence strings

**Output:**
- Matched evidence pairs (evidence string ↔ chunk ID)

## 🔧 Prerequisites
- Contextual AI API access (or adapt the retrieval functions for your system)
- QA dataset (like the one from the generation notebook)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/11-retrieval-analysis/Retrieval_Matching.ipynb)

## 1. Environment Setup

First, we'll install and import the necessary libraries. This notebook relies on `pandas` for data manipulation, `thefuzz` for fuzzy string matching, and the `contextual` client for interacting with the datastore and retrieval APIs. We also import standard libraries for handling data and making API requests.

In [None]:
# ! pip install thefuzz[speedup]

In [None]:
import json
import math
import os
import re
import time
from typing import Any, List, Optional, Tuple

import pandas as pd
import numpy as np
import asyncio
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import NonLLMContextRecall, NonLLMContextPrecisionWithReference


import requests
from thefuzz import fuzz                    # ← matches install above
from tenacity import retry, stop_after_attempt, wait_exponential
from tqdm.notebook import tqdm
from typing import Dict, List, Any, Optional
from contextual import ContextualAI
client = ContextualAI()


## 2. Prepare Datastore with Metadata for Filtering

To ensure that we only search for matches within the correct source document, it's crucial to have reliable metadata. Here, we'll add a `Filename` field to the custom metadata of each document in our datastore.

This allows us to use a `documents_filters` parameter in our retrieval requests, which dramatically narrows down the search space and prevents incorrect matches from other documents. This is a best practice for building robust retrieval evaluation pipelines.

In [None]:
datastore_id = 'ecd873f1-134a-4c43-8d69-bcd32533fd67'

In [None]:
docs = client.datastores.documents.list(datastore_id=datastore_id)
doc_pairs = [(doc.id, doc.name) for doc in docs.documents]
print("Document ID and Name pairs:")
for doc_id, name in doc_pairs:
    print(f"ID: {doc_id}, Name: {name}")

In [None]:
# Loop through doc_pairs and set the Filename metadata for each document
for doc_id, name in doc_pairs:
    result = client.datastores.documents.set_metadata(
        datastore_id=datastore_id,
        document_id=doc_id,
        custom_metadata={"Filename": name}
    )
    print(f"Set metadata for {name} (ID: {doc_id}): {result}")


In [None]:
#Verify
document_id = docs.documents[0].id
metadata = client.datastores.documents.metadata(datastore_id = datastore_id,
                        document_id = document_id)
print("Document metadata:", metadata.custom_metadata)

## 3. Load and Preprocess Annotated Data

Next, we load the annotated dataset, which contains the ground-truth information for our evaluation. This dataset is structured as an Excel file where each row corresponds to a piece of evidence required to answer a multi-hop question.

The key columns for our purposes are:
- `Question`: The user query.
- `Source_Document`: The filename of the document containing the evidence.
- `Evidence`: The ground-truth text string that must be found in a retrieved chunk.

In [None]:
XLSX_PATH = "qa_pairs_multi_row_20250616_174936.xlsx"

In [None]:
df = pd.read_excel(XLSX_PATH)
df.head(3)

### Data Cleaning: Forward-Filling Questions

Our dataset is designed for multi-hop questions, where a single complex question requires multiple steps (and evidence strings) to answer. In the raw data, the `Question` and `Answer` are only present on the first row for a given `QA_ID`.

To create a clean, flat structure where every row has a question, we'll use `ffill()` (forward-fill) to propagate the question and answer to all subsequent rows belonging to the same `QA_ID`.

In [None]:
df['Question'] = df['Question'].fillna(method='ffill')
df['Answer'] = df['Answer'].fillna(method='ffill')
df.head(3)

## 4. Evidence-to-Chunk Matching Algorithm

This section contains the core logic for matching evidence strings to retrieved chunks. A reliable matching function is essential for automated retrieval evaluation, as it allows us to programmatically verify if the retrieved content contains the necessary information.

Our approach uses a hybrid scoring method:

1.  **Fuzzy Matching**: Uses `thefuzz.token_set_ratio` to compare the textual similarity between the evidence and a chunk, ignoring word order.
2.  **Numeric Overlap**: Extracts and compares numeric tokens (e.g., "$1.2M", "50%") separately. This is crucial for financial or data-heavy documents where numbers are key identifiers.

The final score is a weighted average of the fuzzy and numeric scores. A chunk is considered a match if its score exceeds a predefined threshold.

In [None]:
# Configuration

API_KEY = os.environ["CONTEXTUAL_API_KEY"]

## Agent Configs
AGENT_ID = '15543690-68fd-49e7-8fc9-1f53c8e42e33'

## configs for retrieval
TOP_RETRIEVED_DOCS = 150
RETRIEVAL_ALPHA = 0.9

# Configs for evidence string Matching
# recommended to enable only when working with numerical heavy datasets like financial tables.
ENABLE_PREPROCESSING = False

### Custom Retrieval Logic

To maximize our chances of finding a match, we need to configure the retrieval system to return a broad set of candidate chunks. During the initial matching phase, it's often beneficial to cast a wider net than you would in a production application.

This retrieval function is configured to:

- Increase the number of retrieved documents (`top_k`).
- Bypass the reranker and filter models, which might otherwise remove relevant but lower-scored chunks.
- Adjust the balance between lexical and semantic search to favor more direct term matches.

In [None]:
## Test Query to make sure the retrieval is working
query = "What is the total revenue for Tesla in 2024?"
source_document = "Tesla_2024_Annual_Report.pdf"

url = f"https://api.app.contextual.ai/v1/applications/{AGENT_ID}/query?retrievals_only=true"
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "accept": "application/json",
    "Content-Type": "application/json",
}

payload: dict[str, Any] = {
    "stream": False,
    "messages": [{"role": "user", "content": query}],
    "documents_filters": {
        "operator": "AND",
        "filters": [
            {"field": "Filename", "operator": "equals", "value": source_document}
        ]
    }
}

# network I/O – may raise Timeout / HTTPError → caught by tenacity
resp = requests.post(url, headers=headers,
                        data=json.dumps(payload))
resp.raise_for_status()
print(resp.json())

In [None]:
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=8))
def retrieve_documents(
    query: str,
    source_document: str,
    *,
    alpha: float = RETRIEVAL_ALPHA,  # Uses the value from your config cell
    top_k: int = TOP_RETRIEVED_DOCS,    # Uses the value from your config cell
) -> List[Dict[str, Any]]:
    """
    A single, simplified function to retrieve documents that combines all necessary logic.
    """
    # Step 1: Build the specific override config object. This is not working right now
    override_cfg = {
        "filter_retrievals": False,
        "rerank_retrievals": False,
        "lexical_alpha": 1.0 - alpha,
        "semantic_alpha": alpha,
        "top_k": top_k,
    }

    # Step 2: Build the full payload for the API request.
    url = f"https://api.app.contextual.ai/v1/applications/{AGENT_ID}/query?retrievals_only=true"
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
    }
    payload: Dict[str, Any] = {
        "stream": False,
        "messages": [{"role": "user", "content": query}],
        "documents_filters": {
            "operator": "AND",
            "filters": [
                {"field": "Filename", "operator": "equals", "value": source_document}
            ]
        },
       # "override_retrieval_config": override_cfg
    }
    # Step 3: Make the request and return the desired content.
    resp = requests.post(url, headers=headers, data=json.dumps(payload), timeout=30)
    resp.raise_for_status()
    # The API returns a list of dicts in the "retrieval_contents" key
    return resp.json().get("retrieval_contents", [])



In [None]:
def _preprocess_text(text: str) -> str:
    """Lightweight text normalization to improve fuzzy matching robustness.

    Steps:
    1. Lower‐case everything.
    2. Strip table/markdown characters (``|``), repeated dashes and newlines.
    3. Remove formatting punctuation such as ``$``, ``%``, ``(``, ``)`` and ``+``.
    4. Remove commas that only serve as thousands separators inside numbers (e.g. ``2,000`` → ``2000``).
    5. Replace any remaining non‐alphanumeric characters with a single space and collapse runs of whitespace.
    """
    text = text.lower()
    text = text.replace("|", " ")
    text = re.sub(r"-+", " ", text)
    text = re.sub(r"\n+", " ", text)
    text = re.sub(r"[\(\)\$\%\+]", "", text)
    # Remove commas between digits (e.g. 1,234 -> 1234)
    text = re.sub(r"(\d),(\d)", r"\1\2", text)
    # Remove any other non alphanum / period characters
    text = re.sub(r"[^a-z0-9\. ]", " ", text)
    # Collapse redundant whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text


# Regex for capturing numbers that appear in evidence / chunks.
_NUMBER_PATTERN = (
    r"[-+]?\$?\s*(?:\d{1,3}(?:,\d{3})+|\d+)(?:\.\d+)?%?|"  # standard numbers / currency / percentages
    r"\([^)]*\d+[^)]*\)"  # numbers enclosed in parentheses
)


def _extract_numbers(text: str) -> List[str]:
    """Return a list of the numeric substrings found in *text*."""
    return [m.strip() for m in re.findall(_NUMBER_PATTERN, text)]


def matcher_with_metadata(
    evidence_str: str,
    retrieved_chunks: List[Dict[str, Any]],
    threshold: float = 0.85,
    *,
    alpha: float = 1.0,
    preprocess_text: bool = ENABLE_PREPROCESSING,
    ) -> Optional[Dict[str, Any]]:
    """Return the retrieved chunk (with metadata) that best matches evidence_str.

    This function finds the best matching chunk from a list of chunks containing
    both content text and metadata. The scoring mirrors the non-LLM portion of
    ``EvidenceMatchFuzzyLLM``:

    • The *fuzzy* component uses `token_set_ratio` between preprocessed strings
    • The *numeric* component measures overlap of extracted numeric tokens
    • The final score is ``alpha * fuzzy + (1-alpha) * numeric`` when evidence
      contains numbers, otherwise just the fuzzy score

    Parameters
    ----------
    evidence_str : str
        The target evidence string to match against.
    retrieved_chunks : List[Dict[str, Any]]
        List of chunk dictionaries, each containing a "content_text" key with
        the text content to match against, plus any additional metadata.
    threshold : float, default 0.85
        Minimum score required to return a match. Chunks scoring below this
        threshold will result in None being returned.
    alpha : float, default 1.0
        Weight assigned to the fuzzy score when both fuzzy and numeric scores
        are available. For chunks heavy with financial tables, alpha = 0.8
        is recommended to give more weight to numeric matching.
    preprocess_text : bool, default ENABLE_PREPROCESSING
        Whether to run light normalization before matching. Preprocessing is
        helpful for chunks where numbers matter significantly, such as
        financial tables.

    Returns
    -------
    Optional[Dict[str, Any]]
        The complete chunk dictionary from retrieved_chunks with the highest
        score, or None if no chunk meets the threshold or if input is empty.
        The returned dictionary includes both the matched content and any
        associated metadata.

    Examples
    --------
    >>> chunks = [
    ...     {"content_text": "Revenue was $1.2M in Q1", "source": "report.pdf"},
    ...     {"content_text": "Expenses totaled $800K", "source": "budget.xlsx"}
    ... ]
    >>> result = matcher_with_metadata("Revenue $1.2M", chunks)
    >>> print(result["source"])  # "report.pdf"
    """
    if not evidence_str or not retrieved_chunks:
        return None

    norm_evidence = _preprocess_text(evidence_str) if preprocess_text else evidence_str
    evidence_nums = _extract_numbers(norm_evidence)

    best_chunk = None
    best_score = -1.0

    for chunk in retrieved_chunks:
        chunk_text = chunk["content_text"]
        norm_chunk = _preprocess_text(chunk_text) if preprocess_text else chunk_text

        # Short-circuit if the entire evidence string is a substring
        if norm_evidence in norm_chunk:
            return chunk  # Return the whole dict

        fuzzy_score = fuzz.token_set_ratio(norm_evidence, norm_chunk) / 100.0
        chunk_nums = _extract_numbers(norm_chunk)
        numeric_matches = sum(1 for n in evidence_nums if n in chunk_nums)
        numeric_score = numeric_matches / len(evidence_nums) if evidence_nums else 0.0

        final_score = alpha * fuzzy_score + (1 - alpha) * numeric_score if evidence_nums else fuzzy_score

        if final_score > best_score:
            best_score = final_score
            best_chunk = chunk

    return best_chunk if best_score >= threshold else None

### Unit Test: Matching a Single Row

Before processing the entire dataset, it's good practice to test our `retrieve_documents` and `matcher_with_metadata` functions on a single example. This helps verify that the API calls are working correctly, the data is being processed as expected, and the matching logic is sound.

In [None]:
# Get the first row (make sure it's not a NaN row)
row = df.iloc[1]

# Extract the query and evidence string
query = row['Question']
evidence_str = row['Evidence']
source_document = row['Source_Document']

print(query)
print(evidence_str)

# Retrieve full response and get the list of dicts
retrieved_chunks= retrieve_documents(query=query,source_document=source_document)

# Run the matcher
match = matcher_with_metadata(evidence_str, retrieved_chunks)

# Add results to DataFrame
df.at[row.name, 'Match'] = bool(match)
df.at[row.name, 'Content_Id'] = match['content_id'] if match else None

# Print the result
if match:
    print("Match found!")
    #print("Matched content_id:", match["content_id"])
   # print("Matched chunk text:\n", match["content_text"])
else:
    print("No match found.")

## 5. Batch Processing: Matching the Entire Dataset

Now that we've validated the process on a single row, we'll apply it to every row in the DataFrame. We iterate through the dataset, retrieve candidate chunks for each evidence string, and run the matcher.

The results—a boolean `Match` status and the `Content_Id` of the matched chunk—are stored in new columns in the DataFrame. This process can take some time, as it involves making an API call for each row.

In [None]:
# Initialize columns
df['Match'] = False
df['Content_Id'] = None

for idx, row in df.iterrows():
    query = row['Question']
    evidence_str = row['Evidence']
    source_document = row['Source_Document']
    retrieved_chunks= retrieve_documents(query=query,source_document=source_document)
    match = matcher_with_metadata(evidence_str, retrieved_chunks)
    df.at[idx, 'Match'] = bool(match)
    df.at[idx, 'Content_Id'] = match['content_id'] if match else None

### Analysis: Reviewing Match Results

After running the batch process, let's inspect the results. We can check the `tail` of the DataFrame and count the total number of successful matches to get a sense of the match rate.

In [None]:
df.tail()

Not all the annotated evidence was correctly matched. For now, we will continue with the rows with matches.

In [None]:
true_count = df['Match'].sum()
true_count

## 6. Filtering for a High-Quality Evaluation Set

Since our dataset is designed for multi-hop queries, a complete evaluation example requires successful matches for *both* steps of a given question (`Step_Number` 1 and 2).

In this final step, we filter the DataFrame to keep only the `QA_ID`s for which we found a valid chunk match for both evidence strings. This produces a high-quality, reliable dataset that can be used for downstream retrieval and RAG evaluation tasks.

In [None]:
# Step 1: Filter for step 1 or 2 with Match == True
filtered = df[df['Step_Number'].isin([1, 2]) & (df['Match'] == True)]

# Step 2: Find QA_IDs with both step 1 and step 2 matched
step_counts = filtered.groupby('QA_ID')['Step_Number'].nunique()
qa_ids_with_both_matched = step_counts[step_counts == 2].index.tolist()

# Step 3: Filter the original DataFrame for those QA_IDs and steps 1 or 2
result_df = df[df['QA_ID'].isin(qa_ids_with_both_matched) & df['Step_Number'].isin([1, 2])]

# Optional: sort for easier viewing
result_df = result_df.sort_values(['QA_ID', 'Step_Number'])

result_df

In [None]:
#result_df.to_csv('matched_retrievals.csv', index=False)