# Evaluating Retrieval Performance with Standard and RAG-Specific Metrics

This notebook provides a comprehensive framework for evaluating the performance of a retrieval system using both traditional information retrieval (IR) metrics and modern, RAG-specific metrics. It takes a dataset of queries with ground-truth evidence and systematically assesses the quality of the retrieved results.

**Key Evaluation Metrics Covered:**
- **Standard IR Metrics:**
    - `Recall@k`: What fraction of relevant documents are retrieved in the top-k results?
    - `Precision@k`: What fraction of the top-k retrieved documents are relevant?
    - `nDCG@k`: A measure of ranking quality that accounts for the position of relevant documents.
    - `Mean Reciprocal Rank (MRR)`: How high up in the ranking is the first relevant document?
- **RAG-Specific Metrics (using RAGAs):**
    - `Context Recall`: Measures the extent to which the ground-truth answer is covered by the retrieved context.
    - `Context Precision`: Measures the signal-to-noise ratio in the retrieved context.

## 📊 Workflow

1.  **Data Preparation**: Load the matched evidence data from the previous notebook and pivot it into a one-query-per-row format.
2.  **Metric Implementation**: Define functions for standard IR metrics.
3.  **Retrieval and Evaluation**: For each query, retrieve documents, and calculate all metrics.
4.  **Analysis**: Aggregate the results and analyze the mean scores to understand overall system performance.

The goal of thie notebook is a starting point for retrieval analysis. Ultimately, you will want to test out retrieval performance with different agent configurations, e.g., with and without the reranker, as part of your evaluation. 

## 1. Environment and Configuration

First, we import all necessary libraries and set up the configuration parameters for the evaluation, such as API keys and Agent IDs.

In [1]:
import math
import os

import pandas as pd
import numpy as np
import asyncio
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import NonLLMContextRecall, NonLLMContextPrecisionWithReference

from contextual import ContextualAI
client = ContextualAI()
#Agent in the notebook is in Custom Tenant

In [2]:
# Configuration
agent_id= '15543690-68fd-49e7-8fc9-1f53c8e42e33'


## 2. Data Preparation

The evaluation script expects a pandas DataFrame (`df`) where each row represents a single retrieval test case (query). The DataFrame should be constructed from a CSV or other data source with the following columns:

| Column Name                | Description                                                                                 | Example                                                                 |
|----------------------------|--------------------------------------------------------------------------------------------|-------------------------------------------------------------------------|
| `prompt`               | The query string to be sent to the retrieval system.                                       | `"Which company had higher R&D expenses in FY2024, Tesla or Qualcomm?"` |
| `evidence_1`      | The content of the first evidence strings context (evidence) for this query.                | `"Tesla had higher expenses"`                               |
| `evidence_2`      | (Optional) The content of the second evidence strings context, if applicable.               | `"R&D expenses for electric cars in 2024 were mainly  .."`                               |
| ...                        | (Optional) More `evidence_*` columns for additional evidence strings contexts.        |                                                                         |


In [None]:
# Suppose your DataFrame is called matched and has columns: QA_ID, Question, Answer, Content_Id
# Add a group counter for each QA_ID group
matched = pd.read_csv("matched_retrievals")# Make a copy to avoid the warning
matched['evidence_num'] = matched.groupby('QA_ID').cumcount() + 1

# Pivot to wide format
wide = matched.pivot_table(
    index=['QA_ID', 'Question'],
    columns='evidence_num',
    values='Content_Id',
    aggfunc='first'
).reset_index()

# Rename columns to evidence_1, evidence_2, ...
wide.columns = ['QA_ID', 'prompt'] + [f"evidence_{i}" for i in wide.columns[2:]]

# Now `wide` is your desired DataFrame
wide.head()

## 3. Defining the Evaluation Metrics

Here, we implement the functions for our evaluation metrics. This includes standard IR metrics to assess the accuracy and ranking quality of retrieval, as well as several useful variations like Hit Rate and Reciprocal Rank.

In [4]:
def recall_at_k(retrieved_ids, ground_truth_ids, k):
    retrieved_top_k = set(retrieved_ids[:k])
    return len(retrieved_top_k & ground_truth_ids) / len(ground_truth_ids) if ground_truth_ids else 0.0

def precision_at_k(retrieved_ids, ground_truth_ids, k):
    retrieved_top_k = retrieved_ids[:k]
    return sum(1 for cid in retrieved_top_k if cid in ground_truth_ids) / k if k else 0.0

def ndcg_at_k(retrieved_chunks, ground_truth_ids, k=10):
    # Use the order of retrieved_chunks as provided, do not sort by score
    top_k = retrieved_chunks[:k]
    relevances = [1 if chunk['content_id'] in ground_truth_ids else 0 for chunk in top_k]
    dcg = sum((2**rel - 1) / np.log2(idx + 2) for idx, rel in enumerate(relevances))
    ideal_relevances = sorted([1]*min(len(ground_truth_ids), k) + [0]*(k - min(len(ground_truth_ids), k)), reverse=True)
    idcg = sum((2**rel - 1) / np.log2(idx + 2) for idx, rel in enumerate(ideal_relevances))
    return dcg / idcg if idcg > 0 else 0.0

def idcg(matched_chunk_positions):
    total_relevant = len(matched_chunk_positions)
    return sum(1 / math.log2(i + 2) for i in range(total_relevant))

def precision_at_r(retrieved_ids, ground_truth_ids):
    r = len(ground_truth_ids)
    retrieved_top_r = retrieved_ids[:r]
    return sum(1 for cid in retrieved_top_r if cid in ground_truth_ids) / r if r else 0.0

def hit_rate_at_k(retrieved_ids, ground_truth_ids, k):
    return int(any(cid in ground_truth_ids for cid in retrieved_ids[:k]))

def reciprocal_rank_at_k(retrieved_ids, ground_truth_ids, k):
    for idx, cid in enumerate(retrieved_ids[:k]):
        if cid in ground_truth_ids:
            return 1.0 / (idx + 1)
    return 0.0

## 4. Running the Evaluation

This section contains the core logic for running the evaluation. We'll define two key functions:

1.  `run_query`: A simple wrapper to call the retrieval agent for a given query and parse its response to extract the retrieved chunks and attribution data.
2.  `evaluate_single_query`: The main evaluation function. For a single query (a row in our `wide_df`), it calls `run_query`, gets the retrieved results, and then calculates all of our defined metrics (both standard IR and RAGAs).

In [None]:
def run_query(user_input):
    """
    Run the retrieval agent for the given user_input.
    Returns:
        - retrieved_chunks: list of dicts with 'content_id' and 'score'
        - attributions: list of attribution objects (with 'content_ids')
    """
    query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": user_input,
            "role": "user"
        }],
        include_retrieval_content_text=True,
        retrievals_only=False
    )
    # Extract retrieved_chunks (content_id and score)
    retrieved_chunks = [
        {"content_id": rc.content_id, "score": rc.score}
        for rc in query_result.retrieval_contents
    ]

    unique_attributions = []
    seen = set()
    for attr in query_result.attributions:
        key = tuple(attr.content_ids)
        if key not in seen:
            seen.add(key)
            unique_attributions.append({"content_ids": attr.content_ids})

    # Extract attributions (list of content_ids per attribution)
    #attributions = query_result.attributions
    return retrieved_chunks, unique_attributions

run_query("Apple revenue")

In [6]:
def evaluate_single_query(row, run_query, k=10, use_attributions=False):
    index_df = int(row['QA_ID'])
    user_input = row["prompt"]
    evidence_chunk_id_cols = [col for col in row.index if col.startswith('evidence_')]
    ground_truth_ids = set(row[col] for col in evidence_chunk_id_cols if pd.notnull(row[col]))
    retrieved_chunks, attributions = run_query(user_input)
    
    retriever_ids = [chunk['content_id'] for chunk in retrieved_chunks]
    if use_attributions==True:
        attributions_ids = [cid for chunk in attributions for cid in chunk['content_ids']]
        retrieved_ids = attributions_ids
    else:
        retrieved_ids = retriever_ids
        attributions_ids = None

    #print (retrieved_ids)
    #print (ground_truth_ids)
    recall = recall_at_k(retrieved_ids, ground_truth_ids, k)
    precision = precision_at_k(retrieved_ids, ground_truth_ids, k)
    ndcg = ndcg_at_k(retrieved_chunks, ground_truth_ids, k)
    icdg = idcg(set(range(min(len(ground_truth_ids), k))))
    precision_r = precision_at_r(retrieved_ids, ground_truth_ids)
    hit_rate = hit_rate_at_k(retrieved_ids, ground_truth_ids, k)
    mrr = reciprocal_rank_at_k(retrieved_ids, ground_truth_ids, k)

    # --- RAGAS metrics ---
    reference_contexts = list(ground_truth_ids)
    retrieved_contexts = retrieved_ids

    sample = SingleTurnSample(
        user_input=user_input,
        reference_contexts=reference_contexts,
        retrieved_contexts=retrieved_contexts,
    )

    # Run RAGAS metrics (must be run in an event loop)
    async def run_ragas_metrics(sample):
        context_recall = NonLLMContextRecall()
        context_precision = NonLLMContextPrecisionWithReference()
        recall_score = await context_recall.single_turn_ascore(sample)
        precision_score = await context_precision.single_turn_ascore(sample)
        return recall_score, precision_score

    recall_score, precision_score = asyncio.run(run_ragas_metrics(sample))


    return {
        "index": index_df,
        "user_input": user_input,
        "k": k,
        "recall@k": recall,
        "precision@k": precision,
        "precision@R": precision_r,
        "nDCG@k": float(ndcg),
        "iDCG@k": float(icdg),
        "hit_rate@k": hit_rate,
        "mrr@k": mrr,
        "ragas_context_recall": recall_score,
        "ragas_context_precision": precision_score,
        "retrieved_chunks": retrieved_chunks,  # full info
        "ground_truth_ids": list(ground_truth_ids),
        "retriever_ids": retriever_ids,
        "attributions": attributions_ids,
    }

### 4.1. Unit Test: Evaluating a Single Query

Before running the full evaluation, it's a good practice to test our `evaluate_single_query` function on a single row. This acts as a "unit test" to ensure that:
- The `run_query` function is correctly called.
- All metrics (including the async RAGAs metrics) are calculated without errors.
- The output dictionary has the structure we expect.

If this cell runs successfully, we can be more confident that the full evaluation loop will proceed smoothly.

In [None]:
# Pick the first example (row 0) from your DataFrame
row = wide.iloc[0]

# Run the evaluation for this single example
result = evaluate_single_query(row, run_query, k=10, use_attributions=False)

# Print the results
print(result)

### 4.2. Executing the Full Evaluation Run

Now that we have validated our functions on a single example, we will run the evaluation across the entire dataset.

The following cell iterates through each row in our prepared `wide_df`, calling `evaluate_single_query` for each one. To prevent data loss on long runs, the script is configured to save its progress to a partial CSV file periodically.

You can configure two key options here:
- `k`: The cutoff for rank-sensitive metrics (e.g., `precision@10`, `recall@10`).
- `use_attributions`:
    - `False` (default): Evaluates the direct output of the **retriever**. This measures the quality of the initial candidate set.
    - `True`: Evaluates only the chunks that were **attributed** by the generation model. This measures the quality of the final evidence used in the answer.

In [None]:
results = []
for idx, row in wide.iterrows():
    try:
        result = evaluate_single_query(row, run_query, k=10,use_attributions=False)
        results.append(result)
    except Exception as e:
        print(f"Error on row {idx}: {e}")
        # Optionally, append a placeholder or skip
        continue
    # Save after each row (or every N rows)
    if (idx + 1) % 25 == 0 or (idx + 1) == len(wide):  # Save every 5 rows, and at the end
        pd.DataFrame(results).to_csv("retrieval_eval_results_partial.csv", index=False)
        print(f"Saved progress at row {idx + 1}")

# Final save
final = pd.DataFrame(results)
final.to_csv("eval_results_final.csv", index=False)


## 5. Analyzing the Results

After the evaluation loop is complete, the results are stored in a final DataFrame. We can now compute the average (mean) for each metric across all queries. This gives us a high-level overview of the retrieval system's performance.

The metrics below represent the average scores for the entire test set.

In [13]:
final = pd.read_csv("eval_results_final.csv")

metrics = [
    'recall@k',
    'precision@k',
    'precision@R',
    'nDCG@k',
    'iDCG@k',
    'ragas_context_recall',
    'ragas_context_precision'
]

means = final[metrics].mean()
print(means)

recall@k                   0.854839
precision@k                0.164516
precision@R                0.500000
nDCG@k                     0.715072
iDCG@k                     1.590225
ragas_context_recall       0.903226
ragas_context_precision    0.648375
dtype: float64


## 6. Deeper Analysis

This notebook has shown you how to assess the performance of the retrievals provided in the RAG Agent query path. Deeper analysis will consider factors:
- The weight betweeen lexical and semantic search
- Retreival at various stages including retriever, reranker, and filter model

You can modify this script or your agent to test retrieval under different configurations to optimize your RAG Agent.

## Appendix: Comparison Metrics (Attribution-Based)

For reference, these were the results from a previous run where `use_attributions` was set to `False`. This evaluates the chunks that the generation model actually used in its answer, rather than all retrieved chunks.

- recall@k                   0.854839
- precision@k                0.164516
- precision@R                0.500000
- nDCG@k                     0.715072
- iDCG@k                     1.590225
- ragas_context_recall       0.903226
- ragas_context_precision    0.648375
