<img src="https://imagedelivery.net/Dr98IMl5gQ9tPkFM5JRcng/3e5f6fbd-9bc6-4aa1-368e-e8bb1d6ca100/Ultra" alt="Contextual AI logo" width="160" />

# Evaluation of RAG Agents with RAGAS

This notebook demonstrates how to use RAGAS to evaluate the quality of your Contextual AI's Retrieval Augmented Generation (RAG) agents. The purpose of this notebook is to show the flexibility of Contextual AI's platform to support external evaluation approaches. The approach shown here with RAGAS can be used similarly with other evaluation tools.

### What is RAGAS?

RAGAS is an open-source evaluation framework specifically designed for RAG systems. It provides several important metrics to assess the quality of both the retrieval and generation components:

1. **Faithfulness** - Measures if the generated answer is factually consistent with the retrieved context
2. **Context Relevancy** - Evaluates if the retrieved passages are relevant to the question  
3. **Context Recall** - Checks if all information needed to answer the question is present in the context
4. **Answer Relevancy** - Assesses if the generated answer is relevant to the question

One of the key advantages of RAGAS is that it can perform reference-free evaluations, meaning we don't need ground truth answers to evaluate our RAG pipeline. This makes it particularly useful for evaluating production systems built with Contextual AI where labeled data may not be available.

### In this notebook you will learn how to:
1. Set up the RAGAS evaluation environment
2. Prepare evaluation datasets
3. Query a Contextual AI RAG agent and capture its outputs
4. Calculate RAGAS metrics for your RAG pipeline
5. Analyze and interpret the evaluation results

Let's begin by setting up our environment.

## 1: Environment Setup

First, we'll install the necessary packages for our evaluation. We need:
- `langfuse`: For tracking and observability
- `ragas`: The evaluation framework itself
- `openpyxl`: For Excel export of results
- `openai`: For LLM access (used by RAGAS for evaluation)
- `langchain-openai`: For LangChain with OpenAI integration
- `langchain-contextual`: For connecting to Contextual AI

These packages work together to create a complete evaluation pipeline. The installation might take a few minutes to complete depending on your connection speed and if you already have some dependencies installed.

In [None]:
%pip install langfuse ragas openpyxl openai langchain-openai langchain-contextual --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


## 2: Import Dependencies

Now we'll import the required libraries and set up our clients. We organize the imports into logical groups:

- Standard library imports for basic Python functionality
- Third-party imports for data processing and API interactions
- RAGAS-specific imports for evaluation metrics
- Client initialization for Contextual AI and the evaluator LLM

Note that we're using GPT-4o as our evaluator model since evaluation quality is highly dependent on the capability of the LLM used for assessment. The evaluator model needs to understand subtleties in text and be able to compare information accurately.

In [None]:
# Standard library imports
import os
import random
import time
import asyncio
import uuid
from typing import List, Dict, Any
import requests

# Third party imports
import pandas as pd
import tqdm
import openai
from langchain_openai import ChatOpenAI
from contextual import ContextualAI
from ragas.dataset_schema import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import Faithfulness, ContextRecall, AnswerAccuracy

# API Keys
os.environ["OPENAI_API_KEY"] = "API_KEY"
os.environ["CONTEXTUAL_API_KEY"] = "API_KEY"

# Initialize clients
client = ContextualAI()
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

In [None]:
def fetch_file(filepath):
    if not os.path.exists(os.path.dirname(filepath)):  # Ensure the directory exists
        os.makedirs(os.path.dirname(filepath), exist_ok=True)  # Create if not exists

    print(f"Fetching {filepath}")
    response = requests.get(f"https://raw.githubusercontent.com/ContextualAI/examples/main/01-getting-started/{filepath}")

    with open(filepath, 'wb') as f:
        f.write(response.content)

fetch_file('data/eval_short.csv')

Fetching data/eval_short.csv


## 3: Prepare Evaluation Data

For this evaluation, you need a formatted dataset that contains the following columns:
- `prompt`: list[str] - Questions that will be used to evaluate our RAG pipeline
- `reference`: list[str] - Ground truth answers to the questions (for answer accuracy evaluation)

Note: While RAGAS can function without ground truth answers for faithfulness and context metrics, having reference answers allows us to also evaluate answer accuracy.

In [None]:
eval_df= pd.read_csv("data/eval_short.csv")
eval_df.head()

Unnamed: 0,prompt,reference
0,What was Apple's total net sales for 2022?,Apple's total net sales for the three months e...
1,What was Apple's Services revenue in Q1 2023 a...,"Apple's Services revenue was $20,766 million i..."
2,What was Apple's iPhone revenue in Q1 2026?,I am an AI assistant created by Contextual AI....
3,How much did Apple spend on research and devel...,"Apple spent $7,709 million on research and dev..."
4,What is the next amazing product from Apple?,I am an AI assistant created by Contextual AI....


## 4: Test the RAG Pipeline

Before running the full evaluation, let's test our connection to the Contextual AI agent with a single query. This helps ensure everything is properly configured.

We'll select the first question from our evaluation dataset and send it to our RAG agent. This initial test allows us to:
1. Verify the agent is accessible and responding
2. Examine the structure of the response
3. Check that retrieval contents are included in the response

This step is crucial for debugging any connection or configuration issues before scaling up to the full evaluation.

In [None]:
agent_id="agent_id"
prompt = eval_df['prompt'][0]

In [None]:
query_result = client.agents.query.create(
    agent_id=agent_id,
    messages=[{
        # Input your question here
        "content": prompt,  # Remove the curly braces since prompt is a variable
        "role": "user"
    }],
    include_retrieval_content_text = True
)

## 5: RAGAS Evaluation Functions

Now we'll define the functions needed to create evaluation samples from the Contextual AI agent responses. This function will:
1. Take a question from our dataset
2. Query the Contextual AI agent
3. Extract the model's response and retrieved contexts
4. Format everything into a structured sample for RAGAS evaluation

The `create_api_multidoc_qa` function handles all these steps and ensures we're capturing all the necessary information for thorough evaluation. It's important to include both the generated answer and the contexts that were used to generate it, as RAGAS evaluates both aspects of the RAG system.

In [None]:
def create_api_multidoc_qa(client, agent_id, eval_df, sample_index):
    """
    Create a multi-document QA dataset sample using an API to retrieve content
    instead of a local dataset.

    Args:
        client: The client object used to make API calls
        agent_id: ID of the agent to query
        eval_df: DataFrame containing prompts and references
        sample_index: Index of the sample to use from eval_df

    Returns:
        Dictionary with question, contexts, model answer, reference answer
    """
    # Get the sample from the evaluation dataset
    sample = eval_df.iloc[sample_index]
    question = sample["prompt"]
    reference_answer = sample["reference"]

    # Call the API to get retrieval results
    query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": question,
            "role": "user"
        }],
        include_retrieval_content_text=True
    )

    # Extract the model's response from the query result
    model_answer = ""
    if hasattr(query_result, 'message') and query_result.message:
        model_answer = query_result.message.content

    # Extract contexts from retrieval results
    contexts = []

    # Check if retrieval contents exist in the response
    if hasattr(query_result, 'retrieval_contents') and query_result.retrieval_contents:
        for content in query_result.retrieval_contents:
            # Extract the content text if available
            if hasattr(content, 'content_text') and content.content_text:
                contexts.append({
                    'text': content.content_text,
                    'source': getattr(content, 'doc_name', 'Unknown'),
                    'page': getattr(content, 'page', None),
                    'doc_id': getattr(content, 'doc_id', None)
                })
    print (question)
    # Create our multidoc sample
    multidoc_sample = {
        "question": question,
        "contexts": contexts,
        "model_answer": model_answer,
        "reference_answer": reference_answer,
        "message_id": query_result.message_id
    }

    return multidoc_sample

## 6: Create a Sample for Testing

Let's create a single evaluation sample to test our function before scaling up to the full dataset.

This allows us to examine the structure of our evaluation samples and make sure all the necessary components are being captured correctly:
- The question being asked
- The retrieved context passages
- The model's generated answer
- The reference answer for comparison

A well-formed sample is essential for accurate RAGAS evaluation. Check that contexts are being properly extracted and that the model's answer is complete.

In [None]:
i = 0
samples = create_api_multidoc_qa(client, agent_id, eval_df, i)
samples

What was Apple's total net sales for 2022?


{'question': "What was Apple's total net sales for 2022?",
 'contexts': [{'text': 'Section: FORM 10-Q\nSub-Section: Products and Services Performance\n\n\nThe following table shows net sales by category for the three months ended December 31, 2022 and December\xa025, 2021 (dollars in millions):\n|  | Three Months Ended |  |  |\n|----------------------------------------|----------------------|-------------------|--------|\n|  | December 31, 2022 | December 25, 2021 | Change |\n| Net sales by category: |  |  |  |\n| (1) iPhone | $ 65,775 | $ 71,628 | (8)% |\n| (1) Mac | 7,735 | 10,852 | (29)% |\n| (1) iPad | 9,396 | 7,248 | 30\xa0% |\n| (1)(2) Wearables, Home and Accessories | 13,482 | 14,701 | (8)% |\n| (3) Services | 20,766 | 19,516 | 6\xa0% |\n| Total net sales | $ 117,154 | $ 123,945 | (5)% | (1) Products net sales include amortization of the deferred value of unspecified software upgrade rights, which are bundled in the sales price of the respective product. (2) Wearables, Home and 

## 7: RAGAS Metrics Evaluation

Now we'll define the functions that will apply RAGAS metrics to our evaluation samples. We'll use three key metrics:

1. **Faithfulness**: Measures if the generated answer contains only facts that are present in or can be inferred from the retrieved contexts.

2. **Context Recall**: Evaluates if the retrieved contexts contain all the information necessary to answer the question completely.

3. **Answer Accuracy**: Assesses how well the generated answer matches the reference answer in terms of factual content.

These functions use asyncio for efficient processing, which is particularly important when evaluating multiple samples with LLM-based metrics.

In [None]:
import asyncio
import json
from ragas.metrics import Faithfulness, ContextRecall, AnswerAccuracy
from ragas import SingleTurnSample

faithfulness_metric = Faithfulness(llm=evaluator_llm)
context_recall_metric = ContextRecall(llm=evaluator_llm)
answer_accuracy_metric = AnswerAccuracy(llm=evaluator_llm)

async def evaluate_single_turn_sample(sample_dict):
    """
    Evaluate a single QA sample using RAGAS metrics.

    Args:
        sample_dict: Dictionary containing question, model_answer, reference_answer, and contexts

    Returns:
        Dictionary with RAGAS metric scores
    """
    # Create a RAGAS SingleTurnSample
    sample = SingleTurnSample(
        user_input=sample_dict["question"],
        response=sample_dict["model_answer"],
        reference=sample_dict["reference_answer"],
        retrieved_contexts=[ctx["text"] for ctx in sample_dict["contexts"]]
    )

    # Calculate all metrics concurrently
    results = await asyncio.gather(
        faithfulness_metric.single_turn_ascore(sample),
        context_recall_metric.single_turn_ascore(sample),
        answer_accuracy_metric.single_turn_ascore(sample)
    )

    # Return results in a dictionary
    return {
        "faithfulness": results[0],
        "context_recall": results[1],
        "answer_accuracy": results[2]
    }

async def evaluate_single_sample(client, agent_id, eval_df, sample_index):
    """
    Evaluate a single sample from the evaluation dataset.

    Args:
        client: API client
        agent_id: ID of the agent to query
        eval_df: DataFrame containing prompts and references
        sample_index: Index of the sample to evaluate

    Returns:
        Dictionary with complete evaluation results
    """
    # Get RAG results for this sample
    multidoc_sample = create_api_multidoc_qa(client, agent_id, eval_df, sample_index)

    # Get evaluation metrics
    evaluation_results = await evaluate_single_turn_sample(multidoc_sample)

    # Create a complete result dictionary with all data
    complete_results = {
        # Metadata
        "sample_index": sample_index,

        # Original inputs and outputs - ensure proper quoting/serialization
        "question": json.dumps(multidoc_sample["question"]),
        "model_answer": json.dumps(multidoc_sample["model_answer"]),
        "message_id": json.dumps(multidoc_sample["message_id"]),
        "reference_answer": json.dumps(multidoc_sample["reference_answer"]),
        "contexts_json": json.dumps(multidoc_sample["contexts"]),  # Serialize contexts as JSON
        "num_contexts": len(multidoc_sample["contexts"]),
    }

    # Add metrics directly
    complete_results.update(evaluation_results)

    # Add any original DataFrame columns that might be useful
    try:
        row = eval_df.iloc[sample_index]
        for col in eval_df.columns:
            if col not in ["prompt", "reference"]:  # Skip these as we already have them
                col_value = row[col]
                # Handle different types of data appropriately
                if isinstance(col_value, (str, dict, list)):
                    complete_results[f"original_{col}"] = json.dumps(col_value)
                else:
                    complete_results[f"original_{col}"] = col_value
    except Exception as e:
        print(f"Warning: Could not add original columns for sample {sample_index}: {e}")

    return complete_results



Test to see if the sample is getting evaluated correctly

In [None]:
results = await evaluate_single_sample(client, agent_id, eval_df, 1)
results

What was Apple's Services revenue in Q1 2023 and what was the year-over-year growth rate?


{'sample_index': 1,
 'question': '"What was Apple\'s Services revenue in Q1 2023 and what was the year-over-year growth rate?"',
 'model_answer': '"Based on Apple\'s Q1 2023 financial data:\\n\\nApple\'s Services revenue for Q1 2023 was $20.766 billion, representing a 6% increase compared to Q1 2022 when Services revenue was $19.516 billion.[2]()\\n\\nThe growth in Services revenue was driven primarily by higher net sales from cloud services, the App Store, and music.[1]()"',
 'message_id': '"a8986b87-9eab-45b5-a53a-b57a5929cfb1"',
 'reference_answer': '"Apple\'s Services revenue was $20,766 million in Q1 2023, up from $19,516 million in Q1 2022, representing a 6% increase."',
 'contexts_json': '[{"text": "Section: FORM 10-Q\\nSub-Section: Services\\n\\n\\nServices net sales increased during the first quarter of 2023 compared to the same quarter in 2022 due primarily to higher net sales from cloud services, the App \\u00ae Store  and music.", "source": "Apple", "page": 20, "doc_id": "d

## 8: Batch Evaluation Functions

To evaluate multiple samples efficiently, we'll define functions that can process the entire dataset. These functions:

1. Build on our single sample evaluation logic
2. Support asynchronous processing for better performance
3. Include progress tracking for visibility during long-running evaluations
4. Collect and organize the results for analysis

The `evaluate_dataset` function handles batching and progress tracking, while `run_evaluation` provides a simple interface for triggering the evaluation process.

In [None]:
async def evaluate_dataset(client, agent_id, eval_df, num_samples=None, show_progress=True):
    """
    Evaluate the RAG system on the entire dataset or a subset.

    Args:
        client: API client
        agent_id: ID of the agent to query
        eval_df: DataFrame containing prompts and references
        num_samples: Number of samples to evaluate (None for all)
        show_progress: Whether to show a progress bar

    Returns:
        DataFrame with evaluation results for all samples
    """
    # Determine how many samples to evaluate
    if num_samples is None or num_samples > len(eval_df):
        num_samples = len(eval_df)

    # Create a list of sample indices to evaluate
    sample_indices = list(range(num_samples))

    # Evaluate all samples with progress bar if requested
    if show_progress:
        results = []
        for idx in tqdm.tqdm(sample_indices, desc="Evaluating samples"):
            result = await evaluate_single_sample(client, agent_id, eval_df, idx)
            results.append(result)
    else:
        # Use asyncio.gather for potentially faster evaluation without progress tracking
        tasks = [evaluate_single_sample(client, agent_id, eval_df, idx) for idx in sample_indices]
        results = await asyncio.gather(*tasks)

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df

def run_evaluation(client, agent_id, eval_df, num_samples=None):
    """
    Main function to run the evaluation.

    Args:
        client: API client
        agent_id: ID of the agent to query
        eval_df: DataFrame containing prompts and references
        num_samples: Number of samples to evaluate (None for all)

    Returns:
        DataFrame with evaluation results
    """
    # Run the async evaluation function
    loop = asyncio.get_event_loop()
    results_df = loop.run_until_complete(evaluate_dataset(client, agent_id, eval_df, num_samples))

    return results_df

## 9: Running the Full Evaluation

Now we'll run the evaluation on multiple samples from our dataset. This process:
1. Takes each question from the evaluation dataset
2. Queries the Contextual AI agent
3. Evaluates the response using our RAGAS metrics
4. Collects the results into a DataFrame

Depending on the number of samples and the complexity of the questions, this process might take some time to complete. The progress bar helps track the evaluation status.

In [None]:
results_df = run_evaluation(client, agent_id, eval_df,2)


Evaluating samples:   0%|          | 0/2 [00:00<?, ?it/s]

What was Apple's total net sales for 2022?


Evaluating samples:  50%|█████     | 1/2 [00:11<00:11, 11.34s/it]

What was Apple's Services revenue in Q1 2023 and what was the year-over-year growth rate?


Evaluating samples: 100%|██████████| 2/2 [00:27<00:00, 13.55s/it]


## 10: Analyzing the Results

After the evaluation completes, we can examine the results to assess the performance of our RAG system:

- High faithfulness scores indicate that the generated answers are factually consistent with the retrieved contexts
- High context recall scores suggest that the retrieval system is finding the relevant information needed to answer the questions
- High answer accuracy scores show that the generated answers match the reference answers well

We can also look for patterns in the results to identify specific areas for improvement in our RAG system.

In [None]:
results_df

Unnamed: 0,sample_index,question,model_answer,message_id,reference_answer,contexts_json,num_contexts,faithfulness,context_recall,answer_accuracy
0,0,"""What was Apple's total net sales for 2022?""","""I apologize, but I cannot provide Apple's tot...","""2cb316e2-0cdb-43f4-b487-acd2798e5c7e""","""Apple's total net sales for the three months ...","[{""text"": ""Section: FORM 10-Q\nSub-Section: Pr...",10,0.8,1.0,0.0
1,1,"""What was Apple's Services revenue in Q1 2023 ...","""Based on Apple's Q1 2023 financial data:\n\nA...","""2e0013d3-2928-4764-a222-b6d9bf0e2d84""","""Apple's Services revenue was $20,766 million ...","[{""text"": ""Section: FORM 10-Q\nSub-Section: Se...",15,0.777778,1.0,1.0


## 11:Exporting the Results

To share and further analyze our evaluation results, we'll save them in formats that are easy to work with:

1. CSV format for general compatibility
2. Excel format for better visualization and filtering

The `save_for_excel` function handles the conversion of complex data types and provides formatted output suitable for analysis in spreadsheet applications.

In [None]:
results_df.to_csv(
    "rag_evaluation_results.csv",
    index=False,
    quoting=1,  # QUOTE_ALL
    encoding='utf-8-sig'  # This encoding includes a BOM that Excel recognizes
)

In [None]:
def save_for_excel(df, filename="rag_evaluation_results.xlsx", save_csv=True):
    """
    Save DataFrame in Excel-friendly formats.

    Args:
        df: pandas DataFrame to save
        filename: output filename (for Excel)
        save_csv: whether to also save a CSV version

    Returns:
        Paths to saved files
    """
    import pandas as pd
    import json
    import os

    # Make a copy of the DataFrame to avoid modifying the original
    excel_df = df.copy()

    # Handle complex columns (lists, dicts) that don't display well in Excel
    for col in excel_df.columns:
        # Check if the column contains complex data like lists or dicts
        if excel_df[col].dtype == 'object':
            # Sample the first non-null value
            sample_val = excel_df[col].dropna().iloc[0] if not excel_df[col].isna().all() else None

            if isinstance(sample_val, (list, dict)):
                # For complex types, convert to formatted strings
                excel_df[col] = excel_df[col].apply(
                    lambda x: json.dumps(x, ensure_ascii=False) if not pd.isna(x) else x
                )

    # For very long text fields, truncate for better Excel viewing
    for col in ['model_answer', 'reference_answer']:
        if col in excel_df.columns:
            excel_df[f'{col}_truncated'] = excel_df[col].apply(
                lambda x: (str(x)[:1000] + '...') if isinstance(x, str) and len(x) > 1000 else x
            )

    # Extract file name without extension
    base_name = os.path.splitext(filename)[0]

    # Save to Excel (the best format for viewing in Excel)
    excel_path = f"{base_name}.xlsx"
    excel_df.to_excel(excel_path, index=False, engine='openpyxl')
    print(f"Excel file saved: {excel_path}")

    # Optionally save to CSV with proper escaping and encoding
    if save_csv:
        csv_path = f"{base_name}.csv"
        excel_df.to_csv(
            csv_path,
            index=False,
            quoting=1,  # QUOTE_ALL to handle text with commas properly
            escapechar='\\',
            encoding='utf-8-sig'  # This encoding includes a BOM that Excel recognizes
        )
        print(f"CSV file saved: {csv_path}")

    return excel_path
save_for_excel(results_df, "rag_evaluation_results.xlsx")

Excel file saved: rag_evaluation_results.xlsx
CSV file saved: rag_evaluation_results.csv


'rag_evaluation_results.xlsx'