# Document Retrieval Evaluation in Azure AI Foundry

## Summary
This notebook sample demonstrates how to perform evaluation of an Azure AI Search index using Azure AI Evaluation.  The evaluator used in this example, `DocumentRetrievalEvaluator` requires a list of ground truth labeled documents (sometimes referred to as "qrels") and a list of actual search results obtained from a search index as inputs for calculating the evaluation metrics.

This sample will use data prepared in advance to show how to get started quickly with the Azure AI Evaluation service.  To better understand the data processing workflow, take a look at the [full sample](Document_Retrieval_Evaluation_Full_Sample.ipynb).

### Explanation of Document Retrieval Metrics 
The metrics that will be generated in the output of the evaluator include:

| Metric               | Category            | Description                                                                                     |
|-----------------------|---------------------|-------------------------------------------------------------------------------------------------|
| Fidelity             | Search Fidelity    | How well the top n retrieved chunks reflect the content for a given query; number of good documents returned out of the total number of known good documents in a dataset |
| NDCG                 | Search NDCG        | How good are the rankings to an ideal order where all relevant items are at the top of the list.        |
| XDCG                 | Search XDCG        | How good the results are in the top-k documents regardless of scoring of other index documents |
| Max Relevance N      | Search Max Relevance | Maximum relevance in the top-k chunks                                                          |
| Holes      | Search Label Sanity | Number of documents with missing query relevance judgments (Ground truth) | 

It's important to note that some metrics, particularly NDCG, XDCG and Fidelity, are sensitive to holes.  Ideally the count of holes for a given evaluation should be zero, otherwise results for these metrics may not be accurate.  It is recommended to iteratively check results against current known ground truth to fill holes to improve accuracy of the evaluation metrics.  This process is not covered explicitly in the sample but is important to mention.

## Setup

### Prerequisites
Before running this notebook, be sure you have fulfilled the following prerequisites:
* Create or get access to an [Azure Subscription](https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/azure-best-practices/initial-subscriptions), and assign yourself the Owner or Contributor role for creating resources in this subscription.
* `az` CLI is installed in the current environment, and you have run `az login` to gain access to your resources. 
* Create an [Azure AI Foundry project](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/create-projects?tabs=ai-studio).

### Install Python requirements
Run the following command to install the python requirements for this notebook.

In [None]:
!pip install -r requirements.txt
!pip freeze

### Import all modules
For convenience, all modules needed for the rest of the notebook can be imported all at once.

In [None]:
# Standard library
import json
import logging
import pathlib, os
import pandas as pd
import random
import string
import time

# Azure SDK
from azure.ai.evaluation import DocumentRetrievalEvaluator
from azure.ai.projects.models import (
    Dataset,
    Evaluation,
    EvaluatorConfiguration,
)
from azure.ai.projects import AIProjectClient
from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Other open source packages
from dotenv import load_dotenv

### Load resource connection configuration
The following cell will load the necessary resource connection configuration for the sample. Copy the contents of `.env.sample` into a new file named `.env`, and fill in the values corresponding to your own service resources.

In [None]:
load_dotenv()

### Create client objects for managing resources and set other helpful variables
We will also create all of the client objects and other variables needed for the rest of the notebook in the following cell.

In [None]:
# Create the Azure AI Project client
project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str=os.environ["AZURE_PROJECT_CONNECTION_STRING"]
)

# Set other helpful variables
data_directory = os.path.join(".")
data_ids = None

## Dataset Preparation

### Upload dataset to Azure AI Foundry
To run an evaluation in the cloud, we need to uploud our evaluation data to the specified Azure AI Foundry project.

This sample includes a few small dataset examples that were prepared in advance using a subset of the TREC-COVID open-source dataset.  The sample files contain search results and ground truth labels for each query provided in the dataset, where search results were obtained using Azure AI Search.  Each file contains results using a different search configuration: text-only, vector-only, semantic, and semantic-vector hybrid search.

In the next cell, we'll upload each file to the Azure AI Foundry project, and later use them to run evaluation.

In [None]:
evaluation_configs = []
for config_name, sample_data_file_name in [
    ("Text Search", "evaluate-trec-covid-text.jsonl"),
    ("Semantic Search", "evaluate-trec-covid-semantic.jsonl"),
    ("Vector Search", "evaluate-trec-covid-vector.jsonl"),
    ("Hybrid Search", "evaluate-trec-covid-hybrid.jsonl")
]:
    print(f"Uploading file '{sample_data_file_name}'")
    data_id, _ = project_client.upload_file(sample_data_file_name)

    print(f"File {sample_data_file_name} was uploaded successfully!")
    print(f"Data ID: {data_id}")

    evaluation_configs.append((config_name, data_id))

## Run document retrieval evaluation
After our datasets are uploaded, we will configure and run the document retrieval evaluator for each uploaded dataset.  The init params `groundtruth_label_min` and `groundtruth_label_max` help us to configure the qrels scaling for some metrics which depend on a count of labels, such as Fidelity.  In this case, the TREC-COVID dataset groundtruth set has 0, 1, and 2 as possible labels, so we set the values of those init params accordingly.

In [None]:
def run_evaluation(evaluation_name, evaluation_description, dataset_id):
    # Create an evaluation
    evaluation = Evaluation(
        display_name=evaluation_name,
        description=evaluation_description,
        data=Dataset(id=dataset_id),
        evaluators={
            "documentretrievalevaluator": EvaluatorConfiguration(
                id=DocumentRetrievalEvaluator().id,
                data_mapping={
                    "retrieval_ground_truth": "${data.retrieval_ground_truth}",
                    "retrieved_documents": "${data.retrieved_documents}"
                },
                init_params={
                    "ground_truth_label_min": 0,
                    "ground_truth_label_max": 2
                }
            )
        },
    )

    # Create evaluation
    evaluation_response = project_client.evaluations.create(
        evaluation=evaluation,
    )

    # Get evaluation
    get_evaluation_response = project_client.evaluations.get(evaluation_response.id)

    print("----------------------------------------------------------------")
    print("Created evaluation, evaluation ID: ", get_evaluation_response.id)
    print("Evaluation status: ", get_evaluation_response.status)
    print("AI project URI: ", get_evaluation_response.properties["AiStudioEvaluationUri"])
    print("----------------------------------------------------------------")

In [None]:
for (config_name, data_id) in evaluation_configs:
    run_evaluation(f"TREC-COVID evaluation - {config_name}", "Document retrieval evaluation using the TREC-COVID dataset from BeIR", data_id)

## Comparing results

Once the evaluations are complete, you can compare the results by clicking the "Evaluations" tab on the left-side of the Azure AI Foundry project page, select the runs for comparison, and then click the "Compare" button to see metric results side-by-side.

![Azure AI Foundry project evaluations page](eval-results-select.png)