# Document Retrieval Evaluation in Azure AI Foundry

## Summary
This notebook sample demonstrates how to perform evaluation of an Azure AI Search index using Azure AI Evaluation.  The evaluator used in this example, `DocumentRetrievalEvaluator` requires a list of ground truth labeled documents (sometimes referred to as "qrels") and a list of actual search results obtained from a search index as inputs for calculating the evaluation metrics.  This sample will walk through the steps of data preparation, gathering search results for different search configurations, running evaluation and comparing results of each run.

### Explanation of Document Retrieval Metrics 
The metrics that will be generated in the output of the evaluator include:
| Metric               | Category            | Description                                                                                     |
|-----------------------|---------------------|-------------------------------------------------------------------------------------------------|
| Fidelity             | Search Fidelity    | How well the top n retrieved chunks reflect the content for a given query; number of good documents returned out of the total number of known good documents in a dataset |
| NDCG                 | Search NDCG        | How good are the rankings to an ideal order where all relevant items are at the top of the list.        |
| XDCG                 | Search XDCG        | How good the results are in the top-k documents regardless of scoring of other index documents |
| Max Relevance N      | Search Max Relevance | Maximum relevance in the top-k chunks                                                          |
| Holes      | Search Label Sanity | Number of documents with missing query relevance judgments (Ground truth) |

It's important to note that some metrics, particularly NDCG, XDCG and Fidelity, are sensitive to holes.  Ideally the count of holes for a given evaluation should be zero, otherwise results for these metrics may not be accurate.  It is recommended to iteratively check results against current known ground truth to fill holes to improve accuracy of the evaluation metrics.  This process is not covered explicitly in the sample but is important to mention.

## Setup

### Prerequisites
Before running this notebook, be sure you have fulfilled the following prerequisites:
* Create or get access to an [Azure Subscription](https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/azure-best-practices/initial-subscriptions), and assign yourself the Owner or Contributor role for creating resources in this subscription.
* `az` CLI is installed in the current environment, and you have run `az login` to gain access to your resources.
* Read and understand the documentation covering [assigning RBAC roles between resources](https://learn.microsoft.com/en-us/azure/role-based-access-control/role-assignments-cli) using the `az` CLI.
* Create [Azure AI Search resource](https://learn.microsoft.com/en-us/azure/search/search-create-service-portal), and assign yourself the "Search Index Data Contributor" role for the resource.  
* Create an [Azure AI Foundry project](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/create-projects?tabs=ai-studio).
* [Optional] Deploy a text embedding model in the Azure AI Foundry project for testing vector-based search scenarios in this example, and assign yourself the "Cognitive Services OpenAI User" role for the Azure AI Services resource created for your project.  For integrated vectorization support, you will also need to ensure that your Azure AI Search resource has the "Cognitive Services OpenAI User" role assigned for the Azure AI Services resource as well.

### Install Python requirements
Run the following command to install the python requirements for this notebook.

In [None]:
!pip install -r requirements.txt
!pip freeze

### Import all modules
For convenience, all modules needed for the rest of the notebook can be imported all at once.

In [None]:
# Standard library
import json
import logging
import pathlib, os
import pandas as pd
import random
import string
import time

# Azure SDK
from azure.ai.evaluation import DocumentRetrievalEvaluator
from azure.ai.projects.models import (
    Dataset,
    Evaluation,
    EvaluatorConfiguration,
)
from azure.ai.projects import AIProjectClient
from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery, VectorizableTextQuery
from azure.search.documents.indexes.models import (
    ComplexField,
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SemanticConfiguration,
    SemanticField,
    SemanticPrioritizedFields,
    SemanticSearch,
    VectorSearch,
    VectorSearchProfile,
    VectorSearchAlgorithmConfiguration,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    HnswAlgorithmConfiguration,
    HnswParameters
)

# Other open source packages
from dotenv import load_dotenv
from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from openai import AzureOpenAI
import tiktoken

### Load resource connection configuration
The following cell will load the necessary resource connection configuration for the sample. Copy the contents of `.env.sample` into a new file named `.env`, and fill in the values corresponding to your own service resources.

In [None]:
load_dotenv()

### Create client objects for managing resources and set other helpful variables
We will also create all of the client objects and other variables needed for the rest of the notebook in the following cell.

In [None]:
# Create the Azure AI Search service clients
search_service_endpoint = os.environ["AZURE_AI_SEARCH_ENDPOINT"]
search_service_key = os.environ["AZURE_AI_SEARCH_KEY"]
index_name = "trec-covid-vector"

if not search_service_key:
    search_credential = DefaultAzureCredential()

else:
    search_credential = AzureKeyCredential(search_service_key)
    
search_index_client = SearchIndexClient(search_service_endpoint, search_credential)
search_client = SearchClient(search_service_endpoint, index_name, search_credential)

# Create the Azure AI Project client
project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str=os.environ["AZURE_PROJECT_CONNECTION_STRING"]
)

# Set other helpful variables
data_directory = os.path.join(".", "beir")
data_ids = None

# search parameters
search_top_k = 50
search_select = "doc_id"

# vector search parameters, if using vector search
vector_field_name = "title_text_ada002"
search_vectorizer_endpoint = os.environ["AZURE_AI_SEARCH_VECTORIZER_ENDPOINT"]
search_vectorizer_key = os.environ["AZURE_AI_SEARCH_VECTORIZER_KEY"]
search_vectorizer_model_name = os.environ["AZURE_AI_SEARCH_VECTORIZER_MODEL_NAME"]
search_vectorizer_deployment_name = os.environ["AZURE_AI_SEARCH_VECTORIZER_DEPLOYMENT_NAME"]
search_vectorizer_model_dimensions = os.environ["AZURE_AI_SEARCH_VECTORIZER_MODEL_DIMENSIONS"]
use_integrated_vectorization = (search_vectorizer_endpoint and \
                                search_vectorizer_key and \
                                search_vectorizer_model_name and \
                                search_vectorizer_deployment_name and \
                                search_vectorizer_model_dimensions)
doc_chunk_size = 1024 if use_integrated_vectorization else None
tokenizer = tiktoken.get_encoding("cl100k_base")

aoai_client = None
if use_integrated_vectorization:
    print("Creating AOAI client for vectorizing index data")
    ad_token_provider = get_bearer_token_provider(
        DefaultAzureCredential(),
        "https://cognitiveservices.azure.com/.default"
    )
    aoai_client = AzureOpenAI(
        azure_endpoint=search_vectorizer_endpoint,
        api_version="2024-06-01",
        azure_ad_token_provider=ad_token_provider
    )

### Create search configurations
In the next cell, we will set some additional configuration values for configuring document search using Azure AI Search.  We can select from these configurations later on when we generate search results for evaluation, and then compare the results of each run after the evaluations are finished to determine which configuration performs best for the index.

In [None]:
# Full-text search
text_search_configuration = {
    "name": "text",
    "query_language": "en",
    "select": search_select,
    "top": search_top_k,
    "api_version": "2024-11-01-preview",
    "score_field": "@search.score"
}

# Semantic search
semantic_search_configuration = {
    "name": "semantic",
    "query_type": "semantic",
    "select": search_select,
    "top": search_top_k,
    "semantic_configuration_name": "en-semantic-config",
    "query_language": "en",
    "api_version": "2024-11-01-preview",
    "score_field": "@search.reranker_score"
}

# Vector search -- requires setting values for environment variables AZURE_AI_SEARCH_VECTORIZER_ENDPOINT, AZURE_AI_SEARCH_VECTORIZER_KEY, and
# AZURE_AI_SEARCH_VECTORIZER_MODEL
vector_search_configuration = {
    "name": "vector",
    "select": search_select,
    "top": search_top_k,
    "vector_queries": [
        {
            "kind": "text",
            "fields": vector_field_name,
            "k_nearest_neighbors": search_top_k
        }
    ],
    "score_field": "@search.score"
}

# Semantic vector hybrid search
semantic_vector_hybrid_search_configuration = {
    "name": "hybrid",
    "query_type": "semantic",
    "select": search_select,
    "top": search_top_k,
    "semantic_configuration_name": "en-semantic-config",
    "vector_queries": [
        {
            "kind": "text",
            "fields": vector_field_name,
            "k_nearest_neighbors": search_top_k
        }
    ],
    "score_field": "@search.score"
}

## Dataset Preparation

### Download the TREC-COVID (Beir) dataset
In the next cell, we will download an open source dataset to perform evaluation on. We will use the TREC-COVID dataset from BeIR, which contains a corpus we can index into Azure AI Search, as well as a set of queries to run through our search service and a set of ground truth qrels for evaluation.


In [None]:
# This function was adapted from the original sample here:
# https://github.com/beir-cellar/beir?tab=readme-ov-file#beers-quick-example

def download_beir(dataset_name: str, output_directory: str):
    #### Just some code to print debug information to stdout
    logging.basicConfig(format='%(asctime)s - %(message)s',
                        datefmt='%Y-%m-%d %H:%M:%S',
                        level=logging.INFO,
                        handlers=[LoggingHandler()])

    #### Download scifact.zip dataset and unzip the dataset
    url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset_name)
    out_dir = os.path.join(output_directory, "datasets")
    data_path = util.download_and_unzip(url, out_dir)

    #### Provide the data_path where scifact has been downloaded and unzipped
    corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

In [None]:
download_beir("trec-covid", data_directory)

### Create an Azure AI Search index from the dataset corpus
Next, we will create an Azure AI Search index using the BeIR TREC-COVID dataset downloaded in the previous cell.  If integrated vectorization is enabled in the configuration settings, we will also add a vector field for our index.

In [None]:
def create_or_update_index(search_index_client: SearchIndexClient):
    fields = [
        SearchField(
            name="chunk_id",
            type=SearchFieldDataType.String,
            key=True,
            searchable=True,
            hidden=False,
            stored=True,
            filterable=True,
            sortable=True,
            facetable=False,
            analyzer_name="standard.lucene"
        ),
        SearchField(
            name="doc_id",
            type=SearchFieldDataType.String,
            key=False,
            searchable=True,
            hidden=False,
            stored=True,
            filterable=True,
            sortable=True,
            facetable=False,
            analyzer_name="standard.lucene"
        ),
        SearchField(
            name="title",
            type=SearchFieldDataType.String,
            searchable=True,
            hidden=False,
            stored=True,
            filterable=True,
            sortable=False,
            facetable=False,
            analyzer_name="standard.lucene"
        ),
        SearchField(
            name="text",
            type=SearchFieldDataType.String,
            searchable=True,
            hidden=False,
            stored=True,
            filterable=True,
            sortable=False,
            facetable=False,
            analyzer_name="standard.lucene"
        ),
        ComplexField(
            name="metadata",
            fields=[
                SearchField(name="url", type=SearchFieldDataType.String, searchable=True, hidden=False, stored=True, filterable=True, sortable=False, facetable=False, analyzer_name="standard.lucene"),
                SearchField(name="pubmed_id", type=SearchFieldDataType.String, searchable=True, hidden=False, stored=True, filterable=True, sortable=False, facetable=False, analyzer_name="standard.lucene")
            ]
        )
    ]
    semantic_prioritized_fields = SemanticPrioritizedFields(
        title_field=SemanticField(field_name="title"),
        content_fields=[SemanticField(field_name="text")],
        keywords_fields=[SemanticField(field_name="metadata/url")]
    )
    semantic_configuration = SemanticConfiguration(name="en-semantic-config", prioritized_fields=semantic_prioritized_fields)
    semantic_search = SemanticSearch(default_configuration_name="en-semantic-config", configurations=[semantic_configuration])

    vector_search = None
    if use_integrated_vectorization:
        algorithm = HnswAlgorithmConfiguration(
            name="hnsw-1",
            parameters=HnswParameters(
                m=4, ef_construction=400, ef_search=500, metric="cosine"
            )
        )
        vectorizer = AzureOpenAIVectorizer(
            vectorizer_name="aoai-vectorizer-1",
            parameters=AzureOpenAIVectorizerParameters(
                resource_url=search_vectorizer_endpoint,
                deployment_name=search_vectorizer_deployment_name,
                api_key=search_vectorizer_key,
                model_name=search_vectorizer_deployment_name
            )
        )
        profile = VectorSearchProfile(
            name="vector-profile-1",
            algorithm_configuration_name=algorithm.name,
            vectorizer_name=vectorizer.vectorizer_name
        )
        vector_search = VectorSearch(
            profiles=[profile],
            algorithms=[algorithm],
            vectorizers=[vectorizer]
        )
        fields.append(
            SearchField(
                name=vector_field_name,
                type="Collection(Edm.Single)",
                searchable=True,
                hidden=True,
                stored=False,
                filterable=False,
                sortable=False,
                facetable=False,
                vector_search_dimensions=search_vectorizer_model_dimensions,
                vector_search_profile_name=profile.name
            )
        )
    
    index = SearchIndex(name=index_name, fields=fields, semantic_search=semantic_search, vector_search=vector_search)

    result = search_index_client.create_or_update_index(index)

In [None]:
create_or_update_index(search_index_client)

### Index the documents from the dataset corpus
Once we have the data downloaded and the index created, we will ingest the documents from the local file into the index.  If integrated vectorization is configured, we will also create embeddings for the input data to include in the ingestion payload.

In [None]:
def split_text(text, max_tokens):
    tokens = tokenizer.encode(text)
    chunks = []
    startIndex = 0
    while startIndex < len(tokens):
        endIndex = startIndex + max_tokens
        chunks.append(tokens[startIndex:endIndex])
        startIndex = endIndex
    return [tokenizer.decode(chunk) for chunk in chunks]

def get_embedding(
    aoai_client: AzureOpenAI,
    embedding_model: str,
    doc_id: str,
    title: str,
    content: str
):
    time.sleep(0.01)

    text_to_embed = [" ".join([title, content])]
    chunk_ids = [doc_id]
    
    if doc_chunk_size:
        text_to_embed = split_text(text_to_embed[0], doc_chunk_size)
        # create random chunk IDs for each chunk
        chars = string.ascii_lowercase
        chunk_id_length = 8
        chunk_ids = [''.join(random.choice(chars) for i in range(chunk_id_length)) for _ in text_to_embed]

    text_embeddings = [x.embedding for x in aoai_client.embeddings.create(
            input=text_to_embed,
            model=embedding_model
        ).data
    ]
    
    return chunk_ids, text_to_embed, text_embeddings

def index_dataset(search_client: SearchClient):
    doc_chunks_count = 0
    for df in pd.read_json(os.path.join(data_directory, "datasets", "trec-covid", "corpus.jsonl"), lines=True, orient="records", chunksize=100):
        _df = df.copy()
        _df.rename(columns={"_id":"doc_id"}, inplace=True)
        _df.set_index("doc_id")

        # If integrated vectorization is enabled, we'll chunk the data and vectorize it before uploading.
        if use_integrated_vectorization:
            start = time.time()
            _df[["chunk_id", "chunk", vector_field_name]] = _df.apply(
                    lambda q: get_embedding(
                        aoai_client,
                        search_vectorizer_model_name,
                        q["doc_id"],
                        q["title"],
                        q["text"]
                    ),
                    axis=1,
                    result_type="expand"
                )
            end = time.time()
            doc_chunks_count += 1
            print(f"Successfully chunked and vectorized {doc_chunks_count * len(_df)} documents (elapsed time: {end - start})")
            _df = _df.explode(["chunk_id", "chunk", vector_field_name])
            _df.drop("text", axis=1, inplace=True)
            _df.rename(columns={"chunk":"text"}, inplace=True)
            _df.set_index("chunk_id")
        else:
            _df["chunk_id"] = _df["doc_id"]

        documents = _df.to_dict(orient="records")
        search_client.upload_documents(documents=_df.to_dict(orient="records"))
        del _df

In [None]:
index_dataset(search_client)

### Get search results and merge with qrels
The `DocumentRetrievalEvaluator` from the Azure AI Evaluations SDK requires both search results and groundtruth labels in JSON-lines format.  In the next section, we will choose a search configuration to evaluate, generate search results for that configuration, and join the search results with their corresponding qrels from the TREC-COVID dataset to form our JSON-lines input for the evaluator.

In [None]:
import json
import pandas as pd

def search(query: str, search_client: SearchClient, score_field: str = "@search.score", **search_configuration):
    search_text = query
    vector_queries = None
    if "vector_queries" in search_configuration.keys():
        search_text = None
        vector_queries = [VectorizableTextQuery(text=query, **vector_query_config) for vector_query_config in search_configuration["vector_queries"]]
        search_configuration.pop("vector_queries")
        
    results = search_client.search(search_text=search_text, vector_queries=vector_queries, **search_configuration)
    return [{"document_id": result["doc_id"], "relevance_score": result.get(score_field, None)} for result in results]

def prepare_dataset(search_configuration):
    # Load the queryset and qrels
    queryset = pd.read_json(os.path.join(data_directory, "datasets", "trec-covid", "queries.jsonl"), lines=True, orient="records")
    qrels = pd.read_csv(os.path.join(data_directory, "datasets", "trec-covid", "qrels", "test.tsv"), delimiter="\t")
    
    # Drop negative qrels values and duplicates, and rename columns
    qrels = qrels.loc[qrels["score"] >= 0]
    qrels.drop_duplicates(subset=["query-id", "corpus-id"], inplace=True)
    qrels.rename(columns={"corpus-id": "document_id", "score": "query_relevance_label"}, inplace=True)
    
    # Group qrels by query ID and generate groundtruth set per query
    qrels_grouped = qrels.groupby("query-id")
    qrels_aggregated = qrels_grouped[["document_id", "query_relevance_label"]].agg(lambda x: list(x))
    qrels_aggregated["retrieval_ground_truth"] = qrels_aggregated.apply(lambda x: json.dumps([{"document_id": doc_id, "query_relevance_label": label} for (doc_id, label) in zip(x["document_id"], x["query_relevance_label"])]), axis=1)
    
    # Join the queryset and qrels on query ID and doc ID
    merged = queryset.merge(qrels_aggregated, left_on="_id", right_on="query-id")
    
    # Generate search results for each query
    search_configuration_name = search_configuration.pop("name")
    score_field = search_configuration.pop("score_field")
    merged["retrieved_documents"] = merged.apply(
        lambda x: json.dumps(search(
            query=x["text"],
            search_client=search_client,
            score_field=score_field,
            **search_configuration
        )), axis=1)
    
    merged_final = merged[["retrieved_documents", "retrieval_ground_truth"]]
    # Save final dataset to a local file in JSON-lines format
    jsonl_path = os.path.join(".", f"evaluate-beir-{search_configuration_name}.jsonl")
    merged_final.to_json(jsonl_path, lines=True, orient="records")
    
    return (merged_final, jsonl_path)

### Upload dataset to Azure AI Foundry
To run an evaluation in the cloud, we need to uploud our evaluation data to the specified Azure AI Foundry project.

We will run the data preparation for each search configuration specified earlier in the notebook, so we can compare evaluation runs to determine which configuration performs best.

In [None]:
data_ids = []
for config in [
    text_search_configuration,
    semantic_search_configuration,
    vector_search_configuration,
    semantic_vector_hybrid_search_configuration
]:
    config_name = config['name']
    print(f"Preparing data for config '{config_name}'")
    dataset, jsonl_path = prepare_dataset(config)
    data_id, _ = project_client.upload_file(jsonl_path)
    data_ids.append((config_name, data_id))

    print(f"File {jsonl_path} was uploaded successfully!")
    print(f"Data ID: {data_id}")
    print(f"Data content summary:")
    print(dataset.head())

## Run document retrieval evaluation
After our datasets are uploaded, we will configure and run the document retrieval evaluator for each uploaded dataset.  The init params `groundtruth_label_min` and `groundtruth_label_max` help us to configure the qrels scaling for some metrics which depend on a count of labels, such as Fidelity.  In this case, the TREC-COVID dataset groundtruth set has 0, 1, and 2 as possible labels, so we set the values of those init params accordingly.

In [None]:
def run_evaluation(evaluation_name, evaluation_description, dataset_id):
    # Create an evaluation
    evaluation = Evaluation(
        display_name=evaluation_name,
        description=evaluation_description,
        data=Dataset(id=dataset_id),
        evaluators={
            "documentretrievalevaluator": EvaluatorConfiguration(
                id=DocumentRetrievalEvaluator().id,
                data_mapping={
                    "retrieval_ground_truth": "${data.retrieval_ground_truth}",
                    "retrieved_documents": "${data.retrieved_documents}"
                },
                init_params={
                    "ground_truth_label_min": 0,
                    "ground_truth_label_max": 2
                }
            )
        },
    )

    # Create evaluation
    evaluation_response = project_client.evaluations.create(
        evaluation=evaluation,
    )

    # Get evaluation
    get_evaluation_response = project_client.evaluations.get(evaluation_response.id)

    print("----------------------------------------------------------------")
    print("Created evaluation, evaluation ID: ", get_evaluation_response.id)
    print("Evaluation status: ", get_evaluation_response.status)
    print("AI project URI: ", get_evaluation_response.properties["AiStudioEvaluationUri"])
    print("----------------------------------------------------------------")

In [None]:
for (config_name, data_id) in data_ids:
    run_evaluation(f"TREC-COVID evaluation - {config_name}", "Document retrieval evaluation using the TREC-COVID dataset from BeIR", data_id)

## Comparing results

Once the evaluations are complete, you can compare the results by clicking the "Evaluations" tab on the left-side of the Azure AI Foundry project page, select the runs for comparison, and then click the "Compare" button to see metric results side-by-side.

![Azure AI Foundry project evaluations page](eval-results-select.png)