# Document Retrieval Evaluation in Azure AI Foundry

## Objective
This notebook sample demonstrates how to perform evaluation of an Azure AI Search index using Azure AI Evaluation.

## Time

You should expect to spend about 20 minutes running this notebook, after the below setup steps.

## About This Example
This notebook sample demonstrates how to perform evaluation of an Azure AI Search index using Azure AI Evaluation.  The evaluator used in this example, `DocumentRetrievalEvaluator` requires a list of ground truth labeled documents (sometimes referred to as "qrels") and a list of actual search results obtained from a search index as inputs for calculating the evaluation metrics.  This sample will walk through the steps of data preparation, gathering search results for different search configurations, running evaluation and comparing results of each run.

#### Explanation of Document Retrieval Metrics 
The metrics that will be generated in the output of the evaluator include:
| Metric               | Category            | Description                                                                                     |
|-----------------------|---------------------|-------------------------------------------------------------------------------------------------|
| Fidelity             | Search Fidelity    | How well the top n retrieved chunks reflect the content for a given query; number of good documents returned out of the total number of known good documents in a dataset |
| NDCG                 | Search NDCG        | How good are the rankings to an ideal order where all relevant items are at the top of the list.        |
| XDCG                 | Search XDCG        | How good the results are in the top-k documents regardless of scoring of other index documents |
| Max Relevance N      | Search Max Relevance | Maximum relevance in the top-k chunks                                                          |
| Holes      | Search Label Sanity | Number of documents with missing query relevance judgments (Ground truth) |

It's important to note that some metrics, particularly NDCG, XDCG and Fidelity, are sensitive to holes.  Ideally the count of holes for a given evaluation should be zero, otherwise results for these metrics may not be accurate.  It is recommended to iteratively check results against current known ground truth to fill holes to improve accuracy of the evaluation metrics.  This process is not covered explicitly in the sample but is important to mention.

### Dataset

This example will use the open source BEIR/scidocs dataset, hosted publically on [HuggingFace](https://huggingface.co/datasets/BeIR/scidocs#dataset-summary), which contains a corpus we can index into Azure AI Search, as well as a set of queries to run through our search service and a set of ground truth qrels for evaluation. The download is about 25MB.

To speed up future indexing, we will downsample the dataset for the purposes of this demo.


## Before You Begin
Before running this notebook, be sure you have fulfilled the following prerequisites:
* Create or get access to an [Azure Subscription](https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/azure-best-practices/initial-subscriptions), and assign yourself the Owner or Contributor role for creating resources in this subscription.
* `az` CLI is installed in the current environment, and you have run `az login` to gain access to your resources.
* Read and understand the documentation covering [assigning RBAC roles between resources](https://learn.microsoft.com/en-us/azure/role-based-access-control/role-assignments-cli) using the `az` CLI.
* Create [Azure AI Search resource](https://learn.microsoft.com/en-us/azure/search/search-create-service-portal), and assign yourself the "Search Index Data Contributor" role for the resource.  
* Create an [Azure AI Foundry project](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/create-projects?tabs=ai-studio).
* [Connect](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/evaluations-storage-account) the Azure AI Foundry project to a storage account.
* [Optional] Deploy a text embedding model in the Azure AI Foundry project for testing vector-based search scenarios in this example, and assign yourself the "Cognitive Services OpenAI User" role for the Azure AI Services resource created for your project.  For integrated vectorization support, you will also need to ensure that your Azure AI Search resource has the "Cognitive Services OpenAI User" role assigned for the Azure AI Services resource as well.
* Copy the contents of `.env.sample` into a new file named `.env`, and fill in the values corresponding to your own service resources.
* It is recommended to create all needed resources within a single resource group for easier deletion

Set these environment variables with your own values:

**AZURE_AI_SEARCH_ENDPOINT** - Search API endpoint for search service 

**AZURE_AI_SEARCH_KEY** - Search API key for search service 

**AZURE_AI_SEARCH_VECTORIZER_ENDPOINT** - Embedding model endpoint to be used for vectorization of your corpus 

**AZURE_AI_SEARCH_VECTORIZER_KEY** - Embedding model API key to be used for vectorization of your corpus 

**AZURE_AI_SEARCH_VECTORIZER_MODEL_NAME** - Embedding model name to be used for vectorization of your corpus 

**AZURE_AI_SEARCH_VECTORIZER_DEPLOYMENT_NAME** - Embedding model deployment name to be used for vectorization of your corpus 

**AZURE_AI_SEARCH_VECTORIZER_MODEL_DIMENSIONS** - Embedding model dimensions to be used for vectorization of your corpus 

**AZURE_PROJECT_ENDPOINT** - The Azure AI Foundry project endpoint, as found in the overview page of your Azure AI Foundry project. 

Example: AZURE_PROJECT_ENDPOINT=https://your-account.services.ai.azure.com/api/projects/your-project

### Installation
Run the following command to install the python requirements for this notebook.

In [None]:
!pip install azure-core azure-identity azure-search-documents azure-ai-evaluation openai pandas dotenv tiktoken "datasets<4.0"

### Parameters
The following cell will load the necessary resource connection configuration for the sample.

In [None]:
from dotenv import load_dotenv

load_dotenv()

## Run The Example

### Import all modules
For convenience, all modules needed for the rest of the notebook can be imported all at once.

In [None]:
# Standard library
import copy
import os
import pandas as pd
import time
import uuid
from pathlib import Path

# Azure SDK
from azure.ai.evaluation import DocumentRetrievalEvaluator, evaluate
from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery
from azure.search.documents.indexes.models import (
    ComplexField,
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SemanticConfiguration,
    SemanticField,
    SemanticPrioritizedFields,
    SemanticSearch,
    VectorSearch,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    HnswAlgorithmConfiguration,
    HnswParameters,
)

# Other open source packages
from datasets import load_dataset
from openai import AzureOpenAI
import tiktoken

### Create client objects for managing resources and set other helpful variables
We will also create all of the client objects and other variables needed for the rest of the notebook in the following cell.

In [None]:
# Create the Azure AI Search service clients
search_service_endpoint = os.environ["AZURE_AI_SEARCH_ENDPOINT"]
search_service_key = os.environ["AZURE_AI_SEARCH_KEY"]
index_name = "scidocs-vector"
search_credential = AzureKeyCredential(search_service_key)
search_index_client = SearchIndexClient(search_service_endpoint, search_credential)
search_client = SearchClient(search_service_endpoint, index_name, search_credential)

# Create the Azure AI Project client
azure_project_endpoint = os.environ["AZURE_PROJECT_ENDPOINT"]

# search parameters
search_top_k = 50
search_select = "doc_id"

# vector search parameters, if using vector search
vector_field_name = "title_text_ada002"
search_vectorizer_endpoint = os.environ["AZURE_AI_SEARCH_VECTORIZER_ENDPOINT"]
search_vectorizer_key = os.environ["AZURE_AI_SEARCH_VECTORIZER_KEY"]
search_vectorizer_model_name = os.environ["AZURE_AI_SEARCH_VECTORIZER_MODEL_NAME"]
search_vectorizer_deployment_name = os.environ["AZURE_AI_SEARCH_VECTORIZER_DEPLOYMENT_NAME"]
search_vectorizer_model_dimensions = os.environ["AZURE_AI_SEARCH_VECTORIZER_MODEL_DIMENSIONS"]
use_integrated_vectorization = (
    search_vectorizer_endpoint
    and search_vectorizer_key
    and search_vectorizer_model_name
    and search_vectorizer_deployment_name
    and search_vectorizer_model_dimensions
)
doc_chunk_size = 1024 if use_integrated_vectorization else None
tokenizer = tiktoken.get_encoding("cl100k_base")

aoai_client = None
if use_integrated_vectorization:
    ad_token_provider = get_bearer_token_provider(
        DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
    )
    aoai_client = AzureOpenAI(
        azure_endpoint=search_vectorizer_endpoint, api_version="2024-06-01", azure_ad_token_provider=ad_token_provider
    )
    print("Using integrated vectorization with Azure OpenAI model:", search_vectorizer_model_name)

### Create search configurations
In the next cell, we will set some additional configuration values for configuring document search using Azure AI Search.  We can select from these configurations later on when we generate search results for evaluation, and then compare the results of each run after the evaluations are finished to determine which configuration performs best for the index.

In [None]:
# Full-text search
text_search_configuration = {
    "name": "text",
    "select": search_select,
    "top": search_top_k,
    "api_version": "2024-11-01-preview",
    "score_field": "@search.score",
}

# Semantic search
semantic_search_configuration = {
    "name": "semantic",
    "query_type": "semantic",
    "select": search_select,
    "top": search_top_k,
    "semantic_configuration_name": "en-semantic-config",
    "api_version": "2024-11-01-preview",
    "score_field": "@search.reranker_score",
}

# Vector search -- requires setting values for environment variables AZURE_AI_SEARCH_VECTORIZER_ENDPOINT, AZURE_AI_SEARCH_VECTORIZER_KEY, and
# AZURE_AI_SEARCH_VECTORIZER_MODEL
vector_search_configuration = {
    "name": "vector",
    "select": search_select,
    "top": search_top_k,
    "vector_queries": [{"kind": "text", "fields": vector_field_name, "k_nearest_neighbors": search_top_k}],
    "score_field": "@search.score",
}

# Semantic vector hybrid search
semantic_vector_hybrid_search_configuration = {
    "name": "hybrid",
    "query_type": "semantic",
    "select": search_select,
    "top": search_top_k,
    "semantic_configuration_name": "en-semantic-config",
    "vector_queries": [{"kind": "text", "fields": vector_field_name, "k_nearest_neighbors": search_top_k}],
    "score_field": "@search.score",
}

## Dataset Preparation

### Download the Scidocs (Beir) dataset
In the next cell, we will download an open source dataset to perform evaluation on. We will use the Scidocs dataset from BeIR hosted on [HuggingFace](https://huggingface.co/datasets/BeIR/scidocs#dataset-summary), which contains a corpus we can index into Azure AI Search, as well as a set of queries to run through our search service and a set of ground truth qrels for evaluation. The download is about 25MB.

To speed up future indexing, we will downsample the dataset for the purposes of this demo.


In [None]:
corpus = load_dataset("BeIR/scidocs", "corpus")["corpus"]
queries = load_dataset("BeIR/scidocs", "queries")["queries"].to_pandas()
qrels = load_dataset("BeIR/scidocs-qrels", "default")["test"].to_pandas()

# Downsample queries to 1/10th of the original size
queries = queries.sample(frac=0.1, random_state=42)
downsampled_query_ids = queries["_id"].tolist()
qrels = qrels[qrels["query-id"].isin(downsampled_query_ids)]
qrels_doc_ids = qrels["corpus-id"].unique().tolist()
# Filter the corpus to documents that are in the filtered qrels. You might want to include more documents for a more realistic retrieval task.
# This truncation is simply to make vectorizing run faster for demo purposes.
corpus = corpus.filter(lambda example: example["_id"] in qrels_doc_ids)

### Create an Azure AI Search index from the dataset corpus
Next, we will create an Azure AI Search index using the BeIR Scidocs dataset downloaded in the previous cell.  If integrated vectorization is enabled in the configuration settings, we will also add a vector field for our index.

In [None]:
def create_or_update_index(search_index_client: SearchIndexClient) -> None:
    fields = [
        SearchField(
            name="chunk_id", type=SearchFieldDataType.String, key=True, facetable=False, analyzer_name="standard.lucene"
        ),
        SearchField(name="doc_id", type=SearchFieldDataType.String, facetable=False, analyzer_name="standard.lucene"),
        SearchField(
            name="title",
            type=SearchFieldDataType.String,
            sortable=False,
            facetable=False,
            analyzer_name="standard.lucene",
        ),
        SearchField(
            name="text",
            type=SearchFieldDataType.String,
            sortable=False,
            facetable=False,
            analyzer_name="standard.lucene",
        ),
        ComplexField(
            name="metadata",
            fields=[
                SearchField(
                    name="url",
                    type=SearchFieldDataType.String,
                    sortable=False,
                    facetable=False,
                    analyzer_name="standard.lucene",
                ),
                SearchField(
                    name="pubmed_id",
                    type=SearchFieldDataType.String,
                    sortable=False,
                    facetable=False,
                    analyzer_name="standard.lucene",
                ),
            ],
        ),
    ]
    semantic_prioritized_fields = SemanticPrioritizedFields(
        title_field=SemanticField(field_name="title"),
        content_fields=[SemanticField(field_name="text")],
        keywords_fields=[SemanticField(field_name="metadata/url")],
    )
    semantic_configuration = SemanticConfiguration(
        name="en-semantic-config", prioritized_fields=semantic_prioritized_fields
    )
    semantic_search = SemanticSearch(
        default_configuration_name="en-semantic-config", configurations=[semantic_configuration]
    )

    vector_search = None
    if use_integrated_vectorization:
        algorithm = HnswAlgorithmConfiguration(
            name="hnsw-1", parameters=HnswParameters(m=4, ef_construction=400, ef_search=500, metric="cosine")
        )
        vectorizer = AzureOpenAIVectorizer(
            vectorizer_name="aoai-vectorizer-1",
            parameters=AzureOpenAIVectorizerParameters(
                resource_url=search_vectorizer_endpoint,
                deployment_name=search_vectorizer_deployment_name,
                api_key=search_vectorizer_key,
                model_name=search_vectorizer_model_name,
            ),
        )
        profile = VectorSearchProfile(
            name="vector-profile-1",
            algorithm_configuration_name=algorithm.name,
            vectorizer_name=vectorizer.vectorizer_name,
        )
        vector_search = VectorSearch(profiles=[profile], algorithms=[algorithm], vectorizers=[vectorizer])
        fields.append(
            SearchField(
                name=vector_field_name,
                type="Collection(Edm.Single)",
                searchable=True,
                hidden=True,
                stored=False,
                filterable=False,
                sortable=False,
                facetable=False,
                vector_search_dimensions=search_vectorizer_model_dimensions,
                vector_search_profile_name=profile.name,
            )
        )

    index = SearchIndex(name=index_name, fields=fields, semantic_search=semantic_search, vector_search=vector_search)

    search_index_client.create_or_update_index(index)

In [None]:
create_or_update_index(search_index_client)

### Index the documents from the dataset corpus
Once we have the data downloaded and the index created, we will ingest the documents from the local file into the index.  If integrated vectorization is configured, we will also create embeddings for the input data to include in the ingestion payload.

In [None]:
def split_text(text: str, max_tokens: int) -> list[object]:
    tokens = tokenizer.encode(text)
    chunks = []
    startIndex = 0
    while startIndex < len(tokens):
        endIndex = startIndex + max_tokens
        chunks.append(tokens[startIndex:endIndex])
        startIndex = endIndex
    return [tokenizer.decode(chunk) for chunk in chunks]


def get_embedding(
    aoai_client: AzureOpenAI, embedding_model: str, doc_id: str, title: str, content: str
) -> tuple[list[str], list[str], list[list[float]]]:
    text_to_embed = [" ".join([title, content])]
    chunk_ids = [doc_id]

    if doc_chunk_size:
        text_to_embed = split_text(text_to_embed[0], doc_chunk_size)
        chunk_ids = [uuid.uuid4().hex[:8] for _ in text_to_embed]

    text_embeddings = [
        x.embedding for x in aoai_client.embeddings.create(input=text_to_embed, model=embedding_model).data
    ]

    return chunk_ids, text_to_embed, text_embeddings


def index_dataset(search_client: SearchClient) -> None:
    chunk_size = 100
    documents_processed = 0
    for i in range(0, len(corpus), chunk_size):
        chunk = corpus[i : i + chunk_size]
        df = pd.DataFrame(chunk)
        _df = df.copy()
        _df.rename(columns={"_id": "doc_id"}, inplace=True)
        _df.set_index("doc_id")

        # If integrated vectorization is enabled, we'll chunk the data and vectorize it before uploading.
        if use_integrated_vectorization:
            start = time.time()
            _df[["chunk_id", "chunk", vector_field_name]] = _df.apply(
                lambda q: get_embedding(aoai_client, search_vectorizer_model_name, q["doc_id"], q["title"], q["text"]),
                axis=1,
                result_type="expand",
            )
            end = time.time()
            documents_processed += len(_df)
            print(
                f"Chunked and vectorized {documents_processed} of {len(corpus)} documents (last 100 docs took: {end - start:.2f} seconds)"
            )
            _df = _df.explode(["chunk_id", "chunk", vector_field_name])
            _df.drop("text", axis=1, inplace=True)
            _df.rename(columns={"chunk": "text"}, inplace=True)
            _df.set_index("chunk_id")
        else:
            _df["chunk_id"] = _df["doc_id"]

        documents = _df.to_dict(orient="records")
        search_client.upload_documents(documents=documents)

In [None]:
index_dataset(search_client)  # This can take ~10mins with vectorization on. Much faster with vectorization off.

### Get search results and merge with qrels
The `DocumentRetrievalEvaluator` from the Azure AI Evaluations SDK requires both search results and groundtruth labels in JSON-lines format.  In the next section, we will choose a search configuration to evaluate, generate search results for that configuration, and join the search results with their corresponding qrels from the Scidocs dataset to form our JSON-lines input for the evaluator.

In [None]:
def search(
    query: str,
    search_client: SearchClient,
    score_field: str = "@search.score",
    **search_configuration: dict[str, object],
) -> list[dict[str, object]]:
    search_text = query
    vector_queries = None
    if "vector_queries" in search_configuration:
        search_text = None
        vector_queries = [
            VectorizableTextQuery(text=query, **vector_query_config)
            for vector_query_config in search_configuration["vector_queries"]
        ]
        search_configuration.pop("vector_queries")

    results = search_client.search(search_text=search_text, vector_queries=vector_queries, **search_configuration)
    return [{"document_id": result["doc_id"], "relevance_score": result.get(score_field, None)} for result in results]


def prepare_dataset(search_configuration: dict[str, object], qrels: pd.DataFrame) -> tuple[pd.DataFrame, Path]:
    qrels.rename(
        columns={"corpus-id": "document_id", "score": "query_relevance_label", "query-id": "query_id"}, inplace=True
    )

    # Group qrels by query ID and generate groundtruth set per query
    qrels_grouped = qrels.groupby("query_id")
    qrels_aggregated = qrels_grouped[["document_id", "query_relevance_label"]].agg(lambda x: list(x))
    qrels_aggregated["retrieval_ground_truth"] = [
        [{"document_id": doc_id, "query_relevance_label": label} for doc_id, label in zip(doc_ids, labels)]
        for doc_ids, labels in zip(qrels_aggregated["document_id"], qrels_aggregated["query_relevance_label"])
    ]

    # Join the queryset and qrels on query ID and doc ID
    merged = queries.merge(qrels_aggregated, left_on="_id", right_on="query_id")

    # Generate search results for each query - this can take some time.
    search_configuration_name = search_configuration.pop("name")
    score_field = search_configuration.pop("score_field")
    merged["retrieved_documents"] = merged.apply(
        lambda row: search(
            query=row["text"], search_client=search_client, score_field=score_field, **search_configuration
        ),
        axis=1,
    )

    merged_final = merged[["retrieved_documents", "retrieval_ground_truth"]]
    jsonl_path = Path(f"evaluate-beir-{search_configuration_name}.jsonl")
    merged_final.to_json(jsonl_path, lines=True, orient="records")

    return (merged_final, jsonl_path)

### Run document retrieval evaluation
After our datasets are uploaded, we will configure and run the document retrieval evaluator for each uploaded dataset.  The init params `ground_truth_label_min` and `ground_truth_label_max` help us to configure the qrels scaling for some metrics which depend on a count of labels, such as Fidelity.  In this case, the Scidocs dataset ground truth set has 0 and 1 as possible labels, so we set the values of those init params accordingly. The thresholds define whether to mark a metric as passing or failing.

In [None]:
data_ids = []
config_list = [text_search_configuration, semantic_search_configuration]
if use_integrated_vectorization:
    config_list.extend([vector_search_configuration, semantic_vector_hybrid_search_configuration])

for config in config_list:
    config_name = config["name"]
    print(f"Running searches with config '{config_name}'")
    dataset, jsonl_path = prepare_dataset(copy.deepcopy(config), qrels)
    print(f"Made file {jsonl_path}")
    print("Data content summary:")
    print(dataset.head())

In [None]:
def run_evaluation_locally() -> dict[str, object]:
    results = {}
    files_to_evaluate = [
        ("Text Search", "evaluate-beir-text.jsonl"),
        ("Semantic Search", "evaluate-beir-semantic.jsonl"),
    ]
    if use_integrated_vectorization:
        files_to_evaluate.extend(
            [("Vector Search", "evaluate-beir-vector.jsonl"), ("Hybrid Search", "evaluate-beir-hybrid.jsonl")]
        )
    for config_name, sample_data_file_name in files_to_evaluate:
        doc_retrieve = DocumentRetrievalEvaluator(ground_truth_label_min=0, ground_truth_label_max=1)
        response = evaluate(
            data=sample_data_file_name,
            evaluation_name=f"Doc retrieval eval demo - {config_name} run",
            evaluators={
                "DocumentRetrievalEvaluator": doc_retrieve,
            },
            evaluator_config={
                "document_retrieval": {
                    "column_mapping": {
                        "retrieval_ground_truth": "${data.retrieval_ground_truth}",
                        "retrieved_documents": "${data.retrieved_documents}",
                    }
                }
            },
            azure_ai_project=azure_project_endpoint,  # This line uploads evaluation results to the Foundry Evaluations portal
        )
        results[config_name] = response

    return results


res = run_evaluation_locally()
print("Check Foundry Evaluations portal for results")

### Comparing results

To compare across the different evaluations, once they are complete, you can click the "Evaluations" tab on the left-side of the Azure AI Foundry project page, select the runs for comparison, and then click the "Compare" button to see metric results side-by-side. 


## Cleaning Up

To clean up the resources used in this sample, delete your Azure Search Service, Azure Foundry Project, and optionally your vectorizer deployment, wherever that may be hosted.

If you made a resource group specifically to run this example, you could instead [delete the resource group](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/delete-resource-group).
