In [1]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/search/ranking-api/ranking_api_beir_evaluation.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fsearch%2Franking-api%2Franking_api_beir_evaluation.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/search/ranking-api/ranking_api_beir_evaluation.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/ranking-api/ranking_api_beir_evaluation.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/ranking-api/ranking_api_beir_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/ranking-api/ranking_api_beir_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/ranking-api/ranking_api_beir_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/ranking-api/ranking_api_beir_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/ranking-api/ranking_api_beir_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| Author |
| --- |
| [Jannis Grönberg](https://github.com/puebster) |

## Overview
### Ranking API BEIR Evaluation

This notebook is designed to evaluate the performance of the ranking model `googles/semantic-ranker-default-004` against established information retrieval benchmarks. It uses the BEIR (Benchmarking IR) dataset, which provides diverse collections and tasks for evaluating ranking models. The ground truth for evaluation consist of human-annotated relevance judgments.

**Context:**

This notebook performs the evaluation in several key stages:

1.  **Data Loading:** It begins by accessing a public Google Cloud Storage bucket (`ranking_datasets`). From this bucket, it loads humanly annotated data corresponding to the selected BEIR subdatasets: `nfcorpus`, `scidocs`, `trec-covid`, `webis-touche2020`, `dbpedia-entity`, and `msmarco`. This data includes the queries, documents, and their associated relevance judgments.

2.  **Score Computation:** Using the loaded data, the notebook proceeds to compute ranking scores. It performs the scoring operation (using the target model, `googles/semantic-ranker-default-004`) necessary for ranking. These computed scores are then stored back into the *same* designated Google Cloud Storage bucket, making them available for later use or inspection.

3.  **Recomputation Option:** For users who wish to re-run the scoring process or store the results in a different location, the notebook allows for configuration of the target Google Cloud Storage bucket to one within their own Google Cloud Platform project.

4.  **Score Loading for Evaluation:** The notebook loads these computed scores from the specified bucket.

5.  **Metric Calculation:** Finally, utilizing the `pytrec_eval` library, the notebook calculates standard ranking performance metrics. Specifically, it computes the Normalized Discounted Cumulative Gain (NDCG) at various cutoffs: NDCG@1, NDCG@3, NDCG@5, and NDCG@10. These values provide a quantitative measure of how well the model ranks relevant documents for the given queries, based on the loaded ground truth relevance judgments and the computed scores.

**Important Notes on Evaluation:**

*   **Labeled Data Only:** We have restricted our evaluation to only the labeled datapoints within each BEIR subdataset. This decision stems from the observation that many datasets contain examples that are clearly relevant but lack explicit labels. Including these unlabeled but relevant examples would lead to an underestimation of the model's performance, making a fair evaluation challenging.
*   **Datasets with Multiple Labels:** We have further filtered the BEIR datasets to include only those that have more than one unique label assigned. This is because computing meaningful Normalized Discounted Cumulative Gain (NDCG) scores on datasets where each query has only a single relevant document is problematic and can lead to misleading results.
*   **Enable the API** User need to first enable the discoveryengine api by visiting [Enable Discoveryengine API](https://console.developers.google.com/apis/api/discoveryengine.googleapis.com/)

### Install Google Cloud SDKs and other required packages


In [None]:
%pip install --upgrade --quiet google-cloud-discoveryengine pytrec_eval numpy pandas scikit-learn tqdm gcsfs

# Setup

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [1]:
# @title Colab authentification
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [None]:
# @title Import Libraries

# Standard library imports
import json
import os
import time

from google.cloud import discoveryengine_v1 as discoveryengine
from google.cloud import storage

# Third-party library imports
import numpy as np
import pandas as pd
import pytrec_eval
from sklearn.metrics import roc_auc_score
from tqdm.auto import tqdm

In [3]:
# @title Helper functions


def call_model_in_batches(
    df: pd.DataFrame, model_name: str, project_id: str, location: str
):
    """Generates ranking scores for document IDs based on provided queries using

    the Google Cloud Discovery Engine Rank API.

    The function processes a DataFrame of queries and associated documents,
    sends batches of documents to the Discovery Engine Rank API for scoring
    against each query, and returns a dictionary of query-to-document-score
    mappings.

    Args:
        df: A Pandas DataFrame containing query and document information. It must
          have columns: "query_id", "query", "document_id", "title", and
          "content". Each row represents a document associated with a specific
          query.
        project_id: The ID of the Google Cloud Project.
        location: The location (region) of the Discovery Engine instance.
        model_name: The name of the ranking model to use in the Discovery Engine.

    Returns:
        A dictionary where the outer keys are query IDs (str) and the inner values
        are dictionaries mapping document IDs (str) to their ranking scores
        (float),
        rounded to four decimal places.
    """
    results = {}

    client = discoveryengine.RankServiceClient()
    ranking_config = client.ranking_config_path(
        project=project_id,
        location=location,
        ranking_config="default_ranking_config",
    )
    query_groups = df.groupby("query_id")
    dataset_name = df.iloc[0]["name"]
    overall_passed_seconds = 0

    # Iterate through datasets
    for query_id, group in tqdm(query_groups, desc=f"Scoring {dataset_name}"):

        # Add the query id to the results
        query = group.iloc[0]["query"]
        results[query_id] = {}

        records = []
        # iterate through rows of each subgroup.
        for _, row in group.iterrows():

            records.append(
                discoveryengine.RankingRecord(
                    id=row["document_id"],
                    title=row["title"],
                    content=row["content"],
                )
            )
            # We process in in batches of 200
            if len(records) >= 200:
                now = time.time()
                request = discoveryengine.RankRequest(
                    ranking_config=ranking_config,
                    model=model_name,
                    query=query,
                    records=records,
                    ignore_record_details_in_response=True,
                )
                resp = client.rank(request=request)
                overall_passed_seconds += time.time() - now
                for i in resp.records:
                    results[query_id][i.id] = i.score
                records = []

        # If any records are left process them
        if len(records) > 0:
            now = time.time()
            request = discoveryengine.RankRequest(
                ranking_config=ranking_config,
                model=model_name,
                query=query,
                records=records,
                ignore_record_details_in_response=True,
            )
            resp = client.rank(request=request)
            overall_passed_seconds += time.time() - now

        # Round the scores and add them to the overall results.
        for i in resp.records:
            results[query_id][i.id] = round(i.score, 4)

    print(
        f"The time to process {len(query_groups)} queries with at total number of"
        f" {len(df)} documents took: {round(overall_passed_seconds, 2)}s.\n That"
        f" is {round(1000*overall_passed_seconds/len(query_groups), 2)}ms per"
        f" query or {round(1000*overall_passed_seconds/len(df), 2)}ms per"
        " document."
    )
    return results


def evaluate(
    query_document_ground_truth: dict[str, dict[str, int]],
    query_document_prediction: dict[str, dict[str, float]],
    k_values: list[int],
    ignore_identical_ids: bool = True,
    verbose: bool = True,
) -> tuple[dict[str, float], dict[str, float], dict[str, float], dict[str, float]]:
    """Evaluates the retrieval results against the ground truth relevance judgments (query_document_ground_truth)

    using Normalized Discounted Cumulative Gain (NDCG).
    This code is taken from the official BEIR evaluation code.

    Args:
        query_document_ground_truth: A dictionary representing the ground truth relevance judgments. The
          outer keys are query IDs (str), and the inner dictionary maps document
          IDs (str) to their relevance scores (int).
        query_document_prediction: A dictionary representing the retrieval results. The outer keys
          are query IDs (str), and the inner dictionary maps retrieved document
          IDs (str) to their retrieval scores (float).
        k_values: A list of integers representing the cutoff ranks for NDCG
          calculation (e.g., [1, 3, 5, 10]). NDCG will be calculated for each k in
          this list.
        ignore_identical_ids: If True, documents with the same ID as the query ID
          will be ignored during evaluation to prevent self-ranking. Defaults to
          True.
        verbose: If True, detailed evaluation metrics for each k-value will be
          printed. Defaults to True.

    Returns:
        A tuple containing four dictionaries:
            - ndcg_scores: A dictionary with the k values as keys and ndcg@k
            scores as values.
    """

    if ignore_identical_ids:
        popped = []
        for query_id, document_scores in query_document_prediction.items():
            for document_id in list(document_scores):
                if query_id == document_id:
                    query_document_prediction[query_id].pop(document_id)
                    popped.append(document_id)

    ndcg = {}

    for k in k_values:
        ndcg[f"NDCG@{k}"] = 0.0

    ndcg_string = "ndcg_cut." + ",".join([str(k) for k in k_values])
    evaluator = pytrec_eval.RelevanceEvaluator(
        query_document_ground_truth,
        {
            ndcg_string,
        },
    )

    scores = evaluator.evaluate(query_document_prediction)

    for query_id in scores.keys():
        for k in k_values:
            ndcg[f"NDCG@{k}"] += scores[query_id]["ndcg_cut_" + str(k)]

    for k in k_values:
        ndcg[f"NDCG@{k}"] = round(ndcg[f"NDCG@{k}"] / len(scores), 5)

    if verbose:
        for eval in [ndcg]:
            for k in eval.keys():
                print(f"{k}: {eval[k]:.4f}")

    return ndcg


def load_bytes_from_gcs(bucket_name: str, file_path: str):
    """Accesses a Google Cloud Storage (GCS) bucket, opens a file,

    loads its contents and returns its bytes

    Args:
        bucket_name: The name of the GCS bucket.
        file_path: The path to the file within the bucket (e.g.,
          'data/my_data.json').

    Returns:
        Bytes like object
    """
    try:
        # Initialize the GCS client
        client = storage.Client()

        # Get the bucket object
        bucket = client.bucket(bucket_name)

        # Get the blob (file) object
        blob = bucket.blob(file_path)

        # Download the blob's content as bytes
        _bytes = blob.download_as_bytes()

        return _bytes

    except Exception as e:
        print(f"An error occurred: {e}")
        return None


def save_dict_to_gcs_json(
    data: dict, bucket_name: str, file_path: str, encoding: str = "utf-8"
) -> None:
    """Saves a Python dictionary as a JSON file to a Google Cloud Storage (GCS) bucket.

    Args:
        data: The dictionary to be saved.
        bucket_name: The name of the GCS bucket.
        file_path: The desired path for the JSON file within the bucket (e.g.,
          'data/my_data.json').
        encoding: The encoding to use when writing the JSON file. Defaults to
          'utf-8'.
    """
    try:
        # Initialize the GCS client
        client = storage.Client()

        # Get the bucket object
        bucket = client.bucket(bucket_name)

        # Get the blob (file) object
        blob = bucket.blob(file_path)

        # Serialize the dictionary to a JSON string
        json_string = json.dumps(
            data, indent=2, ensure_ascii=False
        )  # indent for readability

        # Upload the JSON string to GCS as bytes with specified encoding
        blob.upload_from_string(
            json_string.encode(encoding), content_type="application/json"
        )

        print(f"Dictionary successfully saved to gs://{bucket_name}/{file_path}")

    except Exception as e:
        print(f"An error occurred while saving to GCS: {e}")

# If you want to run the scoring for yourself run the following section. Otherwise you can skip this section

In [7]:
# @title Google Cloud Platform parameters

# @markdown Change this to your bucket, project id and location


PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

# @markdown If you want to run the scoring for yourself you need to change this to your bucket otherwise you can leave these variables as is.
OUTPUT_SCORES_BUCKET_NAME = "ranking_datasets"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
MODEL_NAME = "semantic-ranker-default-004"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
SCORE_PATH = "scores"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}

DATA_INPUT_BUCKET_NAME = "ranking_datasets"  # leave this as is
DATASET_PATH = "input_data/labeled_beir_with_more_than_1_label.pkl"  # leave this as is

In [None]:
# @title Read Dataset
df = pd.read_pickle(f"gs://{DATA_INPUT_BUCKET_NAME}/{DATASET_PATH}")
display(df.groupby("name").agg({"query_id": "nunique", "document_id": "nunique"}))

In [12]:
# @title Test connection
docs = [
    {
        "query_id": "1",
        "query": "Why is the sky blue?",
        "name": "test",
        "document_id": "1",
        "title": "Rayleigh Scattering and Blue Sky",
        "content": (
            "The sky appears blue because of a phenomenon called Rayleigh scattering. "
            "When sunlight travels through the atmosphere, it collides with gas molecules "
            "and scatters in all directions. Blue light is scattered more than other colors "
            "because it travels in shorter waves."
        ),
    },
    {
        "query_id": "1",
        "query": "Why is the sky blue?",
        "name": "test",
        "document_id": "2",
        "title": "Cooking Pasta",
        "content": (
            "To cook perfect pasta, bring a large pot of salted water to a rolling boil. "
            "Add the pasta and cook until al dente, stirring occasionally. Drain and serve "
            "with your favorite sauce."
        ),
    },
]

res = call_model_in_batches(
    df=pd.DataFrame(docs),
    model_name=MODEL_NAME,
    project_id=PROJECT_ID,
    location=LOCATION,
)

for d in docs:
    for r, v in res["1"].items():
        if r == d["document_id"]:
            print(d["title"], v)
            break

In [None]:
# @title Score Query-Document pairs

dataset_groups = df.groupby("name")
for subdataset_name, dataset in dataset_groups:
    score_output_path = f"{SCORE_PATH}/{subdataset_name.replace('-','_')}/{MODEL_NAME.replace('-', '_')}.json"

    print(f"Evaluating : {subdataset_name.upper()}")
    print(f"Shape: {dataset.shape}")

    scores = call_model_in_batches(
        df=dataset,
        model_name=MODEL_NAME,
        project_id=PROJECT_ID,
        location=LOCATION,
    )

    save_dict_to_gcs_json(
        data=scores, bucket_name=OUTPUT_SCORES_BUCKET_NAME, file_path=score_output_path
    )

# Calculate NDCG and ROC AUC Scores

In [None]:
# @title Calculate...

k_values = [1, 3, 5, 10]

# Get datasets
client = storage.Client()
bucket = client.bucket(OUTPUT_SCORES_BUCKET_NAME)
blobs = [
    "nfcorpus",
    "scidocs",
    "trec-covid",
    "webis-touche2020",
    "dbpedia-entity",
    "msmarco",
]
_res = None

for sub_name in blobs:
    print(sub_name)
    query_document_ground_truth_path = (
        f"{SCORE_PATH}/{sub_name.replace('-', '_')}/qrels.json"
    )
    query_document_prediction_path = (
        f"{SCORE_PATH}/{sub_name.replace('-', '_')}/{MODEL_NAME.replace('-', '_')}.json"
    )

    query_document_ground_truth = json.loads(
        load_bytes_from_gcs(
            OUTPUT_SCORES_BUCKET_NAME, query_document_ground_truth_path
        ).decode()
    )
    query_document_prediction = json.loads(
        load_bytes_from_gcs(
            OUTPUT_SCORES_BUCKET_NAME, query_document_prediction_path
        ).decode()
    )

    # Check if all query document pairs are present in Query Document Ground Truth and Query Document Prediction
    for first, second, n1, n2 in [
        (
            query_document_ground_truth,
            query_document_prediction,
            "Query Document Ground Truth",
            "Query Document Prediction",
        ),
        (
            query_document_prediction,
            query_document_ground_truth,
            "Query Document Prediction",
            "Query Document Ground Truth",
        ),
    ]:
        deleted_queries = 0
        deleted_docs = 0
        first_query_id_keys = list(first.keys())
        for query_id in first_query_id_keys:
            if query_id not in second:
                del first[query_id]
                deleted_queries += 1
            first_document_id_keys = list(first[query_id].keys())
            for document_id in first_document_id_keys:
                if document_id not in second[query_id]:
                    del first[query_id][document_id]
                    deleted_docs += 1
        if deleted_queries > 0 or deleted_docs > 0:
            print(
                f"Deleted from {n1} (because not present in {n2}):"
                f" {deleted_queries} Queries, {deleted_docs} Documents"
            )

    # The following conversion is needed as pytrec eval only excepts positive integers as ground truth labels
    y_true, y_pred = [], []
    max_score = 0
    for query_id, document_id_scores in query_document_ground_truth.items():
        for document_id, y_true_score in document_id_scores.items():
            max_score = max(max_score, y_true_score)

    for query_id, document_id_scores in query_document_ground_truth.items():
        for document_id, y_pred_score in document_id_scores.items():
            y_true_label = 0 if y_pred_score / max_score <= 0.5 else 1
            y_true.append(y_true_label)
            y_pred.append(query_document_prediction[query_id][document_id])

    ndcg = evaluate(
        query_document_ground_truth, query_document_prediction, k_values, verbose=False
    )
    results = {k: [v] for k, v in ndcg.items()}

    results["ROC AUC"] = [float(roc_auc_score(np.array(y_true), np.array(y_pred)))]

    ndcg_res = pd.DataFrame.from_dict(results, orient="index", columns=[sub_name])
    if _res is None:
        _res = ndcg_res
    else:
        _res = _res.merge(ndcg_res, left_index=True, right_index=True)

_res["Macro Avg."] = _res.mean(axis=1)
_res.index.name = MODEL_NAME

In [None]:
_res

## Conclusion

In this notebook, we successfully demonstrated how to evaluate the performance of the `googles/semantic-ranker-default-004` model on several key BEIR subdatasets.

We walked through the process of:
*   Loading human-annotated ground truth data from a public Google Cloud Storage bucket.
*   Utilizing the Google Ranking API (implicitly through the scoring process) to compute relevance scores for query-document pairs.
*   Storing the computed scores back into GCS for persistence and accessibility.
*   Loading these scores and the ground truth to calculate standard information retrieval metrics, specifically NDCG@(1, 3, 5, 10), using the `pytrec_eval` library.

A significant takeaway from this evaluation is the strong performance of the `googles/semantic-ranker-default-004` model. At the point in time of publishing this notebook, the results indicate that this new model achieves state-of-the-art results on these specific BEIR benchmarks, showcasing its effectiveness for ranking tasks.

### Where to Learn More

*   **Google Ranking API:** Explore the official documentation for the Ranking API to understand its capabilities, pricing, and other available models.
*   **BEIR Dataset:** Visit the BEIR project website to learn more about the various datasets available, their characteristics, and how they are used for benchmarking.
*   **pytrec_eval:** Refer to the documentation for the `pytrec_eval` library to understand the NDCG metric and other potential evaluation metrics in more detail.