In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Advanced RAG Techniques - Vertex RAG Engine Retrieval Quality Evaluation and Hyperparameters Tuning

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/rag_engine_evaluation.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Frag_engine_evaluation.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/rag_engine_evaluation.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/rag_engine_evaluation.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

|           |                                         |
|-----------|---------------------------------------- |
| Author(s) | [Ed Tsoi](https://github.com/edtsoi430) |

## Overview

Retrieval Quality is arguably the most important component of a Retrieval Augmented Generation (RAG) application. Not only does it directly impact the quality of the generated response, in some cases poor retrieval could also lead to ireelevant, incomplete or hallucinated output.

This notebook aims to provide guidelines on:
* How to evaluate retrieval quality with Vertex RAG Engine using the [public BEIR-fiqa 2018 dataset](https://arxiv.org/abs/2104.08663). (You are also more than welcome to use Bring Your Own (BYO) evaluation dataset.)
* By the end of this notebook, you should gain a much better understanding on how your retrieval pipeline performs, how certain changes (e.g. using a new embedding model, changing chunk size etc.) impacts the quality of your retrieved contexts, and how to leverage this notebook to tune hyperparameters to get the most out of your RAG application.

**Note:** This notebook assumes that you already have an understanding on how to implement a RAG system with Vertex AI Rag Engine. For more general instructions on how to use Vertex AI Rag Engine, please refer to the [RAG Engine API Documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/rag-api).


##### **How exactly could we use this notebook to improve the RAG system?**
* **Hyperparameters Tuning:** There are a couple hyperparameters that could impact retrieval quality:

| Parameter | Description |
|------------|----------------------|
| Chunk Size | When documents are ingested into an index, they are split into chunks. The `chunk_size` parameter (in tokens) specifies the size of each chunk. |
| Chunk Overlap |  By default, documents are split into chunks with a certain amount of overlap to improve relevance and retrieval quality. |
| Top K | Controls the maximum number of contexts that are retrieved. |
| Vector Distance threshold | Only contexts with a distance smaller than the threshold are considered. |
| Embedding model | The embedding model used to convert input text into embeddings for retrieval.|

You may use this notebook to evaluate your retrieval quality, and see how changing certain parameters (top k, chunk size) impact or improve your retrieval quality (recall@k, precision@k, ndcg@k).

* **Response Quality Evaluation:** Once you have optimized the retrieval metrics, you could understand how it impacts response quality using the [Evaluation Service API Colab](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_rag_gen_ai_evaluation_service_sdk.ipynb)









## Get started

### Install Vertex AI SDK and other required packages


In [22]:
%pip install --upgrade --user --quiet google-cloud-aiplatform beir

### Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it's restarted, continue to the next step.

In [2]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [40]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth
    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [3]:
# Use the environment variable if the user doesn't provide Project ID.
import os

import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

vertexai.init(project=PROJECT_ID, location=LOCATION)

# 1. Option A. Create a RAG Corpus (or use an existing RAG Corpus that you would like to evaluate on).
* If you would like to bring your own existing RagCorpus (with imported files), skip to Option B below.

### Create a RagCorpus with the specified configuration.

In [None]:
from vertexai.preview import rag

# See the list of current supported embedding models here: https://cloud.google.com/vertex-ai/generative-ai/docs/rag-overview#supported-embedding-models
# You may adjust the embedding model here if you would like.
embedding_mode_config = rag.EmbeddingModelConfig(
    publisher_model="publishers/google/models/text-embedding-004"  # @param {type:"string", isTemplate: true},
)

rag_corpus = rag.create_corpus(
    display_name="test-corpus",
    description="A test corpus where we import the BEIR-FiQA-2018 dataset",
    embedding_model_config=embedding_mode_config,
)

print(rag_corpus)


### Load BEIR Fiqa dataset (test split).

In [14]:
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval

# Download and load a BEIR dataset
dataset = 'fiqa' #@param ["arguana", "climate-fever", "cqadupstack", "dbpedia-entity", "fever", "fiqa", "germanquad", "hotpotqa", "mmarco", "mrtydi", "msmarco-v2", "msmarco", "nfcorpus", "nq-train", "nq", "quora", "scidocs", "scifact", "trec-covid-beir", "trec-covid-v2", "trec-covid", "vihealthqa", "webis-touche2020"]
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = "datasets"
data_path = util.download_and_unzip(url, out_dir)

# Load the dataset
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
print(f"Successfully loaded the {dataset} dataset with {len(corpus)} files and {len(queries)} queries!")

  from tqdm.autonotebook import tqdm


datasets/fiqa.zip:   0%|          | 0.00/17.1M [00:00<?, ?iB/s]

  0%|          | 0/57638 [00:00<?, ?it/s]

Successfully loaded the fiqa dataset with 57638 files and 648 queries!


### Define helper function for processing dataset.

In [36]:
import os
import math
import subprocess

def convert_beir_to_rag_corpus(corpus, output_dir):
    """
    Convert a BEIR corpus to Vertex RAG corpus format with a maximum of 10,000
    files per subdirectory.

    For each document in the BEIR corpus, we will create a new txt where:
      * doc_id will be the file name
      * doc_content will be the document text prepended by title (if any).

    Args:
      corpus: BEIR corpus
      output_dir (str): Directory where the converted corpus will be saved

    Returns:
      None (will write output to disk)
    """
    # Create the output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    file_count, subdir_count = 0, 0
    current_subdir = os.path.join(output_dir, f"{subdir_count}")
    os.makedirs(current_subdir, exist_ok=True)

    # Convert each file in the corpus
    for doc_id, doc_content in corpus.items():
        # Combine title and text (if title exists)
        full_text = doc_content.get('title', '')
        if full_text:
            full_text += '\n\n'
        full_text += doc_content['text']

        # Create a new file for each file.
        file_path = os.path.join(current_subdir, f"{doc_id}.txt")
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(full_text)

        file_count += 1

        # Create a new subdirectory if the current one has reached the limit
        if file_count >= 10000:
            subdir_count += 1
            current_subdir = os.path.join(output_dir, f"{subdir_count}")
            os.makedirs(current_subdir, exist_ok=True)
            file_count = 0

    print(f"Conversion complete. {len(corpus)} files saved in {output_dir}")


def count_files_in_gcs_bucket(bucket_path):
  """
  Counts the number of files in a Google Cloud Storage bucket path,
  excluding directories.

  Args:
    bucket_path: The path to the directory in the GCS bucket.

  Returns:
    The number of files in the bucket path.
  """
  # Add a trailing slash if it's not present
  if not bucket_path.endswith("/"):
    bucket_path += "/"

  process = subprocess.Popen(
      f"gsutil ls {bucket_path}** | grep -v '/$' | wc -l",
      shell=True,
      stdout=subprocess.PIPE,
      stderr=subprocess.PIPE
  )
  stdout, stderr = process.communicate()
  if stderr:
    raise RuntimeError(f"Error counting files in GCS: {stderr.decode()}")
  return int(stdout.decode().strip())

def count_directories_after_split(gcs_bucket_path):
  """
  Counts the number of directories in a Google Cloud Storage bucket path.

  Args:
    gcs_bucket_path: The path to the directory in the GCS bucket.

  Returns:
    The number of directories in the bucket path.
  """
  num_files_in_bucket = count_files_in_gcs_bucket(gcs_bucket_path)
  num_directories = math.ceil(num_files_in_bucket / 10000)
  return num_directories


### Convert BEIR corpus to RagCorpus format and upload to GCS bucket.

In [None]:
CONVERTED_DATASET_PATH = f'/converted_dataset_{dataset}'
# Convert BEIR corpus to RAG format.
convert_beir_to_rag_corpus(corpus, CONVERTED_DATASET_PATH)

Conversion complete. 57638 files saved in /converted_dataset_fiqa


#### Authenticate with gcloud.

In [None]:
!gcloud auth application-default login
!gcloud auth application-default set-quota-project {PROJECT_ID}
!gcloud config set project {PROJECT_ID}

#### Create a test bucket for uploading beir evaluation dataset to (or use an existing bucket of your choice).

In [None]:
!gsutil mb gs://beir-test-bucket

Creating gs://beir-test-bucket/...


#### Upload to specified GCS bucket (Modify the GCS bucket path to desired location)

In [None]:
GCS_BUCKET_PATH = "gs://beir-test-bucket/beir-fiqa" #@param {type: "string"}

!echo "Uploading files from ${CONVERTED_DATASET_PATH} to ${GCS_BUCKET_PATH}"
# Upload RAG format dataset to GCS bucket.
!gsutil -m rsync -r -d $CONVERTED_DATASET_PATH $GCS_BUCKET_PATH

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Copying file:///converted_dataset_fiqa/5/547534.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547546.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547553.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547558.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547573.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547574.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547598.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547603.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547616.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547622.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547631.txt [Content-Type=text/plain]...
Copying file:///converted_dataset_fiqa/5/547

### Import evaluation dataset files into RagCorpus.

In [1]:
from vertexai.preview import rag

num_subdirectories = count_directories_after_split(GCS_BUCKET_PATH)
paths = [GCS_BUCKET_PATH+f"/{i}/" for i in range(num_subdirectories)]

chunk_size = 512 #@param {type:"integer"}
chunk_overlap = 102 #@param {type:"integer"}
total_imported, total_num_of_files = 0, 0

for path in paths:
  num_files_to_be_imported = count_files_in_gcs_bucket(path)
  total_num_of_files += num_files_to_be_imported
  max_retries, attempt, imported = 10, 0, 0
  while attempt < max_retries and imported < num_files_to_be_imported:
    response = rag.import_files(
        rag_corpus.name,
        [path],
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        timeout=20000,
        max_embedding_requests_per_min=1400,
    )
    imported += response.imported_rag_files_count if response.imported_rag_files_count else 0
    attempt += 1
  total_imported += imported

print(f"{total_imported} files out of {total_num_of_files} imported!")


57638 files out of 57638 imported!


# 1. Option B. Bring your own existing RagCorpus (insert RAG_CORPUS_ID here).

In [None]:
from vertexai.preview import rag

# Specify your rag corpus ID here that you want to use.
RAG_CORPUS_ID = "" #@param {type: "string"}

rag_corpus = rag.get_corpus(name=f"projects/{PROJECT_ID}/locations/{LOCATION}/ragCorpora/{RAG_CORPUS_ID}")

print(rag_corpus)

# 2. Run Retrieval Quality Evaluation
For Retrieval Quality Evaluation, we focus on the following metrics:
- **Recall@k:**
  - Measures how many of the relevant documents/chunks are successfully retrieved within the top k results
  - Helps evaluate the retrieval component's ability to find ALL relevant information
- **Precision@k:**
  - Measures the proportion of retrieved documents that are actually relevant within the top k results
  - Helps evaluate how "focused" your retrieval is
- **nDCG@K:**
  - Measures both relevance AND ranking quality
  - Takes into account the position of relevant documents

Follow the colab to get these metrics numbers for your configurations, and to optimize your settings.

### Define evaluation helper function.

In [38]:
import time
import numpy as np
from tqdm import tqdm
from vertexai.preview import rag
from google.colab import userdata
import re

def extract_rag_id(rag_resource_name):
  """Extracts the RAG ID from a RAG resource name.

  Args:
    rag_resource_name: The full name of the RAG resource.

  Returns:
    The RAG ID extracted from the resource name.
  """
  return rag_resource_name.split('/')[-1]

def extract_doc_id(file_path):
  """Extracts the document ID from a file path.

  The function assumes the document ID is the number before the '.txt' extension in the file path.

  Args:
    file_path: The path to the file.

  Returns:
    The document ID extracted from the file path, or None if no ID is found.
  """
  # Use regular expression to find the number before .txt, with optional /
  match = re.search(r'/?(\d+)\.txt$', file_path)

  if match:
    return match.group(1)
  else:
    return None

# RAG Engine helper function to extract doc_id, snippet, and score.
def extract_retrieval_details(response):
  doc_id = extract_doc_id(response.source_uri)
  retrieved_snippet = response.text
  distance = response.distance
  return (doc_id, retrieved_snippet, distance)

# RAG Engine helper function for retrieval.
def rag_api_retrieve(query, corpus_id, top_k):
  return rag.retrieval_query(
        rag_resources=[rag.RagResource(rag_corpus=f"projects/{PROJECT_ID}/locations/{LOCATION}/ragCorpora/{corpus_id}")],
        text=query,
        similarity_top_k=top_k,
        vector_distance_threshold=0.5,
      ).contexts.contexts


def calculate_document_level_recall_precision(retrieved_response, cur_qrel):
  """Calculates the recall and precision for a list of sorted retrieved contexts at a given top_k value.

  Args:
    sorted_retrieved_contexts: A list of retrieved contexts sorted by vector distance in ascending order (smallest distance first).
    cur_qrel: qrel[query_id] dictionary.

  Returns:
    A tuple containing the recall, precision and ndcg scores.
  """
  if not retrieved_response:
    return (0, 0)

  relevant_retrieved_unique = set()
  num_relevant_retrieved_snippet = 0
  for res in retrieved_response:
      doc_id, text, score = extract_retrieval_details(res)
      if (doc_id in cur_qrel):
        relevant_retrieved_unique.add(doc_id)
        num_relevant_retrieved_snippet += 1
  recall = len(relevant_retrieved_unique) / len(cur_qrel.keys())
  precision = num_relevant_retrieved_snippet / len(retrieved_response)
  return (recall, precision)

def calculate_document_level_metrics(queries, qrels, k_values, corpus_id):
  """Calculates the average recall, precision, and NDCG for a set of queries.

  Args:
    queries: A dictionary of queries with query IDs as keys and query text as values.
    qrels: A dictionary of ground truth relevant documents for each query.
    top_k: The number of top results to consider for evaluation.

  Returns:
    None

  Prints:
    A tuple containing the average recall, average precision, and average NDCG scores.
  """

  for top_k in k_values:
    start_time = time.time()
    total_recall, total_precision, total_ndcg = 0, 0, 0
    print(f"Computing metrics for top_k value: {top_k}")
    print(f"Total number of queries: {len(queries)}")
    for query_id, query in tqdm(queries.items(), total=len(queries), desc=f"Processing Queries (top_k={top_k})"):
      response = rag_api_retrieve(query, corpus_id, top_k)

      recall, precision = calculate_document_level_recall_precision(response, qrels[query_id])
      ndcg = ndcg_at_k(response, qrels[query_id], top_k)

      total_recall += recall
      total_precision += precision
      total_ndcg += ndcg

    end_time = time.time()
    execution_time = end_time - start_time
    num_queries = len(queries)
    average_recall, average_precision, average_ndcg = total_recall / num_queries, total_precision / num_queries, total_ndcg / num_queries
    print(f"\nAverage Recall@{top_k}: {average_recall:.4f}")
    print(f"Average Precision@{top_k}: {average_precision:.4f}")
    print(f"Average nDCG@{top_k}: {average_ndcg:.4f}")
    print(f"Execution time: {execution_time} seconds.")
    print("=============================================")

def dcg_at_k_with_zero_padding_if_needed(r, k):
    r = np.asarray(r)[:k]
    if r.size:
        # Pad with zeros if r is shorter than k
        if r.size < k:
            r = np.pad(r, (0, k - r.size))
        return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, k + 2)))
    return 0.

def ndcg_at_k(retriever_results, ground_truth_relevances, k):
    if not retriever_results:
      return 0

    # Prepare retriever results
    retrieved_relevances = []
    for res in retriever_results[:k]:
      doc_id, text, score = extract_retrieval_details(res)
      if doc_id in ground_truth_relevances:
        retrieved_relevances.append(ground_truth_relevances[doc_id])
      else:
        retrieved_relevances.append(0)  # Assume irrelevant if not in ground truth

    # Calculate DCG
    dcg = dcg_at_k_with_zero_padding_if_needed(retrieved_relevances, k)
    # Calculate IDCG
    ideal_relevances = sorted(ground_truth_relevances.values(), reverse=True)
    idcg = dcg_at_k_with_zero_padding_if_needed(ideal_relevances, k)

    return dcg / idcg if idcg > 0 else 0.

### Run Retrieval Quality Evaluation.

In [21]:
calculate_document_level_metrics(queries, qrels, [5, 10, 100], corpus_id=extract_rag_id(rag_corpus.name))

Computing metrics for top_k value: 5
Total number of queries: 648


Processing Queries (top_k=5): 100%|██████████| 648/648 [44:47<00:00,  4.15s/it]



Average Recall@5: 0.5608
Average Precision@5: 0.2713
Average nDCG@5: 0.4450
Execution time: 2687.608230829239 seconds.
Computing metrics for top_k value: 10
Total number of queries: 648


Processing Queries (top_k=10): 100%|██████████| 648/648 [37:31<00:00,  3.48s/it]



Average Recall@10: 0.6571
Average Precision@10: 0.1679
Average nDCG@10: 0.4039
Execution time: 2251.886693954468 seconds.
Computing metrics for top_k value: 100
Total number of queries: 648


Processing Queries (top_k=100): 100%|██████████| 648/648 [38:48<00:00,  3.59s/it]


Average Recall@100: 0.8801
Average Precision@100: 0.0253
Average nDCG@100: 0.2592
Execution time: 2328.4095141887665 seconds.





# 3. Next steps
* Once we're done with evaluation, we should carefully examine the metrics number are tune the hypeparameters. Below are some suggestions on how to optimize the hyperparameters to get the best retrieval quality.

### How to optimize Recall:
* If your recall metrics number is too low, consider the following steps:
  * **Reducing chunk size:** Sometimes important information might be buried within large chunks, making it more difficult to retrieve relevant context. Try reducing the chunk size.
  * **Increasing chunk overlap:** If the chunk overlap is too small, some relevant information at the edge might be lost. Consider increasing the chunk overlap (chunk overlap of 20% of chunk size is generally a good start.)
  * **Increasing top-K:** If your top k is too small, the retriever might miss some relevant information due to a too restrictive context.

### How to optimize Precision:
* If your precision number is low, consider:
  * **Reducing top-K:** Your top k might be too large, adding a lot of unwanted noise to the retrieved contexts.
  * **Reducing chunk overlap:** Sometimes, too large of a chunk overlap could result in duplicate information.
  * **Increasing chunk size:** If your chunk size is too small, it might lack sufficient context resulting in a low precision score.

### How to optimize nDCG:
* If your nDCG number is low, consider:
  * **Changing your embedding model:** your embedding model might not capturing relevance well. Consider using a different embedding model (e.g. if your documents are multilingual, consider using a mulilingual embedding model). For more information on the currently supported embedding models, see documentation [here](https://cloud.google.com/vertex-ai/generative-ai/docs/rag-overview#supported-embedding-models).

### Evaluate Response Quality
* If you want to evaluate response quality (generated answers) on top of retrieval quality, please refer to the [Gen AI Evaluation Service - RAG Evaluation Colab](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_rag_gen_ai_evaluation_service_sdk.ipynb)


# 4. Cleaning up (Delete RagCorpus)

Once we are done with evaluation, we should clean up the RagCorpus to free up resources since we don't need it anymore.

In [None]:
rag.delete_corpus(rag_corpus.name)