In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Using "task type" embeddings for improving RAG search quality

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/embeddings/task-type-embedding.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fembeddings%2Ftask-type-embedding.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/embeddings/task-type-embedding.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/task-type-embedding.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/task-type-embedding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/task-type-embedding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/task-type-embedding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/task-type-embedding.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/task-type-embedding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            

| | |
|-|-|
|Author(s) | [Kaz Sato](https://github.com/kazunori279) |

## Why my RAG can't find relevant documents?

Retrieval Augmented Generation (RAG) is becoming a very popular architecture pattern of LLM grounding that retrieves and feed relevant business data to LLM and suppress hallucinations. But many RAG engineers are struggling on building the production quality retrieval engine - the search quality of RAG systems is becoming a common big challenge. In many cases, this happens when you are using "plain vanilla" text embeddings with vector databases for a simple similarity search. This is actually a common fault of information retrieval system design that Google has been solving for our search services for over a decade.

 In this tutorial we'll learn what Google has learned, and how the new "task type" embeddings in Vertex AI can provide a quick solution to optimize the search quality of your RAG system.

## Questions and answers are not similar

When building Retrieval Augmented Generation (RAG) systems, most designs we've seen use text embedding and vector search to conduct a [semantic similarity search](https://cloud.google.com/blog/products/ai-machine-learning/how-to-use-grounding-for-your-llms-with-text-embeddings). In many cases, this leads to degraded search quality, because of the "question is not the answer" problem. For example, a question like "Why is the sky blue?" and its answer, "The scattering of sunlight causes the blue color", have distinctly different meanings as separate statements. Likewise, if you ask "What's the best birthday present for my kid?" to a RAG system with a similarity search, it would be hard to find items like "Nintendo Switch" or "Lego sets", as those answers are not semantically similar to the question.

![](https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/1.%20task_type_1.png)

Technically speaking, the distribution of the question embedding space and answer embedding space have a major gap, and the simple similarity search doesn't work well. To fill the gap, you need to have an AI/ML model that learns the relationship between the query and answer.

### The history of "filling the gap"

This is actually a classic problem of semantic search for information retrieval, and Google has a long history of optimizing the semantic search quality for billions of users. Google Search started incorporating semantic search in 2015, with the introduction of [AI search innovations](https://blog.google/products/search/how-ai-powers-great-search-results/) like our deep learning ranking system RankBrain. This innovation was quickly followed with neural matching to improve the accuracy of document retrieval in Search. Neural matching contains deep learning models that learn the relationship between the query intention and relevant documents. For example, a search "insights how to manage a green". If a friend asked you this, you'd probably be stumped. But with neural matching, we're able to make sense of it.

<center>
  <img src="https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/2.%20Google%20Search.png" width="300"/><br/>
Google Search can find relevant documents<br/>
for an ambiguous query "how to manage a green"
</center>
<br/>

### Dual encoder model
For developers that want to incorporate sophisticated semantic search like neural matching into their RAG system the popular approach has been to [train a dual encoder model](https://cloud.google.com/blog/products/ai-machine-learning/scaling-deep-retrieval-tensorflow-two-towers-architecture) (aka two-tower) that learns the relationship between the query embeddings and answer embeddings. The following diagram depicts how a dual encoder model maps queries with relevant documents.

<br/>
<div align="center">
  <img src="https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/3.%20dual%20encoder.gif">
</div>

But it is not easy to design, train and build a production retrieval system with a customized dual encoder model. It requires data science and ML engineering experience, and significant effort to train a dataset with pairs of questions and answers.

### LLM-based "Advanced RAG"
In the era of Large Language Models (LLM), some "Advanced RAG" approaches emerged to solve this problem. [HyDE](https://arxiv.org/abs/2212.10496) uses LLM reasoning capabilities to generate possible answer texts, and use the text for similarity search. Additionally, [query expansion with LLM](https://arxiv.org/abs/2305.03653) uses the LLM reasoning for expanding the original query with possible answer candidates as query texts.

The downside of these LLM-based approaches is that it adds the LLM prediction latency and high cost for every query. While the vector search itself can finish within milliseconds, and can handle thousands of queries per second at low cost, the LLM reasoning will add a few seconds latency for every single query with significantly higher cost than generating embeddings.

## New "task type" embedding

Recently [Vertex AI Embeddings API](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings) launched new text embedding models, `text-embedding-004` and `text-multilingual-embedding-002` based on [the new embedding models](https://deepmind.google/research/publications/85521/) developed by the Google DeepMind and Research team.

The unique feature of these models is that they can generate optimized embeddings based on [task types](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/task-types). With task types, you can significantly reduce the time and cost for developing dual encoder models or advanced RAG systems for achieving higher search quality, for specific tasks including question and answering, document retrieval by query, fact verification, etc.
For example, to generate embeddings for question and answering, all you have to do is to specify a task type `QUESTION_ANSWERING` for query texts and `RETRIEVAL_DOCUMENT` for answer texts when generating their embeddings. In the embedding space optimized with the task types, the query embedding "Why is the sky blue?" and the answer embedding "The scattering..." will be placed in a much closer area, because the embedding models are trained to learn that they have the question and answer relationship. Thus you should be able to get higher search quality with vector databases to find the right answer for the query.

![](https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/4.%20task_type_2.png)

### Behind the scene: LLM distillation + dual encoder

How do the new embedding models enable this? The novel part of the model design is that the Google DeepMind and Research team used [LLM distillation](https://arxiv.org/abs/1503.02531) (a process to train a smaller model from a larger model) to pre-train a dual encoder based embedding model.
Here's the distillation steps. The team used an LLM for generating two kinds of data: 1) synthetic queries and 2) query-and-document pairs. First, the team generated numerous synthetic queries from the training dataset (corpus) for each task.

![](https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/5.%20FRet_1.png)

Then, with the generated queries, find possibly relevant documents to create query-and-document pairs. For those pairs, use the LLM to determine if the pair is considered as an appropriate query-and-document (positive pair), or not (negative pair).

![](https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/6.%20FRet_2.png)

With those pairs, the team trained a dual-encoder based embedding model. The result is an embedding model that inherits the reasoning capability of the LLM as a distilled model for the specific tasks. Significantly smaller, faster and inexpensive to generate embeddings, and yet, provides the capability of smart matching between queries and relevant documents.

If you are interested in more details of the model design and other performance comparison results, check out [the paper](https://deepmind.google/research/publications/85521/) published by the team.

### Supported task types

The new embeddings models support the following task types:

|Task type|Embeddings optimization criteria|
|:-|:-|
|`SEMANTIC_SIMILARITY`|Semantic similarity. Use this task type when retrieving similar texts from the corpus.|
|`RETRIEVAL_QUERY`|Document search and information retrieval. Use `RETRIEVAL_QUERY` for query texts, and `RETRIEVAL_DOCUMENT` for documents to be retrieved.|
|`QUESTION_ANSWERING`|Questions and answers applications such as RAG. Use `QUESTION_ANSWERING` for question texts, and `RETRIEVAL_DOCUMENT` for documents to be retrieved.|
|`FACT_VERIFICATION`|Document search for fact verification. Use `FACT_VERIFICATION` for the target text, and `RETRIEVAL_DOCUMENT` for documents to be retrieved.|
|`CODE_RETRIEVAL_QUERY`|Code search. Use `CODE_RETRIEVAL_QUERY` for query text, and `RETRIEVAL_DOCUMENT` for code blocks to be retrieved (available on embedding model `text-embedding-preview-0815` and later)|
|`CLASSIFICATION`|Text classification. Use this task type for training a small classification model with the embedding.|
|`CLUSTERING`|Text clustering. Use this task type for k-means or other clustering analysis.|

For example, if you are building a RAG system for a question and answering use case, you may specify task type `RETRIEVAL_DOCUMENT` for generating embeddings for building with vector search, and specify `QUESTION_ANSWERING` for embeddings for question texts. Thus you should see improved search quality compared to using `SEMANTIC_SIMILARITY` for both query and document. Likewise, you may use `RETRIEVAL_QUERY` for queries for document search, and `FACT_VERIFICATION` for queries for finding documents for fact checking.

Not only the document retrieval and Q&A use cases, but also Embedding models support optimized embeddings for text classification and clustering. Embeddings with task type `CLASSIFICATION` are useful for classifying texts with its semantics for use cases such as customer and product segmentation.


# Task type embeddings in Action

In the following sections, we will test how task type improves search quality in Question and Answering case.

## Setup

Let's start setting up the SDK and environment variables.

### Install Python SDK
This tutorial uses Vertex AI SDK and Cloud Storage SDK.

In [None]:
%pip install --upgrade --quiet --user google-cloud-aiplatform

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [2]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [3]:
import subprocess

# get project ID
PROJECT_ID: str = (
    subprocess.check_output(["gcloud", "config", "get-value", "project"])
    .decode()
    .strip()
)
LOCATION = "us-central1"
if PROJECT_ID == "(unset)":
    print("Please set the project ID manually below")

In [4]:
# define project information
if PROJECT_ID == "(unset)":
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

## Try the task types

Now it's ready to try the task type. Let's get started with some simple examples.

### Generate embeddings with task types

Here we define a wrapper function `get_embeddings()` to make it easier to call the Embeddings API. To specify task type on generating embedding, you can use `TextEmbeddingInput(text, task_type)` function.

In [56]:
from typing import Any

import vertexai
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel

# init the vertexai package
vertexai.init(project=PROJECT_ID, location=LOCATION)

MODEL_NAME = "text-embedding-004"


def get_embeddings(texts: list[str], task_type: str) -> list[list[float]]:
    """A wrapper to get text embeddings for a list of texts"""
    model: TextEmbeddingModel = TextEmbeddingModel.from_pretrained(MODEL_NAME)
    inputs: list[TextEmbeddingInput] = [
        TextEmbeddingInput(text, task_type) for text in texts
    ]
    embs = model.get_embeddings(inputs)
    return [emb.values for emb in embs]

### Why is the sky blue?

Using this function, generate embeddings with task type `SEMANTIC_SIMILARITY` for a question and two possible answers. Then, print similarities between them using [`cosine_similarity()` function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) from scikit-learn.

In [31]:
from sklearn.metrics.pairwise import cosine_similarity

QUESTION = "Why is the sky blue?"
ANSWER_1 = "The sky is blue today"
ANSWER_2 = "The scattering of sunlight causes the blue color"

q_emb = get_embeddings([QUESTION], "SEMANTIC_SIMILARITY")
a1_emb = get_embeddings([ANSWER_1], "SEMANTIC_SIMILARITY")
a2_emb = get_embeddings([ANSWER_2], "SEMANTIC_SIMILARITY")

print(cosine_similarity(q_emb, a1_emb))
print(cosine_similarity(q_emb, a2_emb))

[[0.82764803]]
[[0.81540485]]


Since the embedding space for the task type is organized to measure the semantic similarity, naturally the first answer (which doesn't answer to the question at all) has higher similarity than the second answer (which is the correct answer).

Now, let's try task type `QUESTION_ANSWERING`:

In [32]:
q_emb = get_embeddings([QUESTION], "QUESTION_ANSWERING")
a1_emb = get_embeddings([ANSWER_1], "RETRIEVAL_DOCUMENT")
a2_emb = get_embeddings([ANSWER_2], "RETRIEVAL_DOCUMENT")

print(cosine_similarity(q_emb, a1_emb))
print(cosine_similarity(q_emb, a2_emb))

[[0.60361005]]
[[0.66949501]]


This time, the second answer gets much higher similarity than the first one, because the embedding space is organized to capture the question-and-answering relationships.

### Taylor Swift or Lego set?

Try another example:

In [46]:
QUESTION = "What is the best birthday present for my son?"
ANSWER_1 = "My son says Taylor Swift's birthday is December 13th"
ANSWER_2 = "Lego set"

q_emb = get_embeddings([QUESTION], "SEMANTIC_SIMILARITY")
a1_emb = get_embeddings([ANSWER_1], "SEMANTIC_SIMILARITY")
a2_emb = get_embeddings([ANSWER_2], "SEMANTIC_SIMILARITY")

print(cosine_similarity(q_emb, a1_emb))
print(cosine_similarity(q_emb, a2_emb))

[[0.61688392]]
[[0.56495122]]


The model does its job properly - choosing the first answer as similar sentence to the question. But it's not what we expect.

In [47]:
q_emb = get_embeddings([QUESTION], "QUESTION_ANSWERING")
a1_emb = get_embeddings([ANSWER_1], "RETRIEVAL_DOCUMENT")
a2_emb = get_embeddings([ANSWER_2], "RETRIEVAL_DOCUMENT")

print(cosine_similarity(q_emb, a1_emb))
print(cosine_similarity(q_emb, a2_emb))

[[0.44911438]]
[[0.55942861]]


With the task type `QUESTION_ANSWERING` specified, the model now understand it should find an appropriate answer to the question, not a similar sentence. It chooses Lego set over Taylor Swift's birthday.

# Evaluate the search quality of task types

The results above were interesting. Now, let's evaluate the search quality with the task types with a larger dataset.

## Download NQ-Open dataset

As a test dataset, we will use 1K rows sampled from [NQ-Open](https://huggingface.co/datasets/google-research-datasets/nq_open) dataset provided by the Google Research team. NQ-Open dataset can be easily downloaded from the Hugging Face Datasets:

In [50]:
# suppress the HF_TOKEN warning when downloading the dataset
import os

os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"

In [51]:
import pandas as pd

# download NQ-Open dataset
splits = {
    "train": "nq_open/train-00000-of-00001.parquet",
    "validation": "nq_open/validation-00000-of-00001.parquet",
}
nq_open_df = pd.read_parquet(
    "hf://datasets/google-research-datasets/nq_open/" + splits["train"]
)

# sampling for 1K rows
nq_open_df = nq_open_df.sample(n=1000).reset_index(drop=True)
nq_open_df.head()

Unnamed: 0,question,answer
0,who are the goo goo dolls touring with,[Phillip Phillips]
1,what team has the most wins against alabama,[Tennessee]
2,who played tara in buffy the vampire slayer,[Amber Nicole Benson]
3,who was the british general who surrendered at...,[Charles Cornwallis]
4,who sings the song oh what a night,[the Four Seasons]


The sampled dataframe contains 1K pairs of question and answer for a wide variety topics, such as a question "what is the solid outer layer of the earth called" and its answer "The crust".

In this tutorial, we will use 1K rows to measure the search quality. By increasing the rows to 10K you will see better results, but it would take about 30 minutes to finish the tutorial.

## Calculating Q&A search quality with `SEMANTIC_SIMILARITY`

First, let's test the task type `SEMANTIC_SIMILARITY` as a baseline.

Using the `get_embeddings()` function above, we define another function `generate_embs()` that takes a list of texts and generate embeddings with the specified task types. The embeddings will be stored on new columns `q_emb` and `a_emb`.

In [53]:
from tqdm import tqdm  # to show the progress bar


def generate_embs(
    df: pd.DataFrame, question_task_type: str, answer_task_type: str
) -> pd.DataFrame:
    """Generate text embeddings for questions and answers"""

    q_embs: list[list[float]] = []
    a_embs: list[list[float]] = []

    for i in tqdm(range(0, len(df), 5)):  # bundle 5 items for each API call
        embs = get_embeddings(
            df.loc[i : i + 4, "question"].tolist(), question_task_type
        )
        q_embs.extend(embs)

        answers: list[Any] = [row[0] for row in df.loc[i : i + 4, "answer"]]
        embs = get_embeddings(answers, answer_task_type)
        a_embs.extend(embs)

    df["q_emb"] = q_embs
    df["a_emb"] = a_embs
    return df

It's ready to start generating the embeddings for semantic similarity task. The following code may take a few minutes to finish generating embeddings for the 1K rows.

In [54]:
# generate embeddings using SEMANTIC_SIMILARITY for both question and answer
similarity_df = nq_open_df.copy()
generate_embs(similarity_df, "SEMANTIC_SIMILARITY", "SEMANTIC_SIMILARITY")
similarity_df

100%|██████████| 200/200 [02:00<00:00,  1.65it/s]


Unnamed: 0,question,answer,q_emb,a_emb
0,who are the goo goo dolls touring with,[Phillip Phillips],"[0.015478329733014107, -0.012187072075903416, ...","[-0.05240778252482414, 0.034256916493177414, -..."
1,what team has the most wins against alabama,[Tennessee],"[0.016561519354581833, 0.04496956244111061, -0...","[0.02570470981299877, 0.035091277211904526, 0...."
2,who played tara in buffy the vampire slayer,[Amber Nicole Benson],"[-0.0050974395126104355, 0.02355935052037239, ...","[-0.017453117296099663, 0.057975661009550095, ..."
3,who was the british general who surrendered at...,[Charles Cornwallis],"[-0.019962364807724953, 0.007353120017796755, ...","[-0.07644159346818924, 0.015994466841220856, -..."
4,who sings the song oh what a night,[the Four Seasons],"[0.00897395983338356, 0.010980166494846344, -0...","[0.013994764536619186, 0.018888963386416435, 0..."
...,...,...,...,...
995,what is the condition when a person dies witho...,[Intestate],"[-0.006326922215521336, 0.04302241653203964, 0...","[-0.03499693050980568, 0.03457438945770264, -0..."
996,where does the perks of being a wallflower tak...,[Pittsburgh],"[-0.0195462629199028, 0.03374296426773071, -0....","[-0.014027892611920834, 0.03729143366217613, -..."
997,when is the new season of total divas coming out,[in fall 2018],"[-0.02207242324948311, -0.028120234608650208, ...","[-0.007370626088231802, 0.024703267961740494, ..."
998,when does rick and morty ep 7 air,"[March 10, 2014]","[0.0029733579140156507, 0.014099868945777416, ...","[0.011575975455343723, -0.04454023763537407, -..."


### Calculate cosine similarities between questions and answers

Next, we will calculate the cosine similarity for all combinations of the questions and answers using the `cosine_similarity()` function from scikit-learn. Thus, the `calc_cosine_similarities()` function will return a two dimensional array with similarities between 1K questions x 1K answers. By printing the result for the first question, you will see 1K values that represents the cosine similarity between the first question and the 1K answers.

In [67]:
def calc_cosine_similarities(df: pd.DataFrame) -> list[list[float]]:
    """Calculates similarities for all combinations of the questions and answers"""
    q_embs: list[list[float]] = df["q_emb"].tolist()
    a_embs: list[list[float]] = df["a_emb"].tolist()
    return cosine_similarity(q_embs, a_embs)


# print the answer similarities for the first question
q_and_a_similarities = calc_cosine_similarities(similarity_df)
q_and_a_similarities[:5]

array([[0.48812594, 0.40821951, 0.37326078, ..., 0.37413833, 0.41819645,
        0.38734404],
       [0.40231459, 0.54912763, 0.41487358, ..., 0.39853956, 0.41158954,
        0.35762238],
       [0.35498545, 0.42271897, 0.51284914, ..., 0.37637304, 0.4078475 ,
        0.45110982],
       [0.44713429, 0.3888597 , 0.37808677, ..., 0.41129379, 0.39207029,
        0.44731761],
       [0.48705845, 0.40621167, 0.36234445, ..., 0.39593297, 0.40667984,
        0.38146532]])

### Calculate MRR

Finally, we will measure the search quality using [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank). It is a simple algorithm - the table from the Wikipedia page gives an example:

![](https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/mrr_example.png)

In this tutorial, we will use the `answer` column of the dataset as the correct response for the `question` column. Thus, if the similarity between the question and answer is ranked as 3rd in the similarity list, reciprocal rank is 1/3.

To do it, we define `get_top100_similar_answers()` function that returns a top 100 list of answer index sorted by the similarity.

In [57]:
def get_top100_similar_answers(answer_similarities: list[float]) -> list[int]:
    """Extracts top 100 similar answers"""
    sim_index: list[tuple[int, float]] = list(enumerate(answer_similarities))
    sorted_sim_index: list[tuple[int, float]] = sorted(
        sim_index, key=lambda x: x[1], reverse=True
    )
    sorted_index: list[int] = [index for index, _ in sorted_sim_index]
    return sorted_index[:100]


# print top 100 similar answers with their index
get_top100_similar_answers(q_and_a_similarities[0])[:10]

[676, 6, 45, 426, 115, 620, 582, 982, 94, 704]

With this top 100 answer list for each question, we will add `query_ranks` column to the DataFrame that has a list like `[0 0 1 0 0...]`. In this case, the correct response is ranked at 3rd.

In [58]:
# mypy: disable-error-code="index"


def add_query_ranks(df: pd.DataFrame) -> pd.DataFrame:
    """Adds a 'query_ranks' column to the DataFrame"""

    # Calculate similarities for all combinations of questions and answers
    q_and_a_similarities: list[list[float]] = calc_cosine_similarities(df)

    # Get query ranks for all questions
    query_ranks_for_all_questions: list[list[int]] = []
    for i in tqdm(range(len(q_and_a_similarities))):
        # For each question, get ground truth and top 100 similar answers
        ground_truth: Any = df.loc[i, "answer"][
            0
        ]  # Assuming 'answer' column contains lists
        similar_answers: list[int] = get_top100_similar_answers(q_and_a_similarities[i])

        # If an answer equals the ground truth, put "1" in the query ranks, otherwise "0"
        query_ranks: list[int] = []
        for index in similar_answers:
            similar_answer: Any = df.loc[index, "answer"][0]
            if similar_answer == ground_truth:
                query_ranks.append(1)
            else:
                query_ranks.append(0)
        query_ranks_for_all_questions.append(query_ranks)

    # Add the query ranks to the dataframe
    df["query_ranks"] = query_ranks_for_all_questions
    return df

In [59]:
# add query_ranks column to the similarity_df
add_query_ranks(similarity_df)
similarity_df.head()

100%|██████████| 1000/1000 [00:02<00:00, 493.25it/s]


Unnamed: 0,question,answer,q_emb,a_emb,query_ranks
0,who are the goo goo dolls touring with,[Phillip Phillips],"[0.015478329733014107, -0.012187072075903416, ...","[-0.05240778252482414, 0.034256916493177414, -...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ..."
1,what team has the most wins against alabama,[Tennessee],"[0.016561519354581833, 0.04496956244111061, -0...","[0.02570470981299877, 0.035091277211904526, 0....","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,who played tara in buffy the vampire slayer,[Amber Nicole Benson],"[-0.0050974395126104355, 0.02355935052037239, ...","[-0.017453117296099663, 0.057975661009550095, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ..."
3,who was the british general who surrendered at...,[Charles Cornwallis],"[-0.019962364807724953, 0.007353120017796755, ...","[-0.07644159346818924, 0.015994466841220856, -...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,who sings the song oh what a night,[the Four Seasons],"[0.00897395983338356, 0.010980166494846344, -0...","[0.013994764536619186, 0.018888963386416435, 0...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


With this `query_ranks`, we will calculate MRR value for the dataset.

In [60]:
def calc_mrr(df: pd.DataFrame) -> float:
    """Calculates MRR from the query ranks"""

    query_ranks_for_all_questions: list[list[int]] = df["query_ranks"].tolist()
    reciprocal_ranks: list[float] = []

    for query_ranks in query_ranks_for_all_questions:
        for i, relevance in enumerate(query_ranks):
            if relevance == 1:
                reciprocal_ranks.append(1 / (i + 1))
                break
        else:  # This else is associated with the for loop, executed if no break occurred
            reciprocal_ranks.append(0)

    return sum(reciprocal_ranks) / len(query_ranks_for_all_questions)

In [61]:
mrr = calc_mrr(similarity_df)
mrr

0.2834443053591409

You should see a value around 0.3 with this task type.

## Calculating Q&A search quality with QUESTION_ANSWERING

Next, let's test with the task type `QUESTION_ANSWERING`.

Starting with generating embeddings for the same dataset. This time, we will use task type `QUESTION_ANSWERING` for the questions and `RETRIEVAL_DOCUMENT` for the answers. This will take a few minutes to finish.

In [62]:
# generate embeddings using QUESTION_ANSWERING for question and
# RETRIEVAL_DOCUMENT for answer
q_and_a_df = nq_open_df.copy()
generate_embs(q_and_a_df, "QUESTION_ANSWERING", "RETRIEVAL_DOCUMENT")
q_and_a_df

100%|██████████| 200/200 [02:21<00:00,  1.42it/s]


Unnamed: 0,question,answer,q_emb,a_emb
0,who are the goo goo dolls touring with,[Phillip Phillips],"[0.04009374976158142, -0.023044785484671593, -...","[-0.004477624781429768, -0.009845536202192307,..."
1,what team has the most wins against alabama,[Tennessee],"[0.0355888232588768, 0.019433049485087395, 0.0...","[0.07419366389513016, 0.054128434509038925, 0...."
2,who played tara in buffy the vampire slayer,[Amber Nicole Benson],"[0.018368709832429886, 0.007910536602139473, -...","[0.031061062589287758, 0.013845161534845829, -..."
3,who was the british general who surrendered at...,[Charles Cornwallis],"[0.018189506605267525, -0.0029289748053997755,...","[0.000980065786279738, -0.010285453870892525, ..."
4,who sings the song oh what a night,[the Four Seasons],"[0.0233558751642704, -0.009065551683306694, -0...","[0.026310211047530174, 0.0006469273357652128, ..."
...,...,...,...,...
995,what is the condition when a person dies witho...,[Intestate],"[0.028344353660941124, 0.030213801190257072, 0...","[0.02063932828605175, 0.04471667483448982, -0...."
996,where does the perks of being a wallflower tak...,[Pittsburgh],"[0.009749174118041992, 0.024704810231924057, -...","[0.0412089079618454, 0.05075874924659729, -0.0..."
997,when is the new season of total divas coming out,[in fall 2018],"[-0.0029080261010676622, -0.06319431215524673,...","[9.415245585842058e-05, 0.020825795829296112, ..."
998,when does rick and morty ep 7 air,"[March 10, 2014]","[0.0316818468272686, -0.0025976563338190317, -...","[0.01649138517677784, 0.005796652287244797, -0..."


Then, generate `query_ranks` columns for calculating MRR.

In [63]:
add_query_ranks(q_and_a_df)
q_and_a_df.head()

100%|██████████| 1000/1000 [00:02<00:00, 482.52it/s]


Unnamed: 0,question,answer,q_emb,a_emb,query_ranks
0,who are the goo goo dolls touring with,[Phillip Phillips],"[0.04009374976158142, -0.023044785484671593, -...","[-0.004477624781429768, -0.009845536202192307,...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,what team has the most wins against alabama,[Tennessee],"[0.0355888232588768, 0.019433049485087395, 0.0...","[0.07419366389513016, 0.054128434509038925, 0....","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,who played tara in buffy the vampire slayer,[Amber Nicole Benson],"[0.018368709832429886, 0.007910536602139473, -...","[0.031061062589287758, 0.013845161534845829, -...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,who was the british general who surrendered at...,[Charles Cornwallis],"[0.018189506605267525, -0.0029289748053997755,...","[0.000980065786279738, -0.010285453870892525, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,who sings the song oh what a night,[the Four Seasons],"[0.0233558751642704, -0.009065551683306694, -0...","[0.026310211047530174, 0.0006469273357652128, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


Finally, calculate MRR for the task type `QUESTION_ANSWERING`.

In [64]:
mrr = calc_mrr(q_and_a_df)
mrr

0.38180816343499935

You should see the MRR value around 0.4, it's about 30% increase from the task type `SEMANTIC_SIMILARITY`.

# Cleaning up

This tutorial doesn't use any persistent resources of Google Cloud. Thus no need to deleting resources for cleaning up.

# Conclusion

As we saw, Google has been tackling this issue to optimize the semantic search quality of its largest services for many years, and the knowledge and expertise is packaged as easy-to-use "task type" embedding models.

In many cases, you should see around 30% - 40% increase on the MRR value by changing from `SEMANTIC_SIMILARITY` to `QUESTION_ANSWERING`. Using an appropriate task type for different requirements may improve the user experience of your gen AI applications.

### When to use embedding tuning
One caveat of task type embedding is that, the model is trained with generic corpus on the web. That means, the model may be less effective on use cases such as retrieving company proprietary document or question and answering on topics that the model doesn't know. In such cases, [Tune text embeddings](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings) could provide better option than using the pre-trained model.

### Getting Started
To get started with task type embeddings, here's a list of resources to get started with the new embedding models:

- [Choose an embeddings task type](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/task-types)
- [Get text embeddings](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)
- [Overview of Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview)
