In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Using "task type" embeddings for improving RAG search quality

## Why my RAG can't find relevant documents?

Retrieval Augmented Generation (RAG) is becoming a very popular architecture pattern of LLM grounding that retrieves and feed relevant business data to LLM and suppress hallucinations. But many RAG engineers are struggling on building the production quality retrieval engine - the search quality of RAG systems is becoming a common big challenge. In many cases, this happens when you are using "plain vanilla" text embeddings with vector databases for a simple similarity search. This is actually a common fault of information retrieval system design that Google has been solving for our search services for over a decade.

 In this tutorial we'll learn what Google has learned, and how the new "task type" embeddings in Vertex AI can provide a quick solution to optimize the search quality of your RAG system.

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/embeddings/task-type-embedding.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fembeddings%2Ftask-type-embedding.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/embeddings/task-type-embedding.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/embeddings/task-type-embedding.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

## Questions and answers are not similar

When building Retrieval Augmented Generation (RAG) systems, most designs we've seen use text embedding and vector search to conduct a [semantic similarity search](https://cloud.google.com/blog/products/ai-machine-learning/how-to-use-grounding-for-your-llms-with-text-embeddings). In many cases, this leads to degraded search quality, because of the "question is not the answer" problem. For example, a question like "Why is the sky blue?" and its answer, "The scattering of sunlight causes the blue color", have distinctly different meanings as separate statements. Likewise, if you ask "What's the best birthday present for my kid?" to a RAG system with a similarity search, it would be hard to find items like "Nintendo Switch" or "Lego sets", as those answers are not semantically similar to the question.

![](https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/1.%20task_type_1.png)

Technically speaking, the distribution of the question embedding space and answer embedding space have a major gap, and the simple similarity search doesn't work well. To fill the gap, you need to have an AI/ML model that learns the relationship between the query and answer.

### The history of "filling the gap"

This is actually a classic problem of semantic search for information retrieval, and Google has a long history of optimizing the semantic search quality for billions of users. Google Search started incorporating semantic search in 2015, with the introduction of [AI search innovations](https://blog.google/products/search/how-ai-powers-great-search-results/) like our deep learning ranking system RankBrain. This innovation was quickly followed with neural matching to improve the accuracy of document retrieval in Search. Neural matching contains deep learning models that learn the relationship between the query intention and relevant documents. For example, a search "insights how to manage a green". If a friend asked you this, you'd probably be stumped. But with neural matching, we're able to make sense of it.

<center>
  <img src="https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/2.%20Google%20Search.png" width="300"/><br/>
Google Search can find relevant documents<br/>
for an ambiguous query "how to manage a green"
</center>
<br/>

### Dual encoder model
For developers that want to incorporate sophisticated semantic search like neural matching into their RAG system the popular approach has been to [train a dual encoder model](https://cloud.google.com/blog/products/ai-machine-learning/scaling-deep-retrieval-tensorflow-two-towers-architecture) (aka two-tower) that learns the relationship between the query embeddings and answer embeddings. The following diagram depicts how a dual encoder model maps queries with relevant documents.

<br/>
<div align="center">
  <img src="https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/3.%20dual%20encoder.gif">
</div>

But it is not easy to design, train and build a production retrieval system with a customized dual encoder model. It requires data science and ML engineering experience, and significant effort to train a dataset with pairs of questions and answers.

### LLM-based "Advanced RAG"
In the era of Large Language Models (LLM), some "Advanced RAG" approaches emerged to solve this problem. [HyDE](https://arxiv.org/abs/2212.10496) uses LLM reasoning capabilities to generate possible answer texts, and use the text for similarity search. Additionally, [query expansion with LLM](https://arxiv.org/abs/2305.03653) uses the LLM reasoning for expanding the original query with possible answer candidates as query texts.

The downside of these LLM-based approaches is that it adds the LLM prediction latency and high cost for every query. While the vector search itself can finish within milliseconds, and can handle thousands of queries per second at low cost, the LLM reasoning will add a few seconds latency for every single query with significantly higher cost than generating embeddings.

## New "task type" embedding

Recently [Vertex AI Embeddings API](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings) launched new text embedding models, `text-embedding-004` and `text-multilingual-embedding-002` based on [the new embedding models](https://deepmind.google/research/publications/85521/) developed by the Google DeepMind and Research team.

The unique feature of these models is that they can generate optimized embeddings based on [task types](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/task-types). With task types, you can significantly reduce the time and cost for developing dual encoder models or advanced RAG systems for achieving higher search quality, for specific tasks including question and answering, document retrieval by query, fact verification, etc.
For example, to generate embeddings for question and answering, all you have to do is to specify a task type `QUESTION_ANSWERING` for query texts and `RETRIEVAL_DOCUMENT` for answer texts when generating their embeddings. In the embedding space optimized with the task types, the query embedding "Why is the sky blue?" and the answer embedding "The scattering..." will be placed in a much closer area, because the embedding models are trained to learn that they have the question and answer relationship. Thus you should be able to get higher search quality with vector databases to find the right answer for the query.

![](https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/4.%20task_type_2.png)

### Behind the scene: LLM distillation + dual encoder

How do the new embedding models enable this? The novel part of the model design is that the Google DeepMind and Research team used [LLM distillation](https://arxiv.org/abs/1503.02531) (a process to train a smaller model from a larger model) to pre-train a dual encoder based embedding model.
Here's the distillation steps. The team used an LLM for generating two kinds of data: 1) synthetic queries and 2) query-and-document pairs. First, the team generated numerous synthetic queries from the training dataset (corpus) for each task.

![](https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/5.%20FRet_1.png)

Then, with the generated queries, find possibly relevant documents to create query-and-document pairs. For those pairs, use the LLM to determine if the pair is considered as an appropriate query-and-document (positive pair), or not (negative pair).

![](https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/6.%20FRet_2.png)

With those pairs, the team trained a dual-encoder based embedding model. The result is an embedding model that inherits the reasoning capability of the LLM as a distilled model for the specific tasks. Significantly smaller, faster and inexpensive to generate embeddings, and yet, provides the capability of smart matching between queries and relevant documents.

If you are interested in more details of the model design and other performance comparison results, check out [the paper](https://deepmind.google/research/publications/85521/) published by the team.

### Supported task types

The new embeddings models support the following task types:

|Task type|Embeddings optimization criteria|
|:-|:-|
|`SEMANTIC_SIMILARITY`|Semantic similarity. Use this task type when retrieving similar texts from the corpus.|
|`RETRIEVAL_QUERY`|Document search and information retrieval. Use `RETRIEVAL_QUERY` for query texts, and `RETRIEVAL_DOCUMENT` for documents to be retrieved.|
|`QUESTION_ANSWERING`|Questions and answers applications such as RAG. Use `QUESTION_ANSWERING` for question texts, and `RETRIEVAL_DOCUMENT` for documents to be retrieved.|
|`FACT_VERIFICATION`|Document search for fact verification. Use `FACT_VERIFICATION` for the target text, and `RETRIEVAL_DOCUMENT` for documents to be retrieved.|
|`CODE_RETRIEVAL_QUERY`|Code search. Use `CODE_RETRIEVAL_QUERY` for query text, and `RETRIEVAL_DOCUMENT` for code blocks to be retrieved (available on embedding model `text-embedding-preview-0815` and later)|
|`CLASSIFICATION`|Text classification. Use this task type for training a small classification model with the embedding.|
|`CLUSTERING`|Text clustering. Use this task type for k-means or other clustering analysis.|

For example, if you are building a RAG system for a question and answering use case, you may specify task type `RETRIEVAL_DOCUMENT` for generating embeddings for building with vector search, and specify `QUESTION_ANSWERING` for embeddings for question texts. Thus you should see improved search quality compared to using `SEMANTIC_SIMILARITY` for both query and document. Likewise, you may use `RETRIEVAL_QUERY` for queries for document search, and `FACT_VERIFICATION` for queries for finding documents for fact checking.

Not only the document retrieval and Q&A use cases, but also Embedding models support optimized embeddings for text classification and clustering. Embeddings with task type `CLASSIFICATION` are useful for classifying texts with its semantics for use cases such as customer and product segmentation.


# Task type embeddings in Action

In the following sections, we will test how task type improves search quality in Question and Answering case.

## Setup

Let's start setting up the SDK and environment variables.

### Install Python SDK
This tutorial uses Vertex AI SDK and Cloud Storage SDK.

In [None]:
%pip install --upgrade --quiet --user google-cloud-aiplatform

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [2]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

###Authenticate your notebook environment (Colab only)
If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [1]:
# get project ID
PROJECT_ID = ! gcloud config get project
PROJECT_ID = PROJECT_ID[0]
LOCATION = "us-central1"
if PROJECT_ID == "(unset)":
    print(f"Please set the project ID manually below")

In [2]:
# define project information
if PROJECT_ID == "(unset)":
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

## Download NQ-Open dataset

As a test dataset, we will use 1K rows sampled from [NQ-Open](https://huggingface.co/datasets/google-research-datasets/nq_open) dataset provided by the Google Research team. NQ-Open dataset can be easily downloaded from the Huggingface Datasets:

In [4]:
import os
import pandas as pd

# download NQ-Open dataset
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"  # to suppress the HF_TOKEN warning
splits = {
    "train": "nq_open/train-00000-of-00001.parquet",
    "validation": "nq_open/validation-00000-of-00001.parquet",
}
nq_open_df = pd.read_parquet(
    "hf://datasets/google-research-datasets/nq_open/" + splits["train"]
)

# sampling for 1K rows
nq_open_df = nq_open_df.sample(n=1000).reset_index(drop=True)
nq_open_df.head()

Unnamed: 0,question,answer
0,in will and grace who is the father of grace's...,[Leo]
1,who is the leader of the official opposition p...,[Wab Kinew]
2,tu sooraj main saanjh piyaji cast kanak real name,[Rhea Sharma]
3,the process in which the value of ∆u=0 is,[isothermal process]
4,who plays parker's friend in liv and maddie,[Herbie Jackson]


The sampled dataframe contains 1K pairs of question and answer for a wide variety topics, such as a question "what is the solid outer layer of the earth called" and its answer "The crust".

In this tutorial, we will use 1K rows to measure the search quality. By increasing the rows to 10K you will see better results, but it would take about 30 minutes to finish the tutorial.

## Calculating Q&A search quarity with SEMANTIC_SIMILARITY

First, let's test the task type SEMANTIC_SIMILARITY as a baseline.



### Generate embeddings for questions and answers

Here we define a wrapper function `get_embeddings()` to make it easier to call the Embeddings API. To specify task type on generating embedding, you can use `TextEmbeddingInput(text, task_type)` function.

In [5]:
from google.cloud import aiplatform
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel

# init the aiplatform package
aiplatform.init(project=PROJECT_ID, location=LOCATION)

MODEL_NAME = "text-embedding-004"


# a wrapper to get text embeddings for a list of texts
def get_embeddings(texts, task_type):
    model = TextEmbeddingModel.from_pretrained(MODEL_NAME)
    inputs = [TextEmbeddingInput(text, task_type) for text in texts]
    embs = model.get_embeddings(inputs)
    return [emb.values for emb in embs]

Using this `get_embeddings()` function, we define another function `generate_embs()` that takes a DataFrame and generate embeddings for all rows, with the task types specified for questions and answers. The embeddings will be stored on new columns `q_emb` and `a_emb`.

In [6]:
from tqdm import tqdm  # to show the progress bar


# generate text embeddings for questions and answers
def generate_embs(df, question_task_type, answer_task_type):
    q_embs = []
    a_embs = []
    for i in tqdm(range(0, len(df), 5)):  # bundle 5 items for each API call
        embs = get_embeddings(df.loc[i : i + 4, "question"], question_task_type)
        q_embs.extend(embs)
        answers = [row[0] for row in df.loc[i : i + 4, "answer"]]
        embs = get_embeddings(answers, answer_task_type)
        a_embs.extend(embs)

    df["q_emb"] = q_embs
    df["a_emb"] = a_embs

It's ready to start generating the embeddings for semantic similarity task. The following code may take a few minutes to finish generating embeddings for the 1K rows.

In [7]:
# generate embeddings using SEMANTIC_SIMILARITY for both question and answer
similarity_df = nq_open_df.copy()
generate_embs(similarity_df, "SEMANTIC_SIMILARITY", "SEMANTIC_SIMILARITY")
similarity_df

100%|██████████| 200/200 [01:16<00:00,  2.62it/s]


Unnamed: 0,question,answer,q_emb,a_emb
0,in will and grace who is the father of grace's...,[Leo],"[0.018023718148469925, -0.02119533158838749, -...","[0.029674364253878593, -4.3852698581758887e-05..."
1,who is the leader of the official opposition p...,[Wab Kinew],"[-0.024005817249417305, -0.003985343035310507,...","[-0.021027375012636185, -0.010554051026701927,..."
2,tu sooraj main saanjh piyaji cast kanak real name,[Rhea Sharma],"[-0.048737287521362305, 0.022673968225717545, ...","[-0.004593866877257824, 0.004197182599455118, ..."
3,the process in which the value of ∆u=0 is,[isothermal process],"[-0.017183132469654083, 0.022296994924545288, ...","[-0.07643440365791321, 0.0066937957890331745, ..."
4,who plays parker's friend in liv and maddie,[Herbie Jackson],"[-0.04511098191142082, 0.013076890259981155, -...","[-0.03658730536699295, -0.01976088061928749, 0..."
...,...,...,...,...
995,what does matt keough do for a living,[executive for the Oakland Athletics],"[-0.038916464895009995, 0.022071899846196175, ...","[-0.002222894923761487, 0.03451692685484886, -..."
996,when was the movable type printing technique i...,[AD 1040],"[-0.06705911457538605, 0.009357360191643238, -...","[0.005251062568277121, 0.006768928840756416, -..."
997,what was sarah lancaster name in coronation st...,[Wendy Farmer],"[0.005200691055506468, 0.008783416822552681, -...","[-0.010710291564464569, 0.025595564395189285, ..."
998,using techniques that aim at keeping you safe ...,[defensive driving skills],"[0.03723779320716858, -0.03420135751366615, -0...","[0.009463907219469547, -0.003430944634601474, ..."


### Calculate cosine similarities between questions and answers

Next, we will calculate the cosine similarity for all combinations of the questions and answers using `cosine_similarity()` function from scikit-learn. Thus, the `calc_cosine_similarities()` function will return a two dimensional array with similarities between 1K questions x 1K answers. By printing the result for the first question, you will see 1K values that represents the cosine similarity between the first question and the 1K answers.

In [8]:
from sklearn.metrics.pairwise import cosine_similarity


# calculate similarities for all combinations of the questions and answers
def calc_cosine_similarities(df):
    q_embs = df["q_emb"].tolist()
    a_embs = df["a_emb"].tolist()
    return cosine_similarity(q_embs, a_embs)


# print the answer similarities for the first question
q_and_a_similarities = calc_cosine_similarities(similarity_df)
q_and_a_similarities[0]

array([0.39343315, 0.37640625, 0.33152983, 0.30465271, 0.36603535,
       0.37953952, 0.38292636, 0.35593677, 0.36575594, 0.40468736,
       0.41857268, 0.32089753, 0.37006785, 0.44271376, 0.37327467,
       0.30415343, 0.39056585, 0.41512766, 0.36607489, 0.39208197,
       0.39489792, 0.35601312, 0.3973622 , 0.36340229, 0.37166937,
       0.34242905, 0.36735435, 0.31049811, 0.38714251, 0.35671507,
       0.31534748, 0.36131805, 0.35462555, 0.43285521, 0.37385569,
       0.38020557, 0.39276016, 0.26599416, 0.32889233, 0.35395607,
       0.39249028, 0.36587857, 0.39429497, 0.32399515, 0.36707224,
       0.33660027, 0.31985978, 0.43636975, 0.41249792, 0.39786981,
       0.36295599, 0.40917005, 0.36357834, 0.38312368, 0.29934683,
       0.31971568, 0.34318662, 0.38762046, 0.35435774, 0.37446872,
       0.33679314, 0.48027954, 0.38757943, 0.36915082, 0.36286353,
       0.38500496, 0.35116212, 0.41941442, 0.34657553, 0.30470581,
       0.39855073, 0.31833683, 0.32057092, 0.35527732, 0.36903

### Calculate MRR

Finally, we will measure the search quality using [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank). It is a simple algorithm - the table from the Wikipedia page gives an example:

![](https://storage.googleapis.com/github-repo/embeddings/task_type_embeddings/mrr_example.png)

In this tutorial, we will use the `answer` column of the dataset as the correct response for the `question` column. Thus, if the similarity between the qustion and answer is ranked as 3rd in the similarity list, reciprocal rank is 1/3.

To do it, we define `get_top100_similar_answers()` function that returns a top 100 list of answer index sorted by the similarity.

In [9]:
# extract top 100 similar answers
def get_top100_similar_answers(answer_similarities):
    sim_index = list(enumerate(answer_similarities))
    sorted_sim_index = sorted(sim_index, key=lambda x: x[1], reverse=True)
    sorted_index = [index for index, _ in sorted_sim_index]
    return sorted_index[:100]


# print top 100 similar answers with their index
get_top100_similar_answers(q_and_a_similarities[0])

[108,
 61,
 322,
 519,
 829,
 450,
 612,
 13,
 512,
 181,
 929,
 601,
 127,
 636,
 172,
 128,
 169,
 47,
 598,
 970,
 985,
 419,
 576,
 33,
 524,
 438,
 116,
 529,
 597,
 629,
 628,
 804,
 546,
 168,
 361,
 653,
 393,
 229,
 219,
 802,
 86,
 387,
 551,
 806,
 863,
 444,
 240,
 223,
 621,
 883,
 483,
 502,
 781,
 67,
 242,
 10,
 839,
 888,
 343,
 525,
 618,
 627,
 17,
 153,
 583,
 538,
 307,
 403,
 382,
 833,
 48,
 715,
 674,
 330,
 699,
 180,
 434,
 577,
 785,
 117,
 125,
 946,
 956,
 148,
 91,
 918,
 957,
 350,
 51,
 866,
 491,
 559,
 740,
 912,
 748,
 474,
 813,
 499,
 610,
 694]

With this top 100 answer list for each question, we will add `query_ranks` column to the DataFrame that has a list like `[0 0 1 0 0...]`. In this case, the correct response is ranked at 3rd.

In [10]:
# add query_ranks column to the dataframe
def add_query_ranks(df):
    # calculate similarities for all combinations of the quesitons and answers
    q_and_a_similarities = calc_cosine_similarities(df)

    # get query ranks for all quesions
    query_ranks_for_all_questions = []
    for i in tqdm(range(len(q_and_a_similarities))):
        # for each question, get ground truth and top 10 similar answers
        ground_truth = df.loc[i, "answer"][0]
        similar_answers = get_top100_similar_answers(q_and_a_similarities[i])

        # if an answer = ground truth, put "1" on the query ranks
        query_ranks = []
        for index in similar_answers:
            similar_answer = df.loc[index, "answer"][0]
            if similar_answer == ground_truth:
                query_ranks.append(1)
            else:
                query_ranks.append(0)
        query_ranks_for_all_questions.append(query_ranks)

    # add the query ranks to the dataframe
    df["query_ranks"] = query_ranks_for_all_questions

In [11]:
# add query_ranks column to the similarity_df
add_query_ranks(similarity_df)
similarity_df.head()

100%|██████████| 1000/1000 [00:01<00:00, 508.66it/s]


Unnamed: 0,question,answer,q_emb,a_emb,query_ranks
0,in will and grace who is the father of grace's...,[Leo],"[0.018023718148469925, -0.02119533158838749, -...","[0.029674364253878593, -4.3852698581758887e-05...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,who is the leader of the official opposition p...,[Wab Kinew],"[-0.024005817249417305, -0.003985343035310507,...","[-0.021027375012636185, -0.010554051026701927,...","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,tu sooraj main saanjh piyaji cast kanak real name,[Rhea Sharma],"[-0.048737287521362305, 0.022673968225717545, ...","[-0.004593866877257824, 0.004197182599455118, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,the process in which the value of ∆u=0 is,[isothermal process],"[-0.017183132469654083, 0.022296994924545288, ...","[-0.07643440365791321, 0.0066937957890331745, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,who plays parker's friend in liv and maddie,[Herbie Jackson],"[-0.04511098191142082, 0.013076890259981155, -...","[-0.03658730536699295, -0.01976088061928749, 0...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


With this `query_ranks`, we will calculate MRR value for the dataset.

In [12]:
# calculate MRR from the query ranks
def calc_mrr(df):
    query_ranks_for_all_questions = df["query_ranks"]
    reciprocal_ranks = []
    for query_ranks in query_ranks_for_all_questions:
        for i, relevance in enumerate(query_ranks):
            if relevance == 1:
                reciprocal_ranks.append(1 / (i + 1))
                break
        else:
            reciprocal_ranks.append(0)
    return sum(reciprocal_ranks) / len(query_ranks_for_all_questions)

In [13]:
mrr = calc_mrr(similarity_df)
mrr

0.30243866726776186

You should see a value around 0.3 with this task type.

## Calculating Q&A search quarity with QUESTION_ANSWERING

Next, let's test with the task type `QUESTION_ANSWERING`.

Starting with generating embeddings for the same dataset. This time, we will use task type `QUESTION_ANSWERING` for the questions and `RETRIEVAL_DOCUMENT` for the answers. This will take a few minutes to finish.

In [14]:
# generate embeddings using QUESTION_ANSWERING for question and
# RETRIEVAL_DOCUMENT for answer
q_and_a_df = nq_open_df.copy()
generate_embs(q_and_a_df, "QUESTION_ANSWERING", "RETRIEVAL_DOCUMENT")
q_and_a_df

100%|██████████| 200/200 [00:59<00:00,  3.37it/s]


Unnamed: 0,question,answer,q_emb,a_emb
0,in will and grace who is the father of grace's...,[Leo],"[0.039596837013959885, -0.03138693422079086, -...","[0.0478411465883255, 0.032881636172533035, -0...."
1,who is the leader of the official opposition p...,[Wab Kinew],"[0.010377402417361736, -0.025048334151506424, ...","[0.033089689910411835, -0.023185579106211662, ..."
2,tu sooraj main saanjh piyaji cast kanak real name,[Rhea Sharma],"[-0.027541551738977432, 0.007868854328989983, ...","[0.028119541704654694, -0.016552863642573357, ..."
3,the process in which the value of ∆u=0 is,[isothermal process],"[-0.006571343168616295, 0.014349174685776234, ...","[-0.020945671945810318, -0.006434822455048561,..."
4,who plays parker's friend in liv and maddie,[Herbie Jackson],"[-0.03737664967775345, -0.008099189028143883, ...","[-0.011477257125079632, -0.030604803934693336,..."
...,...,...,...,...
995,what does matt keough do for a living,[executive for the Oakland Athletics],"[-0.014033731073141098, 0.007273292634636164, ...","[0.001903463271446526, 0.017220230773091316, -..."
996,when was the movable type printing technique i...,[AD 1040],"[-0.03695536032319069, -0.010554018430411816, ...","[-0.01185955386608839, 0.05006154999136925, -0..."
997,what was sarah lancaster name in coronation st...,[Wendy Farmer],"[0.032287102192640305, -0.012488983571529388, ...","[0.026390070095658302, 0.0046031661331653595, ..."
998,using techniques that aim at keeping you safe ...,[defensive driving skills],"[0.07813065499067307, -0.048868484795093536, 0...","[0.02209935523569584, 0.011047369800508022, -0..."


Then, generate `query_ranks` columns for calculating MRR.

In [15]:
add_query_ranks(q_and_a_df)
q_and_a_df.head()

100%|██████████| 1000/1000 [00:01<00:00, 510.74it/s]


Unnamed: 0,question,answer,q_emb,a_emb,query_ranks
0,in will and grace who is the father of grace's...,[Leo],"[0.039596837013959885, -0.03138693422079086, -...","[0.0478411465883255, 0.032881636172533035, -0....","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,who is the leader of the official opposition p...,[Wab Kinew],"[0.010377402417361736, -0.025048334151506424, ...","[0.033089689910411835, -0.023185579106211662, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,tu sooraj main saanjh piyaji cast kanak real name,[Rhea Sharma],"[-0.027541551738977432, 0.007868854328989983, ...","[0.028119541704654694, -0.016552863642573357, ...","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,the process in which the value of ∆u=0 is,[isothermal process],"[-0.006571343168616295, 0.014349174685776234, ...","[-0.020945671945810318, -0.006434822455048561,...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,who plays parker's friend in liv and maddie,[Herbie Jackson],"[-0.03737664967775345, -0.008099189028143883, ...","[-0.011477257125079632, -0.030604803934693336,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


Finally, calculate MRR for the task type `QUESTION_ANSWERING`.

In [16]:
mrr = calc_mrr(q_and_a_df)
mrr

0.3972196009342006

You should see the MRR value around 0.4, it's about 30% increase from the task type `SEMANTIC_SIMILARITY`.

# Cleaning up

This tutorial doesn't use any persistent resources of Google Cloud. Thus no need to deleting resources for cleaning up.

# Conclusion

As we saw, Google has been tackling this issue to optimize the semantic search quality of its largest services for many years, and the knowledge and expertise is packaged as easy-to-use "task type" embedding models.

In many cases, you should see around 30% - 40% increase on the MRR value by changing from `SEMANTIC_SIMILARITY` to `QUESTION_ANSWERING`. Using an appropriate task type for different requirements may improve the user experience of your gen AI applications.

### When to use embedding tuning
One caveat of task type embedding is that, the model is trained with generic corpus on the web. That means, the model may be less effective on use cases such as retrieving company proprietary document or question and answering on topics that the model doesn't know. In such cases, [Tune text embeddings](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-embeddings) could provide better option than using the pre-trained model.

### Getting Started
To get started with task type embeddings, here's a list of resources to get started with the new embedding models:

- [Choose an embeddings task type](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/task-types)
- [Get text embeddings](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)
- [Overview of Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview)
