# Setup

In [1]:
! pip install --upgrade --quiet --user google-cloud-aiplatform google-cloud-storage

In [2]:
PROJECT_ID = "qwiklabs-gcp-02-d4cb0f04c60a"  
LOCATION = "us-central1"

In [5]:
from datetime import datetime, timezone, timedelta

tzinfo = timezone(timedelta(hours=8)) # UTC+8
UID = datetime.now(tzinfo).strftime("%m%d%H%M")
UID

'10221603'

# Task 2. Prepare a dataset with hybrid embeddings

In this task, you process product catalog data and convert each product description into two types of embeddings:
- dense (semantic) and
- sparse (keyword-based). 

These embeddings are then later used for hybrid search in the Vertex AI Matching Engine.

You need to prepare a data file to build an index for both sparse and dense embeddings, based on the data format described in the Input data format and structure documentation.
https://cloud.google.com/vertex-ai/docs/vector-search/setup/format-structure

```
{
  'id': 0,
  'title': 'Google Sticker',
  'embedding':
    [0.022880317643284798,
    -0.03315234184265137,
    ...
    -0.03309667482972145,
    0.04621824622154236],
  'sparse_embedding': {
    'values': [0.933008728540452, 0.359853737603667],
    'dimensions': [191, 78]
  }
}  
```
Each item should have a sparse_embedding property that has values and dimensions properties and an embedding property that holds a dense embedding. Sparse embeddings have thousands of dimensions with a few non-zero values. This data format works efficiently because it contains the non-zero values only with their positions in the space.

As a sample retail dataset, you use the Google Merch Shop dataset for this lab, which contains about 200 rows of Google-branded goods.

https://shop.merch.google/

In [18]:
# Run the following code in a cell to download the dataset and load it into a pandas DataFrame:

import pandas as pd

CSV_URL = "https://storage.googleapis.com/qwiklabs-gcp-02-d4cb0f04c60a/google_merch_shop_items.csv"

df = pd.read_csv(CSV_URL)
df["title"]

# Note: In this instance, you use the dataset to create sparse embeddings for implementing a token-based search with Vector Search, 
# as well as dense embeddings to allow you to implement semantic search with Vector Search.

# 202 ITEMS

0                          Google Sticker
1                    Google Cloud Sticker
2                       Android Black Pen
3                   Google Ombre Lime Pen
4                    For Everyone Eco Pen
                      ...                
197        Google Recycled Black Backpack
198    Google Cascades Unisex Zip Sweater
199    Google Cascades Womens Zip Sweater
200         Google Cloud Skyline Backpack
201       Google City Black Tote Backpack
Name: title, Length: 202, dtype: object

In [19]:
df.shape

(202, 10)

## TfidfVectorizer - from text df to sparse embeddings

In [7]:
# Run the following code in a cell to train a vectorizer (using TfidfVectorizer from scikit-learn) to generate sparse embeddings from product titles:
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

from sklearn.feature_extraction.text import TfidfVectorizer
# Sample Text Data
corpus = df.title.tolist()
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()
# Fit and Transform
vectorizer.fit_transform(corpus)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 839 stored elements and shape (202, 243)>

The variable corpus holds a list of the 200 item names, such as "Google Sticker" or "Chrome Dino Pin". Then, the code passes them to the vectorizer by calling the fit_transform() function. This prepares the vectorizer to generate sparse embeddings.

TF-IDF vectorizer gives higher weight to significant words in the dataset (such as "Shirts" or "Dino") compared to trivial words (such as "The", "a", or "of").

For more complex production use cases, you can consider using subword tokenizers or more advanced vectorizers like BM25 or SPLADE.

- https://en.wikipedia.org/wiki/Okapi_BM25
- https://en.wikipedia.org/wiki/Learned_sparse_retrieval

In [8]:
# Run the following code in a new cell to create a function that generates sparse embeddings:

# wrapper for sparse embedding
def get_sparse_embedding(text):
    # Transform Text into TF-IDF Sparse Vector
    tfidf_vector = vectorizer.transform([text])

    # Create Sparse Embedding for the New Text
    values = []
    dims = []
    for i, tfidf_value in enumerate(tfidf_vector.data):
        # The list of numbers in 'values' are the non-zero TF-IDF scores for the input text "Chrome Dino Pin."  
        # TF-IDF = TF (term freq in text_text) * IDF (how rare word is in the trained/fitted corpus)
        values.append(float(tfidf_value))
        
        # The values in the dims (dimensions) list are not simply for each input word in the original text, 
        # but rather they represent the INDEX (dimension number) in the TF-IDF VOCABULARY for each word that was present in the text and had a non-zero TF-IDF score.
        dims.append(int(tfidf_vector.indices[i]))
    return {"values": values, "dimensions": dims}

# This function passes the parameter text to the vectorizer to generate a sparse embedding. 
# Then converts it to the {"values": ...., "dimensions": ...} format for building a Vector Search sparse index.


In [9]:
# Run the following code in a new cell to test the wrapper with a product title:

text_text = "Chrome Dino Pin"
get_sparse_embedding(text_text)

# why are the dims different???? 
# each value in the dims is for each input word from text_text????
# CUZ
# The values in the dims (dimensions) list are not simply for each input word in the original text, 
# but rather they represent the INDEX (dimension number) in the TF-IDF VOCABULARY for each word that was present in the text and had a non-zero TF-IDF score.
# 'pin' 33
# 'chrome' 48
# 'dino' 157

# 'values' different
# TF for Chrome/Dino/Pin in text_text = 1/3
# TF-IDF (= TF * IDF) for 'Pin' is highest
#    so IDF for 'Pin' is highest, out of the 3 terms/words
#    'Pin' was was more rare/unique in the trained corpus
#     which can make 'Pin' a v impt term to identify the pdt
#     when u do a search

{'values': [0.5212913389979028, 0.5212913389979028, 0.6756557405747007],
 'dimensions': [33, 48, 157]}

In [24]:
# TEST SAMPLE STRING
# unicorn, plushie are not in TF-IDF vocab
get_sparse_embedding("Unicorn plushie")


{'values': [], 'dimensions': []}

In [23]:
# TEST SAMPLE STRING
# google and pin are in TF-IDF vocab
get_sparse_embedding("Google unicorn plushie pin")


{'values': [0.33910318426250974, 0.9407491857147825], 'dimensions': [78, 157]}

In [10]:
# To build a hybrid index, each item should have 
# both sparse_embedding and embedding (for dense embedding).
# The following code uses Google's text-embedding-005 model to generate dense text embeddings of 768 dimensions for semantic search. 
# Run the following code in a new cell to create a wrapper for dense embeddings:

from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("text-embedding-005")

# wrapper for dense embedding
def get_dense_embedding(text):
    return model.get_embeddings([text])[0].values



In [11]:
# Test the wrapper method by running the following code in a new cell:

text_text = "Chrome Dino Pin"
get_dense_embedding(text_text)


[-0.06114290654659271,
 0.017346370965242386,
 -0.004251249600201845,
 -0.02798495627939701,
 -0.0011111712083220482,
 0.012573054060339928,
 -0.05285409837961197,
 -0.030847828835248947,
 -0.003995438572019339,
 -0.05352348834276199,
 -0.08685804903507233,
 0.034621480852365494,
 -0.01601252891123295,
 -0.012502963654696941,
 -0.024926766753196716,
 0.03435458615422249,
 0.0061781443655490875,
 -0.07511552423238754,
 -0.024149153381586075,
 -0.0012593824649229646,
 0.00979007687419653,
 -0.07821597903966904,
 -0.020418530330061913,
 -0.01775938645005226,
 0.023217324167490005,
 0.008083238266408443,
 0.011712702922523022,
 0.020031563937664032,
 -0.013191470876336098,
 -0.019503341987729073,
 0.06235750392079353,
 0.015036839991807938,
 0.03624138981103897,
 0.032345667481422424,
 -0.014319414272904396,
 -0.02620174176990986,
 -0.06564001739025116,
 -0.04058409854769707,
 -0.009316562674939632,
 0.03755204379558563,
 0.050808507949113846,
 -0.11942362040281296,
 0.018268104642629623,


In [20]:
len(get_dense_embedding(text_text))

768

In [12]:
# In this lab, you use both wrapper methods 
# to create an object in the correct format, 
# and then save a JSONL file into a cloud storage bucket. 
# First, you use gcloud to create the bucket.

# Run the following code in a new cell to set the bucket name and then create it:

BUCKET_URI = f"gs://{PROJECT_ID}-vs-hybridsearch-{UID}"
! gcloud storage buckets create -l $LOCATION --project $PROJECT_ID $BUCKET_URI


Creating gs://qwiklabs-gcp-02-d4cb0f04c60a-vs-hybridsearch-10221603/...


In [13]:
# Run the following gcloud code in a new cell to copy the embeddings file 
# (a sample JSON array of objects with id, title, embedding, and sparse_embedding properties) to your storage bucket:

! gcloud storage cp gs://partner-genai-bucket/genai115/items.json  $BUCKET_URI/items.json

# You have prepared the data and saved the necessary 
# DENSE embedding
# SPARSE embedding TF-IDF 'values' AND 'dimensions'/vocab index (in TF-IDF)
# Next, you build and deploy a hybrid index in Vector Search. For this, you need to initialize the AI Platform.

Copying gs://partner-genai-bucket/genai115/items.json to gs://qwiklabs-gcp-02-d4cb0f04c60a-vs-hybridsearch-10221603/items.json
  Completed files 1/1 | 3.3MiB/3.3MiB                                          


In [14]:
# Run the following code in a new cell to initialize the aiplatform package:

from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=LOCATION)

In [15]:
# Run the following code in a new cell to create a hybrid index using the JSONL file in your bucket. This cell takes a minute or two to complete:

my_hybrid_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=f"vs-hybridsearch-index-{UID}",
    contents_delta_uri=BUCKET_URI,
    dimensions=768,
    approximate_neighbors_count=10,
)

In [27]:
my_hybrid_index

<google.cloud.aiplatform.matching_engine.matching_engine_index.MatchingEngineIndex object at 0x7f9ee61c5a20> 
resource name: projects/194398435739/locations/us-central1/indexes/1176576397766819840

In [28]:
my_hybrid_index.__dict__

{'project': '194398435739',
 'location': 'us-central1',
 'credentials': <google.auth.compute_engine.credentials.Credentials at 0x7f9eed0b3670>,
 'api_client': <google.cloud.aiplatform.utils.IndexClientWithOverride at 0x7f9ee436e1d0>,
 '_FutureManager__latest_future_lock': <unlocked _thread.lock object at 0x7f9ee4371e80>,
 '_FutureManager__latest_future': None,
 '_exception': None,
 '_gca_resource': name: "projects/194398435739/locations/us-central1/indexes/1176576397766819840"
 display_name: "vs-hybridsearch-index-10221603"
 metadata_schema_uri: "gs://google-cloud-aiplatform/schema/matchingengine/metadata/nearest_neighbor_search_1.0.0.yaml"
 metadata {
   struct_value {
     fields {
       key: "config"
       value {
         struct_value {
           fields {
             key: "shardSize"
             value {
               string_value: "SHARD_SIZE_MEDIUM"
             }
           }
           fields {
             key: "dimensions"
             value {
               number_value

In [30]:
import copy
cred = copy.deepcopy(my_hybrid_index.credentials.__dict__)
cred.pop('token')
cred

{'expiry': datetime.datetime(2025, 10, 22, 9, 5, 45, 436085),
 '_quota_project_id': None,
 '_trust_boundary': None,
 '_universe_domain': 'googleapis.com',
 '_use_non_blocking_refresh': False,
 '_refresh_worker': <google.auth._refresh_worker.RefreshThreadManager at 0x7f9ee4398490>,
 '_scopes': ['https://www.googleapis.com/auth/cloud-platform'],
 '_default_scopes': None,
 '_service_account_email': '194398435739-compute@developer.gserviceaccount.com',
 '_universe_domain_cached': False}

In [31]:
my_hybrid_index.api_client.__dict__

{'_clients': {'v1': <google.cloud.aiplatform.utils.ClientWithOverride.WrappedClient at 0x7f9ee436e320>,
  'v1beta1': <google.cloud.aiplatform.utils.ClientWithOverride.WrappedClient at 0x7f9ee436e2f0>}}

In [36]:
my_hybrid_index.api_client._clients["v1"].__dict__

{'_client_class': google.cloud.aiplatform_v1.services.index_service.client.IndexServiceClient,
 '_credentials': <google.auth.compute_engine.credentials.Credentials at 0x7f9eed0b3670>,
 '_client_options': ClientOptions: {'api_endpoint': 'us-central1-aiplatform.googleapis.com', 'client_cert_source': None, 'client_encrypted_cert_source': None, 'quota_project_id': None, 'credentials_file': None, 'scopes': None, 'api_key': None, 'api_audience': None, 'universe_domain': None},
 '_client_info': <google.api_core.gapic_v1.client_info.ClientInfo at 0x7f9ee61c58a0>,
 '_api_transport': None}

In [40]:
import copy
cred = copy.deepcopy(my_hybrid_index.api_client._clients["v1"]._credentials.__dict__)
cred.pop('token')
cred

{'expiry': datetime.datetime(2025, 10, 22, 9, 5, 45, 436085),
 '_quota_project_id': None,
 '_trust_boundary': None,
 '_universe_domain': 'googleapis.com',
 '_use_non_blocking_refresh': False,
 '_refresh_worker': <google.auth._refresh_worker.RefreshThreadManager at 0x7f9eddf4a740>,
 '_scopes': ['https://www.googleapis.com/auth/cloud-platform'],
 '_default_scopes': None,
 '_service_account_email': '194398435739-compute@developer.gserviceaccount.com',
 '_universe_domain_cached': False}

In [38]:
my_hybrid_index.api_client._clients["v1"]._client_info.__dict__

{'python_version': '3.10.18',
 'grpc_version': '1.75.1',
 'api_core_version': '2.26.0',
 'gapic_version': '1.122.0+top_google_constructor_method+google.cloud.aiplatform.matching_engine.matching_engine_index.MatchingEngineIndex.create_tree_ah_index+environment+WORKBENCH_INSTANCE',
 'client_library_version': None,
 'user_agent': 'model-builder/1.122.0+top_google_constructor_method+google.cloud.aiplatform.matching_engine.matching_engine_index.MatchingEngineIndex.create_tree_ah_index+environment+WORKBENCH_INSTANCE',
 'rest_version': None,
 'protobuf_runtime_version': None}

In [37]:
my_hybrid_index.api_client._clients["v1beta1"].__dict__

{'_client_class': google.cloud.aiplatform_v1beta1.services.index_service.client.IndexServiceClient,
 '_credentials': <google.auth.compute_engine.credentials.Credentials at 0x7f9eed0b3670>,
 '_client_options': ClientOptions: {'api_endpoint': 'us-central1-aiplatform.googleapis.com', 'client_cert_source': None, 'client_encrypted_cert_source': None, 'quota_project_id': None, 'credentials_file': None, 'scopes': None, 'api_key': None, 'api_audience': None, 'universe_domain': None},
 '_client_info': <google.api_core.gapic_v1.client_info.ClientInfo at 0x7f9ee61c58a0>,
 '_api_transport': None}

In [39]:
my_hybrid_index.api_client._clients["v1beta1"]._client_info.__dict__

{'python_version': '3.10.18',
 'grpc_version': '1.75.1',
 'api_core_version': '2.26.0',
 'gapic_version': '1.122.0+top_google_constructor_method+google.cloud.aiplatform.matching_engine.matching_engine_index.MatchingEngineIndex.create_tree_ah_index+environment+WORKBENCH_INSTANCE',
 'client_library_version': None,
 'user_agent': 'model-builder/1.122.0+top_google_constructor_method+google.cloud.aiplatform.matching_engine.matching_engine_index.MatchingEngineIndex.create_tree_ah_index+environment+WORKBENCH_INSTANCE',
 'rest_version': None,
 'protobuf_runtime_version': None}

In [41]:
import copy
cred = copy.deepcopy(my_hybrid_index.api_client._clients["v1beta1"]._credentials.__dict__)
cred.pop('token')
cred

{'expiry': datetime.datetime(2025, 10, 22, 9, 5, 45, 436085),
 '_quota_project_id': None,
 '_trust_boundary': None,
 '_universe_domain': 'googleapis.com',
 '_use_non_blocking_refresh': False,
 '_refresh_worker': <google.auth._refresh_worker.RefreshThreadManager at 0x7f9eddf2ffd0>,
 '_scopes': ['https://www.googleapis.com/auth/cloud-platform'],
 '_default_scopes': None,
 '_service_account_email': '194398435739-compute@developer.gserviceaccount.com',
 '_universe_domain_cached': False}

The MatchingEngineIndex.create_tree_ah_index() method builds an index in Vector Search.

To use the index, you need to create an index endpoint. It works as a server instance, accepting query requests for your index.

Note: Wait for the cell to complete and the index to be deployed before continuing.

In [16]:
# Run the following code in a new cell to create an index endpoint:

my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"vs-hybridsearch-index-endpoint-{UID}", public_endpoint_enabled=True
)


# You are almost ready to begin running hybrid queries.
# All that remains is to deploy the index to the endpoint.

# Note: The following cell can take more than 30 minutes to complete.

In [42]:
my_index_endpoint

<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x7f9ee436fe50> 
resource name: projects/194398435739/locations/us-central1/indexEndpoints/2147146097930272768

In [43]:
my_index_endpoint.__dict__

{'project': '194398435739',
 'location': 'us-central1',
 'credentials': <google.auth.compute_engine.credentials.Credentials at 0x7f9eed0b3670>,
 'api_client': <google.cloud.aiplatform.utils.IndexEndpointClientWithOverride at 0x7f9ee436fca0>,
 '_FutureManager__latest_future_lock': <unlocked _thread.lock object at 0x7f9ee5d0fe40>,
 '_FutureManager__latest_future': None,
 '_exception': None,
 '_gca_resource': name: "projects/194398435739/locations/us-central1/indexEndpoints/2147146097930272768"
 display_name: "vs-hybridsearch-index-endpoint-10221603"
 deployed_indexes {
   id: "vs_hybridsearch_deployed_10221603"
   index: "projects/194398435739/locations/us-central1/indexes/1176576397766819840"
   create_time {
     seconds: 1761120568
     nanos: 130844000
   }
   index_sync_time {
     seconds: 1761122204
     nanos: 352550000
   }
   automatic_resources {
     min_replica_count: 2
     max_replica_count: 2
   }
   deployment_group: "default"
 }
 etag: "AMEw9yNI_dNRhJgJZtzYCsFa4XdVTJSH-

In [17]:
# Run the following code in a new cell to deploy the index to the endpoint, specifying a unique deployed index ID.

DEPLOYED_HYBRID_INDEX_ID = f"vs_hybridsearch_deployed_{UID}"
my_index_endpoint.deploy_index(
    index=my_hybrid_index, deployed_index_id=DEPLOYED_HYBRID_INDEX_ID
)

<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x7f9ee436fe50> 
resource name: projects/194398435739/locations/us-central1/indexEndpoints/2147146097930272768

In [56]:
my_hybrid_index._gca_resource._pb.name
# type(my_hybrid_index._gca_resource._pb)
# ["name"]

'projects/194398435739/locations/us-central1/indexes/1176576397766819840'

In [64]:
my_index_endpoint.deployed_indexes._pb[0]

id: "vs_hybridsearch_deployed_10221603"
index: "projects/194398435739/locations/us-central1/indexes/1176576397766819840"
create_time {
  seconds: 1761120568
  nanos: 130844000
}
index_sync_time {
  seconds: 1761122204
  nanos: 352550000
}
automatic_resources {
  min_replica_count: 2
  max_replica_count: 2
}
deployment_group: "default"

In [67]:
my_index_endpoint.deployed_indexes._pb[0].index

'projects/194398435739/locations/us-central1/indexes/1176576397766819840'

In [68]:
# sanity check index location
my_hybrid_index._gca_resource._pb.name == my_index_endpoint.deployed_indexes._pb[0].index

True

In [44]:
my_index_endpoint.__dict__

{'project': '194398435739',
 'location': 'us-central1',
 'credentials': <google.auth.compute_engine.credentials.Credentials at 0x7f9eed0b3670>,
 'api_client': <google.cloud.aiplatform.utils.IndexEndpointClientWithOverride at 0x7f9ee436fca0>,
 '_FutureManager__latest_future_lock': <unlocked _thread.lock object at 0x7f9ee5d0fe40>,
 '_FutureManager__latest_future': None,
 '_exception': None,
 '_gca_resource': name: "projects/194398435739/locations/us-central1/indexEndpoints/2147146097930272768"
 display_name: "vs-hybridsearch-index-endpoint-10221603"
 deployed_indexes {
   id: "vs_hybridsearch_deployed_10221603"
   index: "projects/194398435739/locations/us-central1/indexes/1176576397766819840"
   create_time {
     seconds: 1761120568
     nanos: 130844000
   }
   index_sync_time {
     seconds: 1761122204
     nanos: 352550000
   }
   automatic_resources {
     min_replica_count: 2
     max_replica_count: 2
   }
   deployment_group: "default"
 }
 etag: "AMEw9yNI_dNRhJgJZtzYCsFa4XdVTJSH-

In [None]:
# This would be a good time to review the explanatory material at the beginning of the lab.


# Task 3. Run hybrid queries
In this task, you generate both dense and sparse embeddings for the query Kids, and encapsulate them in a HybridQuery object. In addition to the dense_embedding and sparse_embedding, you pass in an rrf_ranking_alpha, which provides a way to merge the ranking from semantic and token-based search results. This means a search for "cozy hoodie", for example, could surface products similar to a search for "comfortable sweatshirt" even if the exact keywords aren't present.



In [25]:
# Run the following code in a new cell to prepare the query for the word Kids:

from google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint import (
    HybridQuery,
)

query_text = "Kids"
query_dense_emb = get_dense_embedding(query_text)
query_sparse_emb = get_sparse_embedding(query_text)
query = HybridQuery(
    dense_embedding=query_dense_emb,
    sparse_embedding_dimensions=query_sparse_emb["dimensions"],
    sparse_embedding_values=query_sparse_emb["values"],
    rrf_ranking_alpha=0.5,
)


In [71]:
query_dense_emb[:5], len(query_dense_emb)

([-0.048625607043504715,
  -0.0027711312286555767,
  -0.013441437855362892,
  -0.046644870191812515,
  -0.023981038480997086],
 768)

In [72]:
query_sparse_emb

{'values': [1.0], 'dimensions': [105]}

In [26]:
# Run the following code in a new cell to run the query and print distances for each item in the response:

# run a hybrid query
response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_HYBRID_INDEX_ID,
    queries=[query],
    num_neighbors=10,
)

# print results
for idx, neighbor in enumerate(response[0]):
    title = df.title[int(neighbor.id)]
    dense_dist = neighbor.distance if neighbor.distance else 0.0
    sparse_dist = neighbor.sparse_distance if neighbor.sparse_distance else 0.0
    print(f"{title:<40}: dense_dist: {dense_dist:.3f}, sparse_dist: {sparse_dist:.3f}")

Google Blue Kids Sunglasses             : dense_dist: 0.577, sparse_dist: 0.606
Google Red Kids Sunglasses              : dense_dist: 0.573, sparse_dist: 0.572
YouTube Kids Coloring Pencils           : dense_dist: 0.546, sparse_dist: 0.478
YouTube Kids Character Sticker Sheet    : dense_dist: 0.525, sparse_dist: 0.468
Google Doogler Youth Tee                : dense_dist: 0.546, sparse_dist: 0.000
Chrome Dino Glow-in-the-Dark Youth Tee  : dense_dist: 0.536, sparse_dist: 0.000
Google Bike Youth Tee                   : dense_dist: 0.535, sparse_dist: 0.000
Google Indigo Youth Tee                 : dense_dist: 0.524, sparse_dist: 0.000
Google White Classic Youth Tee          : dense_dist: 0.509, sparse_dist: 0.000
Google Doogler Toddler Tee              : dense_dist: 0.506, sparse_dist: 0.000


In each neighbor object, there's a distance property that has the distance between the query and the item with the dense embedding, and a sparse_distance property that has the distance with the sparse embedding. These values are inverted distances, so a higher value means a shorter distance.

By running a query with HybridQuery, you get the following result:

![Query results](https://cdn.qwiklabs.com/wKlzkxCZjNFW7s7rQVRkYc9jdb8vl9b%2FSF6TcUfRkeU%3D)

**In addition to the token-based search results (ie FROM TF-IDF SPARSE EMBEDDING SEARCH) that have the Kids keyword, there are also semantic search results (FROM GEMINI 768-dim embedding)** included. For example, Google White Classic Youth Tee is included because the embedding model knows that Youth and Kids are semantically similar.

To merge the token-based and semantic search results, hybrid search uses Reciprocal Rank Fusion (RRF). For more information about RRF and how to specify the rrf_ranking_alpha parameter, refer to the next section.

https://plg.uwaterloo.ca/%7Egvcormac/cormacksigir09-rrf.pdf



What is reciprocal rank fusion?

RRF provides a way to merge the ranking from semantic and token-based search results. In many production information retrieval or recommender systems, the results go through further precision ranking algorithms – so-called reranking. With the combination of the millisecond-level fast retrieval with vector search, and precision reranking on the results, you can build multi-stage systems https://cloud.google.com/blog/products/ai-machine-learning/scaling-deep-retrieval-tensorflow-two-towers-architecture?e=48754805 that provide higher search quality and recommendation performance.

![flow](https://cdn.qwiklabs.com/c5jAYb%2F%2FSUaCc7tNwe3DSYShL4YXRV%2BRTtYJYRkFDFk%3D)