# Semantic Search with Vertex Vector Search

**Learning Objectives**
  1. Learn how to create text embeddings using the Vertex
  1. Learn how to load embeddings in Vertex Vector Search
  2. Learn how to query Vertex Vector Search
  1. Learn how to build an information retrieval system based on semantic match
  
  
In this notebook, we implement a simple (albeit fast and scalable) [semantic search](https://en.wikipedia.org/wiki/Semantic_search#:~:text=Semantic%20search%20seeks%20to%20improve,to%20generate%20more%20relevant%20results.) retrieval system using [Vertex Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview) and [Vertex Text Embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). In a semantic search system, a number of documents are returned to a user query, ranked by their semantic match. This means that the returned documents should match the intent or meaning of the query rather than its actual exact  keywords as opposed to a boolean or keyword-based retrieval system. Such a semantic search system has in general two components, namely:

* A component that produces semantically meaningful vector representations of both the documents as well as the user queries; we will use the [Vertex Text Embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) API to creates these embeddings, leveraging the power of large language model. 

* A component that allows users to store the document vector embeddings and retrieve the most relevant documents by returning the documents whose embeddings are the closest to the user-query embedding in the embedding space. We will use [Vertex Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview) which can scale up to billions of embeddings thanks to an [efficient approximate nearest neighbor strategy](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) to compare and retrieve the closest document vectors to a query vector based on a [recent paper from Google research](https://arxiv.org/abs/1908.10396).



**Dataset:** We will use a very small subset of the [COVID-19 Open Research Dataset Challenge (CORD-19)
](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge), which contains around 1 million of medical research papers focused on COVID 19. We will focus on only 4000 titles, abstracts, and urls from 2021 only for the sake of speed.


## Setup 

In [1]:
import json
import os

import pandas as pd
from google.cloud import aiplatform
from IPython import display
from vertexai.language_models import TextEmbeddingModel

In [2]:
REGION = "us-central1"
PROJECT = !(gcloud config get-value core/project)
PROJECT = PROJECT[0]
BUCKET = f"{PROJECT}-cord19-semantic-search"

# Do not change these
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION

In [3]:
!gsutil ls gs://{BUCKET} || gsutil mb -l {REGION} gs://{BUCKET}

gs://dherin-dev-cord19-semantic-search/cord19_embeddings.json


## Loading the data

The dataset we will use is the title, abstract, and url metadata of roughly 4000 samples from the ~1 million medical papers in the [COVID-19 Open Research Dataset Challenge (CORD-19)
](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge). In this lab, we use the abstract as the documents, on which to compute and store the embeddings.  

In [4]:
metadata = pd.read_csv("../data/cord19_metadata_sample.csv.gz")
metadata = metadata[~metadata.abstract.isna()]
metadata.index = range(len(metadata))
metadata.head()

Unnamed: 0,title,abstract,url
0,Ethnobotanical and ethnomedicinal analysis of ...,Algerian people largely rely on traditional me...,https://www.ncbi.nlm.nih.gov/pubmed/34131369/;...
1,Myopericarditis in a previously healthy adoles...,We report the case of a previously healthy 16‐...,https://www.ncbi.nlm.nih.gov/pubmed/34133825/;...
2,Religious Support as a Contribution to Face th...,Coping with the COVID-19 pandemic has required...,https://www.ncbi.nlm.nih.gov/pubmed/33405093/;...
3,The urgency of resuming disrupted dog rabies v...,OBJECTIVE: Dog vaccination is a cost-effective...,http://medrxiv.org/cgi/content/short/2021.04.2...
4,Intestinal organoids in farm animals,"In livestock species, the monolayer of epithel...",https://doi.org/10.1186/s13567-021-00909-x; ht...


## Creating the embeddings

The first thing to do is to create embedding vectors for our abstracts. For that, we need to first instantiate the `TextEmbeddingModel` client with the appropriate version of the text embedding model:

In [5]:
model = TextEmbeddingModel.from_pretrained("text-embedding-004")

I0000 00:00:1723145968.119581  214243 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


The embedding model can take up to a list of 5 texts to process at a single time. Because of that, we will iterate over the `metadata.abstract`'s in batches of 5 and feed these batches to `model.get_embbedings` to create the embeddings of all the abstracts, which we will then store in the list `vectors`. Running the next cell will take a couple of minutes:

In [6]:
MAX_BATCH_SIZE = 5
vectors = []

for i in range(0, len(metadata), MAX_BATCH_SIZE):
    batch = metadata.abstract[i : i + MAX_BATCH_SIZE].to_list()
    embeddings = model.get_embeddings(batch)
    vectors.extend([embedding.values for embedding in embeddings])

## Creating the Vector Search engine input file

At this point, our abstract embeddings are stored in memory in the `vectors` list. To store these embeddings into [Vertex Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview), we need to serialize them into a JSON file with the [following format](https://cloud.google.com/vertex-ai/docs/vector-search/setup/format-structure):

```python
{"id": <DOCUMENT_ID1>, "embedding": [0.1, ..., -0.7]}
{"id": <DOCUMENT_ID2>, "embedding": [-0.4, ..., 0.8]}
etc.
```
where the value of the `id` field should be an identifier allowing us to retrieve the actual document from a separate source, and the value of `embedding` is the vector returned by the text embedding API. 

For the document `id` we simply use the row index in the `metadata` DataFrame, which will serve as our in-memory document store. This makes it particularly easy to retrieve the abstract, title and url from an `id` returned by the vector search:

```python
metadata.abstract[id]
metadata.title[id]
metadata.url[id]
```

The next cell iterates over `vectors` appending for each entry a JSON line as above to `cord19_embeddings.json` containing the index of the abstract in `metadata` as well as the embedding vector returned by the text embedding API:

In [7]:
embeddings_file_path = "cord19_embeddings.json"

# Removing the embedding file if it already exists
!test -f {embeddings_file_path} && rm {embeddings_file_path}

with open(embeddings_file_path, "a") as embeddings_file:
    for i, embedding in enumerate(vectors):
        json_line = json.dumps({"id": i, "embedding": embedding}) + "\n"
        embeddings_file.writelines(json_line)

I0000 00:00:1723146113.155569  214243 work_stealing_thread_pool.cc:320] WorkStealingThreadPoolImpl::PrepareFork


Let us verify that our embedding file has the same number of lines, one per abstract, as our original dataframe and then let us save it to a GCS bucket:

In [34]:
!wc -l {embeddings_file_path}
print(len(metadata), "metadata dataframe")

3446 cord19_embeddings.json
3446 metadata dataframe


I0000 00:00:1723150518.546708  214243 work_stealing_thread_pool.cc:320] WorkStealingThreadPoolImpl::PrepareFork


In [9]:
EMBEDDINGS_URI = f"gs://{BUCKET}"

!gsutil cp {embeddings_file_path} {EMBEDDINGS_URI}

I0000 00:00:1723146120.567046  214243 work_stealing_thread_pool.cc:320] WorkStealingThreadPoolImpl::PrepareFork


Copying file://cord19_embeddings.json [Content-Type=application/json]...
\ [1 files][ 55.9 MiB/ 55.9 MiB]                                                
Operation completed over 1 objects/55.9 MiB.                                     


## Creating the Vector Search engine index

We are now up to the task of setting up [Vertex Vector Search](https://cloud.google.com/vertex-ai/docs/matching-engine/overview). The procedure requires two steps:

1. The [creation of an index](https://cloud.google.com/vertex-ai/docs/vector-search/overview)
1. The [deployment of this index to an endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public)

While creating the index, the embedding vectors are uploaded to the matching engine and a tree-like data structure (the index) is created allowing for fast but approximate retrieval of the `approximate_neighbors_count` nearest neighbors of a given vector. The index depends on a notion of distance between embedding vectors that we need to specify in the `distance_measure_type`. We choose here the `COSINE_DISTANCE` which essentially is a measure of the angle between the embedding vectors. Other possible choices are the square of the euclidean distance (`SQUARED_L2_DISTANCE`), the [Manhattan distance](https://en.wikipedia.org/wiki/Taxicab_geometry) (`L1_DISTANCE`), or the dot product distance (`DOT_PRODUCT_DISTANCE`). (Note that if the embeddings you are using have been trained to minimize the one of these distances between matching pairs, then you may get better results by selecting this particular distance, otherwise the `COSINE_DISTANCE` will do just fine.) 

The next cell creates the matching engine index from the embedding file. Running it will take up about 1 hour:

In [21]:
DISPLAY_NAME = "cord19_embeddings"

matching_engine_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    contents_delta_uri=EMBEDDINGS_URI,
    dimensions=len(vectors[0]),
    approximate_neighbors_count=150,
    distance_measure_type="COSINE_DISTANCE",
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=7,
    description=DISPLAY_NAME,
)

Creating MatchingEngineIndex
Create MatchingEngineIndex backing LRO: projects/115851500182/locations/us-central1/indexes/2237809627733426176/operations/3539341295749169152
MatchingEngineIndex created. Resource name: projects/115851500182/locations/us-central1/indexes/2237809627733426176
To use this MatchingEngineIndex in another session:
index = aiplatform.MatchingEngineIndex('projects/115851500182/locations/us-central1/indexes/2237809627733426176')


Once the index is created it is associated with the resource name:

In [22]:
INDEX_RESOURCE_NAME = matching_engine_index.resource_name
print(INDEX_RESOURCE_NAME)

projects/115851500182/locations/us-central1/indexes/2237809627733426176


In turns, this index resource-name can be used to instantiate an index:

In [23]:
matching_engine_index = aiplatform.MatchingEngineIndex(
    index_name=INDEX_RESOURCE_NAME
)

Now that our index is up and running, we need to make it accessible to be able to query it. The first step is to create a public endpoint (for speedups, one can also create a [private endpoint in a VPC network](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-vpc)):

In [24]:
matching_engine_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=DISPLAY_NAME,
    description=DISPLAY_NAME,
    public_endpoint_enabled=True,
)

Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/115851500182/locations/us-central1/indexEndpoints/4142902638855323648/operations/2557556576982401024
MatchingEngineIndexEndpoint created. Resource name: projects/115851500182/locations/us-central1/indexEndpoints/4142902638855323648
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/115851500182/locations/us-central1/indexEndpoints/4142902638855323648')


The second step is to deploy the index to the endpoint we created: 

In [25]:
DEPLOYED_INDEX_ID = f"{DISPLAY_NAME}_deployed"

matching_engine = matching_engine_endpoint.deploy_index(
    index=matching_engine_index, deployed_index_id=DEPLOYED_INDEX_ID
)

matching_engine.deployed_indexes

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/115851500182/locations/us-central1/indexEndpoints/4142902638855323648
Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/115851500182/locations/us-central1/indexEndpoints/4142902638855323648/operations/8187900536125652992
MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/115851500182/locations/us-central1/indexEndpoints/4142902638855323648


[id: "cord19_embeddings_deployed"
index: "projects/115851500182/locations/us-central1/indexes/2237809627733426176"
create_time {
  seconds: 1723146455
  nanos: 603768000
}
index_sync_time {
  seconds: 1723147424
  nanos: 390569000
}
automatic_resources {
  min_replica_count: 2
  max_replica_count: 2
}
deployment_group: "default"
]

## Querying Vector Search

We are now ready to issue queries to Vector Search! 

To begin with, we need to create a text embedding from a user query: 

In [26]:
QUERY = "prophylactic measures"

text_embeddings = [vector.values for vector in model.get_embeddings([QUERY])]

Then we can use the `find_neighbors` method from our deployed Vector Search index. This method takes as input the embedding vector from the user query and returns the abstract id's of the `NUM_NEIGHBORS` nearest neighbors:

In [27]:
# Define number of neighbors to return
NUM_NEIGHBORS = 10

response = matching_engine.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=text_embeddings,
    num_neighbors=NUM_NEIGHBORS,
)

response

[[MatchNeighbor(id='3409', distance=0.42408186197280884, sparse_distance=None, feature_vector=[], crowding_tag='0', restricts=[], numeric_restricts=[], sparse_embedding_values=[], sparse_embedding_dimensions=[]),
  MatchNeighbor(id='1087', distance=0.4245913028717041, sparse_distance=None, feature_vector=[], crowding_tag='0', restricts=[], numeric_restricts=[], sparse_embedding_values=[], sparse_embedding_dimensions=[]),
  MatchNeighbor(id='2987', distance=0.4517195224761963, sparse_distance=None, feature_vector=[], crowding_tag='0', restricts=[], numeric_restricts=[], sparse_embedding_values=[], sparse_embedding_dimensions=[]),
  MatchNeighbor(id='2614', distance=0.46803832054138184, sparse_distance=None, feature_vector=[], crowding_tag='0', restricts=[], numeric_restricts=[], sparse_embedding_values=[], sparse_embedding_dimensions=[]),
  MatchNeighbor(id='1816', distance=0.47184431552886963, sparse_distance=None, feature_vector=[], crowding_tag='0', restricts=[], numeric_restricts=[]

The next cell formats the `NUM_NEIGHBORS` most relevant abstracts into a dataframe containing also the corresponding paper titles and urls:

In [30]:
matched_ids = [int(match.id) for match in response[0]]
matched_distances = [match.distance for match in response[0]]
matched_titles = [metadata.title[i] for i in matched_ids]
matched_abstracts = [metadata.abstract[i] for i in matched_ids]
matched_urls = [metadata.url[i] for i in matched_ids]

matches = pd.DataFrame(
    {
        "distance": matched_distances,
        "title": matched_titles,
        "abstract": matched_abstracts,
        "url": matched_urls,
    }
)
matches

Unnamed: 0,distance,title,abstract,url
0,0.424082,Modeling the effects of preventive measures an...,Coronavirus disease (COVID-19) onset in Decemb...,https://api.elsevier.com/content/article/pii/S...
1,0.424591,MEDIDAS DE BIOSSEGURANÇA PARA ENFRENTAMENTO AO...,Introdução A doença coronavírus 2019 (COVID-19...,https://api.elsevier.com/content/article/pii/S...
2,0.45172,Challenges and potential solutions in the deve...,"Currently, COVID-19 pandemic has become an unp...",https://api.elsevier.com/content/article/pii/S...
3,0.468038,Controlling risk of SARS-CoV-2 infection in es...,The SARS-CoV-2 global pandemic poses significa...,https://api.elsevier.com/content/article/pii/S...
4,0.471844,Substantial Decline in Use of HIV Preexposure ...,In response to the novel coronavirus disease (...,https://doi.org/10.1097/qai.0000000000002514; ...
5,0.47908,Mosques in Japan responding to COVID-19 pandem...,Religious activities tend to be conducted in e...,https://api.elsevier.com/content/article/pii/S...
6,0.47917,"Interplay between risk perception, behaviour, ...",Pharmaceutical and non-pharmaceutical interven...,https://arxiv.org/pdf/2112.12062v2.pdf; https:...
7,0.479923,[Institutional measures and desired supports r...,"OBJECTIVES: Owing to the spread of COVID-19, m...",https://www.ncbi.nlm.nih.gov/pubmed/34162772/;...
8,0.482903,COVID-19 preventive behaviors and influencing ...,BACKGROUND: COVID19 is a respiratory disease c...,https://www.ncbi.nlm.nih.gov/pubmed/33451303/;...
9,0.484601,9 Tips and Pearls for Safer Performance of Der...,SARS-CoV2 pandemic has affected dermatologypra...,https://www.ncbi.nlm.nih.gov/pubmed/33959527/;...


Here is the Vector Search response formatted as a simple list for convenience. You may see in the list of returned papers some in a different language than english even though the query was in english. This demonstrates the muli-language ability of large language model and illustrates that the matches are done on the basis of meaning meaning rather than exact keywords match:

In [31]:
html = "<html><body><ol>"
for i in range(len(matches)):
    html += f"""            
    <li> 
        <article>
            <header>
                <a href="{matches.url[i]}"> <h2>{matches.title[i]}</h2></a>
            </header>
            <p>{matches.abstract[i]}</p>
        </article>
    </li>
    """
html += "</body></html>"
display.HTML(html)

## Cleaning Up

In [None]:
matching_engine.delete(force=True)
matching_engine_index.delete()

Copyright 2023 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.