# Semantic Search with Vertex Vector Search

**Learning Objectives**
  1. Learn how to create text embeddings using the Vertex
  1. Learn how to load embeddings in Vertex Vector Search
  2. Learn how to query Vertex Vector Search
  1. Learn how to build an information retrieval system based on semantic match
  
  
In this notebook, we implement a simple (albeit fast and scalable) [semantic search](https://en.wikipedia.org/wiki/Semantic_search#:~:text=Semantic%20search%20seeks%20to%20improve,to%20generate%20more%20relevant%20results.) retrieval system using [Vertex Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview) and [Vertex Text Embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). In a semantic search system, a number of documents are returned to a user query, ranked by their semantic match. This means that the returned documents should match the intent or meaning of the query rather than its actual exact  keywords as opposed to a boolean or keyword-based retrieval system. Such a semantic search system has in general two components, namely:

* A component that produces semantically meaningful vector representations of both the documents as well as the user queries; we will use the [Vertex Text Embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) API to creates these embeddings, leveraging the power of large language model. 

* A component that allows users to store the document vector embeddings and retrieve the most relevant documents by returning the documents whose embeddings are the closest to the user-query embedding in the embedding space. We will use [Vertex Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview) which can scale up to billions of embeddings thanks to an [efficient approximate nearest neighbor strategy](https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html) to compare and retrieve the closest document vectors to a query vector based on a [recent paper from Google research](https://arxiv.org/abs/1908.10396).



**Dataset:** We will use a very small subset of the [COVID-19 Open Research Dataset Challenge (CORD-19)
](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge), which contains around 1 million of medical research papers focused on COVID 19. We will focus on only 4000 titles, abstracts, and urls from 2021 only for the sake of speed.


## Setup 

In [None]:
import json
import os

import pandas as pd
from google import genai
from google.cloud import aiplatform
from IPython import display

In [None]:
REGION = "us-central1"
PROJECT = !(gcloud config get-value core/project)
PROJECT = PROJECT[0]
BUCKET = f"{PROJECT}-cord19-semantic-search"

# Do not change these
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION

In [None]:
!gsutil ls gs://{BUCKET} || gsutil mb -l {REGION} gs://{BUCKET}

## Loading the data

The dataset we will use is the title, abstract, and url metadata of roughly 4000 samples from the ~1 million medical papers in the [COVID-19 Open Research Dataset Challenge (CORD-19)
](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge). In this lab, we use the abstract as the documents, on which to compute and store the embeddings.  

In [None]:
metadata = pd.read_csv("../data/cord19_metadata_sample.csv.gz")
metadata = metadata[~metadata.abstract.isna()]
metadata.index = range(len(metadata))
metadata.head()

## Creating the embeddings

The first thing to do is to create embedding vectors for our abstracts. For that, we need to first instantiate the Gen AI SDK client:

In [None]:
client = genai.Client(vertexai=True, location="us-central1")
EMBEDDING_MODEL = "text-embedding-004"

The embedding model can take up to a list of 5 texts to process at a single time. Because of that, we will iterate over the `metadata.abstract`'s in batches of 5 and feed these batches to `client.models.embed_content` to create the embeddings of all the abstracts, which we will then store in the list `vectors`. Running the next cell will take a couple of minutes:

In [None]:
MAX_BATCH_SIZE = 5
vectors = []

for i in range(0, len(metadata), MAX_BATCH_SIZE):
    batch = metadata.abstract[i : i + MAX_BATCH_SIZE].to_list()
    embeddings = client.models.embed_content(
        model=EMBEDDING_MODEL, contents=batch
    )
    vectors.extend([embedding.values for embedding in embeddings.embeddings])

## Creating the Vector Search engine input file

At this point, our abstract embeddings are stored in memory in the `vectors` list. To store these embeddings into [Vertex Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview), we need to serialize them into a JSON file with the [following format](https://cloud.google.com/vertex-ai/docs/vector-search/setup/format-structure):

```python
{"id": <DOCUMENT_ID1>, "embedding": [0.1, ..., -0.7]}
{"id": <DOCUMENT_ID2>, "embedding": [-0.4, ..., 0.8]}
etc.
```
where the value of the `id` field should be an identifier allowing us to retrieve the actual document from a separate source, and the value of `embedding` is the vector returned by the text embedding API. 

For the document `id` we simply use the row index in the `metadata` DataFrame, which will serve as our in-memory document store. This makes it particularly easy to retrieve the abstract, title and url from an `id` returned by the vector search:

```python
metadata.abstract[id]
metadata.title[id]
metadata.url[id]
```

The next cell iterates over `vectors` appending for each entry a JSON line as above to `cord19_embeddings.json` containing the index of the abstract in `metadata` as well as the embedding vector returned by the text embedding API:

In [None]:
embeddings_file_path = "cord19_embeddings.json"

# Removing the embedding file if it already exists
!test -f {embeddings_file_path} && rm {embeddings_file_path}

with open(embeddings_file_path, "a") as embeddings_file:
    for i, embedding in enumerate(vectors):
        json_line = json.dumps({"id": i, "embedding": embedding}) + "\n"
        embeddings_file.writelines(json_line)

Let us verify that our embedding file has the same number of lines, one per abstract, as our original dataframe and then let us save it to a GCS bucket:

In [None]:
!wc -l {embeddings_file_path}
print(len(metadata), "metadata dataframe")

In [None]:
EMBEDDINGS_URI = f"gs://{BUCKET}"

!gsutil cp {embeddings_file_path} {EMBEDDINGS_URI}

## Creating the Vector Search engine index

We are now up to the task of setting up [Vertex Vector Search](https://cloud.google.com/vertex-ai/docs/matching-engine/overview). The procedure requires two steps:

1. The [creation of an index](https://cloud.google.com/vertex-ai/docs/vector-search/overview)
1. The [deployment of this index to an endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public)

While creating the index, the embedding vectors are uploaded to the matching engine and a tree-like data structure (the index) is created allowing for fast but approximate retrieval of the `approximate_neighbors_count` nearest neighbors of a given vector. The index depends on a notion of distance between embedding vectors that we need to specify in the `distance_measure_type`. We choose here the `COSINE_DISTANCE` which essentially is a measure of the angle between the embedding vectors. Other possible choices are the square of the euclidean distance (`SQUARED_L2_DISTANCE`), the [Manhattan distance](https://en.wikipedia.org/wiki/Taxicab_geometry) (`L1_DISTANCE`), or the dot product distance (`DOT_PRODUCT_DISTANCE`). (Note that if the embeddings you are using have been trained to minimize the one of these distances between matching pairs, then you may get better results by selecting this particular distance, otherwise the `COSINE_DISTANCE` will do just fine.) 

The next cell creates the matching engine index from the embedding file. Running it will take up about 1 hour:

In [None]:
DISPLAY_NAME = "cord19_embeddings"

matching_engine_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    contents_delta_uri=EMBEDDINGS_URI,
    dimensions=len(vectors[0]),
    approximate_neighbors_count=150,
    distance_measure_type="COSINE_DISTANCE",
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=7,
    description=DISPLAY_NAME,
)

Once the index is created it is associated with the resource name:

In [None]:
INDEX_RESOURCE_NAME = matching_engine_index.resource_name
print(INDEX_RESOURCE_NAME)

In turns, this index resource-name can be used to instantiate an index:

In [None]:
matching_engine_index = aiplatform.MatchingEngineIndex(
    index_name=INDEX_RESOURCE_NAME
)

Now that our index is up and running, we need to make it accessible to be able to query it. The first step is to create a public endpoint (for speedups, one can also create a [private endpoint in a VPC network](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-vpc)):

In [None]:
matching_engine_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=DISPLAY_NAME,
    description=DISPLAY_NAME,
    public_endpoint_enabled=True,
)

The second step is to deploy the index to the endpoint we created: 

In [None]:
DEPLOYED_INDEX_ID = f"{DISPLAY_NAME}_deployed"

matching_engine = matching_engine_endpoint.deploy_index(
    index=matching_engine_index, deployed_index_id=DEPLOYED_INDEX_ID
)

matching_engine.deployed_indexes

## Querying Vector Search

We are now ready to issue queries to Vector Search! 

To begin with, we need to create a text embedding from a user query: 

In [None]:
QUERY = "prophylactic measures"

embeddings = client.models.embed_content(model=EMBEDDING_MODEL, contents=QUERY)
text_embeddings = [vector.values for vector in embeddings.embeddings]

Then we can use the `find_neighbors` method from our deployed Vector Search index. This method takes as input the embedding vector from the user query and returns the abstract id's of the `NUM_NEIGHBORS` nearest neighbors:

In [None]:
# Define number of neighbors to return
NUM_NEIGHBORS = 10

response = matching_engine.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=text_embeddings,
    num_neighbors=NUM_NEIGHBORS,
)

response

The next cell formats the `NUM_NEIGHBORS` most relevant abstracts into a dataframe containing also the corresponding paper titles and urls:

In [None]:
matched_ids = [int(match.id) for match in response[0]]
matched_distances = [match.distance for match in response[0]]
matched_titles = [metadata.title[i] for i in matched_ids]
matched_abstracts = [metadata.abstract[i] for i in matched_ids]
matched_urls = [metadata.url[i] for i in matched_ids]

matches = pd.DataFrame(
    {
        "distance": matched_distances,
        "title": matched_titles,
        "abstract": matched_abstracts,
        "url": matched_urls,
    }
)
matches

Here is the Vector Search response formatted as a simple list for convenience. You may see in the list of returned papers some in a different language than english even though the query was in english. This demonstrates the muli-language ability of large language model and illustrates that the matches are done on the basis of meaning meaning rather than exact keywords match:

In [None]:
html = "<html><body><ol>"
for i in range(len(matches)):
    html += f"""            
    <li> 
        <article>
            <header>
                <a href="{matches.url[i]}"> <h2>{matches.title[i]}</h2></a>
            </header>
            <p>{matches.abstract[i]}</p>
        </article>
    </li>
    """
html += "</body></html>"
display.HTML(html)

## Cleaning Up

In [None]:
matching_engine.delete(force=True)
matching_engine_index.delete()

Copyright 2023 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.