## Setup

Before get started with the Vertex AI services, we need to setup the following.

* Install Python SDK
* Environment variables
* Authentication (Colab only)
* Enable APIs
* Set IAM permissions

### Install Python SDK

Vertex AI, Cloud Storage and BigQuery APIs can be accessed with multiple ways including REST API and Python SDK.

In [None]:
# Install / upgrade packages only for this user in this notebook

%pip install --upgrade --user \
    google-genai \ 
    google-cloud-storage \
    google-cloud-logging \
    'google-cloud-bigquery[pandas]' \
    google-cloud-aiplatform

Collecting google-genai
  Downloading google_genai-1.52.0-py3-none-any.whl.metadata (46 kB)
Collecting google-cloud-logging
  Downloading google_cloud_logging-3.12.1-py2.py3-none-any.whl.metadata (5.0 kB)
Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.128.0-py2.py3-none-any.whl.metadata (46 kB)
Collecting google-cloud-appengine-logging<2.0.0,>=0.1.3 (from google-cloud-logging)
  Downloading google_cloud_appengine_logging-1.7.0-py3-none-any.whl.metadata (10 kB)
Collecting google-cloud-audit-log<1.0.0,>=0.3.1 (from google-cloud-logging)
  Downloading google_cloud_audit_log-0.4.0-py3-none-any.whl.metadata (9.3 kB)
Downloading google_genai-1.52.0-py3-none-any.whl (261 kB)
Downloading google_cloud_logging-3.12.1-py2.py3-none-any.whl (229 kB)
Downloading google_cloud_appengine_logging-1.7.0-py3-none-any.whl (16 kB)
Downloading google_cloud_audit_log-0.4.0-py3-none-any.whl (44 kB)
Downloading google_cloud_aiplatform-1.128.0-py2.py3-none-any.whl (8.1 MB)
[2K   [90

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, we must restart the runtime. 

In [2]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Environment variables

Sets environment variables.

In [1]:
# get project ID and region
PROJECT_ID = "qwiklabs-gcp-00-7ee53ce98247"
LOCATION = "us-central1"

In [None]:
# generate an unique id for this session
from datetime import datetime
UID = datetime.now().strftime("%m%d%H%M")

In [3]:
# Instantiate the Google Cloud Logging client
import logging
from google.cloud import logging as gcp_logging
from google.cloud.logging.handlers import CloudLoggingHandler

# Instantiate the Google Cloud Logging client
client = gcp_logging.Client()

# Create a specific logger for cloud logging
logger = logging.getLogger('my_cloud_logger')

# Prevent this logger from sending messages to its parent (the root logger),
# which has the default console handler
logger.propagate = False

# Create and add the Cloud Logging handler
handler = CloudLoggingHandler(client)
logger.addHandler(handler)

# Set the logging level
logger.setLevel(logging.INFO)

### Import Libraries

In [4]:
import random
import time
import numpy as np
import tqdm

### Enable APIs

Run the following to enable APIs for Compute Engine, Vertex AI, Cloud Storage and BigQuery with this Google Cloud project.

In [5]:
! gcloud services enable compute.googleapis.com aiplatform.googleapis.com storage.googleapis.com bigquery.googleapis.com --project {PROJECT_ID}

Operation "operations/acat.p2-114918136414-3b1faa22-3ac9-44ba-bb19-f5e12fe5ccd2" finished successfully.


### Set IAM permissions

Also, we need to add access permissions to the default service account for using those services.

- Go to [the IAM page](https://console.cloud.google.com/iam-admin/) in the Console
- Look for the principal for default compute service account. It should look like: `<project-number>-compute@developer.gserviceaccount.com`
- Click the edit button at right and click `ADD ANOTHER ROLE` to add `Vertex AI User`, `BigQuery User` and `Storage Admin` to the account.

## Getting Started with Vertex AI Embeddings for Text

Now it's ready to get started with embeddings!

### Data Preparation

We will be using [the Stack Overflow public dataset](https://console.cloud.google.com/marketplace/product/stack-exchange/stack-overflow) hosted on BigQuery table `bigquery-public-data.stackoverflow.posts_questions`. This is a very big dataset with 23 million rows that doesn't fit into the memory. We are going to limit it to 1000 rows for this tutorial.

In [6]:
# load the BQ Table into a Pandas DataFrame
from google.cloud import bigquery

QUESTIONS_SIZE = 1000

bq_client = bigquery.Client(project=PROJECT_ID)
QUERY_TEMPLATE = """
        SELECT distinct q.id, q.title
        FROM (SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions`
        where Score > 0 ORDER BY View_Count desc) AS q
        LIMIT {limit} ;
        """
query = QUERY_TEMPLATE.format(limit=QUESTIONS_SIZE)
query_job = bq_client.query(query)
rows = query_job.result()
df = rows.to_dataframe()

# examine the data
df.head()

Unnamed: 0,id,title
0,73822240,Directory path not defined error in Node - Ref...
1,73822970,Do i need to know the variables before launchi...
2,73841996,"emacs how to bind ""LEFT-POINTING DOUBLE ANGLE ..."
3,73587667,Unity XR Interaction Toolkit multiple cameras ...
4,73800797,How to set different alignment on the header o...


### Call the API to generate embeddings

With the Stack Overflow dataset, we will use the `title` column (the question title) and generate embedding for it with Embeddings for Text API. 

From the package, import [TextEmbeddingModel](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextEmbeddingModel) and get a model.

In [None]:
# init the vertexai package
from google import genai
from google.genai import types

EMBEDDING_MODEL = "gemini-embedding-001"
client = genai.Client(vertexai=True, 
                      project=PROJECT_ID, 
                      location=LOCATION)

Once you get the model, you can call its [get_embeddings](https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextEmbeddingModel#vertexai_language_models_TextEmbeddingModel_get_embeddings) function to get embeddings. You can pass up to 5 texts at once in a call. But there is a caveat. By default, the text embeddings API has a "request per minute" quota set to 60 for new Cloud projects and 600 for projects with usage history (see [Quotas and limits](https://cloud.google.com/vertex-ai/docs/quotas#request_quotas) to check the latest quota value for `base_model:textembedding-gecko`). So, rather than using the function directly, you may want to define a wrapper like below to limit under 10 calls per second, and pass 5 texts each time.

In [8]:
# get embeddings for a list of texts
BATCH_SIZE = 5


def get_embeddings_wrapper(texts: list[str]) -> list[list[float]]:
    embeddings: list[list[float]] = []
    for i in tqdm.tqdm(range(0, len(texts), BATCH_SIZE)):
        time.sleep(1)  # to avoid the quota error
        response = client.models.embed_content(
            model=EMBEDDING_MODEL, contents=texts[i : i + BATCH_SIZE]
        )
        embeddings = embeddings + [e.values for e in response.embeddings]
    return embeddings

The following code will get embedding for the question titles and add them as a new column `embedding` to the DataFrame. This will take a few minutes.

In [9]:
# get embeddings for the question titles and add them as "embedding" column
df = df.assign(embedding=get_embeddings_wrapper(list(df.title)))
df.head()

100%|██████████| 200/200 [03:59<00:00,  1.20s/it]


Unnamed: 0,id,title,embedding
0,73822240,Directory path not defined error in Node - Ref...,"[0.024633081629872322, -0.02726684883236885, 0..."
1,73822970,Do i need to know the variables before launchi...,"[-0.011780531145632267, -0.012115361168980598,..."
2,73841996,"emacs how to bind ""LEFT-POINTING DOUBLE ANGLE ...","[-0.008853348903357983, -0.00976007804274559, ..."
3,73587667,Unity XR Interaction Toolkit multiple cameras ...,"[-0.006446841638535261, -0.0076327030546963215..."
4,73800797,How to set different alignment on the header o...,"[0.022850239649415016, -0.013919918797910213, ..."


## Look at the embedding similarities

Let's see how these embeddings are organized in the embedding space with their meanings by quickly calculating the similarities between them and sorting them.

As embeddings are vectors, you can calculate similarity between two embeddings by using one of the popular metrics like the followings:

![](https://storage.googleapis.com/github-repo/img/embeddings/textemb-vs-notebook/8.png)

Which metric should we use? Usually it depends on how each model is trained. In case of the model `gemini-embedding-001`, we need to use inner product (dot product).

In the following code, it picks up one question randomly and uses the numpy `np.dot` function to calculate the similarities between the question and other questions.

In [10]:
# pick one of them as a key question
key = random.randint(0, len(df))

# calc dot product between the key and other questions
embs = np.array(df.embedding.to_list())
similarities = np.dot(embs[key], embs.T)

# print similarities for the first 5 questions
similarities[:5]

array([0.46090167, 0.47791521, 0.4531224 , 0.52944323, 0.47942473])

Finally, sort the questions with the similarities and print the list.

In [11]:
# print the question
print(f"Key question: {df.title[key]}\n")

# sort and print the questions by similarities
sorted_questions = sorted(
    zip(df.title, similarities), key=lambda x: x[1], reverse=True
)[:20]
for i, (question, similarity) in enumerate(sorted_questions):
    print(f"{similarity:.4f} {question}")

Key question: When many Akka Actors send messages to one actor, how to cleanly handle inheritance of inner Command classes

1.0000 When many Akka Actors send messages to one actor, how to cleanly handle inheritance of inner Command classes
0.6412 Create and call actor from another actor in actix
0.6198 Which is best using separate kafka template / using same kafka template for different topic
0.5974 Kotlin/Java: Add multiple items to a Builder class that only allows one .add() at a time
0.5970 How to refactor the code to obey the rule ‘open-closed’?
0.5914 Consume same message by 2 replicas kubernetes
0.5851 boost::asio delegate type erasure
0.5842 How to register a generic dependency in Autofac when I dont know the Types in advance?
0.5837 DDD problems with aggregates and transactions
0.5814 Filtering an array based on an Inner array
0.5813 Confused on How to Implement Main Method that Runs Program in Order:
0.5797 What is the preferred way of importing object hierarchies into Python


## Get Started with Vector Search

### Setting up Vector Search
- Save the embeddings in JSON files on Cloud Storage
- Build an Index
- Create an Index Endpoint
- Deploy the Index to the endpoint

### Use Vector Search

- Query with the endpoint

### Save the embeddings in a JSON file
To load the embeddings to Vector Search, we need to save them in JSON files with JSONL format. 

First, export the `id` and `embedding` columns from the DataFrame in JSONL format, and save it.

In [None]:
# save id and embedding as a json file
jsonl_string = df[["id", "embedding"]].to_json(orient="records", lines=True)
with open("questions.json", "w") as f:
    f.write(jsonl_string)

# show the first few lines of the json file
! head -n 3 questions.json

{"id":73822240,"embedding":[0.0246330816,-0.0272668488,0.0207392909,-0.0440361314,-0.031373255,0.0284537803,0.0120917652,-0.0299413074,0.0027486072,-0.0006155428,-0.0224032085,-0.0019137176,-0.0091092316,0.0201763678,0.1139698252,-0.0056741177,-0.0100406399,0.0108823255,0.0048874109,-0.0374809504,-0.0164337289,0.0100775193,0.0219101999,-0.0113371043,-0.0097626634,0.0088247182,0.0279919561,-0.0052137263,0.0465257764,-0.0083526811,0.0089516267,-0.0114217736,-0.0134397177,-0.0137203308,-0.0099498592,-0.0011264359,0.0258511826,0.0000112111,0.0054886746,0.0026485834,-0.0211554486,-0.0012422909,0.0301114433,-0.0218603723,-0.0161140617,0.012670056,-0.000627226,-0.0278853215,-0.0009695567,0.0261852238,0.0101965601,-0.0088438839,-0.0207830369,-0.1745157987,-0.0151121235,0.00538381,-0.0050421939,0.0116567928,0.0064808633,-0.0204410274,-0.0054700961,0.0013384022,0.0014969226,-0.0298429243,-0.0151881734,-0.0055720452,0.0109770643,0.016403662,-0.0163887739,-0.0011517595,0.0087300232,-0.0022488798,-

Then, create a new Cloud Storage bucket and copy the file to it.

In [13]:
BUCKET_URI = f"gs://{PROJECT_ID}-embvs-tutorial-{UID}"
! gsutil mb -l $LOCATION -p {PROJECT_ID} {BUCKET_URI}
! gsutil cp questions.json {BUCKET_URI}

Creating gs://qwiklabs-gcp-00-7ee53ce98247-embvs-tutorial-11220407/...
Copying file://questions.json [Content-Type=application/json]...
- [1 files][ 39.3 MiB/ 39.3 MiB]                                                
Operation completed over 1 objects/39.3 MiB.                                     


### Create an Index

Now it's ready to load the embeddings to Vector Search. Its APIs are available under the [aiplatform](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform) package of the SDK.

In [14]:
# init the aiplatform package
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION)

Create an [MatchingEngineIndex](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.MatchingEngineIndex) with its `create_tree_ah_index` function (Matching Engine is the previous name of Vector Search).

In [15]:
# create index
from google.cloud import aiplatform

PROJECT_ID = "qwiklabs-gcp-00-7ee53ce98247"
LOCATION = "us-central1"
BUCKET_URI = f"gs://{PROJECT_ID}-embvs-tutorial-{UID}"

aiplatform.init(project=PROJECT_ID, location=LOCATION)

my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="embvs-tutorial-index",
    contents_delta_uri=BUCKET_URI,
    dimensions=3072,
    approximate_neighbors_count=20,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    leaf_node_embedding_count=100,        # optional
    leaf_nodes_to_search_percent=10       # optional
)

Creating MatchingEngineIndex
Create MatchingEngineIndex backing LRO: projects/114918136414/locations/us-central1/indexes/524187770495696896/operations/2383159069151068160
MatchingEngineIndex created. Resource name: projects/114918136414/locations/us-central1/indexes/524187770495696896
To use this MatchingEngineIndex in another session:
index = aiplatform.MatchingEngineIndex('projects/114918136414/locations/us-central1/indexes/524187770495696896')


### Create Index Endpoint and deploy the Index

To use the Index, you need to create an [Index Endpoint](https://cloud.google.com/vertex-ai/docs/vector-search/deploy-index-public). It works as a server instance accepting query requests for your Index.

In [16]:
# create IndexEndpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"embvs-tutorial-index-endpoint-{UID}",
    public_endpoint_enabled=True,
)

Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/114918136414/locations/us-central1/indexEndpoints/3693639988722794496/operations/5361164322749808640
MatchingEngineIndexEndpoint created. Resource name: projects/114918136414/locations/us-central1/indexEndpoints/3693639988722794496
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/114918136414/locations/us-central1/indexEndpoints/3693639988722794496')


With the Index Endpoint, deploy the Index by specifying an unique deployed index ID.

In [17]:
DEPLOYED_INDEX_ID = f"embvs_tutorial_deployed_{UID}"

In [18]:
# deploy the Index to the Index Endpoint
my_index_endpoint.deploy_index(
    index=my_index,
    deployed_index_id=DEPLOYED_INDEX_ID,
    machine_type="e2-standard-16",
    min_replica_count=1,
    max_replica_count=1  
)

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/114918136414/locations/us-central1/indexEndpoints/3693639988722794496
Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/114918136414/locations/us-central1/indexEndpoints/3693639988722794496/operations/7688962380146933760
MatchingEngineIndexEndpoint index_endpoint Deployed index. Resource name: projects/114918136414/locations/us-central1/indexEndpoints/3693639988722794496


<google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint.MatchingEngineIndexEndpoint object at 0x7f3866cc22f0> 
resource name: projects/114918136414/locations/us-central1/indexEndpoints/3693639988722794496

### Run Query

Finally it's ready to use Vector Search. In the following code, it creates an embedding for a test question, and find similar question with the Vector Search.

In [None]:
test_embeddings = get_embeddings_wrapper(["How to read JSON with Python?"])

In [None]:
# Test query
response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=test_embeddings,
    num_neighbors=20,
)

for idx, neighbor in enumerate(response[0]):
    id = np.int64(neighbor.id)
    similar = df.query("id == @id", engine="python")
    print(f"{neighbor.distance:.4f} {similar.title.values[0]}")
    # Do not remove or modify this logging call, it will be used for tracking purposes
    logger.info(f'Task 4. Similar question with the vector search is: {similar.title.values[0]}')