# Setup

In [1]:
%pip install -q --upgrade --user google-cloud-aiplatform google-cloud-discoveryengine google-cloud-storage google-cloud-bigquery[pandas] google-cloud-bigquery-storage pandas ipywidgets

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
%load_ext google.cloud.bigquery

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


In [1]:
from typing import List
import requests
import subprocess
import time

from google.api_core.client_options import ClientOptions
from google.api_core.exceptions import GoogleAPICallError
from google.cloud import bigquery
from google.cloud import discoveryengine_v1alpha as discoveryengine
from google.cloud import storage

from tqdm import tqdm  # to show a progress bar

import vertexai
from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput

tqdm.pandas()

In [2]:
# Define project information for Vertex AI
PROJECT_ID = "qwiklabs-gcp-02-9355e20ba053"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# Initialize Vertex AI SDK
vertexai.init(project=PROJECT_ID, location=LOCATION)

In [3]:
import os
os.environ

environ{'SHELL': '/bin/bash',
        'CONDA_EXE': '/opt/conda/bin/conda',
        '_CE_M': '',
        'VERTEX_PRODUCT': 'WORKBENCH_INSTANCE',
        'GRPC_FORK_SUPPORT_ENABLED': '0',
        'DL_ANACONDA_HOME': '/opt/conda',
        'FRAMEWORK_FILE_PATH': '/opt/deeplearning/metadata/framework',
        'GOOGLE_CLOUD_PROJECT': 'qwiklabs-gcp-02-9355e20ba053',
        'POST_STARTUP_SCRIPT_PATH': '/opt/c2d/post_start.sh',
        'DL_PATH_DEPS': '/opt/deeplearning/deps',
        'DL_BIN_PATH': '/opt/deeplearning/bin',
        'XML_CATALOG_FILES': 'file:///opt/conda/etc/xml/catalog file:///etc/xml/catalog',
        'KERNEL_LAUNCH_TIMEOUT': '598',
        'BINARIES_PATH': '/opt/deeplearning/binaries',
        'PWD': '/home/jupyter',
        'LOGNAME': 'jupyter',
        'CONDA_PREFIX': '/opt/conda',
        'JPY_SESSION_NAME': '/home/jupyter/Untitled.ipynb',
        'TENSORBOARD_PROXY_URL': '/proxy/%PORT%/',
        'GOOGLE_CLOUD_REGION': 'us-east1',
        'PACKAGE_SOURCE_CODE_PATH': '/

# Task 3. Create embeddings with Vertex AI

## Data Preparation

We will be using the Stack Overflow public dataset hosted on BigQuery table bigquery-public-data.stackoverflow.posts_questions. This is a very big dataset with 23 million rows that doesn't fit into memory. We are going to limit it to 500 rows for this lab.

https://console.cloud.google.com/marketplace/product/stack-exchange/stack-overflow

In this task, we will:
- Fetch the data from BigQuery
- Concat the Title and Body, and create embeddings from the text.

1. Run the following code snippet in the next cell, that connects to Google BigQuery, executes a SQL query to retrieve information from the Stack Overflow dataset, loads the results into a Pandas DataFrame, and then performs some data manipulation.

In [4]:
bq_client = bigquery.Client(project=PROJECT_ID)
query = f"""
SELECT
  DISTINCT
  q.id,
  q.title,
  q.body,
  q.answer_count,
  q.comment_count,
  q.creation_date,
  q.last_activity_date,
  q.score,
  q.tags,
  q.view_count
FROM
  `bigquery-public-data.stackoverflow.posts_questions` AS q
WHERE
  q.score > 0
ORDER BY
  q.view_count DESC
LIMIT
  500;
"""

# Load the BQ Table into a Pandas Dataframe
df = bq_client.query(query).result().to_dataframe()

# Convert ID to String
df["id"] = df["id"].apply(str)

# examine the data
df.head()

Unnamed: 0,id,title,body,answer_count,comment_count,creation_date,last_activity_date,score,tags,view_count
0,927358,How do I undo the most recent local commits in...,<p>I accidentally committed the wrong files to...,100,12,2009-05-29 18:09:14.627000+00:00,2022-09-09 08:13:22.747000+00:00,24809,git|version-control|git-commit|undo,11649204
1,5767325,How can I remove a specific item from an array?,<p>How do I remove a specific value from an ar...,137,7,2011-04-23 22:17:18.487000+00:00,2022-09-16 16:24:04.310000+00:00,10953,javascript|arrays,10493798
2,2003505,How do I delete a Git branch locally and remot...,<h4>Failed Attempts to Delete a Remote Branch:...,42,10,2010-01-05 01:12:15.867000+00:00,2022-09-20 09:16:37.687000+00:00,19556,git|version-control|git-branch|git-push|git-re...,10278934
3,16956810,How to find all files containing specific text...,<p>How do I find all files containing a specif...,53,9,2013-06-06 08:06:45.533000+00:00,2022-09-04 13:42:00.477000+00:00,6894,linux|text|grep|directory|find,9378947
4,4114095,How do I revert a Git repository to a previous...,<p>How do I revert from my current state to a ...,41,3,2010-11-06 16:58:14.550000+00:00,2022-09-02 06:25:46.480000+00:00,7617,git|git-checkout|git-reset|git-revert,8956751


In [5]:
df.to_csv('stackoverflow_posts_questions.csv', index=False)

sample output
![sample output](https://cdn.qwiklabs.com/xCMcHqFtLgZ%2FxCWlyro%2BF8juZ2Vy%2Bg1PyN5UKmVy3%2FI%3D)

## Call the API to generate embeddings

With the Stack Overflow dataset, we will use the title and body columns (the question title and description) and generate embedding for it with Embeddings for Text API. The API is available under the vertexai package of the SDK.

From the package, import TextEmbeddingModel and get a model.

For more information, refer to:
- Vertex AI: Get Text Embeddings https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings
- Vertex AI: Model versions and lifecycle https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions


In [6]:
# Run the following code snippet in the next cell, that loads a pre-trained text embeddings model.
# Load the text embeddings model
model = TextEmbeddingModel.from_pretrained("text-embedding-004")



In [8]:
# Run the following code snippet in the next cell, to define a python function named get_embeddings_wrapper that takes a list of texts and an optional batch size as input parameters. The function uses the previously loaded model (a text embeddings model) to obtain embeddings for the provided texts.
# Get embeddings for a list of texts
def get_embeddings_wrapper(texts, batch_size: int = 50) -> List:
    embs = []
    for i in tqdm(range(0, len(texts), batch_size)):
        # Create embeddings optimized for document retrieval
        # (supported in textembedding-gecko@002 and later)
        result = model.get_embeddings(
            [
                TextEmbeddingInput(text=text, task_type="RETRIEVAL_DOCUMENT")
                for text in texts[i : i + batch_size]
            ]
        )
        embs.extend([e.values for e in result])
    return embs

In [9]:
# This code snippet modifies the previously loaded DataFrame (df) by combining the title and body columns into a new title_body column. Then, it uses the get_embeddings_wrapper function to obtain text embeddings for each combined title and body, and the resulting embeddings are added as a new embedding column to the DataFrame. Finally, the first few rows of the updated DataFrame are displayed.
df["title_body"] = df["title"] + "\n" + df["body"]

# df.assign returns/adds a new col, after running some fn on a df col (tt returns a list)
df = df.assign(embedding=get_embeddings_wrapper(df.title_body))
df.head()

100%|██████████| 10/10 [00:10<00:00,  1.06s/it]


Unnamed: 0,id,title,body,answer_count,comment_count,creation_date,last_activity_date,score,tags,view_count,title_body,embedding
0,927358,How do I undo the most recent local commits in...,<p>I accidentally committed the wrong files to...,100,12,2009-05-29 18:09:14.627000+00:00,2022-09-09 08:13:22.747000+00:00,24809,git|version-control|git-commit|undo,11649204,How do I undo the most recent local commits in...,"[0.0060494025237858295, 0.007868818938732147, ..."
1,5767325,How can I remove a specific item from an array?,<p>How do I remove a specific value from an ar...,137,7,2011-04-23 22:17:18.487000+00:00,2022-09-16 16:24:04.310000+00:00,10953,javascript|arrays,10493798,How can I remove a specific item from an array...,"[0.03094310313463211, -0.0007308208150789142, ..."
2,2003505,How do I delete a Git branch locally and remot...,<h4>Failed Attempts to Delete a Remote Branch:...,42,10,2010-01-05 01:12:15.867000+00:00,2022-09-20 09:16:37.687000+00:00,19556,git|version-control|git-branch|git-push|git-re...,10278934,How do I delete a Git branch locally and remot...,"[0.05592119321227074, -0.01804913766682148, -0..."
3,16956810,How to find all files containing specific text...,<p>How do I find all files containing a specif...,53,9,2013-06-06 08:06:45.533000+00:00,2022-09-04 13:42:00.477000+00:00,6894,linux|text|grep|directory|find,9378947,How to find all files containing specific text...,"[-0.03896770253777504, 0.026809267699718475, -..."
4,4114095,How do I revert a Git repository to a previous...,<p>How do I revert from my current state to a ...,41,3,2010-11-06 16:58:14.550000+00:00,2022-09-02 06:25:46.480000+00:00,7617,git|git-checkout|git-reset|git-revert,8956751,How do I revert a Git repository to a previous...,"[0.020585879683494568, -0.010793603025376797, ..."


In [10]:
df.to_csv('stackoverflow_posts_questions.csv', index=False)

sample output
![sample output](https://cdn.qwiklabs.com/t7O9JhC%2F%2FqQUUDlTAGx4x6FhSwjc%2Bnrw8mvX5Nd1r2M%3D)

# Task 4. Scrape HTML from Question Pages


In [11]:
# Run the following code snippet in the next cell to set up necessary project information.
BUCKET_NAME = "qwiklabs-gcp-02-9355e20ba053"
BUCKET_URI = "gs://qwiklabs-gcp-02-9355e20ba053"
REGION = "us-east1"
PROJECT_ID = "qwiklabs-gcp-02-9355e20ba053"

In [12]:
# Run the following code snippet in the next cell, to create the Google Cloud Storage bucket.
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Creating gs://qwiklabs-gcp-02-9355e20ba053/...


In [None]:
# Create directories within the bucket.

In [15]:
!pwd

/home/jupyter


In [16]:
!ls

day3_custom_embeddings.ipynb  requirements.txt			 utils
notebook_template.ipynb       stackoverflow_posts_questions.csv


In [24]:
!cd /dev && ls -lm -la

total 4
drwxr-xr-x 14 root root          2880 Oct 16 08:31 .
drwxr-xr-x 18 root root          4096 Oct 16 08:31 ..
crw-r--r--  1 root root     10,   235 Oct 16 08:31 autofs
drwxr-xr-x  2 root root           140 Oct 16 08:30 block
drwxr-xr-x  2 root root            80 Oct 16 08:30 bsg
crw-------  1 root root     10,   234 Oct 16 08:30 btrfs-control
drwxr-xr-x  2 root root          2360 Oct 16 08:31 char
crw--w----  1 root tty       5,     1 Oct 16 09:02 console
lrwxrwxrwx  1 root root            11 Oct 16 08:30 core -> /proc/kcore
crw-------  1 root root     10,    62 Oct 16 08:31 cpu_dma_latency
crw-------  1 root root     10,   203 Oct 16 08:30 cuse
drwxr-xr-x  6 root root           120 Oct 16 08:30 disk
lrwxrwxrwx  1 root root            13 Oct 16 08:30 fd -> /proc/self/fd
crw-rw-rw-  1 root root      1,     7 Oct 16 08:31 full
crw-rw-rw-  1 root root     10,   229 Oct 16 08:31 fuse
crw-------  1 root root     10,   228 Oct 16 08:31 hpet
drwxr-xr-x  2 root root             0 Oct 16 0

In [14]:
%%bash

# Set your Google Cloud Storage bucket name
BUCKET_NAME="gs://qwiklabs-gcp-02-9355e20ba053"

# Array of top-level directory names you want to create
TOP_LEVEL_DIRECTORIES=("embeddings-stackoverflow")

# we created a bucket, then created a sub-dir 'html'. why is there 'null' inside???
#    what does it do????
# Creating a Directory (Folder): To make an empty folder visible in the console,
#    a common technique is to upload a zero-byte file with the name of the folder
#    followed by a slash (/).
# so
# gsutil -m cp -r: The command to copy recursively.
# /dev/null: This is a special file on Unix-like systems (like the lab environment) that provides zero bytes of data. It's the standard way to get a zero-byte file without creating one on the disk first.
# "$BUCKET_NAME/$TOP_LEVEL_DIRECTORY/": The destination path.
# The Result: The command copies the zero bytes from /dev/null into an object named embeddings-stackoverflow/. The console often visualizes this zero-byte "folder marker" object as null (a file with a zero size and a Type of application/octet-stream or similar, which often appears null in the name filter and Size column) to represent the newly created, empty directory.
# SO
# CREATE SUB-DIRECTORIES by COPYING 'null' file from '/dev/null'

# also pls explain what the funny syntax means in the code below, such as the '@' code segments?
# '@' -> This is a specific way to expand a Bash array to get all its elements.
# TOP_LEVEL_DIRECTORIES: The name of the array variable.
# @: This special index, when used inside ${...}, means "all elements of the array."
# eg if the array was ("dir one" "dir two"), using "${array[@]}" correctly gives two elements: "dir one" and "dir two

# Loop through the top-level array and create directories
for TOP_LEVEL_DIRECTORY in "${TOP_LEVEL_DIRECTORIES[@]}"; do
  gsutil -m cp -r /dev/null "$BUCKET_NAME/$TOP_LEVEL_DIRECTORY/"

  # Array of subdirectory names you want to create inside the top-level directory
  SUBDIRECTORIES=("html")

  # Loop through the subdirectories array and create subdirectories inside the top-level directory
  for SUBDIRECTORY in "${SUBDIRECTORIES[@]}"; do
    gsutil -m cp -r /dev/null "$BUCKET_NAME/$TOP_LEVEL_DIRECTORY/$SUBDIRECTORY/"
  done
done

echo "Directories created successfully."

Copying file:///dev/null [Content-Type=application/octet-stream]...
/ [1/1 files][    0.0 B/    0.0 B]                                              
Operation completed over 1 objects.                                              
Copying file:///dev/null [Content-Type=application/octet-stream]...
/ [1/1 files][    0.0 B/    0.0 B]                                              
Operation completed over 1 objects.                                              


Directories created successfully.


Run the following code snippet in the next cell, to define a Python function named scrape_question that performs the following tasks:
-   It sends an HTTP GET request to a specified URL (question_url) to scrape the content of a question page.
-   If the request is successful (HTTP status code 200) and the response contains content, it uploads the HTML content of the question page to Google Cloud Storage (GCS).
-   The GCS URI (Uniform Resource Identifier) of the uploaded HTML file is returned.


In [25]:
JSONL_MIME_TYPE = "application/jsonl"
HTML_MIME_TYPE = "text/html"

BUCKET_NAME = "qwiklabs-gcp-02-9355e20ba053"
DIRECTORY = "embeddings-stackoverflow"
BLOB_PREFIX = f"{DIRECTORY}/html/"

GCS_URI_PREFIX = f"gs://{BUCKET_NAME}/{BLOB_PREFIX}"

storage_client = storage.Client()
bucket = storage_client.bucket(BUCKET_NAME)


def scrape_question(question_url: str) -> str:
    response = requests.get(question_url)

    if response.status_code != 200 or not response.content:
        print(f"URL: {question_url} Code: {response.status_code}")
        return None

    print(f"Scraping {question_url}")

    link_title = response.url.split("/")[-1] + ".html"
    gcs_uri = f"{GCS_URI_PREFIX}{link_title}"

    # Upload HTML to Google Cloud Storage
    blob = bucket.blob(f"{BLOB_PREFIX}{link_title}")
    blob.upload_from_string(response.content, content_type=HTML_MIME_TYPE)
    time.sleep(1)
    return gcs_uri

The following code snippet has 2 main parts:
- it constructs URLs for Stack Overflow questions based on their IDs and then
- scrapes the HTML content from each of these URLs before uploading it to Google Cloud Storage (GCS)

In [26]:
# Get the published URL from the ID
QUESTION_BASE_URL = "https://stackoverflow.com/questions/"
# ie go from "https://stackoverflow.com/questions/"
#    to "https://stackoverflow.com/questions/927358", etc
# to reconstruct the URL
df["question_url"] = df["id"].apply(lambda x: f"{QUESTION_BASE_URL}{x}")

# then use URL to
# Scrape HTML from stackoverflow.com and upload to GCS
df["gcs_uri"] = df["question_url"].apply(scrape_question)

Scraping https://stackoverflow.com/questions/927358
Scraping https://stackoverflow.com/questions/5767325
Scraping https://stackoverflow.com/questions/2003505
Scraping https://stackoverflow.com/questions/16956810
Scraping https://stackoverflow.com/questions/4114095
Scraping https://stackoverflow.com/questions/2906582
Scraping https://stackoverflow.com/questions/503093
Scraping https://stackoverflow.com/questions/1789945
Scraping https://stackoverflow.com/questions/1783405
Scraping https://stackoverflow.com/questions/3207219
Scraping https://stackoverflow.com/questions/1125968
Scraping https://stackoverflow.com/questions/5585779
Scraping https://stackoverflow.com/questions/4366730
Scraping https://stackoverflow.com/questions/3437059
Scraping https://stackoverflow.com/questions/3552461
Scraping https://stackoverflow.com/questions/16476924
Scraping https://stackoverflow.com/questions/20035101
Scraping https://stackoverflow.com/questions/176918
Scraping https://stackoverflow.com/questions/1

In the next cell, restructure the embeddings data to JSONL to follow the Vertex AI Search format (Unstructured with Metadata). This format is required to use custom embeddings.

https://cloud.google.com/generative-ai-app-builder/docs/prepare-data

In [27]:
EMBEDDINGS_FIELD_NAME = "embedding_vector"


def format_row(row):
    return {
        "id": row["id"],
        "content": {"mimeType": HTML_MIME_TYPE, "uri": row["gcs_uri"]},
        "structData": {
            EMBEDDINGS_FIELD_NAME: row["embedding"],
            "title": row["title"],
            "body": row["body"],
            "question_url": row["question_url"],
            "answer_count": row["answer_count"],
            "creation_date": row["creation_date"],
            "score": row["score"],
        },
    }


# convert each df's row to JSON
#         extracting only the relevant fields/cols
# the full file containing JSONs (1 JSON per original df record) => JSONL
vais_embeddings = (
    df.apply(format_row, axis=1)
    .to_json(orient="records", lines=True, force_ascii=False)
    .replace("\/", "/")  # To prevent escaping the / characters
)

In [28]:
# In the next cell, upload the JSONL file to Google Cloud Storage.
jsonl_filename = f"{DIRECTORY}/vais_embeddings.jsonl"
embeddings_file = f"gs://{BUCKET_NAME}/{jsonl_filename}"

blob = bucket.blob(jsonl_filename)
blob.upload_from_string(vais_embeddings, content_type=JSONL_MIME_TYPE)

In [29]:
df.to_csv('stackoverflow_posts_questions.csv', index=False)

In [32]:
df.head()

Unnamed: 0,id,title,body,answer_count,comment_count,creation_date,last_activity_date,score,tags,view_count,title_body,embedding,question_url,gcs_uri
0,927358,How do I undo the most recent local commits in...,<p>I accidentally committed the wrong files to...,100,12,2009-05-29 18:09:14.627000+00:00,2022-09-09 08:13:22.747000+00:00,24809,git|version-control|git-commit|undo,11649204,How do I undo the most recent local commits in...,"[0.0060494025237858295, 0.007868818938732147, ...",https://stackoverflow.com/questions/927358,gs://qwiklabs-gcp-02-9355e20ba053/embeddings-s...
1,5767325,How can I remove a specific item from an array?,<p>How do I remove a specific value from an ar...,137,7,2011-04-23 22:17:18.487000+00:00,2022-09-16 16:24:04.310000+00:00,10953,javascript|arrays,10493798,How can I remove a specific item from an array...,"[0.03094310313463211, -0.0007308208150789142, ...",https://stackoverflow.com/questions/5767325,gs://qwiklabs-gcp-02-9355e20ba053/embeddings-s...
2,2003505,How do I delete a Git branch locally and remot...,<h4>Failed Attempts to Delete a Remote Branch:...,42,10,2010-01-05 01:12:15.867000+00:00,2022-09-20 09:16:37.687000+00:00,19556,git|version-control|git-branch|git-push|git-re...,10278934,How do I delete a Git branch locally and remot...,"[0.05592119321227074, -0.01804913766682148, -0...",https://stackoverflow.com/questions/2003505,gs://qwiklabs-gcp-02-9355e20ba053/embeddings-s...
3,16956810,How to find all files containing specific text...,<p>How do I find all files containing a specif...,53,9,2013-06-06 08:06:45.533000+00:00,2022-09-04 13:42:00.477000+00:00,6894,linux|text|grep|directory|find,9378947,How to find all files containing specific text...,"[-0.03896770253777504, 0.026809267699718475, -...",https://stackoverflow.com/questions/16956810,gs://qwiklabs-gcp-02-9355e20ba053/embeddings-s...
4,4114095,How do I revert a Git repository to a previous...,<p>How do I revert from my current state to a ...,41,3,2010-11-06 16:58:14.550000+00:00,2022-09-02 06:25:46.480000+00:00,7617,git|git-checkout|git-reset|git-revert,8956751,How do I revert a Git repository to a previous...,"[0.020585879683494568, -0.010793603025376797, ...",https://stackoverflow.com/questions/4114095,gs://qwiklabs-gcp-02-9355e20ba053/embeddings-s...


In [38]:
# 768 dims embedding
#     for model = TextEmbeddingModel.from_pretrained("text-embedding-004")
len(df.embedding.loc[0])

768

# Task 5. Set up AI Applications - TO EXPLORE WHAT CODE IS DOING
* Seems like the HTML files were scraped for fun - HTML content was not used in creating the data store
* HTML was not used to create the JSONL from above
* JSONL was used to created the data store for task 5, and for testing in task 6

In [39]:
# In the next cell, set up client options for interacting with the Google Cloud Vertex AI Discovery Engine service. It specifies the API endpoint based on the provided DATA_STORE_LOCATION.
DATA_STORE_LOCATION = "global"

client_options = (
    ClientOptions(api_endpoint=f"{DATA_STORE_LOCATION}-discoveryengine.googleapis.com")
    if DATA_STORE_LOCATION != "global"
    else None
)


In [40]:
# In the next cell, define several functions that interact with the Google Cloud Vertex AI Discovery Engine service. These functions are responsible for creating a data store, updating its schema, importing documents, and creating a search engine.
def create_data_store(
    project_id: str, location: str, data_store_name: str, data_store_id: str
):
    # Create a client
    client = discoveryengine.DataStoreServiceClient(client_options=client_options)

    # Initialize request argument(s)
    data_store = discoveryengine.DataStore(
        display_name=data_store_name,
        industry_vertical="GENERIC",
        content_config="CONTENT_REQUIRED",
        solution_types=["SOLUTION_TYPE_SEARCH"],
    )

    request = discoveryengine.CreateDataStoreRequest(
        parent=discoveryengine.DataStoreServiceClient.collection_path(
            project_id, location, "default_collection"
        ),
        data_store=data_store,
        data_store_id=data_store_id,
    )
    operation = client.create_data_store(request=request)

    try:
        operation.result()
    except GoogleAPICallError:
        pass


def update_schema(
    project_id: str,
    location: str,
    data_store_id: str,
):
    client = discoveryengine.SchemaServiceClient(client_options=client_options)

    schema = discoveryengine.Schema(
        name=client.schema_path(project_id, location, data_store_id, "default_schema"),
        
        # JSON Schema
        struct_schema={
            "$schema": "https://json-schema.org/draft/2020-12/schema",
            "type": "object",
            "properties": {
                EMBEDDINGS_FIELD_NAME: {
                    "type": "array",
                    # "embedding_vector" is the field name in structData inside each JSON line
                    #   in the JSONL file
                    "keyPropertyMapping": "embedding_vector",
                    "dimension": 768,
                    "items": {"type": "number"},
                }
            },
        },
    )

    operation = client.update_schema(
        request=discoveryengine.UpdateSchemaRequest(schema=schema)
    )

    print("Waiting for operation to complete...")

    response = operation.result()

    # Handle the response
    print(response)


def import_documents(
    project_id: str,
    location: str,
    data_store_id: str,
    gcs_uri: str,
):
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        # 'gcs_uri' here references the JSONL GCS URI/file path
        gcs_source=discoveryengine.GcsSource(input_uris=[gcs_uri]),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.FULL,
    )

    # Make the request
    operation = client.import_documents(request=request)


def create_engine(
    project_id: str, location: str, data_store_name: str, data_store_id: str
):
    client = discoveryengine.EngineServiceClient(client_options=client_options)

    # Initialize request argument(s)
    config = discoveryengine.Engine.SearchEngineConfig(
        search_tier="SEARCH_TIER_ENTERPRISE", search_add_ons=["SEARCH_ADD_ON_LLM"]
    )

    engine = discoveryengine.Engine(
        display_name=data_store_name,
        solution_type="SOLUTION_TYPE_SEARCH",
        industry_vertical="GENERIC",
        data_store_ids=[data_store_id],
        search_engine_config=config,
    )

    request = discoveryengine.CreateEngineRequest(
        parent=discoveryengine.DataStoreServiceClient.collection_path(
            project_id, location, "default_collection"
        ),
        engine=engine,
        engine_id=engine.display_name,
    )

    # Make the request
    operation = client.create_engine(request=request)
    response = operation.result(timeout=90)

In [41]:
# In the next cell, set the project related variables.
DATA_STORE_NAME = "stackoverflow-embeddings"
DATA_STORE_ID = f"{DATA_STORE_NAME}-id"

In [42]:
# In the next cell, initialize and configure a search application in Google Cloud Vertex AI Discovery Engine, including creating a data store, updating its schema for embeddings, importing documents, and creating a search engine attached to the data store.
# Create a Data Store
create_data_store(PROJECT_ID, DATA_STORE_LOCATION, DATA_STORE_NAME, DATA_STORE_ID)

# Update the Data Store Schema for embeddings
update_schema(PROJECT_ID, DATA_STORE_LOCATION, DATA_STORE_ID)

# Import the embeddings JSONL file
import_documents(PROJECT_ID, DATA_STORE_LOCATION, DATA_STORE_ID, embeddings_file)

# Create a Search App and attach the Data Store
create_engine(PROJECT_ID, DATA_STORE_LOCATION, DATA_STORE_NAME, DATA_STORE_ID)

Waiting for operation to complete...
name: "projects/268447846383/locations/global/collections/default_collection/dataStores/stackoverflow-embeddings-id/schemas/default_schema"
struct_schema {
  fields {
    key: "type"
    value {
      string_value: "object"
    }
  }
  fields {
    key: "properties"
    value {
      struct_value {
        fields {
          key: "embedding_vector"
          value {
            struct_value {
              fields {
                key: "type"
                value {
                  string_value: "array"
                }
              }
              fields {
                key: "keyPropertyMapping"
                value {
                  string_value: "embedding_vector"
                }
              }
              fields {
                key: "items"
                value {
                  struct_value {
                    fields {
                      key: "type"
                      value {
                        string_value: "num

sample output
![sample output](https://cdn.qwiklabs.com/VbWrqyWOPk70tVT5zbiNHvpJaSVvjfNgGWIpODwUbNU%3D)

Next, we need to set the embedding specification for the data store. We will set the same specifications for all search requests: 0.5 * relevance_score.

This is not supported in client libraries, so we will use the requests module to make a REST request Documentation: Bring Embeddings
https://cloud.google.com/generative-ai-app-builder/docs/bring-embeddings#global


Run the following code snippet in the next cell, that retrieves an access token using gcloud auth print-access-token, and then it sends a PATCH request to update the serving configuration of the search application in Google Cloud Vertex AI Discovery Engine. The request includes the embedding configuration and a ranking expression, and the server's response is printed.


In [43]:
access_token = (
    subprocess.check_output(["gcloud", "auth", "print-access-token"])
    .decode("utf-8")
    .strip()
)

response = requests.patch(
    url=f"https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_ID}/locations/{DATA_STORE_LOCATION}/collections/default_collection/dataStores/{DATA_STORE_ID}/servingConfigs/default_search?updateMask=embeddingConfig,rankingExpression",
    headers={
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json; charset=utf-8",
        "X-Goog-User-Project": PROJECT_ID,
    },
    json={
        "name": f"projects/{PROJECT_ID}/locations/{DATA_STORE_LOCATION}/collections/default_collection/dataStores/{DATA_STORE_ID}/servingConfigs/default_search",
        "embeddingConfig": {"fieldPath": EMBEDDINGS_FIELD_NAME},
        # relevance_score is a built-in, system-generated metric within Google Cloud Discovery Engine.
        # calculated internally by the Discovery Engine's ranking models
        # so
        # a custom 'ranking_expression'
        #   is usually used to re-rank documents based on the search query
        #   eg "0.5 * relevance_score + 0.5 * custom_factor"
        #   but here, it is only meant to halve the relevance_score
        "ranking_expression": "0.5 * relevance_score",
    },
)

print(response.text)

{
  "name": "projects/268447846383/locations/global/collections/default_collection/engines/stackoverflow-embeddings/servingConfigs/default_search",
  "displayName": "default_search",
  "solutionType": "SOLUTION_TYPE_SEARCH",
  "embeddingConfig": {
    "fieldPath": "embedding_vector"
  },
  "rankingExpression": "0.5 * relevance_score"
}



In [45]:
response.__dict__

{'_content': b'{\n  "name": "projects/268447846383/locations/global/collections/default_collection/engines/stackoverflow-embeddings/servingConfigs/default_search",\n  "displayName": "default_search",\n  "solutionType": "SOLUTION_TYPE_SEARCH",\n  "embeddingConfig": {\n    "fieldPath": "embedding_vector"\n  },\n  "rankingExpression": "0.5 * relevance_score"\n}\n',
 '_content_consumed': True,
 '_next': None,
 'status_code': 200,
 'headers': {'Content-Type': 'application/json; charset=UTF-8', 'Vary': 'Origin, X-Origin, Referer', 'Content-Encoding': 'gzip', 'Date': 'Thu, 16 Oct 2025 09:31:14 GMT', 'Server': 'ESF', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'Server-Timing': 'gfet4t7; dur=775', 'Transfer-Encoding': 'chunked'},
 'raw': <urllib3.response.HTTPResponse at 0x7f7e995a1f90>,
 'url': 'https://discoveryengine.googleapis.com/v1alpha/projects/qwiklabs-gcp-02-9355e20ba053/locations/global/collections/default_collection/dataStores/stacko

In [49]:
response.raw.__dict__

{'headers': HTTPHeaderDict({'Content-Type': 'application/json; charset=UTF-8', 'Vary': 'Origin, X-Origin, Referer', 'Content-Encoding': 'gzip', 'Date': 'Thu, 16 Oct 2025 09:31:14 GMT', 'Server': 'ESF', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'Server-Timing': 'gfet4t7; dur=775', 'Transfer-Encoding': 'chunked'}),
 'status': 200,
 'version': 11,
 'version_string': 'HTTP/1.1',
 'reason': 'OK',
 'decode_content': False,
 '_has_decoded_content': True,
 '_request_url': '/v1alpha/projects/qwiklabs-gcp-02-9355e20ba053/locations/global/collections/default_collection/dataStores/stackoverflow-embeddings-id/servingConfigs/default_search?updateMask=embeddingConfig,rankingExpression',
 '_retries': Retry(total=0, connect=None, read=False, redirect=None, status=None),
 'chunked': True,
 '_decoder': <urllib3.response.GzipDecoder at 0x7f7e995a2950>,
 'enforce_content_length': True,
 'auto_close': True,
 '_body': None,
 '_fp': <http.client.HTTPRespons

In [51]:
response.raw._decoder.__dict__

{'_obj': <zlib.Decompress at 0x7f7e83770870>, '_state': 0}

In [54]:
response.raw._decoder._obj.__dict__

AttributeError: 'zlib.Decompress' object has no attribute '__dict__'

In [50]:
response.raw._original_response.__dict__

{'fp': None,
 'debuglevel': 0,
 '_method': 'PATCH',
 'headers': <http.client.HTTPMessage at 0x7f7e995a03d0>,
 'msg': <http.client.HTTPMessage at 0x7f7e995a03d0>,
 'version': 11,
 'status': 200,
 'reason': 'OK',
 'chunked': True,
 'chunk_left': None,
 'length': None,
 'will_close': False,
 'code': 200,
 '__IOBase_closed': True}

In [47]:
response.cookies.__dict__

{'_policy': <http.cookiejar.DefaultCookiePolicy at 0x7f7e995a13f0>,
 '_cookies_lock': <unlocked _thread.RLock object owner=0 count=0 at 0x7f7e83717040>,
 '_cookies': {},
 '_now': 1760607074}

In [48]:
response.cookies._policy.__dict__

{'netscape': True,
 'rfc2965': False,
 'rfc2109_as_netscape': None,
 'hide_cookie2': False,
 'strict_domain': False,
 'strict_rfc2965_unverifiable': True,
 'strict_ns_unverifiable': False,
 'strict_ns_domain': 0,
 'strict_ns_set_initial_dollar': False,
 'strict_ns_set_path': False,
 'secure_protocols': ('https', 'wss'),
 '_blocked_domains': (),
 '_allowed_domains': None,
 '_now': 1760607074}

In [55]:
response.request.__dict__

{'method': 'PATCH',
 'url': 'https://discoveryengine.googleapis.com/v1alpha/projects/qwiklabs-gcp-02-9355e20ba053/locations/global/collections/default_collection/dataStores/stackoverflow-embeddings-id/servingConfigs/default_search?updateMask=embeddingConfig,rankingExpression',
 'headers': {'User-Agent': 'python-requests/2.32.5', 'Accept-Encoding': 'gzip, deflate, br, zstd', 'Accept': '*/*', 'Connection': 'keep-alive', 'Authorization': 'Bearer ya29.c.c0ASRK0Gb15UOYb3u8l-bYbc46Kpj9bHUplQLbZ5azmCxttWEpoRIM6aYt_qHDZs4E4O_OyRVFrQ3IrZlGpYn60YDOHhGsRFj0cc1J-Vu-RWdkYY9D87DknWb-kUB2AH5LwCj0f16eRatj7udW_v_k-tifqzyxfwVmQbHv5C0kXtcUpbscPEB9WX1hxGfGzQkfq5x1GdpDKAYlzY4k1OT5979rE2JSKEuZc-HyX3xaRTklo5IyeJWVDb4-Btp8G3Vm1nJ6Jsgz7vjl76E7wXZcARju7U8QVoVep4foMNUkQRaYIevJSLVIXUfcdso1lIQ_JkGFKoiYEad-Hl-j23WVBgQteSLKWxIZQu-fjGIwNJT6JFOON8NAUJqlcgMN6eoF0ztJaLoJUR3toyJ2eOuryQG415AjmY3-6phj7boxMe-39-tOBIzvOi8jQ3UgjrXzXn7-oZcrhSZfsFrrYn2g48F3f4OMW4k7o-gJ2fn12pjozfIb3nRXU4pF3S-YSxIt_1YbQnM9ey8dnv3r7-JhytUnukUz9ZoI

In [56]:
response.connection.__dict__

{'max_retries': Retry(total=0, connect=None, read=False, redirect=None, status=None),
 'config': {},
 'proxy_manager': {},
 '_pool_connections': 10,
 '_pool_maxsize': 10,
 '_pool_block': False,
 'poolmanager': <urllib3.poolmanager.PoolManager at 0x7f7e9991c2b0>}

In [57]:
response.connection.poolmanager.__dict__

{'headers': {},
 'connection_pool_kw': {'maxsize': 10, 'block': False},
 'pools': <urllib3._collections.RecentlyUsedContainer at 0x7f7e995d0250>,
 'pool_classes_by_scheme': {'http': urllib3.connectionpool.HTTPConnectionPool,
  'https': urllib3.connectionpool.HTTPSConnectionPool},
 'key_fn_by_scheme': {'http': functools.partial(<function _default_key_normalizer at 0x7f7ee0cb43a0>, <class 'urllib3.poolmanager.PoolKey'>),
  'https': functools.partial(<function _default_key_normalizer at 0x7f7ee0cb43a0>, <class 'urllib3.poolmanager.PoolKey'>)}}

sample output ![sample output](https://cdn.qwiklabs.com/tPFFttXBoTSg91B5Bf9yL4nXgp7IthV7azuROt4MCZM%3D)

# Task 6. Test Search Application

In [58]:
# Run the following code snippet in the next cell, to define a function named search_data_store that performs a search operation on a Google Cloud Vertex AI Discovery Engine data store.
def search_data_store(
    project_id: str,
    location: str,
    data_store_id: str,
    search_query: str,
) -> List[discoveryengine.SearchResponse]:
    # Create a client
    client = discoveryengine.SearchServiceClient(client_options=client_options)

    # The full resource name of the search engine serving config
    # e.g. projects/{project_id}/locations/{location}/dataStores/{data_store_id}/servingConfigs/{serving_config_id}
    serving_config = client.serving_config_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        serving_config="default_config",
    )

    # Optional: Configuration options for search
    # Refer to the `ContentSearchSpec` reference for all supported fields:
    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest.ContentSearchSpec
    content_search_spec = discoveryengine.SearchRequest.ContentSearchSpec(
        # For information about snippets, refer to:
        # https://cloud.google.com/generative-ai-app-builder/docs/snippets
        snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(
            return_snippet=True
        ),
        # For information about search summaries, refer to:
        # https://cloud.google.com/generative-ai-app-builder/docs/get-search-summaries
        summary_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec(
            summary_result_count=5,
            include_citations=True,
            ignore_adversarial_query=True,
            ignore_non_summary_seeking_query=True,
        ),
    )

    # Refer to the `SearchRequest` reference for all supported fields:
    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest
    request = discoveryengine.SearchRequest(
        serving_config=serving_config,
        query=search_query,
        page_size=10,
        content_search_spec=content_search_spec,
        query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec(
            condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO,
        ),
        spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec(
            mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO
        ),
    )

    response = client.search(request)
    return response


In [59]:
# Run the following code snippet in the next cell, to perform a search operation on a Google Cloud Vertex AI Discovery Engine data store using a specified search query and prints the summary text of the search response.
search_query = "How do I create an array in Java?"

response = search_data_store(
    PROJECT_ID, DATA_STORE_LOCATION, DATA_STORE_ID, search_query
)

print(f"Summary: {response.summary.summary_text}")

Summary: To declare a one-dimensional array in Java, the general form is `type var-name;` or `type var-name;` [1]. To instantiate an array in Java, use `var-name = new type [size];` [1]. For example, `int intArray;` declares an array, and `intArray = new int;` allocates memory to the array [1]. The line `int intArray = new int;` is equivalent to the previous two lines [1]. You can also combine both statements in one line, such as `int intArray = new int{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };` [1]. To access the elements of the array, you can use a for loop [1]. For example: `for (int i = 0; i < intArray.length; i++) System.out.println("Element at index " + i + ": "+ intArray[i]);` [1].

Another way to declare and initialize an `ArrayList` is: `private List<String> list = new ArrayList<String>(){{ add("e1"); add("e2"); }};` [1].

To create a `LinkedList` in Java, you can use: `List<String> linkedList = new LinkedList<>();` [2]. If you need to do frequent insertion/deletion of elements on the

In [60]:
# Run the following code snippet in the next cell, to perform a search operation on a Google Cloud Vertex AI Discovery Engine data store using a specified search query and prints the summary text of the search response.
search_query = "How do I create an array in Python?"

response = search_data_store(
    PROJECT_ID, DATA_STORE_LOCATION, DATA_STORE_ID, search_query
)

print(f"Summary: {response.summary.summary_text}")

Summary: To create an array in Python, you can initialize it with values, which can contain mixed types [1]. For example, `arr = [1, "eels"]` [1]. You can also create an empty array [1]. Python's list is a wrapper for a real array which contains references to items [1]. To create a list of a certain size, you can create an empty array and then convert it to a list [2]. For example, you can use `np.empty(10)` and then convert it to a list using `arrayName.tolist()` [2]. Alternatively, you can chain them [2]. You can also create an empty list using `s1 = list()` and then append values to it using a loop [2]. For example: `for i in range(0,9): s1.append(i)` [2].




In [62]:
# Run the following code snippet in the next cell, to perform a search operation on a Google Cloud Vertex AI Discovery Engine data store using a specified search query and prints the summary text of the search response.
search_query = "What are mutable vs immutable data types in Python? Plese be comprehensive in data types, and in your explanations with examples"

response = search_data_store(
    PROJECT_ID, DATA_STORE_LOCATION, DATA_STORE_ID, search_query
)

print(f"Summary: {response.summary.summary_text}")

Summary: Mutable objects can be changed, while immutable objects cannot [2]. Immutable types include strings, numbers, and bools [2]. It is pointless to shallow copy immutable types like strings, tuples, or bytes because it is equivalent to returning another reference to the immutable object [4]. To check if an object allows getting, setting, and deleting items (mutable sequences), you can use `collections.abc.MutableSequence` [3]. Examples of type checking in Python include `assert type(variable_name) == int`, `assert type(variable_name) == bool`, and `assert type(variable_name) == list` [1].



sample output ![sample output](https://cdn.qwiklabs.com/HNbOJfqZdczDx%2BMw6zG9tDJ8WDrkPO4cUbMt10TiMaE%3D)

Note: If you receive an error, then re-run the same cell after a 3-4 minutes.

If you see the message No results could be found. Try rephrasing the search query. in the output wait for few more minutes and re-run the command.