In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Online feature serving and vector retrieval of BigQuery data with Vertex AI Feature Store


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/feature_store/online_feature_serving_and_vector_retrieval_bigquery_data_with_feature_store.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/feature_store/online_feature_serving_and_vector_retrieval_bigquery_data_with_feature_store.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/feature_store/online_feature_serving_and_vector_retrieval_bigquery_data_with_feature_store.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This tutorial demonstrates how to use `Vertex AI Feature Store` for online serving and vector retrieval of feature values in `BigQuery`.

Learn more about [Vertex AI Feature Store](https://cloud.google.com/vertex-ai/docs/featurestore/overview).

### Objective

In this tutorial, you learn how to create and use an online feature store instance to host and serve data in `BigQuery` with `Vertex AI Feature Store` in an end to end workflow of features serving and vector retrieval user journey.

This tutorial uses the following Google Cloud ML services and resources:

- `Vertex AI Feature Store`

The steps performed include:

- Provision an online feature store instance to host and serve data.
- Create an online feature store instance to serve a `BigQuery` table.
- Use the online server to search nearest neighbors.

###Note
This is a public Preview release. By using the feature, you acknowledge that you're aware of the open issues and that this preview is provided “as is” under the pre-GA terms of service.


### Dataset

This tutorial uses the [Google Patents Public Data](https://pantheon.corp.google.com/marketplace/product/google_patents_public_datasets/google-patents-public-data) dataset from the `BigQuery` public datasets.


### Costs

This tutorial uses billable components of Google Cloud:

* `Vertex AI`
* `BigQuery`

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and
[BigQuery pricing](https://cloud.google.com/bigquery/pricing)
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook.

In [None]:
# Install the packages
! pip3 install --upgrade --quiet google-cloud-aiplatform\
                                 google-cloud-bigquery\
                                 db-dtypes

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).Note that the new API is currently only available in the following regions:
* `us-centra1`
* `us-east1`
* `us-west1`
* `europe-west4`
* `asia-southeast1`

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Import libraries

In [None]:
from google.cloud import aiplatform, bigquery
from google.cloud.aiplatform_v1beta1 import (
    FeatureOnlineStoreAdminServiceClient, FeatureOnlineStoreServiceClient)
from google.cloud.aiplatform_v1beta1.types import NearestNeighborQuery
from google.cloud.aiplatform_v1beta1.types import \
    feature_online_store as feature_online_store_pb2
from google.cloud.aiplatform_v1beta1.types import \
    feature_online_store_admin_service as \
    feature_online_store_admin_service_pb2
from google.cloud.aiplatform_v1beta1.types import \
    feature_online_store_service as feature_online_store_service_pb2
from google.cloud.aiplatform_v1beta1.types import \
    feature_view as feature_view_pb2

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION)

API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"

## Set up data source in BigQuery

### Requirements
The data source has to be a BigQuery table or a BigQuery view, with the following requirements on columns:
1. [*Required*] One entity id column, type: string
2. [*Required*] One embedding column, type: double array
3. [*Optional*] One or more filtering columns, type: string or string array
4. [*Optional*] One crowding column, type: integer. Crowding ensures that results are diverse by returning at most k' < k neighbors with any single crowding attribute out of k total neighbors

### Test data source

Select a subset and exclude the repeated records type of columns that are not compatible with Feature Store from the `patents-public-data.google_patents_research.publications_202304` table:

Create a small dataset (<=100MB) for demo purposes, you can use the full dataset if needed.

In [None]:
FEATURE_EXTRACT_QUERY_FULL = """
SELECT publication_number, embedding_v1 as embedding, url, country, publication_description,
cpc_low, cpc_inventive_low, top_terms, title, CAST(title_translated as INT) as title_translated,
abstract, CAST(abstract_translated as INT) as abstract_translated,
cited_by[safe_offset(0)].filing_date as cited_by_filing_date,
similar[safe_offset(0)].filing_date as similar_filing_date
FROM `patents-public-data.google_patents_research.publications_202304`
"""
FEATURE_EXTRACT_QUERY_SMALL = f"{FEATURE_EXTRACT_QUERY_FULL} WHERE cited_by[safe_offset(0)].filing_date is not NULL LIMIT 1000"

This data source has some filtering columns (e.g. country) and crowding columns (e.g. cited_by_filing_date). Below is the schema of publications data table we use in this guide:

|Column name |  Type  |   Mode   |
|------------|--------|----------|
|publication_number	| STRING |	NULLABLE
|embedding |	FLOAT |	REPEATED
|url |	STRING	| NULLABLE
|country|	STRING|	NULLABLE|
|publication_description|	STRING|	NULLABLE|
|cpc_low|	STRING|	REPEATED|
|cpc_inventive_low|	STRING|	REPEATED|
|top_terms|	STRING|	REPEATED|
|title|	STRING|	NULLABLE|
|title_translated|	INTEGER|	NULLABLE|
|abstract|	STRING|	NULLABLE|
|abstract_translated|	INTEGER|	NULLABLE|
|cited_by_filing_date|	INTEGER|	NULLABLE|
|similar_filing_date|	INTEGER|	NULLABLE|

View the retrieved data.

In [None]:
bq_client = bigquery.Client(project=PROJECT_ID)

product_data = bq_client.query(FEATURE_EXTRACT_QUERY_SMALL).result().to_dataframe()

print(product_data.shape)
product_data.head()

### Create BigQuery dataset

Create a BigQuery dataset to hold the BigQuery table for the tutorial. Since the source data for this tutorial is located in the `US` region, the dataset must also be located in the `US` region. If you use your own data and dataset, you can also use that dataset to create the BigQuery table.


In [None]:
# First, create a dataset if it does not already exist. The source data for this demo is located in the US region, so the dataset must also be located in the US region.

BQ_DATASET_ID = "featurestore_demo"  # @param {type:"string"}
dataset = bigquery.Dataset(f"{PROJECT_ID}.{BQ_DATASET_ID}")
dataset.location = "US"
dataset = bq_client.create_dataset(
    dataset, exists_ok=True, timeout=30
)  # Make an API request.

# Confirm dataset created.
print(f"Created dataset {dataset}.{BQ_DATASET_ID}")

#### Create a BigQuery table

In [None]:
BQ_TABLE_ID = "publications_202304_small"  # @param {type:"string"}
BQ_TABLE_ID_FQN = f"{PROJECT_ID}.{BQ_DATASET_ID}.{BQ_TABLE_ID}"
job_config = bigquery.QueryJobConfig(destination=BQ_TABLE_ID_FQN)
query_job = bq_client.query(FEATURE_EXTRACT_QUERY_SMALL, job_config=job_config)

try:
    query_job.result()
except Exception as e:
    # Table already exists
    print("Error: ", e.message)

print(f"Created table: {BQ_TABLE_ID_FQN}")

In [None]:
DATA_SOURCE = f"bq://{BQ_TABLE_ID_FQN}"
print(f"Data source is: {DATA_SOURCE}")

## Set up and start online serving

To serve embedding data in feature store, you need to do the following:

1. Create an online store cluster to host the data.
2. Define the data (FeatureView) to be served by the newly-created instance.

### Initialize Admin Service Client

Load the Feature Store SDK.

In [None]:
admin_client = FeatureOnlineStoreAdminServiceClient(
    client_options={"api_endpoint": API_ENDPOINT}
)

### Create Feature Online Store

Create a feature online store with embedding management enabled.

In [None]:
FEATURE_ONLINE_STORE_ID = "my_feature_online_store_unique"  # @param {type: "string"}

In [None]:
online_store_config = feature_online_store_pb2.FeatureOnlineStore(
    bigtable=feature_online_store_pb2.FeatureOnlineStore.Bigtable(
        auto_scaling=feature_online_store_pb2.FeatureOnlineStore.Bigtable.AutoScaling(
            min_node_count=1, max_node_count=3, cpu_utilization_target=50
        )
    ),
    embedding_management=feature_online_store_pb2.FeatureOnlineStore.EmbeddingManagement(
        enabled=True
    ),
)

create_store_lro = admin_client.create_feature_online_store(
    feature_online_store_admin_service_pb2.CreateFeatureOnlineStoreRequest(
        parent=f"projects/{PROJECT_ID}/locations/{REGION}",
        feature_online_store_id=FEATURE_ONLINE_STORE_ID,
        feature_online_store=online_store_config,
    )
)

### Verify online store instance creation

After the long-running operation (LRO) is complete, show the result.

> **Note:** This operation might take up to 10 minutes to complete.

In [None]:
# Wait for the LRO to finish and get the LRO result.
print(create_store_lro.result())

#### Verify that the `FeatureOnlineStore` instance is created by getting the online stores instance

In [None]:
# Use get to verify the store is created.
admin_client.get_feature_online_store(
    name=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}"
)

#### List all online stores for the location

In [None]:
# Use list to verify the store is created.
admin_client.list_feature_online_stores(
    parent=f"projects/{PROJECT_ID}/locations/{REGION}"
)

### Create feature view instance

After creating a `FeatureOnlineStore` instance, you define the features to serve with it. To do this, create a `FeatureView` instance, which specifies the following:

* A data source (BigQuery table or view URI or FeatureGroup/features ) synced to the `FeatureOnlineStore` instance for serving.
* The cron schedule to run the sync pipeline.

Within feature view creation, a sync job will be scheduled, either started immediately or following the cron schedule. In the sync job, data is exported to Cloud Bigtable, index is built and deployed to GKE cluster.

In [None]:
FEATURE_VIEW_ID = "feature_view_publications"  # @param {type: "string"}
# A schedule will be created based on cron setting.
# If cron is empty, an immediate schedule job will be started.
CRON_SCHEDULE = "TZ=America/Los_Angeles 00 13 11 8 *"  # @param {type: "string"}

In [None]:
# Vector search configs
DIMENSIONS = 64  # @param {type: "number"}
EMBEDDING_COLUMN = "embedding"  # @param {type: "string"}
# Optional
LEAF_NODE_EMBEDDING_COUNT = 10000  # @param {type: "number"}
# Optional
CROWDING_COLUMN = "cited_by_filing_date"  # @param {type: "string"}
# Optional
FILTER_COLUMNS = ["country"]  # @param

In [None]:
big_query_source = feature_view_pb2.FeatureView.BigQuerySource(
    uri=DATA_SOURCE, entity_id_columns=["publication_number"]
)

sync_config = feature_view_pb2.FeatureView.SyncConfig(cron=CRON_SCHEDULE)

vector_search_config = feature_view_pb2.FeatureView.VectorSearchConfig(
    embedding_column=EMBEDDING_COLUMN,
    filter_columns=FILTER_COLUMNS,
    crowding_column=CROWDING_COLUMN,
    embedding_dimension=DIMENSIONS,
    tree_ah_config=feature_view_pb2.FeatureView.VectorSearchConfig.TreeAHConfig(),
)

print(f"vector_search_config: {vector_search_config}")

create_view_lro = admin_client.create_feature_view(
    feature_online_store_admin_service_pb2.CreateFeatureViewRequest(
        parent=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}",
        feature_view_id=FEATURE_VIEW_ID,
        feature_view=feature_view_pb2.FeatureView(
            big_query_source=big_query_source,
            sync_config=sync_config,
            vector_search_config=vector_search_config,
        ),
    )
)

 Wait for LRO to complete and show result

In [None]:
print(create_view_lro.result())

### Verify feature view instance creation

Verify that the FeatureView instance is created by gettting the feature view.

In [None]:
admin_client.get_feature_view(
    name=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}/featureViews/{FEATURE_VIEW_ID}"
)

Verify that the FeatureView instance is created by listing all the feature views within the online store.

In [None]:
admin_client.list_feature_views(
    parent=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}"
)

### Feature view syncs

The sync pipeline executes according to the schedule specified in the `FeatureView` instance.

To skip the wait and execute the sync pipeline immediately, start the sync manually.

In [None]:
sync_response = admin_client.sync_feature_view(
    feature_view=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}/featureViews/{FEATURE_VIEW_ID}"
)

The `sync_response` contains the ID of the sync job.

#### Use `get_feature_view_sync` to check the status of the job

In [None]:
import time

while True:
    feature_view_sync = admin_client.get_feature_view_sync(
        name=sync_response.feature_view_sync
    )
    if feature_view_sync.run_time.end_time.seconds > 0:
        status = "Succeed" if feature_view_sync.final_status.code == 0 else "Failed"
        print(f"Sync {status} for {feature_view_sync.name}.")
        # wait a little more for the job to properly shutdown
        time.sleep(30)
        break
    else:
        print("Sync ongoing, waiting for 30 seconds.")
    time.sleep(30)

#### Use `list_feature_view_syncs` to view all your syncs

In [None]:
admin_client.list_feature_view_syncs(
    parent=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}/featureViews/{FEATURE_VIEW_ID}"
)

### Start online serving

After the data sync is complete, use the `FetchFeatureValuesRequest` and `SearchNearestEntities` APIs to retrieve the data.

Get public endpoint domain name.

In [None]:
# Use get to verify the store is created.
featore_online_store_instance = admin_client.get_feature_online_store(
    name=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}"
)
PUBLIC_ENDPOINT = (
    featore_online_store_instance.dedicated_serving_endpoint.public_endpoint_domain_name
)

print(f"PUBLIC_ENDPOINT for online serving: {PUBLIC_ENDPOINT}")

#### Initialize the data client

In [None]:
data_client = FeatureOnlineStoreServiceClient(
    client_options={"api_endpoint": PUBLIC_ENDPOINT}
)

#### Set `NearestNeighborQuery.StringFilter`

In [None]:
results_df = (
    bq_client.query(f"select publication_number from {BQ_TABLE_ID_FQN} limit 1")
    .result()
    .to_dataframe()
)
ENTITY_ID = results_df.loc[0, "publication_number"]
print(f"Sample publication number: {ENTITY_ID}")

In [None]:
country_filter = NearestNeighborQuery.StringFilter(
    name="country",
    allow_tokens=["WIPO (PCT)"],  # try different allow tokens
    deny_tokens=["United States"],  # try different deny tokens
)

#### Search with `ENTITY_ID`

In [None]:
data_client.search_nearest_entities(
    request=feature_online_store_service_pb2.SearchNearestEntitiesRequest(
        feature_view=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}/featureViews/{FEATURE_VIEW_ID}",
        query=NearestNeighborQuery(
            entity_id=ENTITY_ID, neighbor_count=5, string_filters=[country_filter]
        ),
        return_full_entity=True,  # returning entities with metadata
    )
)

#### Search with `Embedding`

In [None]:
EMBEDDINGS = [1] * DIMENSIONS

In [None]:
data_client.search_nearest_entities(
    request=feature_online_store_service_pb2.SearchNearestEntitiesRequest(
        feature_view=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}/featureViews/{FEATURE_VIEW_ID}",
        query=NearestNeighborQuery(
            embedding=NearestNeighborQuery.Embedding(value=EMBEDDINGS),
            neighbor_count=10,
            string_filters=[country_filter],
        ),
        return_full_entity=True,  # returning entities with metadata
    )
)

#### Use the `FetchFeatureValues` API to retrieve the full data without search


In [None]:
data_client.fetch_feature_values(
    request=feature_online_store_service_pb2.FetchFeatureValuesRequest(
        feature_view=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}/featureViews/{FEATURE_VIEW_ID}",
        id=ENTITY_ID,
    )
)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial.

In [None]:
# Delete Feature View
admin_client.delete_feature_view(
    name=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}/featureViews/{FEATURE_VIEW_ID}"
)

# Delete Feature Online Store
admin_client.delete_feature_online_store(
    name=f"projects/{PROJECT_ID}/locations/{REGION}/featureOnlineStores/{FEATURE_ONLINE_STORE_ID}",
    force=True,
)