In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Performing Semantic Search in BigQuery

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fapplying-llms-to-data%2Fsemantic-search-in-bigquery%2Fstackoverflow_questions_semantic_search.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/bigquery/import?url=https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/bigquery/v1/32px.svg" alt="BigQuery Studio logo"><br> Open in BigQuery Studio
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/applying-llms-to-data/semantic-search-in-bigquery/stackoverflow_questions_semantic_search.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            

| | |
|-|-|
|Author(s) | [Jaideep Sethi](https://github.com/sethijaideep) |

## Overview

The objective is to demonstrate how to perform semantic search in BigQuery using Vector Search, including:


*   Completing setup steps for accessing Vertex AI from BigQuery
*   Creating a remote model in BigQuery
*   Generating text embedding using the remote model
*   Creating a vector index to optimize the semantic search
*   Performing semantic search using `VECTOR_SEARCH` function in BigQuery


## About the dataset

We are going to use Stack Overflow public dataset available in BigQuery. The data is an archive of Stack Overflow posts, votes, tags and badges.

The dataset can be accessed [here](https://console.cloud.google.com/bigquery(cameo:product/stack-exchange/stack-overflow)).

## Services and Costs

This tutorial uses the following Google Cloud data analytics and ML services, they are billable components of Google Cloud:

* BigQuery & BigQuery ML [(pricing)](https://cloud.google.com/bigquery/pricing)
* Vertex AI API [(pricing)](https://cloud.google.com/vertex-ai/pricing)

Use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.


# Setup steps for accessing Vertex AI models from BigQuery

## Enable the Vertex AI and BigQuery Connection APIs

In [None]:
!gcloud services enable aiplatform.googleapis.com bigqueryconnection.googleapis.com

## Create a Cloud resource connection
You can learn more about Cloud resource connection [here](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection)

In [None]:
!bq mk --connection --location=us \
    --connection_type=CLOUD_RESOURCE vertex_conn

## Grant the "Vertex AI User" role to the service account used by the Cloud resource connection


In [None]:
SERVICE_ACCT = !bq show --format=prettyjson --connection us.vertex_conn | grep "serviceAccountId" | cut -d '"' -f 4
SERVICE_ACCT_EMAIL = SERVICE_ACCT[-1]

In [None]:
import os

PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]
!gcloud projects add-iam-policy-binding --format=none $PROJECT_ID --member=serviceAccount:$SERVICE_ACCT_EMAIL --role=roles/aiplatform.user

# Create the remote model in BigQuery ML

## Create a new dataset named `'bigquery_demo'`

In [None]:
%%bigquery
CREATE SCHEMA
  `bigquery_demo` OPTIONS (location = 'US');

## Create the remote model for Text Embedding in BigQuery ML
Text embeddings model converts textual data into numerical vectors.These vector representations are designed to capture the semantic meaning and context of the words they represent.To generate embeddings we are using `text-embedding-004` model, which is one of the text embedding models available on Vertex AI platform.

You can learn more about Embeddings APIs [here](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings)

Note: If you encounter a permission error while accessing or using the endpoint for the service account, please wait a minute and try again.

In [None]:
%%bigquery
CREATE OR REPLACE MODEL `bigquery_demo.text_embedding_004`
REMOTE WITH CONNECTION `us.vertex_conn`
OPTIONS (endpoint = 'text-embedding-004')

# Prepare the dataset for semantic search
Semantic search is a technology that interprets the meaning of words and phrases.

## Generate text embeddings for title and body associated with Stack Overflow questions

For our use case we are going to use `title` and `body` fields from the Stack Overflow `posts_questions` table to generate text embeddings and perform semantic search using the `VECTOR_SEARCH` function.

Note: To limit costs for this demo, we'll use the top 10,000 iOS-related posts.

In [None]:
%%bigquery
CREATE OR REPLACE TABLE
  `bigquery_demo.posts_questions_embedding` AS
SELECT
  *
FROM
  ML.GENERATE_EMBEDDING( MODEL `bigquery_demo.text_embedding_004`,
    (
    SELECT
      id,
      title,
      body,
      CONCAT (title, body ) AS CONTENT
    FROM
      `bigquery-public-data.stackoverflow.posts_questions`
    WHERE
      tags LIKE '%ios%'
    ORDER BY
      view_Count DESC
    LIMIT
      10000 ),
    STRUCT ( TRUE AS flatten_json_output,
      'SEMANTIC_SIMILARITY' AS task_type ) );

Let's now check the new table containing the embedding fields.

In [None]:
%%bigquery
SELECT * FROM `bigquery_demo.posts_questions_embedding` LIMIT 100;

## Create Vector Index on the embeddings to help with efficient semantic search
A vector index is a data structure designed to let the `VECTOR_SEARCH` function perform a more efficient vector search of embeddings.You can learn more about vector index [here](https://cloud.google.com/bigquery/docs/vector-index).

In [None]:
%%bigquery
  CREATE OR REPLACE VECTOR INDEX ix_posts_questions
  ON
  `bigquery_demo.posts_questions_embedding` (ml_generate_embedding_result) OPTIONS(index_type = 'IVF',
    distance_type = 'COSINE',
    ivf_options = '{"num_lists":500}');

## Verify vector index creation

Note: The vector index is populated asynchronously.You can check whether the index is ready to be used by querying the `INFORMATION_SCHEMA.VECTOR_INDEXES` view and verifying that the `coverage_percentage` column value is greater than 0 and the `last_refresh_time` column value isn't `NULL`.

In [None]:
%%bigquery
SELECT
  table_name,
  index_name,
  index_status,
  coverage_percentage,
  last_refresh_time,
  disable_reason
FROM
  `bigquery_demo.INFORMATION_SCHEMA.VECTOR_INDEXES`;

# Perform semantic search

Using text embeddings to perform similarity search on a new question

## Match input question text to existing question's using vector search
Now let's perform a semantic search using the `VECTOR_SEARCH` function to find the top 5 closest results in our `posts_questions_embedding` table to a given question.

In [None]:
%%bigquery
SELECT
  query.query as input_question,
  base.id matching_question_id,
  base.title as matching_question_title,
  base.content as matching_question_content ,
  distance,
FROM
  VECTOR_SEARCH( TABLE `bigquery_demo.posts_questions_embedding`,
    'ml_generate_embedding_result',
    (
    SELECT
      ml_generate_embedding_result,
      content AS query
    FROM
      ML.GENERATE_EMBEDDING( MODEL `bigquery_demo.text_embedding_004`,
        (
        SELECT
          'Why does my iOS app crash with a low memory warning despite minimal memory usage?' AS content) ) ),
    top_k => 5,
    OPTIONS => '{"fraction_lists_to_search": 0.10}')
ORDER BY
  distance ASC ;

Summary: The results demonstrate that `VECTOR_SEARCH` effectively identified the top 5 most similar questions.You can use this same approach to implement semantic search in BigQuery on any dataset.

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial by uncommenting the below:

In [None]:
#
# !bq rm -r -f $PROJECT_ID:bigquery_demo
# !bq rm --connection --project_id=$PROJECT_ID --location=us vertex_conn
#

#Wrap up

In this you have seen an example of how to integrate BQML with Vertex AI LLMs,  how to generate embeddings with `ML.GENERATE_EMBEDDING` and perform semantic search using `VECTOR_SEARCH` in BigQuery.

Check out our BigQuery ML documentation on [generating embeddings](https://cloud.google.com/bigquery/docs/generate-text-embedding) and [vector search](https://cloud.google.com/bigquery/docs/vector-search) to learn more about generative AI in BigQuery.