In [1]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Production & Scalable RAG Pipeline Using BigFrames

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/scalable_rag_with_bigframes.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fretrieval-augmented-generation%2Fscalable_rag_with_bigframes.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/retrieval-augmented-generation/scalable_rag_with_bigframes.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/bigquery/import?url=https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/scalable_rag_with_bigframes.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/bigquery/v1/32px.svg" alt="BigQuery Studio logo"><br> Open in BigQuery Studio
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/scalable_rag_with_bigframes.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>


<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/scalable_rag_with_bigframes.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/scalable_rag_with_bigframes.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/scalable_rag_with_bigframes.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/scalable_rag_with_bigframes.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/scalable_rag_with_bigframes.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| | |
|-|-|
| Authors | [Lorenzo Spataro](https://github.com/lspataroG), [Elia Secchi](https://github.com/eliasecchig) |


# Overview

This notebook demonstrates how to use [BigFrames](https://cloud.google.com/python/docs/reference/bigframes/latest) and [LangChain](http://python.langchain.com/docs) to build a RAG (Retrieval Augmented Generation) pipeline using Vertex AI.

Specifically, we are going to build a data pipeline capable of being deployed in a production environment with scheduled execution.

You will learn how to:
- Load data from BigQuery into BigFrames
- Create embeddings using Vertex AI models
- Build a vector store using BigQuery
- Create a RAG pipeline using LangChain
- Query your data using natural language

## What is BigQuery DataFrames?
BigQuery DataFrames, also called [BigFrames](https://cloud.google.com/python/docs/reference/bigframes/latest), lets you process data in BigQuery using familiar Python APIs like pandas and scikit-learn. It works by converting Python code into optimized SQL that runs directly in BigQuery.
Key benefits:
- Process terabytes of data using Python without moving it out of BigQuery
- Train ML models directly in BigQuery using scikit-learn syntax
- A wide range of popular pandas and scikit-learn APIs are available through SQL conversion
- Lazy evaluation for better performance
- Custom Python functions run as BigQuery remote functions
- Vertex AI integration for Gemini model access

The following diagram describes the workflow of BigQuery DataFrames:
![BigQuery DataFrames Workflow](https://cloud.google.com/static/bigquery/images/dataframes-workflow.png)


# Get started

## Install Vertex AI SDK and other required packages


In [None]:
%pip install --upgrade --user --quiet google-cloud-aiplatform  "bigframes" langchain markdownify swifter "langchain-google-community[featurestore]" langchain-google-vertexai

## Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it's restarted, continue to the next step.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. In Colab or Colab Enterprise, you might see an error message that says "Your session crashed for an unknown reason." This is expected. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


## Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

## Set Google Cloud project information and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

Alternatively you can also enable the Vertex AI API by uncommenting and running the following command:

In [2]:
# Use the environment variable if the user doesn't provide Project ID.
import datetime
import os

import bigframes.ml.llm as llm
import bigframes.pandas as bpd
from google.cloud import bigquery
import vertexai

PROJECT_ID = (
    "your-project-id"  # @param {type: "string", placeholder: "your-project-id"}
)
if not PROJECT_ID or PROJECT_ID == "your-project-id":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

# GOOGLE_CLOUD_REGION must be in a US region because the source dataset is in US
LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

In [None]:
# Set the Google Cloud project and enable the Vertex AI API
! gcloud config set project $PROJECT_ID && gcloud services enable aiplatform.googleapis.com

In [4]:
# Set project and location for Vertex, BigQuery and BigFrames
vertexai.init(project=PROJECT_ID, location=LOCATION)

bq_client = bigquery.Client(project=PROJECT_ID, location="US")
bpd.options.bigquery.project = PROJECT_ID
bpd.options.bigquery.location = "US"

### Import libraries

In [5]:
from datetime import datetime, timedelta

# Standard library imports
import json

# Third-party imports
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_community import BigQueryVectorStore
from langchain_google_vertexai import VertexAIEmbeddings
from markdownify import markdownify

### Variables definition

As we're building a pipeline intended to run regularly on a schedule, we're going to set up some time-dependent variables:

*   `RUN_DATE`: The date the process runs
*   `IS_INCREMENTAL`: If `True`, only query recent data; otherwise, query the whole dataset
*   `LOOK_BACK_DAYS`: If `IS_INCREMENTAL=True`, defines how many days in the past to query


In [6]:
IS_INCREMENTAL = True  # Flag to enable incremental processing
RUN_DATE = datetime.strptime(
    "2022-09-26", "%Y-%m-%d"
).date()  # Set as the last date of the dataset for demonstration purposes (ie. there is no data after that)
LOOK_BACK_DAYS = 1  # Number of days to look back from RUN_DATE
START_DATE = str(
    RUN_DATE - timedelta(days=LOOK_BACK_DAYS)
)  # Start date for data processing window
END_DATE = str(RUN_DATE)  # End date for data processing window

# 1. Data Loading and initial preprocessing
This section retrieves and examines Stack Overflow Python Q&A data from the public BigQuery table `production-ai-template.stackoverflow_qa.stackoverflow_python_questions_and_answers`. The data comes from the official [Stack Overflow public dataset](https://console.cloud.google.com/marketplace/product/stack-exchange/stack-overflow) and contains a sample of Python-related questions and their corresponding answers.


In [None]:
query = f"""
    SELECT
        creation_date,
        last_edit_date,
        question_id,
        question_title,
        question_body AS question_text,
        answers
    FROM `production-ai-template.stackoverflow_qa.stackoverflow_python_questions_and_answers`
    WHERE TRUE
        # If IS_INCREMENTAL is True, filter records between START_DATE and END_DATE
        # Otherwise, include all records without date filtering
        {f'AND TIMESTAMP_TRUNC(creation_date, DAY) BETWEEN TIMESTAMP("{START_DATE}") AND TIMESTAMP("{END_DATE}")' if IS_INCREMENTAL else ''}
"""
df = bpd.read_gbq(query)
df.head(2)

## Data Cleaning and Markdown Conversion
In this step, we clean the raw data by converting HTML content to Markdown format for better readability and processing.
We transform both questions and answers from HTML to Markdown, structure the content with proper headings,
and combine them into a unified text format that will be easier to work with in subsequent steps.

> Note: In this case, we are leveraging BigFrames' capability to pull data into memory, where the processing happens. This allows us to efficiently transform and clean the data using pandas-like operations. Later, we will demonstrate how to scale this data processing using Remote Functions.

In [8]:
def convert_html_to_markdown(html: str) -> str:
    """Convert HTML into Markdown for easier parsing and rendering after LLM response."""
    return markdownify(html).strip()


def create_answers_markdown(answers: list[dict]) -> str:
    """Convert each answer's HTML to markdown and concatenate into a single markdown text."""
    answers_md = ""
    for index, answer_record in enumerate(answers):
        answers_md += (
            f"\n\n## Answer {index + 1}:\n"  # Answer number is H2 heading size
        )
        answers_md += convert_html_to_markdown(answer_record["body"])
    return answers_md

In [None]:
# Sort, deduplicate and reset index in one operation
df = (
    df.sort_values("last_edit_date", ascending=False)
    .drop_duplicates("question_id")
    .reset_index(drop=True)
)

# Create markdown fields efficiently
df["question_title_md"] = "# " + df["question_title"] + "\n"  # Title is H1 heading size
df["question_text_md"] = (
    df["question_text"].to_pandas().apply(convert_html_to_markdown) + "\n"
)
df["answers_md"] = df["answers"].to_pandas().apply(create_answers_markdown)

# Create a column containing the whole markdown text
df["full_text_md"] = df["question_title_md"] + df["question_text_md"] + df["answers_md"]

In [None]:
# Select final columns for the cleaned dataset
final_cols = ["last_edit_date", "question_id", "question_text", "full_text_md"]
df = df[final_cols]
df.head(2)

# 2. Text Chunking
The text data is in Markdown format, requiring a thoughtful chunking approach. While we currently use a basic character-based splitter, production systems typically employ more sophisticated techniques:

- Preserve semantic units like paragraphs and sections
- Maintain markdown structure and hierarchy
- Keep related content together (e.g., questions with their answers)
- Use overlapping chunks to maintain context across boundaries
- Consider special markdown elements like code blocks and lists

This helps ensure chunks remain coherent and meaningful for downstream tasks.

For simplicity, we will continue using a basic character-based splitter. The primary focus of this notebook, in fact, is demonstrating scalable Gen AI data processing.

In [11]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=20,
    length_function=len,
)

Apply text chunking to each document locally using pandas and swifter


In [None]:
df["text_chunk"] = (
    df["full_text_md"]
    .to_pandas()
    .astype(object)
    .swifter.apply(text_splitter.split_text)
)
df.head(2)

In [None]:
# Compute the sequential index of a chunk within the list of chunks for each question
chunk_ids = [
    str(idx) for text_chunk in df["text_chunk"] for idx in range(len(text_chunk))
]
# Explode the chunk list so that we get a row per chunk
df = df.explode("text_chunk").reset_index(drop=True)
# Assigning the chunk_id as question_id + sequential index of the chunk
df["chunk_id"] = df["question_id"].astype("string") + "__" + chunk_ids

In [None]:
df.head()

# 3. Embedding

To generate embeddings, we leverage the seamless integration between BigFrames, BigQuery, and Vertex AI.
This integration allows us to efficiently generate embeddings through Vertex AI's batch scoring process.
The `text-embedding-005` model converts each text chunk into a high-dimensional vector representation,
enabling semantic search and similarity analysis.

> Note: This step might take a few minutes to complete.


In [None]:
# Initialize the embedding model
embedder = llm.TextEmbeddingGenerator(model_name="text-embedding-005")

# Generate embeddings
embeddings_df = embedder.predict(df["text_chunk"])
df = df.assign(
    embedding_result=embeddings_df["ml_generate_embedding_result"],
    embedding_statistics=embeddings_df["ml_generate_embedding_statistics"],
    embedding_status=embeddings_df["ml_generate_embedding_status"],
)
current_timestamp = datetime.now()
df["creation_timestamp"] = current_timestamp

df.head()

We can now notice 4 new columns added to our DataFrame!

# 4. Saving results

We are now ready to save the results of the processing to a BigQuery table, for consumption by the different Vector DBs we might want to use.
The incremental writing strategy allows us to efficiently update our embeddings table by:
1. Only processing new/modified questions since last run (controlled by `IS_INCREMENTAL` flag)
2. Appending new embeddings to the existing table when `IS_INCREMENTAL=True`
3. Replacing the entire table when `IS_INCREMENTAL=False`

Since we may end up with duplicate entries when doing incremental updates
(e.g. if a question was modified multiple times), we'll need to deduplicate
the table afterwards to keep only the latest version of each question.
The deduplication will be done based on question_id, keeping the row with the most recent `creation_timestamp`.


In [16]:
DESTINATION_DATASET_ID = "stackoverflow_data"
DESTINATION_TABLE_ID = "incremental_questions_embeddings"
PARTITION_DATE_COLUMN = "creation_timestamp"

If it doesn't exist, let's create an empty table with partitioning and the right schema

In [None]:
def create_table_if_not_exist(
    df, project_id, dataset_id, table_id, partition_column, location="US"
):
    table_schema = bq_client.get_table(df.head(0).to_gbq()).schema

    # Create table schema with partitioning
    table = bigquery.Table(f"{project_id}.{dataset_id}.{table_id}", schema=table_schema)
    table.time_partitioning = bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY, field=partition_column
    )

    dataset = bigquery.Dataset(f"{project_id}.{dataset_id}")
    dataset.location = location
    bq_client.create_dataset(dataset, exists_ok=True)
    table = bq_client.create_table(table=table, exists_ok=True)


create_table_if_not_exist(
    df=df,
    project_id=PROJECT_ID,
    dataset_id=DESTINATION_DATASET_ID,
    table_id=DESTINATION_TABLE_ID,
    partition_column=PARTITION_DATE_COLUMN,
)

In [None]:
# If IS_INCREMENTAL is True, append new data to existing table
# If IS_INCREMENTAL is False, replace entire table with new data
if_exists_mode = "append" if IS_INCREMENTAL else "replace"

incremental_table_id = df.to_gbq(
    destination_table=f"{DESTINATION_DATASET_ID}.{DESTINATION_TABLE_ID}",
    if_exists=if_exists_mode,
)

## Create a new Dedup table (Optional)

If necessary, we can create a deduplication table to address duplicate questions that may appear in the dataset across different dates.

In [None]:
df_questions = bpd.read_gbq(
    f"{DESTINATION_DATASET_ID}.{DESTINATION_TABLE_ID}", use_cache=False
)
max_date_df = (
    df_questions.groupby("question_id")["creation_timestamp"].max().reset_index()
)
df_questions_dedup = max_date_df.merge(
    df_questions, how="inner", on=["question_id", "creation_timestamp"]
)

In [None]:
DESTINATION_DEDUPED_QUESTIONS_TABLE_ID = "questions_embeddings"
create_table_if_not_exist(
    df=df_questions_dedup,
    project_id=PROJECT_ID,
    dataset_id=DESTINATION_DATASET_ID,
    table_id=DESTINATION_DEDUPED_QUESTIONS_TABLE_ID,
    partition_column=PARTITION_DATE_COLUMN,
)

deduped_table_id = df_questions_dedup.to_gbq(
    destination_table=f"{DESTINATION_DATASET_ID}.{DESTINATION_DEDUPED_QUESTIONS_TABLE_ID}",
    if_exists="replace",
)

# 5. Testing retrieval

Let's try to find similar documents based on an input query

In [None]:
embedding_model = VertexAIEmbeddings(
    model_name="text-embedding-005", project=PROJECT_ID
)
bq_store = BigQueryVectorStore(
    project_id=PROJECT_ID,
    location="US",
    dataset_name=DESTINATION_DATASET_ID,
    table_name=DESTINATION_DEDUPED_QUESTIONS_TABLE_ID,
    embedding=embedding_model,
    embedding_field="embedding_result",
    content_field="text_chunk",
)

In [None]:
# Perform similarity search and look at the most relevant documents
search_query = "how do I read a csv file with python?"  # @param {type:"string"}
results = bq_store.similarity_search(search_query)
text_results = [x.page_content for x in results]
text_results

# 6. Answer Generation

Now we can put everything together and use an LLM to answer a question based on Stack Overflow data!

We are going to use LangChain and the [`RetrievalQA` chain](https://python.langchain.com/docs/versions/migrating_chains/retrieval_qa/) to build a very simple RAG chain.


In [None]:
from langchain import hub
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_google_vertexai import ChatVertexAI

# Convert the BigQuery VectorStore to a LangChain retriever
langchain_retriever = bq_store.as_retriever()

# Init the VertexAI LLM
llm = ChatVertexAI(model_name="gemini-1.5-flash")

# See full prompt at https://smith.langchain.com/hub/langchain-ai/retrieval-qa-chat
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
rag_chain = create_retrieval_chain(langchain_retriever, combine_docs_chain)

print(rag_chain.invoke({"input": search_query})["answer"])

# 7. Scaling data processing to Terabytes: BigFrame Remote Functions

Sometimes data is too large to run local process when running custom Python functions.
In fact, every time we convert a series or a DataFrame to pandas using `to_pandas()`, the data is loaded into memory.

To be able to run large datasets processes remotely, we can define remote <b>UDF functions</b>. Let's see an example of a BigFrames [remote function](https://cloud.google.com/bigquery/docs/samples/bigquery-dataframes-remote-function).

In [None]:
import json

import bigframes.bigquery as bbq
import bigframes.pandas
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=10,
    length_function=len,
)

# Create UDF for chunking
# Behind the scenes, BigFrames will automatically create a connection for you but you can also create a dedicated connection.
# See here: https://cloud.google.com/bigquery/docs/remote-functions#create_a_connection.


@bigframes.pandas.remote_function(packages=["langchain"], reuse=True)
def chunk_text_udf(text: str) -> str:
    return json.dumps(
        [chunk.page_content for chunk in text_splitter.create_documents([text])]
    )

We are going to read the final table we saved earlier to show how to perform chunking using a Python remote function.

The rest of the pandas custom functions could also be implemented in a similar way.

In [None]:
# Reading the table we saved earlier for demonstration purposes
final_cols = ["last_edit_date", "question_id", "question_text", "full_text_md"]

df_udf = bpd.read_gbq(
    f"{DESTINATION_DATASET_ID}.{DESTINATION_DEDUPED_QUESTIONS_TABLE_ID}",
    use_cache=False,
)[final_cols]

# Sort, deduplicate and reset index in one operation
df_udf = (
    df_udf.sort_values("last_edit_date", ascending=False)
    .drop_duplicates("question_id")
    .reset_index(drop=True)
)
df_udf.head()

We are using the UDF to chunk the data and return a list of text chunks.

Since BigFrames UDFs expect simple types as input and output, we are going to convert the list of chunks to a json string inside the UDF.

In [27]:
df_udf["full_text_chunk"] = df_udf["full_text_md"].apply(chunk_text_udf)

As we can see, the data type is now string.

In [None]:
first_row_text = df_udf["full_text_chunk"].iloc[0]
print(type(first_row_text))
first_row_text

To fix that, we can now use the BigFrames BQ `json_extract_string_array` method to convert the json string back to a list of string.

In [None]:
df_udf["full_text_chunk"] = bbq.json_extract_string_array(df_udf["full_text_chunk"])

first_row_text = df_udf["full_text_chunk"].iloc[0]
print(type(first_row_text))
first_row_text

Here is how the DataFrame looks like after chunking.

In [None]:
df_udf.head(2)

# 8. Cleaning up

Run this cell to clean up the resources created in this notebook.

In [None]:
from google.cloud import bigquery_connection

# Remove BigQuery dataset and tables
dataset = f"{PROJECT_ID}.{DESTINATION_DATASET_ID}"
dataset_object = bigquery.Dataset(dataset)
bq_client.delete_dataset(dataset_object, delete_contents=True, not_found_ok=True)

# Remove BigTable remote function
!gcloud functions delete $chunk_text_udf.bigframes_cloud_function --region=$LOCATION --quiet

# Remove BigQuery external connection
connection_client = bigquery_connection.ConnectionServiceClient()
connection_path = connection_client.connection_path(
    project=PROJECT_ID, location="us", connection="bigframes-default-connection"
)
connection_client.delete_connection(name=connection_path)

# Conclusion

This notebook showcased the power of **BigFrames** for building production-ready RAG pipelines on Google Cloud. We leveraged BigFrames' seamless integration with BigQuery and Vertex AI to efficiently process and embed large text datasets.

Key takeaways highlighting BigFrames' capabilities include:

*   **Scalable Data Processing:**  BigFrames allowed us to manipulate BigQuery data using familiar pandas-like syntax, whether processing in memory or through scalable remote functions for terabyte-scale datasets.
*   **Simplified Embedding Generation:** BigFrames made it easy to generate embeddings with Vertex AI's embedding models directly within our data pipeline.
*   **Efficient Data Management:** We used BigFrames to manage our embeddings in BigQuery, implementing incremental updates and deduplication for optimal performance.