In [1]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Google Spanner
> [Spanner](https://cloud.google.com/spanner) is a highly scalable database that combines unlimited scalability with relational semantics, such as secondary indexes, strong consistency, schemas, and SQL providing 99.999% availability in one easy solution.

This notebook goes over how to chunk documents when using `Spanner` for vector search. We'll use the `SpannerVectorStore` class from LangChain library.

Learn more about Spanner's integration with LangChain by visiting the [GitHub repo](https://github.com/googleapis/langchain-google-spanner-python/).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GoogleCloudPlatform/spanner-vector-hybrid-search-samples/blob/main/chunking/chunking-basics.ipynb)

## Before You Begin

To run this notebook, you will need to do the following:

 * [Create a Google Cloud Project](https://developers.google.com/workspace/guides/create-project)
 * [Enable the Cloud Spanner API](https://console.cloud.google.com/flows/enableapi?apiid=spanner.googleapis.com)
 * [Create a Spanner instance](https://cloud.google.com/spanner/docs/create-manage-instances)
 * [Create a Spanner database](https://cloud.google.com/spanner/docs/create-manage-databases)

### 🦜🔗 Install dependencies
Let's first install langchain and Vertex AI libraries

In [7]:
%pip install --upgrade --quiet langchain-text-splitters langchain-google-spanner langchain-google-vertexai

Note: you may need to restart the kernel to use updated packages.


**Colab only:** Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.

In [None]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### 🔐 Authentication
Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.

* If you are using Colab to run this notebook, use the cell below and continue.
* If you are using Vertex AI Workbench, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [None]:
from google.colab import auth

auth.authenticate_user()

### ☁ Set Your Google Cloud Project
Set your Google Cloud project so that you can leverage Google Cloud resources within this notebook.

If you don't know your project ID, try the following:

* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113).

In [2]:
# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.

PROJECT_ID = "kt-shared-project"  # @param {type:"string"}

# Set the project id
!gcloud config set project {PROJECT_ID}
%env GOOGLE_CLOUD_PROJECT={PROJECT_ID}

Updated property [core/project].
env: GOOGLE_CLOUD_PROJECT=kt-shared-project


### 💡 API Enablement
The `langchain-google-spanner` package requires that you [enable the Spanner API](https://console.cloud.google.com/flows/enableapi?apiid=spanner.googleapis.com) in your Google Cloud Project.

In [None]:
# enable Spanner API
!gcloud services enable spanner.googleapis.com

### Set Spanner database values
Find your database values, in the [Spanner Instances page](https://console.cloud.google.com/spanner?_ga=2.223735448.2062268965.1707700487-2088871159.1707257687).

In [84]:
# @title Set Your Values Here { display-mode: "form" }
INSTANCE = "tempus-test2"  # @param {type: "string"}
DATABASE = "fts-test2"  # @param {type: "string"}
TABLE_NAME = "vectors_search_data"  # @param {type: "string"}

### Initialize a table
The `SpannerVectorStore` class instance requires a database table with id, content and embeddings columns. 

The helper method `init_vector_store_table()` that can be used to create a table with the proper schema for you.

In [111]:
import langchain_google_spanner

from langchain_google_spanner import SecondaryIndex, SpannerVectorStore, TableColumn

SpannerVectorStore.init_vector_store_table(
    instance_id=INSTANCE,
    database_id=DATABASE,
    table_name=TABLE_NAME,
    # Customize the table creation
    id_column="id",
    # content_column="content_column",
    # metadata_columns=[
    #     TableColumn(name="metadata", type="JSON", is_null=True),
    #     TableColumn(name="title", type="STRING(MAX)", is_null=False),
    # ],
    # secondary_indexes=[
    #     SecondaryIndex(index_name="row_id_and_title", columns=["row_id", "title"])
    # ],
)

Waiting for operation to complete...


True

### Create an embedding class instance

You can use any [LangChain embeddings model](https://python.langchain.com/docs/integrations/text_embedding/).
You may need to enable Vertex AI API to use `VertexAIEmbeddings`. We recommend setting the embedding model's version for production, learn more about the [Text embeddings models](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text-embeddings) and [Model versions](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api#model_versions).

In [None]:
# enable Vertex AI API
!gcloud services enable aiplatform.googleapis.com

In [112]:
from langchain_google_vertexai import VertexAIEmbeddings

# Make sure you update the model version below reflect the latest production version
embeddings = VertexAIEmbeddings(
    model_name="text-embedding-005", project=PROJECT_ID
)

### SpannerVectorStore

To initialize the `SpannerVectorStore` class you need to provide 4 required arguments and other arguments are optional and only need to pass if it's different from default ones

1. `instance_id` - The name of the Spanner instance
1. `database_id` - The name of the Spanner database
1. `table_name` - The name of the table within the database to store the documents & their embeddings.
1. `embedding_service` - The Embeddings implementation which is used to generate the embeddings.

In [113]:
db = SpannerVectorStore(
    instance_id=INSTANCE,
    database_id=DATABASE,
    table_name=TABLE_NAME,
    embedding_service=embeddings,
    # Connect to a custom vector store table
    id_column="id",
    # content_column="content",
    # metadata_columns=["metadata", "title"],
)

### Chunking overview
With all of that setup out of the way, let's talk about chunking (aka text splitting). In order to index documents in a vector store like Spanner, it's necessary to first partition or chunk the document into smaller pieces and then send those pieces to the data store to be indexed.

Why is it "necessary" to split documents before indexing them? At a high level, it's because documents (even small ones) are made up of a collection of smaller "fragments". You can think of these fragments as sentences, concepts, words, etc... And in fact, there are a variety of approaches for splitting documents, and LangChain offers multiple options as described [here](https://python.langchain.com/docs/concepts/text_splitters/#text-structured-based).

As explained in the above LangChain article on text splitters, there are roughly four broad approaches for chunking:

- Length based
- Text-structure based
- Document-structured based
- Semantic meaning based

#### Chunking with CharacterTextSplitter followed by indexing on Spanner

In [114]:
import uuid
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("vector_doc_input.txt")
file_contents = loader.load()

# CharacterTextSplitter is just one of many text splitters
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Generate chunks (list of LangChain Documents)
documents = text_splitter.split_documents(file_contents)

ids = [str(uuid.uuid4()) for _ in range(len(documents))]
# The following indexes the above chunks in Spanner
vectorstore = SpannerVectorStore.from_documents(documents,
                                                embeddings,
                                                INSTANCE,
                                                DATABASE,
                                                TABLE_NAME,
                                                id_column="id",
                                                ids=ids)

The above code chunks the document and indexes the chunks in the specified Spanner table. Let's query the underlying Spanner table (TABLE_NAME) directly

In [115]:
import pandas as pd
from google.cloud import spanner

spanner_db = spanner.Client(project=PROJECT_ID).instance(INSTANCE).database(DATABASE)

result_df = pd.DataFrame()

with spanner_db.snapshot() as snapshot:
    results = snapshot.execute_sql(f"SELECT * FROM {TABLE_NAME} LIMIT 10;")

    rows = []
    for row in results:
        rows.append(row)
    
    # Get column names
    cols = [x.name for x in results.fields]

    # Convert to pandas dataframe
    result_df = pd.DataFrame(rows, columns = cols)

display(result_df)

Unnamed: 0,id,content,embedding
0,0e662025-f37e-46cc-ac15-e60c31b4311d,Since both correctness and availability are cr...,"[-0.024999909102916718, -0.001801451202481985,..."
1,2f41a409-3f48-47fd-9f44-a6c58872d073,Upping reliability with chaos testing\nWe run ...,"[-0.04426778107881546, 0.0026189256459474564, ..."
2,31936bef-c7d2-49a8-a804-191c62280056,4. Memory/quota faults\nWhen servers run low o...,"[-0.037452224642038345, -0.007777196355164051,..."
3,319c15f6-b562-444f-be6d-e7d86dae5e4f,Blackhole the request: Sometimes the file syst...,"[-0.0427422858774662, -0.0013925273669883609, ..."
4,3eed6c98-3944-402f-bed2-46500d4ce700,5. Cloud faults\nAccess to Spanner from the Go...,"[-0.04665933921933174, 0.001961533911526203, 0..."
5,4da4365f-c09e-418a-8ebe-9838dcb0b57e,A read or query on the database does not retur...,"[-0.018720634281635284, -0.020843015983700752,..."
6,90036355-e996-4982-b42c-b89eb4a95dda,SOURCE: https://cloud.google.com/blog/products...,"[-0.05042260140180588, 0.007137879263609648, 0..."
7,90b8c7e1-2fdf-4613-abf0-03372d0e5e7f,"For example, through chaos testing, we found a...","[-0.029230546206235886, -0.012972258031368256,..."
8,a5e22f63-87fd-497b-ac9c-3a5ead36d15c,"Errors, either transient or permanent, can be ...","[-0.019595345482230186, -0.019145982339978218,..."
9,ce2bf89f-4b28-40b7-974b-a1995cc4dce1,The restart logic is quite complex and we even...,"[-0.028140805661678314, -0.019556596875190735,..."


Let's now do a similarity search on the indexed data via LangChain and display the results:

In [133]:
results = vectorstore.similarity_search(query="resilience", k=3)
print('Num results: ' + str(len(results)))
print()

search_rows = [x.page_content for x in results]
cols = ['page_content']

# The following ensures that the full chunked text fragment is displayed without truncation
pd.set_option('display.max_colwidth', None)

search_df = pd.DataFrame(search_rows, columns = cols)
display(search_df)

Num results: 3



Unnamed: 0,page_content
0,Spanner earns its reputation for reliability\nSpanner is fault tolerant by design. We continuously validate Spanner’s reliability by running many large-scale randomized system tests that employ chaos testing.\n\nYou can learn more about what makes Spanner unique and how it’s being used today. Or try it yourself for free for 90-days or for as little as $65 USD/month for a production-ready instance that grows with your business without downtime or disruptive re-architecture.
1,"5. Cloud faults\nAccess to Spanner from the Google Cloud Platform is mediated by Spanner API Front End Servers, which proxy requests coming into Google Cloud through Google front ends to a Spanner database. External clients open sessions with the Spanner database and execute transactions on these sessions. For Spanner, we crash the Spanner API frontend servers, which forces sessions to migrate to other Spanner API frontend servers. This should not be visible to the client (besides some additional latency).\n\n6. Regional outages\nThe largest faults we simulate in system tests are outages of an entire region, forcing Spanner to serve data from a quorum of other regions. The majority of our system tests simulate several kinds of regional outages, triggered either by file system or network outages, and we verify Spanner continues to serve. This resilience is a property of the Paxos algorithm, which guarantees progress as long as a quorum (2 of 3, or 3 of 5) of replicas remain healthy."
2,"A fault-tolerant design foundation\nSpanner is built from “mostly reliable” components including machines, disks, and networking hardware that have a low rate of failure. Even so, bad things happen: bad memory and disks may lead to data corruption; file accesses may yield transient or permanent errors or corruption; or network connectivity within or between data centers may be throttled or lost altogether. Worst of all, software bugs sometimes produce correlated failures in all servers running the same version of the code."


Let's use another chunking approach and compare results

In [135]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# A second table to index using second chunking approach
TABLE_NAME_RS = "vectors_search_data_rs" # rs for recursive splitter

SpannerVectorStore.init_vector_store_table(
    instance_id=INSTANCE,
    database_id=DATABASE,
    table_name=TABLE_NAME_RS,
    # Customize the table creation
    id_column="id",
    # content_column="content_column",
    # metadata_columns=[
    #     TableColumn(name="metadata", type="JSON", is_null=True),
    #     TableColumn(name="title", type="STRING(MAX)", is_null=False),
    # ],
    # secondary_indexes=[
    #     SecondaryIndex(index_name="row_id_and_title", columns=["row_id", "title"])
    # ],
)

# We'll use RecursiveCharacterTextSplitter this time and a different chunk size (just to demonstrate)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=30, length_function=len, is_separator_regex=False)

# Generate chunks (list of LangChain Documents)
documents = text_splitter.split_documents(file_contents)

ids = [str(uuid.uuid4()) for _ in range(len(documents))]

# The following indexes the above chunks in Spanner
vectorstore_rs = SpannerVectorStore.from_documents(documents,
                                                embeddings,
                                                INSTANCE,
                                                DATABASE,
                                                TABLE_NAME_RS,
                                                id_column="id",
                                                ids=ids)

Waiting for operation to complete...


Let's do a search using these newly indexed documents

In [136]:
results_rs = vectorstore_rs.similarity_search(query="resilience", k=3)
print('Num results: ' + str(len(results)))
print()

search_rows_rs = [x.page_content for x in results_rs]

# The following ensures that the full chunked text fragment is displayed without truncation
pd.set_option('display.max_colwidth', None)

search_rs_df = pd.DataFrame(search_rows_rs, columns = cols)
display(search_rs_df)

Num results: 3



Unnamed: 0,page_content
0,"5. Cloud faults\nAccess to Spanner from the Google Cloud Platform is mediated by Spanner API Front End Servers, which proxy requests coming into Google Cloud through Google front ends to a Spanner database. External clients open sessions with the Spanner database and execute transactions on these sessions. For Spanner, we crash the Spanner API frontend servers, which forces sessions to migrate to other Spanner API frontend servers. This should not be visible to the client (besides some additional latency).\n\n6. Regional outages\nThe largest faults we simulate in system tests are outages of an entire region, forcing Spanner to serve data from a quorum of other regions. The majority of our system tests simulate several kinds of regional outages, triggered either by file system or network outages, and we verify Spanner continues to serve. This resilience is a property of the Paxos algorithm, which guarantees progress as long as a quorum (2 of 3, or 3 of 5) of replicas remain healthy.\n\nSpanner earns its reputation for reliability\nSpanner is fault tolerant by design. We continuously validate Spanner’s reliability by running many large-scale randomized system tests that employ chaos testing.\n\nYou can learn more about what makes Spanner unique and how it’s being used today. Or try it yourself for free for 90-days or for as little as $65 USD/month for a production-ready instance that grows with your business without downtime or disruptive re-architecture."
1,"SOURCE: https://cloud.google.com/blog/products/databases/chaos-testing-spanner-improves-reiliability\n\nOne of the secrets behind Spanner’s reliability is the team’s extensive use of chaos testing, the process of deliberately injecting faults into production-like instances of the database. Although engineers focus on testing the “happy path,” most software bugs occur when things go wrong. Given Spanner’s complex architecture and constantly evolving codebase, it is inevitable that bugs will be introduced. Here, we give an overview of the types of chaos testing we employ and the kinds of bugs it finds.\n\nA fault-tolerant design foundation\nSpanner is built from “mostly reliable” components including machines, disks, and networking hardware that have a low rate of failure. Even so, bad things happen: bad memory and disks may lead to data corruption; file accesses may yield transient or permanent errors or corruption; or network connectivity within or between data centers may be throttled or lost altogether. Worst of all, software bugs sometimes produce correlated failures in all servers running the same version of the code."
2,"Since both correctness and availability are critical, Spanner uses principles of fault-tolerant design to mask failures of these components and achieve high reliability for the service. For example, checksums are used to detect data corruption at many levels. Spanner tablets, which store a fragment of the database, are replicated across three or (usually) more data centers and the reads and writes use Paxos to achieve consensus and consistency of the distributed state. Checksums are also used to detect corruption of a tablet replica. The data for these tablets is stored in files, and the file system keeps multiple copies of the data blocks within the data center, using checksums to detect corrupted blocks. Finally, we proceed cautiously when rolling out new software versions, alerting on any anomalies that may be caused by a new bug.\n\nUpping reliability with chaos testing\nWe run over a thousand system tests per week to validate that Spanner’s design and implementation actually mask faults and provide a highly reliable service. Each test creates a production-like instance of Spanner comprising hundreds of processes running on the same computing platform and using the same dependent systems (e.g., file system, lock service) as production Spanner. Most tests run for between one and 24 hours and execute tens or hundreds of thousands of transactions."


In [141]:
# display the results next to each other

display(search_df)
display(search_rs_df)

Unnamed: 0,page_content
0,Spanner earns its reputation for reliability\nSpanner is fault tolerant by design. We continuously validate Spanner’s reliability by running many large-scale randomized system tests that employ chaos testing.\n\nYou can learn more about what makes Spanner unique and how it’s being used today. Or try it yourself for free for 90-days or for as little as $65 USD/month for a production-ready instance that grows with your business without downtime or disruptive re-architecture.
1,"5. Cloud faults\nAccess to Spanner from the Google Cloud Platform is mediated by Spanner API Front End Servers, which proxy requests coming into Google Cloud through Google front ends to a Spanner database. External clients open sessions with the Spanner database and execute transactions on these sessions. For Spanner, we crash the Spanner API frontend servers, which forces sessions to migrate to other Spanner API frontend servers. This should not be visible to the client (besides some additional latency).\n\n6. Regional outages\nThe largest faults we simulate in system tests are outages of an entire region, forcing Spanner to serve data from a quorum of other regions. The majority of our system tests simulate several kinds of regional outages, triggered either by file system or network outages, and we verify Spanner continues to serve. This resilience is a property of the Paxos algorithm, which guarantees progress as long as a quorum (2 of 3, or 3 of 5) of replicas remain healthy."
2,"A fault-tolerant design foundation\nSpanner is built from “mostly reliable” components including machines, disks, and networking hardware that have a low rate of failure. Even so, bad things happen: bad memory and disks may lead to data corruption; file accesses may yield transient or permanent errors or corruption; or network connectivity within or between data centers may be throttled or lost altogether. Worst of all, software bugs sometimes produce correlated failures in all servers running the same version of the code."


Unnamed: 0,page_content
0,"5. Cloud faults\nAccess to Spanner from the Google Cloud Platform is mediated by Spanner API Front End Servers, which proxy requests coming into Google Cloud through Google front ends to a Spanner database. External clients open sessions with the Spanner database and execute transactions on these sessions. For Spanner, we crash the Spanner API frontend servers, which forces sessions to migrate to other Spanner API frontend servers. This should not be visible to the client (besides some additional latency).\n\n6. Regional outages\nThe largest faults we simulate in system tests are outages of an entire region, forcing Spanner to serve data from a quorum of other regions. The majority of our system tests simulate several kinds of regional outages, triggered either by file system or network outages, and we verify Spanner continues to serve. This resilience is a property of the Paxos algorithm, which guarantees progress as long as a quorum (2 of 3, or 3 of 5) of replicas remain healthy.\n\nSpanner earns its reputation for reliability\nSpanner is fault tolerant by design. We continuously validate Spanner’s reliability by running many large-scale randomized system tests that employ chaos testing.\n\nYou can learn more about what makes Spanner unique and how it’s being used today. Or try it yourself for free for 90-days or for as little as $65 USD/month for a production-ready instance that grows with your business without downtime or disruptive re-architecture."
1,"SOURCE: https://cloud.google.com/blog/products/databases/chaos-testing-spanner-improves-reiliability\n\nOne of the secrets behind Spanner’s reliability is the team’s extensive use of chaos testing, the process of deliberately injecting faults into production-like instances of the database. Although engineers focus on testing the “happy path,” most software bugs occur when things go wrong. Given Spanner’s complex architecture and constantly evolving codebase, it is inevitable that bugs will be introduced. Here, we give an overview of the types of chaos testing we employ and the kinds of bugs it finds.\n\nA fault-tolerant design foundation\nSpanner is built from “mostly reliable” components including machines, disks, and networking hardware that have a low rate of failure. Even so, bad things happen: bad memory and disks may lead to data corruption; file accesses may yield transient or permanent errors or corruption; or network connectivity within or between data centers may be throttled or lost altogether. Worst of all, software bugs sometimes produce correlated failures in all servers running the same version of the code."
2,"Since both correctness and availability are critical, Spanner uses principles of fault-tolerant design to mask failures of these components and achieve high reliability for the service. For example, checksums are used to detect data corruption at many levels. Spanner tablets, which store a fragment of the database, are replicated across three or (usually) more data centers and the reads and writes use Paxos to achieve consensus and consistency of the distributed state. Checksums are also used to detect corruption of a tablet replica. The data for these tablets is stored in files, and the file system keeps multiple copies of the data blocks within the data center, using checksums to detect corrupted blocks. Finally, we proceed cautiously when rolling out new software versions, alerting on any anomalies that may be caused by a new bug.\n\nUpping reliability with chaos testing\nWe run over a thousand system tests per week to validate that Spanner’s design and implementation actually mask faults and provide a highly reliable service. Each test creates a production-like instance of Spanner comprising hundreds of processes running on the same computing platform and using the same dependent systems (e.g., file system, lock service) as production Spanner. Most tests run for between one and 24 hours and execute tens or hundreds of thousands of transactions."


## Cleanup

To ensure that you don't continue to get billed for the resources you provisioned, just go into the [Cloud Spanner section](https://console.cloud.google.com/spanner/instances/) of the Cloud Console and delete the instance you created.