# Cloudflare Vectorize Walkthrough

This notebook demonstrates Cloudflare Vectorize's functionality via the LangChain python package.

In [25]:
import json
import itertools
import asyncio
import warnings
from datetime import datetime
import pandas as pd
import os
from dotenv import load_dotenv

warnings.filterwarnings('ignore')

from langchain_community.embeddings.cloudflare_workersai import CloudflareWorkersAIEmbeddings
from langchain_community.document_loaders import WikipediaLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

from libs.community.langchain_community.vectorstores.cloudflare_vectorize import CloudflareVectorize, VectorizeRecord



# Setup/Params

In [2]:
# name your vectorize index
vectorize_index_name = "test-langchain"

## Embeddings

For storage of embeddings, semantic search and retrieval, you must embed your raw values as embeddings.  Specify an embedding model, one available on WorkersAI

[https://developers.cloudflare.com/workers-ai/models/](https://developers.cloudflare.com/workers-ai/models/)

In [3]:
MODEL_WORKERSAI = "@cf/baai/bge-large-en-v1.5"

## Raw Values

Vectorize only stores embeddings, metadata and namespaces. If you want to store and retrieve raw values, you must leverage Cloudflare's SQL Database D1.

You can create a database here and retrieve its id:

[https://dash.cloudflare.com/YOUR-ACCT-NUMBER/workers/d1

In [4]:
# provide the id of your D1 Database
d1_database_id = "8ce9ce08-8961-475c-98fb-1ef0e6e4ca40"

## API Tokens

This Python package is a wrapper around Cloudflare's REST API.  To interact with the API, you need to provid an API token with the appropriate privileges.

You can create and manage API tokens here:

https://dash.cloudflare.com/YOUR-ACCT-NUMBER/api-tokens

In [5]:
load_dotenv("/Users/collierking/Desktop/chartclass/langchain/libs/community/tests/integration_tests/vectorstores/.env");

**Note:**
CloudflareVectorize depends on WorkersAI, Vectorize (and D1 if you are using it to store and retrieve raw values).

While you can create a single `api_token` with Edit privileges to all needed resources (WorkersAI, Vectorize & D1), you may want to follow the principle of "least privilege access" and create separate API tokens for each service


In [6]:
cf_acct_id = os.getenv("cf_acct_id")

# single token with WorkersAI, Vectorize & D1
api_token = os.getenv("api_token")

# separate tokens with access to each service
cf_ai_token = os.getenv("cf_ai_token")
cf_vectorize_token = os.getenv("cf_vectorize_token")
cf_d1_token = os.getenv("d1_api_token")

# Documents

For this example, we will use LangChain's Wikipedia loader to pull an article about Cloudflare.  We will store this in Vectorize and query its contents later.

In [7]:
docs = WikipediaLoader(query="Cloudflare", load_max_docs=2).load()

We will then create some simple chunks with metadata based on the chunk sections.

In [8]:
text_splitter = \
    RecursiveCharacterTextSplitter(
        # Set a really small chunk size, just to show.
        chunk_size=100,
        chunk_overlap=20,
        length_function=len,
        is_separator_regex=False,
    )
texts = text_splitter.create_documents([docs[0].page_content])

running_section = ""
for idx, text in enumerate(texts):
    if text.page_content.startswith("="):
        running_section = text.page_content
        running_section = running_section.replace("=", "").strip()
    else:
        if running_section == "":
            text.metadata = {"section": "Introduction"}
        else:
            text.metadata = {"section": running_section}


These chunks look like this:


In [9]:
print(texts[0],"\n\n",texts[-1])

page_content='Cloudflare, Inc., is an American company that provides content delivery network services,' metadata={'section': 'Introduction'} 

 page_content='In 2014, Cloudflare began providing free DDoS mitigation for artists, activists, jour' metadata={'section': 'DDoS mitigation'}


# Embeddings

In this example, we will create some embeddings using an embeddings model from WorkersAI and the `CloudflareWorkersAIEmbeddings` class from LangChain.

This will instantiate that "embedder" for later use.


In [10]:
embedder = \
    CloudflareWorkersAIEmbeddings(
        account_id=cf_acct_id,
        api_token=cf_ai_token,
        model_name=MODEL_WORKERSAI
    )

# CloudflareVectorize Class

Now we can create the CloudflareVectorize instance.  Here we passed:

* The `embedding` instance from earlier
* The account ID
* A global API token for all services (WorkersAI, Vectorize, D1)
* Individual API tokens for each service

In [11]:
cfVect = \
    CloudflareVectorize(
        embedding=embedder,
        account_id=cf_acct_id,
        api_token=api_token, #(Optional if using service-specific token)
        ai_api_token=cf_ai_token,  #(Optional if using global token)
        d1_api_token=cf_d1_token,  #(Optional if using global token)
        vectorize_api_token=cf_vectorize_token,  #(Optional if using global token)
        d1_database_id=d1_database_id,  #(Optional if not using D1)
    )

**Note:** These service-specific tokens (if provided) will take preference over a global token.  You could provide these instead of a global token.


## Gotchyas

A few "gotchyas" are shown below for various missing token/parameter combinations

D1 Database ID provided but no "global" `api_token` and no `d1_api_token`

In [12]:
cfVect = \
    CloudflareVectorize(
        embedding=embedder,
        account_id=cf_acct_id,
        # api_token=api_token, #(Optional if using service-specific token)
        ai_api_token=cf_ai_token,  #(Optional if using global token)
        # d1_api_token=cf_d1_token,  #(Optional if using global token)
        vectorize_api_token=cf_vectorize_token,  #(Optional if using global token)
        d1_database_id=d1_database_id,  #(Optional if not using D1)
    )

ValueError: `d1_database_id` provided, but no global `api_token` provided and no `d1_api_token` provided.

No "global" `api_token` provided and either missing `ai_api_token` or `vectorize_api_token`

In [13]:
cfVect = \
    CloudflareVectorize(
        embedding=embedder,
        account_id=cf_acct_id,
        # api_token=api_token, #(Optional if using service-specific token)
        # ai_api_token=cf_ai_token,  #(Optional if using global token)
        d1_api_token=cf_d1_token,  #(Optional if using global token)
        vectorize_api_token=cf_vectorize_token,  #(Optional if using global token)
        d1_database_id=d1_database_id,  #(Optional if not using D1)
    )

ValueError: Not enough API token values provided.  Please provide a global `api_token` or all of `ai_api_token`,`vectorize_api_token`.

# Creating an Index

Let's start off this example by creating and index (and first deleting if it exists).  If the index doesn't exist we will get a an error from Cloudflare telling us so.

In [17]:
try:
    cfVect.delete_index(index_name=vectorize_index_name)
except Exception as e:
    print(e)

410 Client Error: Gone for url: https://api.cloudflare.com/client/v4/accounts/7e5a6431075d52d65d279502b9980de3/vectorize/v2/indexes/test-langchain


In [18]:
r = cfVect.create_index(index_name=vectorize_index_name)

In [19]:
print(r)

{'created_on': '2025-03-08T17:56:12.776646Z', 'modified_on': '2025-03-08T17:56:12.776646Z', 'name': 'test-langchain', 'description': '', 'config': {'dimensions': 1024, 'metric': 'cosine'}}


# Listing Indexes

Now, we can list our indexes on our account

In [21]:
indexes = cfVect.list_indexes()
indexes = [x for x in indexes if "test-langchain" in x.get("name")]
print(indexes)

[{'created_on': '2025-03-08T17:56:12.776646Z', 'modified_on': '2025-03-08T17:56:12.776646Z', 'name': 'test-langchain', 'description': '', 'config': {'dimensions': 1024, 'metric': 'cosine'}}, {'created_on': '2025-03-08T02:31:53.968678Z', 'modified_on': '2025-03-08T02:31:53.968678Z', 'name': 'test-langchain2', 'description': '', 'config': {'dimensions': 1024, 'metric': 'cosine'}}]


# Get Index
We can also get certain indexes and retrieve more granular information about an index

In [61]:
r = cfVect.get_index(index_name=vectorize_index_name)
print(r)

{'created_on': '2025-03-08T17:56:12.776646Z', 'modified_on': '2025-03-08T17:56:12.776646Z', 'name': 'test-langchain', 'description': '', 'config': {'dimensions': 1024, 'metric': 'cosine'}}


This call returns a `processedUpToMutation` which can be used to track the status of operations such as creating indexes, adding or deleting records.

In [62]:
r = cfVect.get_index_info(index_name=vectorize_index_name)
print(r)

{'dimensions': 1024, 'vectorCount': 110, 'processedUpToDatetime': '2025-03-08T18:05:53.568Z', 'processedUpToMutation': 'a5499994-a863-4d6a-b26b-77c4918612b5'}


# Adding Metadata Indexes

It is common to assist retrieval by supplying metadata filters in quereies.  In Vectorize, this is accomplished by first creating a "metadata index" on your Vectorize Index.  We will do so for our example by creating one on the `section` field in our documents.

**Reference:** [https://developers.cloudflare.com/vectorize/reference/metadata-filtering/](https://developers.cloudflare.com/vectorize/reference/metadata-filtering/)


In [22]:
r = cfVect.create_metadata_index(
    property_name="section",
    index_type="string",
    index_name=vectorize_index_name,
)
print(r)

{'mutationId': '7dc8c166-67ad-4a95-95fc-411a92a374aa'}


# Adding Documents

Now we will add documents to our Vectorize Index.

**Note:**
Adding embeddings to Vectorize happens `asyncronously`, meaning there will be a small delay between adding the embeddings and being able to query them.  By default `add_documents` has a `wait=True` parameter which waits for this operation to complete before returning a response.  If you do not want the program to wait for embeddings availability, you can set this to `wait=False`.


In [24]:
r = cfVect.add_documents(
    index_name=vectorize_index_name,
    documents=texts
)

In [29]:
print(json.dumps(r)[:300])

{"result": {"mutationId": "a5499994-a863-4d6a-b26b-77c4918612b5"}, "success": true, "errors": [], "messages": [], "ids": ["a1a30a1a-3b93-47c9-b6c7-eb79889b8f51", "b0b08af8-db77-460c-903d-7cd4cedfac4e", "b310dc9c-c48b-44c6-9cc0-ae615990015a", "d7645bdc-2dc6-49c2-973a-b1af9b4fb3d2", "b923f84c-78e8-43a


# Query/Search

We will do some searches on our embeddings.  We can specify our search `query` and the top number of results we want with `k`.


In [33]:
query_documents = \
    cfVect.similarity_search(
        index_name=vectorize_index_name,
        query="california",
        k=10
    )

In [34]:
print(query_documents[0])

page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'id': 'd7645bdc-2dc6-49c2-973a-b1af9b4fb3d2', 'score': 0.6114662}


## Output

If you want to return metadata you can pass `return_metadata='all' | 'indexed'`.  The default is `none` or no metadata returned.

If you want to return the embeddings values, you can pass `return_values=True`.  The default is `False`

**Note:**
If you pass non-default values for either of these, the results will be limited to 20.

[https://developers.cloudflare.com/vectorize/platform/limits/](https://developers.cloudflare.com/vectorize/platform/limits/)

In [35]:
query_documents = \
    cfVect.similarity_search(
        index_name=vectorize_index_name,
        query="california",
        return_values=True,
        return_metadata='all',
        k=100
    )

In [37]:
print(len(query_documents))

20


In [42]:
print(str(query_documents[0])[:300])

page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'id': 'd7645bdc-2dc6-49c2-973a-b1af9b4fb3d2', 'score': 0.6114662, 'metadata': {'section': 'Introduction'}, 'values': [-0.028919144, -0.019105384, -0.000850724, 0.012162158, 0.0185395


If you'd like the `scores` to be returned separately, you can use `similarity_search_with_score`


In [43]:
query_documents, query_scores = \
    cfVect.similarity_search_with_score(
        index_name=vectorize_index_name,
        query="california",
        k=100,
        return_metadata="all",
    )

In [46]:
print(query_documents[0])

page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'id': 'd7645bdc-2dc6-49c2-973a-b1af9b4fb3d2', 'score': 0.6114662, 'metadata': {'section': 'Introduction'}}


In [47]:
print(query_scores[1])

0.6114662


## Including D1
All of the add and search methods on CloudflareVectorize support an `include_d1` parameter (default=True).

This is to configure whether you want to store/retrieve raw values.

If you do not want to use D1 for this, you can set this to `include=False`.  This will return documents with an empty `page_content` field.

In [73]:
query_documents, query_scores = \
    cfVect.similarity_search_with_score(
        index_name=vectorize_index_name,
        query="california",
        k=100,
        return_metadata="all",
        include_d1=False
    )

In [75]:
query_documents[0]

Document(id='75f83c36-4a9f-47f5-88e6-bf76e71c7335', metadata={'id': '75f83c36-4a9f-47f5-88e6-bf76e71c7335', 'score': 0.6114662, 'metadata': {'section': 'Introduction'}}, page_content='')

## Searching with Metadata

As mentioned before, Vectorize supports filtered search via filtered on indexes metadata fields.  Here is an example where we search for `Introduction` values within the indexed `section` metadata field.

More info on searching on Metadata fields is here: [https://developers.cloudflare.com/vectorize/reference/metadata-filtering/](https://developers.cloudflare.com/vectorize/reference/metadata-filtering/)


In [48]:
query_documents = \
    cfVect.similarity_search(
        index_name=vectorize_index_name,
        query="california",
        k=100,
        filter={"section": "Introduction"},
        return_metadata="all",
        return_values=True
    )

In [51]:
print(str(query_documents[0])[:300])

page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'id': 'd7645bdc-2dc6-49c2-973a-b1af9b4fb3d2', 'score': 0.6114662, 'metadata': {'section': 'Introduction'}, 'values': [-0.028919144, -0.019105384, -0.000850724, 0.012162158, 0.0185395


## Search by IDs
We can also retrieve specific records for specific IDs

In [58]:
sample_ids = [x.id for x in query_documents][:3]

query_documents = \
    cfVect.get_by_ids(
        index_name=vectorize_index_name,
        ids=sample_ids
    )

In [59]:
print(len(query_documents))

3


# Deleting Records
We can delete records by their ids as well


In [63]:
r = cfVect.delete(
        index_name=vectorize_index_name,
        ids=sample_ids
    )

In [64]:
print(r)

{'result': {'mutationId': '7d204ebe-ced7-4227-9e20-06baa8b7eaa9'}, 'result_info': None, 'success': True, 'errors': [], 'messages': [], 'ids': ['d7645bdc-2dc6-49c2-973a-b1af9b4fb3d2', '6e1a432d-7d7c-46a3-875b-e9a8b95fbae4', 'bda51df6-9be5-46b9-bf84-f65a71e33ffa']}


And to confirm deletion

In [66]:
query_documents = \
    cfVect.get_by_ids(
        index_name=vectorize_index_name,
        ids=sample_ids
    )
print(len(query_documents))

0


# Creating from Documents
LangChain stipulates that all vectorstores must have a `from_documents` method to instantiate a new Vectorstore from documents.  This is a more streamlined method than the individual `create, add` steps shown above.

You can do that as shown here:

In [67]:
vectorize_index_name = "test-langchain-from-docs"

In [70]:
#todo: what is up with these key errors

cfVect = \
    CloudflareVectorize.from_documents(
        account_id=cf_acct_id,
        index_name=vectorize_index_name,
        documents=texts,
        embedding=embedder,
        # api_token=cf_vectorize_token,
        d1_database_id=d1_database_id,
        ai_api_token=cf_ai_token,
        d1_api_token=cf_d1_token,
        vectorize_api_token=cf_vectorize_token
    )

In [71]:
#query for documents
query_documents = \
    cfVect.similarity_search(
        index_name=vectorize_index_name,
        query="california",
        k=10
    )

print(query_documents[0])


page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'id': '75f83c36-4a9f-47f5-88e6-bf76e71c7335', 'score': 0.6114662}
