<a href="https://colab.research.google.com/github/RERobbins/data_science_266_sandbox/blob/main/2_Vector_Databases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to Vector Databases

A vector database, also known as a vector search engine or similarity search database, is a type of database that specializes in storing and retrieving high-dimensional vectors efficiently.  The embeddings we have been using are high-dimensional vectors.

In the context of question answering tasks, vector databases can be particularly useful for tasks like semantic search, where you want to find documents or data points that are semantically similar to a given query.

Traditional relational databases are not well-suited for efficiently querying and retrieving semantically similar data. Vector databases, on the other hand, are designed to handle similarity-based searches efficiently.

Vector databases are an essential component of modern natural language processing solutions that are built to apply the generative capabilities of large language models to data collections.  This approach is called retrieval augmented generation or "RAG".

RAG is used in tasks like question answering.  With RAG, a retrieval component first selects a set of relevant documents or passages from a larger corpus, and then a generation component generates the final response based on the selected information. This approach aims to combine the accuracy of retrieval with the flexibility of generation.

This notebook builds on our work with embeddings in the prior notebook by introducing vector databases.  The next notebook in this sequence covers question answering using RAG.

We will use Qdrant, a vector database and explore some of the most important concepts.  For this notebook, we use an ephemeral vector database by default, but we also show how you could use a persistent vector database instead.  For anything beyond toy examples, we would use a persistent database.

Working with other vector databases is easy.  If you want to explore further, popular alternatives to consider include [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma), [Facebook AI Similarity Search (FAISS)](https://python.langchain.com/docs/integrations/vectorstores/faiss), [Pinecone](https://python.langchain.com/docs/integrations/vectorstores/pinecone),  and [Weaviate](https://python.langchain.com/docs/integrations/vectorstores/weaviate).  A more comprehensive set supported by LangChain is set out [here](https://python.langchain.com/docs/integrations/vectorstores/).

# Vector Database Embeddings

The choice of a generative large language model can be decoupled from the selection of the embedding model used in an accompanying vector database.  The generative models we use from OpenAI and Cohere take a string as input and not an embedding.  When performing similarity search, you will want to use the same model for generating the embeddings as for turning the query into an embedding for use with the vector database.  Your goal will be to get the string representation of the embeddings returned from the vector database.  LangChain can pass back the results as strings inside LangChain document objects.

Up to this point we have experimented with several different embedding models.  For the remaining exercises, we will use a different embedding model.  We will use `multi-qa-mpnet-base-cos-v1` from the SentenceTransformer collection.  It is based on `microsoft/mpnet-base` and has a maximum token length of `512`.  The embeddings are normalized and cosine-similarity is an appropriate choice for a distance function.  

We could have selected the OpenAI embedding model, the Cohere model or many others.  We picked the SentenceTransformer model to make the decoupling between the generative model and embedding model clear.  In practice, we expect that most people will use the embedding model that is most often associated with the generative model they select, i.e., the OpenAI embedding model with OpenAI generative models.  The point is, you have a choice.  Your selection will be influenced by many factors.

Of course, you can experiment with other embedding models in this notebook. If you want to use embedding models covered in the first notebook in this sequence, refer back to the information there about getting API keys and setting up your environment as need be.

OpenAI trial accounts expire after three months.  If you want to use OpenAI embeddings after three months you will need to upgrade to paid access.  Cohere trial accounts do not expire, but the API rate limiting is more significant than OpenAI trial account rate limiting.

The default embedding model for this notebook is not tied to an OpenAI, Cohere or any other membership.

The results the examples below will likely vary depending on the embedding model.

# Setup

## Environment Related Helpers

This portion of the notebook includes `install_if_needed` which will install a single package or list of packages with `pip` only if necessary, and `running_in_colab` a predicate that returns `True` if the notebook is running in Google Colab.

In [None]:
import importlib


def install_if_needed(package_names):
    """
    Install one or more Python packages using pip if they are not already installed.

    Args:
        package_names (str or list): The name(s) of the package(s) to install.

    Returns:
        None
    """
    if isinstance(package_names, str):
        package_names = [package_names]

    for package_name in package_names:
        try:
            importlib.import_module(package_name)
            print(f"{package_name} is already installed.")
        except ImportError:
            !pip install --quiet {package_name}
            print(f"{package_name} has been installed.")


def running_in_colab():
    """
    Check if the Jupyter Notebook is running in Google Colab.

    Returns:
        bool: True if running in Google Colab, False otherwise.
    """
    try:
        import google.colab

        return True
    except ImportError:
        return False

## Mount Google Drive

By default, the data you create in Google Colaboratory does not persist from session to session.  Each session runs in a virtual machine and when that machine goes away, so does your data.  If you want your data to persist, you must store it outside the virtual machine. Google Drive can be used for that purpose.  We use it later in this notebook to store the OpenAI and Cohere API keys.

In [None]:
if running_in_colab():
    from google.colab import drive

    drive.mount("drive")

Mounted at drive


## Install LangChain

In [None]:
install_if_needed("langchain")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hlangchain has been installed.


## GPU Support (Optional)

In [None]:
import tensorflow as tf

print("GPU Available:", tf.config.list_physical_devices("GPU"))

GPU Available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [None]:
install_if_needed("torch")
import torch

print("CUDA Available:", torch.cuda.is_available())

torch is already installed.
CUDA Available: True


# Embeddings

We will use OpenAI and Cohere large language models.

We will use embedding models from OpenAI and Cohere as well as an embedding model from the SentenceTransformers framework.

An overview of OpenAI models can be found [here](https://platform.openai.com/docs/models/overview) and an overciew of OpenAI embeddings can be found [here](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings).

An overview of Cohere models and embeddings can be found [here](https://docs.cohere.com/docs/models).

An overview of SentenceTransformers can be found [here](https://sbert.bet).  SentenceTransformers was created by Nils Reimers.  Nils is now the Director of Machine Learning at Cohere.

In [None]:
packages = [
    "openai",
    "cohere",
    "tiktoken",
    "transformers",
    "sentence_transformers",
]

install_if_needed(packages)

import openai, tiktoken
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.cohere import CohereEmbeddings

import cohere
from langchain.chat_models import ChatOpenAI
from langchain.llms import Cohere

from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer

openai is already installed.
cohere is already installed.
tiktoken is already installed.
transformers is already installed.
sentence_transformers is already installed.
seaborn is already installed.
matplotlib is already installed.


Instantiate the embededing models.  

The default OpenAI model is `text-embedding-ada-002`, which is the preferred OpenAI embedding model for its GPT 3.5 and GPT 4 models.  The context length for the model is 8192 tokens.  For more information see the OpenAI [blog announcement](https://openai.com/blog/new-and-improved-embedding-model).

The default Cohere model is `embed-english-v2.0`  The maximum number of tokens for the model is `512`.

The SentenceTransformers model `paraphrase-multilingual-mpnet-base-v2` is based on the `xlm-roberta-base` model.  It is trained on more than fifty languages. The maximum number of tokens for this model is `128`.  We use it here because of its multilingual capability.  Many of the embedding models available from the [SBERT](https://sbert.net/docs/pretrained_models.html) site have a `512` token maximum.

In [None]:
openai_embeddings_model = OpenAIEmbeddings()
openai_embeddings_model.model

'text-embedding-ada-002'

In [None]:
cohere_embeddings_model = CohereEmbeddings(truncate="None")
cohere_embeddings_model.model

'embed-english-v2.0'

In [None]:
sbert_model_name = "paraphrase-multilingual-mpnet-base-v2"
sbert_embeddings_model = HuggingFaceEmbeddings(model_name=sbert_model_name)
sbert_tokenizer = AutoTokenizer.from_pretrained(
    f"sentence-transformers/{sbert_model_name}"
)

Downloading (…)9e268/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)f2cd19e268/README.md:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading (…)cd19e268/config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)9e268/tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

Downloading (…)d19e268/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [None]:
def openai_token_count(text):
    embedding_model = OpenAIEmbeddings()
    openai_encoding = tiktoken.encoding_for_model(embedding_model.model)
    return len(openai_encoding.encode(text))


cohere_client = cohere.Client(COHERE_API_KEY)


def cohere_token_count(text, model_name="command", client=cohere_client):
    return len(cohere_client.tokenize(text=text, model=model_name))


def sbert_token_count(text, tokenizer=sbert_tokenizer):
    return len(tokenizer(text, add_special_tokens=False).input_ids)

In [None]:
cohere_multilingual_embeddings_model = CohereEmbeddings(
    model="embed-multilingual-v2.0", truncate="End"
)

# Vector Databases

In [None]:
# llm = Cohere(model="command", temperature=0)  ##Cohere seems fragile on self-query
LLM = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
# llm = ChatOpenAI(model="gpt-4", temperature=0)

In [None]:
st_model_name = "multi-qa-mpnet-base-cos-v1"
st_embeddings_model = HuggingFaceEmbeddings(model_name=st_model_name)
st_tokenizer = AutoTokenizer.from_pretrained(f"sentence-transformers/{st_model_name}")

embeddings_model = st_embeddings_model

Downloading (…)e891a/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)92a80e891a/README.md:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

Downloading (…)a80e891a/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)91a/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)e891a/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)891a/train_script.py:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading (…)92a80e891a/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)80e891a/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

## Document Chunking

Document chunking, also known as text segmentation or document splitting, refers to the process of breaking down large documents or pieces of text into smaller, manageable segments before feeding them to large language models. There are several reasons why chunking is important when working with these models.

Chunking documents when working with large language models is essential to overcome input limitations, improve performance, manage costs, ensure complete responses, maintain contextual coherence, and guide the model's attention effectively. It allows you to make the most out of these powerful models when dealing with lengthy or complex text documents.

In the following code cells, we will download several corporate privay policies from the web.  We will use document loaders specific to `pdf` files or `urls` as the case may be.

We the use LangChain's `RecursiveCharacterTextSplitter` to chunk each document.  See the relevant [LangChain documentation](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter).

We add a piece of metadata that identifies the relevant organization for each chunk.

In [None]:
install_if_needed(["pypdf", "unstructured"])

import textwrap
from langchain.document_loaders import PyPDFLoader, UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/271.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/271.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m271.0/271.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hpypdf has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[?25hunstructured has been installed.


In [None]:
import pandas as pd

policy_data = [
    ("Apple",
     "Privacy Policy",
     "https://www.apple.com/legal/privacy/pdfs/apple-privacy-policy-en-ww.pdf",
    ),
    ("Cohere", "Privacy Policy", "https://cohere.com/privacy"),
    ("Google",
     "Privacy Policy",
     "https://static.googleusercontent.com/media/www.google.com/en//intl/en/policies/privacy/google_privacy_policy_en.pdf",
    ),
    ("Hugging Face", "Privacy Policy", "https://huggingface.co/privacy"),
    ("Meta",
     "Privacy Policy",
     "https://about.fb.com/wp-content/uploads/2022/07/Privacy-Within-Metas-Integrity-Systems.pdf",
    ),
    ("Threads", "Privacy Policy", "https://terms.threads.com/privacy-policy"),
    ("TikTok",
     "Privacy Policy",
     "https://www.tiktok.com/legal/page/us/privacy-policy/en",
    ),]

columns = ["organization", "title", "url"]

policy_df = pd.DataFrame(policy_data, columns=columns)

In [None]:
def get_chunks(url, organization, title, chunk_size=385, chunk_overlap=50):
    """
    This function takes a url to an organization's web page, organization name,
    and document title and returns chunks constructed from the target url.
    The function adds the url, the organization name and the document title
    as metadata to the chunks.

    Parameters:
    url (string): Target page.
    organization (string): Organization name.
    title: Document title.
    chunk_size (int, optional): Chunk size, default is 1000 characters.
    chunk_overlap (int, optional): Chunk overlap, default is 10 characters.

    Returns:
    list of chunks
    """

    # Use PyPDFLoader for pdf targets, otherwise UnstructuredURLLoader
    if os.path.splitext(url)[1] == ".pdf":
        loader = PyPDFLoader(url)
    else:
      loader = UnstructuredURLLoader([url])

    documents = loader.load()
    for document in documents:
        metadata = document.metadata
        metadata["url"] = url
        metadata["organization"] = organization
        metadata["title"] = title
        if metadata.get("page", None) is not None:
            metadata["page"] += 1

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )

    return text_splitter.split_documents(documents)


def explore_documents(documents):
    block_indent = "   "
    metadata = documents[0].metadata
    content = documents[0].page_content
    print(f"{metadata['organization']} {metadata['title']} {len(documents)} chunks")
    print("First chunk:")
    print(
        textwrap.fill(
            content,
            initial_indent=block_indent,
            subsequent_indent=block_indent,
            replace_whitespace=True,
        )
    )
    print()

In [None]:
chunks = []

for row in policy_df.itertuples(index=False):
    policy_chunks = get_chunks(row.url, row.organization, row.title)
    explore_documents(policy_chunks)
    chunks += policy_chunks

Apple Privacy Policy 79 chunks
First chunk:
   Apple Privacy Policy Apple’s Privacy Policy describes how Apple
   collects, uses, and shares your personal data. Updated December 22,
   2022 In addition to this Privacy Policy, we provide data and
   privacy information embedded in our products and certain features
   that ask to use your personal data. This product-specific
   information is accompanied by our Data & Privacy Icon.  You will be

Cohere Privacy Policy 57 chunks
First chunk:
   Products  For Developers  For Business  Pricing  Blog  Company  Try
   now  Cohere Privacy Policy  Last Update: Aug 4, 2023

Google Privacy Policy 82 chunks
First chunk:
   Privacy Policy Last modified: December 18, 2017 ( view archived
   versions ) (The hyperlinked examples are available at the end of
   this document.) There are many different ways you can use our
   services – to search for and share information, to communicate with
   other

Hugging Face Privacy Policy 66 chunks
First chunk:
  

## Create Vector Database from Chunked Documents

In [None]:
install_if_needed("qdrant-client")
from langchain.vectorstores import Qdrant

qdrant-client has been installed.


In [None]:
print (f"There are {len(chunks)} chunks.")

There are 654 chunks.


In [None]:
# Let's take a look at a few chunks.

chunks[22]

Document(page_content='a government-issued ID in limited circumstances, including when setting up a wireless account and activating your device, for the purpose of extending commercial credit, managing reservations, or as required by law •Other Information You Provide to Us. Details such as the content of your communications with Apple, including interactions with customer support and contacts through', metadata={'source': '/tmp/tmprrka5pu9/tmp.pdf', 'page': 4, 'url': 'https://www.apple.com/legal/privacy/pdfs/apple-privacy-policy-en-ww.pdf', 'organization': 'Apple', 'title': 'Privacy Policy'})

In [None]:
chunks[356]

Document(page_content='know whether or not adult nudity and sexual activity was consensually taken and consensually \n shared on our platforms. \n One way we can account for this lack of knowledge, however, is by using automation to remove \n all forms of adult nudity that violate Community Standards on Instagram and Facebook. In this', metadata={'source': '/tmp/tmpaijjvjfe/tmp.pdf', 'page': 12, 'url': 'https://about.fb.com/wp-content/uploads/2022/07/Privacy-Within-Metas-Integrity-Systems.pdf', 'organization': 'Meta', 'title': 'Privacy Policy'})

In [None]:
chunks[600]

Document(page_content='and Stripe (https://stripe.com/en-ie/privacy).', metadata={'source': 'https://www.tiktok.com/legal/page/us/privacy-policy/en', 'url': 'https://www.tiktok.com/legal/page/us/privacy-policy/en', 'organization': 'TikTok', 'title': 'Privacy Policy'})

In [None]:
%%time

collection_name = "my_collection"

vectordb = Qdrant.from_documents(
    documents = chunks,
    embedding = embeddings_model,
    location = ":memory:",
    collection_name = collection_name
    )

CPU times: user 1.88 s, sys: 38.4 ms, total: 1.92 s
Wall time: 1.02 s


In [None]:
# Confirm that we have the same number of vectors in the vector database as we have chunks.

assert vectordb.client.get_collection(collection_name).vectors_count == len(chunks)

## Query the Vector Database

A similarity search that returns the four vectors closest to the query by default.

In [None]:
query = "Does Apple sell my personal data?"
results = vectordb.similarity_search(query)
[result.metadata["organization"] for result in results]

['Apple', 'Apple', 'Apple', 'Apple']

Examine the first result in the list.  It looks to be responsive to the question.

In [None]:
print(textwrap.fill(results[0].page_content))

also disclose information about you where there is a lawful basis for
doing so, if we determine that disclosure is reasonably necessary to
enforce our terms and conditions or to protect our operations or
users, or in the event of a reorganization, merger, or sale. Apple
does not sell your personal data including as “sale” is defined in
Nevada and California. Apple also does not


Increase the number of results to 10.  The results are not limited to information that came from Apple's policy.

In [None]:
query = "Does Apple sell my personal data?"
results = vectordb.similarity_search(query, k=10)
[result.metadata["organization"] for result in results]

['Apple',
 'Apple',
 'Apple',
 'Apple',
 'Threads',
 'Apple',
 'Apple',
 'Apple',
 'Apple',
 'Apple']

Let's take a look at each non-Apple result and see what's going on.

In [None]:
for result in results:
  if result.metadata["organization"] != "Apple":
    print(f"Organization: {result.metadata['organization']}")
    print(textwrap.fill(result.page_content))
    print()

Organization: Threads
Personal Data Sales  We will not sell your Personal Data, and have not
done so over the last 12 months.  Personal Data Sharing



The embedding doesn't say anything about Apple but it does talk about the sale of personal data.  So, while the result wasn't in the top few results, our similarity search brought in information we probably didn't want to consider.  Let's try another example.

In [None]:
query = "Does Apple use cookies?"
results = vectordb.similarity_search(query, k=10)
[result.metadata["organization"] for result in results]

['Apple',
 'Apple',
 'Apple',
 'Apple',
 'Apple',
 'Apple',
 'Apple',
 'Threads',
 'Threads',
 'Hugging Face']

We see the same issue.  Let's inspect.

In [None]:
for result in results:
  if result.metadata["organization"] != "Apple":
    print(f"Organization: {result.metadata['organization']}")
    print(textwrap.fill(result.page_content))
    print()

Organization: Threads
You can learn more about our use of Cookies on our Cookie Policy.
Data Security

Organization: Threads
Cookies are small pieces of data– usually text files – placed on your
computer, tablet, phone or similar device when you use that device to
access Threads. We may also supplement the information we collect from
you with information received from third parties, including third
parties that have placed their own Cookies on your device(s). Please
note that because of our use of

Organization: Hugging Face
D. Cookies  We use cookies only for the purposes of delivering,
updating, monitoring, improving the Services, and maintaining security
on our Services by detecting, preventing and responding to any type of
threats or incidents.



None of these results reference Apple and they all talk about the use of cookies.  While it is true that we only saw these results by expanding the number of results return, this problem could show up when looking for fewer results.  For example, what if the cmopany we care about doesn't reference a concept at all and others do?

When we processed the source documents and split them into chunks, we added the name of the organization for the policy as metadata.  We can use that metadata as a filter.  In our example below, the filter is very simple, we merely indicate that the organization field needs to be `Apple`.  When we add that parameter, the results are limited to Apple embeddings.

In [None]:
results = vectordb.similarity_search(query, filter={"organization": "Apple"}, k=10)
[result.metadata["organization"] for result in results]

['Apple',
 'Apple',
 'Apple',
 'Apple',
 'Apple',
 'Apple',
 'Apple',
 'Apple',
 'Apple',
 'Apple']

Our next query only references "the company" and not any specific company.  The results relate to several of the companies.

In [None]:
query = "Does the company sell private data?"
results = vectordb.similarity_search(query, k=10)
[result.metadata["organization"] for result in results]

['Threads',
 'Threads',
 'Apple',
 'Apple',
 'Threads',
 'Threads',
 'Hugging Face',
 'Cohere',
 'TikTok',
 'Apple']

We can use the filter to indicate that we care about Google only.

In [None]:
query = "Does the company sell private data?"
results = vectordb.similarity_search(query, filter={"organization": "Google"}, k=10)
[result.metadata["organization"] for result in results]

['Google',
 'Google',
 'Google',
 'Google',
 'Google',
 'Google',
 'Google',
 'Google',
 'Google',
 'Google']

In [None]:
print(textwrap.fill(results[0].page_content))

by law. We may share  non-personally identifiable information
publicly and with our partners – like publishers, advertisers or
connected sites. For example, we may share information publicly to
show trends  about the general use of our services. If Google is
involved in a merger, acquisition or asset sale, we will continue to
ensure the confidentiality of any personal
