<a href="https://colab.research.google.com/github/RERobbins/data_science_266_sandbox/blob/main/2_Vector_Databases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to Vector Databases

A vector database, also known as a vector search engine or similarity search database, is a type of database that specializes in storing and retrieving high-dimensional vectors efficiently.  The embeddings we have been using are high-dimensional vectors.

In the context of question answering tasks, vector databases can be particularly useful for tasks like semantic search, where you want to find documents or data points that are semantically similar to a given query.

Traditional relational databases are not well-suited for efficiently querying and retrieving semantically similar data. Vector databases, on the other hand, are designed to handle similarity-based searches efficiently.

Vector databases are an essential component of modern natural language processing solutions that are built to apply the generative capabilities of large language models to data collections.  This approach is called retrieval augmented generation or "RAG".

RAG is used in tasks like question answering.  With RAG, a retrieval component first selects a set of relevant documents or passages from a larger corpus, and then a generation component generates the final response based on the selected information. This approach aims to combine the accuracy of retrieval with the flexibility of generation.

This notebook builds on our work with embeddings in the prior notebook by introducing vector databases.  The next notebook in this sequence covers question answering using RAG.

We will use Qdrant, a vector database and explore some of the most important concepts.  For this notebook, we use an ephemeral vector database by default, but we also show how you could use a persistent vector database instead.  For anything beyond toy examples, we would use a persistent database.

Working with other vector databases is easy.  If you want to explore further, popular alternatives to consider include [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma), [Facebook AI Similarity Search (FAISS)](https://python.langchain.com/docs/integrations/vectorstores/faiss), [Pinecone](https://python.langchain.com/docs/integrations/vectorstores/pinecone),  and [Weaviate](https://python.langchain.com/docs/integrations/vectorstores/weaviate).  A more comprehensive set supported by LangChain is set out [here](https://python.langchain.com/docs/integrations/vectorstores/).

# Vector Database Embeddings

The choice of a generative large language model can be decoupled from the selection of the embedding model used in an accompanying vector database.  The generative models we use from OpenAI and Cohere take a string as input and not an embedding.  When performing similarity search, you will want to use the same model for generating the embeddings as for turning the query into an embedding for use with the vector database.  Your goal will be to get the string representation of the embeddings returned from the vector database.  LangChain can pass back the results as strings inside LangChain document objects.

Up to this point we have experimented with several different embedding models.  For the remaining exercises, we will use a different embedding model.  We will use `multi-qa-mpnet-base-cos-v1` from the SentenceTransformer collection.  It is based on `microsoft/mpnet-base` and has a maximum token length of `512`.  The embeddings are normalized and cosine-similarity is an appropriate choice for a distance function.  

We could have selected the OpenAI embedding model, the Cohere model or many others.  We picked the SentenceTransformer model to make the decoupling between the generative model and embedding model clear.  In practice, we expect that most people will use the embedding model that is most often associated with the generative model they select, i.e., the OpenAI embedding model with OpenAI generative models.  The point is, you have a choice.  Your selection will be influenced by many factors.

Of course, you can experiment with other embedding models in this notebook. If you want to use embedding models covered in the first notebook in this sequence, refer back to the information there about getting API keys and setting up your environment as need be.

OpenAI trial accounts expire after three months.  If you want to use OpenAI embeddings after three months you will need to upgrade to paid access.  Cohere trial accounts do not expire, but the API rate limiting is more significant than OpenAI trial account rate limiting.

The default embedding model for this notebook is not tied to an OpenAI, Cohere or any other membership.

The results the examples below will likely vary depending on the embedding model.

# Setup

## Environment Related Helpers

This portion of the notebook includes `install_if_needed` which will install a single package or list of packages with `pip` only if necessary, and `running_in_colab` a predicate that returns `True` if the notebook is running in Google Colab.

In [1]:
import os
import importlib

def install_if_needed(package_names):
    """
    Install one or more Python packages using pip if they are not already installed.

    Args:
        package_names (str or list): The name(s) of the package(s) to install.

    Returns:
        None
    """
    if isinstance(package_names, str):
        package_names = [package_names]

    for package_name in package_names:
        try:
            importlib.import_module(package_name)
            print(f"{package_name} is already installed.")
        except ImportError:
            !pip install --quiet {package_name}
            print(f"{package_name} has been installed.")


def running_in_colab():
    """
    Check if the Jupyter Notebook is running in Google Colab.

    Returns:
        bool: True if running in Google Colab, False otherwise.
    """
    try:
        import google.colab

        return True
    except ImportError:
        return False

## Mount Google Drive

By default, the data you create in Google Colaboratory does not persist from session to session.  Each session runs in a virtual machine and when that machine goes away, so does your data.  If you want your data to persist, you must store it outside the virtual machine. Google Drive can be used for that purpose.  We use it later in this notebook to store the OpenAI and Cohere API keys.

In [2]:
if running_in_colab():
    from google.colab import drive

    drive.mount("drive")

Mounted at drive


## Install LangChain

In [3]:
install_if_needed("langchain")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hlangchain has been installed.


## GPU Support (Optional)

In [4]:
import tensorflow as tf

print("GPU Available:", tf.config.list_physical_devices("GPU"))

GPU Available: []


In [5]:
install_if_needed("torch")
import torch

print("CUDA Available:", torch.cuda.is_available())

torch is already installed.
CUDA Available: False


## Embeddings

We only include what we need for the SentenceTransformer embedding model described above.  If you want to use other embeddings please revisit the code included in the first notebook in this sequence for setting up the necessary API keys and other related functions.  This notebook assumes that the `embeddings_model` variable has been set to the embedding model of choice.

In [6]:
packages = ["transformers", "sentence_transformers",]

install_if_needed(packages)

from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m45.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m88.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25htransformers has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence_transformers (setup.py) ... [?25l[?25hdone
sentence_transformers has been installed.


In [7]:
st_model_name = "multi-qa-mpnet-base-cos-v1"
st_embeddings_model = HuggingFaceEmbeddings(model_name=st_model_name)
st_tokenizer = AutoTokenizer.from_pretrained(f"sentence-transformers/{st_model_name}")

embeddings_model = st_embeddings_model

Downloading (…)e891a/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)92a80e891a/README.md:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

Downloading (…)a80e891a/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)91a/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)e891a/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)891a/train_script.py:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading (…)92a80e891a/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)80e891a/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

# Document Chunking

Document chunking, also known as text segmentation or document splitting, refers to the process of breaking down large documents or pieces of text into smaller, manageable segments before feeding them to large language models. There are several reasons why chunking is important when working with these models.

Chunking documents when working with large language models is essential to overcome input limitations, improve performance, manage costs, ensure complete responses, maintain contextual coherence, and guide the model's attention effectively. It allows you to make the most out of these powerful models when dealing with lengthy or complex text documents.

In the following code cells, we will download several corporate privacy policies from the web.  We will use document loaders specific to `pdf` files or `urls` as the case may be.

We the use LangChain's `RecursiveCharacterTextSplitter` to chunk each document.  See the relevant [LangChain documentation](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter).

For our default embedding model, the maximum number of tokens is `512`.  If we assume that each token, on average, relates to five characters of text, we should be able to use a chunk size of 2,560 characters.  To be conservative, we will use 2,000 characters.  We also need to decide how much our chunks will overlap.  We will use 25% of the maximum, so 500 characters.  You can experiment with those settings in the code below.

We add a piece of metadata that identifies the relevant organization for each chunk.

In [8]:
install_if_needed(["pypdf", "unstructured"])

import textwrap
from langchain.document_loaders import PyPDFLoader, UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/271.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/271.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m271.0/271.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hpypdf has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[?25hunstructured has been installed.


In [9]:
import pandas as pd

policy_data = [
    ("Apple",
     "Privacy Policy",
     "https://www.apple.com/legal/privacy/pdfs/apple-privacy-policy-en-ww.pdf",
    ),
    ("Cohere", "Privacy Policy", "https://cohere.com/privacy"),
    ("Google",
     "Privacy Policy",
     "https://static.googleusercontent.com/media/www.google.com/en//intl/en/policies/privacy/google_privacy_policy_en.pdf",
    ),
    ("Hugging Face", "Privacy Policy", "https://huggingface.co/privacy"),
    ("Meta",
     "Privacy Policy",
     "https://about.fb.com/wp-content/uploads/2022/07/Privacy-Within-Metas-Integrity-Systems.pdf",
    ),
    ("Threads", "Privacy Policy", "https://terms.threads.com/privacy-policy"),
    ("TikTok",
     "Privacy Policy",
     "https://www.tiktok.com/legal/page/us/privacy-policy/en",
    ),]

columns = ["organization", "title", "url"]

policy_df = pd.DataFrame(policy_data, columns=columns)

In [10]:
def get_chunks(url, organization, title, chunk_size=2000, chunk_overlap=500):
    """
    This function takes a url to an organization's web page, organization name,
    and document title and returns chunks constructed from the target url.
    The function adds the url, the organization name and the document title
    as metadata to the chunks.

    Parameters:
    url (string): Target page.
    organization (string): Organization name.
    title: Document title.
    chunk_size (int, optional): Chunk size, default is 1000 characters.
    chunk_overlap (int, optional): Chunk overlap, default is 10 characters.

    Returns:
    list of chunks
    """

    # Use PyPDFLoader for pdf targets, otherwise UnstructuredURLLoader
    if os.path.splitext(url)[1] == ".pdf":
        loader = PyPDFLoader(url)
    else:
      loader = UnstructuredURLLoader([url])

    documents = loader.load()
    for document in documents:
        metadata = document.metadata
        metadata["url"] = url
        metadata["organization"] = organization
        metadata["title"] = title
        if metadata.get("page", None) is not None:
            metadata["page"] += 1

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )

    return text_splitter.split_documents(documents)


def explore_documents(documents):
    block_indent = "   "
    metadata = documents[0].metadata
    content = documents[0].page_content[:300] + ". . ."
    print(f"{metadata['organization']} {metadata['title']} {len(documents)} chunks")
    print("Truncated First chunk:")
    print(
        textwrap.fill(
            content,
            initial_indent=block_indent,
            subsequent_indent=block_indent,
            replace_whitespace=True,
        )
    )
    print()

In [11]:
chunks = []

for row in policy_df.itertuples(index=False):
    policy_chunks = get_chunks(row.url, row.organization, row.title)
    explore_documents(policy_chunks)
    chunks += policy_chunks

Apple Privacy Policy 18 chunks
Truncated First chunk:
   Apple Privacy Policy Apple’s Privacy Policy describes how Apple
   collects, uses, and shares your personal data. Updated December 22,
   2022 In addition to this Privacy Policy, we provide data and
   privacy information embedded in our products and certain features
   that ask to use your personal data. This . . .



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Cohere Privacy Policy 10 chunks
Truncated First chunk:
   Products  For Developers  For Business  Pricing  Blog  Company  Try
   now  Cohere Privacy Policy  Last Update: Aug 4, 2023  Cohere Inc.
   (“Cohere”) values and respects your privacy. We have prepared this
   privacy policy to explain the manner in which we collect, use, and
   disclose personal information th. . .

Google Privacy Policy 20 chunks
Truncated First chunk:
   Privacy Policy Last modified: December 18, 2017 ( view archived
   versions ) (The hyperlinked examples are available at the end of
   this document.) There are many different ways you can use our
   services – to search for and share information, to communicate with
   other people or to create new content. Wh. . .

Hugging Face Privacy Policy 12 chunks
Truncated First chunk:
   Terms of Service  Privacy Policy  Content Policy  Code of Conduct
   Hugging Face Privacy Policy  🗓 Effective Date: March 28, 2023  We
   have implemented this Privacy Policy because

In [12]:
print (f"There are {len(chunks)} chunks.")

There are 140 chunks.


In [13]:
# Let's take a look at a few chunks.

chunks[10]

Document(page_content='Apple’s Sharing of Personal Data Apple may share personal data with Apple-affiliated companies, service providers who act on our behalf, our partners, developers, and publishers, or others at your direction. Apple does not share personal data with third parties for their own marketing purposes.  •Service Providers. Apple may engage third parties to act as our service providers and perform certain tasks on our behalf, such as processing or storing data, including personal data, in connection with your use of our services and delivering products to customers. Apple service providers are obligated to handle personal data consistent with this Privacy Policy and according to our instructions. •Partners. At times, Apple may partner with third parties to provide services or other offerings. For example, Apple financial offerings like Apple Card and Apple Cash are offered by Apple and our partners. Apple requires its partners to protect your personal data. •Developers an

In [14]:
chunks[75]

Document(page_content='negative impact hate speech has on individuals, communities, and our society is why we want to \n quickly detect and remove it through automation as soon as it is posted before many people can \n see it, rather than waiting for user reports after it has gotten many views. \n Thinking about hate speech from a data perspective, hate speech is primarily content-based. \n Although the problem feels very personal between people, it is often reflected in words like racial \n slurs or images like nooses or swastikas that can be picked out of a post or interaction and \n identified as hate speech. As a result, focusing detection of violations against individual pieces \n of content, rather than detection of people, is generally the most effective way to address this \n challenge. If we tried to identify or predict people who might engage in hate speech — for \n example by building a model to predict the type of person who might post hate speech — we \n would run into acc

In [15]:
chunks[120]

Document(page_content='Other State Law Privacy Rights\n\nCalifornia Resident Rights\n\nUnder California Civil Code Sections 1798.83-1798.84, California residents are entitled to contact us to prevent disclosure of Personal Data to third parties for such third parties\' direct marketing purposes; in order to submit such a request, please contact us at privacy@threads.com.\n\nNevada Resident Rights\n\nIf you are a resident of Nevada, you have the right to opt-out of the sale of certain Personal Data to third parties who intend to license or sell that Personal Data. You can exercise this right by contacting us at privacy@threads.com with the subject line "Nevada Do Not Sell Request" and providing us with your name and the email address associated with your account. Please note that we do not currently sell your Personal Data as sales are defined in Nevada Revised Statutes Chapter 603A.\n\nEuropean Union Data Subject Right\n\nUK and EU Residents\n\nIf you are a resident of the European Uni

## Create Vector Database from Chunked Documents

You may get an error message when `qdrant-client` is installed.  You may ignore that message.

In [16]:
install_if_needed("qdrant-client")
from langchain.vectorstores import Qdrant

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.5/132.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m60.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.4/75.4 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.1/143.1 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.4/311.4 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.5/74.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency res

Now we create the vector database.  By default, we store it in memory.  If you want yours to persist, revise `qdrant_database_location` below, disable the `location` parameter in the `from_documents` call, and enable the `path` parameter.

By default, we assume the notebook will run on a CPU only CoLab instance.  If you want to run it in a GPU session you should see quicker times to create the vector database.  When running a CPU session, it can take a minute or two and with a GPU, a fraction of that time.

In [17]:
%%time

collection_name = "my_collection"
qdrant_database_location = "/content/drive/MyDrive/my_qdrant"

vectordb = Qdrant.from_documents(
    documents = chunks,
    embedding = embeddings_model,
    location = ":memory:",
#   path = qdrant_database_location,
    collection_name = collection_name
    )

CPU times: user 4min 7s, sys: 1min, total: 5min 7s
Wall time: 1min 21s


In [18]:
# Confirm that we have the same number of vectors in the vector database as we have chunks.

assert vectordb.client.get_collection(collection_name).vectors_count == len(chunks)

## Query the Vector Database

A similarity search that returns the four vectors closest to the query by default.

In [19]:
query = "Does Apple sell my personal data?"
results = vectordb.similarity_search(query)
[result.metadata["organization"] for result in results]

['Apple', 'Apple', 'Apple', 'Apple']

Examine the first result in the list.  It looks to be responsive to the question.

In [20]:
print(textwrap.fill(results[0].page_content))

you do not resubscribe. This information is provided to developers or
publishers so that they can understand the performance of their
subscriptions. •Others. Apple may share personal data with others at
your direction or with your consent, such as when we share information
with your carrier to activate your account. We may also disclose
information about you if we determine that for purposes of national
security, law enforcement, or other issues of public importance,
disclosure is necessary or appropriate. We may also disclose
information about you where there is a lawful basis for doing so, if
we determine that disclosure is reasonably necessary to enforce our
terms and conditions or to protect our operations or users, or in the
event of a reorganization, merger, or sale. Apple does not sell your
personal data including as “sale” is defined in Nevada and California.
Apple also does not “share” your personal data as that term is defined
in California. Protection of Personal Data at App

In [21]:
query = "Does Apple use Cookies?"
results = vectordb.similarity_search(query)
[result.metadata["organization"] for result in results]

['Apple', 'Apple', 'Apple', 'Threads']

Hey, one of the results doesn't come from Apple's policy.  That is concerning.  Lets take a look.

In [22]:
for result in results:
  if result.metadata["organization"] != "Apple":
    print(f"Organization: {result.metadata['organization']}")
    print(textwrap.fill(result.page_content))
    print()

Organization: Threads
You can learn more about our use of Cookies on our Cookie Policy.
Data Security  We seek to protect your Personal Data from unauthorized
access, use and disclosure using appropriate physical, technical,
organizational and administrative security measures based on the type
of Personal Data and how we are processing that data. You should also
help protect your data by appropriately selecting and protecting your
password and/or other sign-on mechanism; limiting access to your
computer or device and browser; and signing off after you have
finished accessing your account. Although we work to protect the
security of your account and other data that we hold in our records,
please be aware that no method of transmitting data over the internet
or storing data is completely secure.  Data Retention  We retain
Personal Data about you as necessary to provide you with Threads or to
perform our business or commercial purposes for collecting your
Personal Data. When establishing 

The embedding doesn't say anything about Apple but it does talk about cookies.  Our similarity search brought in information we probably didn't want to consider.  Let's try another example and lets expand the number of results returned by setting `k=10`.

In [23]:
query = "Does Cohere sell my personal data?"
results = vectordb.similarity_search(query)
[result.metadata["organization"] for result in results]

['Cohere', 'Cohere', 'Cohere', 'Apple']

We see the same issue.  Let's inspect.

In [24]:
for result in results:
  if result.metadata["organization"] != "Cohere":
    print(f"Organization: {result.metadata['organization']}")
    print(textwrap.fill(result.page_content))
    print()

Organization: Apple
Apple’s Sharing of Personal Data Apple may share personal data with
Apple-affiliated companies, service providers who act on our behalf,
our partners, developers, and publishers, or others at your direction.
Apple does not share personal data with third parties for their own
marketing purposes.  •Service Providers. Apple may engage third
parties to act as our service providers and perform certain tasks on
our behalf, such as processing or storing data, including personal
data, in connection with your use of our services and delivering
products to customers. Apple service providers are obligated to handle
personal data consistent with this Privacy Policy and according to our
instructions. •Partners. At times, Apple may partner with third
parties to provide services or other offerings. For example, Apple
financial offerings like Apple Card and Apple Cash are offered by
Apple and our partners. Apple requires its partners to protect your
personal data. •Developers and P

It is troubling to realize that our searches may take into account documents that should not be relevant at all.  For example, what if the company we care about doesn't reference a concept at all and others do?

Let's repeat the last example and retrieve the cosine similarity score.  Should we look at scores and have a threshold?

In [25]:
query = "Does Cohere sell my personal data?"
results = vectordb.similarity_search_with_score(query)
for document, score in results:
  print (f"Organization: {document.metadata['organization']}\t Score: {score}")

Organization: Cohere	 Score: 0.6561509581046471
Organization: Cohere	 Score: 0.6123153411665482
Organization: Cohere	 Score: 0.5278344415126656
Organization: Apple	 Score: 0.5277527524331249


When we processed the source documents and split them into chunks, we added the name of the organization for the policy as metadata.  We can use that metadata as a filter.  In our example below, the filter is very simple, we merely indicate that the organization field needs to be `Apple`.  When we add that parameter, the results are limited to Apple embeddings, even when we expanded the number of results returned from the default of 4 to 10.

In [26]:
query = "Does Apple use Cookies?"
results = vectordb.similarity_search_with_score(query, filter={"organization": "Apple"}, k=10)
for document, score in results:
  print (f"Organization: {document.metadata['organization']}\t Score: {score}")

Organization: Apple	 Score: 0.679313401100281
Organization: Apple	 Score: 0.537787142921534
Organization: Apple	 Score: 0.5178187754454002
Organization: Apple	 Score: 0.4514172466844604
Organization: Apple	 Score: 0.43803491980847886
Organization: Apple	 Score: 0.4294281645288761
Organization: Apple	 Score: 0.4053593097796341
Organization: Apple	 Score: 0.39328099546132345
Organization: Apple	 Score: 0.3704239429262331
Organization: Apple	 Score: 0.36804883343011763


Our next query only references "the company" and not any specific company.  The results relate to several of the companies.

In [27]:
query = "Does the company sell private data?"
results = vectordb.similarity_search_with_score(query, k=10)
for document, score in results:
  print (f"Organization: {document.metadata['organization']}\t Score: {score}")

Organization: Apple	 Score: 0.580494584423452
Organization: Apple	 Score: 0.5797794282285412
Organization: TikTok	 Score: 0.5776097310290866
Organization: Hugging Face	 Score: 0.5654666165640002
Organization: Apple	 Score: 0.55246001082843
Organization: Cohere	 Score: 0.5454517364826044
Organization: TikTok	 Score: 0.5293108496689212
Organization: Hugging Face	 Score: 0.5178860201316624
Organization: Apple	 Score: 0.5157372330603829
Organization: Hugging Face	 Score: 0.5059552966802697


We can use the filter to indicate that we care about Google only.

In [28]:
query = "Does the company sell private data?"
results = vectordb.similarity_search_with_score(query, filter={"organization": "Google"}, k=10)
for document, score in results:
  print (f"Organization: {document.metadata['organization']}\t Score: {score}")

Organization: Google	 Score: 0.48035517998090516
Organization: Google	 Score: 0.4488092412159005
Organization: Google	 Score: 0.4391542655724485
Organization: Google	 Score: 0.4192212436766915
Organization: Google	 Score: 0.4051980849827558
Organization: Google	 Score: 0.39934450965347273
Organization: Google	 Score: 0.3915672366774913
Organization: Google	 Score: 0.38617583806931866
Organization: Google	 Score: 0.3791200428384656
Organization: Google	 Score: 0.37469536969432515


Now that we are familiar with vector databases, we are ready to move on to using them to give the ability of a large language model to answer questions using data it was not trained on, the information contained in the vector database.