<a href="https://colab.research.google.com/github/RERobbins/data_science_266_sandbox/blob/main/4_QA_RAG_Self_Query.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question Answering Using Retrieval Augmented Generation With a Self-Querying Retriever

As we have worked with the vector database, it has become clear that it would be useful if we could take a look at a query and see if there was an obvious way to apply metadata filters when we retrieve information from it.  For example, if we ask a question about Company X and the database has information about Company X and Company Y with suitable metadata tags, we ought to be able to take a query that focuses on Company X and only look at information in the vector database that relates to Company X.

Once can imagine a variety of ways to approach this problem.  LangChain provides one mechanism, they call it a self-querying retriever which they describe as follows:

> A self-querying retriever is one that, as the name suggests, has the ability to query itself. Specifically, given any natural language query, the retriever uses a query-constructing LLM chain to write a structured query and then applies that structured query to it's underlying VectorStore. This allows the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documented, but to also extract filters from the user query on the metadata of stored documents and to execute those filters.

See [here](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/) for more detail.

This notebook picks up where we left off in the Retrieval Augmented Generation notebook.

The comments made about the use of generative models, embedding models and vector databases apply here.  Moreover, the code to either build your vector database or use one that you have kept is the same.

# Setup

## Environment Related Helpers

This portion of the notebook includes `install_if_needed` which will install a single package or list of packages with `pip` only if necessary, and `running_in_colab` a predicate that returns `True` if the notebook is running in Google Colab.

In [1]:
import importlib


def install_if_needed(package_names):
    """
    Install one or more Python packages using pip if they are not already installed.

    Args:
        package_names (str or list): The name(s) of the package(s) to install.

    Returns:
        None
    """
    if isinstance(package_names, str):
        package_names = [package_names]

    for package_name in package_names:
        try:
            importlib.import_module(package_name)
            print(f"{package_name} is already installed.")
        except ImportError:
            !pip install --quiet {package_name}
            print(f"{package_name} has been installed.")


def running_in_colab():
    """
    Check if the Jupyter Notebook is running in Google Colab.

    Returns:
        bool: True if running in Google Colab, False otherwise.
    """
    try:
        import google.colab

        return True
    except ImportError:
        return False

## Mount Google Drive

By default, the data you create in Google Colaboratory does not persist from session to session.  Each session runs in a virtual machine and when that machine goes away, so does your data.  If you want your data to persist, you must store it outside the virtual machine. Google Drive can be used for that purpose.  We use it later in this notebook to store the OpenAI and Cohere API keys.

In [2]:
if running_in_colab():
    from google.colab import drive

    drive.mount("drive")

Mounted at drive


## API Keys

In [3]:
install_if_needed("python-dotenv")

python-dotenv has been installed.


In [4]:
import os
import getpass

from dotenv import load_dotenv, find_dotenv

def env_file_path(
    colab_path="/content/drive/MyDrive/.env", other_path=f"{find_dotenv()}"
):
    """
    Returns the appropriate file path for the environment variables file (.env) based on the execution environment.

    This function is designed to determine the correct path for the environment variables file
    depending on whether the code is running in Google Colab or in a different environment.

    Args:
        colab_path (str, optional): The file path for the environment variables file in Google Colab.
            Default is '/content/drive/MyDrive/.env'.

        other_path (str, optional): The file path for the environment variables file in other environments.
            Default is '/workspace/.env'.

    Returns:
        str: The file path for the environment variables file (.env).
    """

    return colab_path if running_in_colab() else other_path

In [5]:
load_dotenv(env_file_path())
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
COHERE_API_KEY = os.environ["COHERE_API_KEY"]

## Langchain

In [6]:
install_if_needed("langchain")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hlangchain has been installed.


## GPU Support (Optional)

In [7]:
import tensorflow as tf

print("GPU Available:", tf.config.list_physical_devices("GPU"))

GPU Available: []


In [8]:
install_if_needed("torch")
import torch

print("CUDA Available:", torch.cuda.is_available())

torch is already installed.
CUDA Available: False


## Embeddings

Instantiate the embeddings model.  

In [9]:
packages = [
    "openai",
    "cohere",
    "tiktoken",
    "transformers",
    "sentence_transformers",
]

install_if_needed(packages)

import openai, tiktoken

import cohere
from langchain.chat_models import ChatOpenAI
from langchain.llms import Cohere

from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m71.7/73.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hopenai has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[?25hcohere has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25htiktoken has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [10]:
st_model_name = "multi-qa-mpnet-base-cos-v1"
st_embeddings_model = HuggingFaceEmbeddings(model_name=st_model_name)
st_tokenizer = AutoTokenizer.from_pretrained(f"sentence-transformers/{st_model_name}")

embeddings_model = st_embeddings_model

Downloading (…)e891a/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)92a80e891a/README.md:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

Downloading (…)a80e891a/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)91a/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)e891a/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)891a/train_script.py:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading (…)92a80e891a/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)80e891a/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

## Vector Database

In [11]:
install_if_needed("qdrant-client")
from langchain.vectorstores import Qdrant

import qdrant_client

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.5/132.5 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m62.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.4/75.4 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.1/143.1 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.4/311.4 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.5/74.5 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency res

# Generative Model

OpenAI trial accounts expire after three months and provide access to `gpt-3.5-turbo` but not `gpt-4`.  Paid OpenAI accounts permit use of `gpt-4` as well and do not expire.  Cohere trial accounts do not expire, but the API rate limiting is more significant than OpenAI trial account rate limiting.

Set the `LLM` variable below to reflect the generative model you want to use.  

The results in most of the examples below will vary with your choice.


In [12]:
# llm = Cohere(model="command", temperature=0)
LLM = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
# llm = ChatOpenAI(model="gpt-4", temperature=0)

# Load Documents, Split Into Chunks, Create Vector Database

**If you saved you Qdrant database when you worked on the vector database notebook you can skip this section and use the Load Persistent Vector Database section below.**

Otherwise, we use the following cells to download the privacy policies we have been working with and split them into chunks to be stored in the vector database.

In [None]:
install_if_needed(["pypdf", "unstructured"])

import textwrap
from langchain.document_loaders import PyPDFLoader, UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/271.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/271.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m271.1/271.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hpypdf has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hunstructured has been installed.


In [None]:
import pandas as pd

policy_data = [
    ("Apple",
     "Privacy Policy",
     "https://www.apple.com/legal/privacy/pdfs/apple-privacy-policy-en-ww.pdf",
    ),
    ("Cohere", "Privacy Policy", "https://cohere.com/privacy"),
    ("Google",
     "Privacy Policy",
     "https://static.googleusercontent.com/media/www.google.com/en//intl/en/policies/privacy/google_privacy_policy_en.pdf",
    ),
    ("Hugging Face", "Privacy Policy", "https://huggingface.co/privacy"),
    ("Meta",
     "Privacy Policy",
     "https://about.fb.com/wp-content/uploads/2022/07/Privacy-Within-Metas-Integrity-Systems.pdf",
    ),
    ("Threads", "Privacy Policy", "https://terms.threads.com/privacy-policy"),
    ("TikTok",
     "Privacy Policy",
     "https://www.tiktok.com/legal/page/us/privacy-policy/en",
    ),]

columns = ["organization", "title", "url"]

policy_df = pd.DataFrame(policy_data, columns=columns)

In [None]:
def get_chunks(url, organization, title, chunk_size=2000, chunk_overlap=500):
    """
    This function takes a url to an organization's web page, organization name,
    and document title and returns chunks constructed from the target url.
    The function adds the url, the organization name and the document title
    as metadata to the chunks.

    Parameters:
    url (string): Target page.
    organization (string): Organization name.
    title: Document title.
    chunk_size (int, optional): Chunk size, default is 2000 characters.
    chunk_overlap (int, optional): Chunk overlap, default is 500 characters.

    Returns:
    list of chunks
    """

    # Use PyPDFLoader for pdf targets, otherwise UnstructuredURLLoader
    if os.path.splitext(url)[1] == ".pdf":
        loader = PyPDFLoader(url)
    else:
      loader = UnstructuredURLLoader([url])

    documents = loader.load()
    for document in documents:
        metadata = document.metadata
        metadata["url"] = url
        metadata["organization"] = organization
        metadata["title"] = title
        if metadata.get("page", None) is not None:
            metadata["page"] += 1

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )

    return text_splitter.split_documents(documents)


def explore_documents(documents):
    block_indent = "   "
    metadata = documents[0].metadata
    content = documents[0].page_content[:300] + ". . ."
    print(f"{metadata['organization']} {metadata['title']} {len(documents)} chunks")
    print("Truncated First chunk:")
    print(
        textwrap.fill(
            content,
            initial_indent=block_indent,
            subsequent_indent=block_indent,
            replace_whitespace=True,
        )
    )
    print()

In [None]:
chunks = []

for row in policy_df.itertuples(index=False):
    policy_chunks = get_chunks(row.url, row.organization, row.title)
    explore_documents(policy_chunks)
    chunks += policy_chunks

Apple Privacy Policy 18 chunks
Truncated First chunk:
   Apple Privacy Policy Apple’s Privacy Policy describes how Apple
   collects, uses, and shares your personal data. Updated December 22,
   2022 In addition to this Privacy Policy, we provide data and
   privacy information embedded in our products and certain features
   that ask to use your personal data. This . . .



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Cohere Privacy Policy 10 chunks
Truncated First chunk:
   Products  For Developers  For Business  Pricing  Blog  Company  Try
   now  Cohere Privacy Policy  Last Update: Aug 4, 2023  Cohere Inc.
   (“Cohere”) values and respects your privacy. We have prepared this
   privacy policy to explain the manner in which we collect, use, and
   disclose personal information th. . .

Google Privacy Policy 20 chunks
Truncated First chunk:
   Privacy Policy Last modified: December 18, 2017 ( view archived
   versions ) (The hyperlinked examples are available at the end of
   this document.) There are many different ways you can use our
   services – to search for and share information, to communicate with
   other people or to create new content. Wh. . .

Hugging Face Privacy Policy 12 chunks
Truncated First chunk:
   Terms of Service  Privacy Policy  Content Policy  Code of Conduct
   Hugging Face Privacy Policy  🗓 Effective Date: March 28, 2023  We
   have implemented this Privacy Policy because

In [None]:
print (f"There are {len(chunks)} chunks.")

There are 140 chunks.


In [None]:
%%time

collection_name = "my_collection"

vectordb = Qdrant.from_documents(
    documents = chunks,
    embedding = embeddings_model,
    location = ":memory:",
    collection_name = collection_name
    )

CPU times: user 5.9 s, sys: 890 ms, total: 6.79 s
Wall time: 12.1 s


In [None]:
# Confirm that we have the same number of vectors in the vector database as we have chunks.

assert vectordb.client.get_collection(collection_name).vectors_count == len(chunks)

# Load Persistent Vector Database

**If you saved you Qdrant database when you worked on the vector database notebook you can use this section instead of the Load Documents, Split Into Chunks, Create Vector Database section above.**

If you execute this code block more than once in a session you are likely to get an error indicating that your vector databsae is already accessed by another instance of Qdrant client and that if you require concurrent access, you should use Qdrant server instead.

The prior section should always work in lieu of this section.

In [13]:
collection_name = "my_collection"
qdrant_database_location = "/content/drive/MyDrive/my_qdrant"

client = qdrant_client.QdrantClient(path=qdrant_database_location)

vectordb = Qdrant(client=client,
                   collection_name=collection_name,
                   embeddings=embeddings_model,)

In [14]:
assert vectordb.client.get_collection(collection_name).vectors_count == 140

# Self-querying retriever.

A self-querying retriever is one that, as the name suggests, has the ability to query itself. Specifically, given any natural language query, the retriever uses a query-constructing LLM chain to write a structured query and then applies that structured query to it's underlying vectorstore. This allows the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documented, but to also extract filters from the user query on the metadata of stored documents and to execute those filters.

The self-querying retriever's arguments include descriptions of the metadata fields and the document content.

In [15]:
install_if_needed("lark")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/108.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/108.9 kB[0m [31m1.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.9/108.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hlark has been installed.


In [16]:
import textwrap
from langchain import PromptTemplate, LLMChain
from langchain.chains import RetrievalQA
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="organization",
        description="The company or organization that created the document.  It describes that company's policy.",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="title",
        description="The title of the document",
        type="string",
    ),
    AttributeInfo(
        name="url",
        description="The url for the document",
        type="string",
    ),
]
document_content_description = "A policy"

retriever = SelfQueryRetriever.from_llm(
    LLM,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True,
    enable_limit=True,
)

Now we build chain using our new retriever with the same prompt we used in the last notebook and add a function to pretty-print results.

In [72]:
template = """Use the following pieces of context to answer the question at the end.
Your answer should be as concise as possible and ideally not more than one sentence.
If you don't know the answer, just say that you don't know.

{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

chain = RetrievalQA.from_chain_type(
    llm = LLM,
    retriever = retriever,
#    chain_type_kwargs = {prompt: prompt},
    return_source_documents = True,
)

def pretty_print_result(result):
  indent = "    "
  organizations = [document.metadata["organization"] for document in result['source_documents']]
  print()
  print(f"Query: {result['query']}")
  print(f"Organizations: {organizations}")
  print(f"Answer:")
  indent = "    "
  print (textwrap.fill(result['result'], initial_indent=indent, subsequent_indent=indent))

import warnings

# Ignore all UserWarning messages
warnings.filterwarnings("ignore", category=UserWarning)


We will use the `get_relevant_documents` method provided by the `SelfQueryRetriever` class and examine both the metadata filter generated and the organization for the relevant documents retrieved.

As you look at the examples below and substitute your own you should discover that the approach is not consistent reliable.  In some cases, the system seems to fail to understand that a word in a query is an organization.  Sometimes revising the prompt a little bit to make that distinction more apparent helps.

Does this suggest that using our generative models to do entity extraction is perhaps not the best way to proceed?

In the next example, the system does not identify Meta as an organization.  Nevertheless, it is interesting to note that three of the four examples relate to Meta and the first example, is about one of Meta's businesses, Threads.

In [73]:
result = chain("How does Meta protect my data?")
pretty_print_result(result)

query='Meta protect my data' filter=None limit=None

Query: How does Meta protect my data?
Organizations: ['Meta', 'Cohere', 'Meta', 'Meta']
Answer:
    Meta implements reasonable administrative, technical, and physical
    measures to safeguard personal information against theft, loss,
    unauthorized access, use, modification, and disclosure. Access to
    personal information is restricted to employees and authorized
    service providers on a need-to-know basis. They only retain
    personal information for as long as it is operationally or legally
    necessary, and after that, they either destroy or anonymize the
    information. If you request access or updates to your personal
    information, Meta will direct you to the relevant customer. You
    may have the right to access, update, correct, delete, transfer,
    and object to the processing of your personal information. Meta is
    committed to privacy and conducts privacy reviews for new products
    and tools that involve

However, when we change the query to make more explicit that we are talking about the company named Meta, the filter we want is generated and the documents are limited to Meta.

In [74]:
result = chain("How does the company named Meta protect my data?")
pretty_print_result(result)

query='Meta data protection' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Meta') limit=None

Query: How does the company named Meta protect my data?
Organizations: ['Meta', 'Meta', 'Meta', 'Meta']
Answer:
    Meta protects your data through its commitment to privacy and its
    Privacy Review process. The company analyzes privacy alongside
    safety, security, and integrity concerns. New products and
    internal tools go through a privacy review, where experts evaluate
    privacy risks and make necessary changes before launch. This
    review is especially important for safety, security, and integrity
    tools that may use a range of data to detect and prevent harm.
    Meta follows eight privacy principles when discussing the
    appropriate use of personal data. These principles include purpose
    limitation, data minimization, and data retention. The company
    aims to process data only for a limited, clearly stated purpose
    that prov

One of the subtle and even more interesting things about LangChain's self query retriever is that we can use it to allow the query to specify the number of documents to fetch.  We did that by passing `enable_limit=True` to the constructor.  See the relevant documentation [here](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/qdrant_self_query).

In the next example, the prompt asks for five examples, the query has `limit=5` and we get five results instead of the default of four.

In [75]:
result = chain("How does the company named Meta protect my data?  I want five examples.")
pretty_print_result(result)

query='Meta protect my data' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Meta') limit=5

Query: How does the company named Meta protect my data?  I want five examples.
Organizations: ['Meta', 'Meta', 'Meta', 'Meta', 'Meta']
Answer:
    Meta protects user data through various measures. Here are five
    examples:  1. Privacy Review: Meta has a Privacy Review process in
    place for all new products and internal tools that handle user
    data. This review involves experts from legal, policy, and product
    teams who evaluate privacy risks associated with the project and
    make necessary changes to control those risks before launch.  2.
    Data Minimization: Meta follows the principle of data
    minimization, which means they collect and create only the minimum
    amount of data required to support the stated purposes. This
    ensures that unnecessary or excessive data is not collected or
    stored.  3. Data Retention: Meta retains user d

Now, let's ask about two companies.  The filter seems to be doing the right thing. Notice the response regarding Cohere though.

In [76]:
result = chain("Do Apple or Cohere use cookies?")
pretty_print_result(result)

query='cookies' filter=Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Apple'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Cohere')]) limit=None

Query: Do Apple or Cohere use cookies?
Organizations: ['Apple', 'Cohere', 'Apple', 'Apple']
Answer:
    Yes, both Apple and Cohere use cookies. Apple uses cookies on
    their website to understand website activity, monitor and improve
    the website, and provide a customized experience. Cohere, on the
    other hand, may also use cookies on their website, but the
    provided context does not specifically mention Cohere's use of
    cookies.


When we write our prompt to make clear that Cohere is the name of a company and don't reference other companies, we get back sufficient detail to answer the question.  Depending on the way we word the prompt, the system sometimes fails to recognize that Cohere is the name of a company.  When referencing more than one company, perhaps we should be forcing the retriever to retrieve more documents.

In [77]:
result = chain("Does the company named Cohere use cookies?")
pretty_print_result(result)

query='cookies' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Cohere') limit=None

Query: Does the company named Cohere use cookies?
Organizations: ['Cohere', 'Cohere', 'Cohere', 'Cohere']
Answer:
    Yes, Cohere uses cookies on its website. The website collects the
    IP addresses of visitors, as well as other information such as
    page requests, browser type, operating system, and average time
    spent on the website. Cookies are used to recognize a user's
    computer or device when they return to the website and to optimize
    the user experience.


Unlike the last example, which worked well, see what happens with a more natural phrasing of the question.  Not only does the system fail to recognize that Cohere is a company, it generates a filter on the word "cookies" as if that is an organization.  Finally, when there are no source documents returned, the systems responds in the affirmative.  But we don't know if that is a hallucination or not.

In [78]:
result = chain("Does Cohere use cookies?")
pretty_print_result(result)

query='Cohere' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='title', value='Cookies') limit=None

Query: Does Cohere use cookies?
Organizations: []
Answer:
    Yes, Cohere uses cookies.


But what happens when we ask about a company note in our database?  The filter looks good, but the mechanism is blind to the absence of Microsoft data in the vector database.  Our relevant documents don't include Microsoft.  What if we asked about a pair of companies and the most relevant documents still came from only one?

In [79]:
result = chain ("Does Apple use cookies?  Does Microsoft use cookies?")
pretty_print_result(result)

query='cookies' filter=Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Apple'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Microsoft')]) limit=None

Query: Does Apple use cookies?  Does Microsoft use cookies?
Organizations: ['Apple', 'Apple', 'Apple', 'Apple']
Answer:
    Yes, Apple uses cookies. According to their Privacy Policy,
    Apple's websites, online services, interactive applications, and
    advertisements may use cookies and other technologies such as web
    beacons. These technologies help Apple understand user behavior,
    enhance security, measure the effectiveness of advertisements, and
    improve user experience.  As for Microsoft, I don't have
    information about their use of cookies. It's best to refer to
    Microsoft's Privacy Policy or contact Microsoft directly for more
    information on their use of cookies.


Up to this point we have not included prompts that looked to exclude information.  Let's try some below.  

In the first example, we get responses related to Cohere, Apple, Meta and Hugging Face.

In [82]:
result = chain("How do companies protect my data?")
pretty_print_result(result)

query='data protection' filter=None limit=None

Query: How do companies protect my data?
Organizations: ['Cohere', 'Apple', 'Meta', 'Hugging Face']
Answer:
    Companies protect your data by implementing various safeguards and
    security measures. These measures may include administrative,
    technical, and physical controls to prevent unauthorized access,
    use, modification, and disclosure of your personal information.
    Access to your data is restricted to employees and authorized
    service providers who have a legitimate need to access it for
    their job responsibilities. Companies also follow data retention
    practices, keeping your personal information only for as long as
    necessary and either destroying or anonymizing it afterwards.
    Additionally, companies may have procedures in place to handle
    data subject requests, such as providing access to or updating
    personal information. It is important to note that while companies
    strive to use commerciall

Lets revise the prompt to ask ask for five examples, cite the relevant companies, and limit itself to the supplied context.  The results are very interesting and impressive.

In [85]:
query = """
How do companies protect my data?
Please provide five examples.
When you give an example, please cite one or more companies that provide the protection you describe.
Do not describe any practice not included in the supplied context.
"""

result = chain(query)
pretty_print_result(result)

query='data protection' filter=None limit=5

Query: 
How do companies protect my data?
Please provide five examples.
When you give an example, please cite one or more companies that provide the protection you describe. 
Do not describe any practice not included in the supplied context.

Organizations: ['Cohere', 'Apple', 'Meta', 'Hugging Face', 'Apple']
Answer:
    Based on the provided context, here are five examples of how
    companies protect data:  1. Implementing Safeguards: Companies
    like Apple have implemented reasonable administrative, technical,
    and physical measures to safeguard personal information against
    theft, loss, and unauthorized access. (Source: Apple)  2.
    Restricting Access: Companies limit access to personal information
    on a need-to-know basis, ensuring that only employees and
    authorized service providers who require access for their job
    responsibilities can access the data. (Source: Apple)  3. Data
    Minimization: Companies collect an

Now we revise the prompt to exclude Apple.  The filter looks good, it excludes Apple.

In [87]:
query = """
How do companies protect my data?
Please provide five examples.
When you give an example, please cite one or more companies that provide the protection you describe.
Do not describe any practice not included in the supplied context.
I am not interested in information about Apple's practices.
"""

result = chain(query)
pretty_print_result(result)

query='data protection' filter=Operation(operator=<Operator.NOT: 'not'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Apple')]) limit=5

Query: 
How do companies protect my data?
Please provide five examples.
When you give an example, please cite one or more companies that provide the protection you describe. 
Do not describe any practice not included in the supplied context.
I am not interested in information about Apple's practices.

Organizations: ['Cohere', 'Meta', 'Hugging Face', 'Google', 'TikTok']
Answer:
    1. One example of how companies protect data is by implementing
    reasonable administrative, technical, and physical measures to
    safeguard personal information against unauthorized access, use,
    modification, and disclosure. This can include measures such as
    encryption, firewalls, and access controls. Hugging Face, as
    mentioned in the context, follows generally accepted industry
    standards and uses appropriate 

Finally, we revise the prompt to exclude both Hugging Face and Apple.

In [89]:
query = """
How do companies protect my data?
Please provide five examples.
When you give an example, please cite one or more companies that provide the protection you describe.
Do not describe any practice not included in the supplied context.
I am not interested in information about Hugging Face or Apple.
"""

result = chain(query)
pretty_print_result(result)

query='data protection' filter=Operation(operator=<Operator.NOT: 'not'>, arguments=[Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Hugging Face'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Apple')])]) limit=5

Query: 
How do companies protect my data?
Please provide five examples.
When you give an example, please cite one or more companies that provide the protection you describe. 
Do not describe any practice not included in the supplied context.
I am not interested in information about Hugging Face or Apple.

Organizations: ['Cohere', 'Meta', 'Google', 'TikTok', 'Meta']
Answer:
    Based on the provided context, here are five examples of how
    companies protect data:  1. Safeguards: Companies like Google
    implement reasonable administrative, technical, and physical
    measures to safeguard personal information against theft, loss,
    and unauthorized access, us

Let's try one more time to see how the system responds when we ask about a company not represented in our collection.

In [90]:
query = """
How does Microsoft protect my data?
Limit your resposne to informatoin in the supplied context.
"""

result = chain(query)
pretty_print_result(result)

query='Microsoft protect data' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Microsoft') limit=1

Query: 
How does Microsoft protect my data?
Limit your resposne to informatoin in the supplied context.

Organizations: []
Answer:
    I'm sorry, but I don't have enough information to answer your
    question.


If you experiment with the self query mechanism you will likely conclude that it is fragile.  The implementations seem to vary by vector database as well.  For example, when this notebook was created, Chroma, a popular vector database for simple LangChain examples, does not support self query operations that result in the use of the `NOT` operator even though the query parser recognizes when that operator should be used and Chroma supports that operator directly.