<a href="https://colab.research.google.com/github/RERobbins/data_science_266_sandbox/blob/main/3_QA_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to Question Answering Using Retrieval Augmented Generation

In our introduction to vector databases, we noted that retrieval augment generation or "RAG" is used in tasks like question answering. With RAG, a retrieval component first selects a set of relevant documents or passages from a larger corpus, and then a generation component generates the final response based on the selected information. This approach aims to combine the accuracy of retrieval with the flexibility of generation.

This notebook builds on our work with vector databases.  We will take a query from the user and present it to the vector database. Then, we present that query and the relevant documents to the generative model to generate an answer.

We will make use of LangChain to coordinate this activity and explore model prompting.

We will answer questions about the privacy policies we have been working with.

# Generative Models, Embedding Models and Vector Databases

This notebook assumes that you use the same generative model throughout.  You will rely on the API keys you needed to work on the Embeddings notebook.

This notebook uses the same SentenceTransformer embedding model we used in the Vector Database notebook.  As was the case with that notebook, you are free to select another embedding model.

Finally, this notebook will use the Qdrant vector database.  You are, of course, free to experiment with other vector databases as we discussed before.

# Setup

## Environment Related Helpers

This portion of the notebook includes `install_if_needed` which will install a single package or list of packages with `pip` only if necessary, and `running_in_colab` a predicate that returns `True` if the notebook is running in Google Colab.

In [None]:
import importlib


def install_if_needed(package_names):
    """
    Install one or more Python packages using pip if they are not already installed.

    Args:
        package_names (str or list): The name(s) of the package(s) to install.

    Returns:
        None
    """
    if isinstance(package_names, str):
        package_names = [package_names]

    for package_name in package_names:
        try:
            importlib.import_module(package_name)
            print(f"{package_name} is already installed.")
        except ImportError:
            !pip install --quiet {package_name}
            print(f"{package_name} has been installed.")


def running_in_colab():
    """
    Check if the Jupyter Notebook is running in Google Colab.

    Returns:
        bool: True if running in Google Colab, False otherwise.
    """
    try:
        import google.colab

        return True
    except ImportError:
        return False

## Mount Google Drive

By default, the data you create in Google Colaboratory does not persist from session to session.  Each session runs in a virtual machine and when that machine goes away, so does your data.  If you want your data to persist, you must store it outside the virtual machine. Google Drive can be used for that purpose.  We use it later in this notebook to store the OpenAI and Cohere API keys.

In [None]:
if running_in_colab():
    from google.colab import drive

    drive.mount("drive")

Mounted at drive


## API Keys

In [None]:
install_if_needed("python-dotenv")

python-dotenv has been installed.


In [None]:
import os
import getpass

from dotenv import load_dotenv, find_dotenv

def env_file_path(
    colab_path="/content/drive/MyDrive/.env", other_path=f"{find_dotenv()}"
):
    """
    Returns the appropriate file path for the environment variables file (.env) based on the execution environment.

    This function is designed to determine the correct path for the environment variables file
    depending on whether the code is running in Google Colab or in a different environment.

    Args:
        colab_path (str, optional): The file path for the environment variables file in Google Colab.
            Default is '/content/drive/MyDrive/.env'.

        other_path (str, optional): The file path for the environment variables file in other environments.
            Default is '/workspace/.env'.

    Returns:
        str: The file path for the environment variables file (.env).
    """

    return colab_path if running_in_colab() else other_path

In [None]:
load_dotenv(env_file_path())
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
COHERE_API_KEY = os.environ["COHERE_API_KEY"]

## Langchain

In [None]:
install_if_needed("langchain")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hlangchain has been installed.


## GPU Support (Optional)

In [None]:
import tensorflow as tf

print("GPU Available:", tf.config.list_physical_devices("GPU"))

GPU Available: []


In [None]:
install_if_needed("torch")
import torch

print("CUDA Available:", torch.cuda.is_available())

torch is already installed.
CUDA Available: False


## Embeddings

Instantiate the embeddings model.  

In [None]:
packages = [
    "openai",
    "cohere",
    "tiktoken",
    "transformers",
    "sentence_transformers",
]

install_if_needed(packages)

import openai, tiktoken

import cohere
from langchain.chat_models import ChatOpenAI
from langchain.llms import Cohere

from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m71.7/73.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hopenai has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hcohere has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25htiktoken has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
st_model_name = "multi-qa-mpnet-base-cos-v1"
st_embeddings_model = HuggingFaceEmbeddings(model_name=st_model_name)
st_tokenizer = AutoTokenizer.from_pretrained(f"sentence-transformers/{st_model_name}")

embeddings_model = st_embeddings_model

Downloading (…)e891a/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)92a80e891a/README.md:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

Downloading (…)a80e891a/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)91a/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)e891a/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)891a/train_script.py:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading (…)92a80e891a/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)80e891a/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

## Vector Database

In [None]:
install_if_needed("qdrant-client")
from langchain.vectorstores import Qdrant

import qdrant_client

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.5/132.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.4/75.4 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.1/143.1 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.4/311.4 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.5/74.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency res

# Generative Model

OpenAI trial accounts expire after three months and provide access to `gpt-3.5-turbo` but not `gpt-4`.  Paid OpenAI accounts permit use of `gpt-4` as well and do not expire.  Cohere trial accounts do not expire, but the API rate limiting is more significant than OpenAI trial account rate limiting.

Set the `LLM` variable below to reflect the generative model you want to use.  

The results in most of the examples below will vary with your choice.


In [None]:
# llm = Cohere(model="command", temperature=0)
LLM = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
# llm = ChatOpenAI(model="gpt-4", temperature=0)

# Load Documents, Split Into Chunks, Create Vector Database

**If you saved you Qdrant database when you worked on the vector database notebook you can skip this section and use the Load Persistent Vector Database section below.**

Otherwise, we use the following cells to download the privacy policies we have been working with and split them into chunks to be stored in the vector database.

In [None]:
install_if_needed(["pypdf", "unstructured"])

import textwrap
from langchain.document_loaders import PyPDFLoader, UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/271.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/271.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m271.1/271.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hpypdf has been installed.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hunstructured has been installed.


In [None]:
import pandas as pd

policy_data = [
    ("Apple",
     "Privacy Policy",
     "https://www.apple.com/legal/privacy/pdfs/apple-privacy-policy-en-ww.pdf",
    ),
    ("Cohere", "Privacy Policy", "https://cohere.com/privacy"),
    ("Google",
     "Privacy Policy",
     "https://static.googleusercontent.com/media/www.google.com/en//intl/en/policies/privacy/google_privacy_policy_en.pdf",
    ),
    ("Hugging Face", "Privacy Policy", "https://huggingface.co/privacy"),
    ("Meta",
     "Privacy Policy",
     "https://about.fb.com/wp-content/uploads/2022/07/Privacy-Within-Metas-Integrity-Systems.pdf",
    ),
    ("Threads", "Privacy Policy", "https://terms.threads.com/privacy-policy"),
    ("TikTok",
     "Privacy Policy",
     "https://www.tiktok.com/legal/page/us/privacy-policy/en",
    ),]

columns = ["organization", "title", "url"]

policy_df = pd.DataFrame(policy_data, columns=columns)

In [None]:
def get_chunks(url, organization, title, chunk_size=2000, chunk_overlap=500):
    """
    This function takes a url to an organization's web page, organization name,
    and document title and returns chunks constructed from the target url.
    The function adds the url, the organization name and the document title
    as metadata to the chunks.

    Parameters:
    url (string): Target page.
    organization (string): Organization name.
    title: Document title.
    chunk_size (int, optional): Chunk size, default is 2000 characters.
    chunk_overlap (int, optional): Chunk overlap, default is 500 characters.

    Returns:
    list of chunks
    """

    # Use PyPDFLoader for pdf targets, otherwise UnstructuredURLLoader
    if os.path.splitext(url)[1] == ".pdf":
        loader = PyPDFLoader(url)
    else:
      loader = UnstructuredURLLoader([url])

    documents = loader.load()
    for document in documents:
        metadata = document.metadata
        metadata["url"] = url
        metadata["organization"] = organization
        metadata["title"] = title
        if metadata.get("page", None) is not None:
            metadata["page"] += 1

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )

    return text_splitter.split_documents(documents)


def explore_documents(documents):
    block_indent = "   "
    metadata = documents[0].metadata
    content = documents[0].page_content[:300] + ". . ."
    print(f"{metadata['organization']} {metadata['title']} {len(documents)} chunks")
    print("Truncated First chunk:")
    print(
        textwrap.fill(
            content,
            initial_indent=block_indent,
            subsequent_indent=block_indent,
            replace_whitespace=True,
        )
    )
    print()

In [None]:
chunks = []

for row in policy_df.itertuples(index=False):
    policy_chunks = get_chunks(row.url, row.organization, row.title)
    explore_documents(policy_chunks)
    chunks += policy_chunks

Apple Privacy Policy 18 chunks
Truncated First chunk:
   Apple Privacy Policy Apple’s Privacy Policy describes how Apple
   collects, uses, and shares your personal data. Updated December 22,
   2022 In addition to this Privacy Policy, we provide data and
   privacy information embedded in our products and certain features
   that ask to use your personal data. This . . .



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Cohere Privacy Policy 10 chunks
Truncated First chunk:
   Products  For Developers  For Business  Pricing  Blog  Company  Try
   now  Cohere Privacy Policy  Last Update: Aug 4, 2023  Cohere Inc.
   (“Cohere”) values and respects your privacy. We have prepared this
   privacy policy to explain the manner in which we collect, use, and
   disclose personal information th. . .

Google Privacy Policy 20 chunks
Truncated First chunk:
   Privacy Policy Last modified: December 18, 2017 ( view archived
   versions ) (The hyperlinked examples are available at the end of
   this document.) There are many different ways you can use our
   services – to search for and share information, to communicate with
   other people or to create new content. Wh. . .

Hugging Face Privacy Policy 12 chunks
Truncated First chunk:
   Terms of Service  Privacy Policy  Content Policy  Code of Conduct
   Hugging Face Privacy Policy  🗓 Effective Date: March 28, 2023  We
   have implemented this Privacy Policy because

In [None]:
print (f"There are {len(chunks)} chunks.")

There are 140 chunks.


In [None]:
%%time

collection_name = "my_collection"

vectordb = Qdrant.from_documents(
    documents = chunks,
    embedding = embeddings_model,
    location = ":memory:",
    collection_name = collection_name
    )

CPU times: user 5.9 s, sys: 890 ms, total: 6.79 s
Wall time: 12.1 s


In [None]:
# Confirm that we have the same number of vectors in the vector database as we have chunks.

assert vectordb.client.get_collection(collection_name).vectors_count == len(chunks)

# Load Persistent Vector Database

**If you saved you Qdrant database when you worked on the vector database notebook you can use this section instead of the Load Documents, Split Into Chunks, Create Vector Database section above.**

If you execute this code block more than once in a session you are likely to get an error indicating that your vector databsae is already accessed by another instance of Qdrant client and that if you require concurrent access, you should use Qdrant server instead.

The prior section should always work in lieu of this section.

In [None]:
collection_name = "my_collection"
qdrant_database_location = "/content/drive/MyDrive/my_qdrant"

client = qdrant_client.QdrantClient(path=qdrant_database_location)

vectordb = Qdrant(client=client,
                   collection_name=collection_name,
                   embeddings=embeddings_model,)

In [None]:
assert vectordb.client.get_collection(collection_name).vectors_count == 140

# Prompting a Model

Before we introduce working with the vector database, let's experiment with some simple model prompts.  We will pass a string to Cohere's `command` model, which is its default generative model, and see how it responds.  We will do the same thing with OpenAI's `chatgpt-3.5`.

The responses are based on information the model was trained on.  We don't know if it they are accurate.  Sources are not presented.  The Cohere response for Threads doesn't seem to relate to the Threads social media platform and the OpenAI model isn't able to fine Threads information.

In [None]:
from langchain.schema import HumanMessage
import textwrap

cohere_llm = Cohere(model="command", temperature=0)
openai_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [None]:
query = "Does Apple sell my personal data?"
result = cohere_llm(query)
print(textwrap.fill(result.strip()))

Apple has a strong commitment to protecting the privacy of its
customers. The company does not sell personal data to third parties.
However, it does collect and use data to provide services and improve
products.  For example, Apple collects data about how people use their
devices, such as which apps are used and how often. This data is used
to improve the user experience and develop new features.  Apple also
collects data about customers' purchasing habits, which is used to
improve marketing and develop new products.  In addition, Apple
collects data about customers' location, which is used to provide
location-based services and improve maps.  Apple's commitment to
privacy is a key selling point for many customers. The company's
privacy policies are designed to protect customers' personal data and
ensure that it is used only for legitimate purposes.


In [None]:
query = "Does Threads sell my personal data?"
result = cohere_llm(query)
print(textwrap.fill(result.strip()))

Threads does not sell your personal data. We are committed to
protecting your privacy and will never share your personal information
with third parties without your permission.  We may collect and use
your personal information for the following purposes:  - To provide
you with the services you request - To improve our services and
products - To communicate with you about our services and products -
To protect our rights and property - To comply with legal obligations
We will never share your personal information with third parties
without your permission, except as required by law. We will never sell
your personal information to third parties.  If you have any questions
or concerns about how we use your personal information, please contact
us at support@threadscanada.com.


In [None]:
messages = [HumanMessage(content="Does Apple sell my personal data?")]
print(textwrap.fill(openai_llm(messages).content))

Apple has a strong commitment to privacy and has stated that it does
not sell personal data to third parties. Apple's business model
primarily relies on selling hardware, software, and services rather
than monetizing user data. However, it is important to note that Apple
does collect some user data for various purposes, such as improving
its products and services, but it is typically anonymized and used in
an aggregated form to protect user privacy.


In [None]:
messages = [HumanMessage(content="Does Threads sell my personal data?")]
print(textwrap.fill(openai_llm(messages).content))

As an AI language model, I don't have access to specific company
policies or practices. However, it is important to note that I am
developed by OpenAI and designed to respect user privacy and
confidentiality. My primary function is to provide information and
answer questions to the best of my knowledge and abilities. If you
have concerns about data privacy, it is recommended to review the
privacy policy of Threads or contact the company directly for more
information.


# LangChain PromptTemplate

A prompt template is a reproducible way to generate prompts. It's essentially a text string that can take in a set of parameters from the end user and generate a prompt accordingly.  Let's shift to LangChain chains by using the simplest of templates.  In these examples, we use the large language model you selected above.  Remember, the model is generating responses based on its training data.

In [None]:
from langchain import PromptTemplate, LLMChain

In [None]:
template = """Question: {question} Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])

chain = LLMChain(prompt=prompt, llm=LLM)

Let's inspect the prompt included inside the chain.

In [None]:
chain.prompt

PromptTemplate(input_variables=['question'], output_parser=None, partial_variables={}, template='Question: {question} Answer:', template_format='f-string', validate_template=True)

In [None]:
query = "Does Apple sell my personal data?"
print(textwrap.fill(chain.run(query)).strip())

No, Apple does not sell your personal data. Apple has a strong
commitment to privacy and has implemented various measures to protect
user data.


In [None]:
query = "Does Threads sell my personal data?"
print(textwrap.fill(chain.run(query)).strip())

No, Threads does not sell your personal data.


# LangChain RetrievalQA Chain

Now we introduce our vector database and the LangChain Retrieval QA chain, a chain for question answering against a database of information.  We will also supply our own prompt. It would be reassuring if sources were identified.

In [None]:
from langchain.chains import RetrievalQA

In [None]:
template = """Use the following pieces of context to answer the question at the end.
Your answer should be as concise as possible and ideally not more than one sentence.
If you don't know the answer, just say that you don't know.

{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

chain = RetrievalQA.from_chain_type(
    LLM, retriever=vectordb.as_retriever(), chain_type_kwargs={"prompt": prompt}
)

Before we call the chain, let's inspect our template and the retriever it includes.

In [None]:
print(chain.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end.
Your answer should be as concise as possible and ideally not more than one sentence.
If you don't know the answer, just say that you don't know.

{context}

Question: {question}

Answer:


In [None]:
chain.retriever

VectorStoreRetriever(tags=['Qdrant', 'HuggingFaceEmbeddings'], metadata=None, vectorstore=<langchain.vectorstores.qdrant.Qdrant object at 0x7eb51418e680>, search_type='similarity', search_kwargs={})

In [None]:
query = "Does Apple sell my personal data?"
result = chain.run(query)
print(textwrap.fill(result.strip()))

No, Apple does not sell your personal data.


In [None]:
query = "Does TikTok sell my personal data?"
result = chain.run(query)
print(textwrap.fill(result.strip()))

No.


Let's use the same prompt and get source documents too.

In [None]:
chain = RetrievalQA.from_chain_type(
    llm=LLM,
    retriever=vectordb.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True,
)

In [None]:
query = "Does Apple sell my personal data?"
result = chain(query)

Now, instead of returning a string, the result is a dictionary with three keys, `query`, `result` and `source_documents`.

In [None]:
result.keys()

dict_keys(['query', 'result', 'source_documents'])

In [None]:
print(textwrap.fill(result["result"].strip()))

No, Apple does not sell your personal data.


Let's examine the `organization` field for the source documents from that result.

In [None]:
[source.metadata["organization"] for source in result["source_documents"]]

['Apple', 'Apple', 'Apple', 'Apple']

Let's ask a general question and then say we only care about Apple and Hugging Face.  

In [None]:
query = "Do companies use cookies?  I only care about Apple and Hugging Face."
result = chain(query)

In [None]:
print(textwrap.fill(result["result"].strip()))

Yes, both Apple and Hugging Face use cookies.


In [None]:
[source.metadata["organization"] for source in result["source_documents"]]

['Apple', 'Hugging Face', 'Apple', 'Hugging Face']

Let's ask about Microsoft.  Remember, we have not loaded the Microsoft policy.  Notice that our answer only talks about Apple, our source documents include documents from Hugging Face and Threads, which aren't relevant.  This really isn't a great answer and the sources used do not inspire confidence.

In [None]:
query = "Do companies use cookies?  I only care about Apple, Cohere and Microsoft."
result = chain(query)

In [None]:
print(textwrap.fill(result["result"].strip()))

Yes, Apple uses cookies.


In [None]:
[source.metadata["organization"] for source in result["source_documents"]]

['Apple', 'Apple', 'Hugging Face', 'Threads']

This seems like a good place to make use of our metadata.  LangChain provides one way to do that, which we will explore in the next notebook in this series.

# Self-querying retriever.

A self-querying retriever is one that, as the name suggests, has the ability to query itself. Specifically, given any natural language query, the retriever uses a query-constructing LLM chain to write a structured query and then applies that structured query to it's underlying vectorstore. This allows the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documented, but to also extract filters from the user query on the metadata of stored documents and to execute those filters.

The self-querying retriever's arguments include descriptions of the metadata fields and the document content.

In [None]:
install_if_needed("lark")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/108.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m71.7/108.9 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.9/108.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hlark has been installed.


In [None]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="organization",
        description="The company or organization that created the document.  It describes that company's policy.",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="title",
        description="The title of the document",
        type="string",
    ),
    AttributeInfo(
        name="url",
        description="The url for the document",
        type="string",
    ),
]
document_content_description = "A policy"

retriever = SelfQueryRetriever.from_llm(
    LLM,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True,
    enable_limit=True,
)

We will use the `get_relevant_documents` method provided by the `SelfQueryRetriever` class and examine both the metadata filter generated and the organization for the relevant documents retrieved.

As you look at the examples below and substitute your own you should discover that the approach is not consistent reliable.  In some cases, the system seems to fail to understand that a word in a query is an organization.  Sometimes revising the prompt a little bit to make that distinction more apparent helps.

Does this suggest that using our generative models to do entity extraction is perhaps not the best way to proceed?

In the next example, the system does not identify Meta as an organization.  Nevertheless, it is interesting to note that three of the four examples relate to Meta and the first example, is about one of Meta's businesses, Threads.

In [None]:
documents = retriever.get_relevant_documents("How does Meta protect my data?")
[document.metadata["organization"] for document in documents]



query='Meta protect my data' filter=None limit=None


['Threads', 'Meta', 'Meta', 'Meta']

However, when we change the query to make more explicit that we are talking about the company named Meta, the filter we want is generated and the documents are limited to Meta.

In [None]:
documents = retriever.get_relevant_documents("How does the company named Meta protect my data?")
[document.metadata["organization"] for document in documents]

query='Meta data protection' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Meta') limit=None


['Meta', 'Meta', 'Meta', 'Meta']

One of the subtle and even more interesting things about LangChain's self query retriever is that we can use it to allow the query to specify the number of documents to fetch.  We did that by passing `enable_limit=True` to the constructor.  See the relevant documentation [here](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/qdrant_self_query).

In the next example, the prompt asks for five examples, the query has `limit=5` and we get five results instead of the default of four.

In [None]:
documents = retriever.get_relevant_documents("How does the company named Meta protect my data? I want five examples.")
[document.metadata["organization"] for document in documents]

query='Meta protect my data' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Meta') limit=5


['Meta', 'Meta', 'Meta', 'Meta', 'Meta']

But we don't know if the relevant documents say the same thing.  Let's take a look.

In [None]:
for document in documents:
  print (textwrap.fill(document.page_content))
  print()

July 2022   Privacy within Meta’s   Integrity Systems   Why user
rights are at the center   of our safety and security approach

and integrity issues we see across Meta, 2) what people and
governments are asking social   media companies to do on both privacy
and safety, and 3) the process where we assess privacy   concerns and
ensure adequate protections in tools built for safety.   Meta is
committed to reducing bad experiences on our services.

The kind of harms and negative experiences that Meta seeks to prevent
on our services through   our Community Standards are not new, not
unique to the internet, and not unique to Meta.  8   Academics,
regulators, and non-profit organizations have been tackling questions
of safety   5

Privacy is a core value in safety and security enforcement.  5   Meta
is committed to reducing bad experiences on our services.  5   The
regulatory environment for privacy, free speech, and safety is
shifting.  7   Meta’s Privacy Review offers a process to analyze

Those responses seem reasonably distinct.  But what if they were too similar?  This is where the concept of maximum marginal relevance ("MMR") is useful.  MMR is used to diversify the results returned by a search algorithm by selecting items that are both relevant to a query and different from each other.  A discussion of MMR is beyond the scope of this notebook.  Moreover, the set of documents we have included consist of single policy from each organization and there is relatively little redundancy.

Now, let's ask about two companies.  The filter seems to be doing the right thing.  However, the set of documents returned is limited to just one of the companies. Maybe that is ok since our question asks if either organization uses cookies.

In [None]:
documents = retriever.get_relevant_documents("Do Apple or Microsoft use cookies?")
[document.metadata["organization"] for document in documents]

query='cookies' filter=Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Apple'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Microsoft')]) limit=None


['Apple', 'Apple', 'Apple', 'Apple']

But when we ask specifically about each company in a single query, the system only retrieves documents about Apple as being relevant.  This is hard.  Our vector database stores information about single documents, none of which reference other companies (ignoring that Threads is a Meta business).  We need a more sophisticated approach.

In [None]:
documents = retriever.get_relevant_documents("Does Apple use cookies?  Does Microsoft use cookies?")
[document.metadata["organization"] for document in documents]

query='cookies' filter=Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Apple'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Microsoft')]) limit=None


['Apple', 'Apple', 'Apple', 'Apple']

Up to this point we have not included prompts that looked to exclude information.  Let's try some below.  

In the first example, we get responses related to Threads, TikTok and Apple.

In [None]:
documents = retriever.get_relevant_documents("How do companies protect my data.")
[document.metadata["organization"] for document in documents]

query='data protection' filter=None limit=None


['Threads', 'Cohere', 'TikTok', 'Cohere']

Now we revise the prompt to exclude Threads.

In [None]:
documents = retriever.get_relevant_documents("How do companies protect my data.  I am not interested in information about Threads.")
[document.metadata["organization"] for document in documents]

query='data protection' filter=Operation(operator=<Operator.NOT: 'not'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Threads')]) limit=None


['Cohere', 'TikTok', 'Cohere', 'Apple']

Finally, we revise the prompt to exclude both Threads and Apple.

In [None]:
documents = retriever.get_relevant_documents("How do companies protect my data.  I do not care about Threads or Apple.")
[document.metadata["organization"] for document in documents]

query='data protection' filter=Operation(operator=<Operator.NOT: 'not'>, arguments=[Operation(operator=<Operator.OR: 'or'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Threads'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='organization', value='Apple')])]) limit=None


['Cohere', 'TikTok', 'Cohere', 'Meta']

If you experiment with the self query mechanism you will likely conclude that it is fragile.  The implementations seem to vary by vector database as well.  For example, when this notebook was created, Chroma, a popular vector database for simple LangChain examples, does not support self query operations that result in the use of the `NOT` operator even though the query parser recognizes when that operator should be used and Chroma supports that operator directly.