# Purpose
The purpose of this notebook is to explore the use of metadata filters to provide more relevant context within a RAG workflow.  I'll use a [self-querying retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/) from `langchain` to automatically apply metadata filters where appropriate.

# Vector Database

## Document Loading
For this example, I've pulled 8 of my publications from my [Google Scholar profile](https://scholar.google.com/citations?user=tqJqwA4AAAAJ&hl=en) and saved them as pdfs.  The filenames include metadata with the following format: `<date>_<publication_type>_<journal>_<authorship>.pdf`.  I'll write a function which extracts all metadata from the filename and attaches to each document during the import process.  This way all metadata will stay attached to each document and, eventually, each index.

In [1]:
def get_metadata(filepath):
    # Extract the base filename (without directories)
    base_filename = filepath.split("/")[-1]

    # Split the base filename using the "_" separator
    parts = base_filename.split("_")
    
    # Check if there are exactly four parts
    if len(parts) == 4:
        date, publication_type, journal = parts[0:3]
        # extract year
        year, __, __ = date.split("-")
        # remove file extension from the last part
        authorship, __ = parts[-1].rsplit('.', 1)
        return {
            "year":int(year), 
            "publication_type":publication_type, 
            "journal":journal, 
            "authorship":authorship
        }
    else:
        # Handle the case where the input string doesn't have the expected format
        raise ValueError("Input string doesn't match the expected format")

In [9]:
import os

def list_files_in_directory(directory_path):
    # Get a list of all files in the specified directory
    files = os.listdir(directory_path)
    
    # Filter out only the files (not directories) with a .pdf extension
    pdf_files = [file for file in files if os.path.isfile(os.path.join(directory_path, file)) and file.lower().endswith(".pdf")]
    
    return pdf_files

In [10]:
data_dir = "data/docs/"
files = list_files_in_directory(data_dir)

In [None]:
# Specify the directory path where your files are located
directory_path = "/path/to/your/directory"

# List all files in the directory
files = list_files_in_directory(directory_path)

# Loop through each file and apply the split_filename_with_path function
for file in files:
    try:
        date_title, authorship, extension = split_filename_with_path(file)
        print(f"File: {file}, Date and Title: {date_title}, Authorship: {authorship}, Extension: {extension}")
    except ValueError as e:
        print(f"Error processing file {file}: {e}")


In [19]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

all_pages = []
for file in files:
    try:
        metadata = get_metadata(file)
        loader = PyPDFLoader(data_dir + file)
        pages = loader.load_and_split()
        # add metadata to each page
        for page in pages:
            page.metadata.update(metadata)
        all_pages.extend(pages)
    except ValueError as e:
        print(f"Error processing file {file}: {e}")

# split pages into max 1000 chr chunks
text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)
docs = text_splitter.split_documents(all_pages)

## Embed and Store

There are many vector store options which will allow me to work with the metadata as filters, but I'm going with [Pinecone](pinecone.io) here because it offers a very smooth user experienced through a managed UI.  It is only free to host a single index at a time, however.  I'll also use OpenAI for the embedding model.

In [21]:
import os
import toml
import pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone

os.environ["PINECONE_API_KEY"] = toml.load("secrets.toml")["PINECONE_API_KEY"]
os.environ["OPENAI_API_KEY"] = toml.load("secrets.toml")["OPENAI_API_TOKEN"]

index_name = "publications-mcnew"
pinecone.init(environment=toml.load("secrets.toml")["PINECONE_ENV"])

if index_name not in pinecone.list_indexes():
    pinecone.create_index(name=index_name, dimension=1536, metric="cosine")
pinecone_index = pinecone.Index(index_name)

embeddings = OpenAIEmbeddings()
vectorstore = Pinecone.from_documents(
    docs, embeddings, index_name=index_name
)

  from tqdm.autonotebook import tqdm


In [22]:
# loading from source (skip above cell if index is already built)
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)

# Self Query Retriever
Now I can build the [self query retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/) which will use an LLM to determine when and if to apply metadata filters when retrieving documents.  First, I'll need to define information about each metadata field for which I'd like the LLM to have access.  This will be used by the LLM for decision making so it needs to be descriptive and concise.

In [23]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info=[
    AttributeInfo(
        name="year",
        description="The year the publication was published", 
        type="integer", 
    ),
    AttributeInfo(
        name="publication_type",
        description="The type of publication, one of [paper, dissertation]", 
        type="string", 
    ),
    AttributeInfo(
        name="journal",
        description="The journal in which the publication was published", 
        type="string", 
    ),
]
document_content_description = "Text content and metadata from all of the user's publications"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)

## Test Retriever
### No filter
If the query doesn't contain any information that the LLM determines is relevant to a metadata filter, the similarity search will be executed as a standard retriever.

In [24]:
retriever.get_relevant_documents("What are some things I wrote about DNA?")



query='DNA' filter=None limit=None


[Document(page_content='Appendix B. DNA sequences\nTable B.1: Nucleotide sequences of the 4 DNA-labels, primers, and probes used in this study.\nBold and underlined segments indicate forward and reverse primer locations.\nT3 5’- AA A GTA AAG CAG CAG AGG TGG ACA GAG GAA\nGAG CAG AAG AAG GAA AGA ATG CTG GGA AGA\nGGA AGA ACG CAA GGC AAA GCG GA G GTA - 3’\nT3\nProbe5’- /56-FAM/AGC AGA AGA /ZEN/AGG AAA GAA TGC TGG\nGA/3IABkFQ/ - 3’\nT4 5’- AC A CGG ATC AAT CGG ATG TCA GGA TTC CCA\nGCT CGC AAC TTA CCG ACC TGG ATG AGG AGT GGC CGT\nGAA AG C ACA GAC ACC GTA GAA AAG ACA ACC CT\n- 3’\nT4\nProbe5’- /5HEX/CGC AAC TTA /ZEN/CCG ACC TGG ATG AGG\n/3IABkFQ/ -3’\nT10 5’ - G GC TCT CAC TGT GTA CAT GTG TTA T CT GCC\nTTT CGT CGG GGC GGT AAT TCT TGG TGC ACA\nGAC AAT CTT AAT AAG AGT CAG GAC TGG GT C - 3’\nT12 5’- CCG TAG AGA TCT CCC ATC TGT CCT TTG CTG\nAAG GTT AAA ACC CCG GAC CGC CTA GAA TAT\nTCT TTC TTT AGC TCC AAA ATG GCC TCT C - 3’\nAppendix C. Additional Characterization Details\nAn aliquot of particle s

### Metadata Filter(s) Applied
If the LLM determines the query contains information relevant to the metadata, it will automatically apply one or more filters prior to the similarity search.

In [25]:
retriever.get_relevant_documents("What are some things I said about DNA prior to 2019?")

query='DNA' filter=Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2019) limit=None


[Document(page_content='Appendix B. DNA sequences\nTable B.1: Nucleotide sequences of the 4 DNA-labels, primers, and probes used in this study.\nBold and underlined segments indicate forward and reverse primer locations.\nT3 5’- AA A GTA AAG CAG CAG AGG TGG ACA GAG GAA\nGAG CAG AAG AAG GAA AGA ATG CTG GGA AGA\nGGA AGA ACG CAA GGC AAA GCG GA G GTA - 3’\nT3\nProbe5’- /56-FAM/AGC AGA AGA /ZEN/AGG AAA GAA TGC TGG\nGA/3IABkFQ/ - 3’\nT4 5’- AC A CGG ATC AAT CGG ATG TCA GGA TTC CCA\nGCT CGC AAC TTA CCG ACC TGG ATG AGG AGT GGC CGT\nGAA AG C ACA GAC ACC GTA GAA AAG ACA ACC CT\n- 3’\nT4\nProbe5’- /5HEX/CGC AAC TTA /ZEN/CCG ACC TGG ATG AGG\n/3IABkFQ/ -3’\nT10 5’ - G GC TCT CAC TGT GTA CAT GTG TTA T CT GCC\nTTT CGT CGG GGC GGT AAT TCT TGG TGC ACA\nGAC AAT CTT AAT AAG AGT CAG GAC TGG GT C - 3’\nT12 5’- CCG TAG AGA TCT CCC ATC TGT CCT TTG CTG\nAAG GTT AAA ACC CCG GAC CGC CTA GAA TAT\nTCT TTC TTT AGC TCC AAA ATG GCC TCT C - 3’\nAppendix C. Additional Characterization Details\nAn aliquot of particle s

In [26]:
retriever.get_relevant_documents("What did I say about machine learning in a paper, prior to 2018?")



query='machine learning' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='publication_type', value='paper'), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2018)]) limit=None


[Document(page_content='(note that the breakthrough proﬁle alone is notdiagnostic in determining if current PTMs can be applied).8Articlepubs.acs.org/estThis is an open access article published under an ACS AuthorChoice License, which permitscopying and redistribution of the article or any adaptations for non-commercial purposes.', metadata={'authorship': 'second', 'journal': 'Environmental Science and Technology', 'page': 0.0, 'publication_type': 'paper', 'source': 'data/docs/2017-03-07_paper_Environmental Science and Technology_second.pdf', 'year': 2017.0}),
 Document(page_content='What Factors Determine the Retention Behavior of EngineeredNanomaterials in Saturated Porous Media?Eli  Goldberg,†Coy  McNew,‡Martin  Scheringer,*,§,†Thomas  D.  Bucheli,∥Peter  Nelson,⊥and  Konrad  Hungerbühler††Institute for Chemical and Bioengineering, ETH Zürich, 8093 Zürich, Switzerland‡Department of Land, Air, and Water Resources, University of California, Davis, California 95616, United States§RE

## RAG Chat
And now this retriever can be used like any other retriever.  I'll build it into the RAG workflow here, using a stuff chain.

## Setup
Similar to the previous notebook (`02-conversation-retrievalqa.ipynb`), I'll build a RAG workflow by injecting context from the retriever and chat history into a prompt template along with a system message and the user input.

In [54]:
from langchain.prompts import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.memory import ConversationSummaryBufferMemory

# models
llm = OpenAI()
llm_summary = OpenAI()

# prompt engineering
template = """You are a chatbot having a conversation with a human. 
You are an expert on the human's publications.
Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum and keep the answer as concise as possible. 

{context}

{chat_history}

Question: {human_input}

Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=['context', 'human_input', 'chat_history'], template=template)

# history
memory = ConversationSummaryBufferMemory(
    llm=llm_summary, 
    memory_key="chat_history", 
    input_key="human_input", 
    max_token_limit=100, 
    human_prefix = "", 
    ai_prefix = ""
)

# stuff chain
qa_chain = load_qa_chain(llm=llm, chain_type="stuff", prompt=QA_CHAIN_PROMPT, verbose=True, memory=memory)

## Test RAG Chat
Now I'll try a few prompts to see if I can get the workflow to operate as expected.

In [55]:
from IPython.display import HTML

def display_result(question, result, similar_docs, chat_history):
    result_html = f"<p><blockquote style=\"font-size:24px\">{question}</blockquote></p>"
    result_html += f"<p><blockquote style=\"font-size:18px\">{result}</blockquote></p>"
    result_html += "<p><hr/></p>"
    for d in similar_docs:
        source_id = d.metadata["source"]
        result_html += f"<p><blockquote>{d.page_content}<br/>(Source: {source_id})</blockquote></p>"
    result_html += "<p><hr/></p>"
    result_html += "<p><blockquote style=\"font-size:24px\">Summarized Chat History</blockquote></p>"
    result_html += f"<p><blockquote>{chat_history}</blockquote></p>"

    display(HTML(result_html))

In [56]:
query = "What have I written about machine learning applied to environmental applications?"
similar_docs = retriever.get_relevant_documents(query)
result = qa_chain({"input_documents":similar_docs, "human_input":query})
display_result(result["human_input"], result["output_text"], result["input_documents"], result["chat_history"])

query='machine learning environmental applications' filter=None limit=None


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a chatbot having a conversation with a human. 
You are an expert on the human's publications.
Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum and keep the answer as concise as possible. 

has been developed from these studies. Without explicitly linking the physicochemical
properties of the particle and system to the transport and attachment, these models remain
descriptive tools rather than powerful predictive models.
Machine learning allows us to develop empirical models from complex systems where
the underlying relationships between the data are too complex to develop by hand [83].
Machine learning has been successfully applie

In [57]:
query = "And what did I say about machine learning in a paper after 2015?"
similar_docs = retriever.get_relevant_documents(query)
result = qa_chain({"input_documents":similar_docs, "human_input":query})
display_result(result["human_input"], result["output_text"], result["input_documents"], result["chat_history"])

query='machine learning' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='publication_type', value='paper'), Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='year', value=2015)]) limit=None


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a chatbot having a conversation with a human. 
You are an expert on the human's publications.
Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum and keep the answer as concise as possible. 

Figure  3. 541

help and support. 485
24

loamy sand by fitting to the measured DNA flux, and 3) assigning the best fit attachment and
detachment rates obtained for loamy sand to the corresponding parameters of gravel.  Then, a
simple local sensitivity analy