## Setting up environnement

In [1]:
!pip install -qU langchain langchain-core langchain-community langchain-openai

In [2]:
!pip install -qU qdrant-client

In [3]:
!pip install -qU tiktoken pymupdf

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [5]:
from langchain_openai import ChatOpenAI

openai_chat_model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

## Loading the data

In [6]:
from langchain.document_loaders import PyMuPDFLoader

docs = PyMuPDFLoader("https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf").load()

## Chunking the data

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken

def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-3.5-turbo").encode(
        text,
    )
    return len(tokens)


In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50,
    length_function = tiktoken_len,
)

split_chunks = text_splitter.split_documents(docs)

In [9]:
len(split_chunks)

765

## Embedding and vectore storing

In [10]:
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

In [11]:
from langchain_community.vectorstores import Qdrant

qdrant_vectorstore = Qdrant.from_documents(
    split_chunks,
    embedding_model,
    location=":memory:",
    collection_name="Meta 10-k Fillings",
)

In [12]:
qdrant_retriever = qdrant_vectorstore.as_retriever()

## RAG Prompt

In [13]:
from langchain_core.prompts import ChatPromptTemplate

In [14]:
RAG_PROMPT = """
CONTEXT:
{context}

QUERY:
{question}

Answer the query if the context is related to it; otherwise, answer: 'Sorry, the context is unrelated to the query, I can't answer.'
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

## RAG Chain

In [15]:
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retrieval_augmented_qa_chain = (

    {"context": itemgetter("question") | qdrant_retriever, "question": itemgetter("question")}

    | RunnablePassthrough.assign(context=itemgetter("context"))

    | {"response": rag_prompt | openai_chat_model, "context": itemgetter("context")}
)

## Response generation

In [16]:
response_1 = retrieval_augmented_qa_chain.invoke({"question" : "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"})
response_1["response"].content

"The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41,862."

In [17]:
response_2 = retrieval_augmented_qa_chain.invoke({"question" : "Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"})
response_2["response"].content

"The context provided does not directly mention the names of Meta's Directors (members of the Board of Directors). Therefore, I cannot answer the query based on the given information. Sorry, the context is unrelated to the query, I can't answer."

In [18]:
response_2["context"]

[Document(page_content='to having a skilled, inclusive and diverse workforce because we believe cognitive diversity fuels innovation. To aid in this effort, we have taken steps to reduce\nbias from our hiring processes and performance management systems, as well as offering learning and development courses for our employees.\nCorporate Information\nWe were incorporated in Delaware in July 2004. We completed our initial public offering in May 2012 and our Class\xa0A common stock is currently listed\non the Nasdaq Global Select Market under the symbol "META." Our principal executive offices are located at 1 Meta Way, Menlo Park, California 94025, and\nour telephone number is (650) 543-4800.\nMeta, the Meta logo, Meta Quest, Meta Horizon, Facebook, FB, Instagram, Oculus, WhatsApp, Reels, and our other registered or common law', metadata={'source': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf', 'file_path': 'https://d18rn0p25nwr6d.cloudfron

## First pipeline results analysis

- The answer to the first question is right.
- The answer to the second question is false. It shows that the pipeline is not able to retrieve the context needed to answer.
- I will have to upgrade the context retrieval part of the pipeline.

## Upgrading the chunking strategy
- As we're dealing with a large PDF including tables, let's try to adapt the chunking size to a larger value

In [19]:
text_splitter_2 = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap = 100,
    length_function = tiktoken_len,
)

split_chunks_2 = text_splitter_2.split_documents(docs)

In [20]:
len(split_chunks_2)

220

In [21]:
qdrant_vectorstore_2 = Qdrant.from_documents(
    split_chunks_2,
    embedding_model,
    location=":memory:",
    collection_name="Meta 10-k Fillings 2",
)

In [22]:
qdrant_retriever_2 = qdrant_vectorstore_2.as_retriever()

In [23]:
retrieval_augmented_qa_chain_2 = (

    {"context": itemgetter("question") | qdrant_retriever_2, "question": itemgetter("question")}

    | RunnablePassthrough.assign(context=itemgetter("context"))

    | {"response": rag_prompt | openai_chat_model, "context": itemgetter("context")}
)

In [24]:
response_1b = retrieval_augmented_qa_chain_2.invoke({"question" : "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"})
response_1b["response"].content

"The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41.862 billion."

In [27]:
response_2b = retrieval_augmented_qa_chain_2.invoke({"question" : "Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"})
response_2b["response"].content

"Sorry, the context is unrelated to the query, I can't answer."

In [28]:
response_2b["context"]

[Document(page_content='Table of Contents\nCompensation, Benefits, Health, and Well-being\nWe offer competitive compensation to attract and retain the best people, and we help care for our people so they can focus on our mission. Our\nemployees\' total compensation package includes market-competitive salary, bonuses or sales incentives, and equity. We generally offer full-time employees\nequity at the time of hire and through annual equity grants because we want them to be owners of the company and committed to our long-term success. We\nhave conducted pay equity analyses for many years, and continue to be committed to pay equity. For example, in July 2023, we announced that our analyses\nconfirm that we continue to have pay equity across genders globally and by race in the United States for people in similar jobs, accounting for factors such as\nlocation, role, and level.\nThrough Life@ Meta, our holistic approach to benefits, we continue to provide our employees and their dependents 

## Second pipeline results analysis
- The second pipeline shows the same results as the first one. The context retrieval is still not working properly, despite the larger chunking size.

## Upgrading the retrieval strategy
- While conserving the same chunking strategy as in the first pipeline, I will try to upgrade the retrieval strategy by using the MultiQueryRetriever.

In [35]:
from langchain.retrievers import MultiQueryRetriever

multiquery_retriever = MultiQueryRetriever.from_llm(retriever=qdrant_retriever, llm=openai_chat_model)

In [36]:
retrieval_augmented_qa_chain_3 = (

    {"context": itemgetter("question") | multiquery_retriever, "question": itemgetter("question")}

    | RunnablePassthrough.assign(context=itemgetter("context"))

    | {"response": rag_prompt | openai_chat_model, "context": itemgetter("context")}
)

In [37]:
response_1c = retrieval_augmented_qa_chain_3.invoke({"question" : "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"})
response_1c["response"].content

"The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41,862."

In [38]:
response_2c = retrieval_augmented_qa_chain_3.invoke({"question" : "Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"})
response_2c["response"].content

'Signature\nTitle\nDate\n/s/ Mark Zuckerberg\nBoard Chair and Chief Executive Officer\n(Principal Executive Officer)\nFebruary 1, 2024\nMark Zuckerberg\n/s/ Susan Li\nChief Financial Officer\n(Principal Financial Officer)\nFebruary 1, 2024\nSusan Li\n/S/ Aaron Anderson\nChief Accounting Officer\n(Principal Accounting Officer)\nFebruary 1, 2024\nAaron Anderson\n/s/ Peggy Alford\nDirector\nFebruary 1, 2024\nPeggy Alford\n/s/ Marc L. Andreessen\nDirector\nFebruary 1, 2024\nMarc L. Andreessen\n/s/ Andrew W. Houston\nDirector\nFebruary 1, 2024\nAndrew W. Houston\n/s/ Nancy Killefer\nDirector\nFebruary 1, 2024\nNancy Killefer\n/s/ Robert M. Kimmitt\nDirector\nFebruary 1, 2024'

In [34]:
response_2c["context"]

[Document(page_content='Table of Contents\nCompensation, Benefits, Health, and Well-being\nWe offer competitive compensation to attract and retain the best people, and we help care for our people so they can focus on our mission. Our\nemployees\' total compensation package includes market-competitive salary, bonuses or sales incentives, and equity. We generally offer full-time employees\nequity at the time of hire and through annual equity grants because we want them to be owners of the company and committed to our long-term success. We\nhave conducted pay equity analyses for many years, and continue to be committed to pay equity. For example, in July 2023, we announced that our analyses\nconfirm that we continue to have pay equity across genders globally and by race in the United States for people in similar jobs, accounting for factors such as\nlocation, role, and level.\nThrough Life@ Meta, our holistic approach to benefits, we continue to provide our employees and their dependents 

## Third pipeline results analysis
- The third pipeline shows good results and can answer both questions. The retrieval strategy has been upgraded and is now working properly.