# Local RAG with PDF and Llama 3.1

## Steps

1. Load the pdf
2. Chunk that pdf (split that into pieces)
3. Embed each piece
4. Create the vector database, index
5. Query (retrieving from that vector database using a llama3 model)

## Chunk size?

![](./assets-resources/optimal-chunk-size.png)

In [None]:
!pip install -U langchain langchain-community langchain-core langchain-ollama chromadb langchainhub pypdf

In [1]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

folder_path = "./pdfs/llama3-80k.pdf"

In [2]:
pdf_docs = PyPDFLoader(folder_path).load_and_split()
pdf_docs[0]

Document(metadata={'source': './pdfs/llama3-80k.pdf', 'page': 0}, page_content='Extending Llama-3’s Context Ten-Fold Overnight\nPeitian Zhang1,2, Ninglu Shao1,2, Zheng Liu1∗, Shitao Xiao1, Hongjin Qian1,2,\nQiwei Ye1, Zhicheng Dou2\n1Beijing Academy of Artificial Intelligence\n2Gaoling School of Artificial Intelligence, Renmin University of China\nnamespace.pt@gmail.com zhengliu1026@gmail.com\nAbstract\nWe extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA\nfine-tuning2. The entire training cycle is super efficient, which takes 8 hours on one\n8xA800 (80G) GPU machine. The resulted model exhibits superior performances\nacross a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-\ncontext language understanding; meanwhile, it also well preserves the original\ncapability over short contexts. The dramatic context extension is mainly attributed\nto merely 3.5K synthetic training samples generated by GPT-4 , which indicates\nthe LLMs’ inherent (y

In [3]:
len(pdf_docs)

6

Embedding and Vector Database

In [4]:
from langchain.embeddings import OllamaEmbeddings

In [21]:
embedding = OllamaEmbeddings(model="nomic-embed-text")

In [22]:
from langchain.vectorstores import Chroma

In [24]:
vector_db = Chroma.from_documents(pdf_docs, embedding=embedding, persist_directory="rag-pdf")

In [26]:
from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama3.1:8b", chat_format=True)

In [27]:
llm.invoke("Hi tell me 5 reasons why pancakes are the best breakfast for AI engineers like myself.")

AIMessage(content="A delicious and relevant question! As an AI engineer, you likely appreciate a good pancake breakfast. Here are 5 reasons why pancakes stand out as the ultimate breakfast choice:\n\n1. **Flexibility**: Pancakes can be topped with a wide variety of sweet and savory ingredients, making them adaptable to individual tastes. From classic buttermilk and maple syrup to fresh fruits, nuts, or even bacon - the possibilities are endless!\n2. **Satisfying carbs**: As AI engineers (or anyone who's spent hours staring at code), you likely appreciate a carb-boost to start your day. Pancakes provide a satisfying dose of complex carbohydrates, which can help with focus and mental clarity.\n3. **Quick energy boost**: A stack of fluffy pancakes can be whipped up in no time, making them an ideal breakfast for those with busy schedules (ahem... AI engineers). They'll give you the quick energy boost you need to tackle a morning of coding or project work.\n4. **Mood-boosting goodness**: Le

# Q&A with Retrieval

In [45]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

# Convert loaded documents into strings by concatenating their content
# and ignoring metadata
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

RAG_TEMPLATE = """
You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.

<context>
{context}
</context>

Answer the following question:

{question}"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

retriever = vector_db.as_retriever()

qa_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

In [46]:
query = "In the reported experiments, how does the performance of the extended context model (Llama-3-8B-Instruct-80K-QLoRA) compare to both the original Llama-3-8B-Instruct model and the community model Llama-3-8B-Instruct-262K on the LongBench and InfBench evaluation tasks?"
qa_chain.invoke(query)

'According to the text, on the LongBench evaluation task, the extended context model (Llama-3-8B-Instruct-80K-QLoRA) significantly and consistently outperforms all baselines except on the code completion task. However, its zero-shot performance is lower than that of the original Llama-3-8B-Instruct model.\n\nOn the InfBench evaluation task, specifically the Long-Book QA task, Llama-3-8B-Instruct-80K-QLoRA excels in answering questions based on long context. However, its zero-shot performance is lower than that of the original Llama-3-8B-Instruct model.\n\nThe text does not provide a direct comparison to the community model Llama-3-8B-Instruct-262K on LongBench and InfBench evaluation tasks.'

In [40]:
import time
import numpy as np

queries = [
    "What method did the authors use to extend the context length of Llama-3-8B-Instruct from 8K to 80K?",
    "What are the three long-context tasks covered in the training data synthesized by GPT-4?",
    "What contributions do the authors highlight in their work on extending the context length of Llama-3-8B-Instruct?",
    "How does the performance of Llama-3-8B-Instruct-80K-QLoRA compare with other long-context models on popular benchmarks?"
]

In [41]:
outputs = []
latencies = []
for query in queries:
    start = time.time()
    output = qa_chain.invoke(query)
    outputs.append(output)
    end = time.time()
    latencies.append(end - start)

mean_latency = np.mean(latencies)
print(f"Mean latency in seconds: {mean_latency}")

Mean latency in seconds: 6.165865898132324


In [57]:
# from typing import List

# from langchain_core.runnables import RunnablePassthrough
# from typing_extensions import Annotated, TypedDict

# # llm = ChatOllama(model="llama3.1:8b", temperature=0)

# # 2. Incorporate the retriever into a question-answering chain.
# system_prompt = (
#     "You are an assistant for question-answering tasks. "
#     "Use the following pieces of retrieved context to answer "
#     "the question. If you don't know the answer, say that you "
#     "don't know. Use three sentences maximum and keep the "
#     "answer concise."
#     "\n\n"
#     "{context}"
# )

# prompt = ChatPromptTemplate.from_messages(
#     [
#         ("system", system_prompt),
#         ("human", "{input}"),
#     ]
# )


# # Desired schema for response
# class AnswerWithSources(TypedDict):
#     """An answer to the question, with sources."""

#     answer: str
#     sources: Annotated[
#         List[str],
#         ...,
#         "List of sources (author + year) used to answer the question",
#     ]


# # Our rag_chain_from_docs has the following changes:
# # - add `.with_structured_output` to the LLM;
# # - remove the output parser
# rag_chain_from_docs = (
#     {
#         "input": lambda x: x["input"],
#         "context": lambda x: format_docs(x["context"]),
#     }
#     | prompt
#     | llm.with_structured_output(AnswerWithSources)
# )

# retrieve_docs = (lambda x: x["input"]) | retriever

# chain = RunnablePassthrough.assign(context=retrieve_docs).assign(
#     answer=rag_chain_from_docs
# )

# response = chain.invoke({"input": "How did the Llama-3-8B-Instruct-80K-QLoRA model compare to the original Llama-3-8B-Instruct model?"})
# response

{'input': 'How did the Llama-3-8B-Instruct-80K-QLoRA model compare to the original Llama-3-8B-Instruct model?',
 'context': [Document(metadata={'page': 1, 'source': './pdfs/llama3-80k.pdf'}, page_content='800014315 20631 26947 33263 39578 45894 52210 58526 64842 71157 77473 83789 90105 96421102736 109052 115368 121684 128000\nContext Length0\n11\n22\n33\n44\n55\n66\n77\n88\n100Depth Percent1.0Needle In A HayStack\n12345678910\nAccuracy Score from GPT3.5Figure 1: The accuracy score of Llama-3-8B-Instruct-80K-QLoRA on Needle-In-A-HayStack task.\nThe blue vertical line indicates the training length, i.e. 80K.\nthe same cluster to form each heterogeneous context. Therefore, the grouped texts share\nsome semantic similarity. We then prompt GPT-4 to ask about the similarities/dissimilarities\nacross these texts.\n3.Biography Summarization : we prompt GPT-4 to write a biography for each main character\nin a given book.\nFor all three tasks, the length of context is between 64K to 80K. Note th

In [55]:
import json

print(json.dumps(response["answer"], indent=2))

null
