![](./assets-resources/optimal-chunk-size.png)

1. Load the pdf
2. Chunk that pdf (split that into pieces)
3. Embed each piece
4. Create the vector database, index
5. Query (retrieving from that vector database using a llama3 model)

In [1]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

folder_path = "./pdfs/llama3-80k.pdf"

In [2]:
pdf_docs = PyPDFLoader(folder_path).load_and_split()
pdf_docs[0]

Document(page_content='Extending Llama-3’s Context Ten-Fold Overnight\nPeitian Zhang1,2, Ninglu Shao1,2, Zheng Liu1∗, Shitao Xiao1, Hongjin Qian1,2,\nQiwei Ye1, Zhicheng Dou2\n1Beijing Academy of Artificial Intelligence\n2Gaoling School of Artificial Intelligence, Renmin University of China\nnamespace.pt@gmail.com zhengliu1026@gmail.com\nAbstract\nWe extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA\nfine-tuning2. The entire training cycle is super efficient, which takes 8 hours on one\n8xA800 (80G) GPU machine. The resulted model exhibits superior performances\nacross a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-\ncontext language understanding; meanwhile, it also well preserves the original\ncapability over short contexts. The dramatic context extension is mainly attributed\nto merely 3.5K synthetic training samples generated by GPT-4 , which indicates\nthe LLMs’ inherent (yet largely underestimated) potential to extend its origin

In [3]:
len(pdf_docs)

6

Embedding and Vector Database

In [4]:
from langchain.embeddings import OllamaEmbeddings

In [6]:
embedding = OllamaEmbeddings(model="llama3")

In [7]:
from langchain.vectorstores import Chroma

In [8]:
vector_db = Chroma.from_documents(pdf_docs, embedding=embedding, persist_directory=".")

In [9]:
# building blocks in langchain to connect llama3 to documents
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOllama

In [10]:
llm = ChatOllama(model="llama3", chat_format=True)

In [11]:
llm.invoke("Hi tell me 5 reasons why pancakes are the best breakfast for AI engineers like myself.")

AIMessage(content="A pancake lover, I see! As an AI engineer, you likely appreciate efficiency, versatility, and a boost of energy to tackle complex problems. Here are five reasons why pancakes are the perfect breakfast for AI engineers like yourself:\n\n1. **Fuel for Problem-Solving**: Pancakes provide a quick and satisfying source of carbohydrates, which is essential for fueling your brain's problem-solving abilities. The complex sugars in pancakes will keep you focused and energized throughout your morning coding sessions or meetings.\n2. **Customization Station**: AI engineers value flexibility and adaptability. Pancakes offer the perfect canvas for customization! Add your favorite toppings, from sweet fruits to savory nuts, and create a flavor profile that suits your unique taste buds. This mirrors the iterative process of developing AI models – you can experiment with different approaches until you find what works best.\n3. **Structured Complexity**: Just as AI engineers work wit

In [12]:
qa = RetrievalQA.from_chain_type(llm=llm, retriever=vector_db.as_retriever(), return_source_documents=True)

In [13]:
output = qa.invoke("How did the authors increased the context length of the llama3 model? Cite the specific section source of the paper.", return_sources=True)

In [14]:
output

{'query': 'How did the authors increased the context length of the llama3 model? Cite the specific section source of the paper.',
 'result': 'According to the text, the authors increased the context length of Llama 3 by using Grouped-Query Attention (GQA), which is an efficient representation that should help with longer contexts.\n\nSource: "small model: it goes from 7B in Llama 2 to 8B in Llama 3. In addition, the 8B version of the model now uses Grouped-Query Attention (GQA), which is an efficient representation that should help with longer contexts."',
 'source_documents': [Document(page_content='Regarding the licensing terms, Llama 3 comes with a permissive license that allows redistribution, fine-tuning, and derivative works. The requirement for explicit attribution is new in the Llama 3 license and was not present in Llama 2. Derived models, for instance, need to include "Llama 3" at the beginning of their name, and you also need to mention "Built with Meta Llama 3" in derivativ

Now, let's do a simple semi-manual benchmark for the quality of this rag setup.

In [18]:
import time
import numpy as np

queries = [
    "What method did the authors use to extend the context length of Llama-3-8B-Instruct from 8K to 80K?",
    "How long did the entire training cycle take, and what hardware was used?",
    "What are the three long-context tasks covered in the training data synthesized by GPT-4?",
    "What contributions do the authors highlight in their work on extending the context length of Llama-3-8B-Instruct?",
    "How does the performance of Llama-3-8B-Instruct-80K-QLoRA compare with other long-context models on popular benchmarks?"
]

In [19]:
outputs = []
latencies = []
for query in queries:
    start = time.time()
    output = qa.invoke(query)
    outputs.append(output)
    end = time.time()
    latencies.append(end - start)

mean_latency = np.mean(latencies)
print(f"Mean latency in seconds: {mean_latency}")

Mean latency in seconds: 5.006371784210205


In [20]:
i=0

In [21]:
print(outputs[i]["query"])
print(outputs[i]["result"])
print(outputs[i]["source_documents"])
i+=1

What method did the authors use to extend the context length of Llama-3-8B-Instruct from 8K to 80K?
According to the text, the authors used a method called QLoRA (Quantization, Loosening, and Reconstruction) fine-tuning to extend the context length of Llama-3-8B-Instruct from 8K to 80K. They also synthesized 3.5K long-context training data using GPT-4 to cover three long-context tasks: Single-Detail QA and Multi-Detail QA.
[Document(page_content='I. Molybog, Y . Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.\nSmith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan,\nI. Zarov, Y . Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and\nT. Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.\n[16] P. Zhang, Z. Liu, S. Xiao, N. Shao, Q. Ye, and Z. Dou. Soaring from 4k to 400k: Extending\nllm’s context with activation beacon, 2024.\n[17] X. Zhang, Y . Chen, S. Hu, Z. Xu, J. Che