# Build a RAG System on “Leave No Context Behind” Paper

In [1]:
pip install pypdf

Collecting pypdfNote: you may need to restart the kernel to use updated packages.

  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
     -------------------------------------- 290.4/290.4 kB 1.4 MB/s eta 0:00:00
Installing collected packages: pypdf
Successfully installed pypdf-4.2.0


In [2]:
pip install langchain_google_genai

Collecting langchain_google_genai
  Downloading langchain_google_genai-1.0.2-py3-none-any.whl (28 kB)
Collecting langchain-core<0.2,>=0.1.27
  Downloading langchain_core-0.1.45-py3-none-any.whl (291 kB)
     -------------------------------------- 291.3/291.3 kB 1.4 MB/s eta 0:00:00
Collecting google-generativeai<0.6.0,>=0.5.0
  Downloading google_generativeai-0.5.2-py3-none-any.whl (146 kB)
     -------------------------------------- 146.8/146.8 kB 1.5 MB/s eta 0:00:00
Collecting google-auth>=2.15.0
  Downloading google_auth-2.29.0-py2.py3-none-any.whl (189 kB)
     -------------------------------------- 189.2/189.2 kB 1.6 MB/s eta 0:00:00
Collecting google-api-python-client
  Downloading google_api_python_client-2.127.0-py2.py3-none-any.whl (12.7 MB)
     ---------------------------------------- 12.7/12.7 MB 4.4 MB/s eta 0:00:00
Collecting google-api-core
  Downloading google_api_core-2.18.0-py3-none-any.whl (138 kB)
     -------------------------------------- 138.3/138.3 kB 2.7 MB/s 

In [3]:
from langchain_google_genai import ChatGoogleGenerativeAI
# setup API Key
f=open(r"C:\Users\ANKITHA\OneDrive\Desktop\RAG.txt")
GOOGLE_API_KEY = f.read()

chat_model = ChatGoogleGenerativeAI(google_api_key=GOOGLE_API_KEY, model="gemini-1.5-pro-latest")

# Loading the Document

In [4]:
pip install --upgrade langchain

Collecting langchain
  Downloading langchain-0.1.16-py3-none-any.whl (817 kB)
     -------------------------------------- 817.7/817.7 kB 3.2 MB/s eta 0:00:00
Collecting async-timeout<5.0.0,>=4.0.0
  Downloading async_timeout-4.0.3-py3-none-any.whl (5.7 kB)
Collecting aiohttp<4.0.0,>=3.8.3
  Downloading aiohttp-3.9.5-cp39-cp39-win_amd64.whl (371 kB)
     -------------------------------------- 371.6/371.6 kB 4.6 MB/s eta 0:00:00
Collecting dataclasses-json<0.7,>=0.5.7
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1
  Downloading langchain_text_splitters-0.0.1-py3-none-any.whl (21 kB)
Collecting langchain-community<0.1,>=0.0.32
  Downloading langchain_community-0.0.34-py3-none-any.whl (1.9 MB)
     ---------------------------------------- 1.9/1.9 MB 5.3 MB/s eta 0:00:00
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.5-cp39-cp39-win_amd64.whl (28 kB)
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.4.1-cp39-cp

In [26]:
# Load a document

from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(r"C:\Users\ANKITHA\Downloads\2404.07143.pdf")
data = loader.load_and_split()

data[:5]

[Document(page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast strea

# Spliting the document into chunks

In [27]:
# Spliting the document into chunks
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)

chunks = text_splitter.split_documents(data)

print(len(chunks))


Created a chunk of size 568, which is longer than the specified 500
Created a chunk of size 506, which is longer than the specified 500
Created a chunk of size 633, which is longer than the specified 500


110


# Creating Chunk Embedding

In [28]:
# Creating Chunks Embedding
# We are just loading OpenAIEmbeddings

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(google_api_key=GOOGLE_API_KEY, model="models/embedding-001")

# vectors = embeddings.embed_documents(chunks)

# Storing the chunks in vector

In [29]:
!pip install chromadb 
    #Store the chunks in vector store
from langchain_community.vectorstores import Chroma

# Embed each chunk and load it into the vector store
db = Chroma.from_documents(chunks, embedding_model, persist_directory="Downloads")

# Persist the database on drive
db.persist()



In [30]:
# Setting a Connection with the ChromaDB
connection = Chroma(persist_directory="Downloads", embedding_function=embedding_model)

# Settingup the Vector Store as a Retriever

In [31]:
# Converting CHROMA db_connection to Retriever Object
retriever = connection.as_retriever(search_kwargs={"k": 5})

print(type(retriever))

<class 'langchain_core.vectorstores.VectorStoreRetriever'>


# Now let’s write the actual application logic. We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

# Based on users query retrieving the context
# Query -1

In [32]:
user_query = "What is LLMs?"

In [33]:
retrieved_docs = retriever.invoke(user_query)

In [34]:
len(retrieved_docs)

5

In [35]:
print(retrieved_docs[0].page_content)

However, the LLMs in their current state
have yet to see an effective, practical compres-
sive memory technique that balances simplicity along with quality.

1arXiv:2404.07143v1  [cs.CL]  10 Apr 2024


# Query - 2

In [36]:
user_query = "Tell me about LLMs?"

In [37]:
retrieved_docs = retriever.invoke(user_query)

In [38]:
len(retrieved_docs)

5

In [39]:
print(retrieved_docs[0].page_content)

However, the LLMs in their current state
have yet to see an effective, practical compres-
sive memory technique that balances simplicity along with quality.

1arXiv:2404.07143v1  [cs.CL]  10 Apr 2024


# Passing the context and questioning to the LLM

In [40]:
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

chat_template = ChatPromptTemplate.from_messages([
    # System Message Prompt Template
    SystemMessage(content="""You are a Helpful AI Bot. 
    You take the context and question from user. Your answer should be based on the specific context."""),
    # Human Message Prompt Template
    HumanMessagePromptTemplate.from_template("""Answer the question based on the given context.
    Context:
    {context}
    Question: 
    {question}
    
    Answer: """)
])

In [41]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

In [42]:
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | chat_template
    | chat_model
    | output_parser
)

# Query - 1

In [43]:
from IPython.display import Markdown as markdown
response = rag_chain.invoke("What is Scaled Dot-product Attention?")

markdown(response)

## Scaled Dot-Product Attention in the Context of Local Attention

The provided text doesn't explicitly define "Scaled Dot-Product Attention," but it does offer clues about its application within the context of local attention mechanisms. Let's break down what we can understand:

**Key Points:**

* **Dot-product attention is used within segments:** The text mentions "standard causal dot-product attention context within each segment." This suggests the attention mechanism is applied to individual segments of text rather than the entire sequence.
* **Local attention restricts the scope:** The computation covers a specific number (N) of tokens within the current segment (S). This limits the attention to a local context, ignoring tokens outside the segment.
* **Comparison to previous local attention:** The text contrasts this approach with the local attention proposed by Dai et al. (2019), which discards attention states from previous segments when processing the next one. This implies the current method might retain or utilize information from past segments in some way.

**Inference about Scaled Dot-Product Attention:**

While the specifics of "Scaled Dot-Product Attention" remain unclear, we can infer that it likely involves:

1. **Dot-product calculation:** This is a common way to compute attention scores, measuring the similarity between query and key vectors.
2. **Scaling:** The "scaled" part suggests there might be a scaling factor applied to the dot product, potentially to prevent the values from becoming too large, which can cause vanishing gradients during training.
3. **Causal masking:** The mention of "causal" implies that the attention mechanism only attends to past tokens and the current token, preventing "cheating" by looking into the future.

**Connection to System-Level Optimization:**

The text also mentions system-level optimization techniques used to improve the efficiency of exact attention computation. This suggests that Scaled Dot-Product Attention, while effective, might be computationally expensive, prompting the need for optimization, especially when dealing with long sequences.

**In conclusion,** while the text doesn't explicitly define Scaled Dot-Product Attention, we can infer its core mechanics and its role within the local attention context. The specifics of scaling and its relation to system-level optimizations would require further information or investigation into the referenced papers. 


# Query - 2

In [44]:
response = rag_chain.invoke("Memory and Effective Context Window?")

markdown(response)

## Memory and Effective Context Window: Balancing Power and Practicality

The provided text discusses the concept of memory and effective context window in the context of large language models (LLMs). Let's break down what this means:

**Context Window:** This refers to the amount of information an LLM can "remember" and consider while processing new data.  A larger context window allows the model to understand and respond to complex situations with greater accuracy.

**Memory Footprint:**  This refers to the amount of storage and processing power required to maintain the context window.  As the context window grows, so does the memory footprint, leading to potential limitations in practicality and efficiency. 

**The Challenge:**  LLMs strive to achieve an unbounded context window (remembering everything) while maintaining a bounded memory footprint (using resources efficiently).

**Existing Solutions:** The text mentions "segment-level memory models" which attempt to address this challenge.  However, it points out that these models often lack a balance between simplicity and quality.

**Infini-Transformer:** The proposed solution, "Infini-Transformer," aims to enable an unbounded context window with a bounded memory footprint.  This suggests a more efficient and scalable approach to managing memory in LLMs. 

**Overall:**  The text highlights the importance of balancing memory capacity and practicality in LLMs. The "Infini-Transformer" is presented as a potential solution to achieve this balance, offering the ability to process vast amounts of information without excessive resource demands. 
