# **Build a RAG System on “Leave No Context Behind” Paper**

In [1]:
pip install pypdf

Note: you may need to restart the kernel to use updated packages.




In [2]:
pip install langchain_google_genai

Note: you may need to restart the kernel to use updated packages.




In [3]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Setup API Key
f = open(r"C:\Users\HP\OneDrive\Desktop\projects\Langchain\keys\Lang_Chain_Key.txt")
GOOGLE_API_KEY = f.read()

chat_model = ChatGoogleGenerativeAI(google_api_key=GOOGLE_API_KEY, model="gemini-1.5-pro-latest")

# Loading the Document

In [4]:
pip install --upgrade langchain

Note: you may need to restart the kernel to use updated packages.




In [5]:
# Load a document

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(r"C:\Users\HP\OneDrive\Desktop\projects\Langchain\2404.07143.pdf")

data = loader.load_and_split()

data[:5]

[Document(page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast strea

# Spliting the document into chunks

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
# Spliting the document into chunks
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)

chunks = text_splitter.split_documents(data)

print(len(chunks))

Created a chunk of size 568, which is longer than the specified 500
Created a chunk of size 506, which is longer than the specified 500
Created a chunk of size 633, which is longer than the specified 500


110


# Creating Chunk Embedding

In [8]:
# Creating Chunks Embedding
# We are just loading OpenAIEmbeddings

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(google_api_key=GOOGLE_API_KEY, model="models/embedding-001")

# vectors = embeddings.embed_documents(chunks)

# Storing the chunks in vector

In [9]:
pip install chromadb

Note: you may need to restart the kernel to use updated packages.




In [10]:
# Store the chunks in vector store
from langchain_community.vectorstores import Chroma

# Embed each chunk and load it into the vector store
db = Chroma.from_documents(chunks, embedding_model, persist_directory="Downloads")

# Persist the database on drive
db.persist()

In [11]:
# Setting a Connection with the ChromaDB
connection = Chroma(persist_directory="Downloads", embedding_function=embedding_model)

# Settingup the Vector Store as a Retriever

In [12]:
# Converting CHROMA db_connection to Retriever Object
retriever = connection.as_retriever(search_kwargs={"k": 5})

print(type(retriever))

<class 'langchain_core.vectorstores.VectorStoreRetriever'>


Now let’s write the actual application logic. We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

# Based on users query retrieving the context

### Query -1

In [13]:
user_query = "What is LLMs?"

In [14]:
retrieved_docs = retriever.invoke(user_query)

In [15]:
len(retrieved_docs)

5

In [16]:
print(retrieved_docs[0].page_content)

However, the LLMs in their current state
have yet to see an effective, practical compres-
sive memory technique that balances simplicity along with quality.

1arXiv:2404.07143v1  [cs.CL]  10 Apr 2024


### Query - 2

In [17]:
user_query = "Tell me about LLMs?"

In [18]:
retrieved_docs = retriever.invoke(user_query)

In [19]:
len(retrieved_docs)

5

In [20]:
print(retrieved_docs[0].page_content)

However, the LLMs in their current state
have yet to see an effective, practical compres-
sive memory technique that balances simplicity along with quality.

1arXiv:2404.07143v1  [cs.CL]  10 Apr 2024


# Passing the context and questioning to the LLM

In [21]:
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

chat_template = ChatPromptTemplate.from_messages([
    # System Message Prompt Template
    SystemMessage(content="""You are a Helpful AI Bot.
    You take the context and question from user. Your answer should be based on the specific context."""),
    # Human Message Prompt Template
    HumanMessagePromptTemplate.from_template("""Answer the question based on the given context.
    Context:
    {context}
    Question:
    {question}

    Answer: """)
])

In [22]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

In [23]:
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | chat_template
    | chat_model
    | output_parser
)

## Query - 1

In [24]:
from IPython.display import Markdown as markdown
response = rag_chain.invoke("What is Scaled Dot-product Attention?")

markdown(response)

## Scaled Dot-Product Attention Explained

Unfortunately, the provided context doesn't specifically define "Scaled Dot-Product Attention." However, based on the information given and the general knowledge of attention mechanisms in transformers, we can infer and explain its likely meaning:

**Dot-Product Attention:**

This is a fundamental attention mechanism where relevance between elements in a sequence is computed using the dot product. In simpler terms, it measures how much each element "attends to" or is relevant to every other element. 

**Scaling:**

The dot product can lead to large values, especially with longer sequences, causing gradients to become small during training. To mitigate this, the dot product is scaled by dividing it by the square root of the dimension of the key vectors (typically denoted as d_k). This scaling ensures more stable gradients and better training behavior.

**Putting it Together:**

Therefore, Scaled Dot-Product Attention likely refers to the dot-product attention mechanism with the additional scaling factor to stabilize training. This is a common and crucial component in transformer models for various sequence-to-sequence tasks like machine translation, text summarization, and question answering.

**Connection to the Context:**

The context mentions "causal dot-product attention" within segments. This implies the use of masked attention, where each token can only attend to tokens preceding it in the sequence. This is particularly relevant for tasks like language modeling where the model predicts the next token based on the preceding context. 

**Additional Notes:**

* The context also mentions "local attention" which restricts attention to a smaller window of tokens instead of the entire sequence. This can improve efficiency and is useful for long sequences.
* The paper seems to propose a method that combines aspects of both local and global attention, potentially using segments to achieve this.

While the exact details of Scaled Dot-Product Attention within this specific context remain unclear, the explanation above provides a general understanding of the concept and its likely application within the paper. 


## Query - 2

In [25]:
response = rag_chain.invoke("Memory and Effective Context Window?")

markdown(response)

## Memory and Effective Context Window: Balancing Power and Efficiency

Based on the context, it seems you're exploring the concept of memory and effective context window within the realm of large language models (LLMs) and, specifically, the Infini-Transformer model. Here's a breakdown of what we can understand:

**The Challenge of Memory in LLMs:**

*   Traditional LLMs struggle with limited context window sizes, meaning they can only process a certain amount of information at once. This restricts their ability to handle long-range dependencies and complex tasks requiring extensive context.

**Infini-Transformer and Unbounded Context Window:**

*   The Infini-Transformer introduces a novel approach, enabling an "unbounded context window." This suggests it can theoretically handle input of any length without being constrained by a predetermined limit.

**Bounded Memory Footprint:**

*   Despite the unbounded context window, the Infini-Transformer maintains a "bounded memory footprint." This implies that it employs efficient memory management techniques to avoid excessive memory consumption, even when processing vast amounts of data.

**Comparison with Previous Models:**

*   Table 1 likely provides a comparison of the Infini-Transformer with earlier segment-level memory models, highlighting its advantages in terms of memory efficiency and context window size.

**The Quest for Compressive Memory Techniques:**

*   The context mentions the ongoing search for "effective, practical compressive memory techniques." This indicates a desire to further optimize memory usage in LLMs while maintaining performance and simplicity. 

**In essence, the Infini-Transformer tackles the memory limitations of traditional LLMs by offering an unbounded context window with a controlled memory footprint. This advancement holds significant potential for processing lengthy sequences and complex tasks that demand a broader understanding of context.**
