# **Build a RAG System on “Leave No Context Behind” Paper**

In [46]:
#pip install pypdf

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [49]:
#pip install langchain_google_genai

In [50]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Setup API Key
f = open(r"C:\Users\Admin\OneDrive\Desktop\Lang_chain_key.txt")
GOOGLE_API_KEY = f.read()

chat_model = ChatGoogleGenerativeAI(google_api_key=GOOGLE_API_KEY, model="gemini-1.5-pro-latest")

# Loading the Document

In [51]:
pip install --upgrade langchain

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



In [52]:
# Load a document

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(r"C:\Users\Admin\Downloads\2404.07143.pdf")

data = loader.load_and_split()

data[:5]

[Document(page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast strea

# Spliting the document into chunks

In [11]:
# Spliting the document into chunks
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)

chunks = text_splitter.split_documents(data)

print(len(chunks))

Created a chunk of size 568, which is longer than the specified 500
Created a chunk of size 506, which is longer than the specified 500
Created a chunk of size 633, which is longer than the specified 500


110


# Creating Chunk Embedding

In [53]:
# Creating Chunks Embedding
# We are just loading OpenAIEmbeddings

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(google_api_key=GOOGLE_API_KEY, model="models/embedding-001")

# vectors = embeddings.embed_documents(chunks)

# Storing the chunks in vector

In [54]:
# Store the chunks in vector store
from langchain_community.vectorstores import Chroma

# Embed each chunk and load it into the vector store
db = Chroma.from_documents(chunks, embedding_model, persist_directory="Downloads")

# Persist the database on drive
db.persist()

In [55]:
# Setting a Connection with the ChromaDB
connection = Chroma(persist_directory="Downloads", embedding_function=embedding_model)

# Settingup the Vector Store as a Retriever

In [56]:
# Converting CHROMA db_connection to Retriever Object
retriever = connection.as_retriever(search_kwargs={"k": 5})

print(type(retriever))

<class 'langchain_core.vectorstores.VectorStoreRetriever'>


Now let’s write the actual application logic. We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

# Based on users query retrieving the context

### Query -1

In [57]:
user_query = "What is LLMs?"

In [58]:
retrieved_docs = retriever.invoke(user_query)

In [59]:
len(retrieved_docs)

5

In [60]:
print(retrieved_docs[0].page_content)

However, the LLMs in their current state
have yet to see an effective, practical compres-
sive memory technique that balances simplicity along with quality.

1arXiv:2404.07143v1  [cs.CL]  10 Apr 2024


### Query - 2

In [61]:
user_query = "Tell me about LLMs?"

In [62]:
retrieved_docs = retriever.invoke(user_query)

In [63]:
len(retrieved_docs)

5

In [64]:
print(retrieved_docs[0].page_content)

However, the LLMs in their current state
have yet to see an effective, practical compres-
sive memory technique that balances simplicity along with quality.

1arXiv:2404.07143v1  [cs.CL]  10 Apr 2024


# Passing the context and questioning to the LLM

In [65]:
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

chat_template = ChatPromptTemplate.from_messages([
    # System Message Prompt Template
    SystemMessage(content="""You are a Helpful AI Bot. 
    You take the context and question from user. Your answer should be based on the specific context."""),
    # Human Message Prompt Template
    HumanMessagePromptTemplate.from_template("""Answer the question based on the given context.
    Context:
    {context}
    Question: 
    {question}
    
    Answer: """)
])

In [66]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

In [67]:
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | chat_template
    | chat_model
    | output_parser
)

## Query - 1

In [68]:
from IPython.display import Markdown as markdown
response = rag_chain.invoke("What is Scaled Dot-product Attention?")

markdown(response)

## Scaled Dot-Product Attention: A Missing Piece

While the provided text discusses attention mechanisms within the context of segment processing, it doesn't directly explain Scaled Dot-Product Attention. However, I can provide you with a general understanding of this mechanism based on my knowledge up to November 2023.

**Scaled Dot-Product Attention** is a specific type of attention mechanism commonly used in Transformer models. It allows the model to attend to different parts of the input sequence and weigh their importance when making predictions. Here's how it works:

1. **Dot Product**: For each word in the sequence, its representation (vector) is compared to the representations of all other words using the dot product operation. This results in a score that reflects the similarity between the words.
2. **Scaling**: The dot products are scaled down by dividing them by the square root of the dimension of the word vectors. This scaling helps to prevent the values from becoming too large, which can cause issues during training.
3. **Softmax**: The scaled dot products are then passed through a softmax function. This normalizes the scores into a probability distribution, where each word is assigned a weight between 0 and 1, indicating its relative importance. 
4. **Weighted Sum**: Finally, the word representations are multiplied by their respective weights and summed up. This creates a context vector that represents the relevant information from the entire input sequence for the current word.

**Benefits of Scaled Dot-Product Attention:**

* **Efficient Computation:** The dot product operation is computationally efficient, making it suitable for large sequences.
* **Parallelization:** The attention calculations can be parallelized across different words, leading to faster training and inference.
* **Learnable Relationships:** The model learns to focus on the most relevant parts of the input sequence based on the specific task.

**Limitations and Variations:**

* **Quadratic Complexity:** The attention mechanism has quadratic complexity with respect to the sequence length, which can be a bottleneck for very long sequences.
* **Alternatives:** Several variations of attention mechanisms exist, such as *multi-head attention* and *local attention*, which address some of the limitations of scaled dot-product attention.

**In the context of the provided text**, it seems the authors are using a local attention mechanism that restricts the attention calculations within each segment. This can help to improve efficiency and reduce the computational burden while still capturing relevant information within the segment. 


## Query - 2

In [71]:
response = rag_chain.invoke("Memory and Effective Context Window?")

markdown(response)

## Memory and Effective Context Window in Infini-Transformer

Based on the provided context, the Infini-Transformer model introduces a mechanism to handle an **unbounded context window** while maintaining a **bounded memory footprint**. This means the model can process and retain information from extensive input sequences without requiring excessive memory resources.

**Here's how it likely achieves this:**

* **Segment-level Memory Models:** The context mentions "previous segment-level memory models" and Table 1, which unfortunately isn't included in the provided text. However, it's safe to assume that Infini-Transformer utilizes a similar approach, dividing the input sequence into segments and employing memory techniques to efficiently store and access information from these segments.
* **Trainable Weights (WO):** The equation involving WO suggests the model learns to weigh and prioritize information within the segments, possibly focusing on the most relevant parts for the given task.

**Benefits of this approach:**

* **Handling Long Sequences:**  Infini-Transformer can effectively process and analyze lengthy sequences of data, such as long documents, time series, or code, which are often challenging for traditional transformer models due to memory limitations. 
* **Improved Efficiency:** By maintaining a bounded memory footprint, the model avoids excessive memory consumption, leading to better computational efficiency and potentially faster processing times.

**Further Information Needed:**

To fully understand the specifics of Infini-Transformer's memory and effective context window mechanism, it would be helpful to have access to:

* **Table 1:** This table likely provides a comparison of different segment-level memory models, shedding light on the specific approach used by Infini-Transformer and its advantages.
* **Details of the Memory Mechanism:**  A more detailed explanation of how the model stores and retrieves information from segments would provide a clearer understanding of its inner workings. 
