## Build a RAG System on “Leave No Context Behind” Paper

In [36]:
#!pip install pypdf
#!pip install langchain_google_genai
#!pip install langchain_community
#!pip install -U langchain-text-splitters

In [12]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Setup API Key
f = open(r"C:\Users\farheen\OneDrive\Desktop\gemini key.txt")
GOOGLE_API_KEY = f.read()

chat_model = ChatGoogleGenerativeAI(google_api_key=GOOGLE_API_KEY, model="gemini-1.5-pro-latest")

In [13]:
# Load a document

from langchain_community.document_loaders import PyPDFLoader

# Provide the path to your PDF file
pdf_path = r"C:\Users\farheen\Downloads\Leave_No_Context_Behind.pdf"

# Create a PyPDFLoader instance
loader = PyPDFLoader(pdf_path)

# Load and split the document
data = loader.load_and_split()

# Print the first 5 elements of the data
print(data[:5])

[Document(page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast strea

In [14]:
# Spliting the document into chunks

from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)

chunks = text_splitter.split_documents(data)

print(len(chunks))

Created a chunk of size 568, which is longer than the specified 500
Created a chunk of size 506, which is longer than the specified 500
Created a chunk of size 633, which is longer than the specified 500


110


In [15]:
# Creating Chunks Embedding
# We are just loading OpenAIEmbeddings

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(google_api_key=GOOGLE_API_KEY, model="models/embedding-001")

# vectors = embeddings.embed_documents(chunks)

In [16]:
# Store the chunks in vector store
from langchain_community.vectorstores import Chroma

# Embed each chunk and load it into the vector store
db = Chroma.from_documents(chunks, embedding_model, persist_directory="./chroma_db_rag")

# Persist the database on drive
db.persist()

In [17]:
# Setting a Connection with the ChromaDB
connection = Chroma(persist_directory="./chroma_db_rag", embedding_function=embedding_model)

In [18]:
# Converting CHROMA db_connection to Retriever Object
retriever = connection.as_retriever(search_kwargs={"k": 5})

print(type(retriever))

<class 'langchain_core.vectorstores.VectorStoreRetriever'>


Now let’s write the actual application logic. We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

In [19]:
# Query-1
user_query = "What is Infini-attention?"

In [20]:
retrieved_docs = retriever.invoke(user_query)

In [21]:
len(retrieved_docs)

5

In [22]:
print(retrieved_docs[0].page_content)

2.1 Infini-attention
As shown Figure 1, our Infini-attention computes both local and global context states and
combine them for its output.

Similar to multi-head attention (MHA), it maintains Hnumber
2


In [23]:
# Query-2
user_query = "Tell me about LLMs?"

In [24]:
retrieved_docs = retriever.invoke(user_query)

In [25]:
len(retrieved_docs)

5

In [26]:
print(retrieved_docs[0].page_content)

However, the LLMs in their current state
have yet to see an effective, practical compres-
sive memory technique that balances simplicity along with quality.

1arXiv:2404.07143v1  [cs.CL]  10 Apr 2024


In [27]:
# Passing the context and questioning to the LLM

In [28]:
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

chat_template = ChatPromptTemplate.from_messages([
    # System Message Prompt Template
    SystemMessage(content="""You are a Helpful AI Bot. 
    You take the context and question from user. Your answer should be based on the specific context."""),
    # Human Message Prompt Template
    HumanMessagePromptTemplate.from_template("""Answer the question based on the given context.
    Context:
    {context}
    Question: 
    {question}
    
    Answer: """)
])

In [29]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

In [30]:
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | chat_template
    | chat_model
    | output_parser
)

In [31]:
from IPython.display import Markdown as markdown
response = rag_chain.invoke("What is LLMs?")

markdown(response)

## LLMs: Large Language Models

Based on the context you provided, **LLMs stands for Large Language Models**. These are complex AI systems trained on massive amounts of text data, allowing them to understand and generate human-like text in response to a wide range of prompts and questions. 


In [33]:
response = rag_chain.invoke("What is the main focus of the document regarding Transformer-based Large Language Models (LLMs)?")

markdown(response)

## Main Focus of the Document: Overcoming Memory Limitations in Transformer-based LLMs

The document primarily focuses on the challenges associated with **memory limitations in Transformer-based Large Language Models (LLMs)** and explores potential solutions to enable efficient processing of long sequences.

**Key points highlighted in the document:**

* **Limited Contextual Memory:**  Standard Transformer architectures struggle with long sequences due to the constraints of the attention mechanism. This leads to challenges in scaling LLMs to handle extensive contexts efficiently.
* **Scalability Issues:** Increasing sequence length significantly impacts memory footprint and computational cost, making it impractical for standard Transformers to manage extremely long inputs. 
* **Compressive Memory Systems:** The document proposes exploring compressive memory systems as a more scalable and efficient alternative to the attention mechanism for handling long sequences. 
* **Input Compression Techniques:**  Several methods, including utilizing Transformer LLMs themselves, are discussed for compressing input sequences to achieve efficient long-context modeling.
* **Infini-Transformer:** The document introduces Infini-Transformer, a novel approach inspired by Transformer-XL, designed to process infinitely long contexts with bounded memory and compute resources by employing a streaming fashion.

**Overall, the document emphasizes the need for innovative techniques to overcome memory limitations in Transformer-based LLMs and enable efficient processing of long sequences, paving the way for more powerful and versatile language models.** 


In [34]:
response = rag_chain.invoke("Explain about LLM Pre-training?")

markdown(response)

## LLM Pre-training for Long-Context Adaptation: Explained

Based on the context you provided, LLM pre-training, in this specific case, refers to the process of further training existing Large Language Models (LLMs) to handle **long-context information** more effectively. This is crucial because LLMs often struggle with maintaining context and coherence when dealing with lengthy inputs.

Here's a breakdown of the key points:

**Challenges with Long-Context:**

* Standard LLMs like transformers have limitations in handling long sequences due to the computational complexity of attention mechanisms.
* Maintaining context and relationships within extensive text inputs is difficult.

**Solutions Explored:**

* **Extending Attention Mechanisms:** Researchers have explored modifying attention layers (e.g., using "Infini-attention") to better manage long sequences.
* **Compressed Input Representations:** Techniques like using LLMs themselves to compress and summarize past segments of the input are being investigated. 

**LLM Continual Pre-training:**

* This approach involves taking an existing LLM and further training it on a dataset specifically designed for long-context scenarios.
* The provided context mentions using datasets like PG19, Arxiv-math, and C4 text with lengths exceeding 4K tokens.
* This pre-training aims to adapt the LLM's parameters to better handle and understand long-range dependencies within text.

**Benefits:**

* Improved performance on tasks requiring long-context understanding.
* Potential for better coherence and reduced information loss when processing lengthy inputs.

**Current State and Future Directions:**

* While promising, the research is ongoing. 
* The context highlights the need for finding a balance between simplicity and quality in compressive memory techniques.
* Further exploration is required to develop efficient and practical methods for long-context LLM adaptation. 
