# `Build a RAG System on “Leave No Context Behind” Paper`

In [29]:
#!pip install pypdf
#!pip install langchain_google_genai
#!pip install langchain_community
#!pip install -U langchain-text-splitters

In [1]:
!pip install langchain_google_genai



In [2]:
!pip install langchain_community



In [3]:
!pip install -U langchain-text-splitters



In [4]:
!pip install pypdf



In [5]:
!pip install chromadb

Collecting chromadb
  Obtaining dependency information for chromadb from https://files.pythonhosted.org/packages/a4/e1/ce276f553811bd6c684cfe5f637a33ae6444750746f974a8f73d5dc92004/chromadb-0.5.0-py3-none-any.whl.metadata
  Using cached chromadb-0.5.0-py3-none-any.whl.metadata (7.3 kB)
Collecting build>=1.0.3 (from chromadb)
  Obtaining dependency information for build>=1.0.3 from https://files.pythonhosted.org/packages/e2/03/f3c8ba0a6b6e30d7d18c40faab90807c9bb5e9a1e3b2fe2008af624a9c97/build-1.2.1-py3-none-any.whl.metadata
  Using cached build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting fastapi>=0.95.2 (from chromadb)
  Obtaining dependency information for fastapi>=0.95.2 from https://files.pythonhosted.org/packages/ad/0f/feb7fd8957714498fc4a6be7f13408869619f868f418698a2d934afa82a7/fastapi-0.110.2-py3-none-any.whl.metadata
  Using cached fastapi-0.110.2-py3-none-any.whl.metadata (24 kB)
Collecting uvicorn[standard]>=0.18.3 (from chromadb)
  Obtaining dependency information for u

In [6]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Setup API Key
f = open(r"C:\Users\Arpan Ghosh\OneDrive\Desktop\Gemini\key\Gemini_key.txt")
GOOGLE_API_KEY = f.read()

chat_model = ChatGoogleGenerativeAI(google_api_key=GOOGLE_API_KEY, model="gemini-1.5-pro-latest")

In [7]:
# Load a document

from langchain_community.document_loaders import PyPDFLoader

In [8]:
# Load a document

from langchain_community.document_loaders import PyPDFLoader

# Provide the path to your PDF file
pdf_path = r"C:\Users\Arpan Ghosh\OneDrive\Desktop\langchain\Leave_No_Context_Behind.pdf"

# Create a PyPDFLoader instance
loader = PyPDFLoader(pdf_path)

# Load and split the document
data = loader.load_and_split()

# Print the first 5 elements of the data
print(data[:5])

[Document(page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast strea

In [9]:
# Spliting the document into chunks

from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)

chunks = text_splitter.split_documents(data)

print(len(chunks))

Created a chunk of size 568, which is longer than the specified 500
Created a chunk of size 506, which is longer than the specified 500
Created a chunk of size 633, which is longer than the specified 500


110


In [10]:
# Creating Chunks Embedding
# We are just loading OpenAIEmbeddings

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(google_api_key=GOOGLE_API_KEY, model="models/embedding-001")

# vectors = embeddings.embed_documents(chunks)

In [11]:
# Store the chunks in vector store
from langchain_community.vectorstores import Chroma

# Embed each chunk and load it into the vector store
db = Chroma.from_documents(chunks, embedding_model, persist_directory="./chroma_db_rag")

# Persist the database on drive
db.persist()

In [12]:
# Setting a Connection with the ChromaDB
connection = Chroma(persist_directory="./chroma_db_rag", embedding_function=embedding_model)

In [13]:
# Converting CHROMA db_connection to Retriever Object
retriever = connection.as_retriever(search_kwargs={"k": 5})

print(type(retriever))

<class 'langchain_core.vectorstores.VectorStoreRetriever'>


In [14]:
# Query-1
user_query = "What is Infini-attention?"

In [15]:
retrieved_docs = retriever.invoke(user_query)

In [16]:
len(retrieved_docs)

5

In [17]:
print(retrieved_docs[0].page_content)

2.1 Infini-attention
As shown Figure 1, our Infini-attention computes both local and global context states and
combine them for its output.

Similar to multi-head attention (MHA), it maintains Hnumber
2


In [18]:
# Query-2
user_query = "Tell me about LLMs?"

In [19]:
retrieved_docs = retriever.invoke(user_query)

In [20]:
len(retrieved_docs)

5

In [21]:
print(retrieved_docs[0].page_content)

However, the LLMs in their current state
have yet to see an effective, practical compres-
sive memory technique that balances simplicity along with quality.

1arXiv:2404.07143v1  [cs.CL]  10 Apr 2024


## Passing the context and questioning to the LLM

In [22]:
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

chat_template = ChatPromptTemplate.from_messages([
    # System Message Prompt Template
    SystemMessage(content="""You are a Helpful AI Bot. 
    You take the context and question from user. Your answer should be based on the specific context."""),
    # Human Message Prompt Template
    HumanMessagePromptTemplate.from_template("""Answer the question based on the given context.
    Context:
    {context}
    Question: 
    {question}
    
    Answer: """)
])

In [23]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

In [25]:
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | chat_template
    | chat_model
    | output_parser
)

In [26]:
from IPython.display import Markdown as markdown
response = rag_chain.invoke("What is LLMs?")

markdown(response)

## LLMs Explained

Based on the provided context, **LLMs stands for Large Language Models**. 

These models are a type of artificial intelligence focused on understanding and processing human language. They are trained on massive amounts of text data, allowing them to perform various tasks such as:

* **Text generation:** Creating coherent and contextually relevant text, like stories, articles, or poems.
* **Translation:** Converting text from one language to another.
* **Summarization:** Condensing large amounts of text into shorter summaries.
* **Question answering:** Providing answers to questions based on given information. 


In [27]:
response = rag_chain.invoke("What is the main focus of the document regarding Transformer-based Large Language Models (LLMs)?")

markdown(response)

## Main Focus of the Document: Addressing Context Limitations in Transformer-based LLMs

The document focuses on the limitations of Transformer-based LLMs regarding context length and proposes solutions to overcome these challenges.

**Key Points:**

* **Limited Context Memory:** Transformer-based LLMs, despite their success, struggle with long sequences due to the nature of the attention mechanism. This leads to challenges in both scalability and financial costs when dealing with extensive contexts.
* **Compressive Memory Systems:**  The document explores alternative approaches like compressive memory systems, which offer better scalability and efficiency for handling extremely long sequences compared to the traditional attention mechanism.
* **Input Compression Techniques:**  It discusses methods of compressing input representations as summaries of past sequence segments, including utilizing Transformer LLMs themselves for efficient long-context modeling.
* **Infini-Transformer:**  The document introduces Infini-Transformer as a solution to enable LLMs to handle infinitely long contexts with bounded memory and compute resources. This is achieved by processing long inputs in a streaming fashion, similar to Transformer-XL but with key differences.

**Overall, the document aims to address the limitations of current Transformer-based LLMs in handling long contexts and proposes solutions like compressive memory and the Infini-Transformer model to enable efficient processing of extensive sequences.** 


In [28]:
response = rag_chain.invoke("Explain about LLM Pre-training?")

markdown(response)

## LLM Pre-training for Long-Context Explained:

Based on the context you provided, here's an explanation of LLM pre-training, specifically focused on the **long-context adaptation** of existing LLMs:

**Challenge:** 
Standard LLMs struggle with long sequences of text due to limitations in attention mechanisms and memory.

**Solution:**
Researchers are exploring "continual pre-training" methods to adapt existing LLMs for long-context scenarios. This involves further training the models on extensive text sequences (over 4K tokens) to improve their ability to handle and process lengthy inputs effectively.

**Approaches:**

* **Extending Attention Layers:** Modifying the standard attention mechanisms (like dot-product attention) to better capture long-range dependencies within the text.
* **Compressed Input Representations:**  Employing techniques to summarize or compress past segments of the input sequence, allowing the model to retain essential information without being overwhelmed by the sheer volume of data.
* **Transformer-based Compression:**  Utilizing another Transformer LLM specifically for compressing the input sequence, enabling efficient processing of long contexts.

**Example:**
The context mentions using a 1B parameter LLM and replacing its standard Multi-Head Attention (MHA) with "Infini-attention" – a method designed for long sequences. This modified LLM was then further pre-trained on 4K token long inputs, aiming to enhance its long-context capabilities.

**Current State & Future Direction:**
While these methods show promise, the research is ongoing. The challenge lies in finding a balance between simplicity and effectiveness. The ideal "compressive memory technique" should be easy to implement and integrate into existing LLMs while maintaining high-quality performance on long-context tasks. 
