### **Building a QnA system with RAG (Retrieval-Augmented Generation)**

LLMs like GPT lack the latest information. 

The primary way of fix this issue is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.

LangChain provides all the building blocks for RAG applications

> **Retrieval:** Means fetching the data from external sources  
> **Augmented:** Means enhancement or improvement by incorporating retrieved information  
> **Generation:** Using a LLM based model, to generate human like text

#### **Steps**
1. Load a Document
2. Split the document into chunks
3. Creating Chunks Embedding
4. Store the chunks in vector store
5. Setup the Vector Store as a Retriever
6. Based on users query retrieve the context
7. Pass the context and question to the LLM

In [10]:
# ! pip install pypdf
# ! pip install langchain
# ! pip install langchain_google_genai
# ! pip install chromadb

In [5]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Setup API Key
f = open('keys/.gemini_API_key.txt')
GOOGLE_API_KEY = f.read()

chat_model = ChatGoogleGenerativeAI(google_api_key=GOOGLE_API_KEY, model="gemini-1.5-pro-latest")

In [11]:
# 1.Load a document

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("Data/Leave No Context Behind Paper.pdf")

data = loader.load()
data

[Document(page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast strea

In [12]:
# 2.Split the document into chunks
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)

chunks = text_splitter.split_documents(data)

print(len(chunks))

print(type(chunks[0]))

Created a chunk of size 568, which is longer than the specified 500
Created a chunk of size 506, which is longer than the specified 500
Created a chunk of size 633, which is longer than the specified 500


110
<class 'langchain_core.documents.base.Document'>


In [32]:
# Creating Chunks Embedding
# We are just loading OpenAIEmbeddings

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(google_api_key=GOOGLE_API_KEY, model="models/embedding-001")

# vectors = embeddings.embed_documents(chunks)

In [17]:
# Store the chunks in vector store
from langchain_community.vectorstores import Chroma

# Embed each chunk and load it into the vector store
db = Chroma.from_documents(chunks, embedding_model, persist_directory="./chroma_db_")

# Persist the database on drive
db.persist()

In [18]:
# Setting a Connection with the ChromaDB
db_connection = Chroma(persist_directory="./chroma_db_", embedding_function=embedding_model)

In [19]:
# Converting CHROMA db_connection to Retriever Object
retriever = db_connection.as_retriever(search_kwargs={"k": 5})

print(type(retriever))

<class 'langchain_core.vectorstores.VectorStoreRetriever'>


Now let’s write the actual application logic. We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

In [22]:
user_input = "What is infini-attention"
retrieved_docs = retriever.invoke(user_input)

In [23]:
len(retrieved_docs)

5

In [24]:
print(retrieved_docs[0].page_content)

2.1 Infini-attention
As shown Figure 1, our Infini-attention computes both local and global context states and
combine them for its output.

Similar to multi-head attention (MHA), it maintains Hnumber
2


In [25]:
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

chat_template = ChatPromptTemplate.from_messages([
    # System Message Prompt Template
    SystemMessage(content="""You are a Helpful AI Bot. 
    You take the context and question from user. Your answer should be based on the specific context."""),
    # Human Message Prompt Template
    HumanMessagePromptTemplate.from_template("""Aswer the question based on the given context.
    Context:
    {context}
    
    Question: 
    {question}
    
    Answer: """)
])

In [26]:

from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

In [27]:
from langchain_core.runnables import RunnablePassthrough
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | chat_template
    | chat_model
    | output_parser
)

In [29]:
response = rag_chain.invoke("What is Infini-attention?")
response

"## Infini-attention: A Powerful Attention Mechanism for Long and Short Context\n\nBased on the context you provided, Infini-attention is a novel attention mechanism designed to handle both **long-range and short-range dependencies** within sequences efficiently. It achieves this by incorporating a **compressive memory** into the traditional attention mechanism, allowing it to retain and utilize information from much earlier parts of the sequence. \n\nHere's a breakdown of its key characteristics:\n\n* **Combines Local and Global Context:** Infini-attention considers both the immediate context (local) and information from further back in the sequence (global) to generate its output. \n* **Compressive Memory:** Unlike standard attention mechanisms that discard past key-value pairs, Infini-attention stores them in a compressive memory. This enables the model to access and leverage long-term dependencies effectively.\n* **Masked Local Attention and Long-Term Linear Attention:**  It integr

In [30]:
from IPython.display import Markdown as md

md(response)

## Infini-attention: A Powerful Attention Mechanism for Long and Short Context

Based on the context you provided, Infini-attention is a novel attention mechanism designed to handle both **long-range and short-range dependencies** within sequences efficiently. It achieves this by incorporating a **compressive memory** into the traditional attention mechanism, allowing it to retain and utilize information from much earlier parts of the sequence. 

Here's a breakdown of its key characteristics:

* **Combines Local and Global Context:** Infini-attention considers both the immediate context (local) and information from further back in the sequence (global) to generate its output. 
* **Compressive Memory:** Unlike standard attention mechanisms that discard past key-value pairs, Infini-attention stores them in a compressive memory. This enables the model to access and leverage long-term dependencies effectively.
* **Masked Local Attention and Long-Term Linear Attention:**  It integrates both masked local attention (focusing on recent context) and long-term linear attention (retrieving information from the compressive memory) within a single Transformer block.
* **Minimal Modification:**  Infini-attention introduces minimal changes to the standard scaled dot-product attention, making it easy to integrate into existing Transformer models.
* **Continual Learning:**  The design inherently supports continual pre-training and adaptation to long contexts, allowing the model to learn and evolve over time.

**In essence, Infini-attention enhances the capabilities of standard attention mechanisms by providing a memory of past information, enabling it to handle long sequences and capture complex dependencies more effectively.** 


In [31]:
response = rag_chain.invoke("Does infini-attention's real -world application face limitations for extremely long sequences?")
md(response)

## Infini-attention and Limitations for Extremely Long Sequences:

Based on the provided context, Infini-attention appears to be designed specifically to address the limitations of standard attention mechanisms when dealing with long sequences. However, there might still be practical limitations for **extremely** long sequences, depending on the specific implementation and hardware constraints. Here's an analysis:

**Points suggesting Infini-attention's ability to handle long sequences:**

* **Addresses attention sink and lost-in-the-middle issues:** The context explicitly states that Infini-attention tackles these problems, which are known to hinder performance in long sequences.
* **Segment-level streaming computation:** This approach allows processing long sequences with a fixed local attention window, avoiding the need to attend to the entire sequence at once.
* **Long-term compressive memory:** This component helps capture long-range dependencies efficiently, crucial for understanding context in extensive sequences.
* **Successful extrapolation to 1M input length:** The provided information indicates successful scaling of Infini-attention to very long sequences (1 million tokens) when trained on significantly shorter ones (32K or even 5K tokens).

**Potential limitations for extremely long sequences:**

* **Hardware constraints:**  Despite efficient design, processing extremely long sequences might still demand significant computational resources and memory, potentially leading to limitations on certain hardware configurations.
* **Unforeseen edge cases:**  While the provided information suggests excellent performance, there might be specific edge cases or extremely long sequence lengths where Infini-attention's efficiency or accuracy could degrade. 
* **Specific implementation details:** The efficiency and scalability of Infini-attention likely depend on the specific implementation choices and optimizations made. 

**Therefore, while Infini-attention demonstrates strong capabilities for handling long sequences, it's crucial to consider the specific implementation, hardware limitations, and potential edge cases when dealing with extremely long sequences in real-world applications.** 
