### Problem Statement

**Build a RAG System on “Leave No Context Behind” Paper**

As we know that LLMs like Gemini lack the company specific information. But this latest information is available via PDFs, Text Files, etc... Now if we can connect our LLM with these sources, we can build a much better application.
Using LangChain framework, build a RAG system that can utilize the power of LLM like Gemini 1.5 Pro to answer questions on the “Leave No Context Behind” paper published by Google on 10th April 2024. In this process, external data(i.e. Leave No Context Behind Paper) should be retrieved and then passed to the LLM when doing the generation step.


#### Paper Overview: 

**Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention**

Authors: Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal (Google)

Date: April 10, 2024

**Key Contributions**

Infini-attention mechanism

A modified Transformer attention block that integrates:

Compressive memory storing past key-value (KV) states instead of discarding them.

Local causal attention for handling recent tokens.

Long-term linear attention to retrieve compressed memory—combining both local and global context seamlessly. 
ResearchHub Storage
arXiv

Scalable and efficient long-context modeling

Enables handling infinitely long inputs with fixed memory and compute costs.

Achieves 114× memory compression compared to standard attention architectures. 
ResearchHub Storage

Strong empirical performance

A 1B-parameter model with Infini-attention manages sequence lengths up to 1M tokens and successfully completes a passkey retrieval task.

An 8B-parameter model attains state-of-the-art results on a 500K-token book summarization task. 
ResearchHub Storage
alphaXiv

**Why This Matters**

Extends Transformer reach: Addresses the fundamental limitation of context length in standard Transformers and LLMs.

Allows real-time streaming inference on long sequences with bounded resources.

Minimal architecture changes: Infini-attention is plug-and-play, meaning you can adapt existing pre-trained or finetuned models with ease.

In [1]:
from pathlib import Path
import requests

In [2]:
# Direct PDF URL for the paper
url = "https://arxiv.org/pdf/2404.07143.pdf"  # Replace if different

# Save PDF to local file
path = Path("leave_no_context_behind.pdf")
response = requests.get(url)
response.raise_for_status()  # Check for errors

with path.open('wb') as f:
    f.write(response.content)

In [3]:
# Downloaded PDF size
path.stat().st_size

495619

In [4]:
# pip install -U langchain-community pypdf

In [5]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [6]:
# load the pdf file
load = PyPDFLoader("leave_no_context_behind.pdf")

print(load)

<langchain_community.document_loaders.pdf.PyPDFLoader object at 0x000001B8211C49B0>


In [7]:
document = load.load()

document

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-08-13T00:09:01+00:00', 'author': '', 'keywords': '', 'moddate': '2024-08-13T00:09:01+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'leave_no_context_behind.pdf', 'total_pages': 14, 'page': 0, 'page_label': '1'}, page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the

In [8]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

print(splitter)

<langchain_text_splitters.character.RecursiveCharacterTextSplitter object at 0x000001B8235BFD70>


In [9]:
splits = splitter.split_documents(document)
print(splits)

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-08-13T00:09:01+00:00', 'author': '', 'keywords': '', 'moddate': '2024-08-13T00:09:01+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'leave_no_context_behind.pdf', 'total_pages': 14, 'page': 0, 'page_label': '1'}, page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the

In [10]:
cleaned_splits = []
for d in splits:
    text = d.page_content
    if "references" not in text.lower() and "bibliography" not in text.lower():
        cleaned_splits.append(d)

In [11]:
print(f"Original chunks: {len(splits)}, After cleaning: {len(cleaned_splits)}")

Original chunks: 56, After cleaning: 55


In [12]:
# pip install langchain_google_vertexai

In [13]:
from langchain_community.vectorstores import Chroma

In [14]:
from langchain_community.embeddings import HuggingFaceEmbeddings

In [15]:
# pip install sentence-transformers

In [16]:
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")





In [17]:
# pip install chromadb

In [18]:
vectordb = Chroma.from_documents(cleaned_splits, embedding, persist_directory="./chroma_db")
vectordb.persist()

  return forward_call(*args, **kwargs)
  vectordb.persist()


In [19]:
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

In [23]:
from langchain.chains import RetrievalQA
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

In [35]:
pipe = pipeline("text2text-generation", model="google/flan-t5-large", max_new_tokens=256, temperature=0.5)

Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [36]:
llm = HuggingFacePipeline(pipeline=pipe)

In [37]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

#### Questions you can ask

What problem in large language models does the Leave No Context Behind paper aim to solve?

What is the main contribution of the Infini-attention mechanism?

How does compressive memory work in Infini-attention?

What are the differences between local causal attention and long-term linear attention?

How does Infini-attention achieve scalability for infinitely long inputs?

What is the passkey retrieval task, and how did the model perform on it?

How did the 8B parameter model perform on the 500K-token book summarization task compared to previous models?

What memory efficiency gains (e.g., compression ratio) does Infini-attention provide?

Why is Infini-attention considered a plug-and-play architecture?

How could Infini-attention improve real-world applications of LLMs, like summarization or document analysis?

In [40]:
# Ask questions
query = "Why is Infini-attention considered a plug-and-play architecture?"


result = qa_chain.invoke({"query": query})

  return forward_call(*args, **kwargs)
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [41]:
print("Answer : \n", result["result"])

Answer : 
 each layer has at least a single short-range head, allowing a forward-propagation of input signal up until the output layer


In [42]:
from langchain.prompts import PromptTemplate

In [72]:
prompt_template = """
You are a he research assistant. 
Use the following context from the document to answer the question.
Do not just copy text — instead, synthesize a clear and complete answer in your own words.
If useful, combine multiple pieces of evidence.
Always give concise but informative answers.

context = {context}

question = {question}

answer :

"""

In [61]:
prompt = PromptTemplate(
    template=prompt_template,
    input_variables=['context', 'question']
)

In [67]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type="stuff",
    chain_type_kwargs={"prompt": prompt}
)

In [68]:
query = "Why is Infini-attention considered a plug-and-play architecture?"

In [69]:
result = qa_chain.invoke({'query':query})

  return forward_call(*args, **kwargs)
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [70]:
print(result)

{'query': 'Why is Infini-attention considered a plug-and-play architecture?', 'result': 'There are two types of heads emerged in Infini-attention after training: specialized heads with a gating score near 0 or 1 and mixer heads with a score close to 0.5. The specialized heads either process contextual information via the local attention computation or retrieve from the compressive memory whereas the mixer heads aggregate both current contextual information and long-term memory content together into a single output. Interestingly, each layer has at least a single short-range head, allowing a forward-propagation of input signal up until the output layer. We also Zero-shot 32K 128K 256K 512K 1M Infini-Transformer(Linear) 14/13/98 11/14/100 6/3/100 6/7/99 8/6/98 Infini-Transformer(Linear + Delta) 13/11/99 6/9/99 7/5/99 6/8/97 7/6/97 FT (400 steps) Infini-Transformer(Linear) 100/100/100 100/100/100 100/100/100 100/100/100 96/94/100 96/94/100 that are modified in with respect to', 'source_do

In [71]:
result['result']

'There are two types of heads emerged in Infini-attention after training: specialized heads with a gating score near 0 or 1 and mixer heads with a score close to 0.5. The specialized heads either process contextual information via the local attention computation or retrieve from the compressive memory whereas the mixer heads aggregate both current contextual information and long-term memory content together into a single output. Interestingly, each layer has at least a single short-range head, allowing a forward-propagation of input signal up until the output layer. We also Zero-shot 32K 128K 256K 512K 1M Infini-Transformer(Linear) 14/13/98 11/14/100 6/3/100 6/7/99 8/6/98 Infini-Transformer(Linear + Delta) 13/11/99 6/9/99 7/5/99 6/8/97 7/6/97 FT (400 steps) Infini-Transformer(Linear) 100/100/100 100/100/100 100/100/100 100/100/100 96/94/100 96/94/100 that are modified in with respect to'

In [73]:
from langchain.chains import LLMChain, StuffDocumentsChain

In [74]:
prompt_template = """
You are a helpful research assistant. 
Use the following document context to answer the user’s question.

Instructions:
- Do not copy raw sentences from the text.
- Synthesize the key ideas into a clear, concise explanation.
- If multiple pieces of evidence are relevant, combine them.
- If numerical results or experiments are mentioned, summarize their significance.

context = {context}

question = {question}

answer :

"""

In [75]:
prompt = PromptTemplate(
    template=prompt_template,
    input_variables=['context', 'question']
)

In [76]:
llm_chain = LLMChain(llm=llm, prompt=prompt)

  llm_chain = LLMChain(llm=llm, prompt=prompt)


In [77]:
doc_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="context")

  doc_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="context")


In [78]:
qa_chain = RetrievalQA(combine_documents_chain=doc_chain, retriever=retriever, return_source_documents=True)

  qa_chain = RetrievalQA(combine_documents_chain=doc_chain, retriever=retriever, return_source_documents=True)


In [79]:
query = "Why is Infini-attention considered a plug-and-play architecture?"

In [80]:
result = qa_chain.invoke({"query": query})

  return forward_call(*args, **kwargs)
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [81]:
print(result['result'])

Infini-attention is a recurrent attention mechanism that computes both local and global context states and combine them for its output. Similar to multi-head 3 Gating score visualization. Figure 3 visualizes the gating score, sigmoid() for the compres- sive memory for all attention heads in each layer. There are two types of heads emerged in Infini-attention after training: specialized heads with a gating score near 0 or 1 and mixer heads with a score close to 0.5. The specialized heads either process contextual information via the local attention computation or retrieve from the compressive memory whereas the mixer heads aggregate both current contextual information and long-term memory content together into a single output. Interestingly, each layer has at least a single short-range head, allowing a forward-propagation of input signal up until the output layer. We also Zero-shot 32K 128K 256K 512K 1M Infini-Transformer(Linear) 14/13/98 11/14/100 6/3/100 6/7/99 8/6/98 Infini-Transform

In [83]:
prompt_template = """
You are a helpful research assistant. 
Use the following document context to answer the user’s question.

Rules:
- Do not copy raw sentences.
- Summarize and rephrase into your own words.
- If multiple ideas are present, merge them into a single clear explanation.
- Be concise but complete.

context = {context}

question = {question}

answer :

"""

In [97]:
prompt = PromptTemplate(
    template=prompt_template,
    input_variables=['context', 'question']
)

In [98]:
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [99]:
doc_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="context")

In [95]:
# qa_chain = RetrievalQA(combine_documents_chain=doc_chain, retriever=retriever, return_source_documents=True)

In [100]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type="stuff",
    chain_type_kwargs={"prompt": prompt}
)

In [101]:
query = "Why is Infini-attention considered a plug-and-play architecture?"

In [102]:
result = qa_chain.invoke({"query": query})

  return forward_call(*args, **kwargs)
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [103]:
print(result['result'])

Infini-attention is a recurrent attention mechanism that computes both local and global context states and combine them for its output. Similar to multi-head 3 Gating score visualization. Figure 3 visualizes the gating score, sigmoid() for the compres- sive memory for all attention heads in each layer. There are two types of heads emerged in Infini-attention after training: specialized heads with a gating score near 0 or 1 and mixer heads with a score close to 0.5. The specialized heads either process contextual information via the local attention computation or retrieve from the compressive memory whereas the mixer heads aggregate both current contextual information and long-term memory content together into a single output. Interestingly, each layer has at least a single short-range head, allowing a forward-propagation of input signal up until the output layer. We also Zero-shot 32K 128K 256K 512K 1M Infini-Transformer(Linear) 14/13/98 11/14/100 6/3/100 6/7/99 8/6/98 Infini-Transform