# RAG

## Overview

A typical RAG application has two main components:

**Indexing**: a pipeline for ingesting data from a source and indexing it. _This usually happens offline._

**Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The most common full sequence from raw data to answer looks like:

### Indexing

1. **Load**: First we need to load our data. This is done with [Document Loaders](https://python.langchain.com/docs/concepts/document_loaders/).
2. **Split**: [Text splitters](https://python.langchain.com/docs/concepts/text_splitters/) break large `Documents` into smaller chunks. This is useful both for indexing data and passing it into a model, as large chunks are harder to search over and won't fit in a model's finite context window.
3. **Store**: We need somewhere to store and index our splits, so that they can be searched over later. This is often done using a [VectorStore](https://python.langchain.com/docs/concepts/vectorstores/) and [Embeddings](https://python.langchain.com/docs/concepts/embedding_models/) model.

Side note:

The context window (or “context length”) of a large language model (LLM) is the amount of text, in tokens, that the model can consider or “remember” at any one time. A larger context window enables an AI model to process longer inputs and incorporate a greater amount of information into each output.

> https://codingscape.com/blog/llms-with-largest-context-windows



![rag_indexing_phase](../assets/images/rag_indexing_phase.png)

### Retrieval and generation

4. **Retrieve**: Given a user input, relevant splits are retrieved from storage using a [Retriever](https://python.langchain.com/docs/concepts/retrievers/).
5. **Generate**: A [ChatModel](https://python.langchain.com/docs/concepts/chat_models/) / [LLM](https://python.langchain.com/docs/concepts/text_llms/) produces an answer using a prompt that includes both the question with the retrieved data

![](../assets/images/rag_phase_2.png)

In [1]:
from langchain.chat_models import init_chat_model

llm = init_chat_model("gemini-2.5-flash", model_provider="google_genai")

In [2]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

In [3]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="rag_langchain",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

In [None]:
# !pip install -q pypdf

In [4]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("data/LeaveNoContextBehind.pdf")
pages = loader.load_and_split()

In [5]:
pages[0].page_content

'Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into the vanilla attention mechanism and builds\nin both masked local attention and long-term linear attention mechanisms\nin a single Transformer block. We demonstrate the effectiveness of our\napproach on long-context language modeling benchmarks, 1M sequence\nlength passkey context block retrieval and 500K length book summarization\ntasks with 1B and 8B LLMs. Our approach introduces minimal bounded\nmemory parameters and enables fast streaming inference for LLMs

In [6]:
# total pages
len(pages)

13

In [None]:
# !pip install nltk

In [19]:
# import nltk
# nltk.download('punkt')


In [20]:
# # Split the document into chunks

# from langchain_text_splitters import NLTKTextSplitter

# text_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=100)

# chunks = text_splitter.split_documents(pages)

# print(len(chunks))

# print(type(chunks[0]))

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(pages)

# Index chunks
_ = vector_store.add_documents(documents=all_splits)

In [8]:
from langchain import hub
prompt = hub.pull("rlm/rag-prompt")
prompt



ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])

In [10]:
# Setting a Connection with the ChromaDB
db_connection = Chroma(collection_name="rag_langchain",persist_directory="./chroma_langchain_db", embedding_function=embeddings)

In [11]:
# Converting CHROMA db_connection to Retriever Object
retriever = db_connection.as_retriever(search_kwargs={"k": 5})

print(type(retriever))

<class 'langchain_core.vectorstores.base.VectorStoreRetriever'>


In [12]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | output_parser
)

In [13]:
response = rag_chain.invoke("""Please summarize Leave No Context Behind:
                            Efficient Infinite Context Transformers with Infinite-attention""")

response

'"Leave No Context Behind" introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. This is achieved through a new attention technique called Infini-attention, which incorporates a compressive memory, masked local attention, and long-term linear attention into a single Transformer block. The approach demonstrates effectiveness on long-context benchmarks, including 1M sequence length passkey retrieval and 500K length book summarization, enabling fast streaming inference.'

----


In [None]:

from langchain_core.messages import SystemMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate



chat_template = ChatPromptTemplate.from_messages([
    # System Message Prompt Template
    SystemMessage(content="""You are a Helpful AI Bot.
                  Given a context and question from user,
                  you should answer based on the given context."""),
    # Human Message Prompt Template
    HumanMessagePromptTemplate.from_template("""Answer the question based on the given context.
    Context: {context}
    Question: {question}
    Answer: """)
])


----

In [14]:
from IPython.display import Markdown as md

md(response)

"Leave No Context Behind" introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. This is achieved through a new attention technique called Infini-attention, which incorporates a compressive memory, masked local attention, and long-term linear attention into a single Transformer block. The approach demonstrates effectiveness on long-context benchmarks, including 1M sequence length passkey retrieval and 500K length book summarization, enabling fast streaming inference.

In [None]:
response = rag_chain.invoke("""Please Explain Compressive Memory""")

response

'Compressive memory systems maintain a constant number of memory parameters, unlike Transformer KV memory which grows with input sequence length, to ensure computational efficiency and bounded storage. They store and retrieve information by modifying these fixed parameters via an update rule and a memory reading mechanism. This approach aims for greater scalability and efficiency for processing extremely long sequences compared to standard attention mechanisms.'

In [17]:
from IPython.display import Markdown as md

md(response)

Compressive memory systems maintain a constant number of memory parameters, unlike Transformer KV memory which grows with input sequence length, to ensure computational efficiency and bounded storage. They store and retrieve information by modifying these fixed parameters via an update rule and a memory reading mechanism. This approach aims for greater scalability and efficiency for processing extremely long sequences compared to standard attention mechanisms.