# RAG

## Overview

A typical RAG application has two main components:

**Indexing**: a pipeline for ingesting data from a source and indexing it. _This usually happens offline._

**Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The most common full sequence from raw data to answer looks like:

### Indexing

1. **Load**: First we need to load our data. This is done with [Document Loaders](https://python.langchain.com/docs/concepts/document_loaders/).
2. **Split**: [Text splitters](https://python.langchain.com/docs/concepts/text_splitters/) break large `Documents` into smaller chunks. This is useful both for indexing data and passing it into a model, as large chunks are harder to search over and won't fit in a model's finite context window.
3. **Store**: We need somewhere to store and index our splits, so that they can be searched over later. This is often done using a [VectorStore](https://python.langchain.com/docs/concepts/vectorstores/) and [Embeddings](https://python.langchain.com/docs/concepts/embedding_models/) model.

Side note:

The context window (or “context length”) of a large language model (LLM) is the amount of text, in tokens, that the model can consider or “remember” at any one time. A larger context window enables an AI model to process longer inputs and incorporate a greater amount of information into each output.

> https://codingscape.com/blog/llms-with-largest-context-windows



![rag_indexing_phase](../assets/images/rag_indexing_phase.png)

### Retrieval and generation

4. **Retrieve**: Given a user input, relevant splits are retrieved from storage using a [Retriever](https://python.langchain.com/docs/concepts/retrievers/).
5. **Generate**: A [ChatModel](https://python.langchain.com/docs/concepts/chat_models/) / [LLM](https://python.langchain.com/docs/concepts/text_llms/) produces an answer using a prompt that includes both the question with the retrieved data

![](../assets/images/rag_phase_2.png)

![rag_p2](../assets/images/rag_p2.png)

In [1]:
from langchain.chat_models import init_chat_model

llm = init_chat_model("gemini-2.5-flash", model_provider="google_genai")

In [28]:
# from langchain_google_genai import GoogleGenerativeAIEmbeddings

# embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")


# https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers/

EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

from langchain_huggingface import HuggingFaceEmbeddings

embedding_func = HuggingFaceEmbeddings(model_name=EMBED_MODEL)

In [29]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="rag_langchain",
    embedding_function=embedding_func,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

In [9]:
from langchain_community.document_loaders import PyPDFLoader


loader = PyPDFLoader("../data/LeaveNoContextBehind.pdf")
pages = loader.load_and_split()

In [13]:
len(pages)

13

In [15]:
type(pages[0])

langchain_core.documents.base.Document

In [17]:
pages[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-11T01:01:12+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-11T01:01:12+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../data/LeaveNoContextBehind.pdf', 'total_pages': 12, 'page': 0, 'page_label': '1'}, page_content='Preprint. Under review.\nLeave No Context Behind:\nEfficient Infinite Context Transformers with Infini-attention\nTsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal\nGoogle\ntsendsuren@google.com\nAbstract\nThis work introduces an efficient method to scale Transformer-based Large\nLanguage Models (LLMs) to infinitely long inputs with bounded memory\nand computation. A key component in our proposed approach is a new at-\ntention technique dubbed Infini-attention. The Infini-attention incorporates\na compressive memory into

In [18]:
print(pages[0].page_content)

Preprint. Under review.
Leave No Context Behind:
Efficient Infinite Context Transformers with Infini-attention
Tsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal
Google
tsendsuren@google.com
Abstract
This work introduces an efficient method to scale Transformer-based Large
Language Models (LLMs) to infinitely long inputs with bounded memory
and computation. A key component in our proposed approach is a new at-
tention technique dubbed Infini-attention. The Infini-attention incorporates
a compressive memory into the vanilla attention mechanism and builds
in both masked local attention and long-term linear attention mechanisms
in a single Transformer block. We demonstrate the effectiveness of our
approach on long-context language modeling benchmarks, 1M sequence
length passkey context block retrieval and 500K length book summarization
tasks with 1B and 8B LLMs. Our approach introduces minimal bounded
memory parameters and enables fast streaming inference for LLMs.
1 Introduction
M

In [19]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(pages)




In [20]:
type(all_splits)

list

In [21]:
type(all_splits[0])

langchain_core.documents.base.Document

In [26]:
print(all_splits[0].page_content)

Preprint. Under review.
Leave No Context Behind:
Efficient Infinite Context Transformers with Infini-attention
Tsendsuren Munkhdalai, Manaal Faruqui and Siddharth Gopal
Google
tsendsuren@google.com
Abstract
This work introduces an efficient method to scale Transformer-based Large
Language Models (LLMs) to infinitely long inputs with bounded memory
and computation. A key component in our proposed approach is a new at-
tention technique dubbed Infini-attention. The Infini-attention incorporates
a compressive memory into the vanilla attention mechanism and builds
in both masked local attention and long-term linear attention mechanisms
in a single Transformer block. We demonstrate the effectiveness of our
approach on long-context language modeling benchmarks, 1M sequence
length passkey context block retrieval and 500K length book summarization
tasks with 1B and 8B LLMs. Our approach introduces minimal bounded
memory parameters and enables fast streaming inference for LLMs.
1 Introduction


In [24]:
len(all_splits)

54

In [30]:
# Index chunks
_ = vector_store.add_documents(documents=all_splits)

In [55]:
from langchain import hub
prompt = hub.pull("rlm/rag-prompt")
prompt



ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])

In [38]:
print(prompt.format(question="What is AI?", context='AI is ...'))

Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: What is AI? 
Context: AI is ... 
Answer:


In [42]:
c = {'question': 'What is AI?', 'context':'AI is ...'}

In [43]:
print(prompt.format(**c))

Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: What is AI? 
Context: AI is ... 
Answer:


In [44]:
# Setting a Connection with the ChromaDB
db_connection = Chroma(collection_name="rag_langchain",persist_directory="./chroma_langchain_db", embedding_function=embedding_func)

In [45]:
# Converting CHROMA db_connection to Retriever Object
retriever = db_connection.as_retriever(search_kwargs={"k": 3})

print(type(retriever))

<class 'langchain_core.vectorstores.base.VectorStoreRetriever'>


##
HW: Understand most common output parsers

In [52]:




# docs = ['hii', 'how are you...', 'I am sleepy']

# print("\n\n".join(doc for doc in docs))


hii

how are you...

I am sleepy


In [53]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [54]:
# context from the retreiver -> list of documents and I need to stich/sew, so that we can pass on single string as context to the LLM




In [56]:
rag_chain = (
    {'context':retriever | format_docs, 'question': RunnablePassthrough()}
    | prompt
    |llm
    |output_parser
)




In [57]:
res = rag_chain.invoke("""Please summarize Leave No Context Behind:
                            Efficient Infinite Context Transformers with Infinite-attention""")


res

'This work introduces an efficient method called Infini-attention to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. Infini-attention incorporates a compressive memory into the vanilla attention mechanism, combining masked local attention and long-term linear attention. This allows it to reuse old KV attention states from previous segments, maintaining the entire context history efficiently.'

----

In [58]:

from langchain_core.messages import SystemMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate



chat_template = ChatPromptTemplate.from_messages([
    # System Message Prompt Template
    SystemMessage(content="""You are a Helpful AI Bot.
                  Given a context and question from user,
                  you should answer based on the given context."""),
    # Human Message Prompt Template
    HumanMessagePromptTemplate.from_template("""Answer the question based on the given context.
    Context: {context}
    Question: {question}
    Answer: """)
])


In [60]:
from IPython.display import Markdown as md

md(res)

This work introduces an efficient method called Infini-attention to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. Infini-attention incorporates a compressive memory into the vanilla attention mechanism, combining masked local attention and long-term linear attention. This allows it to reuse old KV attention states from previous segments, maintaining the entire context history efficiently.

In [61]:
response = rag_chain.invoke("""Please Explain Compressive Memory""")

response

'Compressive memory maintains a fixed number of parameters to store and recall information, ensuring bounded storage and computation costs. New information is added by modifying these parameters with an objective for later recovery. Unlike memory arrays that grow with input sequence length, this approach offers computational efficiency.'

In [62]:
from IPython.display import Markdown as md

md(response)

Compressive memory maintains a fixed number of parameters to store and recall information, ensuring bounded storage and computation costs. New information is added by modifying these parameters with an objective for later recovery. Unlike memory arrays that grow with input sequence length, this approach offers computational efficiency.