# RAG ~ Retrieval Augmented Generation

Basically an LLM + Retriever, It saves us the time of fine tuning. 

Fine-tuning v/s RAG:

In fine tuning we re-train the model on new data and it's weights are updated, it changes the way the model would give out the results. In a RAG we do not retrain the model instead the model fetches information from provided data and gives output on the basis of retrieved data.

The LLM in a RAG just takes input of retrieved data and synthesize the output, so it is only the retrival part ehich is new. 

## Retrival

The newly given information (text, doc or etc) is stored in a database (Vector DB, Graph DB or traditional SQL). 

Vector DB: The data is divide into chunks and then embedded and then stored in vector DB, then query is also vectorized and we find the vector closest to the query and we select top K matches and send to the LLM for synthesis. It may sometimes fetch irrelevent data.

Graph DB: The data is broken into entities and it is stored as a graph with connetions b\w entites capturing their relationships, the query is also converted into a graph and the exact matches are searched for and then LLM synthesizes the answer. As  it requires exact matching it maybe irrelevent for some applications.

SQL: Traditional way to store data, then it uses TF-IDF for retriving data but we lose semantic flexibility.



Let’s look at two different types of retrieval methods: Standard, Sentence window and auto-merging.

Standard: The Chunk size is same for both retrieval and synthesis, it makes the data uniform for both the processes but it maybe problematic because synthesis part may require longer chunk for / more info to produce answer.

Sentence window and auto-merging: In this the text is broken units like senetences or group sentences and uses smtg like +- 2 sentences from target sentence. This method does not return only the exact matched sentence instead, it returns a window of surrounding context for synthesis.


What is Retriever Ensembling?
Idea: Instead of sticking to one way of chunking your documents or using a single retriever strategy, you try multiple at once.

Chunk size matters: Different chunk sizes capture different amounts of context. Small chunks may be precise but lack context; large chunks have more context but might introduce noise.

The process is as follows:
1. Chunk up the same document in a bunch of different ways, say with chunk sizes: 128, 256, 512, and 1024.
2. During retrieval, we fetch relevant chunks from each retriever, thus ensembling them together for retrieval.
3. Use a re-ranker to rank results accoring to their relevance to query.

*Note: A re-ranker is a component used after retrieval in a RAG system to re-evaluate and reorder the initial set of retrieved documents or chunks based on how relevant they are to the query.*

1. Lexical Re-ranking:
Based on exact word matching.
Example techniques: BM25, TF-IDF cosine similarity.

2. Semantic Re-ranking:
Uses transformer-based models (e.g., BERT, DistilBERT) to understand semantic meaning, not just word overlap.
The model is asked: “Which of these chunks best answers this question?”

3. Learning to Rank (LTR):
You train a model specifically for ranking documents.
    Three types:
    Point-wise: Score each document independently.
    Pair-wise: Compare pairs of documents to see which is better.
    List-wise: Consider the whole list at once and reorder.



In [1]:
%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m527.5 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.2/152.2 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m442.8/442.8 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.9/43.9 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.6/50.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.5/216.5 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hNote: 

# LangSmith

Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls. As these applications get more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent. The best way to do this is with LangSmith.

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

In [2]:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

 ········


## Chat Model - Google Gemini

In [3]:
pip install -qU "langchain[google-genai]"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.8/47.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-generativeai 0.8.5 requires google-ai-generativelanguage==0.6.15, but you have google-ai-generativelanguage 0.6.18 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
import getpass
import os

if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai")

Enter API key for Google Gemini:  ········


## Embedding - Google Gemini

In [5]:
import getpass
import os

if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

## Vector DB - Chroma

In [6]:
pip install -qU langchain-chroma

I0000 00:00:1753957032.988332      36 fork_posix.cc:71] Other threads are currently calling into gRPC, skipping fork() handlers


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.5/19.5 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.1/103.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [7]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    # persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

## Defining the Agent


1. Make a Customised state function: Question(from user), Context(from the given info), Answer(from LLM)
2. Make tool for retrival, generation

In [45]:
# Libraries to scrape data fromm a website
import bs4
from typing import Literal
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.vectorstores import InMemoryVectorStore

# Libraries for making the agent graph
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict, Annotated

# Download and extract relevant content from a blog post.
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

# Split the loaded text into manageable, overlapping chunks.
text_splitter= RecursiveCharacterTextSplitter(chunk_size= 1000, chunk_overlap= 250)
all_splits= text_splitter.split_documents(docs)

# Update metadata (illustration purposes)
total_documents = len(all_splits)
third = total_documents // 3

for i, document in enumerate(all_splits):
    if i < third:
        document.metadata["section"] = "beginning"
    elif i < 2 * third:
        document.metadata["section"] = "middle"
    else:
        document.metadata["section"] = "end"

# Store the document chunks in a vector store 
# an embedding-based database for fast semantic search
vector_store = InMemoryVectorStore(embeddings)
_ = vector_store.add_documents(documents= all_splits)

# Define scheme for Search
class Search(TypedDict):
    """Search query."""

    query: Annotated[str, ..., "Search query to run."]
    section: Annotated[
        Literal["beginning", "middle", "end"],
        ...,
        "Section to query.",
    ]

prompt for answering for the chatbot

In [53]:
from langchain.prompts import PromptTemplate

# Define custom prompt template as a string
template= """
You are an William Shakesphere. Answer the question based on given context

Context:
{context}

Question:
{question}

Answer:
"""

# Create a PromptTemplate object
prompt= PromptTemplate(
    input_variables= ["context", "question"],
    template= template,
)

The Agent will not remember your convo history

In [54]:
# Define state

class State(TypedDict):
    question: str
    query: Search
    context: List[Document]
    answer: str

def analyze_query(state: State):
    structured_llm= llm.with_structured_output(Search)
    query = structured_llm.invoke(state["question"])
    return {"query": query}
    
# Define retriever
def retrieve(state: State):
    query= state["query"]
    retrieved_docs= vector_store.similarity_search(
        query["query"],
        filter=lambda doc: doc.metadata.get("section") == query["section"],)
    return {"context": retrieved_docs}

# Define generator
def generate(state: State):
    docs_content= "\n\n".join(doc.page_content for doc in state["context"])
    messages= prompt.format(context= docs_content, question= state["question"])
    response= llm.invoke(messages)
    return {"answer": response.content}

graph= StateGraph(State).add_sequence([analyze_query, retrieve, generate])
graph.add_edge(START, "analyze_query")
app= graph.compile()

In [55]:
response= app.invoke({"question": "What is the article about?  Explain points"})
print(response["answer"])

Hark, gentle audience, lend thine ears! This scroll, though writ in the dry language of scholars, doth speak of memory, both human and artificial, and how the one informs the other.

**Firstly, 'tis about Maximum Inner Product Search, a curious beast!** Imagine searching through a vast library of knowledge, not by title, but by how closely related each book is to a question in your mind. That, in essence, is MIPS. It allows a swift sifting through a sea of information, returning the closest matches, though perhaps not with perfect accuracy, for the sake of haste.

**Secondly, the author doth liken this process to the workings of the human mind!** He speaks of Short-Term Memory, a fleeting stage where thoughts reside but briefly, like actors upon a stage, before fading into the wings. Then, Long-Term Memory, a vast storehouse where knowledge is kept for years, like ancient tomes upon dusty shelves.

*   **Sensory memory** the learning embedding representations for raw inputs, including 