# **Retrieval-Augmented Generation (RAG)**  

**RAG is a technique for augmenting LLM knowledge with additional data.**  

A RAG application has two main components:  

1. **Indexing (Offline):** Ingests and indexes data for efficient retrieval.  
2. **Retrieval & Generation (Online):** Processes user queries in real-time.  

## **Workflow**  

### **1. Indexing**  
- **Load:** Data is ingested via **Document Loaders**.  
- **Split:** Large documents are broken into smaller chunks using **Text Splitters**.  
- **Store:** Chunks are indexed in a **VectorStore** with **Embeddings models**.  

### **2. Retrieval & Generation**  
- **Retrieve:** A **Retriever** fetches relevant chunks based on the user query.  
- **Generate:** A **ChatModel / LLM** generates a response using retrieved data.  

This ensures accurate, context-rich answers with optimized search efficiency.


**1. Indexing: Load**

In [47]:
! pip -q install langchain_community

In [None]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))

docs = WebBaseLoader(web_paths=["https://lilianweng.github.io/posts/2023-06-23-agent/"],bs_kwargs={"parse_only": bs4_strainer}).load()

In [None]:
print("Length of documenth",len(docs[0].page_content))
print(docs[0].page_content[:100])

**2. Chunking**

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500,chunk_overlap=200,separators=["\n\n", "\n", " ", ""])

# Extract the text content 
page_content = [doc.page_content for doc in docs]       

# Step 3: Chunking
chunks = []
for content in page_content:
    chunks.extend(recursive_text_splitter.split_text(content))

In [None]:
# Step 4: Print maximum and minimum chunk sizes
chunk_sizes = [len(chunk) for chunk in chunks]
print('Maximum chunk size among all:', max(chunk_sizes))
print('Minimum chunk size among all:', min(chunk_sizes))
print('Total number of chunks is:', len(chunks))

**3. Embedding & Store**

In [5]:
from langchain_community.vectorstores import FAISS
from langchain_ollama import OllamaEmbeddings

embed = OllamaEmbeddings(model="nomic-embed-text")
embeddings = embed.embed_documents(chunks)  # Generate embeddings

# Create and save FAISS vector store
vectorstore = FAISS.from_embeddings(list(zip(chunks, embeddings)), embedding=embed)

#### RAG Chain (Sequential Execution)

**4. Retrieve**

In [None]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")
print(len(retrieved_docs))

**5. Generation**

In [None]:
! ollama pull deepseek-r1:1.5b

In [7]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Load the LLM model with specific parameters
llm = ChatOllama(model="deepseek-r1:1.5b", temperature=0.8)

# Define the prompt template for RAG (Retrieval-Augmented Generation)
prompt = ChatPromptTemplate.from_messages([
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question. "
    "If you don't know the answer, just say that you don't know. "
    "Use three sentences maximum and keep the answer concise.\n"
    "Question: {question} \n"
    "Context: {context} \n"
    "Answer:"
])

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
for chunk in rag_chain.stream("What is Agent ?Explain in short"):
    print(chunk, end="", flush=True)

**RAG Chain (Parallel Execution)**

**4. Retrieval**

In [None]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
retrieved_docs = retriever.invoke("What is Task Decomposition?Explain in short")
print(len(retrieved_docs))

**5.Generation**

In [10]:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

# Load the LLM model with specific parameters
llm = ChatOllama(model="deepseek-r1:1.5b", temperature=0.8)

# Define the prompt template for RAG (Retrieval-Augmented Generation)
prompt = ChatPromptTemplate.from_messages([
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question. "
    "If you don't know the answer, just say that you don't know. "
    "Use three sentences maximum and keep the answer concise.\n"
    "Question: {question} \n"
    "Context: {context} \n"
    "Answer:"
])


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

In [None]:
response=rag_chain_with_source.invoke("What is Task Decomposition? Explain in short")
print(response['answer'])