# **Smple RAG: Document Retrieval and Text Generation**

In the age of information overload, having tools that can efficiently retrieve, process, and generate insights is more valuable than ever. This blog will guide you through building a **Retrieval-Augmented Generation (RAG)** pipeline. With just a few lines of Python, youâ€™ll be able to transform complex documents into actionable knowledge. Letâ€™s get started!




## **Step 0: Setting Up the Essentials**

Before we dive into the pipeline, letâ€™s load the necessary libraries. These tools form the backbone of our RAG framework: 



- **LangChain Libraries**: These power everything from document loading and splitting to embedding storage and output parsing.  
- **PyMuPDFLoader**: Loads PDF documents into processable chunks.  
- **Chroma**: Manages the vector storage for semantic search.
- **Langsmith**: This is used for monitoring the calls and easy to debug.  
- Also load the api key from the .env file for embeddings

And of course, the **pprint** module ensures outputs are beautifully formatted for debugging and exploration.  




In [1]:
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_ollama.chat_models import ChatOllama
import pprint

import os
import sys
from dotenv import load_dotenv


# Load environment variables from a .env file
load_dotenv('D:/Code/AI/.env')

  from .autonotebook import tqdm as notebook_tqdm


True

In [2]:
import os
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = os.getenv('LANGCHAIN_API_KEY')
os.environ['LANGSMITH_PROJECT']="RAG_Simple"



## **Step 1: Indexing Documents**

### **1.1 Loading Documents**
First, letâ€™s load our document. Imagine you have a PDF with key insightsâ€”this is where `PyMuPDFLoader` comes into play.  


- The `PyMuPDFLoader` processes the document lazily (page by page), saving memory when dealing with large files.  
- Each page is appended to a list called `page`, preparing it for text splitting.  

**Why it matters**: Efficient loading ensures scalability for massive documents like research papers or legal contracts.  




In [3]:
loader = PyMuPDFLoader("short.pdf") # load any pdf file which you have.

page = []
for doc in loader.lazy_load():
    page.append(doc)


### **1.2 Splitting the Text**
To manage large blocks of text effectively, we split them into smaller, manageable chunks using a **Recursive Character Text Splitter**.


- **Chunk Size**: Each chunk contains up to 1024 characters.  
- **Overlap**: A 200-character overlap between chunks ensures that no context is lost.  

This intelligent splitting method avoids breaking meaningful sentences or paragraphs, resulting in cleaner and more cohesive data for processing.

In [4]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=200)
splits = text_splitter.split_documents(page)


### **1.3 Creating Semantic Embeddings**
Once we have our text chunks, itâ€™s time to convert them into **embeddings**â€”numerical representations of text.



Hereâ€™s whatâ€™s happening:  
- **Embeddings**: The `GoogleGenerativeAIEmbeddings` model encodes text into high-dimensional vectors that capture semantic meaning.  
- **Vector Store**: Chroma stores these embeddings efficiently for quick retrieval.  
- **Retriever**: Acts as a bridge, finding the most relevant chunks for any given query. 

In [5]:
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))
retriever = vectorstore.as_retriever()


## **Step 2: Retrieval and Text Generation**

### **2.1 Loading a Prompt**
The system needs a blueprint to guide how it generates responses. Thatâ€™s where a **prompt** comes in.  



This prompt serves as a structured template, helping the language model generate coherent and useful answers.

In [6]:
# Pre-built tempate present in the Langchain Hub
prompt = hub.pull("rlm/rag-prompt")
pprint.pprint(prompt[0].prompt.template)


('You are an assistant for question-answering tasks. Use the following pieces '
 "of retrieved context to answer the question. If you don't know the answer, "
 "just say that you don't know. Use three sentences maximum and keep the "
 'answer concise.\n'
 'Question: {question} \n'
 'Context: {context} \n'
 'Answer:')


### **2.2 Configuring the Language Model**
Next, we initialize our Large Language Model (LLM). In this example, we use **ChatOllama**, a 14-billion-parameter powerhouse.  



This model excels at understanding context and generating human-like responses, making it a great fit for RAG pipelines.

In [7]:
# I am using offline model 
llm = ChatOllama(model="deepseek-r1:14b")


### **2.3 Formatting Retrieved Documents**
Before feeding documents to the LLM, we clean and format the retrieved text.  



This function combines chunks into a single, cohesive string, ensuring the model processes them smoothly.


In [8]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

### **2.4 Building the RAG Chain**
Finally, we bring all the components together into a seamless **RAG chain**.

Hereâ€™s how it works:  
1. **Retriever**: Fetches the most relevant chunks.  
2. **Formatting**: Combines chunks into a clean context.  
3. **Prompt**: Structures the input for the LLM.  
4. **LLM**: Generates an insightful response.  
5. **Output Parsing**: Converts the response into a human-readable format.  


In [9]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## **How It All Comes Together**

Imagine you have a PDF about pharmaceutical regulations, and you need answers to specific questions:  
1. **Input your query** into the RAG pipeline.  
2. **Retrieve relevant chunks** from the document.  
3. **Feed the context** and query to the LLM.  
4. **Generate a detailed response** based on your document.  

This pipeline transforms a static PDF into a dynamic, interactive knowledge source!


In [10]:
# Question
response=rag_chain.invoke("what are Garcia Marquez masterpieces and did he get any prize")
pprint.pprint(response)

('<think>\n'
 "Okay, so the user is asking about Gabriel GarcÃ­a MÃ¡rquez's masterpieces and "
 'whether he won any prizes. Let me start by looking through the context '
 'provided.\n'
 '\n'
 'First, I see that the context mentions he was a novelist, short-story '
 'writer, and journalist. It also states that he is considered one of the '
 'greatest Latin American masters of narrative. Thatâ€™s a good sign he has some '
 'notable works.\n'
 '\n'
 'Looking further down, it explicitly says his two masterpieces are "One '
 'Hundred Years of Solitude" from 1967 and "Love in the Time of Cholera" from '
 '1985. So that answers the first part about his masterpieces.\n'
 '\n'
 'Then, the context mentions he won the Nobel Prize in Literature in 1982. '
 'That directly answers the second part of the question regarding any prizes '
 'he received.\n'
 '\n'
 'I should make sure to keep the answer concise, using three sentences maximum '
 'as instructed. I need to include both the masterpieces and m

## **Applications**
- **Healthcare**: Extract insights from medical research papers.  
- **Legal**: Quickly answer questions based on contracts or case files.  
- **Customer Support**: Build smarter chatbots with real-time, document-based answers.  


## **Final Thoughts**
By combining document retrieval, embedding storage, and LLMs, this RAG pipeline is a game-changer for handling complex information. Whether youâ€™re in pharma, legal, or tech, this system simplifies the process of turning data into actionable knowledge.

With minimal code and maximum flexibility, building your own RAG system has never been easier. So, what are you waiting for? Dive into your documents and start generating insights today! ðŸš€