# 5.0 RAG over documents for Reliable AI Responses

In this workshop, we will walk through the steps of building a retrieval-augmented generation (RAG) agent using LangChain and LangGraph. The agent will be able to answer user queries based on multiple documents (PDFs in this case) using text extraction, chunking, vector storage, and an LLM-based generation process.


### Step 1: Importing Required Libraries

In this step, we import essential libraries for **PDF text extraction, embeddings, vector storage, and AI-driven processing**:

- **`fitz (PyMuPDF)`**: Extracts text from PDFs.  
- **`OpenAIEmbeddings` & `Chroma`**: Converts text into embeddings and stores them for retrieval.  
- **`RecursiveCharacterTextSplitter`**: Splits text into manageable chunks.  
- **`OpenAI`**: Interfaces with OpenAI’s language models for text processing.  
- **`StateGraph`**: Manages decision flows in LangGraph-based AI agents.  
- **`TypedDict, List`**: Provides structured data handling.  

This step prepares the foundation for **processing documents, storing vector embeddings, and enabling AI-driven text analysis** in our agent.


In [1]:
import fitz  # PyMuPDF for PDF text extraction
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain_openai import OpenAI
from langgraph.graph import StateGraph
from typing import TypedDict, List



### Step 2: Lets Initialize LLM


In [2]:
# Initialize LLM
llm = OpenAI()

### Step 3: Extract Text from Multiple PDFs
In this step, we will create a function to extract text from multiple PDF documents. The function extract_text_from_pdfs is designed to accept a list of PDF file paths as input and then extract the text content from each PDF using the PyMuPDF library (also known as fitz).

The function works by:

Iterating over each PDF in the list of file paths.
Opening the PDF with fitz.open(pdf_path).
Extracting the text from each page of the PDF using page.get_text("text").
Combining the extracted text from all pages of each document into a single string.
Compiling the extracted text from all documents into a list, which is returned as the final output.

In [3]:
# ✅ Extract text from multiple PDFs
def extract_text_from_pdfs(pdf_paths):
    """Extracts text from multiple PDFs and returns a list of documents."""
    all_text = []
    for pdf_path in pdf_paths:
        doc = fitz.open(pdf_path)
        text = "\n".join([page.get_text("text") for page in doc])
        all_text.append(text)
    return all_text

# Example: List of PDFs to process
pdf_files = ["./data_source/ietf-srv6.pdf", "./data_source/SRv6-Mig-BP.pdf"]  # Add your PDF paths
documents_text = extract_text_from_pdfs(pdf_files)

# ✅ Split text into smaller chunks for better retrieval
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
all_chunks = []
for text in documents_text:
    all_chunks.extend(text_splitter.split_text(text))

# ✅ Convert chunks into Document objects
documents = [Document(page_content=chunk) for chunk in all_chunks]

# ✅ Initialize vector database
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)

### Step 4: Define State for LangGraph

In this step, we define a **state class** to manage and store data within **LangGraph** for a **Retrieval-Augmented Generation (RAG) process**. This state will track key elements, ensuring the agent has structured information for processing queries and generating responses.

#### Why Define a State?
LangGraph requires a way to store and update data as the agent moves through different steps. By using Python’s **TypedDict**, we create a structured state with predefined keys and their expected data types.

In [4]:
# ✅ Define state for LangGraph
class RAGState(TypedDict):
    query: str
    documents: List[Document]
    response: str  # Holds the final answer


### Step 4 LangGraph Workflow for RAG (Retrieval-Augmented Generation)
In this step, we create a LangGraph workflow for a Retrieval-Augmented Generation (RAG) system. The workflow involves two key functions: retrieving documents relevant to a query and generating an answer using those documents with an LLM. We also build a stateful graph that connects these two functions and compiles the process into an executable application.

In [5]:

# ✅ Retrieval function (search across both PDFs)
def retrieve_documents(state: RAGState) -> RAGState:
    docs = vectorstore.similarity_search(state["query"])
    return {"query": state["query"], "documents": docs, "response": ""}

# ✅ Answer generation function (LLM-based response)
def generate_answer(state: RAGState) -> RAGState:
    context = "\n".join([doc.page_content for doc in state["documents"]])
    prompt = f"Based on the following context, answer the question:\n\n{context}\n\nQuestion: {state['query']}"
    answer = llm.invoke(prompt)
    return {"query": state["query"], "documents": state["documents"], "response": answer}

# ✅ Build LangGraph
graph = StateGraph(RAGState)
graph.add_node("retrieval", retrieve_documents)
graph.add_node("generation", generate_answer)
graph.add_edge("retrieval", "generation")

# ✅ Define entry point
graph.set_entry_point("retrieval")

# ✅ Compile the graph
app = graph.compile()





### Step 5: Defining the Prompt and Query Handling

In this step, we define a **prompt** to guide the agent in answering queries related to **SRv6, micro-SID, and SRv6 Migrations from SR-MPLS**. The prompt ensures the agent provides responses based on the provided documents.

In [6]:
# Define the prompt for the agent
agent_prompt = "You are an expert on SRv6, its micro-sid and SRv6 Migrations from SR-MPLS. Please answer the query based on the provided documents."

# Define the user query as a variable
user_query = input("What would you like to know about SRv6, SRv6-uSID or SRv6 Migration Best practices?")

# Invoke the agent with the prompt and the user query
response = app.invoke({
    "query": user_query,
    "documents": [],  # Include documents here if applicable
    "response": "",
    "prompt": agent_prompt  # Adding the prompt to guide the agent's response
})

# Print the response from the agent
print(response["response"])

?

Answer: The SRv6 microSID is a 128-bit value used for routing to a specific node responsible for performing a specific function in an SRv6 network. It is represented as an IPv6 address and consists of three parts: the locator, uSID block, and set ID and node ID. The uSID block is the portion of the SRv6 microSID that is used for identifying the specific node responsible for performing the function. It is allocated from a block specifically designated for service plane addresses in an SRv6 network.
