# Agentic RAG pipeline with Nemo Retriever and LLM NIMs 

## Overview

Retrieval-augmented generation (RAG) has proven to be an effective strategy for ensuring large language model (LLM) responses are up-to-date and not hallucinated. 

Various retrieval strategies have been proposed that can improve the recall of documents for generation. There is no one-size-fits-all all. The strategy (for example: chunk size, number of documents returned, semantic search vs graph retrieval, etc.) depends on your data.  Although the retrieval strategies might differ, an agentic framework designed on top of your retrieval system that does reasoning, decision-making, and reflection on your retrieved data is becoming more common in modern RAG systems.  An agent can be described as a system that can use an LLM to reason through a problem, create a plan to solve the problem, and execute the plan with the help of a set of tools.  For example, LLMs are notoriously bad at solving math problems, giving an LLM a calculator “tool” that it can use to perform mathematical tasks while it reasons through a larger problem of calculating YoY increase of a company’s revenue can be described as an agentic workflow. 

As generative AI systems start transitioning towards entities capable of performing "agentic" tasks, we need robust models that have been trained on the ability to break down tasks, act as central planners, and have multi-step reasoning capabilities with model and system-level safety checks. With the Llama 3.1 family, Meta is launching a suite of LLMs spanning  8B, 70B, and 405B parameters with these tool-calling capabilities for agentic workloads. NVIDIA has partnered with Meta to make sure the latest Llama models can be deployed optimally through NVIDIA NIMs.

Further, with the general availability of the NVIDIA NeMo Retriever collection of NIM microservices, enterprises have access to scalable software to customize their data-dependent RAG pipelines. The NeMo Retriever NIMs can be easily plugged into existing RAG pipelines and interfaces with open source LLM frameworks like LangChain or LlamaIndex, so you can easily integrate retriever models into generative AI applications.


### Setup the Environment 

First, let's install a few packages for interfacing with NVIDIA embedding, raranking, LLM models and vector databases.

Install the following system dependencies if they are not already available on your system with e.g. ```brew install``` for Mac. Depending on what document types you're parsing, you may not need all of these.
* poppler-utils (images and PDFs)
* tesseract-ocr(images and PDFs)

In [None]:
!pip install -U langchain_community unstructured[all-docs] langchain-nvidia-ai-endpoints langchainhub faiss-gpu langchain langgraph pandas rank_bm25

### NeMo Retriever NIMs

NeMo Retriever microservices can be used for embedding and reranking. These microservices can be deployed within the enterprise locally, and are packaged together with <a href="https://developer.nvidia.com/triton-inference-server">NVIDIA Triton Inference Server</a> and <a href="https://developer.nvidia.com/tensorrt">NVIDIA TensorRT</a> for optimized inference of text for embedding and reranking.  Additional enterprise benefits include:

**Scalable deployment**: Whether you're catering to a few users or millions, NeMo Retriever embedding and reranking NIMs can be scaled seamlessly to meet your demands.

**Flexible integration**: Easily incorporate NeMo Retriever embedding and reranking NIMs into existing workflows and applications, thanks to the OpenAI-compliant API endpoints–and deploy anywhere your data resides.

**Secure processing**: Your data privacy is paramount. NeMo Retriever embedding and reranking NIMs ensure that all inferences are processed securely, with rigorous data.

NeMo Retriever <a href="https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/overview.html">embedding</a> and <a href="https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/overview.html">reranking</a> NIM microservices are available today.  Developers can download and deploy docker containers locally.

#### Access the Llama 3.1 405B model

The new Llama 3.1 set of models can be seen as the first big push of open-source models towards serious agentic capabilities. These models can now become part of a larger automation system, with LLMs doing the planning and picking the right tools to solve a larger problem. Since NVIDIA Llama 3.1 NIMs have the necessary support for OpenAI style tool calling, libraries like LangChain can now be used with NIMs to bind LLMs to Pydantic classes and fill in objects/dictionaries. This combination makes it easier for developers to get structured outputs from NIM LLMs without having to resort to regex parsing. You can access Llama 3.1 405B at ai.nvidia.com. Follow <a href="https://nvidia.github.io/GenerativeAIExamples/latest/api-catalog.html#get-an-api-key-for-the-accessing-models-on-the-api-catalog">these</a> instructions to generate the API key


### Architecture

Retrieving passages or documents within a RAG pipeline without further validation and self-reflection can usually result in unhelpful responses and factual inaccuracies. Additionally, since the models aren't explicitly trained to follow facts from passages, post-generation verification is necessary. 

Multi-agent frameworks, like LangGraph, enable developers to group LLM application-level logic into nodes and edges, for finer levels of control over agentic decision-making. LangGraph with NVIDIA LangChain OSS connectors can be used for embedding, reranking, and implementing the necessary agentic RAG techniques with LLMs (as discussed previously). 

To implement this, an application developer must include the finer-level decision-making on top of their RAG pipeline. Figure below shows one of the many renditions on a router node depending on the use case. Here, the router takes a decision to rewrite the query with help on an LLM, perchance of better recall from the retrieve.

![alt text](./imgs/agentic_rag.png "Title")

**Query decomposer**: Breaks down the question into multiple smaller logical questions, and is helpful when a question needs to be answered using chunks from multiple documents.

**Router**: Decides if chunks need to be retrieved from the local retriever to answer the given question based on the relevancy of documents stored locally. Alternatively, ‌the agent can be programmed to do a web search or simply answer with an ‘I don't know.’

**Retriever**: This is the internal implementation of the RAG pipeline. For example, a hybrid retriever of a semantic and keyword search retriever.

**Grader**: Checks if the retrieved passages/chunks are relevant to the question at hand.

**Hallucination checker**: Checks if the LLM generation from each chunk is relevant to the chunk.  Post-generation verification is necessary since the models are not explicitly trained to follow facts from passages.



### Download the dataset
Let's download the NIH clinical studies datasets from docugami repository. It cont

In [None]:
!wget https://raw.githubusercontent.com/docugami/KG-RAG-datasets/main/nih-clinical-trial-protocols/download.csv
!wget https://raw.githubusercontent.com/docugami/KG-RAG-datasets/main/nih-clinical-trial-protocols/download.py
!python download.py

#### Step-1: Load and chunk the dataset

Use Langchain dataloaders to load all the PDF files in the created directory and split them into chunks of 500 characters each

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_nvidia_ai_endpoints import ChatNVIDIA

loader = DirectoryLoader('./docs', glob="**/*.pdf")
docs = loader.load()

In [None]:
docs[0].page_content

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=100
)
doc_splits = text_splitter.split_documents(docs)

### Step-2: Initialize the Embedding, Reranking and LLM connectors

#### Embedding and Reranking NIM
Use the NVIDIA OSS connectors to langchain to initialize the embedding, reranking and LLM models, after setting up the embedding and reranking NIMs locally using instructions <a href="https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/overview.html">here</a> and <a href="https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/overview.html">here</a>. point the ```base_url``` below to the ip address for your local machine. 

#### Llama 3.1 405B LLM
The latest Llama 3.1 405B model is hosted on ai.nvidia.com. Use the instruction <a href="https://nvidia.github.io/GenerativeAIExamples/latest/api-catalog.html#get-an-api-key-for-the-accessing-models-on-the-api-catalog">here</a> to obtain the API Key for access 

In [None]:
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings, NVIDIARerank
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# connect to an embedding NIM running at localhost:8080
embeddings = NVIDIAEmbeddings(
    base_url="http://<REPLACE_WITH_LOCAL_MACHINE_IP>:8000/v1", 
    model="nvidia/nv-embedqa-e5-v5",
    truncate="END"
)

reranker = NVIDIARerank(
    base_url="http://<REPLACE_WITH_LOCAL_MACHINE_IP>:8000/v1", 
    model="nvidia/nv-rerankqa-mistral-4b-v3",
    truncate="END"
)

llm = ChatOpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<REPLACE_WITH_GENERATED_API_KEY>",
    model="meta/llama-3.1-405b-instruct"
)

#### Step-3: Create a hybrid search retriever

Load the documents into a keyword search store and semantic search FAISS vector database. We create a weighted hybrid of a keyword and semantic search for better retrieval recall, and a higher score is given to the keyword search retriever because of domain specific medical jargon. 

In [None]:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS

bm25_retriever = BM25Retriever.from_documents(doc_splits)
faiss_vectorstore = FAISS.from_documents(doc_splits, embeddings)

faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.7, 0.3]
)

In [None]:
question = "How does Get It Right First Time (GIRFT) Urology programme relate to TURBT and URS?"

#### Step-4: Query decompostion with structured generation

The new Llama 3.1 set of models can be seen as the first big push of open-source models towards serious agentic capabilities. These models can now become part of a larger automation system, with LLMs doing the planning and picking the right tools to solve a larger problem. Since NVIDIA Llama 3.1 NIMs have the necessary support for OpenAI style tool calling, libraries like LangChain can now be used with NIMs to bind LLMs to Pydantic classes and fill in objects/dictionaries. This combination makes it easier for developers to get structured outputs from NIM LLMs without having to resort to regex parsing. 

Here we user Llama 3.1 NIMs tool calling capability to split the initial query intp sub-queries

In [None]:
from typing import Literal, Optional, Tuple, List
from langchain_core.pydantic_v1 import BaseModel, Field

class SubQuery(BaseModel):
    """Given a user question, break it down into distinct sub questions that \
    you need to answer in order to answer the original question."""

    questions: List[str] = Field(description="The list of sub questions")

sub_question_generator = llm.with_structured_output(SubQuery)
sub_question_generator.invoke(question)

#### Step-5: Create a simple RAG chain with hybrid retriever

In [None]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser

# Prompt
prompt = hub.pull("rlm/rag-prompt")

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = prompt | llm | StrOutputParser()

# Run
docs = hybrid_retriever.get_relevant_documents(question)
generation = rag_chain.invoke({"context": format_docs(docs), "question": question})
print(generation)

#### Step-6: Create a Retrieval grader with structured generation

Checks if the retrieved passages/chunks are relevant to the question at hand.

In [None]:
### Retrieval Grader

# Data model
class GradeDocuments(BaseModel):
    """Binary score for relevance check on retrieved documents."""

    binary_score: str = Field(
        description="Documents are relevant to the question, 'yes' or 'no'"
    )


# LLM with function call

retrieval_grader = llm.with_structured_output(GradeDocuments)

# Prompt
system = """You are a grader assessing relevance of a retrieved document to a user question. \n 
    It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
    If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. \n
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question."""

grade_prompt = ChatPromptTemplate.from_messages(
    [
     
        ("system", system),
        ("human", "Retrieved document: \n\n {document} \n\n User question: {question}"),
    ]
)

retrieval_grader = grade_prompt | retrieval_grader
docs = hybrid_retriever.get_relevant_documents(question)
doc_txt = docs[1].page_content
print(retrieval_grader.invoke({"question": question, "document": doc_txt}))

#### Step-7: Create a hallucination checker with structured generation
Checks if the LLM generation from each chunk is relevant to the chunk.  Post-generation verification is necessary since the models are not explicitly trained to follow facts from passages.

In [None]:
### Hallucination Grader

# Data model
class GradeHallucinations(BaseModel):
    """Binary score for hallucination present in generation answer."""

    binary_score: str = Field(
        description="Answer is grounded in the facts, 'yes' or 'no'"
    )


hallucination_grader = llm.with_structured_output(GradeHallucinations)

# Prompt
system = """You are a grader assessing whether an LLM generation is grounded in / supported by a set of retrieved facts. \n 
     Give a binary score 'yes' or 'no'. 'Yes' means that the answer is grounded in / supported by the set of facts."""
hallucination_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "Set of facts: \n\n {documents} \n\n LLM generation: {generation}"),
    ]
)

hallucination_grader = hallucination_prompt | hallucination_grader
hallucination_grader.invoke({"documents": format_docs(docs), "generation": generation})

#### Step-7: Create a answer grader with structured generation
Checks if the final answer resolves the supplied question 

In [None]:
### Answer Grader

# Data model
class GradeAnswer(BaseModel):
    """Binary score to assess answer addresses question."""

    binary_score: str = Field(
        description="Answer addresses the question, 'yes' or 'no'"
    )


generation_grader = llm.with_structured_output(GradeAnswer)

# Prompt
system = """You are a grader assessing whether an answer addresses / resolves a question \n 
     Give a binary score 'yes' or 'no'. Yes' means that the answer resolves the question."""
answer_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "User question: \n\n {question} \n\n LLM generation: {generation}"),
    ]
)

answer_grader = answer_prompt | generation_grader
answer_grader.invoke({"question": question, "generation": generation})

#### Step-8: Question rewriting
If none of retrieved documents are unrelated to the given question, then we ask the LLM to rewrite the question again for easier retrieval. 

In [None]:
### Question Re-writer

# Prompt
system = """You a question re-writer that converts an input question to a better version that is optimized \n 
     for vectorstore retrieval. Look at the input and try to reason about the underlying semantic intent / meaning."""
re_write_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        (
            "human",
            "Here is the initial question: \n\n {question} \n Formulate an improved question.",
        ),
    ]
)

question_rewriter = re_write_prompt | llm | StrOutputParser()
question_rewriter.invoke({"question": question})

#### Step-9: Langgraph setup 

Capture the flow in as a graph. Define the graph state, which is a data structure that is shared among the nodes of the graph, each node modifies the graph state depending on its function. 

In [None]:
from typing import List

from typing_extensions import TypedDict


class GraphState(TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        question: question
        generation: LLM generation
        documents: list of documents
    """

    question: str
    sub_questions:  List[str]
    generation: str
    documents: List[str]

#### Step-10: Define the nodes as functions
Using the langchain constructs we have defined above for query decompostion, grading, retrieval, hallucination checking etc, we can write functions that act as nodes for the multi-agent graph.

In [None]:
### Nodes

def decompose(state):
    """
    Retrieve documents

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, documents, that contains retrieved documents
    """
    print("---QUERY DECOMPOSITION ---")
    question = state["question"]

    # Reranking
    sub_queries = sub_question_generator.invoke(question)
    return {"sub_questions": sub_queries.questions, "question": question}

def retrieve(state):
    """
    Retrieve documents

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, documents, that contains retrieved documents
    """
    print("---RETRIEVE---")
    sub_questions = state["sub_questions"]
    question = state["question"]

    # Retrieval
    documents = []
    for sub_question in sub_questions:
        docs = hybrid_retriever.get_relevant_documents(sub_question)
        documents.extend(docs)
    return {"documents": documents, "question": question}


def rerank(state):
    """
    Retrieve documents

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, documents, that contains retrieved documents
    """
    print("---RERANK---")
    question = state["question"]
    documents = state["documents"]

    # Reranking
    documents = reranker.compress_documents(query=question, documents=documents)
    return {"documents": documents, "question": question}

In [None]:
def generate(state):
    """
    Generate answer

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, generation, that contains LLM generation
    """
    print("---GENERATE---")
    question = state["question"]
    documents = state["documents"]

    # RAG generation
    generation = rag_chain.invoke({"context": documents, "question": question})
    return {"documents": documents, "question": question, "generation": generation}


def grade_documents(state):
    """
    Determines whether the retrieved documents are relevant to the question.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Updates documents key with only filtered relevant documents
    """

    print("---CHECK DOCUMENT RELEVANCE TO QUESTION---")
    question = state["question"]
    documents = state["documents"]

    # Score each doc
    filtered_docs = []
    for d in documents:
        score = retrieval_grader.invoke(
            {"question": question, "document": d.page_content}
        )
        grade = score.binary_score
        if grade == "yes":
            print("---GRADE: DOCUMENT RELEVANT---")
            filtered_docs.append(d)
        else:
            print("---GRADE: DOCUMENT NOT RELEVANT---")
            continue
    return {"documents": filtered_docs, "question": question}


def transform_query(state):
    """
    Transform the query to produce a better question.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Updates question key with a re-phrased question
    """

    print("---TRANSFORM QUERY---")
    question = state["question"]
    documents = state["documents"]

    # Re-write question
    better_question = question_rewriter.invoke({"question": question})
    return {"documents": documents, "question": better_question}

#### Step-11: Define graph edges 
 The nodes defined above are connected to each other through functional edges, defined programatically. Based on the graph state the edges might pass the state information to one of the multiple different nodes.

In [None]:
### Edges


def decide_to_generate(state):
    """
    Determines whether to generate an answer, or re-generate a question.

    Args:
        state (dict): The current graph state

    Returns:
        str: Binary decision for next node to call
    """

    print("---ASSESS GRADED DOCUMENTS---")
    state["question"]
    filtered_documents = state["documents"]

    if not filtered_documents:
        # All documents have been filtered check_relevance
        # We will re-generate a new query
        print(
            "---DECISION: ALL DOCUMENTS ARE NOT RELEVANT TO QUESTION, TRANSFORM QUERY---"
        )
        return "transform_query"
    # We have relevant documents, so generate answer
    print("---DECISION: GENERATE---")
    return "generate"
    
def grade_generation_v_documents_and_question(state):
    """
    Determines whether the generation is grounded in the document and answers question.

    Args:
        state (dict): The current graph state

    Returns:
        str: Decision for next node to call
    """

    print("---CHECK HALLUCINATIONS---")
    question = state["question"]
    documents = state["documents"]
    generation = state["generation"]

    score = hallucination_grader.invoke(
        {"documents": documents, "generation": generation}
    )
    grade = score.binary_score

    # Check hallucination
    if grade == "yes":
        print("---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---")
        # Check question-answering
        print("---GRADE GENERATION vs QUESTION---")
        score = answer_grader.invoke({"question": question, "generation": generation})
        grade = score.binary_score
        if grade == "yes":
            print("---DECISION: GENERATION ADDRESSES QUESTION---")
            return "useful"
        print("---DECISION: GENERATION DOES NOT ADDRESS QUESTION---")
        return "not useful"
    pprint("---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---")
    return "not supported"

#### Step-12: Build the graph

We define the rules for how the nodes are connected to each other, we also use conditional edges, which can connect to different nodes based on the output of the functional edge

In [None]:
from langgraph.graph import END, StateGraph, START

workflow = StateGraph(GraphState)

# Define the nodes
workflow.add_node("decompose", decompose) #query decompostion
workflow.add_node("retrieve", retrieve)  # retrieve
workflow.add_node("rerank", rerank)  # rerank
workflow.add_node("grade_documents", grade_documents)  # grade documents
workflow.add_node("generate", generate)  # generatae
workflow.add_node("transform_query", transform_query)  # transform_query

# Build graph
workflow.add_edge(START, "decompose")
workflow.add_edge("decompose", "retrieve")
workflow.add_edge("retrieve", "rerank")
workflow.add_edge("rerank", "grade_documents")
workflow.add_conditional_edges(
    "grade_documents",
    decide_to_generate,
    {
        "transform_query": "transform_query",
        "generate": "generate",
    },
)
workflow.add_edge("transform_query", "retrieve")
workflow.add_conditional_edges(
    "generate",
    grade_generation_v_documents_and_question,
    {
        "not supported": "generate",
        "useful": END,
        "not useful": "transform_query",
    },
)

# Compile
app = workflow.compile()

#### Step-13: Run the multi-agent RAG workflow

In [None]:
from pprint import pprint

# Run
inputs = {"question": question}
for output in app.stream(inputs):
    for key, value in output.items():
        # Node
        pprint(f"Node '{key}':")
        # Optional: print full state at each node
        # pprint.pprint(value["keys"], indent=2, width=80, depth=None)
    pprint("\n---\n")

# Final generation
pprint(value["generation"])