# RAG Agent: Intelligent Document Q&A System

A production-ready Retrieval-Augmented Generation (RAG) system powered by NVIDIA AI endpoints, LangChain, and FAISS vector store.

## Overview

This notebook demonstrates a complete RAG pipeline that:
- Loads and processes research papers from arXiv
- Creates semantic embeddings using NVIDIA's embedding models
- Stores documents in a FAISS vector database
- Retrieves relevant context for user queries
- Generates grounded, citation-backed responses using LLMs

## Features

- **Intelligent Retrieval**: Uses semantic search to find the most relevant document chunks
- **Context Reordering**: Applies long-context reordering to optimize retrieval quality
- **Grounded Generation**: Responses are strictly based on retrieved documents
- **Streaming Output**: Real-time response generation for better UX
- **Production Ready**: Modular design for easy deployment and integration

## 1. Environment Setup

Install required dependencies and configure the environment.

In [None]:
# Install required packages
%pip install -q langchain langchain-nvidia-ai-endpoints gradio
%pip install -q arxiv pymupdf faiss-cpu

# Import core libraries
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda
from langchain_core.runnables.passthrough import RunnableAssign
from langchain.document_transformers import LongContextReorder
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
from langchain_community.vectorstores import FAISS

from operator import itemgetter
from functools import partial
import os

print("âœ“ Environment setup complete")

## 2. Initialize AI Models

Configure NVIDIA AI endpoints for embeddings and language models.

In [None]:
# Initialize embedding model for semantic search
embedder = NVIDIAEmbeddings(
    model="nvidia/nv-embed-v1",
    truncate="END"
)

# Initialize LLM for response generation
instruct_llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")
llm = instruct_llm | StrOutputParser()

print("âœ“ AI models initialized successfully")
print(f"  - Embedding Model: nvidia/nv-embed-v1")
print(f"  - LLM: meta/llama-3.1-8b-instruct")

## 3. Load Document Store

Load pre-built FAISS vector store containing embedded research papers.

In [None]:
# Extract and load the FAISS index
!tar xzvf docstore_index.tgz

# Load the vector store with document embeddings
docstore = FAISS.load_local(
    "docstore_index",
    embedder,
    allow_dangerous_deserialization=True
)

# Get all documents from the store
docs = list(docstore.docstore._dict.values())

print(f"âœ“ Document store loaded successfully")
print(f"  - Total documents: {len(docstore.docstore._dict)}")
print(f"  - Sample paper: {docs[0].metadata.get('Title', 'Unknown')[:80]}...")

## 4. Build RAG Pipeline

Create the complete RAG chain with retrieval and generation components.

In [None]:
# Utility: Format retrieved documents into readable context
def docs2str(docs, title="Document"):
    """Convert document list to formatted string with citations."""
    out_str = ""
    for doc in docs:
        doc_name = getattr(doc, 'metadata', {}).get('Title', title)
        if doc_name:
            out_str += f"[Quote from {doc_name}] "
        out_str += getattr(doc, 'page_content', str(doc)) + "\n"
    return out_str


# Define the chat prompt template
chat_prompt = ChatPromptTemplate.from_template(
    "You are a helpful document chatbot. Answer questions based solely on the provided context."
    " User question: {input}\n\n"
    "Retrieved Context:\n{context}\n\n"
    "Instructions:\n"
    "- Only use information from the retrieved context\n"
    "- Cite sources when making claims\n"
    "- Be conversational and clear\n"
    "- If context is insufficient, acknowledge limitations\n\n"
    "Question: {input}"
)


# Utility: Stream output from chain results
def output_puller(inputs):
    """Extract and yield the 'output' field from runnable results."""
    if isinstance(inputs, dict):
        inputs = [inputs]
    for token in inputs:
        if token.get('output'):
            yield token.get('output')

print("âœ“ Prompt templates configured")

In [None]:
# Step 1: Build the retrieval chain
# ====================================

# Initialize long-context reordering for better retrieval quality
long_reorder = RunnableLambda(LongContextReorder().transform_documents)

# Create retriever from vector store (top-5 most relevant chunks)
doc_retriever = docstore.as_retriever(search_kwargs={'k': 5})


def _context_to_text(docs):
    """Convert retrieved documents to formatted text string."""
    if not docs:
        return (
            "No relevant passages found in the knowledge base. "
            "Please rephrase your question or ask about a different topic."
        )
    return docs2str(docs)


# Build context retrieval pipeline
context_getter = (
    itemgetter('input')
    | doc_retriever
    | long_reorder
    | RunnableLambda(_context_to_text)
)

# Complete retrieval chain: input -> {input, context}
retrieval_chain = (
    {'input': (lambda x: x)}
    | RunnableAssign({'context': context_getter})
)

print("âœ“ Retrieval chain built")

In [None]:
# Step 2: Build the generation chain
# ====================================

# Create response generation pipeline
response_chain = (
    {
        'input': itemgetter('input'),
        'context': itemgetter('context'),
    }
    | chat_prompt
    | llm
)

# Wrap output for streaming compatibility
generator_chain = {'output': response_chain} | RunnableLambda(output_puller)

print("âœ“ Generation chain built")

In [None]:
# Step 3: Combine into complete RAG pipeline
# =============================================

rag_chain = retrieval_chain | generator_chain

print("âœ“ Complete RAG pipeline assembled")
print("\nPipeline Architecture:")
print("  1. User Query â†’ Embedding")
print("  2. Semantic Search â†’ Top-K Documents")
print("  3. Context Reordering â†’ Optimized Context")
print("  4. LLM Generation â†’ Grounded Response")

## 5. Test the RAG Agent

Run sample queries to validate the pipeline.

In [None]:
# Example 1: Basic query with streaming output
print("=" * 70)
print("QUERY: What are the latest developments in large language models?")
print("=" * 70)

for token in rag_chain.stream("What are the latest developments in large language models?"):
    print(token, end="", flush=True)

print("\n" + "=" * 70)

In [None]:
# Example 2: Custom query
query = input("Enter your question: ")

print(f"\n{'=' * 70}")
print(f"QUERY: {query}")
print("=" * 70)

for token in rag_chain.stream(query):
    print(token, end="", flush=True)

print("\n" + "=" * 70)

## 6. Interactive Chat Interface (Optional)

Launch a Gradio web interface for easier interaction with the RAG agent.

In [None]:
import gradio as gr


def chat_interface(message, history):
    """Process user message and return RAG response."""
    response = ""
    for token in rag_chain.stream(message):
        response += token
    return response


# Create Gradio interface
demo = gr.ChatInterface(
    fn=chat_interface,
    title="ðŸ¤– RAG Agent: Intelligent Document Q&A",
    description="Ask questions about research papers in the knowledge base. Responses are grounded in retrieved documents.",
    examples=[
        "What are the key findings in recent NLP research?",
        "Explain the concept of retrieval-augmented generation",
        "What improvements have been made to transformer models?",
    ],
    theme=gr.themes.Soft(),
)

# Launch the interface
demo.launch(share=False)