# LangChain Retriever Integration – NvidiaRAGRetriever

This notebook demonstrates how to use the **`NvidiaRAGRetriever`**, a LangChain-compatible retriever that connects to a running NVIDIA RAG Blueprint server.

The retriever acts as a thin HTTP client — ingestion, embedding, vector search, and reranking all happen server-side.

## Prerequisites

1. The **NVIDIA RAG Blueprint** is deployed and the RAG server is running (default port `8081`).
2. Documents have already been **ingested** into at least one collection via the ingestion API or [ingestion notebook](./ingestion_api_usage.ipynb).
3. Python **3.11+** is installed.

---

## Environment Setup (run once)

Open a **terminal**, `cd` into the `rag/` directory, and run the commands below to create a virtual environment with the `nvidia_rag` package (which includes the retriever module).

```bash
# 1. Navigate to the rag directory (parent of this notebooks/ folder)
cd /Users/rkharwar/Downloads/prd_retriever/rag

# 2. Create a virtual environment
python3 -m venv .venv

# 3. Activate it
source .venv/bin/activate

# 4. Install nvidia_rag in editable mode (includes all retriever dependencies)
pip install -e .

# 5. Register the venv as a Jupyter kernel so this notebook can use it
pip install ipykernel
python -m ipykernel install --user --name nvidia-rag --display-name "Python (nvidia-rag)"
```

After running the above, come back to this notebook and **select the `Python (nvidia-rag)` kernel**:
- Click the kernel name in the **top-right corner** of the notebook (e.g. "Python 3" or ".venv")
- Choose **Python (nvidia-rag)** from the list

Then run the cells below.

## 1. Install Dependencies

In [1]:
# If you followed the terminal setup above and selected the "Python (nvidia-rag)" kernel,
# nvidia_rag is already installed and you can skip this cell.
#
# Otherwise, uncomment and run the line below to install into the current kernel:
# %pip install -e ".." --quiet

# Quick check that the package is importable:
import nvidia_rag.retriever
print(f"nvidia_rag.retriever loaded from: {nvidia_rag.retriever.__file__}")

Note: you may need to restart the kernel to use updated packages.


## 2. Configuration

In [2]:
import os

IPADDRESS = (
    "rag-server" if os.environ.get("AI_WORKBENCH", "false") == "true" else "localhost"
)
RAG_SERVER_PORT = "8081"
BASE_URL = f"http://{IPADDRESS}:{RAG_SERVER_PORT}"

# Optional: set your API key if the server requires authentication
API_KEY = os.environ.get("NVIDIA_API_KEY", None)

# Collection to query (must already contain ingested documents)
COLLECTION_NAME = "test"

print(f"RAG Server URL : {BASE_URL}")
print(f"Collection     : {COLLECTION_NAME}")
print(f"Auth           : {'Enabled' if API_KEY else 'Disabled'}")

RAG Server URL : http://localhost:8081
Collection     : test
Auth           : Disabled


## 3. Health Check

Verify that the RAG server is reachable before using the retriever.

In [3]:
import httpx

resp = httpx.get(f"{BASE_URL}/v1/health")
print(f"Status: {resp.status_code}")
print(resp.json())

Status: 200
{'message': 'Service is up.', 'databases': [], 'object_storage': [], 'nim': []}


## 4. Basic Usage – Instantiate the Retriever

`NvidiaRAGRetriever` subclasses LangChain's `BaseRetriever`, so it supports
`.invoke()`, `.ainvoke()`, `.batch()`, and can be composed in LangChain chains.

In [4]:
from nvidia_rag.retriever import NvidiaRAGRetriever

retriever = NvidiaRAGRetriever(
    base_url=BASE_URL,
    api_key=API_KEY,
    collection_name=COLLECTION_NAME,
    top_k=5,
)

print(f"Retriever ready — will query collection '{COLLECTION_NAME}' with top_k=5")

  from .autonotebook import tqdm as notebook_tqdm
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Retriever ready — will query collection 'test' with top_k=5


## 5. Retrieve Documents

Call `.invoke()` with a natural-language query. The server handles embedding,
vector search, and reranking. Results come back as LangChain `Document` objects.

In [5]:
query = "What is AI?"
docs = retriever.invoke(query)

print(f"Query: {query}")
print(f"Retrieved {len(docs)} documents\n")

for i, doc in enumerate(docs):
    print(f"--- Document {i+1} (score: {doc.metadata.get('score', 'N/A')}) ---")
    print(f"Source : {doc.metadata.get('source', 'unknown')}")
    print(f"Page   : {doc.metadata.get('page_number', 'N/A')}")
    print(f"Content: {doc.page_content[:200]}...")
    print()

INFO:httpx:HTTP Request: POST http://localhost:8081/v1/search "HTTP/1.1 200 OK"


Query: What is AI?
Retrieved 5 documents

--- Document 1 (score: 0.7130321352710473) ---
Source : sample_ai_article_extracted.pdf
Page   : 1
Content: Extracted Content: sample_ai_article.pdf
Total extracted items: 640
 - image: 513
 - structured: 20
 - text: 107
[Item 1] Type: text
Artificial intelligence
Artificial intelligence (AI) is the ...

--- Document 2 (score: 0.6892219530281222) ---
Source : sample_ai_article_extracted.pdf
Page   : 274
Content: Extracted Content: sample_ai_article.pdf
AI itself, with particular emphasis on influential papers and foundational texts that introduced or advanced key concepts in AI. Notably, McCarthy
(1999) and...

--- Document 3 (score: 0.6892219530281222) ---
Source : sample_ai_article_extracted.pdf
Page   : 156
Content: kowski, Nicole (November 2023). "What is Artificial Intelligence and How Does AI Work?
TechTarget" (https://www.techtarget.com/searchenterpriseai/definition/AI-Artificial-Intelligenc
e). Enterprise ...

--- Document 4 (score: 0.6

## 6. Async Retrieval

Use `.ainvoke()` for non-blocking retrieval in async applications.

In [None]:
docs = await retriever.ainvoke("What is AI")

for doc in docs:
    print(f"[{doc.metadata.get('score', 0):.2f}] {doc.page_content[:120]}...")

INFO:httpx:HTTP Request: POST http://localhost:8081/v1/search "HTTP/1.1 200 OK"


[0.17] Extracted Content: sample_ai_article.pdf
[Item 47] Type: image
Caption: performing similar functions . The development...
[0.17] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQH/...
[0.17] /arxiv.org/archive/cs.LG)].
152. Franzen, Carl (8 August 2024). "Alibaba
claims no. 1 spot in AI math models with
Qwe...
[0.15] center for US$650 million.[232] Nvidia CEO Jensen Huang said
...lear power is a good option for the data centers.[233]
[0.15] outputs.[110] The multiple layers can progressively extract
higher-level features from the raw input. For example, in i...


## 7. Using Metadata Filters

Pass a filter expression to narrow the search to specific documents or metadata.
The filter syntax depends on the vector store backend (Milvus or Elasticsearch).

### Milvus filter (string expression)

In [9]:
COLLECTION_NAME="metatest"
filtered_retriever = NvidiaRAGRetriever(
    base_url=BASE_URL,
    collection_name=COLLECTION_NAME,
    top_k=3,
    filters='content_metadata["domain"] == "education"',
)

docs = filtered_retriever.invoke("deep learning optimization")
print(f"Retrieved {len(docs)} documents with domain ==education")
for doc in docs:
    print(f"  Page {doc.metadata.get('page_number')}: {doc.page_content[:80]}...")

INFO:httpx:HTTP Request: POST http://localhost:8081/v1/search "HTTP/1.1 200 OK"


Retrieved 2 documents with domain ==education
  Page 2: Parent Portal: Weekly progress reports, upcoming assessment alerts, and recommen...
  Page 1: Product Requirements Document — AI-Powered Adaptive Learning Platform
1. Execut...


## 8. Configuring Reranker and Query Rewriting

You can control server-side reranking and query rewriting per retriever instance.

In [10]:
# High-recall retriever: fetch many candidates, rerank to top 5
precise_retriever = NvidiaRAGRetriever(
    base_url=BASE_URL,
    collection_name=COLLECTION_NAME,
    top_k=5,
    vdb_top_k=200,
    enable_reranker=True,
    enable_query_rewriting=True,
)

docs = precise_retriever.invoke("GPU memory management best practices")
print(f"Retrieved {len(docs)} reranked documents")
for doc in docs:
    print(f"  [{doc.metadata.get('score', 0):.3f}] {doc.page_content[:100]}...")

INFO:httpx:HTTP Request: POST http://localhost:8081/v1/search "HTTP/1.1 200 OK"


Retrieved 5 reranked documents
  [0.114] Product Requirements Document — AI-Powered Adaptive Learning Platform
1. Executive Summary
This do...
  [0.114] Parent Portal: Weekly progress reports, upcoming assessment alerts, and recommended
home-practice a...
  [0.114] Product Requirements Document — Omnichannel Retail Commerce Platform
1. Executive Summary
This doc...
  [0.114] Product Requirements Document — Personal Finance & Wealth Management App
1. Executive Summary
This...
  [0.114] Product Requirements Document — Remote Patient Monitoring & Telehealth Platform
1. Executive Summar...


## 9. Use in a LangChain RAG Chain

Compose the retriever with an LLM in a standard LangChain chain.
This example uses `ChatNVIDIA` from `langchain-nvidia-ai-endpoints`.

In [11]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough


def format_docs(docs):
    """Join retrieved documents into a single context string."""
    return "\n\n".join(doc.page_content for doc in docs)


prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer the question based only on the following context:\n\n{context}",
        ),
        ("human", "{question}"),
    ]
)

# Uncomment and configure the LLM of your choice:
# from langchain_nvidia_ai_endpoints import ChatNVIDIA
# llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")

# chain = (
#     {"context": retriever | format_docs, "question": RunnablePassthrough()}
#     | prompt
#     | llm
#     | StrOutputParser()
# )

# response = chain.invoke("What are the benefits of using CUDA?")
# print(response)

print("Uncomment the chain code above after configuring your LLM.")

Uncomment the chain code above after configuring your LLM.


## 10. Batch Retrieval

LangChain's `BaseRetriever` provides `.batch()` for retrieving multiple queries at once.

In [None]:
queries = [
    "What is AI?",
    "How does TensorRT work?",
    "Explain GPU parallel computing",
]

batch_results = retriever.batch(queries)

for query, docs in zip(queries, batch_results):
    print(f"Query: {query}")
    print(f"  → {len(docs)} documents retrieved")
    if docs:
        print(f"  → Top result: {docs[0].page_content[:80]}...")
    print()

## 11. Inspecting Document Metadata

Each returned `Document` carries rich metadata from the RAG server.

In [None]:
import json

docs = retriever.invoke("neural network training")

if docs:
    doc = docs[0]
    print("=== Top Document ===")
    print(f"Content (first 300 chars):\n{doc.page_content[:300]}\n")
    print("Metadata:")
    print(json.dumps(doc.metadata, indent=2, default=str))
else:
    print("No documents found.")

## 12. Use as a LangChain Tool (Agent Integration)

Wrap the retriever as a tool so a LangChain agent can decide when to search.

In [None]:
from langchain_core.tools import create_retriever_tool

search_tool = create_retriever_tool(
    retriever,
    name="nvidia_docs_search",
    description="Search NVIDIA documentation for technical information about GPUs, CUDA, TensorRT, and related topics.",
)

# Test the tool directly
result = search_tool.invoke("CUDA memory allocation")
print(f"Tool returned {len(result)} characters of context")
print(result[:300], "...")

## 13. Cleanup

Close the HTTP connections when done.

In [None]:
retriever.close()
print("Retriever connections closed.")

## API Reference

| Parameter | Type | Default | Description |
|---|---|---|---|
| `base_url` | `str` | *required* | Root URL of the NVIDIA RAG server |
| `api_key` | `str \| None` | `None` | Bearer token for authentication |
| `collection_name` | `str \| None` | `None` | Collection to search (server default if None) |
| `top_k` | `int` | `10` | Max documents after reranking (1–25) |
| `vdb_top_k` | `int` | `100` | Candidates from vector DB before reranking (1–400) |
| `filters` | `str \| list` | `""` | Vector store filter expression |
| `enable_reranker` | `bool \| None` | `None` | Enable server-side reranking |
| `enable_query_rewriting` | `bool \| None` | `None` | Enable server-side query rewriting |
| `embedding_model` | `str \| None` | `None` | Override embedding model name |
| `reranker_model` | `str \| None` | `None` | Override reranker model name |
| `timeout` | `float` | `60.0` | HTTP timeout in seconds |