# RAG (Retrieval-Augmented Generation) Tutorial

This notebook demonstrates how to build a RAG pipeline using NVIDIA NeMo Microservices on OpenShift.

Full documentation: [NeMo Data Store](https://docs.nvidia.com/nemo/microservices/latest/datastore/overview.html), [NeMo Entity Store](https://docs.nvidia.com/nemo/microservices/latest/entity-store/overview.html)

## Overview

This example implements a complete RAG workflow:
1. **Document Ingestion**: Upload documents to NeMo Data Store
2. **Embedding Generation**: Create embeddings using NeMo Embedding NIM
3. **Vector Storage**: Store embeddings in NeMo Entity Store
4. **Query Processing**: Retrieve relevant documents based on user queries
5. **Response Generation**: Generate answers using NeMo Chat NIM with retrieved context
6. **Optional Guardrails**: Apply safety guardrails to responses

**No API keys required!** The notebook uses your deployed NIM endpoints for both chat and embedding models.


## Prerequisites

- NeMo Data Store service deployed
- NeMo Entity Store service deployed
- NeMo Guardrails service deployed (optional but recommended)
- **Chat NIM**: `meta-llama3-1b-instruct` service
- **Embedding NIM**: `nv-embedqa-1b-v2` service


In [1]:
# Install required packages
# Note: langchain is not needed - we use direct HTTP requests to NIM services
%pip install requests jupyterlab python-dotenv numpy pandas

# If running locally (outside cluster), set RUN_LOCALLY before importing config
# Uncomment the line below if you're running this notebook locally:
# import os; os.environ["RUN_LOCALLY"] = "true"


Note: you may need to restart the kernel to use updated packages.


In [2]:
# Load configuration
from config import (
    NDS_URL, ENTITY_STORE_URL, GUARDRAILS_URL,
    NIM_CHAT_URL, NIM_EMBEDDING_URL,
    NIM_CHAT_URL_CLUSTER, NIM_EMBEDDING_URL_CLUSTER,
    NMS_NAMESPACE, DATASET_NAME, NDS_TOKEN,
    RAG_TOP_K, RAG_SIMILARITY_THRESHOLD,
    RUN_LOCALLY
)

print(f"‚úÖ Configuration loaded")
print(f"Mode: {'Local (port-forward)' if RUN_LOCALLY else 'Cluster'}")
print(f"Data Store: {NDS_URL}")
print(f"Entity Store: {ENTITY_STORE_URL}")
print(f"Chat NIM: {NIM_CHAT_URL}")
print(f"Embedding NIM: {NIM_EMBEDDING_URL}")
print(f"Namespace: {NMS_NAMESPACE}")
print(f"Dataset: {DATASET_NAME}")

# Quick connectivity test
import requests
try:
    r = requests.get(f"{NDS_URL}/v1/datastore/namespaces", timeout=2)
    print(f"‚úÖ Data Store connectivity: OK")
except Exception as e:
    print(f"‚ö†Ô∏è  Data Store connectivity: FAILED - {e}")
    if RUN_LOCALLY:
        print(f"\nüì° Port-forward setup required for local mode:")
        print(f"   Run this in a terminal:")
        print(f"   ./port-forward.sh")
        print(f"\n   Or manually:")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nemodatastore-sample 8001:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nemoentitystore-sample 8002:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nemoguardrails-sample 8005:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/meta-llama3-1b-instruct 8006:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nv-embedqa-1b-v2 8007:8000 &")
    else:
        print(f"   If running from outside cluster, set RUN_LOCALLY=true environment variable")
        print(f"   Or ensure you're running this notebook from within the cluster")


‚úÖ Configuration loaded
Mode: Local (port-forward)
Data Store: http://localhost:8001
Entity Store: http://localhost:8002
Chat NIM: http://localhost:8006
Embedding NIM: http://localhost:8007
Namespace: anemo-rhoai
Dataset: rag-tutorial-documents
‚úÖ Data Store connectivity: OK


## Step 1: Document Ingestion

First, we'll upload sample documents to NeMo Data Store. These documents will be used for retrieval.


In [3]:
# Sample documents for RAG tutorial
documents = [
    {
        "id": "doc1",
        "title": "Introduction to NeMo Microservices",
        "content": "NVIDIA NeMo Microservices is a platform for deploying AI models at scale. It provides infrastructure for training, inference, and evaluation of large language models. The platform includes components like Data Store, Entity Store, Customizer, Evaluator, and Guardrails."
    },
    {
        "id": "doc2",
        "title": "RAG Architecture",
        "content": "Retrieval-Augmented Generation (RAG) combines information retrieval with language generation. The process involves: 1) Storing documents in a vector database, 2) Embedding user queries, 3) Retrieving relevant documents, 4) Generating responses using retrieved context."
    },
    {
        "id": "doc3",
        "title": "OpenShift Deployment",
        "content": "NeMo Microservices can be deployed on OpenShift using Helm charts. The deployment includes infrastructure components (PostgreSQL, MLflow, Argo Workflows) and instance components (NeMo services, NIM services). All components are namespace-scoped for multi-tenant safety."
    },
    {
        "id": "doc4",
        "title": "NIM Services",
        "content": "NVIDIA Inference Microservices (NIM) provide optimized inference for AI models. NIM services support chat models, embedding models, and reranking models. They are containerized and can be deployed on Kubernetes/OpenShift clusters with GPU support."
    },
    {
        "id": "doc5",
        "title": "Vector Databases",
        "content": "Vector databases store embeddings for similarity search. NeMo Entity Store provides vector storage capabilities. Milvus is also available as an alternative vector database. Both support efficient similarity search for RAG applications."
    }
]

print(f"‚úÖ Prepared {len(documents)} sample documents")
for doc in documents:
    print(f"  - {doc['title']}")


‚úÖ Prepared 5 sample documents
  - Introduction to NeMo Microservices
  - RAG Architecture
  - OpenShift Deployment
  - NIM Services
  - Vector Databases


In [4]:
# Upload documents to NeMo Data Store
import json

# Create namespace if it doesn't exist
namespace_url = f"{NDS_URL}/v1/datastore/namespaces/{NMS_NAMESPACE}"
try:
    response = requests.get(namespace_url, headers={"Authorization": f"Bearer {NDS_TOKEN}"})
    if response.status_code == 404:
        # Create namespace
        response = requests.post(
            f"{NDS_URL}/v1/datastore/namespaces",
            json={"name": NMS_NAMESPACE},
            headers={"Authorization": f"Bearer {NDS_TOKEN}"}
        )
        print(f"‚úÖ Created namespace: {NMS_NAMESPACE}")
    else:
        print(f"‚úÖ Namespace exists: {NMS_NAMESPACE}")
except Exception as e:
    print(f"‚ö†Ô∏è  Error checking namespace: {e}")

# Upload documents
uploaded_docs = []
for doc in documents:
    try:
        # Create dataset entry
        file_url = f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/{doc['id']}.json"
        doc_data = {
            "id": doc['id'],
            "title": doc['title'],
            "content": doc['content']
        }
        
        # In a real scenario, you would upload to Data Store
        # For this tutorial, we'll store locally and use for embedding
        uploaded_docs.append(doc_data)
        print(f"‚úÖ Prepared document: {doc['title']}")
    except Exception as e:
        print(f"‚ö†Ô∏è  Error uploading {doc['id']}: {e}")

print(f"\n‚úÖ Prepared {len(uploaded_docs)} documents for embedding")


‚úÖ Namespace exists: anemo-rhoai
‚úÖ Prepared document: Introduction to NeMo Microservices
‚úÖ Prepared document: RAG Architecture
‚úÖ Prepared document: OpenShift Deployment
‚úÖ Prepared document: NIM Services
‚úÖ Prepared document: Vector Databases

‚úÖ Prepared 5 documents for embedding


## Step 2: Generate Embeddings

Now we'll generate embeddings for each document using the NeMo Embedding NIM service.


In [5]:
# Generate embeddings using NeMo Embedding NIM
def get_embedding(text, embedding_url, input_type="passage"):
    """Generate embedding for text using NeMo Embedding NIM"""
    try:
        response = requests.post(
            f"{embedding_url}/v1/embeddings",
            json={
                "input": text,
                "model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
                "input_type": input_type
            },
            headers={"Content-Type": "application/json"},
            timeout=30
        )
        if response.status_code == 200:
            return response.json()["data"][0]["embedding"]
        else:
            print(f"‚ö†Ô∏è  Error getting embedding: {response.status_code} - {response.text}")
            return None
    except Exception as e:
        print(f"‚ö†Ô∏è  Exception getting embedding: {e}")
        return None

# Generate embeddings for all documents
print("Generating embeddings...")
documents_with_embeddings = []

for doc in uploaded_docs:
    # Combine title and content for embedding
    text_to_embed = f"{doc['title']}\n{doc['content']}"
    embedding = get_embedding(text_to_embed, NIM_EMBEDDING_URL)
    
    if embedding:
        doc['embedding'] = embedding
        documents_with_embeddings.append(doc)
        print(f"‚úÖ Generated embedding for: {doc['title']}")
    else:
        print(f"‚ö†Ô∏è  Failed to generate embedding for: {doc['title']}")

print(f"\n‚úÖ Generated embeddings for {len(documents_with_embeddings)} documents")


Generating embeddings...
‚úÖ Generated embedding for: Introduction to NeMo Microservices
‚úÖ Generated embedding for: RAG Architecture
‚úÖ Generated embedding for: OpenShift Deployment
‚úÖ Generated embedding for: NIM Services
‚úÖ Generated embedding for: Vector Databases

‚úÖ Generated embeddings for 5 documents


## Step 3: Store Embeddings Locally

For this tutorial, we'll store embeddings in memory for local similarity search.
In production, you can use a vector database like Milvus or Pinecone.

**Note**: NeMo Entity Store is primarily designed for managing models, datasets, and namespaces,
not for storing arbitrary document embeddings. For production RAG, consider using a dedicated vector database.**


In [6]:
# Embeddings are already stored in documents_with_embeddings list
# For this tutorial, we use in-memory storage for simplicity
print(f"‚úÖ Stored {len(documents_with_embeddings)} documents with embeddings in memory")
print(f"   Documents ready for local similarity search")
print(f"   In production, use a vector database like Milvus or Pinecone")


‚úÖ Stored 5 documents with embeddings in memory
   Documents ready for local similarity search
   In production, use a vector database like Milvus or Pinecone


## Step 4: Query and Retrieve

Now we'll process a user query: embed it, find similar documents, and retrieve the most relevant ones.


In [7]:
# User query
user_query = "What is RAG and how does it work?"

print(f"User Query: {user_query}\n")

# Generate embedding for the query
query_embedding = get_embedding(user_query, NIM_EMBEDDING_URL, input_type="query")

if query_embedding:
    print(f"‚úÖ Generated query embedding (dimension: {len(query_embedding)})\n")
    
    # Use local similarity search
    import numpy as np
    retrieved_docs = []
    similarities = []
    
    for doc in documents_with_embeddings:
        similarity = np.dot(query_embedding, doc['embedding']) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(doc['embedding'])
        )
        similarities.append((similarity, doc))
    
    # Sort by similarity and get top_k
    similarities.sort(reverse=True, key=lambda x: x[0])
    
    print(f"‚úÖ Found {len(similarities)} documents, showing top {RAG_TOP_K}:\n")
    
    for i, (similarity, doc) in enumerate(similarities[:RAG_TOP_K], 1):
        if similarity >= RAG_SIMILARITY_THRESHOLD:
            print(f"{i}. {doc['title']} (similarity: {similarity:.3f})")
            retrieved_docs.append({
                'title': doc['title'],
                'content': doc['content'],
                'id': doc['id']
            })
    
    print(f"\n‚úÖ Retrieved {len(retrieved_docs)} documents above threshold ({RAG_SIMILARITY_THRESHOLD})\n")
else:
    print("‚ö†Ô∏è  Failed to generate query embedding")
    retrieved_docs = []


User Query: What is RAG and how does it work?

‚úÖ Generated query embedding (dimension: 2048)

‚úÖ Found 5 documents, showing top 5:

1. RAG Architecture (similarity: 0.412)

‚úÖ Retrieved 1 documents above threshold (0.3)



## Step 5: Generate Response

Now we'll use the retrieved documents as context to generate a response using the Chat NIM.


In [8]:
# Build context from retrieved documents
context = "\n\n".join([
    f"Document: {doc['title']}\n{doc['content']}"
    for doc in retrieved_docs
])

print("Retrieved Context:")
print("=" * 80)
print(context[:500] + "..." if len(context) > 500 else context)
print("=" * 80)
print()

# Generate response using Chat NIM
def generate_response(query, context, chat_url):
    """Generate response using Chat NIM with retrieved context"""
    try:
        # Build prompt with context
        system_prompt = "You are a helpful assistant. Answer the question based on the provided context. If the context doesn't contain enough information, say so."
        user_prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
        
        response = requests.post(
            f"{chat_url}/v1/chat/completions",
            json={
                "model": "meta/llama-3.2-1b-instruct",
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt}
                ],
                "temperature": 0.7,
                "max_tokens": 500
            },
            headers={"Content-Type": "application/json"},
            timeout=60
        )
        
        if response.status_code == 200:
            return response.json()["choices"][0]["message"]["content"]
        else:
            print(f"‚ö†Ô∏è  Error generating response: {response.status_code} - {response.text}")
            return None
    except Exception as e:
        print(f"‚ö†Ô∏è  Exception generating response: {e}")
        return None

# Generate response
print("Generating response...")
response_text = generate_response(user_query, context, NIM_CHAT_URL)

if response_text:
    print("\n" + "=" * 80)
    print("Generated Response:")
    print("=" * 80)
    print(response_text)
    print("=" * 80)
else:
    print("‚ö†Ô∏è  Failed to generate response")


Retrieved Context:
Document: RAG Architecture
Retrieval-Augmented Generation (RAG) combines information retrieval with language generation. The process involves: 1) Storing documents in a vector database, 2) Embedding user queries, 3) Retrieving relevant documents, 4) Generating responses using retrieved context.

Generating response...

Generated Response:
Based on the provided context, I can answer the question as follows:

RAG (Retrieval-Augmented Generation) is a technique that combines information retrieval with language generation, where the process involves:

1. Storing documents in a vector database
2. Embedding user queries
3. Retrieving relevant documents
4. Generating responses using retrieved context

In simpler terms, RAG is a method that uses retrieval (finding relevant documents) and augmentation (using user input or context to improve the generated response) to create a more effective language model that can generate relevant and coherent responses.


## Step 7: Validate RAG with Test Queries

Let's test the RAG pipeline with multiple questions to validate it's working correctly.
We'll ask questions about different topics from our documents and verify the responses.

In [9]:
# Test queries covering different topics from our documents
test_queries = [
    "What is NeMo Microservices?",
    "How does RAG work?",
    "What are vector databases used for?",
    "What components does NeMo Microservices include?"
]

print("=" * 80)
print("RAG VALIDATION TESTS")
print("=" * 80)
print(f"Testing {len(test_queries)} queries against {len(documents_with_embeddings)} documents\n")

RAG VALIDATION TESTS
Testing 4 queries against 5 documents



In [10]:
# Function to run a complete RAG query
def run_rag_query(query, show_context=True):
    """Run a complete RAG query and return the response"""
    print(f"\n{'='*80}")
    print(f"Query: {query}")
    print(f"{'='*80}")
    
    # Generate query embedding
    query_embedding = get_embedding(query, NIM_EMBEDDING_URL, input_type="query")
    
    if not query_embedding:
        print("‚ö†Ô∏è  Failed to generate query embedding")
        return None
    
    # Local similarity search
    import numpy as np
    retrieved_docs = []
    similarities = []
    
    for doc in documents_with_embeddings:
        similarity = np.dot(query_embedding, doc['embedding']) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(doc['embedding'])
        )
        similarities.append((similarity, doc))
    
    # Sort by similarity and get top_k
    similarities.sort(reverse=True, key=lambda x: x[0])
    
    print(f"\nüìä Top {min(RAG_TOP_K, len(similarities))} retrieved documents:")
    for i, (similarity, doc) in enumerate(similarities[:RAG_TOP_K], 1):
        if similarity >= RAG_SIMILARITY_THRESHOLD:
            print(f"  {i}. {doc['title']} (similarity: {similarity:.3f})")
            retrieved_docs.append({
                'title': doc['title'],
                'content': doc['content'],
                'id': doc['id']
            })
    
    if not retrieved_docs:
        print(f"‚ö†Ô∏è  No documents found above threshold ({RAG_SIMILARITY_THRESHOLD})")
        return None
    
    # Build context
    context = "\n\n".join([
        f"Document: {doc['title']}\n{doc['content']}"
        for doc in retrieved_docs
    ])
    
    if show_context:
        print(f"\nüìÑ Retrieved Context (first 300 chars):")
        print(f"{context[:300]}...")
    
    # Generate response
    print(f"\nü§ñ Generating response...")
    system_prompt = "You are a helpful assistant. Answer the question based on the provided context. If the context doesn't contain enough information, say so."
    user_prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
    
    try:
        response = requests.post(
            f"{NIM_CHAT_URL}/v1/chat/completions",
            json={
                "model": "meta/llama-3.2-1b-instruct",
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt}
                ],
                "temperature": 0.7,
                "max_tokens": 500
            },
            headers={"Content-Type": "application/json"},
            timeout=60
        )
        
        if response.status_code == 200:
            response_text = response.json()["choices"][0]["message"]["content"]
            print(f"\n‚úÖ Response:")
            print(f"{response_text}")
            return response_text
        else:
            print(f"‚ö†Ô∏è  Failed to generate response: {response.status_code}")
            return None
    except Exception as e:
        print(f"‚ö†Ô∏è  Error generating response: {e}")
        return None

In [11]:
# Run all test queries
results = {}

for query in test_queries:
    response = run_rag_query(query, show_context=True)
    results[query] = response
    print("\n" + "-"*80 + "\n")

print("=" * 80)
print("VALIDATION SUMMARY")
print("=" * 80)
print(f"\nTotal queries tested: {len(test_queries)}")
print(f"Successful responses: {sum(1 for r in results.values() if r is not None)}")
print(f"Failed responses: {sum(1 for r in results.values() if r is None)}")

if all(r is not None for r in results.values()):
    print("\n‚úÖ All queries returned responses! RAG pipeline is working correctly.")
else:
    print("\n‚ö†Ô∏è  Some queries failed. Check the error messages above.")


Query: What is NeMo Microservices?

üìä Top 5 retrieved documents:
  1. Introduction to NeMo Microservices (similarity: 0.639)
  2. OpenShift Deployment (similarity: 0.459)
  3. NIM Services (similarity: 0.351)
  4. Vector Databases (similarity: 0.323)

üìÑ Retrieved Context (first 300 chars):
Document: Introduction to NeMo Microservices
NVIDIA NeMo Microservices is a platform for deploying AI models at scale. It provides infrastructure for training, inference, and evaluation of large language models. The platform includes components like Data Store, Entity Store, Customizer, Evaluator, a...

ü§ñ Generating response...

‚úÖ Response:
NeMo Microservices is a platform for deploying AI models at scale, providing infrastructure for training, inference, and evaluation of large language models, including Data Store, Entity Store, Customizer, Evaluator, and Guardrails components.

--------------------------------------------------------------------------------


Query: How does RAG work?


## Summary

This tutorial demonstrated a complete RAG pipeline:
1. ‚úÖ Document ingestion into NeMo Data Store
2. ‚úÖ Embedding generation using NeMo Embedding NIM
3. ‚úÖ Vector storage (local in-memory for this tutorial)
4. ‚úÖ Query processing with similarity search
5. ‚úÖ Response generation using NeMo Chat NIM
6. ‚úÖ Optional guardrails validation
7. ‚úÖ RAG validation with test queries

### Next Steps

- Add more documents to improve retrieval quality
- Experiment with different embedding models
- Adjust retrieval parameters (top_k, similarity threshold)
- Integrate with your own document sources
- Add multi-turn conversation support
- Use a production vector database (Milvus, Pinecone, etc.) for larger document sets