# RAG (Retrieval-Augmented Generation) Tutorial

This notebook demonstrates how to build a RAG pipeline using NVIDIA NeMo Microservices on OpenShift.

Full documentation: [NeMo Data Store](https://docs.nvidia.com/nemo/microservices/latest/datastore/overview.html), [NeMo Entity Store](https://docs.nvidia.com/nemo/microservices/latest/entity-store/overview.html)

## Overview

This example implements a complete RAG workflow:
1. **Document Ingestion**: Upload documents to NeMo Data Store
2. **Embedding Generation**: Create embeddings using NeMo Embedding NIM
3. **Vector Storage**: Store embeddings in NeMo Entity Store
4. **Query Processing**: Retrieve relevant documents based on user queries
5. **Response Generation**: Generate answers using NeMo Chat NIM with retrieved context
6. **Optional Guardrails**: Apply safety guardrails to responses

**No API keys required!** The notebook uses your deployed NIM endpoints for both chat and embedding models.


## Prerequisites

### Deployed Services
- NeMo Data Store service deployed
- NeMo Entity Store service deployed
- NeMo Guardrails service deployed (optional but recommended)
- **Chat NIM**: `meta/llama-3.2-1b-instruct` model (service name may vary, e.g., `meta-llama3-1b-instruct`)
- **Embedding NIM**: `nv-embedqa-1b-v2` service

### üîí Security Setup (REQUIRED FIRST STEP)

**IMPORTANT**: This notebook uses `env.donotcommit` file for sensitive configuration (tokens, API keys). 

**Before running this notebook:**
1. Copy the template: `cp env.donotcommit.example env.donotcommit`
2. Edit `env.donotcommit` and add your `NMS_NAMESPACE` (and other values as needed)
3. The `env.donotcommit` file is git-ignored and will NOT be committed to version control

**Find your namespace:**
```bash
oc projects
```


In [None]:
# ============================================================================
# CONFIGURATION: Load Environment Variables from env.donotcommit file
# ============================================================================
# üîí SECURITY: Never hardcode secrets in notebooks!
# All sensitive values (tokens, API keys) should be in env.donotcommit file
# 
# SETUP INSTRUCTIONS:
# 1. Copy env.donotcommit.example to env.donotcommit: cp env.donotcommit.example env.donotcommit
# 2. Edit env.donotcommit and fill in your values (especially NMS_NAMESPACE)
# 3. env.donotcommit is git-ignored and will NOT be committed to version control
#
# IMPORTANT: Run this cell FIRST before importing config!
# If you get connection errors, restart the kernel and run cells in order.
import os
import sys
from pathlib import Path

# Load env.donotcommit file from the notebook directory
try:
    from dotenv import load_dotenv
    # Find env.donotcommit file in the same directory as this notebook
    notebook_dir = Path().resolve()  # Current working directory (where notebook is run from)
    env_file = notebook_dir / "env.donotcommit"
    
    if env_file.exists():
        load_dotenv(env_file, override=False)  # override=False: don't overwrite existing env vars
        print(f"‚úÖ Loaded env.donotcommit file from: {env_file}")
    else:
        print(f"‚ö†Ô∏è  env.donotcommit file not found at: {env_file}")
        print(f"   Looking for env.donotcommit.example template...")
        # Check if env.donotcommit.example exists
        env_example = notebook_dir / "env.donotcommit.example"
        if env_example.exists():
            print(f"   ‚ÑπÔ∏è  env.donotcommit.example exists at: {env_example}")
            print(f"   üìù Please copy it to env.donotcommit and fill in your values:")
            print(f"      cp env.donotcommit.example env.donotcommit")
            print(f"      # Then edit env.donotcommit and add your NMS_NAMESPACE")
        else:
            print(f"   ‚ö†Ô∏è  env.donotcommit.example not found - creating template...")
            env_example_content = """# NeMo Microservices Configuration
# Copy this file to env.donotcommit and fill in your values
# env.donotcommit is git-ignored and will NOT be committed

# REQUIRED: Namespace for cluster services
# Replace with your actual OpenShift namespace/project name
# Find your namespace: oc projects
NMS_NAMESPACE=your-namespace

# REQUIRED: Set to "false" when running in Workbench (uses cluster URLs)
# Set to "true" only if you're running locally with port-forwards
RUN_LOCALLY=false

# OPTIONAL: NeMo Data Store token
# Default is "token" - update if your deployment uses a different token
NDS_TOKEN=token

# OPTIONAL: Dataset name for RAG tutorial documents
DATASET_NAME=rag-tutorial-documents

# OPTIONAL: RAG Configuration
# Number of documents to retrieve
RAG_TOP_K=5
# Similarity threshold for retrieval
RAG_SIMILARITY_THRESHOLD=0.3

# OPTIONAL: API Keys (only needed if using external APIs as fallback)
# OPENAI_API_KEY=
# NVIDIA_API_KEY=
# HF_TOKEN=
"""
            env_example.write_text(env_example_content)
            print(f"   ‚úÖ Created env.donotcommit.example template at: {env_example}")
            print(f"   üìù Please copy it to env.donotcommit and fill in your values:")
            print(f"      cp env.donotcommit.example env.donotcommit")
            print(f"      # Then edit env.donotcommit and add your NMS_NAMESPACE")
except ImportError:
    print("‚ö†Ô∏è  python-dotenv not installed - install with: pip install python-dotenv")
    print("   Will use system environment variables only (not recommended)")

# Clear any cached config module to force reload
if 'config' in sys.modules:
    del sys.modules['config']
    print("‚ö†Ô∏è  Cleared cached config module - will reload with new env vars")

# Set defaults (will be overridden by env.donotcommit file if present)
# These are fallback values - prefer setting them in env.donotcommit file
os.environ.setdefault("NMS_NAMESPACE", "anemo-rhoai")
os.environ.setdefault("RUN_LOCALLY", "false")
os.environ.setdefault("NDS_TOKEN", "token")
os.environ.setdefault("DATASET_NAME", "rag-tutorial-documents")
os.environ.setdefault("RAG_TOP_K", "5")
os.environ.setdefault("RAG_SIMILARITY_THRESHOLD", "0.3")
# NIM_SERVICE_ACCOUNT_TOKEN should come from env.donotcommit file, not hardcoded here

print("\n‚úÖ Environment variables loaded")
print(f"   NMS_NAMESPACE: {os.environ.get('NMS_NAMESPACE')}")
print(f"   RUN_LOCALLY: {os.environ.get('RUN_LOCALLY')} (cluster mode for Workbench)")
print(f"   DATASET_NAME: {os.environ.get('DATASET_NAME')}")
print(f"\nüí° If you see connection errors, restart the kernel and run cells in order!")



In [None]:
# Install llama-stack-client from GitHub main (same as llamastack demo)
# This ensures compatibility with the latest server version
%pip install --upgrade git+https://github.com/meta-llama/llama-stack-client-python.git@main


In [None]:
# Install required packages
# Note: langchain is not needed - we use LlamaStack client for chat completions and direct HTTP requests for embeddings
%pip install requests jupyterlab python-dotenv numpy pandas llama-stack-client

# If running locally (outside cluster), set RUN_LOCALLY before importing config
# Uncomment the line below if you're running this notebook locally:
# import os; os.environ["RUN_LOCALLY"] = "true"


In [None]:
# Load configuration
from config import (
    NDS_URL, ENTITY_STORE_URL, GUARDRAILS_URL,
    NIM_CHAT_URL, NIM_EMBEDDING_URL,
    NIM_CHAT_URL_CLUSTER, NIM_EMBEDDING_URL_CLUSTER,
    NMS_NAMESPACE, DATASET_NAME, NDS_TOKEN,
    RAG_TOP_K, RAG_SIMILARITY_THRESHOLD,
    RUN_LOCALLY, LLAMASTACK_URL, NIM_SERVICE_ACCOUNT_TOKEN
)

print(f"‚úÖ Configuration loaded")
print(f"Mode: {'Local (port-forward)' if RUN_LOCALLY else 'Cluster'}")
print(f"Data Store: {NDS_URL}")
print(f"Entity Store: {ENTITY_STORE_URL}")
print(f"Chat NIM: {NIM_CHAT_URL}")
print(f"Embedding NIM: {NIM_EMBEDDING_URL}")
print(f"LlamaStack: {LLAMASTACK_URL}")
print(f"Namespace: {NMS_NAMESPACE}")
print(f"Dataset: {DATASET_NAME}")

# Quick connectivity test
import requests
try:
    r = requests.get(f"{NDS_URL}/v1/datastore/namespaces", timeout=2)
    print(f"‚úÖ Data Store connectivity: OK")
except Exception as e:
    print(f"‚ö†Ô∏è  Data Store connectivity: FAILED - {e}")
    if RUN_LOCALLY:
        print(f"\nüì° Port-forward setup required for local mode:")
        print(f"   Run this in a terminal:")
        print(f"   ./port-forward.sh")
        print(f"\n   Or manually:")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nemodatastore-sample 8001:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nemoentitystore-sample 8002:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nemoguardrails-sample 8005:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/meta-llama3-1b-instruct 8006:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nv-embedqa-1b-v2 8007:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/llamastack 8321:8321 &")
    else:
        print(f"   If running from outside cluster, set RUN_LOCALLY=true environment variable")
        print(f"   Or ensure you're running this notebook from within the cluster")

# Initialize LlamaStack client
try:
    from llama_stack_client import LlamaStackClient
    import logging
    
    # Suppress httpx INFO logs (500 errors and connection attempts are expected during fallback)
    logging.getLogger("httpx").setLevel(logging.WARNING)
    
    client = LlamaStackClient(base_url=LLAMASTACK_URL)
    # Test connectivity
    # Note: 404 on root endpoint is expected - it just means the service is reachable
    try:
        server_info = client._client.get("/")
        print(f"‚úÖ LlamaStack connectivity: OK")
        try:
            client_version = client._client._version
            print(f"   LlamaStack client version: {client_version}")
        except:
            pass
    except Exception as e:
        # 404 is OK - it means service is reachable but root endpoint doesn't exist
        if "404" in str(e) or "Not Found" in str(e):
            print(f"‚úÖ LlamaStack connectivity: OK (service reachable)")
        else:
            print(f"‚ö†Ô∏è  LlamaStack connectivity: FAILED - {e}")
            if RUN_LOCALLY:
                print(f"   Make sure port-forward is active: oc port-forward -n {NMS_NAMESPACE} svc/llamastack 8321:8321")
            else:
                print(f"   Make sure LlamaStack is deployed: oc get pods -n {NMS_NAMESPACE} | grep llamastack")
            client = None
except ImportError:
    print("‚ö†Ô∏è  LlamaStack client not available - install with: %pip install --upgrade git+https://github.com/meta-llama/llama-stack-client-python.git@main")
    print("   Continuing without LlamaStack integration...")
    client = None
except Exception as e:
    print(f"‚ö†Ô∏è  LlamaStack initialization failed: {e}")
    print("   Continuing without LlamaStack integration...")
    client = None


## Step 1: Document Ingestion

First, we'll upload sample documents to NeMo Data Store. These documents will be used for retrieval.


In [None]:
# Sample documents for RAG tutorial
documents = [
    {
        "id": "doc1",
        "title": "Introduction to NeMo Microservices",
        "content": "NVIDIA NeMo Microservices is a platform for deploying AI models at scale. It provides infrastructure for training, inference, and evaluation of large language models. The platform includes components like Data Store, Entity Store, Customizer, Evaluator, and Guardrails."
    },
    {
        "id": "doc2",
        "title": "RAG Architecture",
        "content": "Retrieval-Augmented Generation (RAG) combines information retrieval with language generation. The process involves: 1) Storing documents in a vector database, 2) Embedding user queries, 3) Retrieving relevant documents, 4) Generating responses using retrieved context."
    },
    {
        "id": "doc3",
        "title": "OpenShift Deployment",
        "content": "NeMo Microservices can be deployed on OpenShift using Helm charts. The deployment includes infrastructure components (PostgreSQL, MLflow, Argo Workflows) and instance components (NeMo services, NIM services). All components are namespace-scoped for multi-tenant safety."
    },
    {
        "id": "doc4",
        "title": "NIM Services",
        "content": "NVIDIA Inference Microservices (NIM) provide optimized inference for AI models. NIM services support chat models, embedding models, and reranking models. They are containerized and can be deployed on Kubernetes/OpenShift clusters with GPU support."
    },
    {
        "id": "doc5",
        "title": "Vector Databases",
        "content": "Vector databases store embeddings for similarity search. NeMo Entity Store provides vector storage capabilities. Milvus is also available as an alternative vector database. Both support efficient similarity search for RAG applications."
    }
]

print(f"‚úÖ Prepared {len(documents)} sample documents")
for doc in documents:
    print(f"  - {doc['title']}")


In [None]:
# Upload documents to NeMo Data Store
import json

# Create namespace if it doesn't exist
namespace_url = f"{NDS_URL}/v1/datastore/namespaces/{NMS_NAMESPACE}"
try:
    response = requests.get(namespace_url, headers={"Authorization": f"Bearer {NDS_TOKEN}"})
    if response.status_code == 404:
        # Create namespace
        response = requests.post(
            f"{NDS_URL}/v1/datastore/namespaces",
            json={"name": NMS_NAMESPACE},
            headers={"Authorization": f"Bearer {NDS_TOKEN}"}
        )
        print(f"‚úÖ Created namespace: {NMS_NAMESPACE}")
    else:
        print(f"‚úÖ Namespace exists: {NMS_NAMESPACE}")
except Exception as e:
    print(f"‚ö†Ô∏è  Error checking namespace: {e}")

# Upload documents
uploaded_docs = []
for doc in documents:
    try:
        # Create dataset entry
        file_url = f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/{doc['id']}.json"
        doc_data = {
            "id": doc['id'],
            "title": doc['title'],
            "content": doc['content']
        }
        
        # In a real scenario, you would upload to Data Store
        # For this tutorial, we'll store locally and use for embedding
        uploaded_docs.append(doc_data)
        print(f"‚úÖ Prepared document: {doc['title']}")
    except Exception as e:
        print(f"‚ö†Ô∏è  Error uploading {doc['id']}: {e}")

print(f"\n‚úÖ Prepared {len(uploaded_docs)} documents for embedding")


## Step 2: Generate Embeddings

Now we'll generate embeddings for each document using the NeMo Embedding NIM service.


In [None]:
# Generate embeddings using NeMo Embedding NIM
# Note: We use direct NIM calls for embeddings as LlamaStack may not expose embeddings API directly
# Future enhancement: If LlamaStack adds embeddings API support, we can use client.embeddings.create()
def get_embedding(text, embedding_url, input_type="passage"):
    """Generate embedding for text using NeMo Embedding NIM"""
    try:
        headers = {"Content-Type": "application/json"}
        # Add Authorization header if token is provided
        if NIM_SERVICE_ACCOUNT_TOKEN:
            headers["Authorization"] = f"Bearer {NIM_SERVICE_ACCOUNT_TOKEN}"
        
        response = requests.post(
            f"{embedding_url}/v1/embeddings",
            json={
                "input": text,
                "model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
                "input_type": input_type
            },
            headers=headers,
            timeout=30
        )
        if response.status_code == 200:
            return response.json()["data"][0]["embedding"]
        else:
            print(f"‚ö†Ô∏è  Error getting embedding: {response.status_code} - {response.text}")
            return None
    except Exception as e:
        print(f"‚ö†Ô∏è  Exception getting embedding: {e}")
        return None

# Generate embeddings for all documents
print("Generating embeddings...")
documents_with_embeddings = []

for doc in uploaded_docs:
    # Combine title and content for embedding
    text_to_embed = f"{doc['title']}\n{doc['content']}"
    embedding = get_embedding(text_to_embed, NIM_EMBEDDING_URL)
    
    if embedding:
        doc['embedding'] = embedding
        documents_with_embeddings.append(doc)
        print(f"‚úÖ Generated embedding for: {doc['title']}")
    else:
        print(f"‚ö†Ô∏è  Failed to generate embedding for: {doc['title']}")

print(f"\n‚úÖ Generated embeddings for {len(documents_with_embeddings)} documents")


## Step 3: Store Embeddings Locally

For this tutorial, we'll store embeddings in memory for local similarity search.
In production, you can use a vector database like Milvus or Pinecone.

**Note**: NeMo Entity Store is primarily designed for managing models, datasets, and namespaces,
not for storing arbitrary document embeddings. For production RAG, consider using a dedicated vector database.**


In [None]:
# Embeddings are already stored in documents_with_embeddings list
# For this tutorial, we use in-memory storage for simplicity
print(f"‚úÖ Stored {len(documents_with_embeddings)} documents with embeddings in memory")
print(f"   Documents ready for local similarity search")
print(f"   In production, use a vector database like Milvus or Pinecone")


## Step 4: Query and Retrieve

Now we'll process a user query: embed it, find similar documents, and retrieve the most relevant ones.


In [None]:
# User query
user_query = "What is RAG and how does it work?"

print(f"User Query: {user_query}\n")

# Generate embedding for the query
query_embedding = get_embedding(user_query, NIM_EMBEDDING_URL, input_type="query")

if query_embedding:
    print(f"‚úÖ Generated query embedding (dimension: {len(query_embedding)})\n")
    
    # Use local similarity search
    import numpy as np
    retrieved_docs = []
    similarities = []
    
    for doc in documents_with_embeddings:
        similarity = np.dot(query_embedding, doc['embedding']) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(doc['embedding'])
        )
        similarities.append((similarity, doc))
    
    # Sort by similarity and get top_k
    similarities.sort(reverse=True, key=lambda x: x[0])
    
    print(f"‚úÖ Found {len(similarities)} documents, showing top {RAG_TOP_K}:\n")
    
    for i, (similarity, doc) in enumerate(similarities[:RAG_TOP_K], 1):
        if similarity >= RAG_SIMILARITY_THRESHOLD:
            print(f"{i}. {doc['title']} (similarity: {similarity:.3f})")
            retrieved_docs.append({
                'title': doc['title'],
                'content': doc['content'],
                'id': doc['id']
            })
    
    print(f"\n‚úÖ Retrieved {len(retrieved_docs)} documents above threshold ({RAG_SIMILARITY_THRESHOLD})\n")
else:
    print("‚ö†Ô∏è  Failed to generate query embedding")
    retrieved_docs = []


## Step 5: Generate Response

Now we'll use the retrieved documents as context to generate a response using the Chat NIM.


In [None]:
# Build context from retrieved documents
context = "\n\n".join([
    f"Document: {doc['title']}\n{doc['content']}"
    for doc in retrieved_docs
])

print("Retrieved Context:")
print("=" * 80)
print(context[:500] + "..." if len(context) > 500 else context)
print("=" * 80)
print()

# Generate response using LlamaStack client only (no fallback)
def generate_response(query, context):
    """Generate response using LlamaStack client with retrieved context"""
    # Validate LlamaStack client is available
    if client is None:
        raise ValueError(
            "LlamaStack client not available. "
            "Check LlamaStack deployment and connectivity."
        )
    
    # Build prompt with context
    system_prompt = "You are a helpful assistant. Answer the question based on the provided context. If the context doesn't contain enough information, say so."
    user_prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
    
    try:
        response = client.chat.completions.create(
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            model="nvidia/meta/llama-3.2-1b-instruct",
            temperature=0.7,
            max_tokens=500
        )
        return response.choices[0].message.content
    except Exception as e:
        error_msg = str(e)
        print(f"‚ùå LlamaStack error: {error_msg}")
        print("\nüîç Troubleshooting steps:")
        print(f"1. Check LlamaStack pod is running:")
        print(f"   oc get pods -n {NMS_NAMESPACE} | grep llamastack")
        print(f"2. Check LlamaStack pod status:")
        print(f"   oc get pods -n {NMS_NAMESPACE} -l app=nemo-llamastack")
        print(f"3. Check LlamaStack logs for errors:")
        print(f"   oc logs -n {NMS_NAMESPACE} deployment/llamastack -c llamastack-ctr --tail=50")
        print(f"4. Verify token secret exists and is populated:")
        print(f"   oc get secret -n {NMS_NAMESPACE} | grep token-secret")
        print(f"   oc get secret <sa-name>-token-secret -n {NMS_NAMESPACE} -o jsonpath='{{.data.token}}' | base64 -d | head -c 20")
        print(f"5. Check service account exists:")
        print(f"   oc get sa -n {NMS_NAMESPACE} | grep model")
        print(f"6. Verify LlamaStack can reach NIM (check initContainer logs):")
        print(f"   oc logs -n {NMS_NAMESPACE} <pod-name> -c wait-for-token")
        
        # Provide specific guidance based on error type
        if "500" in error_msg or "Internal server error" in error_msg:
            print(f"\nüí° This is a 500 error - LlamaStack is having internal issues.")
            print(f"   Most likely causes:")
            print(f"   - Token secret not populated (check step 4)")
            print(f"   - LlamaStack can't authenticate to NIM")
            print(f"   - Check LlamaStack logs (step 3) for detailed error")
        elif "401" in error_msg or "Unauthorized" in error_msg:
            print(f"\nüí° This is a 401 error - Authentication failed.")
            print(f"   Most likely causes:")
            print(f"   - Token secret missing or empty")
            print(f"   - Service account doesn't exist")
            print(f"   - Check token secret (step 4)")
        elif "404" in error_msg or "Not Found" in error_msg:
            print(f"\nüí° This is a 404 error - Endpoint not found.")
            print(f"   Most likely causes:")
            print(f"   - LlamaStack not fully started")
            print(f"   - Wrong URL configured")
            print(f"   - Check pod status (step 2)")
        
        raise RuntimeError(f"LlamaStack request failed: {error_msg}")

# Generate response
print("Generating response...")
response_text = generate_response(user_query, context)

if response_text:
    print("\n" + "=" * 80)
    print("Generated Response:")
    print("=" * 80)
    print(response_text)
    print("=" * 80)
else:
    print("‚ö†Ô∏è  Failed to generate response")


## Step 7: Validate RAG with Test Queries

Let's test the RAG pipeline with multiple questions to validate it's working correctly.
We'll ask questions about different topics from our documents and verify the responses.

In [None]:
# Test queries covering different topics from our documents
test_queries = [
    "What is NeMo Microservices?",
    "How does RAG work?",
    "What are vector databases used for?",
    "What components does NeMo Microservices include?"
]

print("=" * 80)
print("RAG VALIDATION TESTS")
print("=" * 80)
print(f"Testing {len(test_queries)} queries against {len(documents_with_embeddings)} documents\n")

In [None]:
# Function to run a complete RAG query
def run_rag_query(query, show_context=True):
    """Run a complete RAG query and return the response"""
    print(f"\n{'='*80}")
    print(f"Query: {query}")
    print(f"{'='*80}")
    
    # Generate query embedding
    query_embedding = get_embedding(query, NIM_EMBEDDING_URL, input_type="query")
    
    if not query_embedding:
        print("‚ö†Ô∏è  Failed to generate query embedding")
        return None
    
    # Local similarity search
    import numpy as np
    retrieved_docs = []
    similarities = []
    
    for doc in documents_with_embeddings:
        similarity = np.dot(query_embedding, doc['embedding']) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(doc['embedding'])
        )
        similarities.append((similarity, doc))
    
    # Sort by similarity and get top_k
    similarities.sort(reverse=True, key=lambda x: x[0])
    
    print(f"\nüìä Top {min(RAG_TOP_K, len(similarities))} retrieved documents:")
    for i, (similarity, doc) in enumerate(similarities[:RAG_TOP_K], 1):
        if similarity >= RAG_SIMILARITY_THRESHOLD:
            print(f"  {i}. {doc['title']} (similarity: {similarity:.3f})")
            retrieved_docs.append({
                'title': doc['title'],
                'content': doc['content'],
                'id': doc['id']
            })
    
    if not retrieved_docs:
        print(f"‚ö†Ô∏è  No documents found above threshold ({RAG_SIMILARITY_THRESHOLD})")
        return None
    
    # Build context
    context = "\n\n".join([
        f"Document: {doc['title']}\n{doc['content']}"
        for doc in retrieved_docs
    ])
    
    if show_context:
        print(f"\nüìÑ Retrieved Context (first 300 chars):")
        print(f"{context[:300]}...")
    
    # Generate response using LlamaStack client (via generate_response function)
    print(f"\nü§ñ Generating response...")
    response_text = generate_response(query, context)
    
    if response_text:
        print(f"\n‚úÖ Response:")
        print(f"{response_text}")
        return response_text
    else:
        print(f"‚ö†Ô∏è  Failed to generate response")
        return None

In [None]:
# Run all test queries
results = {}

for query in test_queries:
    response = run_rag_query(query, show_context=True)
    results[query] = response
    print("\n" + "-"*80 + "\n")

print("=" * 80)
print("VALIDATION SUMMARY")
print("=" * 80)
print(f"\nTotal queries tested: {len(test_queries)}")
print(f"Successful responses: {sum(1 for r in results.values() if r is not None)}")
print(f"Failed responses: {sum(1 for r in results.values() if r is None)}")

if all(r is not None for r in results.values()):
    print("\n‚úÖ All queries returned responses! RAG pipeline is working correctly.")
else:
    print("\n‚ö†Ô∏è  Some queries failed. Check the error messages above.")

## Summary

This tutorial demonstrated a complete RAG pipeline:
1. ‚úÖ Document ingestion into NeMo Data Store
2. ‚úÖ Embedding generation using NeMo Embedding NIM
3. ‚úÖ Vector storage (local in-memory for this tutorial)
4. ‚úÖ Query processing with similarity search
5. ‚úÖ Response generation using NeMo Chat NIM
6. ‚úÖ Optional guardrails validation
7. ‚úÖ RAG validation with test queries

### Next Steps

- Add more documents to improve retrieval quality
- Experiment with different embedding models
- Adjust retrieval parameters (top_k, similarity threshold)
- Integrate with your own document sources
- Add multi-turn conversation support
- Use a production vector database (Milvus, Pinecone, etc.) for larger document sets