# üöÄ NVIDIA NIM RAG Demo

**Build a RAG Pipeline with NVIDIA NIM API**

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using NVIDIA NIM inference microservices.

## What You'll Learn
- Using NVIDIA NIM API for LLM inference
- Creating embeddings with NV-Embed-QA
- Building a simple RAG pipeline
- Measuring latency and performance

**Requirements**: NVIDIA API key from [build.nvidia.com](https://build.nvidia.com)

---

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/QbitLoop/nvidia-nim-rag-demo/blob/main/notebooks/nvidia_nim_rag_colab.ipynb)
[![GitHub](https://img.shields.io/badge/GitHub-QbitLoop-blue)](https://github.com/QbitLoop/nvidia-nim-rag-demo)

## 1Ô∏è‚É£ Setup & Dependencies

In [None]:
# Install dependencies
!pip install -q openai numpy scikit-learn

In [None]:
# Import libraries
import os
import time
import numpy as np
from openai import OpenAI
from google.colab import userdata

# Get API key from Colab secrets
# Add your NVIDIA_API_KEY in Colab: Runtime > Secrets
try:
    NVIDIA_API_KEY = userdata.get('NVIDIA_API_KEY')
except:
    NVIDIA_API_KEY = input("Enter your NVIDIA API key: ")

print("‚úÖ API key configured")

## 2Ô∏è‚É£ Initialize NVIDIA NIM Client

NVIDIA NIM uses an OpenAI-compatible API, making it easy to integrate.

In [None]:
# Initialize NIM client (OpenAI-compatible)
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=NVIDIA_API_KEY
)

# Model configuration
LLM_MODEL = "meta/llama-3.1-70b-instruct"
EMBED_MODEL = "nvidia/nv-embedqa-e5-v5"

print(f"‚úÖ NIM client initialized")
print(f"   LLM: {LLM_MODEL}")
print(f"   Embeddings: {EMBED_MODEL}")

## 3Ô∏è‚É£ Test LLM Inference

Let's test the NIM LLM endpoint and measure latency.

In [None]:
def chat_with_nim(prompt: str, max_tokens: int = 256) -> tuple[str, float]:
    """Send a prompt to NIM and return response with latency."""
    start = time.time()
    
    response = client.chat.completions.create(
        model=LLM_MODEL,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        temperature=0.7
    )
    
    latency_ms = (time.time() - start) * 1000
    return response.choices[0].message.content, latency_ms

# Test the LLM
response, latency = chat_with_nim("What is NVIDIA NIM in one sentence?")

print(f"üìù Response: {response}")
print(f"\n‚è±Ô∏è Latency: {latency:.0f}ms")

## 4Ô∏è‚É£ Create Embeddings

NVIDIA's NV-Embed-QA model provides high-quality embeddings for RAG applications.

In [None]:
def get_embedding(text: str, input_type: str = "passage") -> tuple[list, float]:
    """Get embedding for text using NVIDIA NV-Embed-QA."""
    start = time.time()
    
    response = client.embeddings.create(
        model=EMBED_MODEL,
        input=[text],
        extra_body={"input_type": input_type, "truncate": "END"}
    )
    
    latency_ms = (time.time() - start) * 1000
    return response.data[0].embedding, latency_ms

# Test embedding
test_text = "NVIDIA NIM is an inference microservice for deploying AI models."
embedding, latency = get_embedding(test_text)

print(f"üìä Embedding dimension: {len(embedding)}")
print(f"‚è±Ô∏è Latency: {latency:.0f}ms")
print(f"\nüìà First 5 values: {embedding[:5]}")

## 5Ô∏è‚É£ Build Simple RAG Pipeline

Now let's build a complete RAG system with:
- Document storage (in-memory)
- Semantic search
- Augmented generation

In [None]:
class SimpleRAG:
    """Simple RAG implementation using NVIDIA NIM."""
    
    def __init__(self):
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, docs: list[str]):
        """Add documents to the knowledge base."""
        print(f"üìö Adding {len(docs)} documents...")
        for doc in docs:
            embedding, _ = get_embedding(doc, input_type="passage")
            self.documents.append(doc)
            self.embeddings.append(embedding)
        print(f"‚úÖ {len(self.documents)} documents indexed")
    
    def search(self, query: str, top_k: int = 3) -> list[tuple[str, float]]:
        """Search for relevant documents."""
        # Get query embedding (use "query" type for asymmetric search)
        query_embedding, _ = get_embedding(query, input_type="query")
        
        # Compute cosine similarity
        query_np = np.array(query_embedding)
        scores = []
        for doc_emb in self.embeddings:
            doc_np = np.array(doc_emb)
            similarity = np.dot(query_np, doc_np) / (np.linalg.norm(query_np) * np.linalg.norm(doc_np))
            scores.append(similarity)
        
        # Get top-k
        top_indices = np.argsort(scores)[-top_k:][::-1]
        return [(self.documents[i], scores[i]) for i in top_indices]
    
    def query(self, question: str) -> tuple[str, list, float]:
        """Run RAG query: retrieve + generate."""
        start = time.time()
        
        # Retrieve relevant documents
        relevant_docs = self.search(question, top_k=3)
        
        # Build context
        context = "\n\n".join([f"[{i+1}] {doc}" for i, (doc, _) in enumerate(relevant_docs)])
        
        # Generate response with context
        prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {question}

Answer (cite sources using [1], [2], etc.):"""
        
        response, _ = chat_with_nim(prompt, max_tokens=512)
        
        total_latency = (time.time() - start) * 1000
        return response, relevant_docs, total_latency

# Initialize RAG
rag = SimpleRAG()

In [None]:
# Add sample documents about NVIDIA
nvidia_docs = [
    "NVIDIA NIM is an inference microservice that enables easy deployment of AI models with optimized performance. It provides pre-built containers for popular models.",
    "NVIDIA DGX Cloud offers AI supercomputing in the cloud, providing access to NVIDIA's latest hardware like H100 GPUs for training and inference workloads.",
    "NVIDIA Nemotron is a family of large language models trained by NVIDIA, optimized for enterprise use cases and available through NIM.",
    "NVIDIA TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency and high throughput for production deployments.",
    "NVIDIA CUDA is a parallel computing platform that enables developers to use NVIDIA GPUs for general-purpose computing, dramatically accelerating AI workloads.",
    "NVIDIA Triton Inference Server is an open-source software that simplifies deployment of AI models at scale, supporting multiple frameworks and model formats."
]

rag.add_documents(nvidia_docs)

## 6Ô∏è‚É£ Test RAG Queries

In [None]:
# Test RAG query
question = "What is NVIDIA NIM and how does it help with AI deployment?"

response, sources, latency = rag.query(question)

print(f"‚ùì Question: {question}")
print(f"\nüí¨ Answer:\n{response}")
print(f"\nüìö Sources:")
for i, (doc, score) in enumerate(sources, 1):
    print(f"   [{i}] (score: {score:.3f}) {doc[:80]}...")
print(f"\n‚è±Ô∏è Total latency: {latency:.0f}ms")

In [None]:
# Try another query
question = "How can I optimize AI inference performance with NVIDIA?"

response, sources, latency = rag.query(question)

print(f"‚ùì Question: {question}")
print(f"\nüí¨ Answer:\n{response}")
print(f"\n‚è±Ô∏è Total latency: {latency:.0f}ms")

## 7Ô∏è‚É£ Performance Summary

In [None]:
# Benchmark multiple queries
test_queries = [
    "What is NVIDIA NIM?",
    "How do I deploy models with NVIDIA?",
    "What hardware does NVIDIA offer for AI?"
]

latencies = []
for q in test_queries:
    _, _, lat = rag.query(q)
    latencies.append(lat)
    print(f"‚úÖ '{q[:40]}...' - {lat:.0f}ms")

print(f"\nüìä Performance Summary:")
print(f"   Avg Latency: {np.mean(latencies):.0f}ms")
print(f"   Min Latency: {np.min(latencies):.0f}ms")
print(f"   Max Latency: {np.max(latencies):.0f}ms")

---

## üéâ Summary

You've built a RAG pipeline using NVIDIA NIM with:

| Component | Technology |
|-----------|------------|
| **LLM** | Llama 3.1 70B via NIM |
| **Embeddings** | NV-Embed-QA E5 v5 |
| **Vector Store** | In-memory (numpy) |
| **Search** | Cosine similarity |

### Next Steps
- Add persistent vector storage (pgvector, FAISS)
- Implement document chunking
- Add citation tracking
- Deploy with FastAPI

### Resources
- [NVIDIA NIM Documentation](https://docs.nvidia.com/nim/)
- [Full Demo Repo](https://github.com/QbitLoop/nvidia-nim-rag-demo)
- [NVIDIA Build](https://build.nvidia.com)

---

*Built by [QbitLoop](https://github.com/QbitLoop) | MIT License*