# RAG with NVIDIA NIM Microservices

Welcome to this lab! In this notebook, you'll learn how to build a Retrieval-Augmented Generation (RAG) pipeline using NVIDIA NIM microservices, hosted on the [NVIDIA API Catalog](https://build.nvidia.com/models).

We'll walk through the following:

- Connecting to an NVIDIA-hosted LLM (Llama3.1-8b-instruct) using [LangChain's NVIDIA AI Endpoints integration](https://python.langchain.com/docs/integrations/chat/nvidia_ai_endpoints/).
- Creating a vector store from custom documents using FAISS and GPU-accelerated NVIDIA-hosted embedding models (NV-Embed-QA).
- Running intelligent, context-aware chat chains over the embedded documents.

This lab uses real-world data from NVIDIA Documentation for NIMS to demonstrate a practical RAG implementation.

Let's get started!

---

##  Architecture Diagram

<img src="./RAG_WITH_NIMS.png" alt="RAG Architecture" width="600"/>

---

> **Note:**  
To run this notebook, you’ll need an account on both the [NVIDIA API Catalog (build.nvidia.com)](https://build.nvidia.com/) and the [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/).

- **NVIDIA API Catalog** is a platform where you can access hosted NVIDIA models as API endpoints, including LLMs, embedding models, and more. You’ll generate a personal API key here to authenticate requests.
- **NGC (NVIDIA GPU Cloud) Catalog** provides access to enterprise-grade AI software, models, and containers. It’s required for enabling API access to certain models via NIM.

Make sure you're logged in to both and have generated your API key from the [NVIDIA API Keys page](https://build.nvidia.com/settings/api-keys) before continuing.


## Environment Setup
We’ll begin by installing the necessary Python packages for building our RAG pipeline.

- `langchain-community` and `langchain-nvidia-ai-endpoints` are used to integrate with NVIDIA’s hosted LLMs via LangChain.
- `faiss-cpu` is used for creating a local vector store. If you’re running on a GPU, you can optionally use `faiss-gpu`.
- `beautifulsoup4` is used to parse HTML content from web pages, which we’ll embed later.

> Note: Make sure your `pip` is up to date before installing the packages.

In [None]:
!pip install --upgrade pip
!pip install langchain-community==0.2.5 --q
!pip install langchain-nvidia-ai-endpoints==0.1.2 --q
!pip install faiss-cpu --q # or faiss-gpu if you have a GPU
!pip install beautifulsoup4  --q

> Note: Make sure you restart the kernal after installing the pip packages.

## Set Up Your NVIDIA API Key
To authenticate with the NVIDIA API Catalog, you need to set your personal API key as an environment variable. This key allows you to access the hosted models via LangChain.

If you haven’t already, generate your key from the [NVIDIA API Keys page](https://build.nvidia.com/settings/api-keys), and replace the placeholder below with your actual key.

> **Important:** Never share your API key publicly or commit it to source control.

In [12]:
import os
os.environ['NVIDIA_API_KEY']='<Your NGC_CLI_KEY_HERE>'

### Verify the API Key
Let’s confirm that the `NVIDIA_API_KEY` environment variable is set correctly. The command below should print your API key (or part of it, depending on your environment setup). If it returns an empty line, double-check that you set it properly in the previous step.

In [None]:
!echo $NVIDIA_API_KEY

## Test API Call to Llama Model
Let’s test the connection to the NVIDIA-hosted Llama model by making a simple API call. This will send a prompt ("What is API?") to the model and retrieve a response.

If successful, the model will return an answer based on the question, and you'll be able to confirm that your API key and setup are working correctly.

> If you encounter any errors, make sure that your API key is correctly set and has the proper permissions.

In [None]:
!curl --request POST \
  --url https://integrate.api.nvidia.com/v1/chat/completions \
  --header "Authorization: Bearer $NVIDIA_API_KEY" \
  --header "Accept: application/json" \
  --header "Content-Type: application/json" \
  --data '{"messages": [{"role": "user", "content": "what is api?"}], \
"model": "meta/llama-3.1-8b-instruct", "max_tokens": 24, "top_p": 0.7, "temperature": 0.2}'
                                

## Initialize LLM with LangChain
Now, we'll use the `ChatNVIDIA` class from the `langchain_nvidia_ai_endpoints` package to initialize the connection to the NVIDIA Llama model. We specify the base URL, model type, and several parameters like `temperature`, `max_tokens`, and `top_p` to control the model's output.

In this test, we’ll ask the model, "What is the capital of France?" and print the result.

If everything is set up correctly, the model should return "Paris" as the answer.

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="https://integrate.api.nvidia.com/v1", model="meta/llama-3.1-8b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)

result = llm.invoke("What is the capital of France?")
print(result.content)

## Importing Required Libraries

We’ll import a variety of libraries needed for this lab:

- **LangChain Components**: These include chains like `ConversationalRetrievalChain`, `LLMChain`, and `QA_PROMPT`, which will be used to handle question-answering over our embedded documents.
- **FAISS**: This is used to create and manage our vector store, where we will store and query document embeddings.
- **Text Splitter**: We’ll use `RecursiveCharacterTextSplitter` to break down text into smaller chunks for embedding.
- **ChatNVIDIA and NVIDIAEmbeddings**: These allow us to interact with the NVIDIA LLM and to generate embeddings using NVIDIA's API.

These libraries will help us build the RAG pipeline with both retrieval and generation capabilities.


In [5]:
import os
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.chains.question_answering import load_qa_chain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

## HTML Document Loader

In this step, we define a function `html_document_loader` that retrieves the plain text from an HTML document at a given URL.

The function does the following:
- Makes an HTTP request to load the HTML content using `requests`.
- Uses **BeautifulSoup** to parse the HTML, removing `script` and `style` tags that aren't relevant for text extraction.
- Extracts and cleans up the text by removing excess whitespace.

This function will be useful for fetching and processing web page content, which we’ll later embed into the vector store.


In [6]:
import re
from typing import List, Union

import requests
from bs4 import BeautifulSoup

def html_document_loader(url: Union[str, bytes]) -> str:
    """
    Loads the HTML content of a document from a given URL and return it's content.

    Args:
        url: The URL of the document.

    Returns:
        The content of the document.

    Raises:
        Exception: If there is an error while making the HTTP request.

    """
    try:
        response = requests.get(url)
        html_content = response.text
    except Exception as e:
        print(f"Failed to load {url} due to exception {e}")
        return ""

    try:
        # Create a Beautiful Soup object to parse html
        soup = BeautifulSoup(html_content, "html.parser")

        # Remove script and style tags
        for script in soup(["script", "style"]):
            script.extract()

        # Get the plain text from the HTML document
        text = soup.get_text()

        # Remove excess whitespace and newlines
        text = re.sub("\s+", " ", text).strip()

        return text
    except Exception as e:
        print(f"Exception {e} while loading document")
        return ""

## Create Embeddings from Web Pages

The `create_embeddings` function performs the following steps:

1. **Fetch Web Content**: We define a list of URLs (in this case, NVIDIA’s documentation) and use the previously defined `html_document_loader` function to fetch and process the HTML content.
2. **Text Splitting**: We use **RecursiveCharacterTextSplitter** to break the raw text into chunks, ensuring that each chunk is within a specified size for easier processing and embedding.
3. **Generate Embeddings**: After splitting the text into manageable chunks, we call `index_docs` to generate embeddings using the provided `embedding_model` and store them in a specified directory.

This process converts the textual data from web pages into vector embeddings that can later be queried in a vector store for RAG-based question answering.


In [None]:
def create_embeddings(embedding_path: str = "./data/nv_embedding", embedding_model):

    embedding_path = "./data/nv_embedding"
    print(f"Storing embeddings to {embedding_path}")

    # List of web pages containing NVIDIA Triton technical documentation
    urls = [
         "https://docs.nvidia.com/nim/large-language-models/latest/introduction.html",
    ]

    documents = []
    for url in urls:
        document = html_document_loader(url)
        documents.append(document)


    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=0,
        length_function=len,
    )
    texts = text_splitter.create_documents(documents)
    index_docs(url, embedding_model=embedding_model, splitter=text_splitter, documents=texts, dest_embed_dir=embedding_path)
    print("Generated embedding successfully")

## Index Documents and Create Embeddings

The `index_docs` function handles the creation and storage of embeddings for the given documents.

Here’s what happens in the function:
1. **Text Splitting**: The function splits the document into smaller chunks using the provided `splitter`.
2. **Embedding Generation**: For each chunk, embeddings are generated using the specified `embedding_model` (e.g., NV-Embed-QA).
3. **FAISS Vector Store**: 
   - If the destination directory for embeddings already exists, it loads the existing FAISS vector store and updates it with the new embeddings.
   - If the directory doesn’t exist, it creates a new FAISS index, stores the embeddings, and saves it locally.

The embeddings are stored in the specified directory, ready for querying later.


In [None]:
def index_docs(url: Union[str, bytes], embedding_model, splitter, documents: List[str], dest_embed_dir) -> None:
    """
    Split the document into chunks and create embeddings for the document

    Args:
        url: Source url for the document.
        splitter: Splitter used to split the document
        documents: list of documents whose embeddings needs to be created
        dest_embed_dir: destination directory for embeddings

    Returns:
        None
    """
    embeddings = embedding_model

    for document in documents:
        texts = splitter.split_text(document.page_content)

        # metadata to attach to document
        metadatas = [document.metadata]

        # create embeddings and add to vector store
        if os.path.exists(dest_embed_dir):
            update = FAISS.load_local(folder_path=dest_embed_dir, embeddings=embeddings, allow_dangerous_deserialization=True)
            update.add_texts(texts, metadatas=metadatas)
            update.save_local(folder_path=dest_embed_dir)
        else:
            docsearch = FAISS.from_texts(texts, embedding=embeddings, metadatas=metadatas)
            docsearch.save_local(folder_path=dest_embed_dir)

## Generate and Store Embeddings

Here, we initialize the **NVIDIAEmbeddings** model with the "NV-Embed-QA" model and the truncation method set to "END". This embedding model is specifically designed for efficient question answering tasks.

After initializing the model, we call the `create_embeddings` function to fetch the content from the provided URLs, split the text into chunks, and generate embeddings for each chunk. These embeddings are then stored in a local directory for later use.

The embeddings are key to enabling effective retrieval from the vector store in our RAG pipeline.


In [None]:
embedding_model = NVIDIAEmbeddings(model="NV-Embed-QA", truncate="END")
create_embeddings(embedding_model=embedding_model)

## Load Pre-generated Embeddings

In this step, we load the pre-generated embeddings from the local FAISS vector store. The `FAISS.load_local` method is used to load the embeddings from the specified folder (`embedding_path`), which we created earlier.

We pass the `embedding_model` to ensure the embeddings are compatible with our model, and the `allow_dangerous_deserialization=True` argument is included to handle potential deserialization issues.

Now that the embeddings are loaded, we can perform efficient similarity searches against them in the next steps.


In [25]:
# Embed documents
embedding_path = "./data/nv_embedding"
docsearch = FAISS.load_local(folder_path=embedding_path, embeddings=embedding_model, allow_dangerous_deserialization=True)

## Set Up Conversational Retrieval Chain

In this step, we initialize the **ConversationalRetrievalChain** to enable question answering over our document store. Here’s what happens:

1. **LLM Initialization**: We use the **ChatNVIDIA** class to set up the Llama 3.1 model as our language model, specifying parameters like `temperature`, `max_tokens`, and `top_p` for controlling the output generation.
2. **Memory**: A **ConversationBufferMemory** is created to maintain the chat history across multiple interactions, allowing for more contextualized responses.
3. **QA Prompt**: We define the prompt template (**`QA_PROMPT`**) to structure the model's responses in a question-answer format.
4. **QA Chain**: We load the question-answering chain with the LLM and the QA prompt.
5. **Conversational Chain**: The **ConversationalRetrievalChain** combines the LLM and the document retriever (FAISS in our case). This chain will take user input, retrieve relevant documents, and generate answers based on both the documents and the chat history.

This setup allows us to perform intelligent, context-aware Q&A on our embedded documents.


In [26]:
llm = ChatNVIDIA(base_url="https://integrate.api.nvidia.com/v1", model="meta/llama3-8b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

qa_prompt=QA_PROMPT

doc_chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_PROMPT)

qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=docsearch.as_retriever(),
    chain_type="stuff",
    memory=memory,
    combine_docs_chain_kwargs={'prompt': qa_prompt},
)

## Test the Retrieval Chain

Now that we've set up the conversational retrieval chain, we test it by asking the model a question: **"What are NIMS?"**.

- The model will retrieve relevant documents from the FAISS vector store and generate an answer based on both the retrieved documents and the chat history.
- The result is printed, which should provide a response to the question.

This step demonstrates the full functionality of the Retrieval-Augmented Generation (RAG) pipeline: combining document retrieval with generative language models to answer user queries.


In [None]:
query = "What are NIMS"
result = qa.invoke({"question": query})
print(result.get("answer"))

## Test the Retrieval Chain with Another Query

In [None]:
query = "Brief about its architecture"
result = qa.invoke({"question": query})
print(result.get("answer"))

This query demonstrates how the model can adapt to different types of questions, including those related to specific topics such as applications, based on the documents it has been trained on.

In [None]:
query = "What about its applications?"
result = qa.invoke({"question": query})
print(result.get("answer"))

## Summary and Next Steps

In this lab, we demonstrated how to build a simple Retrieval-Augmented Generation (RAG) pipeline using **NVIDIA NIM microservices**. We integrated NVIDIA-hosted LLMs and embedding models via LangChain, created a local FAISS vector store using content from the HPE product documentation site, and built a conversational retrieval chain to interact with that knowledge base.

This setup showcases the power of combining hosted foundation models with custom data to deliver intelligent, context-aware responses.

### What's Next?
- Trying different NIM-hosted models.
- Explore the [NVIDIA API Catalog](https://build.nvidia.com/models).