# Deploying The NVIDIA RAG Blueprint

In this notebook, we will deploy a Retrieval-Augmented Generation (RAG) powered AI Chatbot. This system enhances traditional large language models by incorporating external knowledge, allowing the model to provide more accurate and contextually relevant responses. The system works in two key stages:

**Data Ingestion Pipeline**
- Ingests and processes enterprise data: Ingests and processes user documents. Processing includes detecting and extracting graphic elements such as tables, charts, infographics.
- Creates embeddings: Converts text into vector representations that capture semantic meaning
- Builds vector database: Stores these embeddings in a searchable database for efficient retrieval

**Query / Response Pipeline**
- Embeds user queries: Converts user questions into vector embeddings
- Retrieves relevant context: Finds semantically similar documents in the vector database
- Reranks results: Prioritizes the most relevant information
- Generates responses: Uses an LLM to craft comprehensive responses based on retrieved context

Both data and queries are encoded as vectors through an embedding process, enabling efficient similarity search based on semantic meaning rather than simple keyword matching. We'll be leveraging the [NVIDIA RAG Blueprint](https://build.nvidia.com/nvidia/build-an-enterprise-rag-pipeline) to setup the RAG service, please refer to the official [GitHub page](https://github.com/NVIDIA-AI-Blueprints/rag/tree/main?tab=readme-ov-file) here as the service will be built off of this code repository. You can also refer to [this list](https://github.com/NVIDIA-AI-Blueprints/rag/tree/main?tab=readme-ov-file#software-components) for the models leveraged in this pipeline.

As an additional step, we'll be leveraging models hosted on [build.nvidia.com.](https://build.nvidia.com/) By default, the microservices we deploy, expect to leverage locally hosted NVIDIA NIMs. To simplify this playbook and to ensure users are able to run Tokkio + RAG on the same instance, we'll leverage models hosted on NVIDIA Foundation Endpoints through [build.nvidia.com.](https://build.nvidia.com/) The ONLY service that will be deployed as part of the pipeline that leverages GPU resources will be Milvus - A GPU accelerated vector DB that will be leveraged to store embedding vectors.

## Import Dependencies

In [1]:
import os

### Set Secrets

An NVIDIA API Key will be needed in order to access resources from NGC. It will also be leveraged to interact with the models deployed on [build.nvidia.com.](https://build.nvidia.com/)

In [19]:
os.environ["NVIDIA_API_KEY"] = "<YOUR_NVIDIA_API_KEY>"

### Login To `nvcr.io` Docker Registry

We need to login to the NGC registry in order to be able to pull the container images that will be deployed later. We can login to the `nvcr.io` registry using the following code block below. Make sure you have a valid NGC key or login will fail:

In [3]:
%%bash
echo "${NVIDIA_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin


https://docs.docker.com/go/credential-store/



Login Succeeded


### Clone NVIDIA AI RAG BluePrint Repository

Once logged in, we can clone the NVIDIA AI RAG BluePrint code repository and store it in the `repos` directory:

In [None]:
%%bash
# clone NVIDIA Generative AI Examples repo
git clone https://github.com/NVIDIA-AI-Blueprints/rag.git ../repos/nvidia_rag

### Deploying Milvus

Milvus is an open-source vector database designed specifically for storing, indexing, and managing large-scale vector data, such as embeddings generated by deep learning models. 

It excels at efficient similarity search and is highly scalable, supporting billions of vectors and running across environments from laptops to large distributed systems. We'll be deploying this service initially, as the other services that will be deployed will rely on Milvus

#### Pull the Containers Needed For The Milvus Service

As a prerequisite, we'll need to pull the containers associated with Milvus before deploying the containers. The code block below will facilitate this process. Once the relevant images are pulled, we can move on to deploying the service.

In [5]:
%%bash
docker compose -f ../repos/nvidia_rag/deploy/compose/vectordb.yaml pull --quiet

 milvus Pulling 
 minio Pulling 
 etcd Pulling 
 minio Pulled 
 milvus Pulled 
 etcd Pulled 


#### Start The Containers For The Milvus Service

After successfully pulling the required Milvus-related Docker containers (milvus, etcd, minio), the next step is to start these containers to launch the Milvus service. With the necessary images now available locally, you can proceed to deploy and run the service using the appropriate Docker Compose command. This will bring up the Milvus, etcd, and minio containers, enabling you to use the Milvus vector database service:

In [6]:
%%bash
docker compose -f ../repos/nvidia_rag/deploy/compose/vectordb.yaml up -d

 Network nvidia-rag  Creating
 Network nvidia-rag  Created
 Container milvus-minio  Creating
 Container milvus-etcd  Creating
 Container milvus-etcd  Created
 Container milvus-minio  Created
 Container milvus-standalone  Creating
 Container milvus-standalone  Created
 Container milvus-minio  Starting
 Container milvus-etcd  Starting
 Container milvus-etcd  Started
 Container milvus-minio  Started
 Container milvus-standalone  Starting
 Container milvus-standalone  Started


#### Check Status Of Milvus Containers

After starting the Milvus service containers, verify that all required containers (milvus-standalone, milvus-etcd, milvus-minio) are running. Use the following command to list the active containers and their statuses:

In [7]:
%%bash
docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"

CONTAINER ID   NAMES               STATUS
3e1ac2226efd   milvus-standalone   Up 2 seconds
1cc93d91a4f3   milvus-minio        Up 2 seconds (health: starting)
9f1091c4de6a   milvus-etcd         Up 2 seconds (health: starting)


If all the services are healthy or running as expected, we can proceed with deploying the ingestor service!

### Deploying The Ingestor Service

[NVIDIA-Ingest (NV-Ingest)](https://github.com/NVIDIA/nv-ingest/tree/main) is leveraged for ingestion of files. NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. Including support for parsing PDFs, Word and PowerPoint documents, it uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.

#### Pull The Containers Needed The Ingestor Microservice

As a prerequisite, we'll need to pull the containers associated with the ingestor pipeline before deploying the containers. The code block below will facilitate this process. Once the relevant images are pulled, we can move on to configuring parameters that will be used to customize deployment of the ingestor service.

In [8]:
%%bash
docker compose -f ../repos/nvidia_rag/deploy/compose/docker-compose-ingestor-server.yaml pull --quiet

 ingestor-server Pulling 
 nv-ingest-ms-runtime Pulling 
 redis Pulling 
 redis Pulled 
 ingestor-server Pulled 
 nv-ingest-ms-runtime Pulled 


#### Configure Parameters For The Ingestor Microservice

By default, the Ingestor Microservice, expects to leverage locally hosted NVIDIA NIMs. To simplify this playbook and to ensure users are able to run Tokkio + RAG on the same instance, we'll leverage models hosted on NVIDIA Foundation Endpoints through [build.nvidia.com.](https://build.nvidia.com/)

We'll configure the service to use the default LLM & Retriever NIMs enabled in the default RAG blueprint. We'll also set the corresponding Server URL's to point to the corresponding models hosted on build. Please note that changing the model names will effectively change which models are invoked.

In [9]:
# retriever NIMs parameters
os.environ["APP_EMBEDDINGS_MODELNAME"] = "nvidia/llama-3.2-nv-embedqa-1b-v2"
os.environ["APP_EMBEDDINGS_SERVERURL"] = "https://integrate.api.nvidia.com/v1"

# NV-Ingest NIMs endpoints
os.environ["PADDLE_HTTP_ENDPOINT"] = "https://ai.api.nvidia.com/v1/cv/baidu/paddleocr"
os.environ["YOLOX_HTTP_ENDPOINT"] = "https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-page-elements-v2"
os.environ["YOLOX_GRAPHIC_ELEMENTS_HTTP_ENDPOINT"] = "https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-graphic-elements-v1"
os.environ["YOLOX_TABLE_STRUCTURE_HTTP_ENDPOINT"] = "https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-table-structure-v1"
os.environ["VLM_CAPTION_ENDPOINT"] = "https://ai.api.nvidia.com/v1/gr/meta/llama-3.2-11b-vision-instruct/chat/completions"
os.environ["NEMORETRIEVER_PARSE_HTTP_ENDPOINT"] = "https://integrate.api.nvidia.com/v1/chat/completions"

# NV-Ingest NIMs protocol must be set to http
os.environ["PADDLE_INFER_PROTOCOL"] = "http"
os.environ["YOLOX_INFER_PROTOCOL"] = "http"
os.environ["YOLOX_GRAPHIC_ELEMENTS_INFER_PROTOCOL"] = "http"
os.environ["YOLOX_TABLE_STRUCTURE_INFER_PROTOCOL"] = "http"
os.environ["NEMORETRIEVER_PARSE_INFER_PROTOCOL"] = "http"

#### Start The Containers For The Ingestor Microservice

With the parameters for this deployment configured, we can start the service using the `docker compose` command:

In [10]:
%%bash
docker compose -f ../repos/nvidia_rag/deploy/compose/docker-compose-ingestor-server.yaml up -d

 Container compose-nv-ingest-ms-runtime-1  Creating
 Container ingestor-server  Creating
 Container compose-redis-1  Creating
 Container compose-redis-1  Created
 Container ingestor-server  Created
 Container compose-nv-ingest-ms-runtime-1  Created
 Container compose-nv-ingest-ms-runtime-1  Starting
 Container compose-redis-1  Starting
 Container ingestor-server  Starting
 Container compose-redis-1  Started
 Container ingestor-server  Started
 Container compose-nv-ingest-ms-runtime-1  Started


#### Verify Status Of Ingestor Containers

Once the containers have been started, we can verify the status below. Each service should either have a `healthy` status or a non-error status next to the appropriate service name:

In [11]:
%%bash
docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"

CONTAINER ID   NAMES                            STATUS
ac450966da38   compose-nv-ingest-ms-runtime-1   Up 26 seconds (healthy)
ae83680b62d2   ingestor-server                  Up 26 seconds
7dd3f3e440f7   compose-redis-1                  Up 26 seconds
3e1ac2226efd   milvus-standalone                Up 37 seconds
1cc93d91a4f3   milvus-minio                     Up 38 seconds (healthy)
9f1091c4de6a   milvus-etcd                      Up 38 seconds (healthy)


#### Verify Application Startup Is Successful

It's helpful to verify if the application started up as expected, even with the presence of the `healthy` status on the `compose-nv-ingest-ms-runtime-1` service. We can view the status of the service by checking the logs. If we see the following output towards the end without any error logs, the Ingestor Server is operating as expected:

```
====Building Segment Complete!====
```

In [12]:
%%bash
docker logs compose-nv-ingest-ms-runtime-1

INFO:     Uvicorn running on http://0.0.0.0:7670 (Press CTRL+C to quit)
INFO:     Started parent process [35]
INFO:     Started server process [66]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Started server process [44]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Started server process [38]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Started server process [48]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Started server process [67]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Started server process [51]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Started server process [65]
INFO:     Waiting for application startup.
INFO:     Started server process [55]
INFO:     Application startup complete.
INFO:

====Pipeline Pre-build====
====Pre-Building Segment: main====
====Pre-Building Segment Complete!====
====Pipeline Pre-build Complete!====
====Registering Pipeline====
====Building Pipeline====
====Building Pipeline Complete!====
====Registering Pipeline Complete!====
====Starting Pipeline====
====Pipeline Started====
====Building Segment: main====
Added source: <broker_listener-0; LinearModuleSourceStageCPU(module_config=<morpheus.utils.module_utils.ModuleLoader object at 0x7e028e43b3d0>, output_port_name=output, output_type=<class 'nv_ingest_api.primitives.ingest_control_message.IngestControlMessage'>)>
  └─> nv_ingest_api.IngestControlMessage
Added stage: <submitted_job_counter-1; LinearModuleStageCPU(module_config=<morpheus.utils.module_utils.ModuleLoader object at 0x7e028e43b490>, input_port_name=input, output_port_name=output, input_type=<class 'nv_ingest_api.primitives.ingest_control_message.IngestControlMessage'>, output_type=<class 'nv_ingest_api.primitives.ingest_control_messa

### Deploying The RAG Server

The RAG server in the NVIDIA RAG Blueprint is a core microservice that orchestrates the entire Retrieval-Augmented Generation (RAG) pipeline. It handles incoming user queries, coordinates retrieval of relevant information from the vector database, and manages the interaction with large language models to generate context-aware responses. 

The RAG server is based on LangChain, exposes APIs for user interaction, and integrates with other components like retrievers, rerankers, and document ingestion services to deliver accurate, enterprise-ready generative AI solutions

#### Pull The Containers Needed The RAG Server Microservice

As a prerequisite, we'll need to pull the containers associated with the RAG pipeline before deploying the containers. The code block below will facilitate this process. Once the relevant images are pulled, we can move on to configuring parameters that will be used to customize deployment of the RAG services.

In [13]:
%%bash
docker compose -f ../repos/nvidia_rag/deploy/compose/docker-compose-rag-server.yaml pull --quiet

 rag-playground Pulling 
 rag-server Pulling 
 rag-server Pulled 
 rag-playground Pulled 


#### Configure Parameters For The RAG Server Microservice

By default, the RAG Server Microservice, expects to leverage locally hosted NVIDIA NIMs. To simplify this playbook and to ensure users are able to run Tokkio + RAG on the same instance, we'll leverage models hosted on NVIDIA Foundation Endpoints through [build.nvidia.com.](https://build.nvidia.com/)

We'll configure the service to use the default LLM & Retriever NIMs enabled in the default RAG blueprint. We'll also set the corresponding Server URL's to empty strings, as for the given containers, an empty server URL string will default to the models hosted on build. Please note that changing the model names will effectively change which models are invoked.

In [14]:
os.environ["APP_LLM_MODELNAME"] = "meta/llama-3.1-70b-instruct"
os.environ["APP_LLM_SERVERURL"] = ""
os.environ["APP_EMBEDDINGS_MODELNAME"] = "nvidia/llama-3.2-nv-embedqa-1b-v2"
os.environ["APP_EMBEDDINGS_SERVERURL"] = ""
os.environ["APP_RANKING_MODELNAME"] = "nvidia/llama-3.2-nv-rerankqa-1b-v2"
os.environ["APP_RANKING_SERVERURL"] = ""

#### Start The Containers For The RAG Server Microservice

With the parameters for this deployment configured, we can start the service using the `docker compose` command:

In [15]:
%%bash
docker compose -f ../repos/nvidia_rag/deploy/compose/docker-compose-rag-server.yaml up -d

 Container rag-server  Creating
 Container rag-server  Created
 Container rag-playground  Creating
 Container rag-playground  Created
 Container rag-server  Starting
 Container rag-server  Started
 Container rag-playground  Starting
 Container rag-playground  Started


#### Verify Status Of RAG Server Containers

Once the containers have been started, we can verify the status below. Each service should either have a `healthy` status or a non-error status next to the appropriate service name:

In [16]:
%%bash
docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"

CONTAINER ID   NAMES                            STATUS
6cf111e897bf   rag-playground                   Up 3 seconds
92a651fdec80   rag-server                       Up 3 seconds
ac450966da38   compose-nv-ingest-ms-runtime-1   Up 46 seconds (healthy)
ae83680b62d2   ingestor-server                  Up 46 seconds
7dd3f3e440f7   compose-redis-1                  Up 46 seconds
3e1ac2226efd   milvus-standalone                Up 58 seconds
1cc93d91a4f3   milvus-minio                     Up 58 seconds (healthy)
9f1091c4de6a   milvus-etcd                      Up 58 seconds (healthy)


#### Verify Application Startup Is Successful

It's helpful to verify if the application started up as expected, even with the presence of the `healthy` status on the `rag-server` or `rag-playground` services. We can view the status of each service by checking the logs. We'll start for the `rag-server` service using the command below, if we see the following output towards the end without any error logs, the RAG Server is operating as expected:

```
INFO:src.server:Initializing NVIDIA RAG server...
INFO:     Started server process [15]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
```

In [17]:
%%bash
docker logs rag-server 

INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
INFO:     Started parent process [1]
Optional nv_ingest_client module not installed.
Optional nv_ingest_client module not installed.
Optional nv_ingest_client module not installed.
Optional nv_ingest_client module not installed.
Optional nv_ingest_client module not installed.
Optional nv_ingest_client module not installed.
Optional nv_ingest_client module not installed.
Optional nv_ingest_client module not installed.
Collection 'multimodal_data' does not exist in Milvus. Aborting vectorstore creation.
Collection 'multimodal_data' does not exist in Milvus. Aborting vectorstore creation.
Collection 'multimodal_data' does not exist in Milvus. Aborting vectorstore creation.
Collection 'multimodal_data' does not exist in Milvus. Aborting vectorstore creation.
Collection 'multimodal_data' does not exist in Milvus. Aborting vectorstore creation.
Collection 'multimodal_data' does not exist in Milvus. Aborting vectorstore c

Similarly, if we observe the output below for the `rag-playground` service, the application startup was successful:

```
✓ Starting...
✓ Ready in 458ms
```

In [18]:
%%bash
docker logs rag-playground 


> rag-2@0.1.0 start
> next start

   ▲ Next.js 15.1.6
   - Local:        http://localhost:3000
   - Network:      http://172.18.0.9:3000

 ✓ Starting...
 ✓ Ready in 437ms


## Accessing The RAG Playground

The RAG playground can be accessed on the endpoint `http://<application-instance-ip>:8090`. Once this port has been exposed, you should be able to see the following:

![ace_rag-playground](../images/rag-playground.png)

Once you have access to the UI, you can create a Milvus collection - think of this as a unique collection where documents under a specific subject matter are stored. We have multiple collections that can be created, but as a start created a collection called `dht_tokkio_rag`. You'll also be prompted upload a document. The document can be a `.txt` document or even `.pdf` or `.docx` file since we're leveraging the multimodal ingestion pipeline (NV-Ingest). Please refer to the [NV-Ingest documentation](https://docs.nvidia.com/nemo/retriever/extraction/overview/#what-nemo-retriever-extraction-is) page for the full list of supported file types.

You can interact with service without specifying a collection - this will use the base models foundational knowledge to generate a response. To trigger a conversation using RAG, simply provide the collection you ingested your documents in. You can always create another collection or add to a preexisting collection using the `Add Source` option! 

## Accessing The RAG Server

The Swagger UI for the RAG server can be accessed on the endpoint http://<application-instance-ip>:8081/docs. Once this port has been exposed, you should be able to see the following:

![ace_rag-server](../images/rag-server.png)

The Swagger UI is a great tool for understanding the RAG Server API schema ansd can be used to test and submit requests to the pipleine. The `/chat/completions` endpoint is particularly useful, as we'll be leveraging this endpoint to connect our Tokkio Avatar to our RAG service via the ACE configurator in the `customizing_tokkio.ipynb` notebook. For example, here's a default request made to the endpoint:

```
{
  "messages": [
    {
      "role": "user",
      "content": "Hello! What can you help me with?"
    }
  ],
  "use_knowledge_base": true,
  "temperature": 0.2,
  "top_p": 0.7,
  "max_tokens": 1024,
  "reranker_top_k": 10,
  "vdb_top_k": 100,
  "vdb_endpoint": "http://milvus:19530",
  "collection_name": "multimodal_data",
  "enable_query_rewriting": false,
  "enable_reranker": true,
  "enable_guardrails": false,
  "enable_citations": true,
  "model": "meta/llama-3.1-70b-instruct",
  "llm_endpoint": "",
  "embedding_model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
  "embedding_endpoint": "",
  "reranker_model": "nvidia/llama-3.2-nv-rerankqa-1b-v2",
  "reranker_endpoint": "",
  "stop": []
}
```

Take note of the parameters used here, as we'll leverage this same schema to connect to our RAG server!

## Cleaing Up Services (Optional)

If you want to clean up all the services we deployed earlier, the code block below can be used:

In [151]:
%%bash
# clean up milvus service
docker compose -f ../repos/nvidia_rag/deploy/compose/vectordb.yaml down
# clean up ingestor service
docker compose -f ../repos/nvidia_rag/deploy/compose/docker-compose-ingestor-server.yaml down
# clean up rag service
docker compose -f ../repos/nvidia_rag/deploy/compose/docker-compose-rag-server.yaml down

 Container milvus-standalone  Stopping
 Container milvus-standalone  Stopped
 Container milvus-standalone  Removing
 Container milvus-standalone  Removed
 Container milvus-minio  Stopping
 Container milvus-etcd  Stopping
 Container milvus-etcd  Stopped
 Container milvus-etcd  Removing
 Container milvus-etcd  Removed
 Container milvus-minio  Stopped
 Container milvus-minio  Removing
 Container milvus-minio  Removed
 Network nvidia-rag  Removing
 Network nvidia-rag  Resource is still in use
 Container compose-nv-ingest-ms-runtime-1  Stopping
 Container compose-redis-1  Stopping
 Container ingestor-server  Stopping
 Container compose-redis-1  Stopped
 Container compose-redis-1  Removing
 Container compose-redis-1  Removed
 Container compose-nv-ingest-ms-runtime-1  Stopped
 Container compose-nv-ingest-ms-runtime-1  Removing
 Container compose-nv-ingest-ms-runtime-1  Removed
 Container ingestor-server  Stopped
 Container ingestor-server  Removing
 Container ingestor-server  Removed
 Network