# Retriever API Usage with Multimodal Query Support

This notebook demonstrates how to use the NVIDIA RAG retriever APIs with **multimodal queries** (text + images). You'll learn how to:

- üîç Search for relevant documents using queries that contain images
- ü§ñ Generate AI responses using the end-to-end RAG API with vision-language models (VLMs)
- üìä Work with multimodal embeddings and vector databases

**Use Case**: Query documents with images (e.g., "What is the price of this item?" + product image)

## üì¶ Setting up the Dependencies

This section will guide you through:
1. Configuring your NGC API key for accessing NVIDIA services
2. Deploying the Milvus vector database
3. Setting up NVIDIA NIMs (NVIDIA Inference Microservices) for embeddings and VLM
4. Starting the NVIDIA Ingest runtime for document processing
5. Launching the RAG server

**Note**: This setup uses Docker Compose to orchestrate all services.

### 1. Setup the Default Configurations

Import necessary libraries for environment management.

In [None]:
# Install python-dotenv for environment variable management
! uv pip install python-dotenv

import os
from getpass import getpass

Provide your NGC_API_KEY after executing the cell below. You can obtain a key by following steps [here](../docs/quickstart.md##obtain-an-api-key).

In [None]:
# Check if NGC_API_KEY is already set, otherwise prompt for it
# Uncomment the line below to reset your API key
# del os.environ['NGC_API_KEY']

if os.environ.get("NGC_API_KEY", "").startswith("nvapi-"):
    print("Valid NGC_API_KEY already in environment. Delete to reset")
else:
    candidate_api_key = getpass("NVAPI Key (starts with nvapi-): ")
    assert candidate_api_key.startswith("nvapi-"), (
        f"{candidate_api_key[:5]}... is not a valid key"
    )
    os.environ["NGC_API_KEY"] = candidate_api_key

Login to nvcr.io which is needed for pulling the containers of dependencies

In [None]:
# Login to NVIDIA Container Registry (nvcr.io) to pull required containers
!echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin

### 2. Setup the Milvus Vector Database

Milvus is a high-performance vector database used to store and search multimodal embeddings.

**Configuration Notes**:
- By default, Milvus uses GPU indexing for faster performance
- Ensure you have provided the correct GPU ID below
- If you don't have a GPU available, you can switch to CPU-only Milvus by following the instructions in [milvus-configuration.md](../docs/milvus-configuration.md)

In [None]:
# Specify which GPU to use for Milvus (change if using a different GPU)
os.environ["VECTORSTORE_GPU_DEVICE_ID"] = "0"

In [None]:
# Start Milvus vector database service
# This will run in the background (-d flag)
!docker compose -f ../deploy/compose/vectordb.yaml up -d

### 3. Setup NVIDIA Inference Microservices (NIMs)

NIMs provide optimized inference for AI models. For multimodal RAG, we need:
- **VLM (Vision-Language Model)**: `nvidia/nemotron-nano-12b-v2-vl` for understanding images and generating responses
- **Embedding Model**: `llama-3.2-nemoretriever-1b-vlm-embed-v1` for creating multimodal embeddings

#### Deploy On-Premise Models

This section deploys NIMs locally using Docker. Models will be cached to avoid re-downloading.

In [None]:
# Create the model cache directory
!mkdir -p ~/.cache/model-cache

In [None]:
# Set the MODEL_DIRECTORY environment variable to specify where models are cached
import os

os.environ["MODEL_DIRECTORY"] = os.path.expanduser("~/.cache/model-cache")
print("MODEL_DIRECTORY set to:", os.environ["MODEL_DIRECTORY"])

In [None]:
# Deploy NIMs with VLM and embedding profiles
# ‚ö†Ô∏è WARNING: This may take 10-20 minutes as models download (~10GB+)
# If the kernel times out, just rerun this cell - it will resume where it left off
! USERID=$(id -u) docker compose --profile vlm-ingest --profile vlm-only -f ../deploy/compose/nims.yaml up -d

In [None]:
# Monitor the status of running containers
# Run this cell repeatedly to check if all services are healthy
# Look for STATUS showing "healthy" or "Up" for all containers
!docker ps

In [None]:
# Configure the model names and service URLs for the RAG pipeline
# These settings tell the RAG server which models and endpoints to use

# VLM (Vision-Language Model) configuration
os.environ["APP_VLM_MODELNAME"] = "nvidia/nemotron-nano-12b-v2-vl"
os.environ["APP_VLM_SERVERURL"] = "http://vlm-ms:8000/v1"

# Multimodal embedding model configuration
os.environ["APP_EMBEDDINGS_MODELNAME"] = "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
os.environ["APP_EMBEDDINGS_SERVERURL"] = "nemoretriever-vlm-embedding-ms:8000/v1"

#### Cloud based deployment
Using NVIDIA hosted cloud model

In [None]:
# OCR and document processing endpoints - cloud hosted
os.environ["OCR_HTTP_ENDPOINT"] = "https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-ocr"
os.environ["OCR_INFER_PROTOCOL"] = "http"
os.environ["OCR_MODEL_NAME"] = "scene_text_ensemble"
os.environ["YOLOX_HTTP_ENDPOINT"] = "https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-page-elements-v2"
os.environ["YOLOX_INFER_PROTOCOL"] = "http"
os.environ["YOLOX_GRAPHIC_ELEMENTS_HTTP_ENDPOINT"] = "https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-graphic-elements-v1"
os.environ["YOLOX_GRAPHIC_ELEMENTS_INFER_PROTOCOL"] = "http"
os.environ["YOLOX_TABLE_STRUCTURE_HTTP_ENDPOINT"] = "https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-table-structure-v1"
os.environ["YOLOX_TABLE_STRUCTURE_INFER_PROTOCOL"] = "http"
os.environ["APP_NVINGEST_CAPTIONENDPOINTURL"] = "https://integrate.api.nvidia.com/v1/chat/completions"

# VLM Model configuration - cloud hosted
os.environ["APP_VLM_MODELNAME"] = "nvidia/nemotron-nano-12b-v2-vl"
os.environ["APP_VLM_SERVERURL"] = "https://integrate.api.nvidia.com/v1"
os.environ["APP_LLM_SERVERURL"] = ""

# Multimodal embedding model configuration - cloud hosted
os.environ["APP_EMBEDDINGS_MODELNAME"] = "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
os.environ["APP_EMBEDDINGS_SERVERURL"] = "https://integrate.api.nvidia.com/v1"

### 4. Setup NVIDIA Ingest Runtime

NVIDIA Ingest processes documents to extract text, images, and other elements. We'll configure it to:
- Extract images from documents
- Handle multimodal content

In [None]:
# Configure NVIDIA Ingest to extract and process images from documents
os.environ["APP_NVINGEST_STRUCTURED_ELEMENTS_MODALITY"] = ""  # No special handling for structured elements
os.environ["APP_NVINGEST_IMAGE_ELEMENTS_MODALITY"] = "image"  # Process image elements as images
os.environ["APP_NVINGEST_EXTRACTIMAGES"] = "True"  # Extract images from documents

# Start the ingestor server with Redis
! docker compose -f ../deploy/compose/docker-compose-ingestor-server.yaml up -d --build

### 5. Setup the NVIDIA RAG Server

The RAG server provides the main API endpoints for search and generation. It orchestrates all the components (embeddings, vector DB, VLM) to deliver intelligent responses.

In [None]:
# Start the RAG server (accessible at localhost:8081)
os.environ["APP_RANKING_SERVERURL"] = ""
! docker compose -f ../deploy/compose/docker-compose-rag-server.yaml up -d --build

---

## üìö Document Ingestion Workflow

Now that all services are running, let's ingest documents into a collection.

### 6. Create a Collection

A collection is a logical grouping of documents in the vector database. Think of it as a database table optimized for similarity search.

In [None]:
# Install aiohttp for async HTTP requests
! uv pip install aiohttp

# Configure the ingestor server URL
# Use "ingestor-server" when running in AI Workbench, otherwise "localhost"
IPADDRESS = (
    "ingestor-server"
    if os.environ.get("AI_WORKBENCH", "false") == "true"
    else "localhost"
)
INGESTOR_SERVER_PORT = "8082"
BASE_URL = f"http://{IPADDRESS}:{INGESTOR_SERVER_PORT}"

async def print_response(response):
    """Helper function to pretty-print API responses."""
    try:
        response_json = await response.json()
        print(json.dumps(response_json, indent=2))
    except aiohttp.ClientResponseError:
        print(await response.text())


In [None]:
# Define a unique name for your collection
# Change this if you want to create a different collection
collection_name = "multimodal_query"

In [None]:
import aiohttp
import json


async def create_collection(
    collection_name: str | None = None,
    embedding_dimension: int = 2048,
    metadata_schema: list = [],
):
    """
    Create a new collection in the vector database.
    
    Args:
        collection_name: Unique identifier for the collection
        embedding_dimension: Size of the embedding vectors (2048 for llama-3.2-nemoretriever-1b-vlm-embed-v1)
        metadata_schema: Optional schema for metadata fields
    """
    data = {
        "collection_name": collection_name,
        "embedding_dimension": embedding_dimension,
        "metadata_schema": metadata_schema,
    }

    HEADERS = {"Content-Type": "application/json"}

    async with aiohttp.ClientSession() as session:
        try:
            async with session.post(
                f"{BASE_URL}/v1/collection", json=data, headers=HEADERS
            ) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            return 500, {"error": str(e)}


# Create the collection
# The embedding dimension is 2048 for the multimodal embedding model we're using
await create_collection(
    collection_name=collection_name,
)

In [None]:
# Specify the documents to upload
# This PDF contains product images with pricing information
FILEPATHS = [
    "../data/multimodal/product_catalog.pdf",
]

async def upload_documents(collection_name: str = ""):
    """
    Upload and process documents into the collection.
    
    This will:
    1. Extract text and images from the PDFs
    2. Chunk the content for optimal retrieval
    3. Generate multimodal embeddings
    4. Store everything in the vector database
    """
    data = {
        "collection_name": collection_name,
        "blocking": False,  # Async upload - use status API to check progress
        "split_options": {
            "chunk_size": 512,        # Characters per chunk
            "chunk_overlap": 150      # Overlap between chunks for context
        },
        "generate_summary": False  # Set to True to generate document summaries
    }

    form_data = aiohttp.FormData()
    
    # Add all PDF files to the form data
    for file_path in FILEPATHS:
        form_data.add_field("documents", open(file_path, "rb"), 
                          filename=os.path.basename(file_path), 
                          content_type="application/pdf")

    form_data.add_field("data", json.dumps(data), content_type="application/json")

    async with aiohttp.ClientSession() as session:
        try:
            # Use POST for new uploads, PATCH for re-ingesting existing documents
            async with session.post(f"{BASE_URL}/v1/documents", data=form_data) as response:
                await print_response(response)
                response_json = await response.json()
                return response_json
        except aiohttp.ClientError as e:
            print(f"Error uploading documents: {e}")
            return None

# Upload the documents and get the task ID for tracking progress
upload_response = await upload_documents(collection_name=collection_name)
task_id = upload_response.get("task_id") if upload_response else None
print(f"\nTask ID for tracking: {task_id}")

In [None]:
async def get_task_status(task_id: str):
    """
    Check the status of an asynchronous ingestion task.
    
    Possible statuses:
    - "pending": Task is queued
    - "processing": Currently processing documents
    - "completed": Successfully finished
    - "failed": Error occurred
    """
    params = {
        "task_id": task_id,
    }

    HEADERS = {"Content-Type": "application/json"}

    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(
                f"{BASE_URL}/v1/status", params=params, headers=HEADERS
            ) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            return 500, {"error": str(e)}


# Check the ingestion status
# Run this cell multiple times until status shows "completed"
await get_task_status(
    task_id=[task_id]
)

---

## üîç Querying with Multimodal Inputs

Now that documents are ingested, let's query them using both text and images!

### 7. Using the Search and Generate APIs

We'll demonstrate two approaches:
1. **Search API**: Find relevant documents without generating a response
2. **Generate API**: Get an AI-generated answer with citations

#### Prepare a Multimodal Query

To query with an image, we need to:
1. Convert the image to base64 encoding
2. Format it according to the OpenAI vision API format
3. Combine it with a text prompt

In [None]:
! uv pip install requests httpx
import base64
import requests
from IPython.display import Image, Markdown, display

def get_base64_image(image_source: str) -> str:
    """
    Convert an image to base64 encoding.
    
    Args:
        image_source: Local file path or URL to the image
        
    Returns:
        Base64 encoded string of the image
    """
    if image_source.startswith(('http://', 'https://')):
        # Download image from URL
        response = requests.get(image_source)
        return base64.b64encode(response.content).decode()
    else:
        # Read local file
        with open(image_source, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode()

# Convert the query image to base64
# Try different images to test different queries:
image_b64 = get_base64_image("../data/multimodal/Creme_clutch_purse1-small.jpg")

# Display the query image for reference
query_image_path = "../data/multimodal/Creme_clutch_purse1-small.jpg"
print("üì∑ Query Image:")
display(Image(filename=query_image_path, width=300))

# Format as a data URL
image_input = f"data:image/png;base64,{image_b64}"

# Create the multimodal query with text + image
# This follows the OpenAI vision API format
query_1 = "What material is this made of?"
image_query = [
    {"type": "text", "text": query_1},
    {
        "type": "image_url",
        "image_url": {
            "url": image_input,
            "detail": "auto"  # Let the model decide the appropriate detail level
        }
    }
]

In [None]:
import httpx
import json
from IPython.display import Image, Markdown, display

RAG_BASE_URL = "http://localhost:8081"

async def search_documents(payload):
    """
    Search for relevant documents using a multimodal query.
    
    This performs similarity search in the vector database and optionally
    reranks results for better relevance.
    """
    search_url = f"{RAG_BASE_URL}/v1/search"
    
    async with httpx.AsyncClient(timeout=300.0) as client:
        try:
            response = await client.post(url=search_url, json=payload)
            response.raise_for_status()
            
            search_results = response.json()
            print("Search Results:")
            
            # Display search results with nice formatting
            if "results" in search_results:
                for idx, result in enumerate(search_results["results"]):
                    doc_type = result.get("document_type", "text")
                    content = result.get("content", "")
                    doc_name = result.get("document_name", f"Result {idx + 1}")
                    score = result.get("score", "N/A")
                    
                    display(Markdown(f"**Result {idx + 1}: {doc_name} (Score: {score})**"))
                    try:
                        if doc_type == "image":
                            # Display image results
                            image_bytes = base64.b64decode(content)
                            display(Image(data=image_bytes))
                        else:
                            # Display text results
                            display(Markdown(f"```\n{content}\n```"))
                    except Exception as e:
                        print(f"Error displaying content: {e}")
                        display(Markdown(f"```\n{content}\n```"))
            
            return search_results
            
        except httpx.HTTPStatusError as e:
            print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
        except httpx.RequestError as e:
            print(f"An error occurred while requesting {e.request.url!r}: {e}")
        except Exception as e:
            print(f"An error occurred: {e}")

# Configure the search parameters
search_payload = {
    "query": image_query,                      # Our multimodal query (text + image)
    "messages": [],                            # No conversation history
    "use_knowledge_base": True,                # Search the vector database
    "collection_names": [collection_name],     # Which collection to search
    "vdb_top_k": 5,                           # Retrieve top 5 results from vector DB
    "vdb_endpoint": "http://milvus:19530",    # Milvus connection string
    "enable_reranker": False,                  # Set to True for better relevance (slower)
    "reranker_top_k": 3,                      # If reranker enabled, return top 3
    "filter_expr": "",                        # Optional metadata filter
}

# Execute the search
print("üîç Searching for documents matching the query...\n")
search_result = await search_documents(search_payload)

In [None]:
import base64
import json
from IPython.display import Image, Markdown, display


async def print_streaming_response_and_citations(response_generator):
    """
    Helper function to display streaming responses with citations.
    
    This function:
    1. Streams the AI-generated response token by token
    2. Extracts citations from the first chunk
    3. Displays citations (text or images) after the response completes
    """
    first_chunk_data = None
    
    async for chunk in response_generator:
        # Parse Server-Sent Events (SSE) format
        if chunk.startswith("data: "):
            chunk = chunk[len("data: ") :].strip()
        if not chunk:
            continue
            
        try:
            data = json.loads(chunk)
        except Exception as e:
            print(f"JSON decode error: {e}")
            continue
            
        choices = data.get("choices", [])
        if not choices:
            continue
            
        # Save the first chunk with citations
        if first_chunk_data is None and data.get("citations"):
            first_chunk_data = data
            
        # Print streaming text
        delta = choices[0].get("delta", {})
        text = delta.get("content")
        if not text:
            message = choices[0].get("message", {})
            text = message.get("content", "")
        print(text, end="", flush=True)
        
    print()  # Newline after streaming

    # Display citations after streaming is done
    if first_chunk_data and first_chunk_data.get("citations"):
        print("\nüìö Citations:")
        citations = first_chunk_data["citations"]
        for idx, citation in enumerate(citations.get("results", [])):
            doc_type = citation.get("document_type", "text")
            content = citation.get("content", "")
            doc_name = citation.get("document_name", f"Citation {idx + 1}")
            display(Markdown(f"**Citation {idx + 1}: {doc_name}**"))
            try:
                # Try to display as image
                image_bytes = base64.b64decode(content)
                display(Image(data=image_bytes))
            except Exception:
                # Fall back to text display
                display(Markdown(f"```\n{content}\n```"))

In [None]:
import httpx

# Configure RAG server URL
IPADDRESS = "rag-server" if os.environ.get("AI_WORKBENCH", "false") == "true" else "localhost"
RAG_SERVER_PORT = "8081"
RAG_BASE_URL = f"http://{IPADDRESS}:{RAG_SERVER_PORT}"
generate_url = f"{RAG_BASE_URL}/v1/generate"

async def generate_answer(payload):
    """
    Generate an AI answer using the RAG pipeline.
    
    This function:
    1. Sends the query to the RAG server
    2. Retrieves relevant context from the vector database
    3. Streams the AI-generated response
    4. Displays citations (sources) used to generate the answer
    """
    rag_response = ""
    citations = []
    is_first_token = True

    async with httpx.AsyncClient(timeout=300.0) as client:
        try:
            async with client.stream("POST", url=generate_url, json=payload) as response:
                # Raise an exception for bad status codes like 4xx or 5xx
                response.raise_for_status()

                # Iterate over the streaming response
                async for line in response.aiter_lines():
                    if line.startswith("data: "):
                        json_str = line[6:].strip()
                        if not json_str:
                            continue

                        try:
                            data = json.loads(json_str)

                            # Extract and display the streaming response
                            message = data.get("choices", [{}])[0].get("message", {}).get("content", "")
                            if message:
                                rag_response += message

                            # Extract and display citations from the first chunk
                            if is_first_token and data.get("citations"):
                                print("\nüìö Citations:")
                                citations = data["citations"]
                                for idx, citation in enumerate(citations.get("results", [])):
                                    doc_type = citation.get("document_type", "text")
                                    content = citation.get("content", "")
                                    doc_name = citation.get("document_name", f"Citation {idx + 1}")
                                    display(Markdown(f"**Citation {idx + 1}: {doc_name}**"))
                                    try:
                                        # Display image citations
                                        image_bytes = base64.b64decode(content)
                                        display(Image(data=image_bytes))
                                    except Exception:
                                        # Display text citations
                                        display(Markdown(f"```\n{content}\n```"))
                                is_first_token = False

                            # Check if streaming is complete
                            finish_reason = data.get("choices", [{}])[0].get("finish_reason")
                            if finish_reason == "stop":
                                return rag_response

                        except json.JSONDecodeError:
                            print(f"Skipping malformed JSON line: {json_str}")
                            continue
        
        except httpx.HTTPStatusError as e:
            print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
        except httpx.RequestError as e:
            print(f"An error occurred while requesting {e.request.url!r}: {e}")
        except Exception as e:
            print(f"An error occurred: {e}")

    print("\n‚úÖ Response complete!")

In [None]:
# Format the query as a chat message
messages = [
    {
        "role": "user",
        "content": image_query  # Our multimodal query (text + image)
    }
]

# Configure the generate API parameters
payload = {
    "messages": messages,                      # Chat conversation
    "use_knowledge_base": True,                # Enable RAG - use vector DB for context
    "temperature": 0.2,                        # Lower = more deterministic, higher = more creative
    "top_p": 0.7,                             # Nucleus sampling parameter
    "max_tokens": 1024,                       # Maximum response length
    "reranker_top_k": 2,                      # Keep top 2 results after reranking
    "vdb_top_k": 10,                          # Retrieve top 10 from vector DB initially
    "vdb_endpoint": "http://milvus:19530",    # Milvus connection
    "collection_names": [collection_name],     # Which collection to search
    "enable_query_rewriting": True,            # Improve query before searching
    "enable_citations": True,                  # Include source citations in response
    "stop": [],                               # Optional stop sequences
    "filter_expr": "",                        # Optional metadata filter    
}

# Generate the answer with RAG
print("ü§ñ Generating answer with RAG...\n")
await generate_answer(payload)

---

## üéâ Summary

Congratulations! You've successfully:

‚úÖ **Set up the infrastructure**: Deployed Milvus vector DB, NVIDIA NIMs, and RAG services  
‚úÖ **Ingested multimodal documents**: Uploaded PDFs with images and extracted their content  
‚úÖ **Created multimodal queries**: Combined text and images in your search queries  
‚úÖ **Retrieved relevant context**: Used semantic search to find matching documents  
‚úÖ **Generated AI responses**: Got intelligent answers with source citations  

### Next Steps

- **Try different queries**: Change the query text or use different query images
- **Upload more documents**: Add more PDFs to enrich your knowledge base
- **Experiment with parameters**: Adjust `temperature`, `top_k`, reranker settings
- **Build applications**: Integrate these APIs into your own applications

### Cleanup

To stop all services and free up resources:

```bash
cd ../deploy/compose
docker compose -f docker-compose-rag-server.yaml down
docker compose -f docker-compose-ingestor-server.yaml down
docker compose -f nims.yaml down
docker compose -f vectordb.yaml down
```
