### Retriever API Usage

This notebook showcases how to use the NVIDIA RAG retriever APIs to fetch relevant document passages based on user queries and also generate responses using end-to-end RAG APIs.


- Ensure the rag-server container is running before executing the notebook by [following steps in the readme](../docs/quickstart.md#start-the-containers-for-rag-microservices).
- Please run the [ingestion notebook](./ingestion_api_usage.ipynb) as a prerequisite to using this notebook.
- Replace `IP_ADDRESS` with the actual server URL if the API is hosted on another system.

You can now execute each cell in sequence to test the API.
#### 1. Install Dependencies

In [None]:
!pip install aiohttp
import aiohttp
import os
import json

#### 2. Setup Base Configuration

In [None]:
IPADDRESS = "localhost" #Replace this with the correct IP address
RAG_SERVER_PORT = "8081"
BASE_URL = f"http://{IPADDRESS}:{RAG_SERVER_PORT}"  # Replace with your server URL

async def print_response(response):
    """Helper to print API response."""
    try:
        response_json = await response.json()
        print(json.dumps(response_json, indent=2))
    except aiohttp.ClientResponseError:
        print(await response.text())

#### 3. Health Check Endpoint

**Purpose:**
This endpoint performs a health check on the server. It returns a 200 status code if the server is operational. It also returns the status of the dependent services.

In [None]:
async def fetch_health_status():
    """Fetch health status asynchronously."""
    url = f"{BASE_URL}/v1/health"
    params = {"check_dependencies": "True"} # Check health of dependencies as well
    async with aiohttp.ClientSession() as session:
        async with session.get(url, params=params) as response:
            await print_response(response)

# Run the async function
await fetch_health_status()

#### 4. Generate Answer Endpoint

**Purpose:**
This endpoint generates a streaming AI response to a given user message. The system message is specified in the [prompts.yaml](src/prompt.yaml) file. This API retrieves the relevant chunks related to the query from knowledge base, adds them as part of the LLM prompt and returns a streaming response. It supports parameters like temperature, top_p, knowledge base usage, and also generates based on the specified vector collection. 

The API endpoint also returns multimodal base64 encoded data if the cited source is an image as part of the returned document chunks. The citations field is always populated as part of the first chunk returned in the streaming response.

In [None]:
url = f"{BASE_URL}/v1/generate"
payload = {
  "messages": [
    {
      "role": "user",
      "content": "How does the price of bluetooth speaker compare with hammer?"
    }
  ],
  "use_knowledge_base": True,
  "temperature": 0.2,
  "top_p": 0.7,
  "max_tokens": 1024,
  "reranker_top_k": 2,
  "vdb_top_k": 10,
  "vdb_endpoint": "http://milvus:19530",
  "collection_name": "multimodal_data",
  "enable_query_rewriting": True,
  "enable_reranker": True,
  "enable_citations": True,
  "model": "meta/llama-3.1-70b-instruct",
  "reranker_model": "nvidia/llama-3.2-nv-rerankqa-1b-v2",
  "embedding_model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
  # Provide url of the model endpoints if deployed elsewhere
  # "llm_endpoint": "",
  #"embedding_endpoint": "",
  #"reranker_endpoint": "",
  "stop": []
}

async def generate_answer(payload):
    async with aiohttp.ClientSession() as session:
        try:
            async with session.post(url=url, json=payload) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            print(f"Error: {e}")

await generate_answer(payload)

#### 5. Document Search Endpoint

**Purpose:**
This endpoint searches for the most relevant documents in the vector store based on a query. You can specify the maximum number of documents to retrieve using `reranker_top_k`.  

The `content` of the document is returned as well, in case of images representing charts or table, in a base64 represention. Developers can use these base64 representations for rendering multimodal citations to end users. The textual representation of this content is available under `description` field of `metadata`.

In [None]:
url = f"{BASE_URL}/v1/search"
payload={
  "query": "Tell me about robert frost's poems",
  "reranker_top_k": 2,
  "vdb_top_k": 10,
  "vdb_endpoint": "http://milvus:19530",
  "collection_name": "multimodal_data",
  "messages": [],
  "enable_query_rewriting": True,
  "enable_reranker": True,
  "embedding_model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
  # Provide url of the model endpoints if deployed elsewhere
  #"embedding_endpoint": "",
  #"reranker_endpoint": "",
  "reranker_model": "nvidia/llama-3.2-nv-rerankqa-1b-v2",

}

async def document_seach(payload):
    async with aiohttp.ClientSession() as session:
        try:
            async with session.post(url=url, json=payload) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            print(f"Error: {e}")

await document_seach(payload)