# Guide to Real-time Inference with NVIDIA Cloud APIs

In this guide, we will guide you through the process of setting up a real-time inference system with MONAI cloud APIs. We will cover setting up the experiments, making on-the-fly predictions, and managing the outputs to ensure a seamless, efficient, and real-time decision-making pipeline.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/monai-cloud-api/blob/main/notebooks/Perform%20Real-time%20Inference.ipynb)

## Table of Contents

- Introduction
- Setup
- Configuring Experiment to Enable the Real-Time Inference
- Triggering Inference on a Specified Image
- Stopping the Experiment from Real-Time Inference Mode
- Cleaning up
- Conclusion

## Introduction

Transitioning to real-time inference can substantially elevate the responsiveness and applicability of AI models in healthcare. Analyzing and interpreting medical images as they are generated, and instantly providing insights, can be transformative, offering benefits such as improved patient outcomes and more efficient use of medical resources.

## Setup

In [None]:
import json
import os

import requests
from requests_toolbelt.multipart.decoder import MultipartDecoder

#### Required Parameters

In [None]:
# API Endpoint and Credentials
host_url = "https://api.monai.ngc.nvidia.com"
ngc_api_key = os.environ.get("MONAI_API_KEY", "<YOUR_API_KEY>")  # we recommend using environment variables for API keys, but you can also hardcode them here
inference_image_url = "<inference image url>"  # replace with your inference image url

#### Login into NGC and API Setup

In [None]:
# Exchange NGC_API_KEY for JWT
api_url = f"{host_url}/api/v1"
response = requests.post(f"{api_url}/login", json={"ngc_api_key": ngc_api_key})
response.raise_for_status()
assert "user_id" in response.json(), "user_id is not in response."
assert "token" in response.json(), "token is not in response."
user_id = response.json()["user_id"]
token = response.json()["token"]

# Construct the URL and Headers
ngc_org = "iasixjqzw1hj"
base_url = f"{api_url}/orgs/{ngc_org}"
headers = {"Authorization": f"Bearer {token}"}
print("API Calls will be forwarded to", base_url)

## Configuring Experiment to Enable the Real-Time Inference

#### Find the base experiment for VISTA-3D

In [None]:
endpoint = f"{base_url}/experiments"
response = requests.get(endpoint, headers=headers)
assert response.status_code == 200, f"List experiment failed, got {response.json()}."
res = response.json()

# VISTA-3D
vista3d_base_exps = [p for p in res["experiments"] if p["network_arch"] == "monai_vista3d" and not p["base_experiment"]]
assert len(vista3d_base_exps) > 0, "No base experiment found for VISTA-3D."
print("List of available base experiments for VISTA-3D:")
for exp in vista3d_base_exps:
    print(f"  {exp['id']}: {exp['name']} v{exp['version']}")

# Take the latest version
base_experiment = sorted(vista3d_base_exps, key=lambda x: x["version"])[-1]
base_exp_vista = base_experiment["id"]
print("-----------------------------------------------------------------------------------------")
print(f"Base experiment ID for '{base_experiment['name']}' v{base_experiment['version']}: {base_exp_vista}")
print("-----------------------------------------------------------------------------------------")
print(f"Base Experiment ID for VISTA Experiment: {base_exp_vista}")

#### Create a new experiment and bootstrap it for real-time inference

**Note:** We're going to use the `realtime_infer` parameter when creating our experiment as that will automatically load the experiment and make sure it's ready for real-time inference workflow.

In [None]:
data = {
    "name": "my_vista",
    "description": "based on vista",
    "network_arch": "monai_vista3d",
    "base_experiment": [base_exp_vista],
    "realtime_infer": True,  # Auto loads MONAI bundle and enables real-time inference
}

endpoint = f"{base_url}/experiments"
response = requests.post(endpoint, json=data, headers=headers)
assert response.status_code == 201, f"Create experiment failed, got {response.json()}."
res = response.json()
experiment_id = res["id"]
print("Experiment creation succeeded with experiment ID:", experiment_id)
print("---------------------------------\n")
print(json.dumps(res, indent=2))

## Triggering Inference on a Specified Image

Initiate an inference process on a particular image within an experiment

In [None]:
data = {
    "action": "inference",
    "specs": {
        "image": inference_image_url,
        "bundle_params": {
            "label_prompt": list(range(1, 118))  # inference all 117 classes
        },
    }
}

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"
response = requests.post(endpoint, json=data, headers=headers)
assert response.status_code == 201, f"Run inference failed, got {response.json()}."
print("Inference Successful.  Label is returned")
print(response.headers)

`MultipartDecoder` is used to decode the response data. If it's not installed, you can use the following command to install it:

```Bash
pip install requests_toolbelt==1.0.0
```

In [None]:
multipart_data = MultipartDecoder.from_response(response)
for part in multipart_data.parts:
    filename = part.headers[b"Content-Disposition"].decode().split(";")[1].split("=")[1].strip('"')

    with open(filename, 'wb') as f:
        f.write(part.content)
print(f"Inference result downloaded to {filename}")

## Stopping the Experiment from Real-Time Inference Mode

When the experiment is created with `realtime_infer` as `True`, it will reserve one GPU to process the inference requests.

After we have finished the inference process, we would like to release the GPU resource for other tasks.

To achieve this, we can switch the `realtime_infer` from `True` to `False`.

Note: this step is irreversible, which means you can't set the `realtime_infer` from `False` to `True`. To bootstrap another inference, you will have to create another experiment.

In [None]:
data = {
    "realtime_infer": False,
}

endpoint = f"{base_url}/experiments/{experiment_id}"
response = requests.patch(endpoint, json=data, headers=headers)
assert response.status_code == 200, f"stop job failed, got {response.json()}."

## Cleaning up
Delete the experiment after jobs are done.

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}"
response = requests.delete(endpoint, headers=headers)
assert response.status_code == 200, f"Delete experiment failed, got {response.json()}."
print(response)

## Conclusion

This tutorial showcases a streamlined approach to real-time inference, emphasizing automation in image selection and processing within a NVIDIA MONAI Cloud API-driven system. This method ensures efficient operations, allowing users to focus on model refinement and analysis while the system efficiently manages image selection and inference tasks, demonstrating the transformative potential of integrating advanced AI in real-time decision-making workflows.