Copyright (c) MONAI Consortium  
Licensed under the Apache License, Version 2.0 (the "License");  
you may not use this file except in compliance with the License.  
You may obtain a copy of the License at  
&nbsp;&nbsp;&nbsp;&nbsp;http://www.apache.org/licenses/LICENSE-2.0  
Unless required by applicable law or agreed to in writing, software  
distributed under the License is distributed on an "AS IS" BASIS,  
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  
See the License for the specific language governing permissions and  
limitations under the License.

# Fast Inference with MONAI features

This tutorial demonstrates the performance comparison between a standard PyTorch training program and a MONAI-optimized inference program. The key features include:

1. **Direct Data Loading**: Load data directly from disk to GPU memory, minimizing data transfer time and improving efficiency.
2. **GPU-based Preprocessing**: Execute preprocessing transforms directly on the GPU, leveraging its computational power for faster data preparation.
3. **TensorRT Inference**: Utilize TensorRT for running inference, which optimizes the model for high-performance execution on NVIDIA GPUs.

This tutorial is modified from the `TensorRT_inference_acceleration` tutorial.

## Setup environment

Loading data directly from disk to GPU memory requires the `kvikio` library. In addition, this tutorial requires many other dependencies such as `monai`, `torch`, `torch_tensorrt`, `numpy`, `ignite`, `pandas`, `matplotlib`, etc. We recommend using the [MONAI Docker](https://docs.monai.io/en/latest/installation.html#from-dockerhub) image to run this tutorial, which includes pre-configured dependencies and allows you to skip manual installation.

If not using MONAI Docker, install `kvikio` using one of these methods:

- **PyPI Installation**  
  Use the appropriate package for your CUDA version:
  ```bash
  pip install kvikio-cu12  # For CUDA 12
  pip install kvikio-cu11  # For CUDA 11
  ```

- **Conda/Mamba Installation**  
  Follow the official [KvikIO installation guide](https://docs.rapids.ai/api/kvikio/nightly/install/) for Conda/Mamba installations.

For convenience, we provide the cell below to install all the dependencies (please modify the cell based on your actual CUDA version, and please note that only CUDA 11 and CUDA 12 are supported for now).

In [None]:
!python -c "import monai" || pip install -q "monai-weekly[nibabel, pydicom, tqdm]"
!python -c "import matplotlib" || pip install -q matplotlib
!python -c "import torch_tensorrt" || pip install torch_tensorrt
!python -c "import kvikio" || pip install kvikio-cu12
!python -c "import ignite" || pip install pytorch-ignite
!python -c "import pandas" || pip install pandas
!python -c "import requests" || pip install requests
!python -c "import fire" || pip install fire
!python -c "import onnx" || pip install nibaonnxbel
%matplotlib inline

## Setup imports

In [None]:
import os

import torch
import torch_tensorrt
import matplotlib.pyplot as plt
import monai
from monai.config import print_config
from monai.transforms import (
    EnsureChannelFirstd,
    EnsureTyped,
    LoadImaged,
    Orientationd,
    Spacingd,
    ScaleIntensityRanged,
    Compose,
)
from monai.data import Dataset, ThreadDataLoader
import torch
import numpy as np
import copy

print(f"Torch-TensorRT version: {torch_tensorrt.__version__}.")

print_config()

## Prepare Test Data, Bundle, and TensorRT Model

We provide a helper script, [`prepare_data.py`](./prepare_data.py), to simplify the setup process. This script performs the following tasks:

- **Test Data**: Downloads and extracts the [Medical Segmentation Decathlon Task09 Spleen dataset](http://medicaldecathlon.com/).
- **Bundle**: Downloads the required `spleen_ct_segmentation` bundle.
- **TensorRT Model**: Exports the downloaded bundle model to a TensorRT engine-based TorchScript model. By default, the script exports the model using `fp16` precision, but you can modify it to use `fp32` precision if desired.

The script automatically checks for existing data, bundles, and exported models before downloading or exporting. This ensures that repeated executions of the notebook do not result in redundant operations.

In [None]:
from utils import prepare_test_datalist, prepare_test_bundle, prepare_tensorrt_model

root_dir = "."

train_files = prepare_test_datalist(root_dir)
bundle_path = prepare_test_bundle(bundle_dir=root_dir, bundle_name="spleen_ct_segmentation")
trt_model_name = "model_trt.ts"
prepare_tensorrt_model(bundle_path, trt_model_name)

## Benchmark the end-to-end bundle inference

A variable `benchmark_type` is defined to specify the type of benchmark to run. To have a fair comparison, each benchmark type should be run after restarting the notebook kernel.

`benchmark_type` can be one of the following:

- `"original"`: benchmark the original bundle inference.
- `"trt"`: benchmark the TensorRT accelerated bundle inference.
- `"trt_gds"`: benchmark the TensorRT accelerated bundle inference with GPU data loading and GPU transforms.

In [4]:
benchmark_type = "trt_gds"

A `TimerHandler` is defined to benchmark every part of the inference process.

Please refer to `utils.py` for the implementation of `CUDATimer` and `TimerHandler`.

In [5]:
from utils import TimerHandler, prepare_workflow, benchmark_workflow

### Benchmark the Original Bundle Inference

In this section, the `workflow`runs several iterations to benchmark the latency.

In [6]:
model_weight = os.path.join(bundle_path, "models", "model.pt")
meta_config = os.path.join(bundle_path, "configs", "metadata.json")
inference_config = os.path.join(bundle_path, "configs", "inference.json")

override = {
    "dataset#data": [{"image": i} for i in train_files],
    "output_postfix": benchmark_type,
}

In [7]:
if benchmark_type == "original":

    workflow = prepare_workflow(inference_config, meta_config, bundle_path, override)
    torch_timer = TimerHandler()
    benchmark_df = benchmark_workflow(workflow, torch_timer, benchmark_type)

### Benchmark the TensorRT Accelerated Bundle Inference
In this part, the TensorRT accelerated model is loaded to the `workflow`. The updated `workflow` runs the same iterations as before to benchmark the latency difference. Since the TensorRT accelerated model cannot be loaded through the `CheckpointLoader` and don't have `amp` mode, disable the `CheckpointLoader` in the `initialize` of the `workflow` and the `amp` parameter in the `evaluator` of the `workflow` needs to be set to `False`.

In [8]:
if benchmark_type == "trt":
    trt_model_path = os.path.join(bundle_path, "models", "model_trt.ts")
    trt_model = torch.jit.load(trt_model_path)

    override["load_pretrain"] = False
    override["network_def"] = trt_model
    override["evaluator#amp"] = False

    workflow = prepare_workflow(inference_config, meta_config, bundle_path, override)
    trt_timer = TimerHandler()
    benchmark_df = benchmark_workflow(workflow, trt_timer, benchmark_type)

### Benchmarking TensorRT Accelerated Bundle Inference with GPU Data Loading and GPU-based Transforms

In the previous section, the inference workflow utilized CPU-based transforms. In this section, we enhance performance by leveraging GPU acceleration:

- **GPU Direct Storage (GDS)**: The `LoadImaged` transform enables GDS on `.nii` and `.dcm` files via specifying `to_gpu=True`.
- **GPU-based Transforms**: After GDS, subsequent preprocessing transforms are executed directly on the GPU.

In [15]:
transforms = Compose(
    [
        LoadImaged(keys="image", reader="NibabelReader", to_gpu=False),
        EnsureTyped(keys="image", device=torch.device("cuda:0")),
        EnsureChannelFirstd(keys="image"),
        Orientationd(keys="image", axcodes="RAS"),
        Spacingd(keys="image", pixdim=[1.5, 1.5, 2.0], mode="bilinear"),
        ScaleIntensityRanged(keys="image", a_min=-57, a_max=164, b_min=0, b_max=1, clip=True),
    ]
)

dataset = Dataset(data=[{"image": i} for i in train_files], transform=transforms)
dataloader = ThreadDataLoader(dataset, batch_size=1, shuffle=False, num_workers=0)

In [None]:
if benchmark_type == "trt_gds":

    trt_model_path = os.path.join(bundle_path, "models", "model_trt.ts")
    trt_model = torch.jit.load(trt_model_path)
    override = {
        "output_postfix": benchmark_type,
        "load_pretrain": False,
        "network_def": trt_model,
        "evaluator#amp": False,
        "preprocessing": transforms,
        "dataset": dataset,
        "dataloader": dataloader,
    }

    workflow = prepare_workflow(inference_config, meta_config, bundle_path, override)
    trt_gpu_trans_timer = TimerHandler()
    benchmark_df = benchmark_workflow(workflow, trt_gpu_trans_timer, benchmark_type)