# Training a Custom MONAI Bundle on NVIDIA DGX Cloud

This comprehensive guide is designed to help you navigate the process of training a custom MONAI Bundle on the NVIDIA DGX Cloud, focusing on leveraging the powerful capabilities of DGX systems for medical imaging applications.

## Table of Contents

1. [Dataset Creation](#Dataset-Creation)
1. [Custom MONAI Bundle Creation](#Custom-Monai-Bundle-Creation)
1. [Training on DGX Cloud](#Training-on-DGX-Cloud)
1. [Monitoring and Downloading](#Monitoring-and-Download)
1. [Conclusion](#Conclusion)

## Introduction

Training a custom MONAI Bundle on the NVIDIA DGX Cloud represents a significant step in advancing medical imaging projects. This guide aims to facilitate your journey, ensuring you harness the full potential of DGX's high-performance computing for deep learning. We'll cover the steps from initializing your custom bundle to optimizing your training process on DGX Cloud.

If you haven't already generated your key or if you're unsure about the process, follow our step-by-step for [Generating and Managing Your Credentials](./Generating%20and%20Managing%20Your%20Credentials.ipynb).


In [1]:
import json
import requests

# API Endpoint and Credentials
monai_cloud_api = "<MONAI Cloud API URL>"
api_url = f"{monai_cloud_api}/api/v1"
ngc_api_key = "<NGC API Key>"

# NGC UID 
response = requests.get(f"{api_url}/login/{ngc_api_key}")
uid = response.json()["user_id"]
token = response.json()["token"]

# Construct the URL and Headers
base_url = f"{api_url}/user/{uid}"
headers = {"Authorization": f"Bearer {token}"}

## Dataset Cration

### **1. Remote Object as Data Sources**

MONAI Cloud platform supports a range of other cloud storage solutions, including Azure Blob Storage, Google Cloud Storage (GCP) and Amazon S3, providing you with the flexibility to choose the service that best fits your project's needs. Below is an example of Azure:

**Steps:**
1. Creating a Storage Account and Container
   - **Storage Account**: Start by creating a new storage account in your Azure portal. This account will host your blob storage containers.
   - **Container Creation**: Within your storage account, create a new container. This container will hold your datasets.

2. Container URL
   - Once the container is created, you will be provided with a unique URL that can be used to access it. This URL will be essential for accessing your data.

## Obtaining Credentials

- **Access Keys**: Access your storage account and navigate to the 'Access keys' section. Here, you will find the necessary credentials to access your Blob Storage programmatically.
- **Shared Access Signature (SAS)**: Alternatively, you can create a SAS for more granular control over permissions and access duration.

## Creating a Manifest JSON File

In the root of your Azure container, create a manifest JSON file to keep track of your datasets. The file format is as follows:

```json
{
    "root_path": "https://[your-storage-account-name].blob.core.windows.net/[your-container-name]",
    "data": [
        {
            "image": {
                "path": ["path/to/your/image_1"],
                "id": "unique-uuid-1"
            },
            "label": {
                "path": ["path/to/your/label_1"],
                "id": "unique-uuid-2"
            }
        },
        // Additional data objects follow the same format
    ]
}
````

- Each dataset (training, testing, etc.) should have their own root directory
- All the data should be under a root directory
- The root directory should contain a `manifest.json` file
- The `manifest.json` file should contain "data" field, which is a list of all the data entries
- Each data entry should contain "image" and "label" fields
- Each "image"/"label" field should contain "path" field, which is the list of relative path to the image/label files

In [3]:
container_url = "<remote object storage address>"
access_id = "<user id>"
access_secret = "<storage secret>"

## Using the Remote Object to Create Datasets

After you've completed the steps above, it's time to run the API to create your dataset.  Below you'll find an example request along with associated parameters and description.

In [None]:
data = {
    "name": "MONAI_CLOUD",
    "description":"Object storage dataset",
    "type": "semantic_segmentation",
    "format": "monai",
    "location": container_url,
    "client_id": access_id,
    "client_secret": access_secret,
}

endpoint = f"{base_url}/dataset"
response = requests.post(endpoint, json=data, headers=headers)

if response.status_code == 201:
    res = response.json()
    dataset_id = res["id"]
    print("Dataset creation succeeded with dataset ID： ", dataset_id)
    print("---------------------------------\n")
    print(json.dumps(res, indent=2))
else:
    print(response.json())
    print(response)

## Custom MONAI Bundle Creation

1. **MONAI Bundle**: We're using the Spleen Segmentation bundle as an example. Choose the one fitting your application from the MONAI Model Zoo.
2. **Dataset Setup**: All data is under one dataset ID for this demo. Adjust as per your data structure.
3. **Pretrained Weights**: The Official MONAI bundles have pretrained weights.

In [7]:
bundle_url = "https://github.com/Project-MONAI/model-zoo/releases/download/hosting_storage_v1/spleen_ct_segmentation_v0.5.3.zip"

data = {
  "name": "my_spleen_seg",
  "description": "from MONAI model zoo",
  "network_arch": "monai_custom",  # must be using monai_custom
  "eval_dataset": dataset_id,
  "train_datasets": [ dataset_id ],
  "bundle_url": bundle_url,
}

endpoint = f"{base_url}/model"
response = requests.post(endpoint, json=data, headers=headers)
if response.status_code == 201:
    res = response.json()
    model_id = res["id"]
    print("Model creation succeeded with model ID： ", model_id)
    print("---------------------------------\n")
    print(json.dumps(res, indent=2))
else:
    print(response.json())
    print(response)

Model creation succeeded with model ID：  e98ad72f-7f47-4ed0-844f-a242ecb414d8
---------------------------------

{
  "actions": [
    "train"
  ],
  "additional_id_info": null,
  "automl_add_hyperparameters": "",
  "automl_algorithm": null,
  "automl_enabled": false,
  "automl_remove_hyperparameters": "",
  "calibration_dataset": null,
  "created_on": "2023-11-22T12:13:54.837278",
  "dataset_type": "user_custom",
  "description": "from MONAI model zoo",
  "encryption_key": "tlt_encode",
  "eval_dataset": "3a02d5dd-f940-4ddc-b593-e7d4aae07e23",
  "id": "e98ad72f-7f47-4ed0-844f-a242ecb414d8",
  "inference_dataset": null,
  "jobs": [],
  "last_modified": "2023-11-22T12:13:54.837296",
  "logo": "https://www.nvidia.com",
  "metric": null,
  "model_params": {},
  "name": "my_spleen_seg",
  "network_arch": "monai_custom",
  "ngc_path": "",
  "ptm": [
    "3f31f0e4-76f2-4fbb-a543-0189d06ea9c2"
  ],
  "public": false,
  "read_only": false,
  "realtime_infer": false,
  "realtime_infer_request_ti

## Training on DGX Cloud

1. Users have the capability to submit jobs directly through our cloud API, enabling a streamlined and efficient process for initiating their projects.
1. Additionally, users are empowered to modify the job submission payload, allowing the inclusion of additional parameters to tailor the execution according to specific requirements or preferences.
1. The format of the payload aligns with the MONAI bundle configuration standards, ensuring a seamless integration and consistency in how data and parameters are structured and processed.

In [8]:
data = {
  "epochs": 10,
}

endpoint = f"{base_url}/model/{model_id}/job/train"
response = requests.post(endpoint, json=data, headers=headers)

if response.status_code == 201:
    job_id = response.json()[0]
    print(f"Job submitted successfully with {job_id}.")
else:
    print(response.json())
    print(response)

Job submitted successfully with ce111b2c-c1d9-4fcd-85d6-c402df4484d7.


## Monitoring and Downloading

Monitoring the status of your jobs is a crucial aspect of managing workflows effectively. In our system, the job monitoring feature provides a straightforward yet essential overview of your job's current state. Here's what you need to know:

1. **Basic Status Overview**: The monitoring functionality in our system is designed to inform you whether your jobs are in a pending, running, or error state. This status update allows you to quickly assess the overall progress and detect any immediate issues that may require attention.

2. **Detailed Logging Through Download API**: For a more comprehensive view and detailed logging of your jobs, our platform offers a Download API. This API enables you to access in-depth logs, model checkpoints, and data outputs, which are instrumental for troubleshooting, in-depth analysis, and gaining insights into the specifics of your job's execution. The Download API is particularly useful if your job encounters an error or if you need to understand the performance and behavior of your job in greater detail.

In [9]:
endpoint = f"{base_url}/model/{model_id}/job/{job_id}"
response = requests.get(endpoint, headers=headers)

if response.status_code == 200:
    for k, v in response.json().items():
        if k != "result":
            print(f"{k}: {v}")
        else:
            print("result:")
            for k1, v1 in v.items():
                print(f"    {k1}: {v1}")
else:
    print(response.json())
    print(response)

action: train
created_on: 2023-11-22T12:14:19.577818
id: ce111b2c-c1d9-4fcd-85d6-c402df4484d7
last_modified: 2023-11-22T12:15:58.604297
parent_id: None
result:
    categorical: []
    cur_iter: None
    detailed_status: {'date': '2023-11-22', 'message': 'MONAI job completed. Please download model artifacts to check.', 'status': 'SUCCESS', 'time': '2023-11-22T12:15:44.269236'}
    epoch: 7
    eta: None
    graphical: []
    key_metric: 0.05019168183207512
    kpi: []
    max_epoch: None
    time_per_epoch: None
    time_per_iter: None
spec: {'cluster': 'local', 'epochs': 10, 'num_gpu': 1}
status: Done


**Downloading**

In [10]:
endpoint = f"{base_url}/model/{model_id}/job/{job_id}/download"
response = requests.get(endpoint, headers=headers)

In [12]:
if response.status_code == 200:
    #save to file
    attachment_data = response.content
    with open(f"{job_id}.tar.gz", 'wb') as f:
        f.write(attachment_data)
    print(f"Bundle training results are downloaded as {job_id}.tar.gz")
else:
    print(response)

Bundle training results are downloaded as ce111b2c-c1d9-4fcd-85d6-c402df4484d7.tar.gz


## Conclusion

"Congratulations on reaching this pivotal milestone! With your dataset created and model selected, you're now fully equipped to leverage the advanced features of the NVIDIA MONAI Cloud APIs for your medical imaging projects.