# Training a Custom MONAI Bundle on NVIDIA DGX Cloud

This guide assists in training a custom MONAI Bundle on the NVIDIA DGX Cloud, focusing on using the cloud clusters' capabilities for medical imaging applications.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/monai-cloud-api/blob/main/notebooks/Training%20a%20Custom%20MONAI%20Bundle.ipynb)

## Table of Contents

- Introduction
- Setup
- Dataset Creation
- Custom MONAI Bundle Creation
- Run a Batch Training Job
- Monitoring Job Status and Logging
- Conclusion

## Introduction

Training a custom MONAI Bundle on NVIDIA DGX Cloud advances medical imaging projects. This guide ensures you utilize the cloud computing for deep learning effectively, from initializing to optimizing your MONAI bundle on DGX Cloud.

### What You Can Expect to Learn

In this guide, you will learn how to fully leverage the advanced computing power of NVIDIA DGX Cloud for training a custom MONAI Bundle tailored to your medical imaging needs. We will cover the entire process, from setting up your environment and creating a suitable dataset to running and monitoring batch training jobs effectively. By the end of this tutorial, you will have successfully trained a new MONAI Bundle using datasets stored on the remote cloud storage.

If you have not generated your key or are unsure about the process, follow our step-by-step guide for [Generating and Managing Your Credentials](./Generating%20and%20Managing%20Your%20Credentials.ipynb).


## Setup

In [None]:
!python -c "import requests" || pip install -q "requests"

import json
import os
import time

import requests

#### Required Parameters

In [None]:
# API Endpoint and Credentials
host_url = "https://api.monai.ngc.nvidia.com"
ngc_api_key = os.environ.get("MONAI_API_KEY", "<YOUR_API_KEY>")  # we recommend using environment variables for API keys, but you can also hardcode them here

# The cloud storage type used in this notebook. Currently only support `aws` and `azure`.
cloud_type = "azure" # cloud storage provider: aws or azure
cloud_account = "account_name" # if cloud_type == "aws"  should be "access_key"
cloud_secret = "access_key" # if cloud_type == "aws" should be "secret_key"

# Cloud storage credentials. Needed for storing the data and results of the experiments.
access_id = "<user name for the remote storage object>"  # Please fill it with the actual Access ID
access_secret = "<secret for the remote storage object>"  # Please fill it with the actual Access Secret

# Dataset Cloud Storage URL. This is the cloud storage where the dataset is stored.
container_url = "<remote storage object address>"

# Experiment Cloud Storage. This is the storage where your jobs and experiments data will be stored.
cs_bucket = "<bucket or container name to push experiment job data to>"  # Please fill it with the actual bucket name

#### Login into NGC and API Setup

In [None]:
# Exchange NGC_API_KEY for JWT
api_url = f"{host_url}/api/v1"
response = requests.post(f"{api_url}/login", json={"ngc_api_key": ngc_api_key})
response.raise_for_status()
assert "user_id" in response.json(), "user_id is not in response."
assert "token" in response.json(), "token is not in response."
user_id = response.json()["user_id"]
token = response.json()["token"]

# Construct the URL and Headers
ngc_org = "iasixjqzw1hj"  # This is the default org for MONAI users. Please select the correct org if you are not using the default one.
base_url = f"{api_url}/orgs/{ngc_org}"
headers = {"Authorization": f"Bearer {token}"}
print("API Calls will be forwarded to", base_url)

# MLFlow server
use_mlflow = False
mlflow_server_address = "" # For example "http://127.0.0.1:5000".
mlflow_experiment_name = "" # For example "my_experiment"

## Dataset Creation

Refer to [Training a MONAI Segmentation Bundle](./Training%20a%20MONAI%20Segmentation%20Bundle.ipynb) for creating a dataset on remote cloud storage. This tutorial simplifies by using the same dataset for both training and evaluation.

In [None]:
data = {
    "name": "MONAI_CLOUD",
    "description":"Remote storage object dataset",
    "type": "semantic_segmentation",
    "format": "monai",
    "client_url": container_url,
    "client_id": access_id,
    "client_secret": access_secret,
}

endpoint = f"{base_url}/datasets"
response = requests.post(endpoint, json=data, headers=headers)

assert response.status_code == 201, f"Create dataset failed, got {response.json()}."

res = response.json()
dataset_id = res["id"]
print("Dataset creation succeeded with dataset ID: ", dataset_id)
print("---------------------------------\n")
print(json.dumps(res, indent=2))

## Custom MONAI Bundle Creation

1. **MONAI Bundle**: Use the Spleen Segmentation MONAI bundle from the MONAI Model Zoo. Customize bundles to fit your applications.
2. **Dataset Setup**: Use one dataset ID for this demo. Adjust according to your data structure.
3. **Pretrained Weights**: Official MONAI bundles come with pretrained weights.

Here are some notes about the payload used to create the experiment:

- name: A user-defined name for the training experiment, here named "my_spleen_seg".
- description: A brief description of the experiment. Optional
- network_arch: Specifies the architecture of the network. The value "monai_custom" indicates that a custom network architecture is being used. The user must provide the `bundle_url` with such custom architecture.
- train_datasets: A list of dataset IDs used for training the model. This payload supports only one dataset for the MONAI bundle, which is indicated by [ dataset_id ].
- eval_dataset: The dataset ID used for evaluating the model. It can be different from the training dataset. Here, it's referred to as dataset_id.
- bundle_url: Indicating the specific location of the MONAI bundle to be used in this training experiment.

In this example, we use the same dataset for training and metrics validation. Users can also create two different datasets and use different dataset ids for `train_datasets` and `eval_dataset`.

In [None]:
bundle_url = "https://api.ngc.nvidia.com/v2/models/nvidia/monaihosting/spleen_ct_segmentation/versions/0.5.3/files/spleen_ct_segmentation_v0.5.3.zip"

experiment_cloud_details = {
    "cloud_type": cloud_type,
    "cloud_file_type": "folder",  # If the file is tar.gz key in "file", else "folder"
    "cloud_specific_details": {
        "cloud_bucket_name": cs_bucket,  # Bucket link to save files
        cloud_account: access_id,  # Access and Secret for Azure
        cloud_secret: access_secret,  # Access and Secret for Azure
    }
}

data = {
    "name": "my_spleen_seg",
    "description": "from MONAI model zoo",
    "network_arch": "monai_custom",  # must be using monai_custom
    "train_datasets": [ dataset_id ],  # only one dataset is supported for MONAI bundle
    "eval_dataset": dataset_id,  # it can be a different dataset
    "bundle_url": bundle_url,
    "cloud_details": experiment_cloud_details,
}

endpoint = f"{base_url}/experiments"
response = requests.post(endpoint, json=data, headers=headers)

assert response.status_code == 201, f"Create experiment failed, got {response.json()}."

res = response.json()
experiment_id = res["id"]
base_experiment_ids = res["base_experiment"]
print("Experiment creation succeeded with experiment ID: ", experiment_id)
print("---------------------------------\n")
print(json.dumps(res, indent=2))


## Run a Batch Training Job

1. Users can submit jobs directly through our cloud API, modify the job submission payload to add specific parameters.
1. Ensure payload format complies with the MONAI bundle configuration standards for streamlined integration.

In [None]:
train_spec = {
    "epochs": 2,
}

if use_mlflow:
    mlflow_spec = {
        "tracking": "mlflow",
        "tracking_uri": f"{mlflow_server_address}",
        "experiment_name": f"{mlflow_experiment_name}",
        "save_execute_config": False
    }
    train_spec.update(mlflow_spec)

data = {
  "name": "my_spleen_seg",
  "action": "train",
  "specs": train_spec
}

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"
response = requests.post(endpoint, json=data, headers=headers)

assert response.status_code == 201, f"Run dgx train job failed, got {response.json()}."

job_id = response.json()
print(f"Job submitted successfully with {job_id}.")

When initiating training with the MONAI bundle, the default configuration is set to utilize the train.json file located at configs/train.json. However, users have the flexibility to modify or override specific settings in this configuration file. This can be achieved by including key-value pairs within the training payload.

**Customizing Configuration File in Payload**

If your training scenario requires different or additional configuration files, you can specify this in the payload. For example, if the bundle relies on a different configuration file or multiple files, you can define them as follows:

**Using an Alternate Single Configuration File**
```json
train_spec = {
    ...
    "config_file": "configs/train_autoencoder.json",
}
```

In this example, the training will be based on the settings defined in `train_autoencoder.json` instead of the default train.json.

**Specifying Multiple Configuration Files**
```json
train_spec = {
    ...
    "config_file": ["configs/train.json", "configs/train_continual.json"]
}
```
Here, both `train.json` and `train_continual.json` are used, allowing for a more complex training setup that combines settings from multiple files.

Important Notes
- Adaptability: This method offers adaptability in training, catering to diverse and complex model training requirements.
- Payload Customization: Carefully customize the payload to ensure that the training aligns with your specific model needs and dataset characteristics.
- File Paths: Ensure that the file paths provided in the payload correctly point to the respective configuration files within the bundle structure.

## Monitoring Job Status and Logging

Monitoring the status of your jobs is a crucial aspect of managing workflows effectively. In our system, the job monitoring feature provides a straightforward yet essential overview of your job's current state.

In [None]:
def wait_for_job(endpoint, headers, timeout=1800, interval=5, target_status="Done"):
    """Helper function to wait for job to reach target status."""
    expected = ["Pending", "Running", "Done"]
    assert target_status in expected, f"Invalid target status: {target_status}"
    status_before_target = expected[:expected.index(target_status)]
    start_time = time.time()
    print(f"Waiting for job to reach state {target_status} ...")
    status = None
    while True:
        response = requests.get(endpoint, headers=headers)
        response.raise_for_status()
        status_new = response.json()["status"].title()
        if time.time() - start_time > timeout:
            print(f"\nJob timeout after {timeout} seconds with last status {status_new}.")
            break
        elif status_new not in status_before_target:
            assert status_new == target_status, f"Job failed with status: {status_new}"
            print(f"\nJob reached target status: {status_new}")
            break
        print(f"\n{status_new}", end="", flush=True) if status_new != status else print(".", end="", flush=True)
        status = status_new
        time.sleep(interval)


# During the Job is Running 
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
response = requests.get(endpoint, headers=headers)

assert response.status_code == 200, f"Failed to get job status, got {response.json()}."
for k, v in response.json().items():
    print(f"{k}: {v}")

print("------------------------------------------------------------------------")
wait_for_job(endpoint, headers, timeout=1800)

### Job Log Download

Access and download job logs to troubleshoot or assess performance. The job log is available when the status of the job is `RUNNING`, `Error` or `Done`.

Please note that the job log will not be immediately available after the status turns to `RUNNING` since it takes a while to prepare the environment for the running job.

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
response = requests.get(endpoint, headers=headers)
assert response.status_code == 200, f"Failed to get job status, got {response.json()}."
status = response.json()["status"].title()
if status in ["Running", "Done", "Error"]:
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}/logs"
    response = requests.get(endpoint, headers=headers)
    assert response.status_code == 200, f"Failed to get job logs, got {response.text}."
    print(response.text)
else:
    print(f"Job status: {status}, logs are not available.")

### Job Bundle Download

Download the completed job bundle once training is finished successfully.

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
response = requests.get(endpoint, headers=headers)
# In order to download the job, the training process should be finished
if response.json()["status"] == "Done":
    # Download the trained bundle from the job
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:download"
    response = requests.get(endpoint, data=json.dumps({"export_type": "monai_bundle"}), headers=headers)
    assert response.status_code == 200, f"Failed to download bundle, got {response.json()}."
    with open(f"{job_id}.tar.gz", "wb") as fp:
        fp.write(response.content)
    print(f"The trained bundle is downloaded as {job_id}.tar.gz")

## Cleaning Up

Delete the experiment after all jobs are done.

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
response = requests.get(endpoint, headers=headers)
# If the job is not done, need to cancel it first
if response.json()["status"] != "Done":
    endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:cancel"
    response = requests.post(endpoint, headers=headers)
    assert response.status_code == 200, f"Cancel job failed, got {response.json()}."
    print(response)

endpoint = f"{base_url}/experiments/{experiment_id}"
response = requests.delete(endpoint, headers=headers)
assert response.status_code == 200, f"Delete experiment failed, got {response.json()}."
print(response)

If creating base experiments, also need to delete them before delete datasets

In [None]:
for base_experiment_id in base_experiment_ids:
    endpoint = f"{base_url}/experiments/{base_experiment_id}"
    response = requests.delete(endpoint, headers=headers)
    assert response.status_code == 200, f"Delete base experiment failed, got {response.json()}."
    print(response)

Delete datasets after the experiment is done.

In [None]:
# train dataset
endpoint = f"{base_url}/datasets/{dataset_id}"
response = requests.delete(endpoint, headers=headers)
assert response.status_code == 200, f"Delete dataset failed, got {response.json()}."
print(response)

## Conclusion

Congratulations on reaching this pivotal milestone! With your dataset created and experiment selected, you're now fully equipped to leverage the advanced customization features of the NVIDIA MONAI Cloud APIs for your medical imaging projects.