# Training a Custom MONAI Bundle on NVIDIA DGX Cloud

This comprehensive guide is designed to help you navigate the process of training a custom MONAI Bundle on the NVIDIA DGX Cloud, focusing on leveraging the powerful capabilities of DGX systems for medical imaging applications.

## Table of Contents

- Dataset Creation
- Custom MONAI Bundle Creation
- Training on DGX Cloud
- Monitoring and Downloading
- Conclusion

## Introduction

Training a custom MONAI Bundle on the NVIDIA DGX Cloud represents a significant step in advancing medical imaging projects. This guide aims to facilitate your journey, ensuring you harness the full potential of DGX's high-performance computing for deep learning. We'll cover the steps from initializing your custom bundle to optimizing your training process on DGX Cloud.

If you haven't already generated your key or if you're unsure about the process, follow our step-by-step for [Generating and Managing Your Credentials](./Generating%20and%20Managing%20Your%20Credentials.ipynb).


<a id='Setup'></a>

### Setup

In [None]:
import requests
import json
import time

In [None]:
# API Endpoint and Credentials
host_url = "<MONAI Cloud API URL>"
ngc_api_key = "<NGC API Key>"
# Object storage info
container_url = "<remote object storage address>"
access_id = "<user id>"
access_secret = "<storage secret>"
# training parameters
epochs = 2

In [None]:
# Exchange NGC_API_KEY for JWT
data = json.dumps({"ngc_api_key": ngc_api_key})
response = requests.post(f"{host_url}/api/v1/login", data=data)
print(response.status_code)
assert response.status_code == 201, f"Login failed, got status code: {response.status_code}."
assert "user_id" in response.json().keys(), "user_id is not in response."
user_id = response.json()["user_id"]
print("User ID",user_id)
assert "token" in response.json().keys(), "token is not in response."
token = response.json()["token"]
print("JWT",token)

# Construct the URL and Headers
base_url = f"{host_url}/api/v1/users/{user_id}"
print("API Calls will be forwarded to", base_url)

headers = {"Authorization": f"Bearer {token}"}

# MLFlow server
use_mlflow = False
mlflow_server_address = "" # For example "http://127.0.0.1:5000".
mlflow_experiment_name = "" # For example "my_experiment"

## Dataset Creation

Please refer to [Training a MONAI Segmentation Bundle](./Training%20a%20MONAI%20Segmentation%20Bundle.ipynb) for how to create a dataset on remote cloud storage. For simplicity, this tutorial use the same dataset for training and evaluation.

## Using the Remote Object to Create Datasets

After you've completed the steps above, it's time to run the API to create your dataset.  Below you'll find an example request along with associated parameters and description.

In [None]:
data = {
    "name": "MONAI_CLOUD",
    "description":"Object storage dataset",
    "type": "semantic_segmentation",
    "format": "monai",
    "client_url": container_url,
    "client_id": access_id,
    "client_secret": access_secret,
}

endpoint = f"{base_url}/datasets"
response = requests.post(endpoint, json=data, headers=headers)

assert response.status_code == 201, f"Create dataset failed, got {response.json()}."

res = response.json()
dataset_id = res["id"]
print("Dataset creation succeeded with dataset ID: ", dataset_id)
print("---------------------------------\n")
print(json.dumps(res, indent=2))

## Custom MONAI Bundle Creation

1. **MONAI Bundle**: We're using the Spleen Segmentation bundle as an example. Choose the one fitting your application from the MONAI Model Zoo.
2. **Dataset Setup**: All data is under one dataset ID for this demo. Adjust as per your data structure.
3. **Pretrained Weights**: The Official MONAI bundles have pretrained weights.

Here are some notes about the payload used to create the experiment:

- name: A user-defined name for the training experiment, here named "my_spleen_seg".
- description: A brief description of the experiment. Optional
- network_arch: Specifies the architecture of the network. The value "monai_custom" indicates that a custom network architecture is being used. The user must provide the `bundle_url` with such custom architecture.
- train_datasets: A list of dataset IDs used for training the model. This payload supports only one dataset for the MONAI bundle, which is indicated by [ dataset_id ].
- eval_dataset: The dataset ID used for evaluating the model. It can be different from the training dataset. Here, it's referred to as dataset_id.
- bundle_url: Indicating the specific location of the MONAI bundle to be used in this training experiment.

In this example, we use the same dataset for training and metrics validation. Users can also create two different datasets and use different dataset ids for `train_datasets` and `eval_dataset`.

In [None]:
bundle_url = "https://api.ngc.nvidia.com/v2/models/nvidia/monaihosting/spleen_ct_segmentation/versions/0.5.3/files/spleen_ct_segmentation_v0.5.3.zip"

data = {
  "name": "my_spleen_seg",
  "description": "from MONAI model zoo",
  "network_arch": "monai_custom",  # must be using monai_custom
  "train_datasets": [ dataset_id ],  # only one dataset is supported for MONAI bundle
  "eval_dataset": dataset_id,  # it can be a different dataset
  "bundle_url": bundle_url,
}

endpoint = f"{base_url}/experiments"
response = requests.post(endpoint, json=data, headers=headers)

assert response.status_code == 201, f"Create experiment failed, got {response.json()}."

res = response.json()
experiment_id = res["id"]
base_experiment_ids = res["base_experiment"]
print("Experiment creation succeeded with experiment ID: ", experiment_id)
print("---------------------------------\n")
print(json.dumps(res, indent=2))


## Training on DGX Cloud

1. Users have the capability to submit jobs directly through our cloud API, enabling a streamlined and efficient process for initiating their projects.
1. Additionally, users are empowered to modify the job submission payload, allowing the inclusion of additional parameters to tailor the execution according to specific requirements or preferences.
1. The format of the payload aligns with the MONAI bundle configuration standards, ensuring a seamless integration and consistency in how data and parameters are structured and processed.

In [None]:
train_spec = {
  "epochs": epochs,
}

if use_mlflow:
    mlflow_spec = {
        "tracking": "mlflow",
        "tracking_uri": f"{mlflow_server_address}",
        "experiment_name": f"{mlflow_experiment_name}",
        "train#handlers#-1#artifacts": None
    }
    train_spec.update(mlflow_spec)

data = {
  "action": "train",
  "specs": train_spec
}

endpoint = f"{base_url}/experiments/{experiment_id}/jobs"
response = requests.post(endpoint, json=data, headers=headers)

assert response.status_code == 201, f"Run dgx train job failed, got {response.json()}."

job_id = response.json()
print(f"Job submitted successfully with {job_id}.")

When initiating training with the MONAI bundle, the default configuration is set to utilize the train.json file located at configs/train.json. However, users have the flexibility to modify or override specific settings in this configuration file. This can be achieved by including key-value pairs within the training payload.

**Customizing Configuration File in Payload**

If your training scenario requires different or additional configuration files, you can specify this in the payload. For example, if the bundle relies on a different configuration file or multiple files, you can define them as follows:

**Using an Alternate Single Configuration File**
```json
train_spec = {
    ...
    "config_file": "configs/train_autoencoder.json",
}
```

In this example, the training will be based on the settings defined in `train_autoencoder.json` instead of the default train.json.

**Specifying Multiple Configuration Files**
```json
train_spec = {
    ...
    "config_file": ["configs/train.json", "configs/train_continual.json"]
}
```
Here, both `train.json` and `train_continual.json` are used, allowing for a more complex training setup that combines settings from multiple files.

Important Notes
- Adaptability: This method offers adaptability in training, catering to diverse and complex model training requirements.
- Payload Customization: Carefully customize the payload to ensure that the training aligns with your specific model needs and dataset characteristics.
- File Paths: Ensure that the file paths provided in the payload correctly point to the respective configuration files within the bundle structure.

## Monitoring and Downloading

Monitoring the status of your jobs is a crucial aspect of managing workflows effectively, especially for bundle customization. Here's what you need to know:

1. **Basic Status Overview**: The monitoring functionality in our system is designed to inform you whether your jobs are in a pending, running, done, or error state. This status update allows you to quickly assess the overall progress and detect any immediate issues that may require attention.

Status interpretation:
- "Pending": MONAI cloud is looking for resources and preparing the datasets. This can take quite a while, and depends on the size of the dataset.
- "Running": MONAI cloud has submitted the job to the DGX. 
- "Done": The training is complete
- "Error": There is some error in the job. User probably wants to download the job as a `.tar.gz` archive and inspect the detailed log.

2. **Detailed Logging Through Download API**: For a more comprehensive view and detailed logging of your jobs, our platform offers a Download API. This API enables you to access in-depth logs, model checkpoints, and data outputs, which are instrumental for troubleshooting, in-depth analysis, and gaining insights into the specifics of your job's execution. The Download API is particularly useful if your job encounters an error or if you need to understand the performance and behavior of your job in greater detail.

In [None]:
# Helper functions for running jobs
def wait_for_job(endpoint, headers, timeout):
    start_time = time.time()
    response = requests.get(endpoint, headers=headers)
    assert response.status_code == 200, f"Failed to get job status, got {response.json()}."
    status = response.json()["status"].title()
    print("Waiting for job to complete...")
    print(status, end="", flush=True)
    while True:
        if status not in ["Pending", "Running"]:
            assert status == "Done", f"Job failed with status: {status}"
            break
        time.sleep(5)
        response = requests.get(endpoint, headers=headers)
        assert response.status_code == 200, f"Failed to get job status, got {response.json()}."
        status_new = response.json()["status"].title()
        if status_new != status:
            status = status_new
            print(f"\n{status}", end="", flush=True)
        else:
            print(".", end="", flush=True)
        if time.time() - start_time > timeout:
            assert False, f"Job timeout after {timeout} seconds."
    print("\nJob completed successfully!")

# During the Job is Running 
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}"
response = requests.get(endpoint, headers=headers)

assert response.status_code == 200, f"Failed to get job status, got {response.json()}."
for k, v in response.json().items():
    if k != "result":
        print(f"{k}: {v}")
    else:
        print("result:")
        for k1, v1 in v.items():
            print(f"    {k1}: {v1}")

print("------------------------------------------------------------------------")
wait_for_job(endpoint, headers, 600)

**Downloading**

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}/jobs/{job_id}:download"
response = requests.get(endpoint, headers=headers)

In [None]:
assert response.status_code == 200, f"Failed to download job, got {response.json()}."

# Save to file
attachment_data = response.content
with open(f"{job_id}.tar.gz", 'wb') as f:
    f.write(attachment_data)
print(f"Bundle training results are downloaded as {job_id}.tar.gz")

## Cleaning Up

Delete the experiment after all jobs are done.

In [None]:
endpoint = f"{base_url}/experiments/{experiment_id}"
response = requests.delete(endpoint, headers=headers)
assert response.status_code == 200, f"Delete experiment failed, got {response.json()}."
print(response)

If creating base experiments, also need to delete them before delete datasets

In [None]:
for base_experiment_id in base_experiment_ids:
    endpoint = f"{base_url}/experiments/{base_experiment_id}"
    response = requests.delete(endpoint, headers=headers)
    assert response.status_code == 200, f"Delete base experiment failed, got {response.json()}."
    print(response)

Delete datasets after the experiment is done.

In [None]:
# train dataset
endpoint = f"{base_url}/datasets/{dataset_id}"
response = requests.delete(endpoint, headers=headers)
assert response.status_code == 200, f"Delete dataset failed, got {response.json()}."
print(response)

## Conclusion

Congratulations on reaching this pivotal milestone! With your dataset created and experiment selected, you're now fully equipped to leverage the advanced customization features of the NVIDIA MONAI Cloud APIs for your medical imaging projects.