Skip to content

Conversation

@Nic-Ma
Copy link
Contributor

@Nic-Ma Nic-Ma commented Apr 25, 2022

Fixes #670 .

Description

This PR added multi-gpu training example in the spleen bundle.

Status

Ready

Checks

  • Notebook runs automatically ./runner [-p <regex_pattern>]

Nic-Ma added 4 commits April 25, 2022 16:30
Signed-off-by: Nic Ma <nma@nvidia.com>
Signed-off-by: Nic Ma <nma@nvidia.com>
Signed-off-by: Nic Ma <nma@nvidia.com>
Signed-off-by: Nic Ma <nma@nvidia.com>
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 26, 2022

Depends on Project-MONAI/MONAI#4168.

Thanks.

@Nic-Ma Nic-Ma changed the title [WIP] 670 add bundle example for multi-gpu training 670 add bundle example for multi-gpu training Apr 26, 2022
@Nic-Ma Nic-Ma marked this pull request as ready for review April 26, 2022 16:35
@Nic-Ma Nic-Ma requested review from ericspod, rijobro and wyli April 26, 2022 16:36
@Nic-Ma Nic-Ma changed the title 670 add bundle example for multi-gpu training [WIP] 670 add bundle example for multi-gpu training Apr 27, 2022
Signed-off-by: Nic Ma <nma@nvidia.com>
@Nic-Ma Nic-Ma force-pushed the 670-support-multigpu branch from 67d1d33 to a7e0ed1 Compare April 27, 2022 11:40
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 27, 2022

Hi @wyli @ericspod ,

I added 2 methods to support multi-gpu training in this PR:
(1) Newly defined config train_multi_gpu.json which contains all the logic.
(2) A much simpler config multi_gpu_train.json which only overrides the existing train.json config.

Which method do you prefer?

Thanks in advance.

@ericspod
Copy link
Member

I think providing the override and an example of its usage would be a better idea. The override mechanism is quite powerful but I imagine for beginners it's not clear how it's meant to be used.

Nic-Ma added 2 commits April 28, 2022 10:56
Signed-off-by: Nic Ma <nma@nvidia.com>
Signed-off-by: Nic Ma <nma@nvidia.com>
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

Depends on: Project-MONAI/MONAI#4192.

Thanks.

@Nic-Ma Nic-Ma changed the title [WIP] 670 add bundle example for multi-gpu training 670 add bundle example for multi-gpu training Apr 28, 2022
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

Hi @ericspod @wyli ,

Thanks for your review and comments.
I updated the PR to make it clearer with the multiple configs method instead of the "implicit" args_file method:
--config_file [configs/train.json, configs/multi_gpu_train.json]
It depends on this minor PR: Project-MONAI/MONAI#4192.
Could you please help review it again?

Thanks in advance.

@Nic-Ma Nic-Ma changed the title 670 add bundle example for multi-gpu training [WIP] 670 add bundle example for multi-gpu training Apr 28, 2022
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

I will also add the evaluation config in this PR soon.

Thanks.

Nic-Ma and others added 2 commits April 28, 2022 19:26
@ericspod
Copy link
Member

I approved and committed Project-MONAI/MONAI#4192

Signed-off-by: Nic Ma <nma@nvidia.com>
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

I approved and committed Project-MONAI/MONAI#4192

Thanks @ericspod ,

@wyli @ericspod Then this PR is ready for review now.
I added multi-gpu training, evaluate configs based on override, and also simplified the inference config as we have completed the configs of this bundle now.

Thanks.

@Nic-Ma Nic-Ma changed the title [WIP] 670 add bundle example for multi-gpu training 670 add bundle example for multi-gpu training Apr 28, 2022
@ericspod
Copy link
Member

I tried reviewing through ReviewNB, I'm not sure it's as coherent as I'd hoped.

When you have a structure like:

"train":{
...some defs...
}

The "train" value is interpreted as a dict by the parser correct? So in places where you have "train#loss" or something like it that sets that entry in that dictionary. I think that needs to be explained in a bit more detail to make sense to the reader and to understand what to use it for.

Copy link
Member

@ericspod ericspod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had commented on a few changes to text and a question on references to members of dictionaries (I think) like "validate#postprocessing" that we should explain at little more.

Signed-off-by: Nic Ma <nma@nvidia.com>
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

I tried reviewing through ReviewNB, I'm not sure it's as coherent as I'd hoped.

When you have a structure like:

"train":{
...some defs...
}

The "train" value is interpreted as a dict by the parser correct? So in places where you have "train#loss" or something like it that sets that entry in that dictionary. I think that needs to be explained in a bit more detail to make sense to the reader and to understand what to use it for.

Good point, I added more description for it in the dataset and dataloader section.

Thanks.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

Hi @ericspod ,

Thanks for your review and comments, I updated the PR according to them.
Could you please help review again?

Thanks in advance.

Copy link
Member

@ericspod ericspod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@wyli wyli merged commit b66354a into master Apr 29, 2022
@wyli wyli deleted the 670-support-multigpu branch April 29, 2022 11:52
boneseva pushed a commit to boneseva/MONAI-tutorials that referenced this pull request Apr 21, 2024
* [DLMED] draft config

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] update for test

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] update based on enhancement

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] update tutorial

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] simplify to override

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] update according to comments

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] remove test file

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] add evaluation config

Signed-off-by: Nic Ma <nma@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [DLMED] simplify inference

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] update according to comments

Signed-off-by: Nic Ma <nma@nvidia.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add multi-gpu example in the spleen segmentation bundle

4 participants