Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weights & Biases handlers for MonAI #6305

Closed
wants to merge 33 commits into from
Closed

Weights & Biases handlers for MonAI #6305

wants to merge 33 commits into from

Conversation

soumik12345
Copy link

@soumik12345 soumik12345 commented Apr 5, 2023

Features Contributed

  • WandbStatsHandler defines a set of Ignite Event-handlers for all the Weights & Biases logging logic. It can be used for any Ignite Engine(trainer, validator, and evaluator) and support both epoch level and iteration level. The expected data source is Ignite engine.state.output and engine.state.metrics. Default behaviors:
    • When EPOCH_COMPLETED, write each dictionary item in engine.state.metrics to Weights & Biases.
    • When ITERATION_COMPLETED, write each dictionary item in self.output_transform(engine.state.output) to Weights & Biases.
  • WandbModelCheckpointHandler inherits from ~ignite.handlers.ModelCheckpoint and can be used to periodically save objects as Weights & Biases artifacts.

The following colab notebook and Weights & Biases run demonstrate the usage of these handlers and their results respectively:


Some additional Weights & Biases features:

  • When TensorBoardStatsHandler and TensorBoardImageHandler are used inside a wandb run, Weights & Biases automatically hosts the Tensorboard instance inside the run if during wandb.init(), sync_tensorboard is set to True.
  • When used with TensorBoardImageHandler, the images and videos are automatically logged to Weights & Biases media panel if during wandb.init(), sync_tensorboard is set to True.

A few sentences describing the changes proposed in this pull request.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@soumik12345 soumik12345 marked this pull request as draft April 6, 2023 10:10
@soumik12345 soumik12345 marked this pull request as ready for review April 6, 2023 10:42
@Nic-Ma
Copy link
Contributor

Nic-Ma commented Apr 6, 2023

Hi @soumik12345 ,

Thanks for your contribution here.
Please add below things:

  1. Unit tests.
  2. Add to docs: https://github.com/Project-MONAI/MONAI/blob/dev/docs/source/handlers.rst.
  3. Add new optional dependency for wandb, referring to: https://github.com/Project-MONAI/MONAI/blob/dev/CONTRIBUTING.md#adding-new-optional-dependencies.

Thanks in advance.

monai/handlers/wandb_handlers.py Outdated Show resolved Hide resolved
tag_name: when iteration output is a scalar, tag_name is used to plot, defaults to `'Loss'`.
"""
if wandb.run is None:
raise wandb.Error("You must call `wandb.init()` before WandbStatsHandler()")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @soumik12345 ,
I think it would be better that the handler itself process the initialization of wandb at the beginning of the expriment and have the run inside handler. @Nic-Ma How do you think about it?
If the experiment run is inside the handler, please be careful about the multiprocessing and single-process-multi-tasks cases.

Thanks,
Bin

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @binliunls, since there are 2 different handlers (and potentially a third one in the future as per #6305 (comment)), I think its best if the initialization of the wandb run is left independent of the handlers. Would of course love to know your thoughts as well.

cc: @Nic-Ma

@dzenanz
Copy link
Contributor

dzenanz commented Apr 6, 2023

Looking at your run the visualization of 3D images is hard to comprehend. They overflow my screen (like they are high resolution), but appear very blocky. Is this expected?

Another question, can this tensorboard 3D visualization plugin be used with W&B, and how? If not, could it be ported/integrated? It is open source.

@soumik12345
Copy link
Author

Looking at your run the visualization of 3D images is hard to comprehend. They overflow my screen (like they are high resolution), but appear very blocky. Is this expected?

Another question, can this tensorboard 3D visualization plugin be used with W&B, and how? If not, could it be ported/integrated? It is open source.

Hi @dzenanz currently, the 3D images are being logged as gifs by the TensorBoardImageHandler. If you wish to disable them from the wandb run, you can remove sync_tensorboard=True from wandb.init.

I can explore if it is possible to integrate tensorboard-plugin-3d with Weights & Biases or not. However, I can also create a WandbImageHandler similar to the TensorBoardImageHandler that would use image overlays for segmentation maps combined with the media panel step slider to visualize 3D volumetric images and compare visualizations across multiple runs.

@soumik12345
Copy link
Author

Hi @soumik12345 ,

Thanks for your contribution here. Please add below things:

  1. Unit tests.
  2. Add to docs: https://github.com/Project-MONAI/MONAI/blob/dev/docs/source/handlers.rst.
  3. Add new optional dependency for wandb, referring to: https://github.com/Project-MONAI/MONAI/blob/dev/CONTRIBUTING.md#adding-new-optional-dependencies.

Thanks in advance.

Hi @Nic-Ma

Added the following as per your request:

  • Unit tests for the handlers
  • Documentation for the handlers
  • New optional dependency for wandb

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Apr 7, 2023

Please add your test file here to skip the min tests as it has optional dependency:
https://github.com/Project-MONAI/MONAI/blob/dev/tests/min_tests.py#L203
And please fix the other CI failures.

Thanks.

myron and others added 19 commits April 7, 2023 09:30
Issues: #6291

Allows to skip the already trained algos, and continue training only for
the non-trained ones.

after this PR, the default option AutoRunner(train=None) will have this
behavior, whereas manually setting AutoRunner(train=True/False) will
always train all or skip all training. Previously we can only train all
or skip all (without any option to resume)

---------

Signed-off-by: myron <amyronenko@nvidia.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
for more information, see https://pre-commit.ci

Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
for more information, see https://pre-commit.ci

Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Reverts #6290
fixes #6294

this commit is not compatible with the integration tests

---------

Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Mingxin Zheng <18563433+mingxin-zheng@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Mingxin Zheng <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Fixes #6184  .

### Description

Enable bundlle_gen to skip an algorithm if the BundleAlgo
pre_check_skip_algo() function sets skip_bundlegen=True. SegResNet2D
overrided pre_check_skip_algo() and will be skipped if data is not
highly anisotropic.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

---------

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
…6254)

Fixes #6193

### Description

- adds `buffer_step` (code adapted from @myron's idea/implementation)
- clean up multioutput resizing logic
- perf optimize
- `overlap` to support different values along the spatial dimensions

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [x] New tests added to cover the changes.
- [x] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [x] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [x] In-line docstrings updated.
- [x] Documentation updated, tested `make html` command in the `docs/`
folder.

---------

Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
to solve: #6285

Currently ClassesToIndicesd can be used to pre-compute and cache the
locations of all background/foreground classes.
This significantly speedups the cropping transform
RandCropByLabelClassesd, since the coordinates of class voxels do not
need to be computed every time.

Unfortunately for an average dataset it requires ~80gb of extra RAM,
with an average cache (per image) is on the order of image size itself.
(most of it due to background coordinates, or by some large segmentation
classes).

This PR clips (optionally) the number of coordinates in each class to a
parameter max_indices_per_class. (e.g. 5000), which greatly reduces
cache size requirements only to a few MB. Generally these coordinates
are used to random choose a cropping center at each epoch, e.g. for 1000
epochs -> 1000 centers per image, so we don't need to store ALL the
foreground coordinates, it's sufficient to store just some large random
subset.

PS: I wasn't sure if it's better to also change ClassesToIndices (not
ClassesToIndicesD), feel free edit this PR.

thank you

### Description

A few sentences describing the changes proposed in this pull request.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

---------

Signed-off-by: myron <amyronenko@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Co-authored-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
To solve: #6299

### Description

You can find the description of the changes in issue #6299

Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
…6309)

Fixes # .
#6297

### Description

PyTorch onnx exporter API has been changed since 1.10 to remove
example_outputs as a required input argument. Special handling is added
in this PR to support PyTorch version older than 1.10. Unit test is also
extended to covert this case.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [x] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Second PR for issue #6291

Since the previous PR #6290
was reverted #6295

Allows to skip the already trained algos, and continue training only for
the non-trained ones.

after this PR, the default option AutoRunner(train=None) will have this
behavior, whereas manually setting AutoRunner(train=True/False) will
always train all or skip all training. Previously we can only train all
or skip all (without any option to resume)

I changed  import_bundle_algo_history() to return a better algo_dict

previously it returned "list[dict(name: algo)]" - a list of dict, but
each dict must have a single key name "name => algo". Not it returns a
list of dicts, each with several keys dict(AlgoEnsembleKeys.ID: name,
AlgoEnsembleKeys.ALGO, algo, "is_trained": bool, etc).
this allows to put additional metadata inside of each algo_dict, and
it's easier to read it back.

previously, to get a name we had to use "name = history[0].keys()[0]",
now it's more elegant "name = history[0][AlgoEnsembleKeys.ID]".

this however required to change many files, everywhere where
import_bundle_algo_history and export_bundle_algo_history was used.

All the tests have passed, except for "integration GPU utilization
tests" , but those errors seems unrelated

After this PR, tutorials need to be updated too
Project-MONAI/tutorials#1288

---------

Signed-off-by: myron <amyronenko@nvidia.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
pre-commit-ci bot and others added 11 commits April 7, 2023 09:30
for more information, see https://pre-commit.ci

Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
for more information, see https://pre-commit.ci

Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
@soumik12345
Copy link
Author

soumik12345 commented Apr 7, 2023

Please add your test file here to skip the min tests as it has optional dependency: https://github.com/Project-MONAI/MONAI/blob/dev/tests/min_tests.py#L203 And please fix the other CI failures.

Thanks.

Hi @Nic-Ma, I attempted to fix the CI failures, but ran into a couple of issues:

  • The reason this test is failing, seems to be an issue of inheritance with PyTorch Ignite itself and how the ignite.handlers.checkpoint.ModelCheckpoint class is created. Could you please recommend as to how we can resolve this?

  • Few CI checks (such as this) are also failing due to the following error:

Traceback (most recent call last):
  File "/Users/runner/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/runner/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/runner/work/MONAI/MONAI/tests/min_tests.py", line 229, in <module>
    assert not err_mod, f"err_mod={err_mod} not empty"
AssertionError: err_mod=['monai.handlers'] not empty

Can you please please recommend as to how we can resolve these issues?

Signed-off-by: Soumik Rakshit <19soumik.rakshit96@gmail.com>
@Nic-Ma
Copy link
Contributor

Nic-Ma commented Apr 7, 2023

Hi @vfdev-5 ,

Could you please help give some guidance for the inheritance issue of ignite handler?

Thanks in advance.

@soumik12345
Copy link
Author

Hi @vfdev-5 ,

Could you please help give some guidance for the inheritance issue of ignite handler?

Thanks in advance.

Hi @vfdev-5
Could you please guide me in resolving the CI issues?

Thanks in advance.

@vfdev-5
Copy link
Member

vfdev-5 commented Apr 11, 2023

@soumik12345

The first issue is mypy issue:

mypy
mypy 1.2.0 (compiled: yes)
monai/handlers/wandb_handlers.py:134:13: error: Too few arguments  [call-arg]
monai/handlers/wandb_handlers.py:147:13: error: Too few arguments  [call-arg]
monai/handlers/wandb_handlers.py:305:5: error: Signature of "__call__" incompatible with supertype "Checkpoint"  [override]
monai/handlers/wandb_handlers.py:305:5: note:      Superclass:
monai/handlers/wandb_handlers.py:305:5: note:          def __call__(self, engine: Engine) -> None
monai/handlers/wandb_handlers.py:305:5: note:      Subclass:
monai/handlers/wandb_handlers.py:305:5: note:          def __call__(self, engine: Engine, to_save: Mapping[Any, Any]) -> None
Found 3 errors in 1 file (checked 1071 source files)

We can fix it but do we really need to subclass ignite Checkpoint? Other experiment systems like ClearML or Neptune are providing ClearMLSaver and NeptunSaver to save checkpoint to their platforms/cloud

@soumik12345
Copy link
Author

soumik12345 commented Apr 11, 2023

@soumik12345

The first issue is mypy issue:

mypy
mypy 1.2.0 (compiled: yes)
monai/handlers/wandb_handlers.py:134:13: error: Too few arguments  [call-arg]
monai/handlers/wandb_handlers.py:147:13: error: Too few arguments  [call-arg]
monai/handlers/wandb_handlers.py:305:5: error: Signature of "__call__" incompatible with supertype "Checkpoint"  [override]
monai/handlers/wandb_handlers.py:305:5: note:      Superclass:
monai/handlers/wandb_handlers.py:305:5: note:          def __call__(self, engine: Engine) -> None
monai/handlers/wandb_handlers.py:305:5: note:      Subclass:
monai/handlers/wandb_handlers.py:305:5: note:          def __call__(self, engine: Engine, to_save: Mapping[Any, Any]) -> None
Found 3 errors in 1 file (checked 1071 source files)

We can fix it but do we really need to subclass ignite Checkpoint? Other experiment systems like ClearML or Neptune are providing ClearMLSaver and NeptunSaver to save checkpoint to their platforms/cloud

Hi @vfdev-5
I can refactor the Checkpoint Handler logic without subclassing an Ignite Handler.

@soumik12345 soumik12345 closed this by deleting the head repository May 15, 2023
@soumik12345
Copy link
Author

soumik12345 commented May 16, 2023

Re-raised the PR #6519

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants