Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor multi-node running command into dedicated functions #6623

Merged
merged 27 commits into from
Jul 5, 2023

Conversation

mingxin-zheng
Copy link
Contributor

@mingxin-zheng mingxin-zheng commented Jun 18, 2023

Fixes #6567 .

Description

The PR aims to refactor the multi-node command preparation and running for auto3dseg.
In the initial draft, I assume the functions are only for Auto3DSeg and they are internal, so I put them in the Auto3DSeg utils. I am open to changes if we think the usage can be more general.

Some details:
To address the 3 variations of commands used in Auto3DSeg:

  1. python script.py <options>
  2. torchrun <specs> script.py <options>
  3. bcprun <specs> -c python script.py <options>

I split the <options> and <specs> passing in different stages

  • <options> is in preparation stage, e.g. create_cmd in Auto3DSeg
  • <specs> is related to device configuration, and so it is in the launch stage e.g. run_cmd in Auto3DSeg

Each variation has its version of preparation and launching.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

mingxin-zheng and others added 20 commits June 18, 2023 15:01
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
@mingxin-zheng mingxin-zheng changed the title [WIP] Refractor multi-node running command into dedicated functions Refactor multi-node running command into dedicated functions Jun 25, 2023
monai/auto3dseg/utils.py Outdated Show resolved Hide resolved
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
@wyli
Copy link
Member

wyli commented Jul 4, 2023

/build

@wyli wyli enabled auto-merge (squash) July 4, 2023 14:19
@mingxin-zheng
Copy link
Contributor Author

Looks like the test error is not related to this PR:

#6696

I can take a look after a few hours.

@wyli
Copy link
Member

wyli commented Jul 4, 2023

/build

1 similar comment
@wyli
Copy link
Member

wyli commented Jul 4, 2023

/build

@mingxin-zheng
Copy link
Contributor Author

/integration-test

@mingxin-zheng
Copy link
Contributor Author

@wyli blossom-ci looks "skipped" when I clicked into it. May I know if we can re-trigger it?

@wyli
Copy link
Member

wyli commented Jul 5, 2023

the last two runs have the same errors:

[2023-07-04T16:02:34.328Z] ======================================================================
[2023-07-04T16:02:34.328Z] ERROR: test_run_optuna (tests.test_auto3dseg_hpo.TestHPO)
[2023-07-04T16:02:34.328Z] ----------------------------------------------------------------------
[2023-07-04T16:02:34.328Z] Traceback (most recent call last):
[2023-07-04T16:02:34.328Z]   File "/home/jenkins/agent/workspace/MONAI-premerge/monai/tests/test_auto3dseg_hpo.py", line 163, in test_run_optuna
[2023-07-04T16:02:34.328Z]     study.optimize(
[2023-07-04T16:02:34.328Z]   File "/usr/local/lib/python3.8/dist-packages/optuna/study/study.py", line 443, in optimize
[2023-07-04T16:02:34.328Z]     _optimize(
[2023-07-04T16:02:34.328Z]   File "/usr/local/lib/python3.8/dist-packages/optuna/study/_optimize.py", line 66, in _optimize
[2023-07-04T16:02:34.328Z]     _optimize_sequential(
[2023-07-04T16:02:34.328Z]   File "/usr/local/lib/python3.8/dist-packages/optuna/study/_optimize.py", line 163, in _optimize_sequential
[2023-07-04T16:02:34.328Z]     frozen_trial = _run_trial(study, func, catch)
[2023-07-04T16:02:34.328Z]   File "/usr/local/lib/python3.8/dist-packages/optuna/study/_optimize.py", line 251, in _run_trial
[2023-07-04T16:02:34.328Z]     raise func_err
[2023-07-04T16:02:34.328Z]   File "/usr/local/lib/python3.8/dist-packages/optuna/study/_optimize.py", line 200, in _run_trial
[2023-07-04T16:02:34.328Z]     value_or_values = func(trial)
[2023-07-04T16:02:34.328Z]   File "/home/jenkins/agent/workspace/MONAI-premerge/monai/monai/apps/auto3dseg/hpo_gen.py", line 336, in __call__
[2023-07-04T16:02:34.328Z]     self.run_algo(obj_filename, output_folder, template_path)
[2023-07-04T16:02:34.328Z]   File "/home/jenkins/agent/workspace/MONAI-premerge/monai/monai/apps/auto3dseg/hpo_gen.py", line 394, in run_algo
[2023-07-04T16:02:34.328Z]     self.algo.train(self.params)
[2023-07-04T16:02:34.328Z]   File "/home/jenkins/agent/workspace/MONAI-premerge/monai/monai/apps/auto3dseg/bundle_gen.py", line 277, in train
[2023-07-04T16:02:34.328Z]     return self._run_cmd(cmd)
[2023-07-04T16:02:34.328Z]   File "/home/jenkins/agent/workspace/MONAI-premerge/monai/monai/apps/auto3dseg/bundle_gen.py", line 254, in _run_cmd
[2023-07-04T16:02:34.328Z]     return run_cmd(cmd.split(), env=ps_environ, check=True)
[2023-07-04T16:02:34.328Z]   File "/home/jenkins/agent/workspace/MONAI-premerge/monai/monai/utils/misc.py", line 845, in run_cmd
[2023-07-04T16:02:34.328Z]     return subprocess.run(cmd_list, **kwargs)
[2023-07-04T16:02:34.328Z]   File "/usr/lib/python3.8/subprocess.py", line 493, in run
[2023-07-04T16:02:34.328Z]     with Popen(*popenargs, **kwargs) as process:
[2023-07-04T16:02:34.328Z]   File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
[2023-07-04T16:02:34.328Z]     self._execute_child(args, executable, preexec_fn, close_fds,
[2023-07-04T16:02:34.328Z]   File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
[2023-07-04T16:02:34.328Z]     raise child_exception_type(errno_num, err_msg, err_filename)
[2023-07-04T16:02:34.328Z] FileNotFoundError: [Errno 2] No such file or directory: 'None'
[2023-07-04T16:02:34.328Z] 
[2023-07-04T16:02:34.328Z] ----------------------------------------------------------------------
  • MONAI-premerge/detail/MONAI-premerge/2625/pipeline
  • MONAI-premerge/detail/MONAI-premerge/2624/pipeline

I'll try to run it again..

@wyli
Copy link
Member

wyli commented Jul 5, 2023

/black
/build

@wyli wyli disabled auto-merge July 5, 2023 07:26
@mingxin-zheng
Copy link
Contributor Author

Thanks @wyli for pointing it out. I was looking at the wrong things. Let me fix this.

mingxin-zheng and others added 3 commits July 5, 2023 13:03
Signed-off-by: Mingxin <18563433+mingxin-zheng@users.noreply.github.com>
Signed-off-by: monai-bot <monai.miccai2019@gmail.com>
@wyli
Copy link
Member

wyli commented Jul 5, 2023

/build

Signed-off-by: Wenqi Li <wenqil@nvidia.com>
@wyli
Copy link
Member

wyli commented Jul 5, 2023

thanks @mingxin-zheng, I modified the device_settings to remove the mypy warning 93d6e05 I think the PR can now pass all the tests, merging it soon..

@wyli
Copy link
Member

wyli commented Jul 5, 2023

/build

@mingxin-zheng
Copy link
Contributor Author

Thank you @wyli

@wyli wyli enabled auto-merge (squash) July 5, 2023 15:40
@wyli wyli merged commit c6c7ba9 into Project-MONAI:dev Jul 5, 2023
31 of 35 checks passed
@mingxin-zheng mingxin-zheng deleted the fix-6567 branch March 27, 2024 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto3DSeg should separate bcprun as an unique execution method
3 participants