Skip to content

test_auto3dseg_hpo OOM frequently #5604

@wyli

Description

@wyli

run into this issue from time to time, would be great if the GPU memory footprint can be smaller cc @mingxin-zheng @Nic-Ma

2022-11-29T10:04:35.2204084Z ======================================================================
2022-11-29T10:04:35.2204811Z ERROR: test_get_history (tests.test_auto3dseg_hpo.TestHPO)
2022-11-29T10:04:35.2206107Z ----------------------------------------------------------------------
2022-11-29T10:04:35.2207238Z Traceback (most recent call last):
2022-11-29T10:04:35.2208564Z   File "/__w/MONAI/MONAI/tests/test_auto3dseg_hpo.py", line 212, in test_get_history
2022-11-29T10:04:35.2209724Z     NNIGen().run_algo(obj_filename, self.work_dir)
2022-11-29T10:04:35.2211043Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/hpo_gen.py", line 225, in run_algo
2022-11-29T10:04:35.2212392Z     self.algo.train(self.params)
2022-11-29T10:04:35.2213184Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 198, in train
2022-11-29T10:04:35.2213952Z     return self._run_cmd(cmd, devices_info)
2022-11-29T10:04:35.2214729Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 185, in _run_cmd
2022-11-29T10:04:35.2215606Z     normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True)
2022-11-29T10:04:35.2216413Z   File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
2022-11-29T10:04:35.2217205Z     raise CalledProcessError(retcode, process.args,
2022-11-29T10:04:35.2221363Z subprocess.CalledProcessError: Command '['torchrun', '--nnodes=1', '--nproc_per_node=2', '/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/scripts/train.py', 'run', "--config_file='/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/transforms_infer.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/transforms_validate.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/transforms_train.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/network.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/hyper_parameters.yaml'"]' returned non-zero exit status 1.
2022-11-29T10:04:35.2223560Z 
2022-11-29T10:04:35.2224128Z ======================================================================
2022-11-29T10:04:35.2224502Z 
2022-11-29T10:04:35.2224903Z tests finished, printing completed times >10.0s in ascending order...
2022-11-29T10:04:35.2225391Z 
2022-11-29T10:04:35.2225993Z test_read_patches_cucim_3 (tests.test_masked_inference_wsi_dataset.TestMaskedInferenceWSIDataset) (10.1s)
2022-11-29T10:04:35.2227214Z test_read_patches_cucim_2 (tests.test_masked_inference_wsi_dataset.TestMaskedInferenceWSIDataset) (10.1s)
2022-11-29T10:04:35.2228409Z test_read_patches_cucim_1 (tests.test_masked_inference_wsi_dataset.TestMaskedInferenceWSIDataset) (10.1s)
2022-11-29T10:04:35.2229387Z test_script_0 (tests.test_senet.TestSENET) (10.3s)
2022-11-29T10:04:35.2230192Z test_script (tests.test_highresnet.TestHighResNet) (10.3s)
2022-11-29T10:04:35.2231083Z ERROR: test_run_algo (tests.test_auto3dseg_hpo.TestHPO)
2022-11-29T10:04:35.2232292Z ----------------------------------------------------------------------
2022-11-29T10:04:35.2233047Z Traceback (most recent call last):
2022-11-29T10:04:35.2233821Z   File "/__w/MONAI/MONAI/tests/test_auto3dseg_hpo.py", line 133, in test_run_algo
2022-11-29T10:04:35.2234584Z     NNIGen().run_algo(obj_filename, self.work_dir)
2022-11-29T10:04:35.2235535Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/hpo_gen.py", line 225, in run_algo
2022-11-29T10:04:35.2236287Z     self.algo.train(self.params)
2022-11-29T10:04:35.2237016Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 198, in train
2022-11-29T10:04:35.2237778Z     return self._run_cmd(cmd, devices_info)
2022-11-29T10:04:35.2238562Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 185, in _run_cmd
2022-11-29T10:04:35.2239415Z     normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True)
2022-11-29T10:04:35.2240248Z   File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
2022-11-29T10:04:35.2241038Z     raise CalledProcessError(retcode, process.args,
2022-11-29T10:04:35.2244859Z subprocess.CalledProcessError: Command '['torchrun', '--nnodes=1', '--nproc_per_node=2', '/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/scripts/train.py', 'run', "--config_file='/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/transforms_infer.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/transforms_validate.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/transforms_train.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/network.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/hyper_parameters.yaml'"]' returned non-zero exit status 1.
2022-11-29T10:04:35.2246927Z 
2022-11-29T10:04:35.2247210Z ======================================================================
2022-11-29T10:04:35.2247931Z ERROR: test_run_algo_after_move_files (tests.test_auto3dseg_hpo.TestHPO)
2022-11-29T10:04:35.2249022Z ----------------------------------------------------------------------
2022-11-29T10:04:35.2249778Z Traceback (most recent call last):
2022-11-29T10:04:35.2250595Z   File "/__w/MONAI/MONAI/tests/test_auto3dseg_hpo.py", line 194, in test_run_algo_after_move_files
2022-11-29T10:04:35.2251527Z     NNIGen().run_algo(obj_filename_2, work_dir_2, template_path=algorithm_templates_2)
2022-11-29T10:04:35.2252438Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/hpo_gen.py", line 225, in run_algo
2022-11-29T10:04:35.2253175Z     self.algo.train(self.params)
2022-11-29T10:04:35.2253915Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 198, in train
2022-11-29T10:04:35.2254674Z     return self._run_cmd(cmd, devices_info)
2022-11-29T10:04:35.2255445Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 185, in _run_cmd
2022-11-29T10:04:35.2256314Z     normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True)
2022-11-29T10:04:35.2257116Z   File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
2022-11-29T10:04:35.2258055Z     raise CalledProcessError(retcode, process.args,
2022-11-29T10:04:35.2261937Z subprocess.CalledProcessError: Command '['torchrun', '--nnodes=1', '--nproc_per_node=2', '/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/scripts/train.py', 'run', "--config_file='/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/transforms_infer.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/transforms_validate.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/transforms_train.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/network.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/hyper_parameters.yaml'"]' returned non-zero exit status 1.
2022-11-29T10:04:35.2264139Z 
2022-11-29T10:04:35.2264413Z ======================================================================
2022-11-29T10:04:35.2265091Z ERROR: test_run_optuna (tests.test_auto3dseg_hpo.TestHPO)
2022-11-29T10:04:35.2266277Z ----------------------------------------------------------------------
2022-11-29T10:04:35.2267040Z Traceback (most recent call last):
2022-11-29T10:04:35.2267818Z   File "/__w/MONAI/MONAI/tests/test_auto3dseg_hpo.py", line 157, in test_run_optuna
2022-11-29T10:04:35.2268490Z     study.optimize(
2022-11-29T10:04:35.2269521Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/study.py", line 419, in optimize
2022-11-29T10:04:35.2270250Z     _optimize(
2022-11-29T10:04:35.2271236Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 66, in _optimize
2022-11-29T10:04:35.2272017Z     _optimize_sequential(
2022-11-29T10:04:35.2273143Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 160, in _optimize_sequential
2022-11-29T10:04:35.2274008Z     frozen_trial = _run_trial(study, func, catch)
2022-11-29T10:04:35.2275137Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 234, in _run_trial
2022-11-29T10:04:35.2275881Z     raise func_err
2022-11-29T10:04:35.2276906Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 196, in _run_trial
2022-11-29T10:04:35.2277667Z     value_or_values = func(trial)
2022-11-29T10:04:35.2278412Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/hpo_gen.py", line 333, in __call__
2022-11-29T10:04:35.2279253Z     self.run_algo(obj_filename, output_folder, template_path)
2022-11-29T10:04:35.2280062Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/hpo_gen.py", line 395, in run_algo
2022-11-29T10:04:35.2280790Z     self.algo.train(self.params)
2022-11-29T10:04:35.2281540Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 198, in train
2022-11-29T10:04:35.2282261Z     return self._run_cmd(cmd, devices_info)
2022-11-29T10:04:35.2283036Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 185, in _run_cmd
2022-11-29T10:04:35.2283910Z     normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True)
2022-11-29T10:04:35.2284747Z   File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
2022-11-29T10:04:35.2285506Z     raise CalledProcessError(retcode, process.args,
2022-11-29T10:04:35.2289942Z subprocess.CalledProcessError: Command '['torchrun', '--nnodes=1', '--nproc_per_node=2', '/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py', 'run', "--config_file='/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_infer.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_validate.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_train.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/network.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/hyper_parameters.yaml'", '--learning_rate=0.001']' returned non-zero exit status 1.
2022-11-29T10:04:35.2292326Z 
2022-11-29T10:04:35.2292937Z ----------------------------------------------------------------------
Details
2022-11-29T09:55:09.0551041Z Starting test: test_get_history (tests.test_auto3dseg_hpo.TestHPO)...
2022-11-29T09:55:09.0551548Z 
2022-11-29T09:55:10.1811180Z   0%|          | 0/12 [00:00<?, ?it/s]
2022-11-29T09:55:10.3612333Z   8%|▊         | 1/12 [00:01<00:12,  1.12s/it]
2022-11-29T09:55:10.5055004Z  17%|█▋        | 2/12 [00:01<00:05,  1.76it/s]
2022-11-29T09:55:10.6297577Z  42%|████▏     | 5/12 [00:01<00:01,  5.10it/s]
2022-11-29T09:55:10.7651096Z  67%|██████▋   | 8/12 [00:01<00:00,  8.48it/s]
2022-11-29T09:55:11.1991967Z  92%|█████████▏| 11/12 [00:01<00:00, 11.44it/s]
2022-11-29T09:55:11.5556705Z 100%|██████████| 12/12 [00:02<00:00,  5.60it/s]
2022-11-29T09:55:11.5557118Z 
2022-11-29T09:55:12.2148624Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2022-11-29T09:55:12.3589082Z algo_templates.tar.gz:   3%|▎         | 8.00k/290k [00:00<00:23, 12.4kB/s]
2022-11-29T09:55:12.5044988Z algo_templates.tar.gz:  17%|█▋        | 48.0k/290k [00:00<00:03, 77.2kB/s]
2022-11-29T09:55:12.6539602Z algo_templates.tar.gz:  44%|████▍     | 128k/290k [00:00<00:00, 201kB/s]  
2022-11-29T09:55:12.6570675Z algo_templates.tar.gz:  94%|█████████▍| 272k/290k [00:01<00:00, 416kB/s]
2022-11-29T09:55:12.6576749Z 2022-11-29 09:55:12,657 - INFO - Downloaded: /tmp/tmpn_5xi132/algo_templates.tar.gz
2022-11-29T09:55:12.6578437Z 2022-11-29 09:55:12,657 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmpn_5xi132/algo_templates.tar.gz.
2022-11-29T09:55:12.6581851Z 2022-11-29 09:55:12,657 - INFO - Writing into directory: /tmp/tmp8svrlhoq/workdir.
2022-11-29T09:55:13.2978999Z 2022-11-29 09:55:13,297 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet2d_0
2022-11-29T09:55:13.7312508Z 2022-11-29 09:55:13,730 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet2d_1
2022-11-29T09:55:14.2050683Z 2022-11-29 09:55:14,204 - INFO - /tmp/tmp8svrlhoq/workdir/dints_0
2022-11-29T09:55:14.6769312Z 2022-11-29 09:55:14,676 - INFO - /tmp/tmp8svrlhoq/workdir/dints_1
2022-11-29T09:55:15.0954867Z 2022-11-29 09:55:15,094 - INFO - /tmp/tmp8svrlhoq/workdir/swinunetr_0
2022-11-29T09:55:15.5124449Z 2022-11-29 09:55:15,511 - INFO - /tmp/tmp8svrlhoq/workdir/swinunetr_1
2022-11-29T09:55:15.9678003Z 2022-11-29 09:55:15,967 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet_0
2022-11-29T09:55:16.4254530Z 2022-11-29 09:55:16,424 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet_1
2022-11-29T09:55:16.5836417Z 2022-11-29 09:55:16,583 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override
2022-11-29T09:55:16.5837786Z 2022-11-29 09:55:16,583 - INFO - ============================================================================================================================================
2022-11-29T09:55:16.5838810Z 2022-11-29 09:55:16,583 - INFO - If NNI will run in your local env: 
2022-11-29T09:55:16.5839931Z 2022-11-29 09:55:16,583 - INFO - 1. Add the following line to the trialCommand in your NNI config: 
2022-11-29T09:55:16.5841525Z 2022-11-29 09:55:16,583 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo  /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override/algo_object.pkl {result_dir}
2022-11-29T09:55:16.5843375Z 2022-11-29 09:55:16,583 - INFO - --------------------------------------------------------------------------------------------------------------------------------------------
2022-11-29T09:55:16.5845128Z 2022-11-29 09:55:16,583 - INFO - If NNI will run in a remote env: 
2022-11-29T09:55:16.5846598Z 2022-11-29 09:55:16,583 - INFO - 1. Copy the algorithm_templates folder /tmp/tmp8svrlhoq/workdir/algorithm_templates/segresnet2d to remote {remote_algorithm_templates_dir}
2022-11-29T09:55:16.5848229Z 2022-11-29 09:55:16,583 - INFO - 2. Copy the older /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override to the remote machine {remote_algo_dir}
2022-11-29T09:55:16.5849572Z 2022-11-29 09:55:16,583 - INFO - Then add the following line to the trialCommand in your NNI config: 
2022-11-29T09:55:16.5851241Z 2022-11-29 09:55:16,583 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo  {remote_algo_dir} {result_dir} {remote_algorithm_templates_dir}
2022-11-29T09:55:16.5852668Z 2022-11-29 09:55:16,583 - INFO - ============================================================================================================================================
2022-11-29T09:55:16.6948146Z 2022-11-29 09:55:16,694 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None
2022-11-29T09:55:16.6951825Z 2022-11-29 09:55:16,694 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/scripts/train.py run --config_file='/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/transforms_infer.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/transforms_validate.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/transforms_train.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/network.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/hyper_parameters.yaml'
2022-11-29T09:55:18.4222096Z algo_templates.tar.gz: 296kB [00:01, 275kB/s]                           WARNING:torch.distributed.run:
2022-11-29T09:55:18.4222874Z *****************************************
2022-11-29T09:55:18.4224086Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
2022-11-29T09:55:18.4225203Z *****************************************
2022-11-29T09:55:23.8582013Z [info] number of GPUs: 2
2022-11-29T09:55:23.8583735Z [info] number of GPUs: 2
2022-11-29T09:55:23.9202797Z 2022-11-29 09:55:23,919 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-11-29T09:55:23.9203975Z 2022-11-29 09:55:23,920 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-11-29T09:55:23.9205355Z 2022-11-29 09:55:23,920 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:55:23.9206131Z [info] world_size: 2
2022-11-29T09:55:23.9218634Z train_files: 4
2022-11-29T09:55:23.9219140Z val_files: 2
2022-11-29T09:55:23.9306079Z 2022-11-29 09:55:23,930 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:55:23.9306939Z [info] world_size: 2
2022-11-29T09:55:23.9318224Z train_files: 4
2022-11-29T09:55:23.9318684Z val_files: 2
2022-11-29T09:55:24.1687175Z Traceback (most recent call last):
2022-11-29T09:55:24.1688136Z   File "/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/scripts/train.py", line 439, in <module>
2022-11-29T09:55:24.1688872Z     fire.Fire()
2022-11-29T09:55:24.1689937Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2022-11-29T09:55:24.1690851Z     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2022-11-29T09:55:24.1691994Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
2022-11-29T09:55:24.1692821Z     component, remaining_args = _CallAndUpdateTrace(
2022-11-29T09:55:24.1694010Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
2022-11-29T09:55:24.1694840Z     component = fn(*varargs, **kwargs)
2022-11-29T09:55:24.1696123Z   File "/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/scripts/train.py", line 164, in run
2022-11-29T09:55:24.1696896Z     model = model.to(device)
2022-11-29T09:55:24.1698118Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
2022-11-29T09:55:24.1698893Z     return self._apply(convert)
2022-11-29T09:55:24.1699926Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:55:24.1700673Z     module._apply(fn)
2022-11-29T09:55:24.1701697Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:55:24.1702411Z     module._apply(fn)
2022-11-29T09:55:24.1703414Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
2022-11-29T09:55:24.1704364Z     param_applied = fn(param)
2022-11-29T09:55:24.1705422Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
2022-11-29T09:55:24.1706390Z     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2022-11-29T09:55:24.1707191Z RuntimeError: CUDA error: out of memory
2022-11-29T09:55:24.1708139Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2022-11-29T09:55:24.1709083Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-11-29T09:55:25.7366762Z num_epochs 4
2022-11-29T09:55:25.7367400Z num_epochs_per_validation 2
2022-11-29T09:55:28.4527840Z WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47061 closing signal SIGTERM
2022-11-29T09:55:28.6172946Z ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 47062) of binary: /opt/conda/bin/python
2022-11-29T09:55:28.6272926Z Traceback (most recent call last):
2022-11-29T09:55:28.6273687Z   File "/opt/conda/bin/torchrun", line 8, in <module>
2022-11-29T09:55:28.6274336Z     sys.exit(main())
2022-11-29T09:55:28.6275875Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
2022-11-29T09:55:28.6276736Z     return f(*args, **kwargs)
2022-11-29T09:55:28.6277775Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
2022-11-29T09:55:28.6278490Z     run(args)
2022-11-29T09:55:28.6279466Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
2022-11-29T09:55:28.6280170Z     elastic_launch(
2022-11-29T09:55:28.6281235Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
2022-11-29T09:55:28.6282208Z     return launch_agent(self._config, self._entrypoint, list(args))
2022-11-29T09:55:28.6283430Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
2022-11-29T09:55:28.6284268Z     raise ChildFailedError(
2022-11-29T09:55:28.6285244Z torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
2022-11-29T09:55:28.6286202Z ============================================================
2022-11-29T09:55:28.6286942Z /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/scripts/train.py FAILED
2022-11-29T09:55:28.6287991Z ------------------------------------------------------------
2022-11-29T09:55:28.6288621Z Failures:
2022-11-29T09:55:28.6289077Z   <NO_OTHER_FAILURES>
2022-11-29T09:55:28.6289912Z ------------------------------------------------------------
2022-11-29T09:55:28.6290606Z Root Cause (first observed failure):
2022-11-29T09:55:28.6291106Z [0]:
2022-11-29T09:55:28.6291739Z   time      : 2022-11-29_09:55:28
2022-11-29T09:55:28.6292265Z   host      : 16e7a27492ca
2022-11-29T09:55:28.6292774Z   rank      : 1 (local_rank: 1)
2022-11-29T09:55:28.6293321Z   exitcode  : 1 (pid: 47062)
2022-11-29T09:55:28.6293850Z   error_file: <N/A>
2022-11-29T09:55:28.6294658Z   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2022-11-29T09:55:28.6295928Z ============================================================
2022-11-29T09:55:28.8428026Z 
2022-11-29T09:55:28.8429721Z �[1m�[31mRunning trial code without runtime. Please check the tutorial if you are new to NNI: �[33mhttps://nni.readthedocs.io/en/stable/tutorials/hpo_quickstart_pytorch/main.html�[0m
2022-11-29T09:55:29.1969287Z EFinished test: test_get_history (tests.test_auto3dseg_hpo.TestHPO) (20.1s)
2022-11-29T09:55:29.1970251Z Starting test: test_run_algo (tests.test_auto3dseg_hpo.TestHPO)...
2022-11-29T09:55:29.1970742Z 
2022-11-29T09:55:30.4961914Z   0%|          | 0/12 [00:00<?, ?it/s]
2022-11-29T09:55:30.6012010Z   8%|▊         | 1/12 [00:01<00:14,  1.30s/it]
2022-11-29T09:55:30.7443006Z  25%|██▌       | 3/12 [00:01<00:03,  2.66it/s]
2022-11-29T09:55:30.8724339Z  50%|█████     | 6/12 [00:01<00:01,  5.73it/s]
2022-11-29T09:55:30.9907717Z  75%|███████▌  | 9/12 [00:01<00:00,  8.85it/s]
2022-11-29T09:55:31.3609564Z 100%|██████████| 12/12 [00:01<00:00, 11.97it/s]
2022-11-29T09:55:31.7145529Z 100%|██████████| 12/12 [00:02<00:00,  5.55it/s]
2022-11-29T09:55:31.7145935Z 
2022-11-29T09:55:32.1145141Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2022-11-29T09:55:32.1406013Z algo_templates.tar.gz:   3%|▎         | 8.00k/290k [00:00<00:14, 20.5kB/s]
2022-11-29T09:55:32.1413636Z 2022-11-29 09:55:32,140 - INFO - Downloaded: /tmp/tmpqbuouysy/algo_templates.tar.gz
2022-11-29T09:55:32.1415662Z 2022-11-29 09:55:32,141 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmpqbuouysy/algo_templates.tar.gz.
2022-11-29T09:55:32.1422130Z 2022-11-29 09:55:32,141 - INFO - Writing into directory: /tmp/tmpo5ptmej2/workdir.
2022-11-29T09:55:32.7890156Z 2022-11-29 09:55:32,788 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet2d_0
2022-11-29T09:55:33.2201065Z 2022-11-29 09:55:33,219 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet2d_1
2022-11-29T09:55:33.6893202Z 2022-11-29 09:55:33,688 - INFO - /tmp/tmpo5ptmej2/workdir/dints_0
2022-11-29T09:55:34.1584238Z 2022-11-29 09:55:34,157 - INFO - /tmp/tmpo5ptmej2/workdir/dints_1
2022-11-29T09:55:34.5752796Z 2022-11-29 09:55:34,574 - INFO - /tmp/tmpo5ptmej2/workdir/swinunetr_0
2022-11-29T09:55:34.9919902Z 2022-11-29 09:55:34,991 - INFO - /tmp/tmpo5ptmej2/workdir/swinunetr_1
2022-11-29T09:55:35.4527105Z 2022-11-29 09:55:35,452 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet_0
2022-11-29T09:55:35.9293233Z 2022-11-29 09:55:35,928 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet_1
2022-11-29T09:55:36.0874985Z 2022-11-29 09:55:36,086 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override
2022-11-29T09:55:36.0876363Z 2022-11-29 09:55:36,086 - INFO - ============================================================================================================================================
2022-11-29T09:55:36.0877331Z 2022-11-29 09:55:36,087 - INFO - If NNI will run in your local env: 
2022-11-29T09:55:36.0878503Z 2022-11-29 09:55:36,087 - INFO - 1. Add the following line to the trialCommand in your NNI config: 
2022-11-29T09:55:36.0880069Z 2022-11-29 09:55:36,087 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo  /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override/algo_object.pkl {result_dir}
2022-11-29T09:55:36.0881905Z 2022-11-29 09:55:36,087 - INFO - --------------------------------------------------------------------------------------------------------------------------------------------
2022-11-29T09:55:36.0883229Z 2022-11-29 09:55:36,087 - INFO - If NNI will run in a remote env: 
2022-11-29T09:55:36.0884701Z 2022-11-29 09:55:36,087 - INFO - 1. Copy the algorithm_templates folder /tmp/tmpo5ptmej2/workdir/algorithm_templates/segresnet2d to remote {remote_algorithm_templates_dir}
2022-11-29T09:55:36.0886317Z 2022-11-29 09:55:36,087 - INFO - 2. Copy the older /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override to the remote machine {remote_algo_dir}
2022-11-29T09:55:36.0887623Z 2022-11-29 09:55:36,087 - INFO - Then add the following line to the trialCommand in your NNI config: 
2022-11-29T09:55:36.0889076Z 2022-11-29 09:55:36,087 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo  {remote_algo_dir} {result_dir} {remote_algorithm_templates_dir}
2022-11-29T09:55:36.0890897Z 2022-11-29 09:55:36,087 - INFO - ============================================================================================================================================
2022-11-29T09:55:36.1982921Z 2022-11-29 09:55:36,197 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None
2022-11-29T09:55:36.1986967Z 2022-11-29 09:55:36,197 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/scripts/train.py run --config_file='/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/transforms_infer.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/transforms_validate.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/transforms_train.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/network.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/hyper_parameters.yaml'
2022-11-29T09:55:37.9397084Z algo_templates.tar.gz: 296kB [00:00, 712kB/s]                             WARNING:torch.distributed.run:
2022-11-29T09:55:37.9397872Z *****************************************
2022-11-29T09:55:37.9399075Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
2022-11-29T09:55:37.9400186Z *****************************************
2022-11-29T09:55:43.3559678Z [info] number of GPUs: 2
2022-11-29T09:55:43.3561263Z [info] number of GPUs: 2
2022-11-29T09:55:43.4182811Z 2022-11-29 09:55:43,417 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-11-29T09:55:43.4184683Z 2022-11-29 09:55:43,418 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-11-29T09:55:43.4186102Z 2022-11-29 09:55:43,418 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:55:43.4186848Z [info] world_size: 2
2022-11-29T09:55:43.4198933Z train_files: 4
2022-11-29T09:55:43.4199432Z val_files: 2
2022-11-29T09:55:43.4286886Z 2022-11-29 09:55:43,428 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:55:43.4287705Z [info] world_size: 2
2022-11-29T09:55:43.4293980Z train_files: 4
2022-11-29T09:55:43.4294770Z val_files: 2
2022-11-29T09:55:43.6301642Z Traceback (most recent call last):
2022-11-29T09:55:43.6302684Z   File "/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/scripts/train.py", line 439, in <module>
2022-11-29T09:55:43.6303452Z     fire.Fire()
2022-11-29T09:55:43.6304558Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2022-11-29T09:55:43.6305446Z     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2022-11-29T09:55:43.6306625Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
2022-11-29T09:55:43.6307474Z     component, remaining_args = _CallAndUpdateTrace(
2022-11-29T09:55:43.6308630Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
2022-11-29T09:55:43.6309465Z     component = fn(*varargs, **kwargs)
2022-11-29T09:55:43.6310304Z   File "/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/scripts/train.py", line 164, in run
2022-11-29T09:55:43.6326024Z     model = model.to(device)
2022-11-29T09:55:43.6327297Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
2022-11-29T09:55:43.6328106Z     return self._apply(convert)
2022-11-29T09:55:43.6329183Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:55:43.6329923Z     module._apply(fn)
2022-11-29T09:55:43.6330962Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:55:43.6331706Z     module._apply(fn)
2022-11-29T09:55:43.6332726Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
2022-11-29T09:55:43.6333968Z     param_applied = fn(param)
2022-11-29T09:55:43.6335040Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
2022-11-29T09:55:43.6336076Z     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2022-11-29T09:55:43.6336866Z RuntimeError: CUDA error: out of memory
2022-11-29T09:55:43.6337992Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2022-11-29T09:55:43.6338970Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-11-29T09:55:45.2301934Z num_epochs 4
2022-11-29T09:55:45.2302574Z num_epochs_per_validation 2
2022-11-29T09:55:47.9688382Z WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47208 closing signal SIGTERM
2022-11-29T09:55:48.1335061Z ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 47209) of binary: /opt/conda/bin/python
2022-11-29T09:55:48.1436786Z Traceback (most recent call last):
2022-11-29T09:55:48.1437551Z   File "/opt/conda/bin/torchrun", line 8, in <module>
2022-11-29T09:55:48.1438211Z     sys.exit(main())
2022-11-29T09:55:48.1439869Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
2022-11-29T09:55:48.1440767Z     return f(*args, **kwargs)
2022-11-29T09:55:48.1441799Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
2022-11-29T09:55:48.1442540Z     run(args)
2022-11-29T09:55:48.1443507Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
2022-11-29T09:55:48.1444203Z     elastic_launch(
2022-11-29T09:55:48.1445260Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
2022-11-29T09:55:48.1446210Z     return launch_agent(self._config, self._entrypoint, list(args))
2022-11-29T09:55:48.1447478Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
2022-11-29T09:55:48.1448290Z     raise ChildFailedError(
2022-11-29T09:55:48.1449262Z torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
2022-11-29T09:55:48.1450206Z ============================================================
2022-11-29T09:55:48.1450951Z /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/scripts/train.py FAILED
2022-11-29T09:55:48.1452005Z ------------------------------------------------------------
2022-11-29T09:55:48.1452630Z Failures:
2022-11-29T09:55:48.1453089Z   <NO_OTHER_FAILURES>
2022-11-29T09:55:48.1453926Z ------------------------------------------------------------
2022-11-29T09:55:48.1454635Z Root Cause (first observed failure):
2022-11-29T09:55:48.1455145Z [0]:
2022-11-29T09:55:48.1455761Z   time      : 2022-11-29_09:55:47
2022-11-29T09:55:48.1456284Z   host      : 16e7a27492ca
2022-11-29T09:55:48.1456802Z   rank      : 1 (local_rank: 1)
2022-11-29T09:55:48.1457512Z   exitcode  : 1 (pid: 47209)
2022-11-29T09:55:48.1458055Z   error_file: <N/A>
2022-11-29T09:55:48.1458888Z   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2022-11-29T09:55:48.1459682Z ============================================================
2022-11-29T09:55:48.3577958Z 
2022-11-29T09:55:48.3579980Z �[1m�[31mRunning trial code without runtime. Please check the tutorial if you are new to NNI: �[33mhttps://nni.readthedocs.io/en/stable/tutorials/hpo_quickstart_pytorch/main.html�[0m
2022-11-29T09:55:48.7132707Z EFinished test: test_run_algo (tests.test_auto3dseg_hpo.TestHPO) (19.5s)
2022-11-29T09:55:48.7133738Z Starting test: test_run_algo_after_move_files (tests.test_auto3dseg_hpo.TestHPO)...
2022-11-29T09:55:48.7134280Z 
2022-11-29T09:55:50.0123203Z   0%|          | 0/12 [00:00<?, ?it/s]
2022-11-29T09:55:50.1349379Z   8%|▊         | 1/12 [00:01<00:14,  1.30s/it]
2022-11-29T09:55:50.2906597Z  25%|██▌       | 3/12 [00:01<00:03,  2.62it/s]
2022-11-29T09:55:50.4219264Z  50%|█████     | 6/12 [00:01<00:01,  5.57it/s]
2022-11-29T09:55:50.5415627Z  75%|███████▌  | 9/12 [00:01<00:00,  8.61it/s]
2022-11-29T09:55:50.9646172Z 100%|██████████| 12/12 [00:01<00:00, 11.68it/s]
2022-11-29T09:55:51.3342432Z 100%|██████████| 12/12 [00:02<00:00,  5.33it/s]
2022-11-29T09:55:51.3342851Z 
2022-11-29T09:55:51.5588108Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2022-11-29T09:55:51.5839498Z algo_templates.tar.gz:   3%|▎         | 8.00k/290k [00:00<00:07, 36.6kB/s]
2022-11-29T09:55:51.5845945Z 2022-11-29 09:55:51,584 - INFO - Downloaded: /tmp/tmp14ob9n70/algo_templates.tar.gz
2022-11-29T09:55:51.5847918Z 2022-11-29 09:55:51,584 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmp14ob9n70/algo_templates.tar.gz.
2022-11-29T09:55:51.5850845Z 2022-11-29 09:55:51,584 - INFO - Writing into directory: /tmp/tmpe03g0p2o/workdir.
2022-11-29T09:55:52.2333875Z 2022-11-29 09:55:52,232 - INFO - /tmp/tmpe03g0p2o/workdir/segresnet2d_0
2022-11-29T09:55:52.6650624Z 2022-11-29 09:55:52,664 - INFO - /tmp/tmpe03g0p2o/workdir/segresnet2d_1
2022-11-29T09:55:53.1359243Z 2022-11-29 09:55:53,135 - INFO - /tmp/tmpe03g0p2o/workdir/dints_0
2022-11-29T09:55:53.6065237Z 2022-11-29 09:55:53,605 - INFO - /tmp/tmpe03g0p2o/workdir/dints_1
2022-11-29T09:55:54.0267889Z 2022-11-29 09:55:54,026 - INFO - /tmp/tmpe03g0p2o/workdir/swinunetr_0
2022-11-29T09:55:54.4470347Z 2022-11-29 09:55:54,446 - INFO - /tmp/tmpe03g0p2o/workdir/swinunetr_1
2022-11-29T09:55:54.9021081Z 2022-11-29 09:55:54,901 - INFO - /tmp/tmpe03g0p2o/workdir/segresnet_0
2022-11-29T09:55:55.3591210Z 2022-11-29 09:55:55,358 - INFO - /tmp/tmpe03g0p2o/workdir/segresnet_1
2022-11-29T09:55:55.5195089Z 2022-11-29 09:55:55,518 - INFO - /tmp/tmpe03g0p2o/workdir/segresnet2d_0_override
2022-11-29T09:55:55.5196581Z 2022-11-29 09:55:55,519 - INFO - ============================================================================================================================================
2022-11-29T09:55:55.5197649Z 2022-11-29 09:55:55,519 - INFO - If NNI will run in your local env: 
2022-11-29T09:55:55.5198816Z 2022-11-29 09:55:55,519 - INFO - 1. Add the following line to the trialCommand in your NNI config: 
2022-11-29T09:55:55.5200374Z 2022-11-29 09:55:55,519 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo  /tmp/tmpe03g0p2o/workdir/segresnet2d_0_override/algo_object.pkl {result_dir}
2022-11-29T09:55:55.5202203Z 2022-11-29 09:55:55,519 - INFO - --------------------------------------------------------------------------------------------------------------------------------------------
2022-11-29T09:55:55.5203552Z 2022-11-29 09:55:55,519 - INFO - If NNI will run in a remote env: 
2022-11-29T09:55:55.5205041Z 2022-11-29 09:55:55,519 - INFO - 1. Copy the algorithm_templates folder /tmp/tmpe03g0p2o/workdir/algorithm_templates/segresnet2d to remote {remote_algorithm_templates_dir}
2022-11-29T09:55:55.5206675Z 2022-11-29 09:55:55,519 - INFO - 2. Copy the older /tmp/tmpe03g0p2o/workdir/segresnet2d_0_override to the remote machine {remote_algo_dir}
2022-11-29T09:55:55.5207982Z 2022-11-29 09:55:55,519 - INFO - Then add the following line to the trialCommand in your NNI config: 
2022-11-29T09:55:55.5209452Z 2022-11-29 09:55:55,519 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo  {remote_algo_dir} {result_dir} {remote_algorithm_templates_dir}
2022-11-29T09:55:55.5210877Z 2022-11-29 09:55:55,519 - INFO - ============================================================================================================================================
2022-11-29T09:55:55.6383539Z 2022-11-29 09:55:55,637 - INFO - /tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None
2022-11-29T09:55:55.6387319Z 2022-11-29 09:55:55,637 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/scripts/train.py run --config_file='/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/transforms_infer.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/transforms_validate.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/transforms_train.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/network.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/hyper_parameters.yaml'
2022-11-29T09:55:57.3514953Z algo_templates.tar.gz: 296kB [00:00, 1.22MB/s]                            WARNING:torch.distributed.run:
2022-11-29T09:55:57.3515768Z *****************************************
2022-11-29T09:55:57.3516951Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
2022-11-29T09:55:57.3518031Z *****************************************
2022-11-29T09:56:02.7961617Z [info] number of GPUs: 2
2022-11-29T09:56:02.7962238Z [info] number of GPUs: 2
2022-11-29T09:56:02.8595629Z 2022-11-29 09:56:02,858 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-11-29T09:56:02.8597275Z 2022-11-29 09:56:02,859 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-11-29T09:56:02.8598590Z 2022-11-29 09:56:02,859 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:56:02.8599358Z [info] world_size: 2
2022-11-29T09:56:02.8606836Z train_files: 4
2022-11-29T09:56:02.8607325Z val_files: 2
2022-11-29T09:56:02.8700912Z 2022-11-29 09:56:02,869 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:56:02.8701745Z [info] world_size: 2
2022-11-29T09:56:02.8710618Z train_files: 4
2022-11-29T09:56:02.8711083Z val_files: 2
2022-11-29T09:56:03.0772150Z Traceback (most recent call last):
2022-11-29T09:56:03.0773100Z   File "/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/scripts/train.py", line 439, in <module>
2022-11-29T09:56:03.0773895Z     fire.Fire()
2022-11-29T09:56:03.0774939Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2022-11-29T09:56:03.0775874Z     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2022-11-29T09:56:03.0777020Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
2022-11-29T09:56:03.0778306Z     component, remaining_args = _CallAndUpdateTrace(
2022-11-29T09:56:03.0779446Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
2022-11-29T09:56:03.0780260Z     component = fn(*varargs, **kwargs)
2022-11-29T09:56:03.0781092Z   File "/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/scripts/train.py", line 164, in run
2022-11-29T09:56:03.0781833Z     model = model.to(device)
2022-11-29T09:56:03.0782855Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
2022-11-29T09:56:03.0783639Z     return self._apply(convert)
2022-11-29T09:56:03.0784675Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:56:03.0785424Z     module._apply(fn)
2022-11-29T09:56:03.0786450Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:56:03.0787180Z     module._apply(fn)
2022-11-29T09:56:03.0788169Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
2022-11-29T09:56:03.0788916Z     param_applied = fn(param)
2022-11-29T09:56:03.0789961Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
2022-11-29T09:56:03.0790923Z     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2022-11-29T09:56:03.0791714Z RuntimeError: CUDA error: out of memory
2022-11-29T09:56:03.0792658Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2022-11-29T09:56:03.0793619Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-11-29T09:56:04.6903189Z num_epochs 4
2022-11-29T09:56:04.6903822Z num_epochs_per_validation 2
2022-11-29T09:56:07.3912570Z WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47355 closing signal SIGTERM
2022-11-29T09:56:07.6058014Z ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 47356) of binary: /opt/conda/bin/python
2022-11-29T09:56:07.6161104Z Traceback (most recent call last):
2022-11-29T09:56:07.6161909Z   File "/opt/conda/bin/torchrun", line 8, in <module>
2022-11-29T09:56:07.6162564Z     sys.exit(main())
2022-11-29T09:56:07.6164129Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
2022-11-29T09:56:07.6164996Z     return f(*args, **kwargs)
2022-11-29T09:56:07.6166038Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
2022-11-29T09:56:07.6166755Z     run(args)
2022-11-29T09:56:07.6167697Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
2022-11-29T09:56:07.6168423Z     elastic_launch(
2022-11-29T09:56:07.6169810Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
2022-11-29T09:56:07.6170768Z     return launch_agent(self._config, self._entrypoint, list(args))
2022-11-29T09:56:07.6172000Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
2022-11-29T09:56:07.6172819Z     raise ChildFailedError(
2022-11-29T09:56:07.6173768Z torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
2022-11-29T09:56:07.6174705Z ============================================================
2022-11-29T09:56:07.6175466Z /tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/scripts/train.py FAILED
2022-11-29T09:56:07.6176532Z ------------------------------------------------------------
2022-11-29T09:56:07.6177140Z Failures:
2022-11-29T09:56:07.6177785Z   <NO_OTHER_FAILURES>
2022-11-29T09:56:07.6178610Z ------------------------------------------------------------
2022-11-29T09:56:07.6179296Z Root Cause (first observed failure):
2022-11-29T09:56:07.6179799Z [0]:
2022-11-29T09:56:07.6180401Z   time      : 2022-11-29_09:56:07
2022-11-29T09:56:07.6180911Z   host      : 16e7a27492ca
2022-11-29T09:56:07.6181415Z   rank      : 1 (local_rank: 1)
2022-11-29T09:56:07.6181959Z   exitcode  : 1 (pid: 47356)
2022-11-29T09:56:07.6182475Z   error_file: <N/A>
2022-11-29T09:56:07.6183268Z   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2022-11-29T09:56:07.6184088Z ============================================================
2022-11-29T09:56:07.8314873Z 
2022-11-29T09:56:07.8317037Z /tmp/tmpe03g0p2o/workdir/segresnet2d_0_override does not contain /tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override/algo_object.pkl.Now override the Algo output_path with: /tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override
2022-11-29T09:56:08.1864414Z �[1m�[31mRunning trial code without runtime. Please check the tutorial if you are new to NNI: �[33mhttps://nni.readthedocs.io/en/stable/tutorials/hpo_quickstart_pytorch/main.html�[0m
2022-11-29T09:56:08.1865740Z EFinished test: test_run_algo_after_move_files (tests.test_auto3dseg_hpo.TestHPO) (19.5s)
2022-11-29T09:56:08.1866711Z Starting test: test_run_optuna (tests.test_auto3dseg_hpo.TestHPO)...
2022-11-29T09:56:08.1867205Z 
2022-11-29T09:56:09.3253527Z   0%|          | 0/12 [00:00<?, ?it/s]
2022-11-29T09:56:09.5007141Z   8%|▊         | 1/12 [00:01<00:12,  1.14s/it]
2022-11-29T09:56:09.6147742Z  17%|█▋        | 2/12 [00:01<00:05,  1.75it/s]
2022-11-29T09:56:09.7417871Z  33%|███▎      | 4/12 [00:01<00:01,  4.01it/s]
2022-11-29T09:56:09.8645233Z  58%|█████▊    | 7/12 [00:01<00:00,  7.66it/s]
2022-11-29T09:56:10.3256176Z  83%|████████▎ | 10/12 [00:01<00:00, 11.08it/s]
2022-11-29T09:56:10.6777263Z 100%|██████████| 12/12 [00:02<00:00,  5.61it/s]
2022-11-29T09:56:10.6777845Z 
2022-11-29T09:56:11.0249843Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2022-11-29T09:56:11.0510174Z algo_templates.tar.gz:   3%|▎         | 8.00k/290k [00:00<00:12, 23.6kB/s]
2022-11-29T09:56:11.0515818Z 2022-11-29 09:56:11,051 - INFO - Downloaded: /tmp/tmp2xv7qsxd/algo_templates.tar.gz
2022-11-29T09:56:11.0517577Z 2022-11-29 09:56:11,051 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmp2xv7qsxd/algo_templates.tar.gz.
2022-11-29T09:56:11.0519827Z 2022-11-29 09:56:11,051 - INFO - Writing into directory: /tmp/tmpwfj8ea33/workdir.
2022-11-29T09:56:11.6872359Z 2022-11-29 09:56:11,686 - INFO - /tmp/tmpwfj8ea33/workdir/segresnet2d_0
2022-11-29T09:56:12.1155475Z 2022-11-29 09:56:12,114 - INFO - /tmp/tmpwfj8ea33/workdir/segresnet2d_1
2022-11-29T09:56:12.5846868Z 2022-11-29 09:56:12,584 - INFO - /tmp/tmpwfj8ea33/workdir/dints_0
2022-11-29T09:56:13.0540510Z 2022-11-29 09:56:13,053 - INFO - /tmp/tmpwfj8ea33/workdir/dints_1
2022-11-29T09:56:13.4701130Z 2022-11-29 09:56:13,469 - INFO - /tmp/tmpwfj8ea33/workdir/swinunetr_0
2022-11-29T09:56:13.8861990Z 2022-11-29 09:56:13,885 - INFO - /tmp/tmpwfj8ea33/workdir/swinunetr_1
2022-11-29T09:56:14.3424143Z 2022-11-29 09:56:14,341 - INFO - /tmp/tmpwfj8ea33/workdir/segresnet_0
2022-11-29T09:56:14.7983048Z 2022-11-29 09:56:14,797 - INFO - /tmp/tmpwfj8ea33/workdir/segresnet_1
2022-11-29T09:56:14.9559007Z 2022-11-29 09:56:14,955 - INFO - /tmp/tmpwfj8ea33/workdir/segresnet2d_0_override
2022-11-29T09:56:14.9565366Z algo_templates.tar.gz: 296kB [00:00, 813kB/s]                             
2022-11-29T09:56:14.9566724Z �[32m[I 2022-11-29 09:56:14,956]�[0m A new study created in memory with name: no-name-3fd61c31-14c5-466c-b2d2-4e417c9f0423�[0m
2022-11-29T09:56:15.0677904Z 2022-11-29 09:56:15,067 - INFO - /tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001
2022-11-29T09:56:15.0682234Z 2022-11-29 09:56:15,067 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py run --config_file='/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_infer.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_validate.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_train.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/network.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/hyper_parameters.yaml' --learning_rate=0.001
2022-11-29T09:56:16.8004670Z WARNING:torch.distributed.run:
2022-11-29T09:56:16.8005329Z *****************************************
2022-11-29T09:56:16.8006510Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
2022-11-29T09:56:16.8007590Z *****************************************
2022-11-29T09:56:22.2331343Z [info] number of GPUs: 2
2022-11-29T09:56:22.2331991Z [info] number of GPUs: 2
2022-11-29T09:56:22.2960001Z 2022-11-29 09:56:22,295 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-11-29T09:56:22.2961832Z 2022-11-29 09:56:22,295 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-11-29T09:56:22.2963169Z 2022-11-29 09:56:22,295 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:56:22.2963950Z [info] world_size: 2
2022-11-29T09:56:22.2973657Z train_files: 4
2022-11-29T09:56:22.2974162Z val_files: 2
2022-11-29T09:56:22.3065064Z 2022-11-29 09:56:22,305 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:56:22.3065837Z [info] world_size: 2
2022-11-29T09:56:22.3078168Z train_files: 4
2022-11-29T09:56:22.3078650Z val_files: 2
2022-11-29T09:56:22.5454617Z Traceback (most recent call last):
2022-11-29T09:56:22.5455751Z   File "/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py", line 439, in <module>
2022-11-29T09:56:22.5456728Z     fire.Fire()
2022-11-29T09:56:22.5458071Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2022-11-29T09:56:22.5459277Z     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2022-11-29T09:56:22.5461291Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
2022-11-29T09:56:22.5462139Z     component, remaining_args = _CallAndUpdateTrace(
2022-11-29T09:56:22.5463292Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
2022-11-29T09:56:22.5464073Z     component = fn(*varargs, **kwargs)
2022-11-29T09:56:22.5464952Z   File "/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py", line 164, in run
2022-11-29T09:56:22.5465797Z     model = model.to(device)
2022-11-29T09:56:22.5466825Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
2022-11-29T09:56:22.5467571Z     return self._apply(convert)
2022-11-29T09:56:22.5468624Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:56:22.5469578Z     module._apply(fn)
2022-11-29T09:56:22.5470583Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:56:22.5471536Z     module._apply(fn)
2022-11-29T09:56:22.5473095Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
2022-11-29T09:56:22.5473854Z     param_applied = fn(param)
2022-11-29T09:56:22.5474931Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
2022-11-29T09:56:22.5475928Z     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2022-11-29T09:56:22.5476739Z RuntimeError: CUDA error: out of memory
2022-11-29T09:56:22.5477643Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2022-11-29T09:56:22.5478610Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-11-29T09:56:24.1101037Z num_epochs 4
2022-11-29T09:56:24.1101691Z num_epochs_per_validation 2
2022-11-29T09:56:26.8273388Z WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47502 closing signal SIGTERM
2022-11-29T09:56:27.0419095Z ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 47503) of binary: /opt/conda/bin/python
2022-11-29T09:56:27.0522809Z Traceback (most recent call last):
2022-11-29T09:56:27.0523538Z   File "/opt/conda/bin/torchrun", line 8, in <module>
2022-11-29T09:56:27.0524174Z     sys.exit(main())
2022-11-29T09:56:27.0525800Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
2022-11-29T09:56:27.0526671Z     return f(*args, **kwargs)
2022-11-29T09:56:27.0527722Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
2022-11-29T09:56:27.0528449Z     run(args)
2022-11-29T09:56:27.0529438Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
2022-11-29T09:56:27.0530160Z     elastic_launch(
2022-11-29T09:56:27.0531236Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
2022-11-29T09:56:27.0532208Z     return launch_agent(self._config, self._entrypoint, list(args))
2022-11-29T09:56:27.0533431Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
2022-11-29T09:56:27.0534271Z     raise ChildFailedError(
2022-11-29T09:56:27.0535252Z torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
2022-11-29T09:56:27.0536184Z ============================================================
2022-11-29T09:56:27.0537020Z /tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py FAILED
2022-11-29T09:56:27.0538800Z ------------------------------------------------------------
2022-11-29T09:56:27.0539448Z Failures:
2022-11-29T09:56:27.0539906Z   <NO_OTHER_FAILURES>
2022-11-29T09:56:27.0540792Z ------------------------------------------------------------
2022-11-29T09:56:27.0541519Z Root Cause (first observed failure):
2022-11-29T09:56:27.0542423Z [0]:
2022-11-29T09:56:27.0543058Z   time      : 2022-11-29_09:56:26
2022-11-29T09:56:27.0543594Z   host      : 16e7a27492ca
2022-11-29T09:56:27.0544113Z   rank      : 1 (local_rank: 1)
2022-11-29T09:56:27.0544668Z   exitcode  : 1 (pid: 47503)
2022-11-29T09:56:27.0545204Z   error_file: <N/A>
2022-11-29T09:56:27.0546004Z   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2022-11-29T09:56:27.0546817Z ============================================================
2022-11-29T09:56:27.2555038Z �[33m[W 2022-11-29 09:56:27,251]�[0m Trial 0 failed because of the following error: CalledProcessError(1, ['torchrun', '--nnodes=1', '--nproc_per_node=2', '/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py', 'run', "--config_file='/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_infer.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_validate.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_train.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/network.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/hyper_parameters.yaml'", '--learning_rate=0.001'])�[0m
2022-11-29T09:56:27.2557797Z Traceback (most recent call last):
2022-11-29T09:56:27.2558958Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 196, in _run_trial
2022-11-29T09:56:27.2559777Z     value_or_values = func(trial)
2022-11-29T09:56:27.2560504Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/hpo_gen.py", line 333, in __call__
2022-11-29T09:56:27.2561361Z     self.run_algo(obj_filename, output_folder, template_path)
2022-11-29T09:56:27.2562209Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/hpo_gen.py", line 395, in run_algo
2022-11-29T09:56:27.2562935Z     self.algo.train(self.params)
2022-11-29T09:56:27.2563689Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 198, in train
2022-11-29T09:56:27.2564457Z     return self._run_cmd(cmd, devices_info)
2022-11-29T09:56:27.2565214Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 185, in _run_cmd
2022-11-29T09:56:27.2566096Z     normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True)
2022-11-29T09:56:27.2566940Z   File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
2022-11-29T09:56:27.2567731Z     raise CalledProcessError(retcode, process.args,
2022-11-29T09:56:27.2572162Z subprocess.CalledProcessError: Command '['torchrun', '--nnodes=1', '--nproc_per_node=2', '/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py', 'run', "--config_file='/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_infer.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_validate.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_train.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/network.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/hyper_parameters.yaml'", '--learning_rate=0.001']' returned non-zero exit status 1.
2022-11-29T09:56:27.2867415Z EFinished test: test_run_optuna (tests.test_auto3dseg_hpo.TestHPO) (19.4s)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions