2022-11-29T09:55:09.0551041Z Starting test: test_get_history (tests.test_auto3dseg_hpo.TestHPO)...
2022-11-29T09:55:09.0551548Z
2022-11-29T09:55:10.1811180Z 0%| | 0/12 [00:00<?, ?it/s]
2022-11-29T09:55:10.3612333Z 8%|▊ | 1/12 [00:01<00:12, 1.12s/it]
2022-11-29T09:55:10.5055004Z 17%|█▋ | 2/12 [00:01<00:05, 1.76it/s]
2022-11-29T09:55:10.6297577Z 42%|████▏ | 5/12 [00:01<00:01, 5.10it/s]
2022-11-29T09:55:10.7651096Z 67%|██████▋ | 8/12 [00:01<00:00, 8.48it/s]
2022-11-29T09:55:11.1991967Z 92%|█████████▏| 11/12 [00:01<00:00, 11.44it/s]
2022-11-29T09:55:11.5556705Z 100%|██████████| 12/12 [00:02<00:00, 5.60it/s]
2022-11-29T09:55:11.5557118Z
2022-11-29T09:55:12.2148624Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2022-11-29T09:55:12.3589082Z algo_templates.tar.gz: 3%|▎ | 8.00k/290k [00:00<00:23, 12.4kB/s]
2022-11-29T09:55:12.5044988Z algo_templates.tar.gz: 17%|█▋ | 48.0k/290k [00:00<00:03, 77.2kB/s]
2022-11-29T09:55:12.6539602Z algo_templates.tar.gz: 44%|████▍ | 128k/290k [00:00<00:00, 201kB/s]
2022-11-29T09:55:12.6570675Z algo_templates.tar.gz: 94%|█████████▍| 272k/290k [00:01<00:00, 416kB/s]
2022-11-29T09:55:12.6576749Z 2022-11-29 09:55:12,657 - INFO - Downloaded: /tmp/tmpn_5xi132/algo_templates.tar.gz
2022-11-29T09:55:12.6578437Z 2022-11-29 09:55:12,657 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmpn_5xi132/algo_templates.tar.gz.
2022-11-29T09:55:12.6581851Z 2022-11-29 09:55:12,657 - INFO - Writing into directory: /tmp/tmp8svrlhoq/workdir.
2022-11-29T09:55:13.2978999Z 2022-11-29 09:55:13,297 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet2d_0
2022-11-29T09:55:13.7312508Z 2022-11-29 09:55:13,730 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet2d_1
2022-11-29T09:55:14.2050683Z 2022-11-29 09:55:14,204 - INFO - /tmp/tmp8svrlhoq/workdir/dints_0
2022-11-29T09:55:14.6769312Z 2022-11-29 09:55:14,676 - INFO - /tmp/tmp8svrlhoq/workdir/dints_1
2022-11-29T09:55:15.0954867Z 2022-11-29 09:55:15,094 - INFO - /tmp/tmp8svrlhoq/workdir/swinunetr_0
2022-11-29T09:55:15.5124449Z 2022-11-29 09:55:15,511 - INFO - /tmp/tmp8svrlhoq/workdir/swinunetr_1
2022-11-29T09:55:15.9678003Z 2022-11-29 09:55:15,967 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet_0
2022-11-29T09:55:16.4254530Z 2022-11-29 09:55:16,424 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet_1
2022-11-29T09:55:16.5836417Z 2022-11-29 09:55:16,583 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override
2022-11-29T09:55:16.5837786Z 2022-11-29 09:55:16,583 - INFO - ============================================================================================================================================
2022-11-29T09:55:16.5838810Z 2022-11-29 09:55:16,583 - INFO - If NNI will run in your local env:
2022-11-29T09:55:16.5839931Z 2022-11-29 09:55:16,583 - INFO - 1. Add the following line to the trialCommand in your NNI config:
2022-11-29T09:55:16.5841525Z 2022-11-29 09:55:16,583 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override/algo_object.pkl {result_dir}
2022-11-29T09:55:16.5843375Z 2022-11-29 09:55:16,583 - INFO - --------------------------------------------------------------------------------------------------------------------------------------------
2022-11-29T09:55:16.5845128Z 2022-11-29 09:55:16,583 - INFO - If NNI will run in a remote env:
2022-11-29T09:55:16.5846598Z 2022-11-29 09:55:16,583 - INFO - 1. Copy the algorithm_templates folder /tmp/tmp8svrlhoq/workdir/algorithm_templates/segresnet2d to remote {remote_algorithm_templates_dir}
2022-11-29T09:55:16.5848229Z 2022-11-29 09:55:16,583 - INFO - 2. Copy the older /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override to the remote machine {remote_algo_dir}
2022-11-29T09:55:16.5849572Z 2022-11-29 09:55:16,583 - INFO - Then add the following line to the trialCommand in your NNI config:
2022-11-29T09:55:16.5851241Z 2022-11-29 09:55:16,583 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo {remote_algo_dir} {result_dir} {remote_algorithm_templates_dir}
2022-11-29T09:55:16.5852668Z 2022-11-29 09:55:16,583 - INFO - ============================================================================================================================================
2022-11-29T09:55:16.6948146Z 2022-11-29 09:55:16,694 - INFO - /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None
2022-11-29T09:55:16.6951825Z 2022-11-29 09:55:16,694 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/scripts/train.py run --config_file='/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/transforms_infer.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/transforms_validate.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/transforms_train.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/network.yaml','/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/configs/hyper_parameters.yaml'
2022-11-29T09:55:18.4222096Z algo_templates.tar.gz: 296kB [00:01, 275kB/s] WARNING:torch.distributed.run:
2022-11-29T09:55:18.4222874Z *****************************************
2022-11-29T09:55:18.4224086Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
2022-11-29T09:55:18.4225203Z *****************************************
2022-11-29T09:55:23.8582013Z [info] number of GPUs: 2
2022-11-29T09:55:23.8583735Z [info] number of GPUs: 2
2022-11-29T09:55:23.9202797Z 2022-11-29 09:55:23,919 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-11-29T09:55:23.9203975Z 2022-11-29 09:55:23,920 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-11-29T09:55:23.9205355Z 2022-11-29 09:55:23,920 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:55:23.9206131Z [info] world_size: 2
2022-11-29T09:55:23.9218634Z train_files: 4
2022-11-29T09:55:23.9219140Z val_files: 2
2022-11-29T09:55:23.9306079Z 2022-11-29 09:55:23,930 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:55:23.9306939Z [info] world_size: 2
2022-11-29T09:55:23.9318224Z train_files: 4
2022-11-29T09:55:23.9318684Z val_files: 2
2022-11-29T09:55:24.1687175Z Traceback (most recent call last):
2022-11-29T09:55:24.1688136Z File "/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/scripts/train.py", line 439, in <module>
2022-11-29T09:55:24.1688872Z fire.Fire()
2022-11-29T09:55:24.1689937Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2022-11-29T09:55:24.1690851Z component_trace = _Fire(component, args, parsed_flag_args, context, name)
2022-11-29T09:55:24.1691994Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
2022-11-29T09:55:24.1692821Z component, remaining_args = _CallAndUpdateTrace(
2022-11-29T09:55:24.1694010Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
2022-11-29T09:55:24.1694840Z component = fn(*varargs, **kwargs)
2022-11-29T09:55:24.1696123Z File "/tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/scripts/train.py", line 164, in run
2022-11-29T09:55:24.1696896Z model = model.to(device)
2022-11-29T09:55:24.1698118Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
2022-11-29T09:55:24.1698893Z return self._apply(convert)
2022-11-29T09:55:24.1699926Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:55:24.1700673Z module._apply(fn)
2022-11-29T09:55:24.1701697Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:55:24.1702411Z module._apply(fn)
2022-11-29T09:55:24.1703414Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
2022-11-29T09:55:24.1704364Z param_applied = fn(param)
2022-11-29T09:55:24.1705422Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
2022-11-29T09:55:24.1706390Z return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2022-11-29T09:55:24.1707191Z RuntimeError: CUDA error: out of memory
2022-11-29T09:55:24.1708139Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2022-11-29T09:55:24.1709083Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-11-29T09:55:25.7366762Z num_epochs 4
2022-11-29T09:55:25.7367400Z num_epochs_per_validation 2
2022-11-29T09:55:28.4527840Z WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47061 closing signal SIGTERM
2022-11-29T09:55:28.6172946Z ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 47062) of binary: /opt/conda/bin/python
2022-11-29T09:55:28.6272926Z Traceback (most recent call last):
2022-11-29T09:55:28.6273687Z File "/opt/conda/bin/torchrun", line 8, in <module>
2022-11-29T09:55:28.6274336Z sys.exit(main())
2022-11-29T09:55:28.6275875Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
2022-11-29T09:55:28.6276736Z return f(*args, **kwargs)
2022-11-29T09:55:28.6277775Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
2022-11-29T09:55:28.6278490Z run(args)
2022-11-29T09:55:28.6279466Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
2022-11-29T09:55:28.6280170Z elastic_launch(
2022-11-29T09:55:28.6281235Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
2022-11-29T09:55:28.6282208Z return launch_agent(self._config, self._entrypoint, list(args))
2022-11-29T09:55:28.6283430Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
2022-11-29T09:55:28.6284268Z raise ChildFailedError(
2022-11-29T09:55:28.6285244Z torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
2022-11-29T09:55:28.6286202Z ============================================================
2022-11-29T09:55:28.6286942Z /tmp/tmp8svrlhoq/workdir/segresnet2d_0_override_None/scripts/train.py FAILED
2022-11-29T09:55:28.6287991Z ------------------------------------------------------------
2022-11-29T09:55:28.6288621Z Failures:
2022-11-29T09:55:28.6289077Z <NO_OTHER_FAILURES>
2022-11-29T09:55:28.6289912Z ------------------------------------------------------------
2022-11-29T09:55:28.6290606Z Root Cause (first observed failure):
2022-11-29T09:55:28.6291106Z [0]:
2022-11-29T09:55:28.6291739Z time : 2022-11-29_09:55:28
2022-11-29T09:55:28.6292265Z host : 16e7a27492ca
2022-11-29T09:55:28.6292774Z rank : 1 (local_rank: 1)
2022-11-29T09:55:28.6293321Z exitcode : 1 (pid: 47062)
2022-11-29T09:55:28.6293850Z error_file: <N/A>
2022-11-29T09:55:28.6294658Z traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2022-11-29T09:55:28.6295928Z ============================================================
2022-11-29T09:55:28.8428026Z
2022-11-29T09:55:28.8429721Z �[1m�[31mRunning trial code without runtime. Please check the tutorial if you are new to NNI: �[33mhttps://nni.readthedocs.io/en/stable/tutorials/hpo_quickstart_pytorch/main.html�[0m
2022-11-29T09:55:29.1969287Z EFinished test: test_get_history (tests.test_auto3dseg_hpo.TestHPO) (20.1s)
2022-11-29T09:55:29.1970251Z Starting test: test_run_algo (tests.test_auto3dseg_hpo.TestHPO)...
2022-11-29T09:55:29.1970742Z
2022-11-29T09:55:30.4961914Z 0%| | 0/12 [00:00<?, ?it/s]
2022-11-29T09:55:30.6012010Z 8%|▊ | 1/12 [00:01<00:14, 1.30s/it]
2022-11-29T09:55:30.7443006Z 25%|██▌ | 3/12 [00:01<00:03, 2.66it/s]
2022-11-29T09:55:30.8724339Z 50%|█████ | 6/12 [00:01<00:01, 5.73it/s]
2022-11-29T09:55:30.9907717Z 75%|███████▌ | 9/12 [00:01<00:00, 8.85it/s]
2022-11-29T09:55:31.3609564Z 100%|██████████| 12/12 [00:01<00:00, 11.97it/s]
2022-11-29T09:55:31.7145529Z 100%|██████████| 12/12 [00:02<00:00, 5.55it/s]
2022-11-29T09:55:31.7145935Z
2022-11-29T09:55:32.1145141Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2022-11-29T09:55:32.1406013Z algo_templates.tar.gz: 3%|▎ | 8.00k/290k [00:00<00:14, 20.5kB/s]
2022-11-29T09:55:32.1413636Z 2022-11-29 09:55:32,140 - INFO - Downloaded: /tmp/tmpqbuouysy/algo_templates.tar.gz
2022-11-29T09:55:32.1415662Z 2022-11-29 09:55:32,141 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmpqbuouysy/algo_templates.tar.gz.
2022-11-29T09:55:32.1422130Z 2022-11-29 09:55:32,141 - INFO - Writing into directory: /tmp/tmpo5ptmej2/workdir.
2022-11-29T09:55:32.7890156Z 2022-11-29 09:55:32,788 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet2d_0
2022-11-29T09:55:33.2201065Z 2022-11-29 09:55:33,219 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet2d_1
2022-11-29T09:55:33.6893202Z 2022-11-29 09:55:33,688 - INFO - /tmp/tmpo5ptmej2/workdir/dints_0
2022-11-29T09:55:34.1584238Z 2022-11-29 09:55:34,157 - INFO - /tmp/tmpo5ptmej2/workdir/dints_1
2022-11-29T09:55:34.5752796Z 2022-11-29 09:55:34,574 - INFO - /tmp/tmpo5ptmej2/workdir/swinunetr_0
2022-11-29T09:55:34.9919902Z 2022-11-29 09:55:34,991 - INFO - /tmp/tmpo5ptmej2/workdir/swinunetr_1
2022-11-29T09:55:35.4527105Z 2022-11-29 09:55:35,452 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet_0
2022-11-29T09:55:35.9293233Z 2022-11-29 09:55:35,928 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet_1
2022-11-29T09:55:36.0874985Z 2022-11-29 09:55:36,086 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override
2022-11-29T09:55:36.0876363Z 2022-11-29 09:55:36,086 - INFO - ============================================================================================================================================
2022-11-29T09:55:36.0877331Z 2022-11-29 09:55:36,087 - INFO - If NNI will run in your local env:
2022-11-29T09:55:36.0878503Z 2022-11-29 09:55:36,087 - INFO - 1. Add the following line to the trialCommand in your NNI config:
2022-11-29T09:55:36.0880069Z 2022-11-29 09:55:36,087 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override/algo_object.pkl {result_dir}
2022-11-29T09:55:36.0881905Z 2022-11-29 09:55:36,087 - INFO - --------------------------------------------------------------------------------------------------------------------------------------------
2022-11-29T09:55:36.0883229Z 2022-11-29 09:55:36,087 - INFO - If NNI will run in a remote env:
2022-11-29T09:55:36.0884701Z 2022-11-29 09:55:36,087 - INFO - 1. Copy the algorithm_templates folder /tmp/tmpo5ptmej2/workdir/algorithm_templates/segresnet2d to remote {remote_algorithm_templates_dir}
2022-11-29T09:55:36.0886317Z 2022-11-29 09:55:36,087 - INFO - 2. Copy the older /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override to the remote machine {remote_algo_dir}
2022-11-29T09:55:36.0887623Z 2022-11-29 09:55:36,087 - INFO - Then add the following line to the trialCommand in your NNI config:
2022-11-29T09:55:36.0889076Z 2022-11-29 09:55:36,087 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo {remote_algo_dir} {result_dir} {remote_algorithm_templates_dir}
2022-11-29T09:55:36.0890897Z 2022-11-29 09:55:36,087 - INFO - ============================================================================================================================================
2022-11-29T09:55:36.1982921Z 2022-11-29 09:55:36,197 - INFO - /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None
2022-11-29T09:55:36.1986967Z 2022-11-29 09:55:36,197 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/scripts/train.py run --config_file='/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/transforms_infer.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/transforms_validate.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/transforms_train.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/network.yaml','/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/configs/hyper_parameters.yaml'
2022-11-29T09:55:37.9397084Z algo_templates.tar.gz: 296kB [00:00, 712kB/s] WARNING:torch.distributed.run:
2022-11-29T09:55:37.9397872Z *****************************************
2022-11-29T09:55:37.9399075Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
2022-11-29T09:55:37.9400186Z *****************************************
2022-11-29T09:55:43.3559678Z [info] number of GPUs: 2
2022-11-29T09:55:43.3561263Z [info] number of GPUs: 2
2022-11-29T09:55:43.4182811Z 2022-11-29 09:55:43,417 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-11-29T09:55:43.4184683Z 2022-11-29 09:55:43,418 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-11-29T09:55:43.4186102Z 2022-11-29 09:55:43,418 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:55:43.4186848Z [info] world_size: 2
2022-11-29T09:55:43.4198933Z train_files: 4
2022-11-29T09:55:43.4199432Z val_files: 2
2022-11-29T09:55:43.4286886Z 2022-11-29 09:55:43,428 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:55:43.4287705Z [info] world_size: 2
2022-11-29T09:55:43.4293980Z train_files: 4
2022-11-29T09:55:43.4294770Z val_files: 2
2022-11-29T09:55:43.6301642Z Traceback (most recent call last):
2022-11-29T09:55:43.6302684Z File "/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/scripts/train.py", line 439, in <module>
2022-11-29T09:55:43.6303452Z fire.Fire()
2022-11-29T09:55:43.6304558Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2022-11-29T09:55:43.6305446Z component_trace = _Fire(component, args, parsed_flag_args, context, name)
2022-11-29T09:55:43.6306625Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
2022-11-29T09:55:43.6307474Z component, remaining_args = _CallAndUpdateTrace(
2022-11-29T09:55:43.6308630Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
2022-11-29T09:55:43.6309465Z component = fn(*varargs, **kwargs)
2022-11-29T09:55:43.6310304Z File "/tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/scripts/train.py", line 164, in run
2022-11-29T09:55:43.6326024Z model = model.to(device)
2022-11-29T09:55:43.6327297Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
2022-11-29T09:55:43.6328106Z return self._apply(convert)
2022-11-29T09:55:43.6329183Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:55:43.6329923Z module._apply(fn)
2022-11-29T09:55:43.6330962Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:55:43.6331706Z module._apply(fn)
2022-11-29T09:55:43.6332726Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
2022-11-29T09:55:43.6333968Z param_applied = fn(param)
2022-11-29T09:55:43.6335040Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
2022-11-29T09:55:43.6336076Z return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2022-11-29T09:55:43.6336866Z RuntimeError: CUDA error: out of memory
2022-11-29T09:55:43.6337992Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2022-11-29T09:55:43.6338970Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-11-29T09:55:45.2301934Z num_epochs 4
2022-11-29T09:55:45.2302574Z num_epochs_per_validation 2
2022-11-29T09:55:47.9688382Z WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47208 closing signal SIGTERM
2022-11-29T09:55:48.1335061Z ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 47209) of binary: /opt/conda/bin/python
2022-11-29T09:55:48.1436786Z Traceback (most recent call last):
2022-11-29T09:55:48.1437551Z File "/opt/conda/bin/torchrun", line 8, in <module>
2022-11-29T09:55:48.1438211Z sys.exit(main())
2022-11-29T09:55:48.1439869Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
2022-11-29T09:55:48.1440767Z return f(*args, **kwargs)
2022-11-29T09:55:48.1441799Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
2022-11-29T09:55:48.1442540Z run(args)
2022-11-29T09:55:48.1443507Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
2022-11-29T09:55:48.1444203Z elastic_launch(
2022-11-29T09:55:48.1445260Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
2022-11-29T09:55:48.1446210Z return launch_agent(self._config, self._entrypoint, list(args))
2022-11-29T09:55:48.1447478Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
2022-11-29T09:55:48.1448290Z raise ChildFailedError(
2022-11-29T09:55:48.1449262Z torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
2022-11-29T09:55:48.1450206Z ============================================================
2022-11-29T09:55:48.1450951Z /tmp/tmpo5ptmej2/workdir/segresnet2d_0_override_None/scripts/train.py FAILED
2022-11-29T09:55:48.1452005Z ------------------------------------------------------------
2022-11-29T09:55:48.1452630Z Failures:
2022-11-29T09:55:48.1453089Z <NO_OTHER_FAILURES>
2022-11-29T09:55:48.1453926Z ------------------------------------------------------------
2022-11-29T09:55:48.1454635Z Root Cause (first observed failure):
2022-11-29T09:55:48.1455145Z [0]:
2022-11-29T09:55:48.1455761Z time : 2022-11-29_09:55:47
2022-11-29T09:55:48.1456284Z host : 16e7a27492ca
2022-11-29T09:55:48.1456802Z rank : 1 (local_rank: 1)
2022-11-29T09:55:48.1457512Z exitcode : 1 (pid: 47209)
2022-11-29T09:55:48.1458055Z error_file: <N/A>
2022-11-29T09:55:48.1458888Z traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2022-11-29T09:55:48.1459682Z ============================================================
2022-11-29T09:55:48.3577958Z
2022-11-29T09:55:48.3579980Z �[1m�[31mRunning trial code without runtime. Please check the tutorial if you are new to NNI: �[33mhttps://nni.readthedocs.io/en/stable/tutorials/hpo_quickstart_pytorch/main.html�[0m
2022-11-29T09:55:48.7132707Z EFinished test: test_run_algo (tests.test_auto3dseg_hpo.TestHPO) (19.5s)
2022-11-29T09:55:48.7133738Z Starting test: test_run_algo_after_move_files (tests.test_auto3dseg_hpo.TestHPO)...
2022-11-29T09:55:48.7134280Z
2022-11-29T09:55:50.0123203Z 0%| | 0/12 [00:00<?, ?it/s]
2022-11-29T09:55:50.1349379Z 8%|▊ | 1/12 [00:01<00:14, 1.30s/it]
2022-11-29T09:55:50.2906597Z 25%|██▌ | 3/12 [00:01<00:03, 2.62it/s]
2022-11-29T09:55:50.4219264Z 50%|█████ | 6/12 [00:01<00:01, 5.57it/s]
2022-11-29T09:55:50.5415627Z 75%|███████▌ | 9/12 [00:01<00:00, 8.61it/s]
2022-11-29T09:55:50.9646172Z 100%|██████████| 12/12 [00:01<00:00, 11.68it/s]
2022-11-29T09:55:51.3342432Z 100%|██████████| 12/12 [00:02<00:00, 5.33it/s]
2022-11-29T09:55:51.3342851Z
2022-11-29T09:55:51.5588108Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2022-11-29T09:55:51.5839498Z algo_templates.tar.gz: 3%|▎ | 8.00k/290k [00:00<00:07, 36.6kB/s]
2022-11-29T09:55:51.5845945Z 2022-11-29 09:55:51,584 - INFO - Downloaded: /tmp/tmp14ob9n70/algo_templates.tar.gz
2022-11-29T09:55:51.5847918Z 2022-11-29 09:55:51,584 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmp14ob9n70/algo_templates.tar.gz.
2022-11-29T09:55:51.5850845Z 2022-11-29 09:55:51,584 - INFO - Writing into directory: /tmp/tmpe03g0p2o/workdir.
2022-11-29T09:55:52.2333875Z 2022-11-29 09:55:52,232 - INFO - /tmp/tmpe03g0p2o/workdir/segresnet2d_0
2022-11-29T09:55:52.6650624Z 2022-11-29 09:55:52,664 - INFO - /tmp/tmpe03g0p2o/workdir/segresnet2d_1
2022-11-29T09:55:53.1359243Z 2022-11-29 09:55:53,135 - INFO - /tmp/tmpe03g0p2o/workdir/dints_0
2022-11-29T09:55:53.6065237Z 2022-11-29 09:55:53,605 - INFO - /tmp/tmpe03g0p2o/workdir/dints_1
2022-11-29T09:55:54.0267889Z 2022-11-29 09:55:54,026 - INFO - /tmp/tmpe03g0p2o/workdir/swinunetr_0
2022-11-29T09:55:54.4470347Z 2022-11-29 09:55:54,446 - INFO - /tmp/tmpe03g0p2o/workdir/swinunetr_1
2022-11-29T09:55:54.9021081Z 2022-11-29 09:55:54,901 - INFO - /tmp/tmpe03g0p2o/workdir/segresnet_0
2022-11-29T09:55:55.3591210Z 2022-11-29 09:55:55,358 - INFO - /tmp/tmpe03g0p2o/workdir/segresnet_1
2022-11-29T09:55:55.5195089Z 2022-11-29 09:55:55,518 - INFO - /tmp/tmpe03g0p2o/workdir/segresnet2d_0_override
2022-11-29T09:55:55.5196581Z 2022-11-29 09:55:55,519 - INFO - ============================================================================================================================================
2022-11-29T09:55:55.5197649Z 2022-11-29 09:55:55,519 - INFO - If NNI will run in your local env:
2022-11-29T09:55:55.5198816Z 2022-11-29 09:55:55,519 - INFO - 1. Add the following line to the trialCommand in your NNI config:
2022-11-29T09:55:55.5200374Z 2022-11-29 09:55:55,519 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo /tmp/tmpe03g0p2o/workdir/segresnet2d_0_override/algo_object.pkl {result_dir}
2022-11-29T09:55:55.5202203Z 2022-11-29 09:55:55,519 - INFO - --------------------------------------------------------------------------------------------------------------------------------------------
2022-11-29T09:55:55.5203552Z 2022-11-29 09:55:55,519 - INFO - If NNI will run in a remote env:
2022-11-29T09:55:55.5205041Z 2022-11-29 09:55:55,519 - INFO - 1. Copy the algorithm_templates folder /tmp/tmpe03g0p2o/workdir/algorithm_templates/segresnet2d to remote {remote_algorithm_templates_dir}
2022-11-29T09:55:55.5206675Z 2022-11-29 09:55:55,519 - INFO - 2. Copy the older /tmp/tmpe03g0p2o/workdir/segresnet2d_0_override to the remote machine {remote_algo_dir}
2022-11-29T09:55:55.5207982Z 2022-11-29 09:55:55,519 - INFO - Then add the following line to the trialCommand in your NNI config:
2022-11-29T09:55:55.5209452Z 2022-11-29 09:55:55,519 - INFO - python -m monai.apps.auto3dseg NNIGen run_algo {remote_algo_dir} {result_dir} {remote_algorithm_templates_dir}
2022-11-29T09:55:55.5210877Z 2022-11-29 09:55:55,519 - INFO - ============================================================================================================================================
2022-11-29T09:55:55.6383539Z 2022-11-29 09:55:55,637 - INFO - /tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None
2022-11-29T09:55:55.6387319Z 2022-11-29 09:55:55,637 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/scripts/train.py run --config_file='/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/transforms_infer.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/transforms_validate.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/transforms_train.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/network.yaml','/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/configs/hyper_parameters.yaml'
2022-11-29T09:55:57.3514953Z algo_templates.tar.gz: 296kB [00:00, 1.22MB/s] WARNING:torch.distributed.run:
2022-11-29T09:55:57.3515768Z *****************************************
2022-11-29T09:55:57.3516951Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
2022-11-29T09:55:57.3518031Z *****************************************
2022-11-29T09:56:02.7961617Z [info] number of GPUs: 2
2022-11-29T09:56:02.7962238Z [info] number of GPUs: 2
2022-11-29T09:56:02.8595629Z 2022-11-29 09:56:02,858 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-11-29T09:56:02.8597275Z 2022-11-29 09:56:02,859 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-11-29T09:56:02.8598590Z 2022-11-29 09:56:02,859 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:56:02.8599358Z [info] world_size: 2
2022-11-29T09:56:02.8606836Z train_files: 4
2022-11-29T09:56:02.8607325Z val_files: 2
2022-11-29T09:56:02.8700912Z 2022-11-29 09:56:02,869 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:56:02.8701745Z [info] world_size: 2
2022-11-29T09:56:02.8710618Z train_files: 4
2022-11-29T09:56:02.8711083Z val_files: 2
2022-11-29T09:56:03.0772150Z Traceback (most recent call last):
2022-11-29T09:56:03.0773100Z File "/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/scripts/train.py", line 439, in <module>
2022-11-29T09:56:03.0773895Z fire.Fire()
2022-11-29T09:56:03.0774939Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2022-11-29T09:56:03.0775874Z component_trace = _Fire(component, args, parsed_flag_args, context, name)
2022-11-29T09:56:03.0777020Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
2022-11-29T09:56:03.0778306Z component, remaining_args = _CallAndUpdateTrace(
2022-11-29T09:56:03.0779446Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
2022-11-29T09:56:03.0780260Z component = fn(*varargs, **kwargs)
2022-11-29T09:56:03.0781092Z File "/tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/scripts/train.py", line 164, in run
2022-11-29T09:56:03.0781833Z model = model.to(device)
2022-11-29T09:56:03.0782855Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
2022-11-29T09:56:03.0783639Z return self._apply(convert)
2022-11-29T09:56:03.0784675Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:56:03.0785424Z module._apply(fn)
2022-11-29T09:56:03.0786450Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:56:03.0787180Z module._apply(fn)
2022-11-29T09:56:03.0788169Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
2022-11-29T09:56:03.0788916Z param_applied = fn(param)
2022-11-29T09:56:03.0789961Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
2022-11-29T09:56:03.0790923Z return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2022-11-29T09:56:03.0791714Z RuntimeError: CUDA error: out of memory
2022-11-29T09:56:03.0792658Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2022-11-29T09:56:03.0793619Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-11-29T09:56:04.6903189Z num_epochs 4
2022-11-29T09:56:04.6903822Z num_epochs_per_validation 2
2022-11-29T09:56:07.3912570Z WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47355 closing signal SIGTERM
2022-11-29T09:56:07.6058014Z ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 47356) of binary: /opt/conda/bin/python
2022-11-29T09:56:07.6161104Z Traceback (most recent call last):
2022-11-29T09:56:07.6161909Z File "/opt/conda/bin/torchrun", line 8, in <module>
2022-11-29T09:56:07.6162564Z sys.exit(main())
2022-11-29T09:56:07.6164129Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
2022-11-29T09:56:07.6164996Z return f(*args, **kwargs)
2022-11-29T09:56:07.6166038Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
2022-11-29T09:56:07.6166755Z run(args)
2022-11-29T09:56:07.6167697Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
2022-11-29T09:56:07.6168423Z elastic_launch(
2022-11-29T09:56:07.6169810Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
2022-11-29T09:56:07.6170768Z return launch_agent(self._config, self._entrypoint, list(args))
2022-11-29T09:56:07.6172000Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
2022-11-29T09:56:07.6172819Z raise ChildFailedError(
2022-11-29T09:56:07.6173768Z torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
2022-11-29T09:56:07.6174705Z ============================================================
2022-11-29T09:56:07.6175466Z /tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override_None/scripts/train.py FAILED
2022-11-29T09:56:07.6176532Z ------------------------------------------------------------
2022-11-29T09:56:07.6177140Z Failures:
2022-11-29T09:56:07.6177785Z <NO_OTHER_FAILURES>
2022-11-29T09:56:07.6178610Z ------------------------------------------------------------
2022-11-29T09:56:07.6179296Z Root Cause (first observed failure):
2022-11-29T09:56:07.6179799Z [0]:
2022-11-29T09:56:07.6180401Z time : 2022-11-29_09:56:07
2022-11-29T09:56:07.6180911Z host : 16e7a27492ca
2022-11-29T09:56:07.6181415Z rank : 1 (local_rank: 1)
2022-11-29T09:56:07.6181959Z exitcode : 1 (pid: 47356)
2022-11-29T09:56:07.6182475Z error_file: <N/A>
2022-11-29T09:56:07.6183268Z traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2022-11-29T09:56:07.6184088Z ============================================================
2022-11-29T09:56:07.8314873Z
2022-11-29T09:56:07.8317037Z /tmp/tmpe03g0p2o/workdir/segresnet2d_0_override does not contain /tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override/algo_object.pkl.Now override the Algo output_path with: /tmp/tmpe03g0p2o/workdir2/segresnet2d_0_override
2022-11-29T09:56:08.1864414Z �[1m�[31mRunning trial code without runtime. Please check the tutorial if you are new to NNI: �[33mhttps://nni.readthedocs.io/en/stable/tutorials/hpo_quickstart_pytorch/main.html�[0m
2022-11-29T09:56:08.1865740Z EFinished test: test_run_algo_after_move_files (tests.test_auto3dseg_hpo.TestHPO) (19.5s)
2022-11-29T09:56:08.1866711Z Starting test: test_run_optuna (tests.test_auto3dseg_hpo.TestHPO)...
2022-11-29T09:56:08.1867205Z
2022-11-29T09:56:09.3253527Z 0%| | 0/12 [00:00<?, ?it/s]
2022-11-29T09:56:09.5007141Z 8%|▊ | 1/12 [00:01<00:12, 1.14s/it]
2022-11-29T09:56:09.6147742Z 17%|█▋ | 2/12 [00:01<00:05, 1.75it/s]
2022-11-29T09:56:09.7417871Z 33%|███▎ | 4/12 [00:01<00:01, 4.01it/s]
2022-11-29T09:56:09.8645233Z 58%|█████▊ | 7/12 [00:01<00:00, 7.66it/s]
2022-11-29T09:56:10.3256176Z 83%|████████▎ | 10/12 [00:01<00:00, 11.08it/s]
2022-11-29T09:56:10.6777263Z 100%|██████████| 12/12 [00:02<00:00, 5.61it/s]
2022-11-29T09:56:10.6777845Z
2022-11-29T09:56:11.0249843Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2022-11-29T09:56:11.0510174Z algo_templates.tar.gz: 3%|▎ | 8.00k/290k [00:00<00:12, 23.6kB/s]
2022-11-29T09:56:11.0515818Z 2022-11-29 09:56:11,051 - INFO - Downloaded: /tmp/tmp2xv7qsxd/algo_templates.tar.gz
2022-11-29T09:56:11.0517577Z 2022-11-29 09:56:11,051 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmp2xv7qsxd/algo_templates.tar.gz.
2022-11-29T09:56:11.0519827Z 2022-11-29 09:56:11,051 - INFO - Writing into directory: /tmp/tmpwfj8ea33/workdir.
2022-11-29T09:56:11.6872359Z 2022-11-29 09:56:11,686 - INFO - /tmp/tmpwfj8ea33/workdir/segresnet2d_0
2022-11-29T09:56:12.1155475Z 2022-11-29 09:56:12,114 - INFO - /tmp/tmpwfj8ea33/workdir/segresnet2d_1
2022-11-29T09:56:12.5846868Z 2022-11-29 09:56:12,584 - INFO - /tmp/tmpwfj8ea33/workdir/dints_0
2022-11-29T09:56:13.0540510Z 2022-11-29 09:56:13,053 - INFO - /tmp/tmpwfj8ea33/workdir/dints_1
2022-11-29T09:56:13.4701130Z 2022-11-29 09:56:13,469 - INFO - /tmp/tmpwfj8ea33/workdir/swinunetr_0
2022-11-29T09:56:13.8861990Z 2022-11-29 09:56:13,885 - INFO - /tmp/tmpwfj8ea33/workdir/swinunetr_1
2022-11-29T09:56:14.3424143Z 2022-11-29 09:56:14,341 - INFO - /tmp/tmpwfj8ea33/workdir/segresnet_0
2022-11-29T09:56:14.7983048Z 2022-11-29 09:56:14,797 - INFO - /tmp/tmpwfj8ea33/workdir/segresnet_1
2022-11-29T09:56:14.9559007Z 2022-11-29 09:56:14,955 - INFO - /tmp/tmpwfj8ea33/workdir/segresnet2d_0_override
2022-11-29T09:56:14.9565366Z algo_templates.tar.gz: 296kB [00:00, 813kB/s]
2022-11-29T09:56:14.9566724Z �[32m[I 2022-11-29 09:56:14,956]�[0m A new study created in memory with name: no-name-3fd61c31-14c5-466c-b2d2-4e417c9f0423�[0m
2022-11-29T09:56:15.0677904Z 2022-11-29 09:56:15,067 - INFO - /tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001
2022-11-29T09:56:15.0682234Z 2022-11-29 09:56:15,067 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py run --config_file='/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_infer.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_validate.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_train.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/network.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/hyper_parameters.yaml' --learning_rate=0.001
2022-11-29T09:56:16.8004670Z WARNING:torch.distributed.run:
2022-11-29T09:56:16.8005329Z *****************************************
2022-11-29T09:56:16.8006510Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
2022-11-29T09:56:16.8007590Z *****************************************
2022-11-29T09:56:22.2331343Z [info] number of GPUs: 2
2022-11-29T09:56:22.2331991Z [info] number of GPUs: 2
2022-11-29T09:56:22.2960001Z 2022-11-29 09:56:22,295 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-11-29T09:56:22.2961832Z 2022-11-29 09:56:22,295 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-11-29T09:56:22.2963169Z 2022-11-29 09:56:22,295 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:56:22.2963950Z [info] world_size: 2
2022-11-29T09:56:22.2973657Z train_files: 4
2022-11-29T09:56:22.2974162Z val_files: 2
2022-11-29T09:56:22.3065064Z 2022-11-29 09:56:22,305 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-29T09:56:22.3065837Z [info] world_size: 2
2022-11-29T09:56:22.3078168Z train_files: 4
2022-11-29T09:56:22.3078650Z val_files: 2
2022-11-29T09:56:22.5454617Z Traceback (most recent call last):
2022-11-29T09:56:22.5455751Z File "/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py", line 439, in <module>
2022-11-29T09:56:22.5456728Z fire.Fire()
2022-11-29T09:56:22.5458071Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2022-11-29T09:56:22.5459277Z component_trace = _Fire(component, args, parsed_flag_args, context, name)
2022-11-29T09:56:22.5461291Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
2022-11-29T09:56:22.5462139Z component, remaining_args = _CallAndUpdateTrace(
2022-11-29T09:56:22.5463292Z File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
2022-11-29T09:56:22.5464073Z component = fn(*varargs, **kwargs)
2022-11-29T09:56:22.5464952Z File "/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py", line 164, in run
2022-11-29T09:56:22.5465797Z model = model.to(device)
2022-11-29T09:56:22.5466825Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
2022-11-29T09:56:22.5467571Z return self._apply(convert)
2022-11-29T09:56:22.5468624Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:56:22.5469578Z module._apply(fn)
2022-11-29T09:56:22.5470583Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
2022-11-29T09:56:22.5471536Z module._apply(fn)
2022-11-29T09:56:22.5473095Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
2022-11-29T09:56:22.5473854Z param_applied = fn(param)
2022-11-29T09:56:22.5474931Z File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
2022-11-29T09:56:22.5475928Z return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2022-11-29T09:56:22.5476739Z RuntimeError: CUDA error: out of memory
2022-11-29T09:56:22.5477643Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2022-11-29T09:56:22.5478610Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-11-29T09:56:24.1101037Z num_epochs 4
2022-11-29T09:56:24.1101691Z num_epochs_per_validation 2
2022-11-29T09:56:26.8273388Z WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47502 closing signal SIGTERM
2022-11-29T09:56:27.0419095Z ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 47503) of binary: /opt/conda/bin/python
2022-11-29T09:56:27.0522809Z Traceback (most recent call last):
2022-11-29T09:56:27.0523538Z File "/opt/conda/bin/torchrun", line 8, in <module>
2022-11-29T09:56:27.0524174Z sys.exit(main())
2022-11-29T09:56:27.0525800Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
2022-11-29T09:56:27.0526671Z return f(*args, **kwargs)
2022-11-29T09:56:27.0527722Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
2022-11-29T09:56:27.0528449Z run(args)
2022-11-29T09:56:27.0529438Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
2022-11-29T09:56:27.0530160Z elastic_launch(
2022-11-29T09:56:27.0531236Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
2022-11-29T09:56:27.0532208Z return launch_agent(self._config, self._entrypoint, list(args))
2022-11-29T09:56:27.0533431Z File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
2022-11-29T09:56:27.0534271Z raise ChildFailedError(
2022-11-29T09:56:27.0535252Z torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
2022-11-29T09:56:27.0536184Z ============================================================
2022-11-29T09:56:27.0537020Z /tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py FAILED
2022-11-29T09:56:27.0538800Z ------------------------------------------------------------
2022-11-29T09:56:27.0539448Z Failures:
2022-11-29T09:56:27.0539906Z <NO_OTHER_FAILURES>
2022-11-29T09:56:27.0540792Z ------------------------------------------------------------
2022-11-29T09:56:27.0541519Z Root Cause (first observed failure):
2022-11-29T09:56:27.0542423Z [0]:
2022-11-29T09:56:27.0543058Z time : 2022-11-29_09:56:26
2022-11-29T09:56:27.0543594Z host : 16e7a27492ca
2022-11-29T09:56:27.0544113Z rank : 1 (local_rank: 1)
2022-11-29T09:56:27.0544668Z exitcode : 1 (pid: 47503)
2022-11-29T09:56:27.0545204Z error_file: <N/A>
2022-11-29T09:56:27.0546004Z traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2022-11-29T09:56:27.0546817Z ============================================================
2022-11-29T09:56:27.2555038Z �[33m[W 2022-11-29 09:56:27,251]�[0m Trial 0 failed because of the following error: CalledProcessError(1, ['torchrun', '--nnodes=1', '--nproc_per_node=2', '/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py', 'run', "--config_file='/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_infer.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_validate.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_train.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/network.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/hyper_parameters.yaml'", '--learning_rate=0.001'])�[0m
2022-11-29T09:56:27.2557797Z Traceback (most recent call last):
2022-11-29T09:56:27.2558958Z File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 196, in _run_trial
2022-11-29T09:56:27.2559777Z value_or_values = func(trial)
2022-11-29T09:56:27.2560504Z File "/__w/MONAI/MONAI/monai/apps/auto3dseg/hpo_gen.py", line 333, in __call__
2022-11-29T09:56:27.2561361Z self.run_algo(obj_filename, output_folder, template_path)
2022-11-29T09:56:27.2562209Z File "/__w/MONAI/MONAI/monai/apps/auto3dseg/hpo_gen.py", line 395, in run_algo
2022-11-29T09:56:27.2562935Z self.algo.train(self.params)
2022-11-29T09:56:27.2563689Z File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 198, in train
2022-11-29T09:56:27.2564457Z return self._run_cmd(cmd, devices_info)
2022-11-29T09:56:27.2565214Z File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 185, in _run_cmd
2022-11-29T09:56:27.2566096Z normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True)
2022-11-29T09:56:27.2566940Z File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
2022-11-29T09:56:27.2567731Z raise CalledProcessError(retcode, process.args,
2022-11-29T09:56:27.2572162Z subprocess.CalledProcessError: Command '['torchrun', '--nnodes=1', '--nproc_per_node=2', '/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/scripts/train.py', 'run', "--config_file='/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_infer.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_validate.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/transforms_train.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/network.yaml','/tmp/tmpwfj8ea33/optuna_test/segresnet2d_0_override_learning_rate_0.001/configs/hyper_parameters.yaml'", '--learning_rate=0.001']' returned non-zero exit status 1.
2022-11-29T09:56:27.2867415Z EFinished test: test_run_optuna (tests.test_auto3dseg_hpo.TestHPO) (19.4s)
run into this issue from time to time, would be great if the GPU memory footprint can be smaller cc @mingxin-zheng @Nic-Ma
Details