You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
multi-gpu test error happens randomly:
tests/test_cumulative_average_dist.py
2023-01-04 21:15:57,236 - Added key: store_based_barrier_key:1 to store for rank: 1
2023-01-04 21:15:57,246 - Added key: store_based_barrier_key:1 to store for rank: 0
2023-01-04 21:15:57,246 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-01-04 21:15:57,246 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Process SpawnProcess-2:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/__w/MONAI/MONAI/tests/utils.py", line 489, in run_process
raise e
File "/__w/MONAI/MONAI/tests/utils.py", line 480, in run_process
func(*args, **kwargs)
File "/__w/MONAI/MONAI/tests/utils.py", line 648, in _call_original_func
return f(*args, **kwargs)
File "/__w/MONAI/MONAI/tests/test_cumulative_average_dist.py", line 38, in test_value
val = torch.as_tensor(rank + i, device=device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
The text was updated successfully, but these errors were encountered:
Describe the bug
multi-gpu test error happens randomly:
The text was updated successfully, but these errors were encountered: