-
Notifications
You must be signed in to change notification settings - Fork 75
Clean up CUDA state between tests #2296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up CUDA state between tests #2296
Conversation
|
Jenkins build for acea51aab3ae9c443b19a072ff7fa8791afe58a6 commit finished as FAILURE |
|
Jenkins build for acea51aab3ae9c443b19a072ff7fa8791afe58a6 commit finished as FAILURE |
|
Jenkins build for b00cb058f98e7a9c06f7c3f1a556e2ff2ff2cc7f commit finished as FAILURE |
|
Jenkins build for 7bff9c3008b162b45b6a53b96fdd3d3d1c2405cf commit finished as FAILURE |
|
Jenkins build for 7bff9c3008b162b45b6a53b96fdd3d3d1c2405cf commit is in progress |
fc804c3
into
ROCm:rocm7.0_internal_testing
This PR fixes the unit test,
```
test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s]
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction
tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda")
RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432]
```
This error is specific to gfx1101 arch.
This error is coming from an integer overflow when another unit test,
`test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel`
creates a tensor with a huge numel, which overflows into a higher
`torch.cuda.max_memory_reserved()` when you call
`test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction`
afterward. To avoid this we introduced `torch.cuda.empty_cache()` and
`torch.cuda.reset_peak_memory_stats()` to clean up CUDA states.
JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295
(cherry picked from commit fc804c3)
This PR fixes the unit test,
This error is specific to gfx1101 arch.
This error is coming from an integer overflow when another unit test,
test/test_cuda.py::TestCuda::test_randint_generation_for_large_numelcreates a tensor with a huge numel, which overflows into a highertorch.cuda.max_memory_reserved()when you calltest/test_cuda.py::TestCuda::test_set_per_process_memory_fractionafterward. To avoid this we introducedtorch.cuda.empty_cache()andtorch.cuda.reset_peak_memory_stats()to clean up CUDA states.JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295