Skip to content

Conversation

@rraminen
Copy link

@rraminen rraminen commented Jun 26, 2025

This PR fixes the unit test,

test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s]

Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction
    tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda")
RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432]

This error is specific to gfx1101 arch.
This error is coming from an integer overflow when another unit test, test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel creates a tensor with a huge numel, which overflows into a higher torch.cuda.max_memory_reserved() when you call test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction afterward. To avoid this we introduced torch.cuda.empty_cache() and torch.cuda.reset_peak_memory_stats() to clean up CUDA states.

JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jun 26, 2025

Jenkins build for acea51aab3ae9c443b19a072ff7fa8791afe58a6 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 1, 2025

Jenkins build for acea51aab3ae9c443b19a072ff7fa8791afe58a6 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 8, 2025

Jenkins build for b00cb058f98e7a9c06f7c3f1a556e2ff2ff2cc7f commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 9, 2025

Jenkins build for 7bff9c3008b162b45b6a53b96fdd3d3d1c2405cf commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

Jenkins build for 7bff9c3008b162b45b6a53b96fdd3d3d1c2405cf commit is in progress
Links: Blue Ocean view / Build artifacts

@pruthvistony pruthvistony merged commit fc804c3 into ROCm:rocm7.0_internal_testing Jul 14, 2025
0 of 2 checks passed
pragupta pushed a commit that referenced this pull request Oct 29, 2025
This PR fixes the unit test,
```
test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s]

Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction
    tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda")
RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432]
```
This error is specific to gfx1101 arch.
This error is coming from an integer overflow when another unit test,
`test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel`
creates a tensor with a huge numel, which overflows into a higher
`torch.cuda.max_memory_reserved()` when you call
`test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction`
afterward. To avoid this we introduced `torch.cuda.empty_cache()` and
`torch.cuda.reset_peak_memory_stats()` to clean up CUDA states.

JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295
(cherry picked from commit fc804c3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants