Clean up CUDA state between tests #2296

rraminen · 2025-06-26T07:33:42Z

This PR fixes the unit test,

test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s]

Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction
    tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda")
RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432]

This error is specific to gfx1101 arch.
This error is coming from an integer overflow when another unit test, test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel creates a tensor with a huge numel, which overflows into a higher torch.cuda.max_memory_reserved() when you call test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction afterward. To avoid this we introduced torch.cuda.empty_cache() and torch.cuda.reset_peak_memory_stats() to clean up CUDA states.

JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295

rocm-repo-management-api · 2025-06-26T07:55:46Z

Jenkins build for acea51aab3ae9c443b19a072ff7fa8791afe58a6 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-07-01T18:18:07Z

Jenkins build for acea51aab3ae9c443b19a072ff7fa8791afe58a6 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

test/test_cuda.py

rocm-repo-management-api · 2025-07-08T21:41:33Z

Jenkins build for b00cb058f98e7a9c06f7c3f1a556e2ff2ff2cc7f commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-07-09T18:41:03Z

Jenkins build for 7bff9c3008b162b45b6a53b96fdd3d3d1c2405cf commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-07-11T04:06:26Z

Jenkins build for 7bff9c3008b162b45b6a53b96fdd3d3d1c2405cf commit is in progress
Links: Blue Ocean view / Build artifacts

This PR fixes the unit test, ``` test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s] Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda") RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432] ``` This error is specific to gfx1101 arch. This error is coming from an integer overflow when another unit test, `test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel` creates a tensor with a huge numel, which overflows into a higher `torch.cuda.max_memory_reserved()` when you call `test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction` afterward. To avoid this we introduced `torch.cuda.empty_cache()` and `torch.cuda.reset_peak_memory_stats()` to clean up CUDA states. JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295 (cherry picked from commit fc804c3)

Clean up CUDA state between tests

acea51a

rraminen requested review from jithunnair-amd and pruthvistony June 26, 2025 07:35

pruthvistony requested changes Jul 8, 2025

View reviewed changes

test/test_cuda.py Outdated Show resolved Hide resolved

Specific to gfx1101

b00cb05

Remove extra line

7bff9c3

pruthvistony approved these changes Jul 14, 2025

View reviewed changes

pruthvistony merged commit fc804c3 into ROCm:rocm7.0_internal_testing Jul 14, 2025
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clean up CUDA state between tests #2296

Clean up CUDA state between tests #2296

Uh oh!

rraminen commented Jun 26, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Jun 26, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Jul 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 8, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Jul 9, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Jul 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Clean up CUDA state between tests #2296

Clean up CUDA state between tests #2296

Uh oh!

Conversation

rraminen commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rraminen commented Jun 26, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jun 26, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 1, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 8, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jul 9, 2025 •

edited

Loading