Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not running tests when more memory is needed than available. #1011

Merged
merged 4 commits into from
Feb 25, 2020

Conversation

daineAMD
Copy link
Contributor

Potentially resolves SWDEV-223558 and SWDEV-220984.

Summary

This allows tests to check how much memory will be allocated on the device, and compares this to how much memory is free. If the amount of memory to be allocated is greater, it immediately succeeds, and allows logging that a test was skipped.
This seems a little hacky, so any advice is appreciated.

Implementation

In utility.cpp, I use hipMemGetInfo to query how much memory is available. In each test, I calculate how much memory is needed, and if it's greater, I use googletest's SUCCESS with a unique message. In rocblas_gtest_main.cpp the listener detects this message and records the skipped test.

Example Output

[----------] Global test environment tear-down
[==========] 13697 tests from 2 test cases ran. (18047 ms total)
[ PASSED ] 13697 tests.
[ SKIPPED ] 592 tests.

Alternatives

Alternatively, GTEST_SKIP() is a more elegant solution that I believe is available in a newer release of googletest (1.10.0) (we're currently using 1.8.0, not sure if it would be an issue to update).

Notes

I only added the memory check for the functions noted in the two tickets. I also updated ger_batched and syr_batched to the new device_batch_vectors as well.
I haven't ran this on a V340L machine to see if it works, but it worked for my manual testing. Will do this and update.

@saadrahim
Copy link
Member

This should be the standard for PRs for everyone, impressive.

@saadrahim
Copy link
Member

Please revert 42f8f01

to test on V340L

Copy link
Contributor

@TorreZuk TorreZuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good you cleaned out some old style allocators but could have been a separate PR.

Comment on lines 65 to 66
hipError_t err = hipMemGetInfo(&free_mem, &total_mem);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the cost of this in terms of time? Wondering if it is feasible to just move this into the device* allocators and return a failure that is handled by the CHECK_HIP_ERROR on device memory rather than all these special cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also replace the CHECK_HIP_ERROR calls with a CHECK_HIP_DEV_ALLOC if hipMalloc failure is fine to trigger without checking memory free.

Copy link
Contributor

@leekillough leekillough left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placeholder to give me time to review changes

@daineAMD
Copy link
Contributor Author

I addressed Torre's comments on this and moved the checking code into the device*.hpp files instead of individual tests since the check takes only a few microseconds, and now it's generic when adding future tests.
I added another macro other than CHECK_HIP_ERROR to use to do this new check. In the future I think we maybe could instead just update CHECK_HIP_ERROR and create a new macro for CPU-side testing and whatnot.

@daineAMD daineAMD merged commit 5a3b468 into ROCm:develop Feb 25, 2020
eidenyoshida pushed a commit to eidenyoshida/rocBLAS that referenced this pull request Feb 28, 2020
ROCmMathLibrariesBot pushed a commit that referenced this pull request May 11, 2023
merge staging  2852646 into master on Conditional GO from CQE #1011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants