Not running tests when more memory is needed than available. #1011

daineAMD · 2020-02-22T00:38:28Z

Potentially resolves SWDEV-223558 and SWDEV-220984.

Summary

This allows tests to check how much memory will be allocated on the device, and compares this to how much memory is free. If the amount of memory to be allocated is greater, it immediately succeeds, and allows logging that a test was skipped.
This seems a little hacky, so any advice is appreciated.

Implementation

In utility.cpp, I use hipMemGetInfo to query how much memory is available. In each test, I calculate how much memory is needed, and if it's greater, I use googletest's SUCCESS with a unique message. In rocblas_gtest_main.cpp the listener detects this message and records the skipped test.

Example Output

[----------] Global test environment tear-down
[==========] 13697 tests from 2 test cases ran. (18047 ms total)
[ PASSED ] 13697 tests.
[ SKIPPED ] 592 tests.

Alternatives

Alternatively, GTEST_SKIP() is a more elegant solution that I believe is available in a newer release of googletest (1.10.0) (we're currently using 1.8.0, not sure if it would be an issue to update).

Notes

I only added the memory check for the functions noted in the two tickets. I also updated ger_batched and syr_batched to the new device_batch_vectors as well.
I haven't ran this on a V340L machine to see if it works, but it worked for my manual testing. Will do this and update.

saadrahim · 2020-02-22T00:40:55Z

This should be the standard for PRs for everyone, impressive.

saadrahim · 2020-02-22T00:42:12Z

Please revert 42f8f01

to test on V340L

This reverts commit 42f8f01.

TorreZuk

Good you cleaned out some old style allocators but could have been a separate PR.

TorreZuk · 2020-02-24T17:57:17Z

clients/common/utility.cpp

+    hipError_t err       = hipMemGetInfo(&free_mem, &total_mem);
+


What is the cost of this in terms of time? Wondering if it is feasible to just move this into the device* allocators and return a failure that is handled by the CHECK_HIP_ERROR on device memory rather than all these special cases?

You could also replace the CHECK_HIP_ERROR calls with a CHECK_HIP_DEV_ALLOC if hipMalloc failure is fine to trigger without checking memory free.

leekillough

Placeholder to give me time to review changes

…to V340L-memory

daineAMD · 2020-02-25T00:51:19Z

I addressed Torre's comments on this and moved the checking code into the device*.hpp files instead of individual tests since the check takes only a few microseconds, and now it's generic when adding future tests.
I added another macro other than CHECK_HIP_ERROR to use to do this new check. In the future I think we maybe could instead just update CHECK_HIP_ERROR and create a new macro for CPU-side testing and whatnot.

merge staging 2852646 into master on Conditional GO from CQE #1011

Not running tests when more memory is needed than available.

f8070a9

daineAMD requested review from leekillough, amcamd, saadrahim and amdkila February 22, 2020 00:38

daineAMD requested review from mahmoodw, TorreZuk and YvanMokwinski as code owners February 22, 2020 00:38

Revert "change to gfx900h to avoid low performant machines (ROCm#1003)"

85bc868

This reverts commit 42f8f01.

amdkila approved these changes Feb 22, 2020

View reviewed changes

TorreZuk reviewed Feb 24, 2020

View reviewed changes

leekillough suggested changes Feb 24, 2020

View reviewed changes

daineAMD added 2 commits February 24, 2020 17:47

Moving checks to device*.hpp instead of in individual test files.

27e5030

Merge branch 'V340L-memory' of https://github.com/daineamd/rocblas in…

d31cdd0

…to V340L-memory

leekillough approved these changes Feb 25, 2020

View reviewed changes

daineAMD merged commit 5a3b468 into ROCm:develop Feb 25, 2020

saadrahim pushed a commit that referenced this pull request Feb 25, 2020

Not running tests when more memory is needed than available. (#1011)

a346176

eidenyoshida pushed a commit to eidenyoshida/rocBLAS that referenced this pull request Feb 28, 2020

Not running tests when more memory is needed than available. (ROCm#1011)

79c634f

daineAMD mentioned this pull request Mar 11, 2020

Remove pre-check for memory allocation. #1058

Merged

mlse-lib-jenkins pushed a commit that referenced this pull request Dec 20, 2021

Add readthedocs structure similar to hipBLAS (#1011)

8a4ead7

ROCmMathLibrariesBot pushed a commit that referenced this pull request May 11, 2023

Merge pull request #1833 from amcamd/master

08312df

merge staging 2852646 into master on Conditional GO from CQE #1011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not running tests when more memory is needed than available. #1011

Not running tests when more memory is needed than available. #1011

daineAMD commented Feb 22, 2020

saadrahim commented Feb 22, 2020

saadrahim commented Feb 22, 2020

TorreZuk left a comment

TorreZuk Feb 24, 2020

TorreZuk Feb 24, 2020

leekillough left a comment

daineAMD commented Feb 25, 2020

Not running tests when more memory is needed than available. #1011

Not running tests when more memory is needed than available. #1011

Conversation

daineAMD commented Feb 22, 2020

Summary

Implementation

Example Output

Alternatives

Notes

saadrahim commented Feb 22, 2020

saadrahim commented Feb 22, 2020

TorreZuk left a comment

Choose a reason for hiding this comment

TorreZuk Feb 24, 2020

Choose a reason for hiding this comment

TorreZuk Feb 24, 2020

Choose a reason for hiding this comment

leekillough left a comment

Choose a reason for hiding this comment

daineAMD commented Feb 25, 2020