-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GTest Event Listener with CUDA validation after TEST #2516
Conversation
672d4a5
to
ac77b50
Compare
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
ac77b50
to
7e8ce3d
Compare
!build |
CI MESSAGE: [1850302]: BUILD STARTED |
dali/test/dali_cuda_finalize_test.h
Outdated
*/ | ||
class CudaFinalizeEventListener : public ::testing::EmptyTestEventListener { | ||
void OnTestEnd(const ::testing::TestInfo& test_info) override { | ||
auto sync_result = cudaDeviceSynchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_result
is always cudaSuccess
IIRC
Also I'm wondering, could this actually catch the error from another thread or this synchronization secures this scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs is pretty clear:
Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It theoretically can catch error from other thread (apparently use of threading is on by default), but this sync should limit the leaking of error to something close. And I observed a scenario when cudaDeviceSynchronize returned success when there was launch error and a case where both calls failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the error is reported only once. If the launch error appeared then it won't be reported again. @mzient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forced a kernel launch failure and cudaDeviceSynchronize didn't report it, but the cudaGetLastError did (and this clears it for further calls).
dali/test/dali_cuda_finalize_test.h
Outdated
class CudaFinalizeEventListener : public ::testing::EmptyTestEventListener { | ||
void OnTestEnd(const ::testing::TestInfo& test_info) override { | ||
auto sync_result = cudaDeviceSynchronize(); | ||
EXPECT_EQ(sync_result, cudaSuccess) << "CUDA error: \"" << cudaGetErrorName(sync_result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add also OnTestStart
and check cudaGetLastError
there also, if by any chance it's not cudaSuccess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would yield maybe an early exit in some cases, but still, the error will be propagated to the point where it is checked - so either the test will do it explicitly (some does) or it will be checked after.
I will leave the error checking in one point only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I had in mind is that with this some race conditions or something like that, that aren't easy to extract a single unit test that break, may be easier to debug. I'm fine either way
CI MESSAGE: [1850302]: BUILD FAILED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1852620]: BUILD STARTED |
dali/test/dali_cuda_finalize_test.h
Outdated
*/ | ||
class CudaFinalizeEventListener : public ::testing::EmptyTestEventListener { | ||
void OnTestEnd(const ::testing::TestInfo& test_info) override { | ||
if (std::strstr(test_info.test_suite_name(), "CpuOnlyTest") == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do something like:
(cuDeviceGet(&dummy, 0) != CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND)
to make sure that driver is accessible. I'm just afraid that CpuOnlyTest
asking every test without GPU to stick to this name patter maybe too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the pattern from our tests that assumes that substring.
CI MESSAGE: [1852620]: BUILD FAILED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
3846b12
to
61dd648
Compare
!build |
CI MESSAGE: [1854128]: BUILD STARTED |
CI MESSAGE: [1854128]: BUILD FAILED |
CI MESSAGE: [1854128]: BUILD PASSED |
*/ | ||
class CudaFinalizeEventListener : public ::testing::EmptyTestEventListener { | ||
void OnTestEnd(const ::testing::TestInfo& test_info) override { | ||
if (cuInitChecked()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a comment why this is tested here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1857925]: BUILD STARTED |
CI MESSAGE: [1857925]: BUILD PASSED |
Signed-off-by: Krzysztof Lecki klecki@nvidia.com
Why we need this PR?
What happened in this PR?
Used event listener to allow per test case error checking without the need to modify code.
GTest
N/A
CI, adjustment in the testing framework
N/A
JIRA TASK: [Use DALI-1724 or NA]