Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GTest Event Listener with CUDA validation after TEST #2516

Merged
merged 4 commits into from
Dec 3, 2020

Conversation

klecki
Copy link
Contributor

@klecki klecki commented Dec 1, 2020

Signed-off-by: Krzysztof Lecki klecki@nvidia.com

Why we need this PR?

  • It fixes a bug/It adds new feature: Automatic error checking for CUDA after each Test Case.

What happened in this PR?

  • What solution was applied:
    Used event listener to allow per test case error checking without the need to modify code.
  • Affected modules and functionalities:
    GTest
  • Key points relevant for the review:
    N/A
  • Validation and testing:
    CI, adjustment in the testing framework
  • Documentation (including examples):
    N/A

JIRA TASK: [Use DALI-1724 or NA]

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Dec 1, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1850302]: BUILD STARTED

*/
class CudaFinalizeEventListener : public ::testing::EmptyTestEventListener {
void OnTestEnd(const ::testing::TestInfo& test_info) override {
auto sync_result = cudaDeviceSynchronize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_result is always cudaSuccess IIRC

Also I'm wondering, could this actually catch the error from another thread or this synchronization secures this scenario

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs is pretty clear:

Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It theoretically can catch error from other thread (apparently use of threading is on by default), but this sync should limit the leaking of error to something close. And I observed a scenario when cudaDeviceSynchronize returned success when there was launch error and a case where both calls failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error is reported only once. If the launch error appeared then it won't be reported again. @mzient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forced a kernel launch failure and cudaDeviceSynchronize didn't report it, but the cudaGetLastError did (and this clears it for further calls).

class CudaFinalizeEventListener : public ::testing::EmptyTestEventListener {
void OnTestEnd(const ::testing::TestInfo& test_info) override {
auto sync_result = cudaDeviceSynchronize();
EXPECT_EQ(sync_result, cudaSuccess) << "CUDA error: \"" << cudaGetErrorName(sync_result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add also OnTestStart and check cudaGetLastError there also, if by any chance it's not cudaSuccess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would yield maybe an early exit in some cases, but still, the error will be propagated to the point where it is checked - so either the test will do it explicitly (some does) or it will be checked after.
I will leave the error checking in one point only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had in mind is that with this some race conditions or something like that, that aren't easy to extract a single unit test that break, may be easier to debug. I'm fine either way

@szalpal szalpal self-assigned this Dec 1, 2020
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1850302]: BUILD FAILED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Dec 2, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1852620]: BUILD STARTED

*/
class CudaFinalizeEventListener : public ::testing::EmptyTestEventListener {
void OnTestEnd(const ::testing::TestInfo& test_info) override {
if (std::strstr(test_info.test_suite_name(), "CpuOnlyTest") == nullptr) {
Copy link
Contributor

@JanuszL JanuszL Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do something like:

(cuDeviceGet(&dummy, 0) != CUDA_ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND)

to make sure that driver is accessible. I'm just afraid that CpuOnlyTest asking every test without GPU to stick to this name patter maybe too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the pattern from our tests that assumes that substring.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1852620]: BUILD FAILED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Dec 2, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1854128]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1854128]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1854128]: BUILD PASSED

*/
class CudaFinalizeEventListener : public ::testing::EmptyTestEventListener {
void OnTestEnd(const ::testing::TestInfo& test_info) override {
if (cuInitChecked()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a comment why this is tested here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Dec 3, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1857925]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1857925]: BUILD PASSED

@klecki klecki merged commit 1f9329a into NVIDIA:master Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants