New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework DeviceGuard to restore original context upon the exit #882
Conversation
CI MESSAGE: [739710]: BUILD STARTED |
CI MESSAGE: [739710]: BUILD FAILED |
CI MESSAGE: [741836]: BUILD STARTED |
CI MESSAGE: [741836]: BUILD FAILED |
CI MESSAGE: [742169]: BUILD STARTED |
CI MESSAGE: [742169]: BUILD PASSED |
dali/util/cucontext.h
Outdated
@@ -22,7 +22,7 @@ namespace dali { | |||
class CUContext { | |||
public: | |||
CUContext(); | |||
explicit CUContext(CUdevice device, unsigned int flags = 0); | |||
explicit CUContext(int device_id_, unsigned int flags = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like it. You're mixing runtime and driver APIs at CUContext API level. It's very likely to cause trouble unless CUdevice and device ordinal from CUDA runtime are the same (I don't think they are).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing CUContext as it is no longer needed.
!build |
CI MESSAGE: [759101]: BUILD STARTED |
CI MESSAGE: [759101]: BUILD FAILED |
CI MESSAGE: [759125]: BUILD STARTED |
CI MESSAGE: [759125]: BUILD PASSED |
} | ||
CUDA_CALL(cudaEventDestroy(master_event_)); | ||
NVJPEG_CALL(nvjpegDestroy(handle_)); | ||
} catch (...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything we throw inherits from std::exception
, so we can leverage it and print some meaningful diagnostic before aborting.
} catch (...) { | |
} catch (const std::exception &e) { | |
std::cerr << "Fatal error: exception in ~nvJPEGDecoder():\n" << e.what() << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
#pragma GCC diagnostic pop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
include/dali/core/dynlink_cuda.h
Outdated
@@ -1886,9 +1886,9 @@ extern tcuMemcpy *cuMemcpy; | |||
extern tcuMemcpyPeer *cuMemcpyPeer; | |||
#endif | |||
|
|||
extern tcuCtxCreate *cuCtxCreate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move it to where other Ctx functions are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
!build |
CI MESSAGE: [761644]: BUILD STARTED |
CI MESSAGE: [761644]: BUILD PASSED |
dali/core/device_guard_test.cc
Outdated
int count = 1; | ||
|
||
cuInitChecked(); | ||
cudaGetDeviceCount(&count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be beneficial to error check all the calls as well?
CUDA_CALL(cuda*) and CUDA_CALL(cu*)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
CUDA_CALL(cudaGetDevice(&original_device_)); | ||
} | ||
DeviceGuard(); | ||
|
||
/// @brief Saves current device id, sets a new one and switches back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should write here that it's supposed to be no-op for -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update docs/comments, otherwise looks ok.
!build |
CI MESSAGE: [762911]: BUILD STARTED |
- some libraries are not using PrimiarContext while DALI does. When cudaSetDevice is called PrimaryContext is created and is set as the current one, the old one is lost while other apps may still need it. Adds saving of current context and restores it when DeviceGuard is destroyed - removed CUContext as it is not needed and can be replaced by the DeviceGuard Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
!build |
CI MESSAGE: [762930]: BUILD STARTED |
CI MESSAGE: [762911]: BUILD FAILED |
CI MESSAGE: [762930]: BUILD PASSED |
) - some libraries are not using PrimiarContext while DALI does. When cudaSetDevice is called PrimaryContext is created and is set as the current one, the old one is lost while other apps may still need it. Adds saving of current context and restores it when DeviceGuard is destroyed - removed CUContext as it is not needed and can be replaced by the DeviceGuard Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
cudaSetDevice is called PrimaryContext is created and is set as
the current one, the old one is lost while other apps may still need it.
Adds saving of current context and restores it when DeviceGuard
is destroyed
Signed-off-by: Janusz Lisiecki jlisiecki@nvidia.com