You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
As of #719 we have NVTX ranges in CUB device algorithms. Most CUB device algorithms support graph capture. For now, it's not clear if NVTX is working correctly in presence of graph capture.
Describe the solution you'd like
We need to understand if NVTX ranges work correctly when CUB is in graph capture mode. Since all of our *_.lid_2 tests run CUB algorithms in graph capture mode, one of these tests, say cub.cpp17.test.device_select_if.lid_2, can be used as an example. If NVTX ranges do not contain kernels they surround, I'd prefer no NVTX ranges to be reported.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
Testing with catch2_test_device_histogram from #1695 shows:
You can see the cudaMalloc and cub::DeviceFor::Bulk from the thrust::device_vector<int8> setting up temporary storage, then graph capture begins, the kernel is launched and reported by NVTX, graph capture ends. When the graph is instantiated, launched and synchronized with, no NVTX ranges are reported. So NVTX ranges are shown when a kernel is captured, not when executed.
@gevtushenko Would you like to have NVTX ranges disabled when stream capturing is active? That would require us to check the stream state on each invocation of a CUB device API.
@bernhardmgruber thank you for taking a look! The results seem to match our intuition. Regarding the action item, investigating how much overhead is caused by checking if stream is in capture mode is non-trivial amount of work. I'd just update the NVTX section of the developer overview to clarify this behavior.
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
As of #719 we have NVTX ranges in CUB device algorithms. Most CUB device algorithms support graph capture. For now, it's not clear if NVTX is working correctly in presence of graph capture.
Describe the solution you'd like
We need to understand if NVTX ranges work correctly when CUB is in graph capture mode. Since all of our
*_.lid_2
tests run CUB algorithms in graph capture mode, one of these tests, saycub.cpp17.test.device_select_if.lid_2
, can be used as an example. If NVTX ranges do not contain kernels they surround, I'd prefer no NVTX ranges to be reported.Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: