-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[To Disable LayerFusion] add copyOutputToHost function for only selected outputs #495
base: release/8.6
Are you sure you want to change the base?
Conversation
//! | ||
//! \brief Copy the selected contents of output device buffers by bidingIdx to output host buffers synchronously. | ||
//! | ||
void copyOutputToHost(const std::vector<int>& bindingIndices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
To follow our coding convention, it would be great if you can make the following modifications:
- Use east-const style (i.e.
const std::vector<int>&
->std::vector<int> const&
) for all occurrences ofconst
. - Use
int32_t
instead ofint
. - Use 4-whitespace indentation instead of 2-whitespace indentation.
We appreciate it!
const cudaMemcpyKind memcpyType = cudaMemcpyDeviceToHost; | ||
if (!mEngine->bindingIsInput(i)) | ||
{ | ||
CHECK(cudaMemcpy(dstPtr, srcPtr, byteSize, memcpyType, stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudaMemcpy
should be cudaMemcpyAsync
{ | ||
for (int i : bindingIndices) | ||
{ | ||
void* dstPtr = mManagedBuffers[i]->hostBuffer.data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move these inside the if(){....}
clause?
When there is an bug(#380) in layer fusion or you just want to debug layer fusions,
using mark_output is the only solution to disable layer fusion now(#252).
https://github.com/NVIDIA/TensorRT/issues/252#issuecomment-577468499
But the latency increases because to use copyOutputToHost brings unintended outputs from device to host.
https://github.com/NVIDIA/TensorRT/issues/252#issuecomment-593716265
Therefore, added two functions in buffers.h to bring the selected outputs only(by index) to host.