-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cast operator performance #3783
Conversation
Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>
!build |
CI MESSAGE: [4381653]: BUILD STARTED |
CI MESSAGE: [4381653]: BUILD FAILED |
Co-authored-by: Michał Zientkiewicz <mzient@gmail.com>
dali/kernels/common/cast.cuh
Outdated
const CastSampleBlockDesc *params, | ||
int nsamples, int block_volume_scale) { | ||
int i = 0; | ||
for (int jump = (1 << (32 - __clz(nsamples) - 1)); jump; jump >>= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calculating (1 << (32 - __clz(nsamples) - 1))
outside and passing it as a kernel parameter? You can use ilog2
function. I'm not saying this is mandatory, but I'm curious if that would yield a measurable change in performance (one way or the other).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing (1 << (32 - __clz(nsamples) - 1))
as a paramter would mean adding another fifth kernel parameter (as nsamples needs to be passed as it's used in another place in the kernel).
As for performance, moving this calculation outside the kernel doesn't change performance in any significant way. AFAIK whole binary search has almost no impact on performance (I tried removing it and choosing random block to parse and it didn't improve throughput).
!build |
CI MESSAGE: [4500452]: BUILD STARTED |
CI MESSAGE: [4500452]: BUILD FAILED |
Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>
Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>
!build |
CI MESSAGE: [4680741]: BUILD STARTED |
CI MESSAGE: [4680741]: BUILD FAILED |
Lint complaining:
|
!build |
CI MESSAGE: [4683075]: BUILD STARTED |
CI MESSAGE: [4683075]: BUILD FAILED |
Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>
!build |
CI MESSAGE: [4687549]: BUILD STARTED |
CI MESSAGE: [4687549]: BUILD FAILED |
(This is detected via clang-only build, it has more thorough error checking in CUDA code). |
Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>
!build |
CI MESSAGE: [4731433]: BUILD STARTED |
CI MESSAGE: [4731433]: BUILD FAILED |
!build |
CI MESSAGE: [4747662]: BUILD STARTED |
CI MESSAGE: [4747662]: BUILD PASSED |
* Use binary search to find the sample to process * Extracting params to CastSampleBlockDesc Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>
* Use binary search to find the sample to process * Extracting params to CastSampleBlockDesc Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>
Signed-off-by: Konrad Litwiński klitwinski41418@gmail.com
Category:
Refactoring (Redesign of existing code that doesn't affect functionality)
Description:
Work's main motivation was to improve throughput for small batch sizes of data for Cast.
Originally running Cast kernel (BatchedCastKernel) required copying two arrays to GPU -
As there was a linear relationship between data size and number of blocks (number of blocks was around 1024 times smaller then data size) copying second array was a big cost of running Cast kernel.
The idea of this optimization is to instead of copying block array, create an array with information of how big the samples are, and which block is the first one to parse every sample, copy it to GPU and then - in the kernel - calculate which sample should the block work on. To calculate that efficiently we use binary search over sample descriptors.
Additional information:
For image size 1000x1000 we achieved following improvement
![image](https://user-images.githubusercontent.com/55670462/161379657-198cf932-93ed-47e6-b052-c8bead7db068.png)
Affected modules and functionalities:
cast.cuh
cast.cu
Key points relevant for the review:
The key changes are in the newly added BinSearchCastKernel kernel.
Checklist
Tests
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A