Improve cast operator performance #3783

MirazSpecial · 2022-04-02T10:47:07Z

Signed-off-by: Konrad Litwiński klitwinski41418@gmail.com

Category:

Refactoring (Redesign of existing code that doesn't affect functionality)

Description:

Work's main motivation was to improve throughput for small batch sizes of data for Cast.

Originally running Cast kernel (BatchedCastKernel) required copying two arrays to GPU -

samples array (with descriptors of placement of samples in memory) - it used 8 * samples_number bytes.
blocks array (with descriptors of what should be done for each threads block) - it used 20 * number_of_blocks bytes
As there was a linear relationship between data size and number of blocks (number of blocks was around 1024 times smaller then data size) copying second array was a big cost of running Cast kernel.

The idea of this optimization is to instead of copying block array, create an array with information of how big the samples are, and which block is the first one to parse every sample, copy it to GPU and then - in the kernel - calculate which sample should the block work on. To calculate that efficiently we use binary search over sample descriptors.

Additional information:

For image size 1000x1000 we achieved following improvement

Affected modules and functionalities:

cast.cuh
cast.cu

Key points relevant for the review:

The key changes are in the newly added BinSearchCastKernel kernel.

Checklist

Tests

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>

dali/kernels/common/cast.cuh

szalpal · 2022-04-04T11:32:39Z

!build

dali-automaton · 2022-04-04T11:35:28Z

CI MESSAGE: [4381653]: BUILD STARTED

dali-automaton · 2022-04-04T12:16:06Z

CI MESSAGE: [4381653]: BUILD FAILED

Co-authored-by: Michał Zientkiewicz <mzient@gmail.com>

mzient · 2022-04-04T13:52:50Z

dali/kernels/common/cast.cuh

+                                    const CastSampleBlockDesc *params,
+                                    int nsamples, int block_volume_scale) {
+  int i = 0;
+  for (int jump = (1 << (32 - __clz(nsamples) - 1)); jump; jump >>= 1) {


How about calculating (1 << (32 - __clz(nsamples) - 1)) outside and passing it as a kernel parameter? You can use ilog2 function. I'm not saying this is mandatory, but I'm curious if that would yield a measurable change in performance (one way or the other).

Passing (1 << (32 - __clz(nsamples) - 1)) as a paramter would mean adding another fifth kernel parameter (as nsamples needs to be passed as it's used in another place in the kernel).

As for performance, moving this calculation outside the kernel doesn't change performance in any significant way. AFAIK whole binary search has almost no impact on performance (I tried removing it and choosing random block to parse and it didn't improve throughput).

szalpal · 2022-04-13T13:04:15Z

!build

dali-automaton · 2022-04-13T13:10:10Z

CI MESSAGE: [4500452]: BUILD STARTED

dali-automaton · 2022-04-13T13:14:59Z

CI MESSAGE: [4500452]: BUILD FAILED

Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>

szalpal · 2022-04-26T16:29:16Z

!build

dali-automaton · 2022-04-26T16:34:59Z

CI MESSAGE: [4680741]: BUILD STARTED

dali-automaton · 2022-04-26T16:41:27Z

CI MESSAGE: [4680741]: BUILD FAILED

klecki · 2022-04-26T17:09:54Z

Lint complaining:

#14 58.59 /opt/dali/dali/kernels/common/cast.cuh:30:  At least two spaces is best between code and comments  [whitespace/comments] [2]
#14 58.59 /opt/dali/dali/kernels/common/cast.cuh:58:  At least two spaces is best between code and comments  [whitespace/comments] [2]

szalpal · 2022-04-26T20:09:57Z

!build

dali-automaton · 2022-04-26T20:15:03Z

CI MESSAGE: [4683075]: BUILD STARTED

dali-automaton · 2022-04-26T20:54:03Z

CI MESSAGE: [4683075]: BUILD FAILED

Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>

szalpal · 2022-04-27T06:43:50Z

!build

dali-automaton · 2022-04-27T06:44:58Z

CI MESSAGE: [4687549]: BUILD STARTED

dali-automaton · 2022-04-27T07:36:52Z

CI MESSAGE: [4687549]: BUILD FAILED

klecki · 2022-04-27T15:16:44Z

#14 251.6 /opt/dali/dali/kernels/common/cast.cuh:39:45: error: comparison of integers of different signs: 'int' and 'unsigned int' [-Werror,-Wsign-compare]
#14 251.6   for (int x = threadIdx.x + block_start; x < block_end; x += blockDim.x) {
#14 251.6                                           ~ ^ ~~~~~~~~~

(This is detected via clang-only build, it has more thorough error checking in CUDA code).

Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>

klecki · 2022-05-02T17:54:07Z

!build

dali-automaton · 2022-05-02T18:00:21Z

CI MESSAGE: [4731433]: BUILD STARTED

dali-automaton · 2022-05-02T19:12:58Z

CI MESSAGE: [4731433]: BUILD FAILED

szalpal · 2022-05-04T12:42:27Z

!build

dali-automaton · 2022-05-04T12:45:07Z

CI MESSAGE: [4747662]: BUILD STARTED

dali-automaton · 2022-05-04T13:51:23Z

CI MESSAGE: [4747662]: BUILD PASSED

* Use binary search to find the sample to process * Extracting params to CastSampleBlockDesc Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>

MirazSpecial changed the title ~~Cast operator optimization using binary searc~~ Cast operator optimization using binary search Apr 2, 2022

Cast operator optimization using binary searc

d89def5

Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>

MirazSpecial changed the title ~~Cast operator optimization using binary search~~ Improve cast operator performance Apr 2, 2022

szkarpinski mentioned this pull request Apr 3, 2022

Specialized Cast GPU kernel for smaller batch sizes #3784

Closed

18 tasks

jantonguirao assigned awolant and banasraf Apr 4, 2022

klecki reviewed Apr 4, 2022

View reviewed changes

dali/kernels/common/cast.cuh Show resolved Hide resolved

mzient self-assigned this Apr 4, 2022

mzient reviewed Apr 4, 2022

View reviewed changes

dali/kernels/common/cast.cuh Outdated Show resolved Hide resolved

Update dali/kernels/common/cast.cuh

9112e07

Co-authored-by: Michał Zientkiewicz <mzient@gmail.com>

mzient reviewed Apr 4, 2022

View reviewed changes

awolant approved these changes Apr 8, 2022

View reviewed changes

banasraf approved these changes Apr 11, 2022

View reviewed changes

MirazSpecial added 2 commits April 13, 2022 14:48

Types in kernell corrected

ecb85eb

Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>

Types in kernel corrected

16d2f64

Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>

Whitespaces added

90f94cc

Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>

Type changes

580901b

Signed-off-by: Konrad Litwiński <klitwinski41418@gmail.com>

szalpal merged commit f492132 into NVIDIA:main May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cast operator performance #3783

Improve cast operator performance #3783

MirazSpecial commented Apr 2, 2022 •

edited

szalpal commented Apr 4, 2022

dali-automaton commented Apr 4, 2022

dali-automaton commented Apr 4, 2022

mzient Apr 4, 2022 •

edited

MirazSpecial Apr 13, 2022

szalpal commented Apr 13, 2022

dali-automaton commented Apr 13, 2022

dali-automaton commented Apr 13, 2022

szalpal commented Apr 26, 2022

dali-automaton commented Apr 26, 2022

dali-automaton commented Apr 26, 2022

klecki commented Apr 26, 2022

szalpal commented Apr 26, 2022

dali-automaton commented Apr 26, 2022

dali-automaton commented Apr 26, 2022

szalpal commented Apr 27, 2022

dali-automaton commented Apr 27, 2022

dali-automaton commented Apr 27, 2022

klecki commented Apr 27, 2022

klecki commented May 2, 2022

dali-automaton commented May 2, 2022

dali-automaton commented May 2, 2022

szalpal commented May 4, 2022

dali-automaton commented May 4, 2022

dali-automaton commented May 4, 2022

Improve cast operator performance #3783

Improve cast operator performance #3783

Conversation

MirazSpecial commented Apr 2, 2022 • edited

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Checklist

Tests

Documentation

DALI team only

Requirements

szalpal commented Apr 4, 2022

dali-automaton commented Apr 4, 2022

dali-automaton commented Apr 4, 2022

mzient Apr 4, 2022 • edited

Choose a reason for hiding this comment

MirazSpecial Apr 13, 2022

Choose a reason for hiding this comment

szalpal commented Apr 13, 2022

dali-automaton commented Apr 13, 2022

dali-automaton commented Apr 13, 2022

szalpal commented Apr 26, 2022

dali-automaton commented Apr 26, 2022

dali-automaton commented Apr 26, 2022

klecki commented Apr 26, 2022

szalpal commented Apr 26, 2022

dali-automaton commented Apr 26, 2022

dali-automaton commented Apr 26, 2022

szalpal commented Apr 27, 2022

dali-automaton commented Apr 27, 2022

dali-automaton commented Apr 27, 2022

klecki commented Apr 27, 2022

klecki commented May 2, 2022

dali-automaton commented May 2, 2022

dali-automaton commented May 2, 2022

szalpal commented May 4, 2022

dali-automaton commented May 4, 2022

dali-automaton commented May 4, 2022

MirazSpecial commented Apr 2, 2022 •

edited

mzient Apr 4, 2022 •

edited