Optimize the DCT GPU kernel. #2471

banasraf · 2020-11-16T18:43:19Z

Why we need this PR?

Refactoring to improve the performance of the GPU DCT kernel. In the current state it's extremely slow for the transform done on the inner axis, so I handle this case separately. Also the existing CUDA kernel was slightly optimized.

What happened in this PR?

What solution was applied:
The case with the transform done over the inner axis is handled with a separate CUDA kernel. The existing kernel was optimized by employing shared memory.
Affected modules and functionalities:
GPU DCT kernel.
Key points relevant for the review:
A new CUDA kernel and the changes in the old one.
Validation and testing:
Existing tests still apply. I added a performance test.
Documentation (including examples):
N/A

JIRA TASK: DALI-1690

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

banasraf · 2020-11-17T09:15:07Z

!build

dali-automaton · 2020-11-17T09:50:48Z

CI MESSAGE: [1805832]: BUILD STARTED

JanuszL · 2020-11-17T11:21:59Z

What is the speed now, and what used to be before that optimization?

dali-automaton · 2020-11-17T11:29:25Z

CI MESSAGE: [1805832]: BUILD FAILED

mzient · 2020-11-17T12:24:49Z

dali/kernels/signal/dct/dct_gpu.cu

+__global__ void ApplyDctInner(const typename Dct1DGpu<OutputType, InputType>::SampleDesc *samples,
+                              const BlockSetupInner::BlockDesc *blocks,
+                              const float *lifter_coeffs) {
+  extern __shared__ char *shm[];


Suggested change

extern __shared__ char *shm[];

extern __shared__ char shm[];

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

mzient · 2020-11-17T13:23:02Z

dali/kernels/signal/dct/dct_gpu.h

+  struct BlockDesc {
+    int64_t sample_idx;
+    int64_t frame_start;
+    int64_t frames_num;


Suggested change

int64_t frames_num;

int64_t num_frames;

or

Suggested change

int64_t frames_num;

int64_t frame_count;

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

banasraf · 2020-11-17T16:29:24Z

!build

banasraf · 2020-11-17T16:31:44Z

@JanuszL

What is the speed now, and what used to be before that optimization?

For the planar layout it's ~550 GFLOPS -> ~630 GFLOPS and for interleaved it's ~30 GFLOPS -> ~315 GFLOPS

dali-automaton · 2020-11-17T16:35:27Z

CI MESSAGE: [1806677]: BUILD STARTED

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

JanuszL · 2020-11-17T19:46:28Z

dali/test/python/test_operator_decoder.py

@@ -40,7 +40,7 @@ def define_graph(self):
 test_data_root = get_dali_extra_path()
 good_path = 'db/single'
 missnamed_path = 'db/single/missnamed'
-test_good_path = {'jpeg', 'mixed', 'png', 'tiff', 'pnm', 'bmp', 'jpeg2k'}
+test_good_path = {'jpeg2k'}


eeeh, good catch. I've committed too much

dali-automaton · 2020-11-17T20:02:05Z

CI MESSAGE: [1806677]: BUILD FAILED

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

banasraf · 2020-11-18T09:02:26Z

!build

dali-automaton · 2020-11-18T09:05:26Z

CI MESSAGE: [1809572]: BUILD STARTED

dali-automaton · 2020-11-18T10:36:10Z

CI MESSAGE: [1809572]: BUILD PASSED

Optimize the DCT GPU kernel. Handle the inner-axis layout separately.

fe097d3

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

mzient reviewed Nov 17, 2020

View reviewed changes

Merge branch 'master' into dct-kernel-optimization

f35faf7

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

mzient reviewed Nov 17, 2020

View reviewed changes

banasraf force-pushed the dct-kernel-optimization branch from be2c955 to 1e59882 Compare November 17, 2020 16:07

Fix a bug in the test

f2ac7a0

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

banasraf force-pushed the dct-kernel-optimization branch from 1e59882 to f2ac7a0 Compare November 17, 2020 16:08

JanuszL approved these changes Nov 17, 2020

View reviewed changes

banasraf added 2 commits November 17, 2020 18:03

remove debug prints

7e86ddd

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

bug fix

4910fc9

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

JanuszL reviewed Nov 17, 2020

View reviewed changes

revert unwanted changes

c03505b

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

JanuszL approved these changes Nov 18, 2020

View reviewed changes

disable perf tests

ab0b531

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

mzient approved these changes Nov 20, 2020

View reviewed changes

banasraf merged commit d2f08b3 into NVIDIA:master Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the DCT GPU kernel. #2471

Optimize the DCT GPU kernel. #2471

banasraf commented Nov 16, 2020

banasraf commented Nov 17, 2020

dali-automaton commented Nov 17, 2020

JanuszL commented Nov 17, 2020

dali-automaton commented Nov 17, 2020

mzient Nov 17, 2020

banasraf Nov 17, 2020

mzient Nov 17, 2020

banasraf Nov 17, 2020

banasraf commented Nov 17, 2020

banasraf commented Nov 17, 2020 •

edited

Loading

dali-automaton commented Nov 17, 2020

JanuszL Nov 17, 2020

banasraf Nov 18, 2020

banasraf Nov 18, 2020

dali-automaton commented Nov 17, 2020

banasraf commented Nov 18, 2020

dali-automaton commented Nov 18, 2020

dali-automaton commented Nov 18, 2020

Optimize the DCT GPU kernel. #2471

Optimize the DCT GPU kernel. #2471

Conversation

banasraf commented Nov 16, 2020

Why we need this PR?

What happened in this PR?

banasraf commented Nov 17, 2020

dali-automaton commented Nov 17, 2020

JanuszL commented Nov 17, 2020

dali-automaton commented Nov 17, 2020

mzient Nov 17, 2020

Choose a reason for hiding this comment

banasraf Nov 17, 2020

Choose a reason for hiding this comment

mzient Nov 17, 2020

Choose a reason for hiding this comment

banasraf Nov 17, 2020

Choose a reason for hiding this comment

banasraf commented Nov 17, 2020

banasraf commented Nov 17, 2020 • edited Loading

dali-automaton commented Nov 17, 2020

JanuszL Nov 17, 2020

Choose a reason for hiding this comment

banasraf Nov 18, 2020

Choose a reason for hiding this comment

banasraf Nov 18, 2020

Choose a reason for hiding this comment

dali-automaton commented Nov 17, 2020

banasraf commented Nov 18, 2020

dali-automaton commented Nov 18, 2020

dali-automaton commented Nov 18, 2020

banasraf commented Nov 17, 2020 •

edited

Loading