-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the DCT GPU kernel. #2471
Conversation
Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
!build |
CI MESSAGE: [1805832]: BUILD STARTED |
What is the speed now, and what used to be before that optimization? |
CI MESSAGE: [1805832]: BUILD FAILED |
dali/kernels/signal/dct/dct_gpu.cu
Outdated
__global__ void ApplyDctInner(const typename Dct1DGpu<OutputType, InputType>::SampleDesc *samples, | ||
const BlockSetupInner::BlockDesc *blocks, | ||
const float *lifter_coeffs) { | ||
extern __shared__ char *shm[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extern __shared__ char *shm[]; | |
extern __shared__ char shm[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
dali/kernels/signal/dct/dct_gpu.h
Outdated
struct BlockDesc { | ||
int64_t sample_idx; | ||
int64_t frame_start; | ||
int64_t frames_num; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t frames_num; | |
int64_t num_frames; |
or
int64_t frames_num; | |
int64_t frame_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
be2c955
to
1e59882
Compare
Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
1e59882
to
f2ac7a0
Compare
!build |
For the planar layout it's ~550 GFLOPS -> ~630 GFLOPS and for interleaved it's ~30 GFLOPS -> ~315 GFLOPS |
CI MESSAGE: [1806677]: BUILD STARTED |
Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
@@ -40,7 +40,7 @@ def define_graph(self): | |||
test_data_root = get_dali_extra_path() | |||
good_path = 'db/single' | |||
missnamed_path = 'db/single/missnamed' | |||
test_good_path = {'jpeg', 'mixed', 'png', 'tiff', 'pnm', 'bmp', 'jpeg2k'} | |||
test_good_path = {'jpeg2k'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eeeh, good catch. I've committed too much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
CI MESSAGE: [1806677]: BUILD FAILED |
Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
Signed-off-by: Rafal <Banas.Rafal97@gmail.com>
!build |
CI MESSAGE: [1809572]: BUILD STARTED |
CI MESSAGE: [1809572]: BUILD PASSED |
Why we need this PR?
What happened in this PR?
The case with the transform done over the inner axis is handled with a separate CUDA kernel. The existing kernel was optimized by employing shared memory.
GPU DCT kernel.
A new CUDA kernel and the changes in the old one.
Existing tests still apply. I added a performance test.
N/A
JIRA TASK: DALI-1690