-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JPEG color conversion and chroma subsampling kernel #2771
Conversation
template <typename T> | ||
__inline__ __device__ vec<3, T> rgb_to_ycbcr(const vec<3, uint8_t> rgb) { | ||
vec<3, T> ycbcr; | ||
ycbcr.x = rgb_to_y<T>(rgb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need here as you are not using it when rgb_to_ycbcr
is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really. Fixed
8914fd2
to
6cac3ab
Compare
return ycbcr; | ||
} | ||
|
||
template <bool horz_subsample, bool vert_subsample, typename T = uint8_t, int in_nchannels = 3> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <bool horz_subsample, bool vert_subsample, typename T = uint8_t, int in_nchannels = 3> | |
template <bool horz_subsample, bool vert_subsample, typename T = uint8_t> |
I don't think that RGBToYCbCr makes sense for anything other than 3 channels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
fe4eddd
to
2fd52c8
Compare
Signed-off-by: Joaquin Anton <janton@nvidia.com>
Signed-off-by: Joaquin Anton <janton@nvidia.com>
Signed-off-by: Joaquin Anton <janton@nvidia.com>
Signed-off-by: Joaquin Anton <janton@nvidia.com>
Signed-off-by: Joaquin Anton <janton@nvidia.com>
2fd52c8
to
b6b8926
Compare
Signed-off-by: Joaquin Anton <janton@nvidia.com>
b6b8926
to
3180b51
Compare
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Joaquin Anton <janton@nvidia.com>
// chroma_subsample_params_t<uint8_t, false, true>, | ||
// chroma_subsample_params_t<uint8_t, true, false>, | ||
// chroma_subsample_params_t<uint8_t, false, false> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
7ced1fc
to
6416ca6
Compare
for (int i = 0; i < N; i++) | ||
*ptr++ = v[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (int i = 0; i < N; i++) | |
*ptr++ = v[i]; | |
for (int i = 0; i < N; i++) | |
ptr[i] = v[i]; |
I think that 2 kinds of indexing are more confusing both for the human reader and compiler.
// Assuming CUDA block has: | ||
// - width 32, leads to 4 horizontal blocks of 8 | ||
// - height 8, so a single block 8x8 fits vertically | ||
__shared__ T luma_blk[4][luma_blk_h][luma_blk_w]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__shared__ T luma_blk[4][luma_blk_h][luma_blk_w]; | |
__shared__ T luma_blk[4][luma_blk_h][luma_blk_strides[1]]; |
__shared__ T cb_blk[4][8][8]; | ||
__shared__ T cr_blk[4][8][8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__shared__ T cb_blk[4][8][8]; | |
__shared__ T cr_blk[4][8][8]; | |
__shared__ T cb_blk[4][chroma_blk_sz[1]][chroma_blk_strides[1]]; | |
__shared__ T cr_blk[4][chroma_blk_sz[1]][chroma_blk_strides[1]]; |
rgb_to_ycbcr_chroma_subsample<horz_subsample, vert_subsample>( | ||
blk_offset, offset, luma, cb, cr, in); | ||
|
||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this kernel, the synchronization is not necessary - you're accessing only the elements produced by this thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware of that. I just put it here, because I was planning to build on top of that. I also don't need the shared memory.
__shared__ T luma_blk[4][luma_blk_h][luma_blk_w]; | ||
__shared__ T cb_blk[4][8][8]; | ||
__shared__ T cr_blk[4][8][8]; | ||
|
||
int blk_idx = threadIdx.x / 8; | ||
int local_x = threadIdx.x % 8; | ||
int local_y = threadIdx.y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that having 8x8 blocks for both luma and chroma may simplify the DCT & quantization steps.
__shared__ T luma_blk[4][luma_blk_h][luma_blk_w]; | |
__shared__ T cb_blk[4][8][8]; | |
__shared__ T cr_blk[4][8][8]; | |
int blk_idx = threadIdx.x / 8; | |
int local_x = threadIdx.x % 8; | |
int local_y = threadIdx.y; | |
__shared__ float Cb[2][4][8][9]; // yes, 9 - to reduce bank conflicts! | |
__shared__ float Cr[2][4][8][9]; | |
__shared__ float Y[2 << vert_subsample][4 << horz_subsample][8][9]; | |
int cx = threadIdx.x; | |
int cy = threadIdx.y; | |
chroma_x = cx & 7; | |
chroma_y = cy & 7; | |
chroma_bx = cx >> 3; | |
chroma_by = cy >> 3; | |
int lx = threadIdx.x << horz_subsample; | |
int ly = threadIdx.y << vert_subsample; | |
luma_x = lx & 7; | |
luma_y = ly & 7; | |
luma_bx = lx >> 3; | |
luma_by = ly >> 3; |
With 16x32 block size we'd have
Cr 2489 * 4 bytes = 2304 B
Cb 2489 * 4 bytes = 2304 B
Y 488*9 * 4 bytes = 9216 B
total 13824 B - well within acceptable limits, we still have quite a lot of shared mem for DCT step
Signed-off-by: Joaquin Anton <janton@nvidia.com>
6416ca6
to
debc3bb
Compare
Signed-off-by: Joaquin Anton <janton@nvidia.com>
Signed-off-by: Joaquin Anton <janton@nvidia.com>
e41d1b6
to
3a2eac4
Compare
Signed-off-by: Joaquin Anton <janton@nvidia.com>
3a2eac4
to
81be4a9
Compare
if (horz_subsample && vert_subsample) { | ||
luma[luma_y][luma_x + 1] = ycbcr.luma[1]; | ||
luma[luma_y + 1][luma_x] = ycbcr.luma[2]; | ||
luma[luma_y + 1][luma_x + 1] = ycbcr.luma[3]; | ||
} else if (horz_subsample) { | ||
luma[luma_y][luma_x + 1] = ycbcr.luma[1]; | ||
} else if (vert_subsample) { | ||
luma[luma_y + 1][luma_x] = ycbcr.luma[1]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be shorter - and hopefully the compiler would be smart enough to unroll it.
if (horz_subsample && vert_subsample) { | |
luma[luma_y][luma_x + 1] = ycbcr.luma[1]; | |
luma[luma_y + 1][luma_x] = ycbcr.luma[2]; | |
luma[luma_y + 1][luma_x + 1] = ycbcr.luma[3]; | |
} else if (horz_subsample) { | |
luma[luma_y][luma_x + 1] = ycbcr.luma[1]; | |
} else if (vert_subsample) { | |
luma[luma_y + 1][luma_x] = ycbcr.luma[1]; | |
} | |
for (int i = 0, k = 0; i < vert_subsample+1; i++) | |
for (int j = 0; j < horz_subsample+1; j++, k++) | |
luma[luma_y + i][luma_x + j] = ycbcr.luma[k]; |
ivec2 offset{x, y}; | ||
|
||
auto ycbcr = rgb_to_ycbcr_subsampled<horz_subsample, vert_subsample, T>(offset, in); | ||
luma[luma_y][luma_x] = ycbcr.luma[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this line closer to other luma
writes - or follow the suggestion below.
Signed-off-by: Joaquin Anton <janton@nvidia.com>
int chroma_x = threadIdx.x & 7; // % 8 | ||
int chroma_y = threadIdx.y & 7; // % 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and for both & 7
and % 8
NVCC generates the same code:
mov.u32 %r3, %tid.x;
and.b32 %r4, %r3, 7;
Maybe we should stick to what is intended here then? Unless there is something here I'm not getting.
This goes for every time this trick was used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzient This was originally your suggestion. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same for /8
vs >>8
. For both it is:
mov.u32 %r3, %tid.x;
shr.u32 %r4, %r3, 3;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original suggestion was inside a function where these values were passed as signed integers. For signed integers, the division rounds towards zero and gives a negative remainder. It's not only slower (additional math), but also potentially dangerous.
!build |
CI MESSAGE: [2167738]: BUILD STARTED |
CI MESSAGE: [2167738]: BUILD FAILED |
Signed-off-by: Joaquin Anton <janton@nvidia.com>
6d4cd64
to
e426f31
Compare
!build |
CI MESSAGE: [2168116]: BUILD STARTED |
CI MESSAGE: [2168116]: BUILD PASSED |
Signed-off-by: Joaquin Anton janton@nvidia.com
Why we need this PR?
Pick one, remove the rest
What happened in this PR?
Fill relevant points, put NA otherwise. Replace anything inside []
Added a CUDA kernel for JPEG RGB to YCbCr conversion plus chroma subsampling
New functionality
CUDA kernel, performance
C++ tests added
NA
JIRA TASK: [DALI-1905]