-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Equalize kernel #4565
Equalize kernel #4565
Conversation
cef9814
to
368a0b2
Compare
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
368a0b2
to
c7ffa4f
Compare
fec67d2
to
2ef12eb
Compare
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
2ef12eb
to
2b67a69
Compare
!build |
CI MESSAGE: [7027870]: BUILD STARTED |
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
CI MESSAGE: [7027870]: BUILD FAILED |
!build |
CI MESSAGE: [7028794]: BUILD STARTED |
CI MESSAGE: [7028794]: BUILD FAILED |
CI MESSAGE: [7028794]: BUILD PASSED |
static constexpr int hist_range = 256; | ||
|
||
/** | ||
* @brief Performs per-channel equalization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to do per-channel equalization, but equalization on a single-channel input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input can have multiple channels (the second extent), each of them will get different histogram and lookup tables.
__global__ void ZeroMem(const SampleDesc *sample_descs) { | ||
auto sample_desc = sample_descs[blockIdx.y]; | ||
sample_desc.out[blockIdx.x * SampleDesc::range_size + threadIdx.x] = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this better than cudaMemsetAsync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't try it, likely not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For single sample it seems to perform sligtly slower, but it is such a slim difference that I am not sure if that is a real thing. One concern is that I'd have to assume the tensor list is contigious here (or make num_sample calls).
static constexpr int64_t kMaxGridSize = 128; | ||
static constexpr int64_t kShmPerChannelSize = SampleDesc::range_size * sizeof(uint64_t); | ||
|
||
HistogramKernelGpu() : shared_mem_limit_{GetSharedMemPerBlock()}, sample_descs_{} {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HistogramKernelGpu() : shared_mem_limit_{GetSharedMemPerBlock()}, sample_descs_{} {} | |
HistogramKernelGpu() : shared_mem_limit_{GetSharedMemPerBlock()} {} |
^^ redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right!
const uint64_t *in; | ||
}; | ||
|
||
struct LutKernelGpu { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no DLL_PUBLIC here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it.
__shared__ uint64_t workspace[SampleDesc::range_size]; | ||
auto sample_desc = sample_descs[blockIdx.x]; | ||
PrefixSum(workspace, sample_desc.in); | ||
int32_t first_idx = FirstNonZero(workspace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need every single thread finding the first non-zero? wonder if it makes a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed elsewhere, we need that value in each thread anyway and there does not seem to be obvious alternative solution that would outperform it.
idx += blockDim.x * gridDim.x) { | ||
const uint8_t *in = sample_desc.in; | ||
uint8_t *out = sample_desc.out; | ||
uint64_t channel_idx = idx % sample_desc.num_channels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why relying on modulus to index the channels, when you could use the y dimenson (threadIdx.y)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have strided accesses in the small lookup table rather than global input and output.
for (int64_t idx = 0; idx < batch_shape[0].num_elements(); idx++) { | ||
sample_in_view.data[idx] = 51 * sample_idx; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not running the test, just setting the data. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I missed it, thanks.
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
!build |
CI MESSAGE: [7056329]: BUILD STARTED |
CI MESSAGE: [7056329]: BUILD FAILED |
CI MESSAGE: [7056329]: BUILD PASSED |
* Adds equalization kernel for uint8 samples * The kernel computes histogram, lookup table and performs the lookup. Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski ktokarski@nvidia.com
Category:
New feature (non-breaking change which adds functionality)
Description:
This PR adds
equalize
kernel. Equalization consits of following steps:The lookup table is different for different channels, so the existing lookup kernel did not seem to fit here.
The operation is needed for auto augment pipelines. For now it supports images and videos of uint8 type only.
Additional information:
Affected modules and functionalities:
No existing funcionalities are affected.
Key points relevant for the review:
Tests:
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-3187