-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Convolution CPU kernel #1987
Conversation
!build |
CI MESSAGE: [1354581]: BUILD STARTED |
CI MESSAGE: [1354581]: BUILD PASSED |
!build |
CI MESSAGE: [1356721]: BUILD STARTED |
CI MESSAGE: [1356721]: BUILD PASSED |
::testing::Types<cpw_params<1, true>, cpw_params<3, true>, cpw_params<1, false>>; | ||
|
||
TYPED_TEST_P(CyclicPixelWrapperTest, FillAndCycle) { | ||
constexpr int size = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test if the buffer actually holds up to 6 samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will prevent you from pushing more only in debug configuration with asserts. Too many PushBack with not enough PopFronts can be considered undefined. I check if the pointers to elements map to offsets that I would expect, so if you don't abuse it, it's working ok.
8e2bd9c
to
961d701
Compare
!build |
CI MESSAGE: [1374247]: BUILD STARTED |
CI MESSAGE: [1374247]: BUILD FAILED |
961d701
to
f01cc37
Compare
!build |
CI MESSAGE: [1374667]: BUILD STARTED |
CI MESSAGE: [1374667]: BUILD PASSED |
out_axis[xout * channels + c] = ConvertSat<Out>(acc * scale); | ||
} | ||
} | ||
// hotfix? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in PR description, there is one case with window > image, I think I can either stop here or run through the loops below. I don't know which is better and if that's all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early exit sounds better.
// this is probably not true, after we skipped the loop above | ||
// x0 = axis_size - window_size + 1; | ||
// xout = x0 + radius; | ||
x0 = flat_x / channels; | ||
xout = flat_xout / channels; | ||
// we either stopped with (full window - 1) fitting before the end of image, | ||
// or we reached the end? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are quite confusing.
} | ||
// hotfix? | ||
// if (xout == axis_size) { | ||
// return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not return, but break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to exit here or to allow it to run through all the loops?
Do you see any other edge case than the one with window > image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, not even break - this should be a continue
- there are still outer dimensions to process. However, it's an edge case and maybe it's not worth checking, since in 99% cases it just won't happen and the loops would be empty anyway - so on average there's a performance penalty of additional if
- and, to corroborate the hypothesis that this special case is a bad idea, we've both made a mistake here already (return/break instead of continue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I let it go through all the loops, recovering the non-flattened index.
} | ||
} | ||
|
||
int NumLanes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int NumLanes() { | |
int NumLanes() const noexcept { |
} | ||
|
||
int NumLanes() { | ||
return std::min(num_lanes_, max_lanes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already enforcing it in the constructor... is this function even necessary?
return std::min(num_lanes_, max_lanes); | |
return num_lanes_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was to hint the compiler, that whenever I use the num_lanes_ it's less than max_lanes. But I didn't test if that helps much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if it got unrolled... but I don't think any compiler would be that smart.
} | ||
} | ||
|
||
int Size() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int Size() { | |
int Size() const { |
return elements_; | ||
} | ||
|
||
bool Empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool Empty() { | |
bool Empty() const { |
} | ||
|
||
private: | ||
void WrapPosition(int& pos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void WrapPosition(int& pos) { | |
void WrapPosition(int& pos) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put const
everywhere I could.
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
f01cc37
to
8ca08fa
Compare
!build |
CI MESSAGE: [1380989]: BUILD STARTED |
template <bool has_channels, typename Out, typename In, typename W, int ndim> | ||
void ConvolveInnerDim(Out* out, const In* in, const W* window, int window_size, int radius, | ||
const TensorShape<ndim>& shape, const TensorShape<ndim>& strides, W scale) { | ||
constexpr int last_dim = has_channels ? ndim - 2 : ndim - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there should be some assert based on ndim
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_assert(ndim >= (has_channels ? 2 : 1))
something like that
CI MESSAGE: [1380989]: BUILD PASSED |
int64_t xout = 0; | ||
Out* out_axis = &out[o * axis_stride]; | ||
const In* in_axis = &in[o * axis_stride]; | ||
for (; x0 < 0 && xout < axis_size; x0++, xout++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment that this is left border.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
template <bool has_channels, typename Out, typename In, typename W, int ndim> | ||
void ConvolveInnerDim(Out* out, const In* in, const W* window, int window_size, int radius, | ||
const TensorShape<ndim>& shape, const TensorShape<ndim>& strides, W scale) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between radius and window_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diameter = window_size
radius = (diameter - 1) / 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you need as the function argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't :P
out_axis[flat_xout] = ConvertSat<Out>(acc * scale); | ||
} | ||
// get back from flat coordinates | ||
x0 = flat_x / channels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right border.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (i + d >= 0 && i + d < len) { | ||
out[i * stride + c] += in[(i + d) * stride + c] * window[d + r]; | ||
} else { | ||
out[i * stride + c] += |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I think that you could just use this and remove the if block and the condition, since the out-of-bounds check already happens inside idx_reflect_101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, reworked it a bit with ConvertSat and accumulator, I need it in follow-up, might as well land here.
Kernel kernel; | ||
|
||
auto req = kernel.Setup(ctx, in_.shape, k_win_.num_elements()); | ||
// this is painful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could always use the kernel manager
TYPED_TEST_P(CyclicWindowWrapperTest, FillAndCycle) { | ||
constexpr int size = 6; | ||
constexpr int num_lanes = TypeParam::num_lanes; | ||
int tmp_buffer[size * num_lanes]; // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the linter complaining about here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It thinks that a value that is not named kSomethingSomething is not constexpr and will use a VLA.
} | ||
|
||
/** | ||
* @brief Drop one element consistine of `NumLanes()` lanes from the buffer, moving the start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @brief Drop one element consistine of `NumLanes()` lanes from the buffer, moving the start | |
* @brief Drop one element consisting of `NumLanes()` lanes from the buffer, moving the start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* @brief Drop one element consistine of `NumLanes()` lanes from the buffer, moving the start | ||
* element. | ||
*/ | ||
void PopElement() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void PopElement() { | |
void PopFront() { |
suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
for (int64_t o = 0; o < outer_elements; o++) { | ||
int64_t x0 = -radius; | ||
int64_t xout = 0; | ||
Out* out_axis = &out[o * axis_stride]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
o * axis_stride
could be calculated only once, or make o
incremented by axis_stride
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more readable that way, it's one multiply per axis which will do window_size * axis_elements multiplies anyway.
out_axis[xout * channels + c] = ConvertSat<Out>(acc * scale); | ||
} | ||
} | ||
int64_t flat_x = x0 * channels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a note here that this central part won't take effect when the window is bigger than the extent of the axis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
void ConvolveInplaceOuterLoop(Out* out, const In* in, const W* window, | ||
const TensorShape<ndim>& shape, const TensorShape<ndim>& strides, | ||
int diameter, In* input_window_buffer, W scale = 1) { | ||
int64_t outer_elements = volume(&shape[0], &shape[axis]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a static assert to check that axis is within the valid range? Unless you are sure it is because it's checked before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already in Kernel:
static_assert(0 <= axis && axis < (has_channels ? ndim - 1 : ndim),
"Selected axis must be in [0, ndim) when there is no channel axis, or in [0, ndim "
"- 1) for channel-last input");
KernelRequirements Setup(KernelContext& ctx, const TensorShape<ndim>& in_shape, int window_size) { | ||
KernelRequirements req; | ||
ScratchpadEstimator se; | ||
DALI_ENFORCE(window_size % 2 == 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is window_size = 1 valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be. It would basically scale the image by the single window value. It doesn't make much sense to use it, but I don't see the reason to forbid it.
} | ||
} | ||
|
||
void baseline_dot(span<int> result, span<const int> input, span<const int> window, int in_offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void baseline_dot(span<int> result, span<const int> input, span<const int> window, int in_offset) { | |
void BaselineDot(span<int> result, span<const int> input, span<const int> window, int in_offset) { |
nitpick: consistency with the rest of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep naming stuff with snake_case all the time, fixed.
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1384404]: BUILD STARTED |
CI MESSAGE: [1384404]: BUILD FAILED |
CI MESSAGE: [1384404]: BUILD PASSED |
Signed-off-by: Krzysztof Lecki klecki@nvidia.com
Why we need this PR?
This is part of the effort for Gaussian Filter Operator
What happened in this PR?
A DALI kernel was implemented, that can convolve a kernel window with n-dimenstional tensor in selected axis.
The selected axis is traversed using a sliding window/cyclic buffer, adding only one new pixel for every output pixel - as we have the input in contiguous memory (almost) the convolution with kernel window should be fast even if the axis have bigger strides.Cyclic window used for non-innermost cases to allow "in-place" computation
with cyclic buffer memory requested through scratchpad. It now keeps lanes, as some contiguous interval of channel values that is used as i-th window element in convolution for all of them.
Non in-place innermost dimension uses flattened direct loop from @mzient,
needs additional check for big window case.
Kernels
Any more edge cases?
Where to apply the fix in ConvolveInnerDim, the problematic parts are mentioned in the comments there.
Gtest test added
A bit of docstrings
JIRA TASK: [DALI-1425]