Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Convolution CPU kernel #1987

Merged
merged 5 commits into from
Jun 10, 2020
Merged

Conversation

klecki
Copy link
Contributor

@klecki klecki commented May 28, 2020

Signed-off-by: Krzysztof Lecki klecki@nvidia.com

Why we need this PR?

This is part of the effort for Gaussian Filter Operator

What happened in this PR?

  • What solution was applied:
    A DALI kernel was implemented, that can convolve a kernel window with n-dimenstional tensor in selected axis.
    The selected axis is traversed using a sliding window/cyclic buffer, adding only one new pixel for every output pixel - as we have the input in contiguous memory (almost) the convolution with kernel window should be fast even if the axis have bigger strides.
    Cyclic window used for non-innermost cases to allow "in-place" computation
    with cyclic buffer memory requested through scratchpad. It now keeps lanes, as some contiguous interval of channel values that is used as i-th window element in convolution for all of them.
    Non in-place innermost dimension uses flattened direct loop from @mzient,
    needs additional check for big window case.
  • Affected modules and functionalities:
    Kernels
  • Key points relevant for the review:
    Any more edge cases?
    Where to apply the fix in ConvolveInnerDim, the problematic parts are mentioned in the comments there.
  • Validation and testing:
    Gtest test added
  • Documentation (including examples):
    A bit of docstrings

JIRA TASK: [DALI-1425]

@klecki
Copy link
Contributor Author

klecki commented May 28, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1354581]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1354581]: BUILD PASSED

@klecki
Copy link
Contributor Author

klecki commented May 29, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1356721]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1356721]: BUILD PASSED

::testing::Types<cpw_params<1, true>, cpw_params<3, true>, cpw_params<1, false>>;

TYPED_TEST_P(CyclicPixelWrapperTest, FillAndCycle) {
constexpr int size = 6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test if the buffer actually holds up to 6 samples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will prevent you from pushing more only in debug configuration with asserts. Too many PushBack with not enough PopFronts can be considered undefined. I check if the pointers to elements map to offsets that I would expect, so if you don't abuse it, it's working ok.

@klecki
Copy link
Contributor Author

klecki commented Jun 5, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1374247]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1374247]: BUILD FAILED

@klecki
Copy link
Contributor Author

klecki commented Jun 5, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1374667]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1374667]: BUILD PASSED

out_axis[xout * channels + c] = ConvertSat<Out>(acc * scale);
}
}
// hotfix?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in PR description, there is one case with window > image, I think I can either stop here or run through the loops below. I don't know which is better and if that's all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early exit sounds better.

Comment on lines 219 to 225
// this is probably not true, after we skipped the loop above
// x0 = axis_size - window_size + 1;
// xout = x0 + radius;
x0 = flat_x / channels;
xout = flat_xout / channels;
// we either stopped with (full window - 1) fitting before the end of image,
// or we reached the end?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are quite confusing.

}
// hotfix?
// if (xout == axis_size) {
// return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not return, but break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to exit here or to allow it to run through all the loops?
Do you see any other edge case than the one with window > image?

Copy link
Contributor

@mzient mzient Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, not even break - this should be a continue - there are still outer dimensions to process. However, it's an edge case and maybe it's not worth checking, since in 99% cases it just won't happen and the loops would be empty anyway - so on average there's a performance penalty of additional if - and, to corroborate the hypothesis that this special case is a bad idea, we've both made a mistake here already (return/break instead of continue).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I let it go through all the loops, recovering the non-flattened index.

@klecki klecki mentioned this pull request Jun 8, 2020
}
}

int NumLanes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int NumLanes() {
int NumLanes() const noexcept {

}

int NumLanes() {
return std::min(num_lanes_, max_lanes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already enforcing it in the constructor... is this function even necessary?

Suggested change
return std::min(num_lanes_, max_lanes);
return num_lanes_;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to hint the compiler, that whenever I use the num_lanes_ it's less than max_lanes. But I didn't test if that helps much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if it got unrolled... but I don't think any compiler would be that smart.

}
}

int Size() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int Size() {
int Size() const {

return elements_;
}

bool Empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool Empty() {
bool Empty() const {

}

private:
void WrapPosition(int& pos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void WrapPosition(int& pos) {
void WrapPosition(int& pos) const {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put const everywhere I could.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Keep in-place, use fixed-width of lanes for non-innermost dims

Add optimized version for innermost dim,

Fix in-place with right border so it uses cyclic buffer
for border values that are already overwritten

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Jun 9, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1380989]: BUILD STARTED

template <bool has_channels, typename Out, typename In, typename W, int ndim>
void ConvolveInnerDim(Out* out, const In* in, const W* window, int window_size, int radius,
const TensorShape<ndim>& shape, const TensorShape<ndim>& strides, W scale) {
constexpr int last_dim = has_channels ? ndim - 2 : ndim - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there should be some assert based on ndim?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_assert(ndim >= (has_channels ? 2 : 1))
something like that

@NVIDIA NVIDIA deleted a comment from klecki Jun 9, 2020
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1380989]: BUILD PASSED

int64_t xout = 0;
Out* out_axis = &out[o * axis_stride];
const In* in_axis = &in[o * axis_stride];
for (; x0 < 0 && xout < axis_size; x0++, xout++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment that this is left border.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


template <bool has_channels, typename Out, typename In, typename W, int ndim>
void ConvolveInnerDim(Out* out, const In* in, const W* window, int window_size, int radius,
const TensorShape<ndim>& shape, const TensorShape<ndim>& strides, W scale) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between radius and window_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diameter = window_size
radius = (diameter - 1) / 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you need as the function argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't :P

out_axis[flat_xout] = ConvertSat<Out>(acc * scale);
}
// get back from flat coordinates
x0 = flat_x / channels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right border.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (i + d >= 0 && i + d < len) {
out[i * stride + c] += in[(i + d) * stride + c] * window[d + r];
} else {
out[i * stride + c] +=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I think that you could just use this and remove the if block and the condition, since the out-of-bounds check already happens inside idx_reflect_101

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, reworked it a bit with ConvertSat and accumulator, I need it in follow-up, might as well land here.

Kernel kernel;

auto req = kernel.Setup(ctx, in_.shape, k_win_.num_elements());
// this is painful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could always use the kernel manager

TYPED_TEST_P(CyclicWindowWrapperTest, FillAndCycle) {
constexpr int size = 6;
constexpr int num_lanes = TypeParam::num_lanes;
int tmp_buffer[size * num_lanes]; // NOLINT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the linter complaining about here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It thinks that a value that is not named kSomethingSomething is not constexpr and will use a VLA.

}

/**
* @brief Drop one element consistine of `NumLanes()` lanes from the buffer, moving the start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Drop one element consistine of `NumLanes()` lanes from the buffer, moving the start
* @brief Drop one element consisting of `NumLanes()` lanes from the buffer, moving the start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @brief Drop one element consistine of `NumLanes()` lanes from the buffer, moving the start
* element.
*/
void PopElement() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void PopElement() {
void PopFront() {

suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

for (int64_t o = 0; o < outer_elements; o++) {
int64_t x0 = -radius;
int64_t xout = 0;
Out* out_axis = &out[o * axis_stride];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o * axis_stride could be calculated only once, or make o incremented by axis_stride

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more readable that way, it's one multiply per axis which will do window_size * axis_elements multiplies anyway.

out_axis[xout * channels + c] = ConvertSat<Out>(acc * scale);
}
}
int64_t flat_x = x0 * channels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a note here that this central part won't take effect when the window is bigger than the extent of the axis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

void ConvolveInplaceOuterLoop(Out* out, const In* in, const W* window,
const TensorShape<ndim>& shape, const TensorShape<ndim>& strides,
int diameter, In* input_window_buffer, W scale = 1) {
int64_t outer_elements = volume(&shape[0], &shape[axis]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a static assert to check that axis is within the valid range? Unless you are sure it is because it's checked before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already in Kernel:

  static_assert(0 <= axis && axis < (has_channels ? ndim - 1 : ndim),
                "Selected axis must be in [0, ndim) when there is no channel axis, or in [0, ndim "
                "- 1) for channel-last input");

KernelRequirements Setup(KernelContext& ctx, const TensorShape<ndim>& in_shape, int window_size) {
KernelRequirements req;
ScratchpadEstimator se;
DALI_ENFORCE(window_size % 2 == 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is window_size = 1 valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be. It would basically scale the image by the single window value. It doesn't make much sense to use it, but I don't see the reason to forbid it.

}
}

void baseline_dot(span<int> result, span<const int> input, span<const int> window, int in_offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void baseline_dot(span<int> result, span<const int> input, span<const int> window, int in_offset) {
void BaselineDot(span<int> result, span<const int> input, span<const int> window, int in_offset) {

nitpick: consistency with the rest of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep naming stuff with snake_case all the time, fixed.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Jun 10, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1384404]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1384404]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1384404]: BUILD PASSED

@klecki klecki merged commit 05f9eec into NVIDIA:master Jun 10, 2020
@klecki klecki deleted the convolution-cpu-kernel branch June 10, 2020 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants