Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SeprableConvolutionGPU kernel #2311

Merged
merged 5 commits into from
Oct 1, 2020
Merged

Conversation

klecki
Copy link
Contributor

@klecki klecki commented Sep 30, 2020

Supports Frames-first, Channel-last, wraps
ConvolutionGPU applying several passes,
for number of data axes in [1, 3].
Simple sanity test, convolution already tested.

Signed-off-by: Krzysztof Lecki klecki@nvidia.com

Why we need this PR?

It adds new kernel, that will be used in GaussianBlur GPU

What happened in this PR?

  • What solution was applied:
    Same as CPU, wrap the ConvolutionGpu kernel and apply several passes
  • Affected modules and functionalities:
    Kernels
  • Key points relevant for the review:
    Nothing special
  • Validation and testing:
    Simple sanity gtest
  • Documentation (including examples):
    NA

JIRA TASK: [Use DALI-1588 or NA]

Supports Frames-first, Channel-last, wraps
ConvolutionGPU applying several passes,
for number of data axes in [1, 3].
Simple sanity test, convolution already tested.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
/**
* @brief Apply convolution in all spatial axes, starting from the innermost to outermost.
* If channel axis is pressent, the convolution is not applied there.
* If it is marqed as sequence, the first data axis is considered as temporal axis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If it is marqed as sequence, the first data axis is considered as temporal axis
* If it is marked as sequence, the outermost dimension denotes frames and
* convolution is not applied to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 97 to 98
req.AddInputSet(req_inner, false);
req.AddInputSet(req_outer, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
req.AddInputSet(req_inner, false);
req.AddInputSet(req_outer, false);
req.AddInputSet(req_inner, true);
req.AddInputSet(req_outer, true);

Can't you reuse the scratchpad? It should be possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that SeparableConvolutionXXX are the only kernels using this function. And... it's bugeed. Also, it's used incorrectly, since AddInputSet concatenates the list of outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, missed that append. I think it was used also in something else, I just wanted to have a way of accumulating the requirements without dealing with the insides of the KernelRequirements representation. Will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 142 to 144
req.AddInputSet(req_inner, false);
req.AddInputSet(req_middle, false);
req.AddInputSet(req_outer, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
req.AddInputSet(req_inner, false);
req.AddInputSet(req_middle, false);
req.AddInputSet(req_outer, false);
req.AddInputSet(req_inner, true);
req.AddInputSet(req_middle, true);
req.AddInputSet(req_outer, true);

?

static constexpr int axes = 1;
static constexpr int sequence_axes = static_cast<int>(is_sequence);
static constexpr int channel_axes = static_cast<int>(has_channels);
static constexpr int ndim = sequence_axes + axes + channel_axes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be the same for all specialization, can we extract this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea how to make it better? I can maybe put it in some

template <int axes, bool is_sequence, bool has_channels>
struct calc_ndim {
	static constexpr int ndim = sequence_axes + axes + channel_axes;
};

and skip the sequence_axes & channel_axes members, but I'm not sure how much better is it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation, if you don't see any better way leave it as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a constexpr function can also work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually 3 of those member variables are used directly in some kernel parameters definitions or calls. So I will just leave it.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Sep 30, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1666358]: BUILD STARTED

@klecki klecki mentioned this pull request Sep 30, 2020
int num_samples_;

void SetDataShape() {
TensorShape<> target_shape = {64, 64, 64};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use non-cubic data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

target_shape = shape_cat(target_shape, 3);
if (kFrames)
target_shape = shape_cat(20, target_shape);
data_shape_ = uniform_list_shape<kNdim>(1, target_shape.to_static<kNdim>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use non-uniform list shape in the batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 110 to 122
auto req = kernel_gpu.Setup(ctx_gpu, data_shape_, window_dims_);

ScratchpadAllocator scratch_alloc;
scratch_alloc.Reserve(req.scratch_sizes);
auto scratchpad = scratch_alloc.GetScratchpad();
ctx_gpu.scratchpad = &scratchpad;

kernel_gpu.Run(ctx_gpu, out_gpu_v, in_gpu_v, window_v);

auto out_cpu_v = output_.cpu(0);
cudaDeviceSynchronize();
CUDA_CALL(cudaGetLastError());
Check(out_cpu_v, baseline_out_v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run Setup ad Run at least twice with different data to make sure that there's no accumulation of state in some internal vectors or sth like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1666358]: BUILD PASSED

return result;
}

inline scratch_sizes_t GetSumScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline scratch_sizes_t GetSumScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {
inline scratch_sizes_t AppendScratchSize(const scratch_sizes_t &a, const scratch_sizes_t &b, int alignment = 64) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, use this function in AddInputSet - the existing implementation is buggy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

inline scratch_sizes_t GetSumScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {
scratch_sizes_t result;
for (size_t i = 0; i < result.size(); i++) {
result[i] = a[i] + b[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result[i] = a[i] + b[i];
result[i] = align_up(a[i], alignment) + b[i];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -25,13 +25,31 @@
namespace dali {
namespace kernels {

using scratch_sizes_t = std::array<size_t, static_cast<size_t>(AllocType::Count)>;

inline scratch_sizes_t GetMaxScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick

Suggested change
inline scratch_sizes_t GetMaxScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {
inline scratch_sizes_t MaxScratchSize(const scratch_sizes_t &a, const scratch_sizes_t &b) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

void FillData() {
ConstantFill(input_.cpu(), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a particularly strong test....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, it's mostly sanity test as the actual testing is done in the ConvolutionGpu and this is simple wrapper. The Operator brings the full load of tests in python as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, added yet another similar test.

void FillData() {
ConstantFill(input_.cpu(), 1);
for (int i = 0; i < kAxes; i++) {
ConstantFill(kernel_window_[i].cpu(), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about filling the windows with a pattern:
1 2 3 4 3 2 1
?
With constant input you could still calculate the reference as a product of sums of the windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Sep 30, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1666944]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1666944]: BUILD PASSED

@klecki klecki merged commit 1491278 into NVIDIA:master Oct 1, 2020
@klecki klecki deleted the separable-conv-gpu branch October 1, 2020 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants