Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separable convolution #2009

Merged
merged 5 commits into from
Jun 19, 2020
Merged

Separable convolution #2009

merged 5 commits into from
Jun 19, 2020

Conversation

klecki
Copy link
Contributor

@klecki klecki commented Jun 8, 2020

Why we need this PR?

Adds separable convolution using Convolution Kernel, passing over all axes.

What happened in this PR?

  • What solution was applied:
    SeparableConvolution kernel build by using several passes of ConvolutionKernel
  • Affected modules and functionalities:
    Kernels, kernels tests
  • Key points relevant for the review:
    Nothing fancy, mostly boilerplate
  • Validation and testing:
    Gtest for kernel, baseline implementation moved to separate file.
  • Documentation (including examples):
    [ Describe here if documentation and examples were updated. ]

JIRA TASK: [DALI-1425]

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

@klecki klecki force-pushed the separable-convolution branch 2 times, most recently from 398df59 to 206941e Compare June 15, 2020 14:26
@klecki klecki marked this pull request as ready for review June 15, 2020 14:27
@klecki
Copy link
Contributor Author

klecki commented Jun 15, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1395861]: BUILD STARTED

Wraps Convolution CPU kernel

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Jun 15, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1395876]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1395928]: BUILD STARTED

void Run(KernelContext& ctx, const TensorView<StorageCPU, Out, ndim> out,
const TensorView<StorageCPU, const In, ndim>& in,
const std::array<TensorView<StorageCPU, const W, 1>, axes>& windows,
const std::array<W, axes>& scales = uniform_array<axes, W>(1.f)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any real life use case for per-axis scale? They will be multiplied anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, got carried away, will simplify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still there - simply missed or intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed, now it's here

KernelRequirements req;

ScratchpadEstimator se;
se.add<W>(AllocType::Host, volume(in_shape));
Copy link
Contributor

@mzient mzient Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want the intermediate image to be of type W? What if you have float input/output and integer weights?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you don't need this buffer when your intermediate element type is the same as Out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want the intermediate image to be of type W? What if you have float input/output and integer weights?

For now I assume that W is float. It can be generalized as well, and we can parametrize every possible step here with configurable type. Do you want me to do it?

Also, you don't need this buffer when your intermediate element type is the same as Out.

Yes, I don't need it, but the in-place for first step will be slower, and for everything other than W==Out it's still needed so I just opted for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the intermediate data, maybe it would indeed be better to just store the result of arithmetic operation. Will do that.

Comment on lines 175 to 197
template <typename Out, typename In, typename W>
struct SeparableConvolutionCpu<Out, In, W, 1, false>
: public SeparableConvolutionCpuImpl<Out, In, W, 1, false> {};

template <typename Out, typename In, typename W>
struct SeparableConvolutionCpu<Out, In, W, 2, true>
: public SeparableConvolutionCpuImpl<Out, In, W, 1, true> {};

template <typename Out, typename In, typename W>
struct SeparableConvolutionCpu<Out, In, W, 2, false>
: public SeparableConvolutionCpuImpl<Out, In, W, 2, false> {};

template <typename Out, typename In, typename W>
struct SeparableConvolutionCpu<Out, In, W, 3, true>
: public SeparableConvolutionCpuImpl<Out, In, W, 2, true> {};

template <typename Out, typename In, typename W>
struct SeparableConvolutionCpu<Out, In, W, 3, false>
: public SeparableConvolutionCpuImpl<Out, In, W, 3, false> {};

template <typename Out, typename In, typename W>
struct SeparableConvolutionCpu<Out, In, W, 4, true>
: public SeparableConvolutionCpuImpl<Out, In, W, 3, true> {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it for? Why not just rename XxxImpl to Xxx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the Kernel is parametrized with the number of actual dimensions and not only data dimensions + bool for the channels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a personal opinion, but I find it harder to use this way - I mean, I find it more intuitive to parameterize 2D convolution with channels and without channels with <2>, not <2> or <3>. From the operator standpoint, you still need to handle it explicitly, either in value switch or an if, so there's no difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, now it's less code which has some benefits.

Comment on lines 42 to 44
* TODO(klecki): For more dimension, fusing a permute step when writing the result
* could allow for processing all steps with innermost, contiguous dimension.
* For example DHWC->DWHC->HWDC->DHWC, while applying convolutions for W, H, D respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mark it as TODO - it doesn't seem like a good idea to transpose on the fly; when we employ some automatic vectorization, it will most certainly be defeated by transposition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

KernelRequirements req;

ScratchpadEstimator se;
se.add<W>(AllocType::Host, volume(in_shape));
Copy link
Contributor

@mzient mzient Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise - you may get by without this buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but probably I would get a comment that it can be faster. I left it same for all variants for simplicity as well.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1395876]: BUILD PASSED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1395928]: BUILD PASSED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Jun 18, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1406047]: BUILD STARTED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1406047]: BUILD PASSED

@klecki
Copy link
Contributor Author

klecki commented Jun 18, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1406394]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1406394]: BUILD FAILED

@klecki
Copy link
Contributor Author

klecki commented Jun 19, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1408269]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1408269]: BUILD PASSED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Jun 19, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1408534]: BUILD STARTED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki
Copy link
Contributor Author

klecki commented Jun 19, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1408572]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1408572]: BUILD PASSED

@klecki klecki merged commit d2ebb14 into NVIDIA:master Jun 19, 2020
@klecki klecki deleted the separable-convolution branch June 19, 2020 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants