-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separable convolution #2009
Separable convolution #2009
Conversation
Check out this pull request on Review Jupyter notebook visual diffs & provide feedback on notebooks. Powered by ReviewNB |
398df59
to
206941e
Compare
!build |
CI MESSAGE: [1395861]: BUILD STARTED |
Wraps Convolution CPU kernel Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
206941e
to
3ee9b3e
Compare
!build |
CI MESSAGE: [1395876]: BUILD STARTED |
CI MESSAGE: [1395928]: BUILD STARTED |
void Run(KernelContext& ctx, const TensorView<StorageCPU, Out, ndim> out, | ||
const TensorView<StorageCPU, const In, ndim>& in, | ||
const std::array<TensorView<StorageCPU, const W, 1>, axes>& windows, | ||
const std::array<W, axes>& scales = uniform_array<axes, W>(1.f)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any real life use case for per-axis scale? They will be multiplied anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, got carried away, will simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still there - simply missed or intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed, now it's here
KernelRequirements req; | ||
|
||
ScratchpadEstimator se; | ||
se.add<W>(AllocType::Host, volume(in_shape)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want the intermediate image to be of type W? What if you have float
input/output and integer weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you don't need this buffer when your intermediate element type is the same as Out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want the intermediate image to be of type W? What if you have float input/output and integer weights?
For now I assume that W
is float. It can be generalized as well, and we can parametrize every possible step here with configurable type. Do you want me to do it?
Also, you don't need this buffer when your intermediate element type is the same as Out.
Yes, I don't need it, but the in-place for first step will be slower, and for everything other than W==Out it's still needed so I just opted for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the intermediate data, maybe it would indeed be better to just store the result of arithmetic operation. Will do that.
template <typename Out, typename In, typename W> | ||
struct SeparableConvolutionCpu<Out, In, W, 1, false> | ||
: public SeparableConvolutionCpuImpl<Out, In, W, 1, false> {}; | ||
|
||
template <typename Out, typename In, typename W> | ||
struct SeparableConvolutionCpu<Out, In, W, 2, true> | ||
: public SeparableConvolutionCpuImpl<Out, In, W, 1, true> {}; | ||
|
||
template <typename Out, typename In, typename W> | ||
struct SeparableConvolutionCpu<Out, In, W, 2, false> | ||
: public SeparableConvolutionCpuImpl<Out, In, W, 2, false> {}; | ||
|
||
template <typename Out, typename In, typename W> | ||
struct SeparableConvolutionCpu<Out, In, W, 3, true> | ||
: public SeparableConvolutionCpuImpl<Out, In, W, 2, true> {}; | ||
|
||
template <typename Out, typename In, typename W> | ||
struct SeparableConvolutionCpu<Out, In, W, 3, false> | ||
: public SeparableConvolutionCpuImpl<Out, In, W, 3, false> {}; | ||
|
||
template <typename Out, typename In, typename W> | ||
struct SeparableConvolutionCpu<Out, In, W, 4, true> | ||
: public SeparableConvolutionCpuImpl<Out, In, W, 3, true> {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is it for? Why not just rename XxxImpl to Xxx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the Kernel is parametrized with the number of actual dimensions and not only data dimensions + bool for the channels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a personal opinion, but I find it harder to use this way - I mean, I find it more intuitive to parameterize 2D convolution with channels and without channels with <2>, not <2> or <3>. From the operator standpoint, you still need to handle it explicitly, either in value switch or an if
, so there's no difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, now it's less code which has some benefits.
* TODO(klecki): For more dimension, fusing a permute step when writing the result | ||
* could allow for processing all steps with innermost, contiguous dimension. | ||
* For example DHWC->DWHC->HWDC->DHWC, while applying convolutions for W, H, D respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mark it as TODO - it doesn't seem like a good idea to transpose on the fly; when we employ some automatic vectorization, it will most certainly be defeated by transposition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
KernelRequirements req; | ||
|
||
ScratchpadEstimator se; | ||
se.add<W>(AllocType::Host, volume(in_shape)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise - you may get by without this buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but probably I would get a comment that it can be faster. I left it same for all variants for simplicity as well.
CI MESSAGE: [1395876]: BUILD PASSED |
CI MESSAGE: [1395928]: BUILD PASSED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1406047]: BUILD STARTED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
CI MESSAGE: [1406047]: BUILD PASSED |
!build |
CI MESSAGE: [1406394]: BUILD STARTED |
CI MESSAGE: [1406394]: BUILD FAILED |
!build |
CI MESSAGE: [1408269]: BUILD STARTED |
CI MESSAGE: [1408269]: BUILD PASSED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1408534]: BUILD STARTED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1408572]: BUILD STARTED |
CI MESSAGE: [1408572]: BUILD PASSED |
Why we need this PR?
Adds separable convolution using Convolution Kernel, passing over all axes.
What happened in this PR?
SeparableConvolution kernel build by using several passes of ConvolutionKernel
Kernels, kernels tests
Nothing fancy, mostly boilerplate
Gtest for kernel, baseline implementation moved to separate file.
[ Describe here if documentation and examples were updated. ]
JIRA TASK: [DALI-1425]