-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SeprableConvolutionGPU kernel #2311
Conversation
Supports Frames-first, Channel-last, wraps ConvolutionGPU applying several passes, for number of data axes in [1, 3]. Simple sanity test, convolution already tested. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
/** | ||
* @brief Apply convolution in all spatial axes, starting from the innermost to outermost. | ||
* If channel axis is pressent, the convolution is not applied there. | ||
* If it is marqed as sequence, the first data axis is considered as temporal axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* If it is marqed as sequence, the first data axis is considered as temporal axis | |
* If it is marked as sequence, the outermost dimension denotes frames and | |
* convolution is not applied to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
req.AddInputSet(req_inner, false); | ||
req.AddInputSet(req_outer, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
req.AddInputSet(req_inner, false); | |
req.AddInputSet(req_outer, false); | |
req.AddInputSet(req_inner, true); | |
req.AddInputSet(req_outer, true); |
Can't you reuse the scratchpad? It should be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that SeparableConvolutionXXX are the only kernels using this function. And... it's bugeed. Also, it's used incorrectly, since AddInputSet concatenates the list of outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, missed that append. I think it was used also in something else, I just wanted to have a way of accumulating the requirements without dealing with the insides of the KernelRequirements representation. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
req.AddInputSet(req_inner, false); | ||
req.AddInputSet(req_middle, false); | ||
req.AddInputSet(req_outer, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
req.AddInputSet(req_inner, false); | |
req.AddInputSet(req_middle, false); | |
req.AddInputSet(req_outer, false); | |
req.AddInputSet(req_inner, true); | |
req.AddInputSet(req_middle, true); | |
req.AddInputSet(req_outer, true); |
?
static constexpr int axes = 1; | ||
static constexpr int sequence_axes = static_cast<int>(is_sequence); | ||
static constexpr int channel_axes = static_cast<int>(has_channels); | ||
static constexpr int ndim = sequence_axes + axes + channel_axes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be the same for all specialization, can we extract this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any idea how to make it better? I can maybe put it in some
template <int axes, bool is_sequence, bool has_channels>
struct calc_ndim {
static constexpr int ndim = sequence_axes + axes + channel_axes;
};
and skip the sequence_axes & channel_axes members, but I'm not sure how much better is it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an observation, if you don't see any better way leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a constexpr function can also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually 3 of those member variables are used directly in some kernel parameters definitions or calls. So I will just leave it.
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1666358]: BUILD STARTED |
int num_samples_; | ||
|
||
void SetDataShape() { | ||
TensorShape<> target_shape = {64, 64, 64}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use non-cubic data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
target_shape = shape_cat(target_shape, 3); | ||
if (kFrames) | ||
target_shape = shape_cat(20, target_shape); | ||
data_shape_ = uniform_list_shape<kNdim>(1, target_shape.to_static<kNdim>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use non-uniform list shape in the batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
auto req = kernel_gpu.Setup(ctx_gpu, data_shape_, window_dims_); | ||
|
||
ScratchpadAllocator scratch_alloc; | ||
scratch_alloc.Reserve(req.scratch_sizes); | ||
auto scratchpad = scratch_alloc.GetScratchpad(); | ||
ctx_gpu.scratchpad = &scratchpad; | ||
|
||
kernel_gpu.Run(ctx_gpu, out_gpu_v, in_gpu_v, window_v); | ||
|
||
auto out_cpu_v = output_.cpu(0); | ||
cudaDeviceSynchronize(); | ||
CUDA_CALL(cudaGetLastError()); | ||
Check(out_cpu_v, baseline_out_v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run Setup ad Run at least twice with different data to make sure that there's no accumulation of state in some internal vectors or sth like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
CI MESSAGE: [1666358]: BUILD PASSED |
dali/kernels/kernel_req.h
Outdated
return result; | ||
} | ||
|
||
inline scratch_sizes_t GetSumScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline scratch_sizes_t GetSumScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) { | |
inline scratch_sizes_t AppendScratchSize(const scratch_sizes_t &a, const scratch_sizes_t &b, int alignment = 64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, use this function in AddInputSet - the existing implementation is buggy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dali/kernels/kernel_req.h
Outdated
inline scratch_sizes_t GetSumScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) { | ||
scratch_sizes_t result; | ||
for (size_t i = 0; i < result.size(); i++) { | ||
result[i] = a[i] + b[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result[i] = a[i] + b[i]; | |
result[i] = align_up(a[i], alignment) + b[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dali/kernels/kernel_req.h
Outdated
@@ -25,13 +25,31 @@ | |||
namespace dali { | |||
namespace kernels { | |||
|
|||
using scratch_sizes_t = std::array<size_t, static_cast<size_t>(AllocType::Count)>; | |||
|
|||
inline scratch_sizes_t GetMaxScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick
inline scratch_sizes_t GetMaxScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) { | |
inline scratch_sizes_t MaxScratchSize(const scratch_sizes_t &a, const scratch_sizes_t &b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
void FillData() { | ||
ConstantFill(input_.cpu(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a particularly strong test....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, it's mostly sanity test as the actual testing is done in the ConvolutionGpu and this is simple wrapper. The Operator brings the full load of tests in python as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, added yet another similar test.
void FillData() { | ||
ConstantFill(input_.cpu(), 1); | ||
for (int i = 0; i < kAxes; i++) { | ||
ConstantFill(kernel_window_[i].cpu(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about filling the windows with a pattern:
1 2 3 4 3 2 1
?
With constant input you could still calculate the reference as a product of sums of the windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1666944]: BUILD STARTED |
CI MESSAGE: [1666944]: BUILD PASSED |
Supports Frames-first, Channel-last, wraps
ConvolutionGPU applying several passes,
for number of data axes in [1, 3].
Simple sanity test, convolution already tested.
Signed-off-by: Krzysztof Lecki klecki@nvidia.com
Why we need this PR?
It adds new kernel, that will be used in GaussianBlur GPU
What happened in this PR?
Same as CPU, wrap the ConvolutionGpu kernel and apply several passes
Kernels
Nothing special
Simple sanity gtest
NA
JIRA TASK: [Use DALI-1588 or NA]