Erase GPU operator #1971

banasraf · 2020-05-19T16:45:08Z

Why we need this PR?

Pick one, remove the rest

It adds a GPU erase operator needed for audio support.

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
The gpu erase kernel was modified to support different number of erase regions for each sample. The operator implementation is mostly just instantiating the kernel.
Affected modules and functionalities:
GPU erase kernel and a new file with GPU operator.
Key points relevant for the review:
Instantiating the kernel.
Validation and testing:
Existing python test was extended to GPU. Kernel test was extended to cover different number of erase regions per sample.
Documentation (including examples):
N/A

JIRA TASK: [DALI-1245]

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

dali/kernels/erase/erase_gpu.h

dali/operators/generic/erase/erase.cu

klecki

Rather minor things, otherwise looks ok.

dali/kernels/erase/erase_gpu.h

mzient · 2020-05-20T09:19:23Z

dali/operators/generic/erase/erase.cu

+  }
+
+ private:
+  OpSpec spec_;


That smells of bad design - base Operator class has this field - maybe we should just rework OpImplBase into something that doesn't require this kind of ugly tricks.

changed to const OpSpec&

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

banasraf · 2020-05-21T15:47:31Z

!build

mzient · 2020-05-21T15:50:08Z

dali/kernels/erase/erase_gpu.h

@@ -309,7 +310,7 @@ struct do_copy_or_erase {

 template <int channel_dim = -1, typename T, int ndim = 2>
 __global__ void erase_gpu_impl(erase_sample_desc<T, ndim> *samples, ivec<ndim> region_shape,


Suggested change

__global__ void erase_gpu_impl(erase_sample_desc<T, ndim> *samples, ivec<ndim> region_shape,

__global__ void erase_gpu_impl(const erase_sample_desc<T, ndim> *samples, ivec<ndim> region_shape,

dali-automaton · 2020-05-21T15:50:43Z

CI MESSAGE: [1337975]: BUILD STARTED

mzient · 2020-05-21T15:57:27Z

dali/kernels/erase/erase_gpu.h

+    auto *sample_desc_gpu = ctx.scratchpad->ToGPU(stream, make_span(sample_desc_cpu, num_samples));
+    auto* fill_values_gpu =
+      ctx.scratchpad->ToGPU(stream, make_span(fill_values_cpu, num_fill_values));


Better option would be to use ToContiguousGPU - it will issue just one cudaMemcpy.

It's not in the scope of this task, I guess, however, if we keep some sane limit on number of channels, then the fill value could be copied to a __constant__ - it should improve the performance, since the fill_value won't have to be read from global memory and will not compete for cache with input.

Used ToContiguousGPU

mzient · 2020-05-21T16:00:30Z

dali/kernels/erase/erase_gpu.h

  const T *in = nullptr;
  T* out = nullptr;


Suggested change

const T *in = nullptr;

T* out = nullptr;

const T *__restrict__ in = nullptr;

T *__restrict__ out = nullptr;

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

mzient · 2020-05-22T08:34:55Z

dali/operators/generic/erase/erase.cu

+  return {reinterpret_cast<ptrs_t>(tlv.data.data()), new_shape};
+}
+
+template <int ndim, typename Storage>


I don't think this overload is necessary - a non-const TensorListView should convert implicitly to const one.

mzient · 2020-05-22T08:36:34Z

dali/kernels/scratch_copy_impl.h

@@ -126,8 +126,8 @@ std::tuple<std::remove_cv_t<element_t<Collections>>*...>
 ToContiguousGPUMem(Scratchpad &scratchpad, cudaStream_t stream, const Collections &... c) {
  const size_t N = sizeof...(Collections);
  static_assert(
-    all_of<std::is_pod<std::remove_cv_t<element_t<Collections>>>::value...>::value,
-    "ToContiguousGPUMem must be used with collections of POD types");
+    all_of<std::is_trivially_copyable<std::remove_cv_t<element_t<Collections>>>::value...>::value,


There's one more is_pod in this file - please change it, too.

klecki · 2020-05-22T08:57:21Z

dali/operators/generic/erase/erase.cu

+      regions_shape.set_tensor_shape(i, {n_regions, 2, Dims});
+    }
+    TensorList<CPUBackend> regions_cpu;
+    regions_cpu.set_type(TypeTable::GetTypeInfo(TypeTable::GetTypeID<int32_t>()));


Why not:

Suggested change

regions_cpu.set_type(TypeTable::GetTypeInfo(TypeTable::GetTypeID<int32_t>()));

regions_cpu.set_type(TypeInfo::Create<ibox<Dims>());

?

klecki · 2020-05-22T08:58:36Z

dali/operators/generic/erase/erase.cu

+    TensorList<CPUBackend> regions_cpu;
+    regions_cpu.set_type(TypeTable::GetTypeInfo(TypeTable::GetTypeID<int32_t>()));
+    regions_cpu.Resize(regions_shape);
+    auto regions_tlv = detail::as_boxes<Dims>(view<int32_t, 3>(regions_cpu));


And you don't need this as_boxes thing at all, you can take a view<ibox<Dims>>(regions_cpu) and place the data in it directly. AFAIK it should work.

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

banasraf · 2020-05-22T09:31:50Z

!build

dali-automaton · 2020-05-22T09:52:06Z

CI MESSAGE: [1340155]: BUILD STARTED

dali-automaton · 2020-05-22T11:26:59Z

CI MESSAGE: [1340155]: BUILD PASSED

Operator implementation

1f413cd

Signed-off-by: Rafal <Banas.Rafal97@gmail.com>

banasraf requested review from jantonguirao and klecki May 19, 2020 16:45