Normalize operator for GPU backend #1986

mzient · 2020-05-28T19:25:22Z

Why we need this PR?

Pick one, remove the rest

It adds new feature: normalization operator for GPU backend

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
- Split old operator into common base and CPU-specific part
- Apply some fixes to normalize kernel
- Add the operator ;)
  - not using kernel manager
  - reuse scratch memory across three different kernels
  - allocate temporary tensors in the scratchpad
- Add a TensorListView copy function with sample merging
Affected modules and functionalities:
- Normalize operator and kernel
- dali/core utilities (fix), kernels/copy
Key points relevant for the review:
- N/A
Validation and testing:
- Python tests extended for GPU backend
Documentation (including examples):
- Old docs apply
- TODO: add gpu backend to jupyter example

JIRA TASK: DALI-1243

mzient · 2020-05-28T19:53:13Z

!build

mzient · 2020-05-28T19:56:09Z

dali/kernels/normalize/normalize_gpu_impl.cuh

@@ -270,7 +270,7 @@ class NormalizeImplGPU {

    // this condition is false when the other Setup overload was used
    if (axes_.data() != axes.data())
-      axes_ = { axes_.begin(), axes_.end() };
+      axes_ = { axes.begin(), axes.end() };


This is a bug fix.

dali-automaton · 2020-05-28T20:22:40Z

CI MESSAGE: [1354384]: BUILD STARTED

dali-automaton · 2020-05-28T22:14:36Z

CI MESSAGE: [1354384]: BUILD PASSED

mzient · 2020-05-29T10:10:11Z

dali/operators/math/normalize/normalize.cc

+void Normalize<CPUBackend>::AllocTempStorage() {
+  const TypeInfo &float_type = TypeTable::GetTypeInfo(DALI_FLOAT);
+  int n = data_shape_.num_samples();
+  const TensorListShape<> &tmp_shape = batch_norm_
+    ? uniform_list_shape(n, param_shape_[0])  // extend to all samples, to enable parallelism
+    : param_shape_;
+
+  if (ShouldCalcMean()) {
+    mean_.Resize(tmp_shape);
+  } else if (has_tensor_mean_) {
+    assert(!batch_norm_);
+    // use mean as-is
+    assert(param_shape_ == mean_input_.shape);
+  } else if (has_scalar_mean_) {
+    // need to broadcast mean to match required shape
+    if (is_uniform(param_shape_)) {
+      // if param_shape_ is uniform, we need only one tensor
+      mean_.Resize(TensorListShape<>({ param_shape_[0] }));
+    } else {
+      mean_.Resize(param_shape_);
+    }
+  }
+  if (ShouldCalcStdDev()) {
+    inv_stddev_.Resize(tmp_shape);
+  } else if (has_tensor_stddev_) {
+    assert(!batch_norm_);
+    // we need space to calculate inverse stddev
+    inv_stddev_.Resize(stddev_input_.shape);
+  } else {
+    assert(has_scalar_stddev_);
+    if (!IsFullReduction()) {
+      // need to broadcast stddev to match required shape
+      if (is_uniform(param_shape_)) {
+        // if param_shape_ is uniform, we need only one tensor
+        inv_stddev_.Resize(TensorListShape<>({ param_shape_[0] }));
+      } else {
+        inv_stddev_.Resize(param_shape_);
+      }
+    }
+  }
+  mean_.set_type(float_type);
+  inv_stddev_.set_type(float_type);
+}


This function is just moved from the header to .cc, do not look too much into this.

You should write this at the beginning of the function.

Github comment to multiple lines appears at the bottom - I marked the whole function and it was highlighted, so I thought it would stay this way :(

mzient · 2020-05-29T11:22:50Z

!build

mzient · 2020-05-29T11:23:49Z

dali/kernels/normalize/normalize_gpu_test.cu

+    for (int iter = 0; iter < 3; iter++) {
+      auto req = kmgr_.Setup<Kernel>(0, ctx, data_shape_, param_shape_,
+                                    use_scalar_base_, use_scalar_scale_, scale_is_stddev_);
+      ASSERT_EQ(req.output_shapes.size(), 1u);
+      ASSERT_EQ(req.output_shapes[0], data_shape_);
+      out_.reshape(data_shape_);
+      ref_.reshape(data_shape_);
+
+      Launch(ctx);
+
+      int param_samples = param_shape_.num_samples();
+      auto ref_base  = use_scalar_base_
+                      ? ScalarTLV(scalar_base_,  param_samples, data_shape_.sample_dim())
+                      : base_.cpu();
+      auto ref_scale = use_scalar_scale_
+                      ? ScalarTLV(scalar_scale_, param_samples, data_shape_.sample_dim())
+                      : scale_.cpu();
+      RefNormalize(ref_.cpu(), in_.cpu(), ref_base, ref_scale,
+                  global_scale_, shift_, scale_is_stddev_, epsilon_);
+
+      if (scale_is_stddev_ && !std::is_integral<Out>::value)
+        Check(out_.cpu(), ref_.cpu(), EqualEpsRel(1e-6, 1e-6));
+      else
+        Check(out_.cpu(), ref_.cpu(), EqualUlp(4));
+    }


Just added the loop - no changes inside.

mzient · 2020-05-29T11:24:08Z

dali/kernels/reduce/mean_stddev_gpu_impl_test.cu

+  for (int iter = 0; iter < 3; iter++) {
+    test.Setup(in_shape, ref_out_shape, make_span(axes), false, true);
+    EXPECT_GE(test.kernel.GetNumStages(), 4);  // both reduced axes must be split
+    test.FillData(0, 255);
+    test.Run();
+
+    RefMean<int64_t>(test.ref.cpu(), test.in.cpu(), make_span(axes), false, true);
+    test.Check(EqualEpsRel(1e-5, 1e-6));
+  }


Just added the loop - no changes inside.

mzient · 2020-05-29T11:24:13Z

dali/kernels/reduce/mean_stddev_gpu_impl_test.cu

+  for (int iter = 0; iter < 3; iter++) {
+    test.Setup(in_shape, ref_out_shape, make_span(axes), true, false);
+    EXPECT_GE(test.kernel.GetNumStages(), 4);  // both reduced axes must be split
+    test.FillData(-100, 100);

-  test.ref = RefStdDev(test.in.cpu(), mean_cpu);
+    test.Run(fake_mean.gpu());

-  test.Check(EqualEpsRel(1e-5, 1e-6));
+    test.ref = RefStdDev(test.in.cpu(), mean_cpu);
+    test.Check(EqualEpsRel(1e-5, 1e-6));
+  }


Just added the loop - no changes inside.

mzient · 2020-05-29T11:24:22Z

dali/kernels/reduce/mean_stddev_gpu_impl_test.cu

+  for (int iter = 0; iter < 3; iter++) {
+    test.Setup(in_shape, ref_out_shape, make_span(axes), true, false);
+    EXPECT_GE(test.kernel.GetNumStages(), 2);  // both reduced axes must be split
+    test.FillData(-100, 100);

-  test.ref = RefStdDev(test.in.cpu(), mean_cpu);
+    test.Run(fake_mean.gpu());

-  test.Check(EqualEpsRel(1e-5, 1e-6));
+    test.ref = RefStdDev(test.in.cpu(), mean_cpu);
+    test.Check(EqualEpsRel(1e-5, 1e-6));
+  }


Just added the loop - no changes inside.

mzient · 2020-05-29T11:24:28Z

dali/kernels/reduce/mean_stddev_gpu_impl_test.cu

+  for (int iter = 0; iter < 3; iter++) {
+    test.Setup(in_shape, ref_out_shape, make_span(axes), true, true);
+    EXPECT_GE(test.kernel.GetNumStages(), 2);  // both reduced axes must be split
+    test.FillData(-100, 100);

-  test.ref = RefStdDev(test.in.cpu(), mean_cpu, 1, 12000, true);
+    test.Run(fake_mean.gpu(), 1, 12000);

-  test.Check(EqualEpsRel(1e-5, 1e-6));
+    test.ref = RefStdDev(test.in.cpu(), mean_cpu, 1, 12000, true);
+    test.Check(EqualEpsRel(1e-5, 1e-6));
+  }


Just added the loop - no changes inside.

mzient · 2020-05-29T11:25:13Z

dali/kernels/reduce/reduce_gpu_impl.cuh

@@ -743,6 +745,7 @@ class ReduceImplGPU {
  void InitStages() {
    const int nsamples = in_shape_.num_samples();
    const int in_dim = in_shape_.sample_dim();
+    stages_.clear();


This is an important bug fix.

mzient · 2020-05-29T11:25:29Z

dali/kernels/reduce/reduce_gpu_test.cc

+  for (int iter = 0; iter < 3; iter++) {
+    test.Setup(in_shape, ref_out_shape, make_span(axes), false, true);
+    test.FillData(0, 255);
+    test.Run();
+    RefReduce(test.ref.cpu(), test.in.cpu(), make_span(axes), false, true, reductions::sum());
+    test.Check();
+  }


Just added the loop - no changes inside.

mzient · 2020-05-29T11:25:51Z

dali/test/python/test_operator_normalize.py


 def axes2names(axes, layout='abcdefghijklmnopqrstuvwxyz'):
    return "".join([layout[axis] for axis in axes])

 def _test_up_to_5D_all_axis_combinations(device):
-    batch_size = 10
+    batch_size = 5


Batch size is reduced, but there are two iterations now.

dali-automaton · 2020-05-29T11:26:03Z

CI MESSAGE: [1356273]: BUILD STARTED

dali-automaton · 2020-05-29T12:51:18Z

CI MESSAGE: [1356273]: BUILD FAILED

JanuszL · 2020-05-29T13:01:56Z

dali/kernels/common/copy.h

+/**
+ * @brief Copy in to out, merging contiguous samples.
+ *
+ * Contiguous samples are merged to reduce number of copies issuead and, at the same time,


Suggested change

* Contiguous samples are merged to reduce number of copies issuead and, at the same time,

* Contiguous samples are merged to reduce number of copies issued and, at the same time,

dali/kernels/common/copy.h

dali/kernels/context.h

dali/kernels/reduce/mean_stddev_gpu_impl.cuh

dali-automaton · 2020-05-29T13:36:56Z

CI MESSAGE: [1356273]: BUILD PASSED

dali/operators/math/normalize/normalize_gpu.cu

JanuszL · 2020-05-29T14:09:35Z

dali/operators/math/normalize/normalize_gpu.cu

+  }
+
+  // We can't just Clear() the scratchpad to reuse it, because temporary buffers are also
+  // stored there - so let's make a snapshot of current allocation state and restore it


Why before, and not clean after instead?

Because the scratchpad is not empty at this point - it contains temporary tensor lists.

Maybe it is not most efficient regarding the amount of copies, but how about adding some scoped state saver so you can nest it?

That's a jolly good idea, done.

mzient · 2020-05-29T15:26:54Z

!build

dali-automaton · 2020-05-29T15:30:31Z

CI MESSAGE: [1356695]: BUILD STARTED

dali-automaton · 2020-05-29T16:47:36Z

CI MESSAGE: [1356695]: BUILD PASSED

jantonguirao

LGTM, but please check my comments here, and couple of comments in
https://app.reviewnb.com/NVIDIA/DALI/pull/1986/files/

jantonguirao · 2020-06-01T15:56:35Z

dali/operators/math/normalize/normalize_gpu.cu

+
+namespace {
+
+template <typename ToUpdate, typename Other>


Suggested change

template <typename ToUpdate, typename Other>

template <typename InOut, typename Other>

just a suggestion

jantonguirao · 2020-06-02T07:10:05Z

dali/kernels/common/copy.h

+    if (in.data[i] != in_start + in_len || out.data[o] != out_start + out_len) {
+      // discontinuity detected
+      if (in_len < out_len) {
+        assert(in.data[i] != in_start + in_len);


this assert seems a bit redundant to me, since this condition is checked 3 lines above

jantonguirao · 2020-06-02T07:10:16Z

dali/kernels/common/copy.h

+        continue;
+      }
+      if (out_len < in_len) {
+        assert(out.data[o] != out_start + out_len);


jantonguirao · 2020-06-02T07:10:39Z

dali/kernels/common/copy.h

+        continue;
+      }
+      assert(in_len == out_len && "Groups of contiguous samples must have equal length");
+      if (out_len)


Suggested change

if (out_len)

if (out_len > 0)

reads better in my opinion

jantonguirao · 2020-06-02T07:15:52Z

dali/kernels/common/copy.h

+  assert(i == M && o == N);
+  assert(in_len == out_len && "Groups of contiguous samples must have equal length");
+
+  if (out_len)


Suggested change

if (out_len)

if (out_len > 0)

Just a suggestion

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

* Extend reduction tests to multiple iterations. * Add multiple iterations to python tests for normalize operator. * Change Normalize tutorial notebook to use GPU backend. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

dali-automaton · 2020-06-02T08:53:49Z

CI MESSAGE: [1363718]: BUILD STARTED

dali-automaton · 2020-06-02T10:05:09Z

CI MESSAGE: [1363718]: BUILD PASSED

mzient requested a review from a team May 28, 2020 19:25

mzient changed the title ~~[WIP] Normalize operator for GPU backend~~ Normalize operator for GPU backend May 28, 2020

mzient commented May 28, 2020

View reviewed changes

mzient commented May 29, 2020

View reviewed changes

mzient force-pushed the NormalizeOp_GPU branch from 20ac490 to d829aba Compare May 29, 2020 11:22

mzient commented May 29, 2020

View reviewed changes

JanuszL reviewed May 29, 2020

View reviewed changes

dali/kernels/common/copy.h Show resolved Hide resolved

JanuszL reviewed May 29, 2020

View reviewed changes

dali/kernels/common/copy.h Outdated Show resolved Hide resolved

JanuszL reviewed May 29, 2020

View reviewed changes

dali/kernels/context.h Show resolved Hide resolved

JanuszL reviewed May 29, 2020

View reviewed changes

dali/kernels/reduce/mean_stddev_gpu_impl.cuh Show resolved Hide resolved

JanuszL reviewed May 29, 2020

View reviewed changes

dali/operators/math/normalize/normalize_gpu.cu Outdated Show resolved Hide resolved

JanuszL reviewed May 29, 2020

View reviewed changes

mzient force-pushed the NormalizeOp_GPU branch from 7fdbc51 to ed3cda1 Compare May 29, 2020 15:25

JanuszL approved these changes May 29, 2020

View reviewed changes

jantonguirao approved these changes Jun 2, 2020

View reviewed changes

mzient added 10 commits June 2, 2020 10:15

Split Normalize for CPU into backend-agnostic and specific parts.

a55b867

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Initial effort on Normalize operator for GPU

b00a98d

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Add tensor list copy utility.

e074abc

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

At least it compliles...

fe33658

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Fix some bugs.

81856c0

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Fix bugs. Enable python tests.

66344b1

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Fix a bug in kernels::copy.

0fbfd29

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Add scoped scratchpad state.

7b8a335

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Fix review issues.

566fc17

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient force-pushed the NormalizeOp_GPU branch from ed3cda1 to 566fc17 Compare June 2, 2020 08:50

mzient merged commit 841ab73 into NVIDIA:master Jun 2, 2020

	* Contiguous samples are merged to reduce number of copies issuead and, at the same time,
	* Contiguous samples are merged to reduce number of copies issued and, at the same time,

	template <typename ToUpdate, typename Other>
	template <typename InOut, typename Other>

Normalize operator for GPU backend #1986

Normalize operator for GPU backend #1986

Conversation

mzient commented May 28, 2020 • edited Loading

Why we need this PR?

What happened in this PR?

mzient commented May 28, 2020

Choose a reason for hiding this comment

dali-automaton commented May 28, 2020

dali-automaton commented May 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient May 29, 2020 • edited Loading

Choose a reason for hiding this comment

mzient commented May 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented May 29, 2020

dali-automaton commented May 29, 2020

Choose a reason for hiding this comment

dali-automaton commented May 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient commented May 29, 2020

dali-automaton commented May 29, 2020

dali-automaton commented May 29, 2020

jantonguirao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Jun 2, 2020

dali-automaton commented Jun 2, 2020

mzient commented May 28, 2020 •

edited

Loading

mzient May 29, 2020 •

edited

Loading