Normalize GPU kernel #1974

mzient · 2020-05-21T17:21:22Z

Why we need this PR?

Pick one, remove the rest

It adds NormalizeImplGPU because we want a GPU-based normalization

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
- DropDims as a way to recalculate data/param coordinates
- trivial slicing by data chunks
- broadcasting using DropDims
- full broadcasting (from host-side scalars) is a special case
- NOTE: pImpl front-end kernel is not a part of this PR, it's in Normalize GPU - pImpl + Bessel's corrections #1981
Affected modules and functionalities:
- Reductions - ReduceDims now uses fast_div
Key points relevant for the review:
- The kernel?
Validation and testing:
- Unit tests (GTest)
Documentation (including examples):
- Doxygen

JIRA TASK: DALI-1267

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Tests - begun. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient · 2020-05-21T17:25:23Z

!build

dali/kernels/reduce/mean_stddev_gpu_impl_test.cu

dali-automaton · 2020-05-21T17:30:39Z

CI MESSAGE: [1338269]: BUILD STARTED

dali/kernels/reduce/reduce_drop_dims.h

dali-automaton · 2020-05-21T19:36:57Z

CI MESSAGE: [1338269]: BUILD FAILED

dali/kernels/normalize/normalize_gpu_impl_test.cu

JanuszL · 2020-05-22T15:01:46Z

dali/kernels/normalize/normalize_gpu_impl.cuh

+  }
+
+  std::pair<dim3, dim3> GetLaunchParams(const TensorListShape<> &data_shape) const {
+    int64_t block = 1024;


Maybe we should query the GPU for capabilities - I mean user should and provide the necessary values to the kernel?

Add performance tests. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient · 2020-05-22T15:32:12Z

!build

dali-automaton · 2020-05-22T15:35:28Z

CI MESSAGE: [1340822]: BUILD STARTED

dali-automaton · 2020-05-22T17:20:00Z

CI MESSAGE: [1340822]: BUILD PASSED

jantonguirao · 2020-05-25T14:32:34Z

dali/kernels/normalize/normalize_gpu_impl_test.cu

+    Launch(ctx);
+    cudaEventRecord(end, ctx.gpu.stream);
+    float time;
+    cudaDeviceSynchronize();


nitpick: isn't it enough with synchronizing with ctx.gpu.stream ?

jantonguirao · 2020-05-25T14:34:17Z

dali/kernels/normalize/normalize_gpu_impl_test.cu

+    int64_t base_size  = scalar_base_  ? 0 : param_shape_.num_elements() * sizeof(float);
+    int64_t scale_size = scalar_scale_ ? 0 : param_shape_.num_elements() * sizeof(float);
+    int64_t data_size = out_size + in_size + base_size + scale_size;
+    std::cerr << "Throughput: " << data_size / time << " GB/s\n";


Those perf tests are not really checking anything just printing the throughput. I am wondering:

How long do those tests take?

Do we want to have those run as part of our unit tests?

If they don't take a lot of time, I wouldn't mind to leave them here anyway

I don't check anything because generating reference data took far too long. The tests take <5s on a machine equipped with a Tesla P40, on RTX 2080 Super I measured 3.6s for all NormalizeGPU tests.

jantonguirao · 2020-05-25T14:46:24Z

dali/kernels/normalize/normalize_gpu_impl.cuh

+ *        be regularized and inversed.
+ *
+ * The output elements are calculated as:
+ * mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon)


Suggested change

* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon)

* mul = 1 / sqrt(square(stddev[param_offset]) + epsilon)

jantonguirao · 2020-05-25T14:46:40Z

dali/kernels/normalize/normalize_gpu_impl.cuh

+ *        be regularized and inversed.
+ *
+ * The output elements are calculated as:
+ * mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon)


Suggested change

* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon)

* mul = 1 / sqrt(square(stddev[param_offset]) + epsilon)

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

dali-automaton · 2020-05-25T19:07:33Z

CI MESSAGE: [1345384]: BUILD STARTED

dali-automaton · 2020-05-25T21:06:41Z

CI MESSAGE: [1345384]: BUILD PASSED

JanuszL · 2020-05-26T14:18:45Z

dali/kernels/normalize/normalize_gpu_impl.cuh

+
+  template <typename Desc, typename KernelFunc>
+  std::pair<dim3, dim3>
+  GetLaunchParams(const TensorListShape<> &data_shape, KernelFunc func) const {


I wonder how much we can generalize and reuse across all kernel launches?

banasraf

No critical comments. Only formatting and typo.

banasraf · 2020-05-27T08:16:43Z

dali/kernels/normalize/normalize_gpu_impl.cuh

+__global__ void NormalizeKernel(const NormalizeParams *sample_params,
+                          float scale, float shift) {


Suggested change

__global__ void NormalizeKernel(const NormalizeParams *sample_params,

float scale, float shift) {

__global__ void NormalizeKernel(const NormalizeParams *sample_params,

float scale, float shift) {

banasraf · 2020-05-27T09:22:38Z

dali/kernels/normalize/normalize_gpu_impl.cuh

+  void RunInvStdDev(KernelContext &ctx,
+                const OutListGPU<Out> &out, const InListGPU<In> &in,
+                const BaseParam &base, const ScaleParam &scale,
+                float epsilon, float global_scale, float shift) {


Suggested change

void RunInvStdDev(KernelContext &ctx,

const OutListGPU<Out> &out, const InListGPU<In> &in,

const BaseParam &base, const ScaleParam &scale,

float epsilon, float global_scale, float shift) {

void RunInvStdDev(KernelContext &ctx,

const OutListGPU<Out> &out, const InListGPU<In> &in,

const BaseParam &base, const ScaleParam &scale,

float epsilon, float global_scale, float shift) {

banasraf · 2020-05-27T09:24:53Z

dali/kernels/reduce/reduce_drop_dims.h

@@ -43,14 +45,58 @@ namespace reduce_impl {
 *
 * The reindexing is done by either dividing and multiplying by old/new strides or by takind modulo.


Suggested change

* The reindexing is done by either dividing and multiplying by old/new strides or by takind modulo.

* The reindexing is done by either dividing and multiplying by old/new strides or by taking modulo.

I'll fix it in the follow up.

mzient added 5 commits May 20, 2020 22:33

Templatize DropDims wrt max_dims and make it work with fast_div.

8ecc2d2

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Add regularizing term to variance directly, without squaring.

e7874c6

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Use fast_div in DropDims.

f3a36ad

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Normalize GPU kernel frontend - done?

18ef3eb

Tests - begun. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Add tests & fix errors.

959f853

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient requested a review from a team May 21, 2020 17:21

JanuszL reviewed May 21, 2020

View reviewed changes

dali/kernels/reduce/mean_stddev_gpu_impl_test.cu Show resolved Hide resolved

JanuszL reviewed May 21, 2020

View reviewed changes

dali/kernels/reduce/reduce_drop_dims.h Outdated Show resolved Hide resolved

JanuszL reviewed May 22, 2020

View reviewed changes

dali/kernels/normalize/normalize_gpu_impl_test.cu Show resolved Hide resolved

JanuszL reviewed May 22, 2020

View reviewed changes

Fix a bug in DropDims.

069141a

Add performance tests. Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

jantonguirao approved these changes May 25, 2020

View reviewed changes

mzient mentioned this pull request May 25, 2020

Normalize GPU - pImpl + Bessel's corrections #1981

Merged

Comment fix.

7e64aa8

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

JanuszL reviewed May 26, 2020

View reviewed changes

JanuszL approved these changes May 26, 2020

View reviewed changes

banasraf approved these changes May 27, 2020

View reviewed changes

mzient merged commit d6df3de into NVIDIA:master May 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize GPU kernel #1974

Normalize GPU kernel #1974

mzient commented May 21, 2020 •

edited

Loading

mzient commented May 21, 2020

dali-automaton commented May 21, 2020

dali-automaton commented May 21, 2020

JanuszL May 22, 2020

mzient commented May 22, 2020

dali-automaton commented May 22, 2020

dali-automaton commented May 22, 2020

jantonguirao May 25, 2020

jantonguirao May 25, 2020

mzient May 25, 2020

jantonguirao May 25, 2020

jantonguirao May 25, 2020

dali-automaton commented May 25, 2020

dali-automaton commented May 25, 2020

JanuszL May 26, 2020

banasraf left a comment

banasraf May 27, 2020

banasraf May 27, 2020

banasraf May 27, 2020

mzient May 27, 2020

	* mul = 1 / sqrt(sqr(stddev[param_offset]) + epsilon)
	* mul = 1 / sqrt(square(stddev[param_offset]) + epsilon)

		__global__ void NormalizeKernel(const NormalizeParams *sample_params,
		float scale, float shift) {

		@@ -43,14 +45,58 @@ namespace reduce_impl {
		*
		* The reindexing is done by either dividing and multiplying by old/new strides or by takind modulo.

Normalize GPU kernel #1974

Normalize GPU kernel #1974

Conversation

mzient commented May 21, 2020 • edited Loading

Why we need this PR?

What happened in this PR?

mzient commented May 21, 2020

dali-automaton commented May 21, 2020

dali-automaton commented May 21, 2020

Choose a reason for hiding this comment

mzient commented May 22, 2020

dali-automaton commented May 22, 2020

dali-automaton commented May 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented May 25, 2020

dali-automaton commented May 25, 2020

Choose a reason for hiding this comment

banasraf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient commented May 21, 2020 •

edited

Loading