Tune Arithmetic Op launch specification #2137

klecki · 2020-07-21T14:41:33Z

Signed-off-by: Krzysztof Lecki klecki@nvidia.com

Why we need this PR?

Arithmetic Op is not fast enough, GPU was underutilized due to not big enough tiles.

What happened in this PR?

What solution was applied:
Tile was made bigger, grid was made smaller to allow for few more iterations for given thread.
Pointer usage was adjusted a bit, it helps with small inputs a bit.
Benchmark was added
Affected modules and functionalities:
Arithmetic Ops
Key points relevant for the review:
Validation and testing:
Benchmark
Documentation (including examples):
NA

OLD:

test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'const'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f5fd44f6ae8>, [(1024, 1024)] * 256, '*') ... Throughput: 377.034 GB/s
Throughput: 375.006 GB/s
ok
test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'const'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f5fd44f6ae8>, [(16384, 1024)] * 64, '*') ... Throughput: 391.476 GB/s
Throughput: 405.331 GB/s
ok
test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'const'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f5fd44f6ae8>, [(400, 400)] * 64, '*') ... Throughput: 334.448 GB/s
Throughput: 353.607 GB/s
ok
test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'gpu'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f5fd3e962f0>, [(1024, 1024)] * 256, '*') ... Throughput: 490.778 GB/s
Throughput: 490.198 GB/s
ok
test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'gpu'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f5fd3e962f0>, [(16384, 1024)] * 64, '*') ... Throughput: 499.623 GB/s
Throughput: 507.385 GB/s
ok
test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'gpu'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f5fd3e962f0>, [(400, 400)] * 64, '*') ... Throughput: 358.166 GB/s
Throughput: 402.577 GB/s
ok

----------------------------------------------------------------------
Ran 6 tests in 10.587s

OK

NEW:

test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'const'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f96c80bbae8>, [(1024, 1024)] * 256, '*') ... Throughput: 533.855 GB/s
Throughput: 531.948 GB/s
ok
test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'const'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f96c80bbae8>, [(16384, 1024)] * 64 '*') ... Throughput: 576.293 GB/s
Throughput: 576.419 GB/s
ok
test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'const'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f96c80bbae8>, [(400, 400)] * 64, '*') ... Throughput: 462.535 GB/s
Throughput: 498.008 GB/s
ok
test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'gpu'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f96c7a5c2f0>, [(1024, 1024)] * 256, '*') ... Throughput: 571.668 GB/s
Throughput: 572.734 GB/s
ok
test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'gpu'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f96c7a5c2f0>, [(16384, 1024)] * 64, '*') ... Throughput: 604.232 GB/s
Throughput: 603.718 GB/s
ok
test_operator_arithmetic_ops.test_arithmetic_ops_perf(('gpu', 'gpu'), (<class 'numpy.float32'>, <class 'numpy.float32'>), <function test_arithmetic_ops_perf.<locals>.<lambda> at 0x7f96c7a5c2f0>, [(400, 400)] * 64, '*') ... Throughput: 447.778 GB/s
Throughput: 459.841 GB/s
ok

----------------------------------------------------------------------
Ran 6 tests in 10.426s

OK

JIRA TASK: [Use DALI-1514 or NA]

Add benchmark, adjust test to new tile size Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-07-21T14:42:07Z

!build

dali-automaton · 2020-07-21T14:45:34Z

CI MESSAGE: [1484299]: BUILD STARTED

mzient · 2020-07-21T15:04:52Z

dali/operators/math/expressions/expression_impl_gpu.cuh

  for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) {
-    result[i] = meta::impl(l[i], r);
+    *result = meta::impl(*l, r);


Why? Does using pointer arithmetic instead of indexing help in any way?

It seems so with smaller inputs, with bigger ones it doesn't make much difference.

In that case, can we store the offset and step in variables to avoid repeating?

Done I guess.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

JanuszL · 2020-07-21T15:37:48Z

dali/operators/math/expressions/arithmetic_test.cu

+    for (int sample_id = 0; sample_id < TestConfig::batch_size; sample_id++) {
+      for (int extent_id = 0; extent_id < TestConfig::tiles_per_sample; extent_id++) {
+        int tile_id = sample_id * TestConfig::tiles_per_sample + extent_id;
+        tiles_cpu(tile_id)->desc = tile_descs[tile_id];


Can't you fill tile_descs here as well?

I could but implementing that lambda would be probably a bit more of a hassle than a direct loop.

jantonguirao · 2020-07-21T15:35:33Z

dali/operators/math/expressions/expression_impl_gpu.cuh

@@ -33,8 +33,12 @@ namespace dali {
 template <ArithmeticOp op, typename Result, typename Input>
 __device__ void ExecuteUnOp(Result *result, const Input *in, int64_t extent) {
  using meta = arithm_meta<op, GPUBackend>;
+  result += blockIdx.x * blockDim.x + threadIdx.x;
+  in += blockIdx.x * blockDim.x + threadIdx.x;
  for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) {


extract blockDim.x * gridDim.x to a variable?

jantonguirao · 2020-07-21T15:37:03Z

dali/operators/math/expressions/expression_impl_gpu.cuh

  for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) {
-    result[i] = meta::impl(in[i]);


actually it seemed more readable before. Why the change?

For a small perf gain as well as correctness with int64 offset.

jantonguirao · 2020-07-21T15:37:28Z

dali/operators/math/expressions/expression_impl_gpu.cuh

  for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) {
-    result[i] = meta::impl(l[i], r[i]);
+    *result = meta::impl(*l, *r);


same applies here. Why the change?

jantonguirao · 2020-07-21T15:38:25Z

dali/operators/math/expressions/expression_impl_gpu.cuh

-  auto left = static_cast<const Left *>(tile.args[0]);
-  auto right = static_cast<const Right *>(tile.args[1]);
+  auto *output = static_cast<Result *>(tile.output);
+  const auto *left = static_cast<const Left *>(tile.args[0]);


I'd go with either

Suggested change

const auto *left = static_cast<const Left *>(tile.args[0]);

const Left *left = static_cast<const Left *>(tile.args[0]);

or

Suggested change

const auto *left = static_cast<const Left *>(tile.args[0]);

auto left = static_cast<const Left *>(tile.args[0]);

I can go back to auto left

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

jantonguirao · 2020-07-21T15:58:21Z

dali/operators/math/expressions/expression_impl_gpu.cuh

@@ -44,8 +51,17 @@ __device__ void ExecuteUnOp(Result *result, const Input *in, int64_t extent) {
 template <ArithmeticOp op, typename Result, typename Left, typename Right>
 __device__ void ExecuteBinOp(Result *result, const Left *l, const Right *r, int64_t extent) {
  using meta = arithm_meta<op, GPUBackend>;
-  for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) {
-    result[i] = meta::impl(l[i], r[i]);
+  uint32_t start_ofs = (blockDim.x) * blockIdx.x + threadIdx.x;


in the previous implementation, you used int64_t. Can we use int64_t everywhere?

Added by mistake, will revert to previous int64_t change.

jantonguirao · 2020-07-21T15:58:32Z

dali/operators/math/expressions/expression_impl_gpu.cuh

@@ -55,8 +71,15 @@ __device__ void ExecuteBinOp(Result *result, const Left *l, const Right *r, int6
 template <ArithmeticOp op, typename Result, typename Left, typename Right>
 __device__ void ExecuteBinOp(Result *result, Left l, const Right *r, int64_t extent) {
  using meta = arithm_meta<op, GPUBackend>;
-  for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) {
-    result[i] = meta::impl(l, r[i]);
+  uint32_t start_ofs = (blockDim.x) * blockIdx.x + threadIdx.x;


jantonguirao · 2020-07-21T15:58:44Z

dali/operators/math/expressions/expression_impl_gpu.cuh

@@ -66,16 +89,23 @@ __device__ void ExecuteBinOp(Result *result, Left l, const Right *r, int64_t ext
 template <ArithmeticOp op, typename Result, typename Left, typename Right>
 __device__ void ExecuteBinOp(Result *result, const Left *l, Right r, int64_t extent) {
  using meta = arithm_meta<op, GPUBackend>;
-  for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) {
-    result[i] = meta::impl(l[i], r);
+  uint32_t start_ofs = (blockDim.x) * blockIdx.x + threadIdx.x;


Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-07-21T16:05:10Z

!build

dali-automaton · 2020-07-21T16:16:43Z

CI MESSAGE: [1484540]: BUILD STARTED

dali-automaton · 2020-07-21T17:58:43Z

CI MESSAGE: [1484540]: BUILD FAILED

klecki · 2020-07-21T18:11:00Z

!build

dali-automaton · 2020-07-21T18:18:52Z

CI MESSAGE: [1484876]: BUILD STARTED

dali-automaton · 2020-07-21T20:09:09Z

CI MESSAGE: [1484876]: BUILD FAILED

dali-automaton · 2020-07-21T20:29:44Z

CI MESSAGE: [1484876]: BUILD PASSED

dali/operators/math/expressions/arithmetic_test.cu

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-07-22T08:59:28Z

!build

dali-automaton · 2020-07-22T09:05:55Z

CI MESSAGE: [1486961]: BUILD STARTED

dali-automaton · 2020-07-22T10:37:19Z

CI MESSAGE: [1486961]: BUILD FAILED

klecki · 2020-07-22T13:15:33Z

!build

dali-automaton · 2020-07-22T13:20:44Z

CI MESSAGE: [1487392]: BUILD STARTED

klecki · 2020-07-22T14:56:06Z

!build

dali-automaton · 2020-07-22T15:00:25Z

CI MESSAGE: [1487604]: BUILD STARTED

dali-automaton · 2020-07-22T15:03:55Z

CI MESSAGE: [1487392]: BUILD FAILED

dali-automaton · 2020-07-22T16:33:18Z

CI MESSAGE: [1487604]: BUILD PASSED

Tune Arithmetic Op launch specification

ea4a9c2

Add benchmark, adjust test to new tile size Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki mentioned this pull request Jul 21, 2020

Optimize arithmetic op parameters #2128

Closed

mzient reviewed Jul 21, 2020

View reviewed changes

Simplify loop

c1e9c1f

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

JanuszL reviewed Jul 21, 2020

View reviewed changes

jantonguirao reviewed Jul 21, 2020

View reviewed changes

mzient approved these changes Jul 21, 2020

View reviewed changes

Simpler auto

08fdc4e

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

jantonguirao reviewed Jul 21, 2020

View reviewed changes

jantonguirao approved these changes Jul 21, 2020

View reviewed changes

Revert u32 -> i64

398b1b9

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki commented Jul 21, 2020

View reviewed changes

dali/operators/math/expressions/arithmetic_test.cu Show resolved Hide resolved

Fix TestTensorList producing garbage in allocated memory

beda541

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the arithm-op-perf branch from 42c0776 to beda541 Compare July 22, 2020 08:55

klecki merged commit 498a22e into NVIDIA:master Jul 22, 2020

klecki deleted the arithm-op-perf branch July 22, 2020 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune Arithmetic Op launch specification #2137

Tune Arithmetic Op launch specification #2137

klecki commented Jul 21, 2020 •

edited

Loading

klecki commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

mzient Jul 21, 2020 •

edited

Loading

klecki Jul 21, 2020

jantonguirao Jul 21, 2020

klecki Jul 21, 2020

JanuszL Jul 21, 2020

klecki Jul 21, 2020

jantonguirao Jul 21, 2020

jantonguirao Jul 21, 2020

klecki Jul 21, 2020

jantonguirao Jul 21, 2020

jantonguirao Jul 21, 2020

klecki Jul 21, 2020

jantonguirao Jul 21, 2020

klecki Jul 21, 2020

jantonguirao Jul 21, 2020

jantonguirao Jul 21, 2020

klecki commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

klecki commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

klecki commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

klecki commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

klecki commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

		for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) {
		result[i] = meta::impl(in[i]);

	const auto left = static_cast<const Left >(tile.args[0]);
	const Left left = static_cast<const Left >(tile.args[0]);

	const auto left = static_cast<const Left >(tile.args[0]);
	auto left = static_cast<const Left *>(tile.args[0]);

Tune Arithmetic Op launch specification #2137

Tune Arithmetic Op launch specification #2137

Conversation

klecki commented Jul 21, 2020 • edited Loading

Why we need this PR?

What happened in this PR?

klecki commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

mzient Jul 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

klecki commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

dali-automaton commented Jul 21, 2020

klecki commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

klecki commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

klecki commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

dali-automaton commented Jul 22, 2020

klecki commented Jul 21, 2020 •

edited

Loading

mzient Jul 21, 2020 •

edited

Loading