Make Slice kernel tiling adaptive #3557

szkarpinski · 2021-12-03T22:35:42Z

Description

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactoring (Redesign of existing code that doesn't affect functionality)
Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

When SliceGPU runs the kernel, each thread processes a hardcoded count of 64 pixels:

Line 197 in d7951e5

static constexpr int64_t kBlockSize = 64 * kBlockDim;

This limits the performance for small batches of data, because for small data not enough blocks are run to keep all SMs busy. This PR addresses this issue by computing value of pixels per thread in an adaptive manner.

This increases the throughput by up to 60% for certain configurations.

The solution

This solution tries to make at least 4 * number of SMs tiles to improve total GPU occupancy.

It starts with original value of 64 pixels per thread and then divides it by 2 until estimated number of tiles reaches 4 * number of SMs or a lower limit of 4 pixels per thread is reached. This means that for bigger data the behaviour is unchanged as the original value of 64 pixels per thread is used.

Benchmarks

Benchmark. I have measured Slice's performance in cropping images of size 500x500x3, 1000x1000x3, 2000x2000x3 to 250x250x3, 500x500x3 and 1000x1000x3 respecitvely. The measurements were taken for batches of 1, 2, 4, 8, 16, 32, 64, 128 and 256 images. As the change is not specific to particular shape of input data, I would expect similar performance impact on other, more complex shapes.

The GPU. The benchmarks were run on Titan V, which has 80 SMs.

Performance for various pixels per thread. I have measured performance of SliceGPU for values of pixels per thread (abbreviated to ppt on the plots) other than original 64. An increased throughput can be observed for smaller values of pixels per thread for small batch sizes.

Performance of the adaptive method. In the second picture the results achieved by the adaptive method described above are presented. As you can see, no other value of pixels per thread performs better than the one chosen by the adaptive method.

Additional information

Affected modules and functionalities:

SliceGPU kernel tiling is affected, but only for small data

Key points relevant for the review:

The magic constant 4 in computing minimal number of tiles (4 * number of SMs). This constant 4 turned out to be the best during benchmarks. The reason for that is probably that the kernel uses 48 registers, which means ~5 blocks can fit on the SM. This probably could be computed somehow from the GPU properties in the runtime, but this would add a lot of complexity to the code while having small effect on performance, so I decided to leave the magic constant here. I'm not really convinced though.
The placement of GetSMCount method. I'm not very familiar with DALI, so I'm not sure if utils.h is a good place for the new GetSMCount function.

Checklist

Tests

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

Computes optimal pixels-per-thread for Slice GPU kernel based on count of SMs instead of using hardcoded constant value. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

mzient · 2021-12-06T18:01:06Z

dali/kernels/common/utils.h

@@ -58,6 +59,16 @@ OutShape GetStrides(const Shape& shape) {
  return strides;
 }

+inline int64_t GetSMCount() {


This function should go to cuda_utils.h, right next to MaxThreadsPerBlock. Please move it there and revert this file to the original version.

Also, make it return an int - I don't expect we're going to surpass 2^31 SMs any time soon and cudaDeviceProp::multiProcessorCount is declared as int.

Moved to cuda_utils.h and changed to int

mzient · 2021-12-06T18:02:36Z

dali/kernels/slice/slice_gpu.cuh

      block_count_ += std::ceil(
-        sample_size / static_cast<float>(kBlockSize));
+        sample_size / static_cast<float>(blockSize));


This is prone to rounding errors. Use div_ceil instead.

Suggested change

block_count_ += std::ceil(

sample_size / static_cast<float>(kBlockSize));

sample_size / static_cast<float>(blockSize));

block_count_ += div_ceil(sample_size, block_size);

Changed it to div_ceil, thank you

mzient · 2021-12-06T18:03:50Z

dali/kernels/slice/slice_gpu.cuh

+      total_volume += volume(args.shape);
+    }
+
+    auto minBlocks = 4 * GetSMCount();


We don't use camelCase. Also, don't use auto for trivial types.

Suggested change

auto minBlocks = 4 * GetSMCount();

int min_blocks = 4 * GetSMCount();

mzient · 2021-12-06T18:08:01Z

dali/kernels/slice/slice_gpu.cuh

+  int64 blockSize;
+


Suggested change

int64 blockSize;

int64_t block_size_ = 256;

snake_case

trailing _ for a member (I don't like it, but that's what we use everywhere...)

stick it to block_count_ field rather than to the constants.

some default would be nice

Sorry for that, I had block_count_ just below and didn't notice style difference :/

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

jantonguirao

@hugo213 Thanks for this contribution, and for the thorough explanation in the PR description. Good work!

jantonguirao · 2021-12-07T05:00:32Z

!build

dali-automaton · 2021-12-07T05:07:52Z

CI MESSAGE: [3542688]: BUILD STARTED

dali-automaton · 2021-12-07T05:58:34Z

CI MESSAGE: [3542688]: BUILD FAILED

JanuszL · 2021-12-07T07:42:24Z

@hugo213 it seems that clang is unhappy:

#14 299.2 /opt/dali/dali/kernels/slice/slice_gpu.cuh:332:35: error: comparison of integers of different signs: 'uint64_t' (aka 'unsigned long') and 'int64_t' (aka 'long') [-Werror,-Wsign-compare]
#14 299.2         uint64_t size = remaining < block_size_ ? remaining : block_size_;```

As clang was complaining about comparing signed and unsigned, I've changed variables which are guaranteed to be non-negative to unsigned. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

Google Codestyle recommends treating abbreviations as words and CamelCase them. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

szkarpinski · 2021-12-07T12:16:32Z

I've fixed the signedness issues, Clang should be happy now. Also, @szalpal pointed out that Google Codestyle recommends (https://google.github.io/styleguide/cppguide.html#General_Naming_Rules) treating abbreviations as words, so I've changed GetSMCount to GetSmCount.

JanuszL · 2021-12-07T12:27:29Z

!build

dali-automaton · 2021-12-07T12:31:09Z

CI MESSAGE: [3544727]: BUILD STARTED

dali-automaton · 2021-12-07T13:43:37Z

CI MESSAGE: [3544727]: BUILD FAILED

mzient · 2021-12-07T14:08:07Z

include/dali/core/cuda_utils.h

+inline int GetSmCount() {
+  static int count = 0;
+  if (!count) {
+    cudaDeviceProp prop;
+    CUDA_CALL(cudaGetDeviceProperties(&prop, 0));
+    count = prop.multiProcessorCount;
+  }
+  return count;
+}


Sorry, but that's not sufficient. There can be more than one device - and indeed, more than one type of device.

Suggested change

inline int GetSmCount() {

static int count = 0;

if (!count) {

cudaDeviceProp prop;

CUDA_CALL(cudaGetDeviceProperties(&prop, 0));

count = prop.multiProcessorCount;

}

return count;

}

inline int GetSmCount(int device_id = -1) {

if (device_id < 0)

CUDA_CALL(cudaGetDevice(&device_id));

static int dev_count = []() {

int ndevs = 0;

CUDA_CALL(cudaGetDeviceCount(&ndevs));

return ndevs;

}();

static unique_ptr<int[]> count(new int[dev_count]()); // this should be zero-initialized

if (!count[device_id]) {

cudaDeviceProp prop;

CUDA_CALL(cudaGetDeviceProperties(&prop, 0));

count[device_id] = prop.multiProcessorCount;

}

return count[device_id];

}

I've fixed this as you suggested in the latest commit with a small difference. I decided to use vector instead of unique_ptr to an array, because I think it's more readable then. If there's some advantage of the unique_ptr approach, of course I'll change it.

JanuszL · 2021-12-07T16:52:26Z

!build

dali-automaton · 2021-12-07T16:56:29Z

CI MESSAGE: [3545912]: BUILD STARTED

dali-automaton · 2021-12-07T18:08:00Z

CI MESSAGE: [3545912]: BUILD PASSED

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

JanuszL · 2021-12-08T01:02:37Z

!build

dali-automaton · 2021-12-08T02:31:05Z

CI MESSAGE: [3549166]: BUILD STARTED

dali-automaton · 2021-12-08T05:01:21Z

CI MESSAGE: [3549166]: BUILD PASSED

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

szkarpinski added 2 commits December 3, 2021 22:59

Make SliceGPU tiling adaptive

dd4cc10

Computes optimal pixels-per-thread for Slice GPU kernel based on count of SMs instead of using hardcoded constant value. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

Fix code formatting

1a07a46

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

jantonguirao assigned mzient and jantonguirao Dec 6, 2021

mzient reviewed Dec 6, 2021

View reviewed changes

szkarpinski force-pushed the slice-adaptive-pr branch from e7fd9a5 to 5a70853 Compare December 6, 2021 21:47

Fix issues found during review

9d0fb91

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

szkarpinski force-pushed the slice-adaptive-pr branch from 5a70853 to 9d0fb91 Compare December 6, 2021 21:49

jantonguirao approved these changes Dec 7, 2021

View reviewed changes

szkarpinski added 2 commits December 7, 2021 13:08

Fix signedness problems

7329d5c

As clang was complaining about comparing signed and unsigned, I've changed variables which are guaranteed to be non-negative to unsigned. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

Rename GetSMCount to GetSmCount

c528da7

Google Codestyle recommends treating abbreviations as words and CamelCase them. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

mzient reviewed Dec 7, 2021

View reviewed changes

Fix GetSmCount for multi-GPU

00de8c6

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

mzient approved these changes Dec 7, 2021

View reviewed changes

jantonguirao merged commit 9901d7f into NVIDIA:main Dec 9, 2021

szkarpinski mentioned this pull request Dec 31, 2021

Improve Slice's adaptive tiling #3604

Merged

23 tasks

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

0530477

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

2ad7ff4

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

efd37b5

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

79e6c58

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

b33bd92

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

cfc116b

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

388ae74

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

6eb03a5

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

baef098

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

1653f90

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

8c53268

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

5dea0a3

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

ad81faf

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Feb 21, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

1cc34d9

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

JanuszL mentioned this pull request Mar 30, 2022

DALI 2022 roadmap #3774

Closed

cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

e485e23

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022

Make Slice kernel tiling adaptive (NVIDIA#3557)

fa3b5ab

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Slice kernel tiling adaptive #3557

Make Slice kernel tiling adaptive #3557

szkarpinski commented Dec 3, 2021 •

edited

Loading

mzient Dec 6, 2021 •

edited

Loading

szkarpinski Dec 6, 2021

mzient Dec 6, 2021 •

edited

Loading

szkarpinski Dec 6, 2021

mzient Dec 6, 2021

szkarpinski Dec 6, 2021

mzient Dec 6, 2021 •

edited

Loading

szkarpinski Dec 6, 2021

jantonguirao left a comment

jantonguirao commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

JanuszL commented Dec 7, 2021

szkarpinski commented Dec 7, 2021

JanuszL commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

mzient Dec 7, 2021 •

edited

Loading

szkarpinski Dec 7, 2021 •

edited

Loading

JanuszL commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

JanuszL commented Dec 8, 2021

dali-automaton commented Dec 8, 2021

dali-automaton commented Dec 8, 2021

	auto minBlocks = 4 * GetSMCount();
	int min_blocks = 4 * GetSMCount();

-inline int GetSmCount() {
-  static int count = 0;
-  if (!count) {
-    cudaDeviceProp prop;
-    CUDA_CALL(cudaGetDeviceProperties(&prop, 0));
-    count = prop.multiProcessorCount;
-  }
-  return count;
-}
+inline int GetSmCount(int device_id = -1) {
+  if (device_id < 0)
+    CUDA_CALL(cudaGetDevice(&device_id));
+  static int dev_count = []() {
+    int ndevs = 0;
+    CUDA_CALL(cudaGetDeviceCount(&ndevs));
+    return ndevs;
+  }();
+  static unique_ptr<int[]> count(new int[dev_count]());  // this should be zero-initialized
+  if (!count[device_id]) {
+    cudaDeviceProp prop;
+    CUDA_CALL(cudaGetDeviceProperties(&prop, 0));
+    count[device_id] = prop.multiProcessorCount;
+  }
+  return count[device_id];
+}

Make Slice kernel tiling adaptive #3557

Make Slice kernel tiling adaptive #3557

Conversation

szkarpinski commented Dec 3, 2021 • edited Loading

Description

What happened in this PR

The solution

Benchmarks

Additional information

Affected modules and functionalities:

Key points relevant for the review:

Checklist

Tests

Documentation

DALI team only

Requirements

mzient Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

szkarpinski Dec 6, 2021

Choose a reason for hiding this comment

mzient Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

szkarpinski Dec 6, 2021

Choose a reason for hiding this comment

mzient Dec 6, 2021

Choose a reason for hiding this comment

szkarpinski Dec 6, 2021

Choose a reason for hiding this comment

mzient Dec 6, 2021 • edited Loading

Choose a reason for hiding this comment

szkarpinski Dec 6, 2021

Choose a reason for hiding this comment

jantonguirao left a comment

Choose a reason for hiding this comment

jantonguirao commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

JanuszL commented Dec 7, 2021

szkarpinski commented Dec 7, 2021

JanuszL commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

mzient Dec 7, 2021 • edited Loading

Choose a reason for hiding this comment

szkarpinski Dec 7, 2021 • edited Loading

Choose a reason for hiding this comment

JanuszL commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

dali-automaton commented Dec 7, 2021

JanuszL commented Dec 8, 2021

dali-automaton commented Dec 8, 2021

dali-automaton commented Dec 8, 2021

szkarpinski commented Dec 3, 2021 •

edited

Loading

mzient Dec 6, 2021 •

edited

Loading

mzient Dec 6, 2021 •

edited

Loading

mzient Dec 6, 2021 •

edited

Loading

mzient Dec 7, 2021 •

edited

Loading

szkarpinski Dec 7, 2021 •

edited

Loading