Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Slice kernel tiling adaptive #3557

Merged
merged 6 commits into from
Dec 9, 2021

Conversation

szkarpinski
Copy link
Collaborator

@szkarpinski szkarpinski commented Dec 3, 2021

Description

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactoring (Redesign of existing code that doesn't affect functionality)
  • Other (e.g. Documentation, Tests, Configuration)

What happened in this PR

When SliceGPU runs the kernel, each thread processes a hardcoded count of 64 pixels:

static constexpr int64_t kBlockSize = 64 * kBlockDim;
This limits the performance for small batches of data, because for small data not enough blocks are run to keep all SMs busy. This PR addresses this issue by computing value of pixels per thread in an adaptive manner.

This increases the throughput by up to 60% for certain configurations.

The solution

This solution tries to make at least 4 * number of SMs tiles to improve total GPU occupancy.

It starts with original value of 64 pixels per thread and then divides it by 2 until estimated number of tiles reaches 4 * number of SMs or a lower limit of 4 pixels per thread is reached. This means that for bigger data the behaviour is unchanged as the original value of 64 pixels per thread is used.

Benchmarks

Benchmark. I have measured Slice's performance in cropping images of size 500x500x3, 1000x1000x3, 2000x2000x3 to 250x250x3, 500x500x3 and 1000x1000x3 respecitvely. The measurements were taken for batches of 1, 2, 4, 8, 16, 32, 64, 128 and 256 images. As the change is not specific to particular shape of input data, I would expect similar performance impact on other, more complex shapes.

The GPU. The benchmarks were run on Titan V, which has 80 SMs.

Performance for various pixels per thread. I have measured performance of SliceGPU for values of pixels per thread (abbreviated to ppt on the plots) other than original 64. An increased throughput can be observed for smaller values of pixels per thread for small batch sizes.

constant_ppt

Performance of the adaptive method. In the second picture the results achieved by the adaptive method described above are presented. As you can see, no other value of pixels per thread performs better than the one chosen by the adaptive method.

adaptive_ppt

Additional information

Affected modules and functionalities:

  • SliceGPU kernel tiling is affected, but only for small data

Key points relevant for the review:

  • The magic constant 4 in computing minimal number of tiles (4 * number of SMs). This constant 4 turned out to be the best during benchmarks. The reason for that is probably that the kernel uses 48 registers, which means ~5 blocks can fit on the SM. This probably could be computed somehow from the GPU properties in the runtime, but this would add a lot of complexity to the code while having small effect on performance, so I decided to leave the magic constant here. I'm not really convinced though.

  • The placement of GetSMCount method. I'm not very familiar with DALI, so I'm not sure if utils.h is a good place for the new GetSMCount function.

Checklist

Tests

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

Computes optimal pixels-per-thread for Slice GPU kernel based on count
of SMs instead of using hardcoded constant value.

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
@@ -58,6 +59,16 @@ OutShape GetStrides(const Shape& shape) {
return strides;
}

inline int64_t GetSMCount() {
Copy link
Contributor

@mzient mzient Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should go to cuda_utils.h, right next to MaxThreadsPerBlock. Please move it there and revert this file to the original version.

Also, make it return an int - I don't expect we're going to surpass 2^31 SMs any time soon and cudaDeviceProp::multiProcessorCount is declared as int.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to cuda_utils.h and changed to int

Comment on lines 257 to 258
block_count_ += std::ceil(
sample_size / static_cast<float>(kBlockSize));
sample_size / static_cast<float>(blockSize));
Copy link
Contributor

@mzient mzient Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is prone to rounding errors. Use div_ceil instead.

Suggested change
block_count_ += std::ceil(
sample_size / static_cast<float>(kBlockSize));
sample_size / static_cast<float>(blockSize));
block_count_ += div_ceil(sample_size, block_size);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to div_ceil, thank you

total_volume += volume(args.shape);
}

auto minBlocks = 4 * GetSMCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use camelCase. Also, don't use auto for trivial types.

Suggested change
auto minBlocks = 4 * GetSMCount();
int min_blocks = 4 * GetSMCount();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 199 to 200
int64 blockSize;

Copy link
Contributor

@mzient mzient Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int64 blockSize;
int64_t block_size_ = 256;
  1. snake_case
  2. trailing _ for a member (I don't like it, but that's what we use everywhere...)
  3. stick it to block_count_ field rather than to the constants.
  4. some default would be nice

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for that, I had block_count_ just below and didn't notice style difference :/

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Copy link
Contributor

@jantonguirao jantonguirao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hugo213 Thanks for this contribution, and for the thorough explanation in the PR description. Good work!

@jantonguirao
Copy link
Contributor

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3542688]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3542688]: BUILD FAILED

@JanuszL
Copy link
Contributor

JanuszL commented Dec 7, 2021

@hugo213 it seems that clang is unhappy:

#14 299.2 /opt/dali/dali/kernels/slice/slice_gpu.cuh:332:35: error: comparison of integers of different signs: 'uint64_t' (aka 'unsigned long') and 'int64_t' (aka 'long') [-Werror,-Wsign-compare]
#14 299.2         uint64_t size = remaining < block_size_ ? remaining : block_size_;```

As clang was complaining about comparing signed and unsigned, I've
changed variables which are guaranteed to be non-negative to unsigned.

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Google Codestyle recommends treating abbreviations as words and CamelCase them.

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
@szkarpinski
Copy link
Collaborator Author

I've fixed the signedness issues, Clang should be happy now. Also, @szalpal pointed out that Google Codestyle recommends (https://google.github.io/styleguide/cppguide.html#General_Naming_Rules) treating abbreviations as words, so I've changed GetSMCount to GetSmCount.

@JanuszL
Copy link
Contributor

JanuszL commented Dec 7, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3544727]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3544727]: BUILD FAILED

Comment on lines 96 to 104
inline int GetSmCount() {
static int count = 0;
if (!count) {
cudaDeviceProp prop;
CUDA_CALL(cudaGetDeviceProperties(&prop, 0));
count = prop.multiProcessorCount;
}
return count;
}
Copy link
Contributor

@mzient mzient Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but that's not sufficient. There can be more than one device - and indeed, more than one type of device.

Suggested change
inline int GetSmCount() {
static int count = 0;
if (!count) {
cudaDeviceProp prop;
CUDA_CALL(cudaGetDeviceProperties(&prop, 0));
count = prop.multiProcessorCount;
}
return count;
}
inline int GetSmCount(int device_id = -1) {
if (device_id < 0)
CUDA_CALL(cudaGetDevice(&device_id));
static int dev_count = []() {
int ndevs = 0;
CUDA_CALL(cudaGetDeviceCount(&ndevs));
return ndevs;
}();
static unique_ptr<int[]> count(new int[dev_count]()); // this should be zero-initialized
if (!count[device_id]) {
cudaDeviceProp prop;
CUDA_CALL(cudaGetDeviceProperties(&prop, 0));
count[device_id] = prop.multiProcessorCount;
}
return count[device_id];
}

Copy link
Collaborator Author

@szkarpinski szkarpinski Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed this as you suggested in the latest commit with a small difference. I decided to use vector instead of unique_ptr to an array, because I think it's more readable then. If there's some advantage of the unique_ptr approach, of course I'll change it.

@JanuszL
Copy link
Contributor

JanuszL commented Dec 7, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3545912]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3545912]: BUILD PASSED

Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
@JanuszL
Copy link
Contributor

JanuszL commented Dec 8, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3549166]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [3549166]: BUILD PASSED

@jantonguirao jantonguirao merged commit 9901d7f into NVIDIA:main Dec 9, 2021
@szkarpinski szkarpinski mentioned this pull request Dec 31, 2021
23 tasks
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jan 23, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Feb 21, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
@JanuszL JanuszL mentioned this pull request Mar 30, 2022
cyyever pushed a commit to cyyever/DALI that referenced this pull request May 13, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
cyyever pushed a commit to cyyever/DALI that referenced this pull request Jun 7, 2022
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants