-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Slice kernel tiling adaptive #3557
Conversation
Computes optimal pixels-per-thread for Slice GPU kernel based on count of SMs instead of using hardcoded constant value. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
dali/kernels/common/utils.h
Outdated
@@ -58,6 +59,16 @@ OutShape GetStrides(const Shape& shape) { | |||
return strides; | |||
} | |||
|
|||
inline int64_t GetSMCount() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should go to cuda_utils.h
, right next to MaxThreadsPerBlock
. Please move it there and revert this file to the original version.
Also, make it return an int
- I don't expect we're going to surpass 2^31 SMs any time soon and cudaDeviceProp::multiProcessorCount
is declared as int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to cuda_utils.h
and changed to int
dali/kernels/slice/slice_gpu.cuh
Outdated
block_count_ += std::ceil( | ||
sample_size / static_cast<float>(kBlockSize)); | ||
sample_size / static_cast<float>(blockSize)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is prone to rounding errors. Use div_ceil
instead.
block_count_ += std::ceil( | |
sample_size / static_cast<float>(kBlockSize)); | |
sample_size / static_cast<float>(blockSize)); | |
block_count_ += div_ceil(sample_size, block_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to div_ceil
, thank you
dali/kernels/slice/slice_gpu.cuh
Outdated
total_volume += volume(args.shape); | ||
} | ||
|
||
auto minBlocks = 4 * GetSMCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use camelCase. Also, don't use auto for trivial types.
auto minBlocks = 4 * GetSMCount(); | |
int min_blocks = 4 * GetSMCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
dali/kernels/slice/slice_gpu.cuh
Outdated
int64 blockSize; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64 blockSize; | |
int64_t block_size_ = 256; |
- snake_case
- trailing
_
for a member (I don't like it, but that's what we use everywhere...) - stick it to block_count_ field rather than to the constants.
- some default would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for that, I had block_count_
just below and didn't notice style difference :/
e7fd9a5
to
5a70853
Compare
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
5a70853
to
9d0fb91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hugo213 Thanks for this contribution, and for the thorough explanation in the PR description. Good work!
!build |
CI MESSAGE: [3542688]: BUILD STARTED |
CI MESSAGE: [3542688]: BUILD FAILED |
@hugo213 it seems that clang is unhappy:
|
As clang was complaining about comparing signed and unsigned, I've changed variables which are guaranteed to be non-negative to unsigned. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Google Codestyle recommends treating abbreviations as words and CamelCase them. Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
I've fixed the signedness issues, Clang should be happy now. Also, @szalpal pointed out that Google Codestyle recommends (https://google.github.io/styleguide/cppguide.html#General_Naming_Rules) treating abbreviations as words, so I've changed |
!build |
CI MESSAGE: [3544727]: BUILD STARTED |
CI MESSAGE: [3544727]: BUILD FAILED |
include/dali/core/cuda_utils.h
Outdated
inline int GetSmCount() { | ||
static int count = 0; | ||
if (!count) { | ||
cudaDeviceProp prop; | ||
CUDA_CALL(cudaGetDeviceProperties(&prop, 0)); | ||
count = prop.multiProcessorCount; | ||
} | ||
return count; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but that's not sufficient. There can be more than one device - and indeed, more than one type of device.
inline int GetSmCount() { | |
static int count = 0; | |
if (!count) { | |
cudaDeviceProp prop; | |
CUDA_CALL(cudaGetDeviceProperties(&prop, 0)); | |
count = prop.multiProcessorCount; | |
} | |
return count; | |
} | |
inline int GetSmCount(int device_id = -1) { | |
if (device_id < 0) | |
CUDA_CALL(cudaGetDevice(&device_id)); | |
static int dev_count = []() { | |
int ndevs = 0; | |
CUDA_CALL(cudaGetDeviceCount(&ndevs)); | |
return ndevs; | |
}(); | |
static unique_ptr<int[]> count(new int[dev_count]()); // this should be zero-initialized | |
if (!count[device_id]) { | |
cudaDeviceProp prop; | |
CUDA_CALL(cudaGetDeviceProperties(&prop, 0)); | |
count[device_id] = prop.multiProcessorCount; | |
} | |
return count[device_id]; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed this as you suggested in the latest commit with a small difference. I decided to use vector
instead of unique_ptr
to an array, because I think it's more readable then. If there's some advantage of the unique_ptr
approach, of course I'll change it.
!build |
CI MESSAGE: [3545912]: BUILD STARTED |
CI MESSAGE: [3545912]: BUILD PASSED |
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
!build |
CI MESSAGE: [3549166]: BUILD STARTED |
CI MESSAGE: [3549166]: BUILD PASSED |
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Signed-off-by: Szymon Karpiński <hugo@staszic.waw.pl>
Description
What happened in this PR
When
SliceGPU
runs the kernel, each thread processes a hardcoded count of 64 pixels:DALI/dali/kernels/slice/slice_gpu.cuh
Line 197 in d7951e5
This increases the throughput by up to 60% for certain configurations.
The solution
This solution tries to make at least
4 * number of SMs
tiles to improve total GPU occupancy.It starts with original value of 64 pixels per thread and then divides it by 2 until estimated number of tiles reaches
4 * number of SMs
or a lower limit of 4 pixels per thread is reached. This means that for bigger data the behaviour is unchanged as the original value of 64 pixels per thread is used.Benchmarks
Benchmark. I have measured Slice's performance in cropping images of size 500x500x3, 1000x1000x3, 2000x2000x3 to 250x250x3, 500x500x3 and 1000x1000x3 respecitvely. The measurements were taken for batches of 1, 2, 4, 8, 16, 32, 64, 128 and 256 images. As the change is not specific to particular shape of input data, I would expect similar performance impact on other, more complex shapes.
The GPU. The benchmarks were run on Titan V, which has 80 SMs.
Performance for various pixels per thread. I have measured performance of SliceGPU for values of pixels per thread (abbreviated to ppt on the plots) other than original 64. An increased throughput can be observed for smaller values of pixels per thread for small batch sizes.
Performance of the adaptive method. In the second picture the results achieved by the adaptive method described above are presented. As you can see, no other value of pixels per thread performs better than the one chosen by the adaptive method.
Additional information
Affected modules and functionalities:
Key points relevant for the review:
The magic constant
4
in computing minimal number of tiles (4 * number of SMs
). This constant4
turned out to be the best during benchmarks. The reason for that is probably that the kernel uses 48 registers, which means ~5 blocks can fit on the SM. This probably could be computed somehow from the GPU properties in the runtime, but this would add a lot of complexity to the code while having small effect on performance, so I decided to leave the magic constant here. I'm not really convinced though.The placement of
GetSMCount
method. I'm not very familiar with DALI, so I'm not sure ifutils.h
is a good place for the newGetSMCount
function.Checklist
Tests
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A