-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tune Arithmetic Op launch specification #2137
Conversation
Add benchmark, adjust test to new tile size Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1484299]: BUILD STARTED |
for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) { | ||
result[i] = meta::impl(l[i], r); | ||
*result = meta::impl(*l, r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Does using pointer arithmetic instead of indexing help in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems so with smaller inputs, with bigger ones it doesn't make much difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, can we store the offset and step in variables to avoid repeating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done I guess.
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
for (int sample_id = 0; sample_id < TestConfig::batch_size; sample_id++) { | ||
for (int extent_id = 0; extent_id < TestConfig::tiles_per_sample; extent_id++) { | ||
int tile_id = sample_id * TestConfig::tiles_per_sample + extent_id; | ||
tiles_cpu(tile_id)->desc = tile_descs[tile_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you fill tile_descs
here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could but implementing that lambda would be probably a bit more of a hassle than a direct loop.
@@ -33,8 +33,12 @@ namespace dali { | |||
template <ArithmeticOp op, typename Result, typename Input> | |||
__device__ void ExecuteUnOp(Result *result, const Input *in, int64_t extent) { | |||
using meta = arithm_meta<op, GPUBackend>; | |||
result += blockIdx.x * blockDim.x + threadIdx.x; | |||
in += blockIdx.x * blockDim.x + threadIdx.x; | |||
for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract blockDim.x * gridDim.x
to a variable?
for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) { | ||
result[i] = meta::impl(in[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it seemed more readable before. Why the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a small perf gain as well as correctness with int64 offset.
for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) { | ||
result[i] = meta::impl(l[i], r[i]); | ||
*result = meta::impl(*l, *r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same applies here. Why the change?
auto left = static_cast<const Left *>(tile.args[0]); | ||
auto right = static_cast<const Right *>(tile.args[1]); | ||
auto *output = static_cast<Result *>(tile.output); | ||
const auto *left = static_cast<const Left *>(tile.args[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go with either
const auto *left = static_cast<const Left *>(tile.args[0]); | |
const Left *left = static_cast<const Left *>(tile.args[0]); |
or
const auto *left = static_cast<const Left *>(tile.args[0]); | |
auto left = static_cast<const Left *>(tile.args[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go back to auto left
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@@ -44,8 +51,17 @@ __device__ void ExecuteUnOp(Result *result, const Input *in, int64_t extent) { | |||
template <ArithmeticOp op, typename Result, typename Left, typename Right> | |||
__device__ void ExecuteBinOp(Result *result, const Left *l, const Right *r, int64_t extent) { | |||
using meta = arithm_meta<op, GPUBackend>; | |||
for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) { | |||
result[i] = meta::impl(l[i], r[i]); | |||
uint32_t start_ofs = (blockDim.x) * blockIdx.x + threadIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the previous implementation, you used int64_t. Can we use int64_t everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added by mistake, will revert to previous int64_t change.
@@ -55,8 +71,15 @@ __device__ void ExecuteBinOp(Result *result, const Left *l, const Right *r, int6 | |||
template <ArithmeticOp op, typename Result, typename Left, typename Right> | |||
__device__ void ExecuteBinOp(Result *result, Left l, const Right *r, int64_t extent) { | |||
using meta = arithm_meta<op, GPUBackend>; | |||
for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) { | |||
result[i] = meta::impl(l, r[i]); | |||
uint32_t start_ofs = (blockDim.x) * blockIdx.x + threadIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -66,16 +89,23 @@ __device__ void ExecuteBinOp(Result *result, Left l, const Right *r, int64_t ext | |||
template <ArithmeticOp op, typename Result, typename Left, typename Right> | |||
__device__ void ExecuteBinOp(Result *result, const Left *l, Right r, int64_t extent) { | |||
using meta = arithm_meta<op, GPUBackend>; | |||
for (int64_t i = blockIdx.x * blockDim.x + threadIdx.x; i < extent; i += blockDim.x * gridDim.x) { | |||
result[i] = meta::impl(l[i], r); | |||
uint32_t start_ofs = (blockDim.x) * blockIdx.x + threadIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1484540]: BUILD STARTED |
CI MESSAGE: [1484540]: BUILD FAILED |
!build |
CI MESSAGE: [1484876]: BUILD STARTED |
CI MESSAGE: [1484876]: BUILD FAILED |
CI MESSAGE: [1484876]: BUILD PASSED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
!build |
CI MESSAGE: [1486961]: BUILD STARTED |
CI MESSAGE: [1486961]: BUILD FAILED |
!build |
CI MESSAGE: [1487392]: BUILD STARTED |
!build |
CI MESSAGE: [1487604]: BUILD STARTED |
CI MESSAGE: [1487392]: BUILD FAILED |
CI MESSAGE: [1487604]: BUILD PASSED |
Signed-off-by: Krzysztof Lecki klecki@nvidia.com
Why we need this PR?
Arithmetic Op is not fast enough, GPU was underutilized due to not big enough tiles.
What happened in this PR?
What solution was applied:
Tile was made bigger, grid was made smaller to allow for few more iterations for given thread.
Pointer usage was adjusted a bit, it helps with small inputs a bit.
Benchmark was added
Affected modules and functionalities:
Arithmetic Ops
Key points relevant for the review:
Validation and testing:
Benchmark
Documentation (including examples):
NA
OLD:
NEW:
JIRA TASK: [Use DALI-1514 or NA]