Reduce branching in `__brick_shift_left` implementation in SYCL backend

This issue is being filed based on the following review comment: https://github.com/uxlfoundation/oneDPL/pull/1976#discussion_r1928840673

**Potential Performance Issue**
The current implementation of `__brick_shift_left` implementation in the SYCL backend performs strided accesses within a loop with a conditional check to ensure we are within bounds at each iteration:

```
const _DiffType __i = __idx - __n; //loop invariant
for (_DiffType __k = __n; __k < __size; __k += __n)
{
    if (__k + __idx < __size)
         __rng[__k + __i] = ::std::move(__rng[__k + __idx]);
}
```
The proposed vectorization path in `https://github.com/uxlfoundation/oneDPL/pull/1976` more or less follows the same implementation with the same branching. This likely has some performance hit particularly on GPU architectures as they lack branch prediction. Instead, we should precompute the number of iterations outside the loop and hoist the last iteration after the loop with boundary checking as it may not be a full case.

This optimization should be a follow-up to the mentioned PR and should adjust both scalar and vector implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce branching in `__brick_shift_left` implementation in SYCL backend #2021

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce branching in __brick_shift_left implementation in SYCL backend #2021

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Reduce branching in `__brick_shift_left` implementation in SYCL backend #2021