-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized variant of CMN for HWC to HWC pad FP16 case #4993
Conversation
!build |
CI MESSAGE: [9337391]: BUILD STARTED |
CI MESSAGE: [9337391]: BUILD FAILED |
c52a60d
to
d85c78a
Compare
!build |
CI MESSAGE: [9412867]: BUILD STARTED |
CI MESSAGE: [9412867]: BUILD FAILED |
!build |
CI MESSAGE: [9424292]: BUILD STARTED |
CI MESSAGE: [9424292]: BUILD FAILED |
CI MESSAGE: [9424292]: BUILD PASSED |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
3607f0a
to
1c85dff
Compare
!build |
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
CI MESSAGE: [9464965]: BUILD STARTED |
CI MESSAGE: [9464965]: BUILD PASSED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice comments and docs. This is quite hard code - additional info makes it significantly easier to understand.
Introduce an optimized kernel for the HWC -> HWC, pad case for FP16. With current DALI allocation patterns, all outputs are aligned to multiple of 4, allowing to implement the output loop as writing pair of two lower channels, pair of two higher channels for each pixel using vectorized fp16 instructions, resulting in 4 byte accesses to output gmem. This yields up to 3x speedups compared to the previous version used by DALI for this particular case. There is a safety fallback in case if the memory is not actually aligned. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Category: New feature, Refactoring
Description:
Introduce an optimized kernel for the HWC -> HWC, pad case for FP16.
With current DALI allocation patterns, all outputs are aligned to multiple of 4,
allowing to implement the output loop as writing pair of two lower channels, pair of two higher channels
for each pixel using vectorized fp16 instructions, resulting in 4 byte accesses to output gmem.
This yields up to 3x speedups compared to the previous version used by DALI for this particular case.
There is a safety fallback in case if the memory is not actually alinged.
Additional information:
Affected modules and functionalities:
CMN
Key points relevant for the review:
Tests:
This is already covered by existing tests, current implementation gets more specialized version
for one particular case.
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A