Add optimized variant of CMN for HWC to HWC pad FP16 case #4993

klecki · 2023-08-11T19:19:33Z

Category: New feature, Refactoring

Description:

Introduce an optimized kernel for the HWC -> HWC, pad case for FP16.
With current DALI allocation patterns, all outputs are aligned to multiple of 4,
allowing to implement the output loop as writing pair of two lower channels, pair of two higher channels
for each pixel using vectorized fp16 instructions, resulting in 4 byte accesses to output gmem.

This yields up to 3x speedups compared to the previous version used by DALI for this particular case.
There is a safety fallback in case if the memory is not actually alinged.

Additional information:

Affected modules and functionalities:

CMN

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

klecki · 2023-08-11T19:19:39Z

!build

dali-automaton · 2023-08-11T19:25:21Z

CI MESSAGE: [9337391]: BUILD STARTED

dali-automaton · 2023-08-11T20:02:06Z

CI MESSAGE: [9337391]: BUILD FAILED

klecki · 2023-08-17T16:10:33Z

!build

dali-automaton · 2023-08-17T16:20:29Z

CI MESSAGE: [9412867]: BUILD STARTED

dali-automaton · 2023-08-17T16:55:55Z

CI MESSAGE: [9412867]: BUILD FAILED

klecki · 2023-08-18T12:22:57Z

!build

dali-automaton · 2023-08-18T12:25:16Z

CI MESSAGE: [9424292]: BUILD STARTED

dali-automaton · 2023-08-18T15:24:00Z

CI MESSAGE: [9424292]: BUILD FAILED

dali-automaton · 2023-08-21T15:06:31Z

CI MESSAGE: [9424292]: BUILD PASSED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2023-08-22T09:20:41Z

!build

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

dali-automaton · 2023-08-22T09:25:22Z

CI MESSAGE: [9464965]: BUILD STARTED

dali-automaton · 2023-08-22T11:21:40Z

CI MESSAGE: [9464965]: BUILD PASSED

awolant

Very nice comments and docs. This is quite hard code - additional info makes it significantly easier to understand.

Introduce an optimized kernel for the HWC -> HWC, pad case for FP16. With current DALI allocation patterns, all outputs are aligned to multiple of 4, allowing to implement the output loop as writing pair of two lower channels, pair of two higher channels for each pixel using vectorized fp16 instructions, resulting in 4 byte accesses to output gmem. This yields up to 3x speedups compared to the previous version used by DALI for this particular case. There is a safety fallback in case if the memory is not actually aligned. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki changed the title ~~Add HWC -> HWC pad optimized variant for fp16 in CMN~~ Add optimized variant of CMN for HWC to HWC pad FP16 case Aug 17, 2023

klecki force-pushed the hwc2hwc_planar_specialized branch from c52a60d to d85c78a Compare August 17, 2023 15:36

klecki added 2 commits August 22, 2023 11:19

Add optimized Hwc2Hwc+pad fp16 operation

010d5aa

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

WAR clang build with intrinsics

1c85dff

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the hwc2hwc_planar_specialized branch from 3607f0a to 1c85dff Compare August 22, 2023 09:19

klecki marked this pull request as ready for review August 22, 2023 09:20

awolant self-assigned this Aug 22, 2023

Remove unused code

45e0af3

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

jantonguirao assigned szalpal Aug 22, 2023

awolant approved these changes Aug 24, 2023

View reviewed changes

szalpal approved these changes Aug 24, 2023

View reviewed changes

klecki merged commit c9d603e into NVIDIA:main Aug 24, 2023
5 checks passed

klecki deleted the hwc2hwc_planar_specialized branch August 24, 2023 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized variant of CMN for HWC to HWC pad FP16 case #4993

Add optimized variant of CMN for HWC to HWC pad FP16 case #4993

klecki commented Aug 11, 2023 •

edited

klecki commented Aug 11, 2023

dali-automaton commented Aug 11, 2023

dali-automaton commented Aug 11, 2023

klecki commented Aug 17, 2023

dali-automaton commented Aug 17, 2023

dali-automaton commented Aug 17, 2023

klecki commented Aug 18, 2023

dali-automaton commented Aug 18, 2023

dali-automaton commented Aug 18, 2023

dali-automaton commented Aug 21, 2023

klecki commented Aug 22, 2023

dali-automaton commented Aug 22, 2023

dali-automaton commented Aug 22, 2023

awolant left a comment

Add optimized variant of CMN for HWC to HWC pad FP16 case #4993

Add optimized variant of CMN for HWC to HWC pad FP16 case #4993

Conversation

klecki commented Aug 11, 2023 • edited

Category: New feature, Refactoring

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

klecki commented Aug 11, 2023

dali-automaton commented Aug 11, 2023

dali-automaton commented Aug 11, 2023

klecki commented Aug 17, 2023

dali-automaton commented Aug 17, 2023

dali-automaton commented Aug 17, 2023

klecki commented Aug 18, 2023

dali-automaton commented Aug 18, 2023

dali-automaton commented Aug 18, 2023

dali-automaton commented Aug 21, 2023

klecki commented Aug 22, 2023

dali-automaton commented Aug 22, 2023

dali-automaton commented Aug 22, 2023

awolant left a comment

Choose a reason for hiding this comment

klecki commented Aug 11, 2023 •

edited