Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized variant of CMN for HWC to HWC pad FP16 case #4993

Merged
merged 3 commits into from
Aug 24, 2023

Conversation

klecki
Copy link
Contributor

@klecki klecki commented Aug 11, 2023

Category: New feature, Refactoring

Description:

Introduce an optimized kernel for the HWC -> HWC, pad case for FP16.
With current DALI allocation patterns, all outputs are aligned to multiple of 4,
allowing to implement the output loop as writing pair of two lower channels, pair of two higher channels
for each pixel using vectorized fp16 instructions, resulting in 4 byte accesses to output gmem.

This yields up to 3x speedups compared to the previous version used by DALI for this particular case.
There is a safety fallback in case if the memory is not actually alinged.

Additional information:

Affected modules and functionalities:

CMN

Key points relevant for the review:

Tests:

  • Existing tests apply
    This is already covered by existing tests, current implementation gets more specialized version
    for one particular case.
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@klecki
Copy link
Contributor Author

klecki commented Aug 11, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9337391]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9337391]: BUILD FAILED

@klecki klecki changed the title Add HWC -> HWC pad optimized variant for fp16 in CMN Add optimized variant of CMN for HWC to HWC pad FP16 case Aug 17, 2023
@klecki
Copy link
Contributor Author

klecki commented Aug 17, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9412867]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9412867]: BUILD FAILED

@klecki
Copy link
Contributor Author

klecki commented Aug 18, 2023

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9424292]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9424292]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9424292]: BUILD PASSED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@klecki klecki marked this pull request as ready for review August 22, 2023 09:20
@klecki
Copy link
Contributor Author

klecki commented Aug 22, 2023

!build

@awolant awolant self-assigned this Aug 22, 2023
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9464965]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [9464965]: BUILD PASSED

Copy link
Contributor

@awolant awolant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice comments and docs. This is quite hard code - additional info makes it significantly easier to understand.

@klecki klecki merged commit c9d603e into NVIDIA:main Aug 24, 2023
5 checks passed
@klecki klecki deleted the hwc2hwc_planar_specialized branch August 24, 2023 11:54
JanuszL pushed a commit to JanuszL/DALI that referenced this pull request Oct 13, 2023
Introduce an optimized kernel for the HWC -> HWC, pad case for FP16.
With current DALI allocation patterns, all outputs are aligned to multiple of 4,
allowing to implement the output loop as writing pair of two lower channels, 
pair of two higher channels for each pixel using vectorized fp16 instructions, 
resulting in 4 byte accesses to output gmem.

This yields up to 3x speedups compared to the previous version used by DALI
for this particular case. There is a safety fallback in case if the memory is not actually 
aligned.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants