Skip to content

Commit

Permalink
Add optimized variant of CMN for HWC to CHW case (#4972)
Browse files Browse the repository at this point in the history
Add an optimized version of SliceFlipNormalize kernel for the HWC to CHW layout switch.
The kernel has several variants that allow for:
* cropping (only X-dimension is relevant, as Y-dimension is done via tiling)
* mirroring X coordinate (as required by Crop Mirror Normalize operator)
* padding channel dimension
It assumes uint8_t inputs and allows for float16 and float32 outputs.

The algorithm is described in the docstring.
Additionally, due to the linear tiling, the bin-search for tile index is ported over from the Cast kernel.

The CropMirrorNormalize operator setup is generalized to support the new and old versions 
of the Slice kernel (most notably their setup).
Selection of appropriate implementation based on the inputs and parameters was added.

Testing is done via Python layer for simplicity.

The pure kernel (disregarding the setups) achieves 2.6 TB/s vs the 1.6 TB/s of the previous variant.

Simple benchmark utilizing DALI pipeline gives 2TB/s for the new one vs 1.5 TB/s for the old one, 
note that here we are self restricted by the previous iteration, overlapping the compute with
training may help further.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
  • Loading branch information
klecki committed Aug 7, 2023
1 parent 09ebd09 commit dbb79d4
Show file tree
Hide file tree
Showing 4 changed files with 951 additions and 88 deletions.
Loading

0 comments on commit dbb79d4

Please sign in to comment.