Add optimized variant of CMN for HWC to CHW case (#4972) · NVIDIA/DALI@dbb79d4

Commit

Add optimized variant of CMN for HWC to CHW case (#4972)

Add an optimized version of SliceFlipNormalize kernel for the HWC to CHW layout switch.
The kernel has several variants that allow for:
* cropping (only X-dimension is relevant, as Y-dimension is done via tiling)
* mirroring X coordinate (as required by Crop Mirror Normalize operator)
* padding channel dimension
It assumes uint8_t inputs and allows for float16 and float32 outputs.

The algorithm is described in the docstring.
Additionally, due to the linear tiling, the bin-search for tile index is ported over from the Cast kernel.

The CropMirrorNormalize operator setup is generalized to support the new and old versions 
of the Slice kernel (most notably their setup).
Selection of appropriate implementation based on the inputs and parameters was added.

Testing is done via Python layer for simplicity.

The pure kernel (disregarding the setups) achieves 2.6 TB/s vs the 1.6 TB/s of the previous variant.

Simple benchmark utilizing DALI pipeline gives 2TB/s for the new one vs 1.5 TB/s for the old one, 
note that here we are self restricted by the previous iteration, overlapping the compute with
training may help further.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Loading branch information

klecki committed Aug 7, 2023

1 parent 09ebd09 commit dbb79d4

0 comments on commit `dbb79d4`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `dbb79d4`

Commit

There are no files selected for viewing

0 comments on commit dbb79d4

0 comments on commit `dbb79d4`