Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add optimized variant of CMN for HWC to CHW case (#4972)
Add an optimized version of SliceFlipNormalize kernel for the HWC to CHW layout switch. The kernel has several variants that allow for: * cropping (only X-dimension is relevant, as Y-dimension is done via tiling) * mirroring X coordinate (as required by Crop Mirror Normalize operator) * padding channel dimension It assumes uint8_t inputs and allows for float16 and float32 outputs. The algorithm is described in the docstring. Additionally, due to the linear tiling, the bin-search for tile index is ported over from the Cast kernel. The CropMirrorNormalize operator setup is generalized to support the new and old versions of the Slice kernel (most notably their setup). Selection of appropriate implementation based on the inputs and parameters was added. Testing is done via Python layer for simplicity. The pure kernel (disregarding the setups) achieves 2.6 TB/s vs the 1.6 TB/s of the previous variant. Simple benchmark utilizing DALI pipeline gives 2TB/s for the new one vs 1.5 TB/s for the old one, note that here we are self restricted by the previous iteration, overlapping the compute with training may help further. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
- Loading branch information