Use __shfl_down for cuda affine interp backward #7

jacobhinkle · 2018-11-20T17:33:20Z

cf. https://devblogs.nvidia.com/faster-parallel-reductions-kepler/

Shuffle intrinsics on nvidia gpus can dramatically speed up custom reductions. Currently the method i use has lots of thread synchronization so there is a lot of room for improvement probably. This should probably come after we start a benchmarking suite.

jacobhinkle added the enhancement New feature or request label Nov 20, 2018

jacobhinkle mentioned this issue Nov 20, 2018

Benchmarking suite #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use __shfl_down for cuda affine interp backward #7

Use __shfl_down for cuda affine interp backward #7

jacobhinkle commented Nov 20, 2018

Use __shfl_down for cuda affine interp backward #7

Use __shfl_down for cuda affine interp backward #7

Comments

jacobhinkle commented Nov 20, 2018