I've come up with a new fused tiled matrix multiplication algorithm!
access in Indice M*K times
access out indice M*N times
use atomicAdd K*N*K/shared_memory_size times
The original one in MinkowskiEngine ,
access in Indice M*K*K/shared_memory_size times
access out indice M*N times
use atomicAdd M*N times
The above is in terms of one kernel, for Kernel Volume kernel, they should both multiply by Kernel Volume.
Which one do you think is better?
Can you elaborate on the latency of atomicAdd?