A new fused tiled matrix multiplication algorithm

I've come up with a new fused tiled matrix multiplication algorithm!
```
access in Indice M*K times
access out indice M*N times
use atomicAdd  K*N*K/shared_memory_size times
```
The original one in `MinkowskiEngine `,
```
access in Indice M*K*K/shared_memory_size times
access out indice M*N times
use atomicAdd  M*N times
```

The above is in terms of one kernel, for `Kernel Volume` kernel, they should both multiply by `Kernel Volume`.

Which one do you think is better?
Can you elaborate on the latency of `atomicAdd`?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new fused tiled matrix multiplication algorithm #166

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A new fused tiled matrix multiplication algorithm #166

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions