The new batched GEMM performance (develop cb87b04) is ~2TFlops lower than the old one (b53e9d0). This can be reproduced by running
./bin/ckProfiler batched_gemm 0 0 1 2 0 5 1024 512 2048 -1 -1 -1 8
Other observations include:
- Both versions have exactly the same number of s and v instructions and almost the same number of VGPRS in the main K0 loop.
- The old version has scratch memory if
HasMainK0BlockLoop=true; the new version does not.
The new batched GEMM performance (develop cb87b04) is ~2TFlops lower than the old one (b53e9d0). This can be reproduced by running
./bin/ckProfiler batched_gemm 0 0 1 2 0 5 1024 512 2048 -1 -1 -1 8Other observations include:
HasMainK0BlockLoop=true; the new version does not.