Skip to content

Batched GEMM performance issue #139

@j4yan

Description

@j4yan

The new batched GEMM performance (develop cb87b04) is ~2TFlops lower than the old one (b53e9d0). This can be reproduced by running

./bin/ckProfiler batched_gemm 0 0 1 2 0 5 1024 512 2048 -1 -1 -1 8

Other observations include:

  1. Both versions have exactly the same number of s and v instructions and almost the same number of VGPRS in the main K0 loop.
  2. The old version has scratch memory if HasMainK0BlockLoop=true; the new version does not.

Metadata

Metadata

Assignees

Labels

Performance Issueperformance issue due to regression or something fishy

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions