fix max BF16 flop rate on CDNA2 #155

skyreflectedinmirrors · 2023-07-28T16:31:37Z

From the AMD Matrix Calculator, some BF16 ops get up to 1024 FLOPs/CU/Cycle, e.g.:

$ ./matrix_calculator.py --architecture CDNA2 --instruction v_mfma_f32_32x32x8bf16_1k --detail-instruction
Architecture: CDNA2
Instruction: V_MFMA_F32_32X32X8BF16_1K
    Encoding: VOP3P-MAI
    VOP3P Opcode: 0x66
    VOP3P-MAI Opcode: 0x26
    Matrix Dimensions:
        M: 32
        N: 32
        K: 8
        blocks: 1
    Execution statistics:
        FLOPs: 16384
        Execution cycles: 64
        **FLOPs/CU/cycle: 1024**
        Can co-execute with VALU: True
        VALU co-execution cycles possible: 60

Therefore, our calculations of peak, which here assume a max of 512, are wrong. This can be verified experimentally with a simple (unoptimized)( kernel that spams these ops, e.g.:

After this fix, we get:

Signed-off-by: Nicholas Curtis <nicurtis@amd.com>

coleramos425 · 2023-08-03T22:36:16Z

Looks good to me

fix max BF16 flop rate on CDNA2

42248d5

Signed-off-by: Nicholas Curtis <nicurtis@amd.com>

skyreflectedinmirrors force-pushed the bf16_flops branch from 0431abb to 42248d5 Compare July 28, 2023 16:38

coleramos425 merged commit c809346 into ROCm:dev Aug 3, 2023
9 checks passed

coleramos425 mentioned this pull request Aug 9, 2023

Update MFMA peaks to reflect CDNA3 #159

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix max BF16 flop rate on CDNA2 #155

fix max BF16 flop rate on CDNA2 #155

skyreflectedinmirrors commented Jul 28, 2023

coleramos425 commented Aug 3, 2023

fix max BF16 flop rate on CDNA2 #155

fix max BF16 flop rate on CDNA2 #155

Conversation

skyreflectedinmirrors commented Jul 28, 2023

coleramos425 commented Aug 3, 2023