Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix max BF16 flop rate on CDNA2 #155

Merged
merged 1 commit into from
Aug 3, 2023

Conversation

skyreflectedinmirrors
Copy link
Contributor

From the AMD Matrix Calculator, some BF16 ops get up to 1024 FLOPs/CU/Cycle, e.g.:

$ ./matrix_calculator.py --architecture CDNA2 --instruction v_mfma_f32_32x32x8bf16_1k --detail-instruction
Architecture: CDNA2
Instruction: V_MFMA_F32_32X32X8BF16_1K
    Encoding: VOP3P-MAI
    VOP3P Opcode: 0x66
    VOP3P-MAI Opcode: 0x26
    Matrix Dimensions:
        M: 32
        N: 32
        K: 8
        blocks: 1
    Execution statistics:
        FLOPs: 16384
        Execution cycles: 64
        **FLOPs/CU/cycle: 1024**
        Can co-execute with VALU: True
        VALU co-execution cycles possible: 60

Therefore, our calculations of peak, which here assume a max of 512, are wrong. This can be verified experimentally with a simple (unoptimized)( kernel that spams these ops, e.g.:

image

After this fix, we get:

image

Signed-off-by: Nicholas Curtis <nicurtis@amd.com>
@coleramos425
Copy link
Collaborator

Looks good to me

@coleramos425 coleramos425 merged commit c809346 into ROCm:dev Aug 3, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants