Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occupancy Calculator for AMD in Excel (similar to CUDA) #1689

Open
nikl-i opened this issue Feb 22, 2022 · 7 comments
Open

Occupancy Calculator for AMD in Excel (similar to CUDA) #1689

nikl-i opened this issue Feb 22, 2022 · 7 comments
Assignees

Comments

@nikl-i
Copy link

nikl-i commented Feb 22, 2022

Hello! I've made recently an Occupancy Calculator for AMD GPUs similar to CUDA Occupancy Calculator](https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html), and would like to share it somehow.
It's Excel file (like CUDA calculator) and includes several plots and summary information about occupancy factors with links to documentation and other materials.

If case you find useful, it would be great if you could suggest a way to make it more available to community (add to docs or something).
Thanks!

Screenshots:
occupancy-calc-screenshot-1
occupancy-calc-screenshot-2
occupancy-calc-screenshot-3

Calculator inself (*.xlsx file in *.zip archive):
ROCm-Occupancy-Calculator.zip

@ROCmSupport
Copy link

ROCmSupport commented Feb 23, 2022

Hi @nikowleye,
Thanks for reaching out.
Very much appreciations to you for sharing this document and information to us, cool.
I will discuss internally and update asap. Thank you.

@Rmalavally
Copy link
Collaborator

Thank you for sharing your document, @nikowleye. We are discussing internally how to proceed with your submission and hope to give you an update soon.

Regards,
AMD ROCm Documentation Team

@yugr
Copy link

yugr commented Aug 12, 2022

@Rmalavally hi, just curious, are there plans to integrate this into ROCm?

@mazhaojia123
Copy link

Hi, @nikl-i.

I have some questions about the value of VGPR per CU. Is there any evidence that one CU have 65536 VGPRs ?

Based on the information provided in amd-cdna-whitepaper, each CU of the MI100 has a 512KB-sized VGPR file. If there are 65536 VGPRs per CU, does that mean the VGPRs are all 64-bit in size?

@mazhaojia123
Copy link

Hi, @nikl-i.

I have some questions about the value of VGPR per CU. Is there any evidence that one CU have 65536 VGPRs ?

Based on the information provided in amd-cdna-whitepaper, each CU of the MI100 has a 512KB-sized VGPR file. If there are 65536 VGPRs per CU, does that mean the VGPRs are all 64-bit in size?

Hi, @ROCmSupport @Rmalavally .
It seems like MI100 has 65536*2 32-bit VGPRs per CU? Are there any official materials or evidences?

@jlgreathouse
Copy link
Collaborator

gfx9 "GCN" GPUs (i.e., not CDNA accelerators such as MI100 or MI200) have a compute unit architecture that:

  • Has 4 separate SIMD16 units, each with their own VGPR file
  • Each SIMD16 unit has 256x 64-wide vector general-purpose registers. Each lane of the vector register holds a 4-byte value.
  • So a gfx9 GCN CU has: 4 bytes / lane * 64 lanes / VGPR * 256 VGPRs / SIMD * 4 SIMDs / CU = 256 KiB / CU
    • Depending on how you define a VGPR (e.g. if you're counting lanes), then yes, you could say that a CU has 64 * 256 * 4 = 65,536 GPRs. But note that this is really 256 vector registers, you can't individually access all of those GPRs from every lane.
    • Please note that waves are scheduled to the SIMD16, and any individual wave can only access at most 256 VGPRs

You can learn a bit more about the general "GCN" architecture in this presentation: https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah, e.g., slide 19 describes the "4x SIMD16" architecture of our GCN CUs and shows the VGPR size. See also https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf.

On the MI100 CDNA1 accelerator, the compute unit architecture:

  • Also has 4 separate SIMD16 units, each with their own vector register file
  • Each SIMD16 unit has 512x 64-wide vector registers.
    • Please note that I did not say general-purpose registers. The register file in MI100 is split in half. There are 256x general-purpose vector registers, and 256x "accumulation" registers that are for use by the matrix multiplication instructions.
    • "Normal" code cannot easily use these AccVGPRs, because the only VALU instructions that can use them are mov instructions. Our compiler can use these registers for spills & fills from traditional "ArchVGPRs", but you can't use them to e.g. feed an add operation.
    • As described in the CDNA architecture guide (Section 3.6.4 https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf), on MI100 you always allocate an identical number of ArchVGPRs and AccVGPRs. So if you make a kernel that uses 256 VGPRs, you also spend 256 AccVGPRs.
  • So from a storage perspective, Figure 5 of the CDNA whitepaper (https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf) is correct.
    • 4 bytes / lane * 64 lanes / register * 512 registers / SIMD = 128 KiB of registers per SIMD (and with 4 SIMDs / CU, this yields your calculation of 512 KiB / CU)
  • However, from an occupancy calculation perspective, doubling the GPR count on MI100 would not give you extra occupancy for an otherwise identical kernel compared to GCN. Because those new registers can't be used for "normal" architected VGPRs.

On the MI200 CDNA2 accelerator, the compute unit architecture:

  • Also has 4 separate SIMD16 units, each with their own vector register file
  • Each SIMD16 unit has 512x 64-wide vector general-purpose registers.
  • So from both a storage and an occupancy perspective, MI200 has:
    • 4 bytes / lane * 64 lanes / register * 512 VGPRs / SIMD = 128 KiB of VGPRs per SIMD (and with 4 SIMDs / CU, this yields 512 KiB / CU)

For a bit more coverage of the MI200 topic, you can see the media reports from our HotChips talk: https://chipsandcheese.com/2022/09/18/hot-chips-34-amds-instinct-mi200-architecture/ which gives some extra commentary from the Q&A session of the talk itself (https://hc34.hotchips.org/assets/program/conference/day1/GPU%20HPC/HC2022.AMD.AlanSmith.v14.Final.20220820.pdf), where this was covered in 1 bullet point on slide 8.
Thanks!

@mazhaojia123
Copy link

Hi, @jlgreathouse.
Thank you very much, your answer is very helpful to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants