-
Notifications
You must be signed in to change notification settings - Fork 39
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Peter Jun Park <peter.park@amd.com> clean up Signed-off-by: Peter Jun Park <peter.park@amd.com> reorg images move profile mode reorg reorg reorg more fix formatting
- Loading branch information
1 parent
18effce
commit 2a4206b
Showing
82 changed files
with
6,180 additions
and
1,149 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
name: Lint Documentation | ||
name: Linting | ||
|
||
on: | ||
push: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
********************** | ||
Command processor (CP) | ||
********************** | ||
|
||
The command processor (CP) is responsible for interacting with the AMDGPU Kernel | ||
Driver (a.k.a., the Linux Kernel) on the CPU and | ||
for interacting with user-space HSA clients when they submit commands to | ||
HSA queues. Basic tasks of the CP include reading commands (e.g., | ||
corresponding to a kernel launch) out of `HSA | ||
Queues <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__ | ||
(Sec. 2.5), scheduling work to subsequent parts of the scheduler | ||
pipeline, and marking kernels complete for synchronization events on the | ||
host. | ||
|
||
The command processor is composed of two sub-components: | ||
|
||
- Fetcher (CPF): Fetches commands out of memory to hand them over to | ||
the CPC for processing | ||
- Packet Processor (CPC): The micro-controller running the command | ||
processing firmware that decodes the fetched commands, and (for | ||
kernels) passes them to the `Workgroup Processors <SPI>`__ for | ||
scheduling | ||
|
||
Before scheduling work to the accelerator, the command-processor can | ||
first acquire a memory fence to ensure system consistency `(Sec | ||
2.6.4) <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__. | ||
After the work is complete, the command-processor can apply a | ||
memory-release fence. Depending on the AMD CDNA accelerator under | ||
question, either of these operations *may* initiate a cache write-back | ||
or invalidation. | ||
|
||
Analyzing command processor performance is most interesting for kernels | ||
that the user suspects to be scheduling/launch-rate limited. The command | ||
processor’s metrics therefore are focused on reporting, e.g.: | ||
|
||
- Utilization of the fetcher | ||
- Utilization of the packet processor, and decoding processing packets | ||
- Fetch/processing stalls | ||
|
||
Command Processor Fetcher (CPF) metrics | ||
======================================= | ||
|
||
.. list-table:: | ||
:header-rows: 1 | ||
:widths: 20 65 15 | ||
|
||
* - Metric | ||
- Description | ||
- Unit | ||
* - CPF Utilization | ||
- Percent of total cycles where the CPF was busy actively doing any work. The ratio of CPF busy cycles over total cycles counted by the CPF. | ||
- Percent | ||
* - CPF Stall | ||
- Percent of CPF busy cycles where the CPF was stalled for any reason. | ||
- Percent | ||
* - CPF-L2 Utilization | ||
- Percent of total cycles counted by the CPF-[L2](L2) interface where the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles over total cycles counted by the CPF-L2. | ||
- Percent | ||
* - CPF-L2 Stall | ||
- Percent of CPF-L2 busy cycles where the CPF-[L2](L2) interface was stalled for any reason. | ||
- Percent | ||
* - CPF-UTCL1 Stall | ||
- Percent of CPF busy cycles where the CPF was stalled by address translation. | ||
- Percent | ||
|
||
Command Processor Packet Processor (CPC) metrics | ||
================================================ | ||
|
||
.. list-table:: | ||
:header-rows: 1 | ||
:widths: 20 65 15 | ||
|
||
* - Metric | ||
- Description | ||
- Unit | ||
* - CPC Utilization | ||
- Percent of total cycles where the CPC was busy actively doing any work. The ratio of CPC busy cycles over total cycles counted by the CPC. | ||
- Percent | ||
* - CPC Stall | ||
- Percent of CPC busy cycles where the CPC was stalled for any reason. | ||
- Percent | ||
* - CPC Packet Decoding Utilization | ||
- Percent of CPC busy cycles spent decoding commands for processing. | ||
- Percent | ||
* - CPC-Workgroup Manager Utilization | ||
- Percent of CPC busy cycles spent dispatching workgroups to the [Workgroup Manager](SPI). | ||
- Percent | ||
* - CPC-L2 Utilization | ||
- Percent of total cycles counted by the CPC-[L2](L2) interface where the CPC-L2 interface was active doing any work. | ||
- Percent | ||
* - CPC-UTCL1 Stall | ||
- Percent of CPC busy cycles where the CPC was stalled by address translation. | ||
- Percent | ||
* - CPC-UTCL2 Utilization | ||
- Percent of total cycles counted by the CPC's L2 address translation interface where the CPC was busy doing address translation work. | ||
- Percent |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
***************** | ||
Compute unit (CU) | ||
***************** | ||
|
||
The compute unit (CU) is responsible for executing a user's kernels on | ||
CDNA-based accelerators. All :ref:`wavefronts` of a :ref:`workgroup` are | ||
scheduled on the same CU. | ||
|
||
.. image:: ../data/performance-model/gcn_compute_unit.png | ||
:alt: AMD CDNA accelerator compute unit diagram | ||
|
||
The CU consists of several independent execution pipelines and functional units. | ||
|
||
* The :ref:`desc-valu` is composed of multiple SIMD (single | ||
instruction, multiple data) vector processors, vector general purpose | ||
registers (VGPRs) and instruction buffers. The VALU is responsible for | ||
executing much of the computational work on CDNA accelerators, including but | ||
not limited to floating-point operations (FLOPs) and integer operations | ||
(IOPs). | ||
* The *vector memory (VMEM)* unit is responsible for issuing loads, stores and | ||
atomic operations that interact with the memory system. | ||
* The :ref:`desc-salu` is shared by all threads in a | ||
[wavefront](wavefront), and is responsible for executing instructions that are | ||
known to be uniform across the wavefront at compile-time. The SALU has a | ||
memory unit (SMEM) for interacting with memory, but it cannot issue separately | ||
from the SALU. | ||
* The :ref:`desc-lds` is an on-CU software-managed scratchpad memory | ||
that can be used to efficiently share data between all threads in a | ||
[workgroup](workgroup). | ||
* The :ref:`desc-scheduler` is responsible for issuing and decoding instructions for all | ||
the [wavefronts](wavefront) on the compute unit. | ||
* The *vector L1 data cache (vL1D)* is the first level cache local to the | ||
compute unit. On current CDNA accelerators, the vL1D is write-through. The | ||
vL1D caches from multiple compute units are kept coherent with one another | ||
through software instructions. | ||
* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain | ||
specialized matrix-multiplication accelerator pipelines known as the | ||
:ref:`desc-mfma`. | ||
|
||
For a more in-depth description of a compute unit on a CDNA accelerator, see | ||
slides 22 to 28 in | ||
`An introduction to AMD GPU Programming with HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`_ | ||
and slide 27 in | ||
`The AMD GCN Architecture - A Crash Course (Layla Mah) <https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah>`_. | ||
|
||
:ref:`pipeline-desc` details the various | ||
execution pipelines (VALU, SALU, LDS, Scheduler, etc.). The metrics | ||
presented by Omniperf for these pipelines are described in | ||
:ref:`pipeline-metrics`. Finally, the `vL1D <vL1D>`__ cache and | ||
`LDS <LDS>`__ will be described their own sections. | ||
|
||
.. include:: ./includes/pipeline-descriptions.rst | ||
|
||
.. include:: ./includes/pipeline-metrics.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,45 +1 @@ | ||
.. _def-compute-unit: | ||
|
||
Compute unit | ||
============ | ||
|
||
The compute unit (CU) is responsible for executing a user's kernels on | ||
CDNA-based accelerators. All :ref:`wavefronts` of a :ref:`workgroup` are | ||
scheduled on the same CU. | ||
|
||
.. image:: ../data/performance-model/gcn_compute_unit.png | ||
:alt: AMD CDNA accelerator compute unit diagram | ||
|
||
The CU consists of several independent pipelines and functional units. | ||
|
||
* The *vector arithmetic logic unit (VALU)* is composed of multiple SIMD (single | ||
instruction, multiple data) vector processors, vector general purpose | ||
registers (VGPRs) and instruction buffers. The VALU is responsible for | ||
executing much of the computational work on CDNA accelerators, including but | ||
not limited to floating-point operations (FLOPs) and integer operations | ||
(IOPs). | ||
* The *vector memory (VMEM)* unit is responsible for issuing loads, stores and | ||
atomic operations that interact with the memory system. | ||
* The *scalar arithmetic logic unit (SALU)* is shared by all threads in a | ||
[wavefront](wavefront), and is responsible for executing instructions that are | ||
known to be uniform across the wavefront at compile-time. The SALU has a | ||
memory unit (SMEM) for interacting with memory, but it cannot issue separately | ||
from the SALU. | ||
* The *local data share (LDS)* is an on-CU software-managed scratchpad memory | ||
that can be used to efficiently share data between all threads in a | ||
[workgroup](workgroup). | ||
* The *scheduler* is responsible for issuing and decoding instructions for all | ||
the [wavefronts](wavefront) on the compute unit. | ||
* The *vector L1 data cache (vL1D)* is the first level cache local to the | ||
compute unit. On current CDNA accelerators, the vL1D is write-through. The | ||
vL1D caches from multiple compute units are kept coherent with one another | ||
through software instructions. | ||
* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain | ||
specialized matrix-multiplication accelerator pipelines known as the | ||
[matrix fused multiply-add (MFMA)](mfma). | ||
|
||
For a more in-depth description of a compute unit on a CDNA accelerator, see | ||
slides 22 to 28 in | ||
`An introduction to AMD GPU Programming with HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`_ | ||
and slide 27 in | ||
`The AMD GCN Architecture - A Crash Course (Layla Mah) <https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah>`_. |
This file was deleted.
Oops, something went wrong.
Empty file.
Empty file.
Empty file.
Oops, something went wrong.