Skip to content

Commit

Permalink
more reorg
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Jun Park <peter.park@amd.com>

clean up

Signed-off-by: Peter Jun Park <peter.park@amd.com>

reorg images

move profile mode

reorg

reorg

reorg more

fix formatting
  • Loading branch information
peterjunpark committed Jun 26, 2024
1 parent 18effce commit 2a4206b
Show file tree
Hide file tree
Showing 82 changed files with 6,180 additions and 1,149 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Lint Documentation
name: Linting

on:
push:
Expand Down
96 changes: 96 additions & 0 deletions docs/conceptual/command-processor.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
**********************
Command processor (CP)
**********************

The command processor (CP) is responsible for interacting with the AMDGPU Kernel
Driver (a.k.a., the Linux Kernel) on the CPU and
for interacting with user-space HSA clients when they submit commands to
HSA queues. Basic tasks of the CP include reading commands (e.g.,
corresponding to a kernel launch) out of `HSA
Queues <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__
(Sec. 2.5), scheduling work to subsequent parts of the scheduler
pipeline, and marking kernels complete for synchronization events on the
host.

The command processor is composed of two sub-components:

- Fetcher (CPF): Fetches commands out of memory to hand them over to
the CPC for processing
- Packet Processor (CPC): The micro-controller running the command
processing firmware that decodes the fetched commands, and (for
kernels) passes them to the `Workgroup Processors <SPI>`__ for
scheduling

Before scheduling work to the accelerator, the command-processor can
first acquire a memory fence to ensure system consistency `(Sec
2.6.4) <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__.
After the work is complete, the command-processor can apply a
memory-release fence. Depending on the AMD CDNA accelerator under
question, either of these operations *may* initiate a cache write-back
or invalidation.

Analyzing command processor performance is most interesting for kernels
that the user suspects to be scheduling/launch-rate limited. The command
processor’s metrics therefore are focused on reporting, e.g.:

- Utilization of the fetcher
- Utilization of the packet processor, and decoding processing packets
- Fetch/processing stalls

Command Processor Fetcher (CPF) metrics
=======================================

.. list-table::
:header-rows: 1
:widths: 20 65 15

* - Metric
- Description
- Unit
* - CPF Utilization
- Percent of total cycles where the CPF was busy actively doing any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
- Percent
* - CPF Stall
- Percent of CPF busy cycles where the CPF was stalled for any reason.
- Percent
* - CPF-L2 Utilization
- Percent of total cycles counted by the CPF-[L2](L2) interface where the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles over total cycles counted by the CPF-L2.
- Percent
* - CPF-L2 Stall
- Percent of CPF-L2 busy cycles where the CPF-[L2](L2) interface was stalled for any reason.
- Percent
* - CPF-UTCL1 Stall
- Percent of CPF busy cycles where the CPF was stalled by address translation.
- Percent

Command Processor Packet Processor (CPC) metrics
================================================

.. list-table::
:header-rows: 1
:widths: 20 65 15

* - Metric
- Description
- Unit
* - CPC Utilization
- Percent of total cycles where the CPC was busy actively doing any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
- Percent
* - CPC Stall
- Percent of CPC busy cycles where the CPC was stalled for any reason.
- Percent
* - CPC Packet Decoding Utilization
- Percent of CPC busy cycles spent decoding commands for processing.
- Percent
* - CPC-Workgroup Manager Utilization
- Percent of CPC busy cycles spent dispatching workgroups to the [Workgroup Manager](SPI).
- Percent
* - CPC-L2 Utilization
- Percent of total cycles counted by the CPC-[L2](L2) interface where the CPC-L2 interface was active doing any work.
- Percent
* - CPC-UTCL1 Stall
- Percent of CPC busy cycles where the CPC was stalled by address translation.
- Percent
* - CPC-UTCL2 Utilization
- Percent of total cycles counted by the CPC's L2 address translation interface where the CPC was busy doing address translation work.
- Percent
54 changes: 54 additions & 0 deletions docs/conceptual/compute-unit.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
*****************
Compute unit (CU)
*****************

The compute unit (CU) is responsible for executing a user's kernels on
CDNA-based accelerators. All :ref:`wavefronts` of a :ref:`workgroup` are
scheduled on the same CU.

.. image:: ../data/performance-model/gcn_compute_unit.png
:alt: AMD CDNA accelerator compute unit diagram

The CU consists of several independent execution pipelines and functional units.

* The :ref:`desc-valu` is composed of multiple SIMD (single
instruction, multiple data) vector processors, vector general purpose
registers (VGPRs) and instruction buffers. The VALU is responsible for
executing much of the computational work on CDNA accelerators, including but
not limited to floating-point operations (FLOPs) and integer operations
(IOPs).
* The *vector memory (VMEM)* unit is responsible for issuing loads, stores and
atomic operations that interact with the memory system.
* The :ref:`desc-salu` is shared by all threads in a
[wavefront](wavefront), and is responsible for executing instructions that are
known to be uniform across the wavefront at compile-time. The SALU has a
memory unit (SMEM) for interacting with memory, but it cannot issue separately
from the SALU.
* The :ref:`desc-lds` is an on-CU software-managed scratchpad memory
that can be used to efficiently share data between all threads in a
[workgroup](workgroup).
* The :ref:`desc-scheduler` is responsible for issuing and decoding instructions for all
the [wavefronts](wavefront) on the compute unit.
* The *vector L1 data cache (vL1D)* is the first level cache local to the
compute unit. On current CDNA accelerators, the vL1D is write-through. The
vL1D caches from multiple compute units are kept coherent with one another
through software instructions.
* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
specialized matrix-multiplication accelerator pipelines known as the
:ref:`desc-mfma`.

For a more in-depth description of a compute unit on a CDNA accelerator, see
slides 22 to 28 in
`An introduction to AMD GPU Programming with HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`_
and slide 27 in
`The AMD GCN Architecture - A Crash Course (Layla Mah) <https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah>`_.

:ref:`pipeline-desc` details the various
execution pipelines (VALU, SALU, LDS, Scheduler, etc.). The metrics
presented by Omniperf for these pipelines are described in
:ref:`pipeline-metrics`. Finally, the `vL1D <vL1D>`__ cache and
`LDS <LDS>`__ will be described their own sections.

.. include:: ./includes/pipeline-descriptions.rst

.. include:: ./includes/pipeline-metrics.rst
130 changes: 125 additions & 5 deletions docs/conceptual/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,129 @@
Glossary
********

.. include:: ./includes/terms.rst
The following table briefly defines some terminology used in Omniperf interfaces
and in this documentation.

.. list-table::
:header-rows: 1

* - Name
- Description
- Unit

* - Kernel time
- The number of seconds the accelerator was executing a kernel, from the
:ref:`command processor <def-cp>`'s (CP) start-of-kernel
timestamp (a number of cycles after the CP beings processing the packet)
to the CP's end-of-kernel timestamp (a number of cycles before the CP
stops processing the packet).
- Seconds

* - Kernel cycles
- The number of cycles the accelerator was active doing *any* work, as
measured by the :ref:`command processor <def-cp>` (CP).
- Cycles

* - Total CU cycles
- The number of cycles the accelerator was active doing *any* work
(that is, kernel cycles), multiplied by the number of
:doc:`compute units <compute-unit>` on the accelerator. A
measure of the total possible active cycles the compute units could be
doing work, useful for the normalization of metrics inside the CU.
- Cycles

* - Total active CU cycles
- The number of cycles a CU on the accelerator was active doing *any*
work, summed over all :ref:`compute units <def-cu>` on the
accelerator.
- Cycles

* - Total SIMD cycles
- The number of cycles the accelerator was active doing *any* work (that
is, kernel cycles), multiplied by the number of
:ref:`SIMDs <def-cu>` on the accelerator. A measure of the
total possible active cycles the SIMDs could be doing work, useful for
the normalization of metrics inside the CU.
- Cycles

* - Total L2 cycles
- The number of cycles the accelerator was active doing *any* work (that
is, kernel cycles), multiplied by the number of :ref:`L2 <def-l2>`
channels on the accelerator. A measure of the total possible active
cycles the L2 channels could be doing work, useful for the normalization
of metrics inside the L2.
- Cycles

* - Total active L2 cycles
- The number of cycles a channel of the L2 cache was active doing *any*
work, summed over all :ref:`L2 <def-l2>` channels on the accelerator.
- Cycles

* - Total sL1D cycles
- The number of cycles the accelerator was active doing *any* work (that
is, kernel cycles), multiplied by the number of
:ref:`scalar L1 data caches <def-sl1d>` on the accelerator. A measure of
the total possible active cycles the sL1Ds could be doing work, useful
for the normalization of metrics inside the sL1D.
- Cycles

* - Total L1I cycles
- The number of cycles the accelerator was active doing *any* work (that
is, kernel cycles), multiplied by the number of
:ref:`L1 instruction caches <def-l1i>` (L1I) on the accelerator. A
measure of the total possible active cycles the L1Is could be doing
work, useful for the normalization of metrics inside the L1I.
- Cycles

* - Total scheduler-pipe cycles
- The number of cycles the accelerator was active doing *any* work (that
is, kernel cycles), multiplied by the number of
:ref:`scheduler pipes <def-cp>` on the accelerator. A measure of the
total possible active cycles the scheduler-pipes could be doing work,
useful for the normalization of metrics inside the
:ref:`workgroup manager <def-spi>` and :ref:`command processor <def-cp>`.
- Cycles

* - Total shader-engine cycles
- The total number of cycles the accelerator was active doing *any* work,
multiplied by the number of :ref:`shader engines <def-se>` on the
accelerator. A measure of the total possible active cycles the shader
engines could be doing work, useful for the normalization of
metrics inside the :ref:`workgroup manager <def-spi>`.
- Cycles

* - Thread-requests
- The number of unique memory addresses accessed by a single memory
instruction. On AMD Instinct accelerators, this has a maximum of 64
(that is, the size of the :ref:`wavefront <def-wavefront>`).
- Addresses

* - Work-item
- A single *thread*, or lane, of execution that executes in lockstep with
the rest of the work-items comprising a :ref:`wavefront <def-wavefront>`
of execution.
- N/A

* - Wavefront
- A group of work-items, or threads, that execute in lockstep on the
:ref:`compute unit <def-cu>`. On AMD Instinct accelerators, the
wavefront size is always 64 work-items.
- N/A

* - Workgroup
- A group of wavefronts that execute on the same
:ref:`compute unit <def-cu>`, and can cooperatively execute and share
data via the use of synchronization primitives, :ref:`LDS <def-lds>`,
atomics, and others.
- N/A

* - Divergence
- Divergence within a wavefront occurs when not all work-items are active
when executing an instruction, that is, due to non-uniform control flow
within a wavefront. Can reduce execution efficiency by causing,
for instance, the :ref:`VALU <def-valu>` to need to execute both
branches of a conditional with different sets of work-items active.
- N/A

.. include:: ./includes/normalization-units.rst

Expand Down Expand Up @@ -81,14 +203,12 @@ of LLVM:
will always have the most up-to-date information, and the interested reader is
referred to this source for a more complete explanation.

.. _def-memory-type:

Memory type
===========

AMD Instinct(tm) accelerators contain a number of different memory allocation
AMD Instinct accelerators contain a number of different memory allocation
types to enable the HIP language's
:doc:`memory coherency model <hip:how-to/programming-manual>`.
:doc:`memory coherency model <hip:how-to/programming_manual>`.
These memory types are broadly similar between AMD Instinct accelerator
generations, but may differ in exact implementation.

Expand Down
44 changes: 0 additions & 44 deletions docs/conceptual/includes/compute-unit.rst
Original file line number Diff line number Diff line change
@@ -1,45 +1 @@
.. _def-compute-unit:

Compute unit
============

The compute unit (CU) is responsible for executing a user's kernels on
CDNA-based accelerators. All :ref:`wavefronts` of a :ref:`workgroup` are
scheduled on the same CU.

.. image:: ../data/performance-model/gcn_compute_unit.png
:alt: AMD CDNA accelerator compute unit diagram

The CU consists of several independent pipelines and functional units.

* The *vector arithmetic logic unit (VALU)* is composed of multiple SIMD (single
instruction, multiple data) vector processors, vector general purpose
registers (VGPRs) and instruction buffers. The VALU is responsible for
executing much of the computational work on CDNA accelerators, including but
not limited to floating-point operations (FLOPs) and integer operations
(IOPs).
* The *vector memory (VMEM)* unit is responsible for issuing loads, stores and
atomic operations that interact with the memory system.
* The *scalar arithmetic logic unit (SALU)* is shared by all threads in a
[wavefront](wavefront), and is responsible for executing instructions that are
known to be uniform across the wavefront at compile-time. The SALU has a
memory unit (SMEM) for interacting with memory, but it cannot issue separately
from the SALU.
* The *local data share (LDS)* is an on-CU software-managed scratchpad memory
that can be used to efficiently share data between all threads in a
[workgroup](workgroup).
* The *scheduler* is responsible for issuing and decoding instructions for all
the [wavefronts](wavefront) on the compute unit.
* The *vector L1 data cache (vL1D)* is the first level cache local to the
compute unit. On current CDNA accelerators, the vL1D is write-through. The
vL1D caches from multiple compute units are kept coherent with one another
through software instructions.
* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
specialized matrix-multiplication accelerator pipelines known as the
[matrix fused multiply-add (MFMA)](mfma).

For a more in-depth description of a compute unit on a CDNA accelerator, see
slides 22 to 28 in
`An introduction to AMD GPU Programming with HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`_
and slide 27 in
`The AMD GCN Architecture - A Crash Course (Layla Mah) <https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah>`_.
32 changes: 0 additions & 32 deletions docs/conceptual/includes/high-level-design.rst

This file was deleted.

Empty file.
Empty file.
Empty file.
Loading

0 comments on commit 2a4206b

Please sign in to comment.