more reorg

Signed-off-by: Peter Jun Park <peter.park@amd.com> clean up Signed-off-by: Peter Jun Park <peter.park@amd.com> reorg images move profile mode reorg reorg reorg more fix formatting
ROCm · Jun 26, 2024 · 2a4206b · 2a4206b
1 parent 18effce
commit 2a4206b
Show file tree

Hide file tree

Showing 82 changed files with 6,180 additions and 1,149 deletions.
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -1,4 +1,4 @@
-name: Lint Documentation
+name: Linting
 
 on:
   push:

diff --git a/docs/conceptual/command-processor.rst b/docs/conceptual/command-processor.rst
@@ -0,0 +1,96 @@
+**********************
+Command processor (CP)
+**********************
+
+The command processor (CP) is responsible for interacting with the AMDGPU Kernel
+Driver (a.k.a., the Linux Kernel) on the CPU and
+for interacting with user-space HSA clients when they submit commands to
+HSA queues. Basic tasks of the CP include reading commands (e.g.,
+corresponding to a kernel launch) out of `HSA
+Queues <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__
+(Sec. 2.5), scheduling work to subsequent parts of the scheduler
+pipeline, and marking kernels complete for synchronization events on the
+host.
+
+The command processor is composed of two sub-components:
+
+-  Fetcher (CPF): Fetches commands out of memory to hand them over to
+   the CPC for processing
+-  Packet Processor (CPC): The micro-controller running the command
+   processing firmware that decodes the fetched commands, and (for
+   kernels) passes them to the `Workgroup Processors <SPI>`__ for
+   scheduling
+
+Before scheduling work to the accelerator, the command-processor can
+first acquire a memory fence to ensure system consistency `(Sec
+2.6.4) <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__.
+After the work is complete, the command-processor can apply a
+memory-release fence. Depending on the AMD CDNA accelerator under
+question, either of these operations *may* initiate a cache write-back
+or invalidation.
+
+Analyzing command processor performance is most interesting for kernels
+that the user suspects to be scheduling/launch-rate limited. The command
+processor’s metrics therefore are focused on reporting, e.g.:
+
+-  Utilization of the fetcher
+-  Utilization of the packet processor, and decoding processing packets
+-  Fetch/processing stalls
+
+Command Processor Fetcher (CPF) metrics
+=======================================
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 65 15
+
+   * - Metric
+     - Description
+     - Unit
+   * - CPF Utilization
+     - Percent of total cycles where the CPF was busy actively doing any work.  The ratio of CPF busy cycles over total cycles counted by the CPF.
+     - Percent
+   * - CPF Stall
+     - Percent of CPF busy cycles where the CPF was stalled for any reason.
+     - Percent
+   * - CPF-L2 Utilization
+     - Percent of total cycles counted by the CPF-[L2](L2) interface where the CPF-L2 interface was active doing any work.  The ratio of CPF-L2 busy cycles over total cycles counted by the CPF-L2.
+     - Percent
+   * - CPF-L2 Stall
+     - Percent of CPF-L2 busy cycles where the CPF-[L2](L2) interface was stalled for any reason. 
+     - Percent
+   * - CPF-UTCL1 Stall
+     - Percent of CPF busy cycles where the CPF was stalled by address translation. 
+     - Percent
+
+Command Processor Packet Processor (CPC) metrics
+================================================
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 65 15
+
+   * - Metric
+     - Description
+     - Unit
+   * - CPC Utilization
+     - Percent of total cycles where the CPC was busy actively doing any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
+     - Percent
+   * - CPC Stall
+     - Percent of CPC busy cycles where the CPC was stalled for any reason.
+     - Percent
+   * - CPC Packet Decoding Utilization
+     - Percent of CPC busy cycles spent decoding commands for processing.
+     - Percent
+   * - CPC-Workgroup Manager Utilization
+     - Percent of CPC busy cycles spent dispatching workgroups to the [Workgroup Manager](SPI).
+     - Percent
+   * - CPC-L2 Utilization
+     - Percent of total cycles counted by the CPC-[L2](L2) interface where the CPC-L2 interface was active doing any work.
+     - Percent
+   * - CPC-UTCL1 Stall
+     - Percent of CPC busy cycles where the CPC was stalled by address translation.
+     - Percent
+   * - CPC-UTCL2 Utilization
+     - Percent of total cycles counted by the CPC's L2 address translation interface where the CPC was busy doing address translation work.
+     - Percent
diff --git a/docs/conceptual/compute-unit.rst b/docs/conceptual/compute-unit.rst
@@ -0,0 +1,54 @@
+*****************
+Compute unit (CU)
+*****************
+
+The compute unit (CU) is responsible for executing a user's kernels on
+CDNA-based accelerators. All :ref:`wavefronts` of a :ref:`workgroup` are
+scheduled on the same CU.
+
+.. image:: ../data/performance-model/gcn_compute_unit.png
+    :alt: AMD CDNA accelerator compute unit diagram
+
+The CU consists of several independent execution pipelines and functional units.
+
+* The :ref:`desc-valu` is composed of multiple SIMD (single
+  instruction, multiple data) vector processors, vector general purpose
+  registers (VGPRs) and instruction buffers. The VALU is responsible for
+  executing much of the computational work on CDNA accelerators, including but
+  not limited to floating-point operations (FLOPs) and integer operations
+  (IOPs).
+* The *vector memory (VMEM)* unit is responsible for issuing loads, stores and
+  atomic operations that interact with the memory system.
+* The :ref:`desc-salu` is shared by all threads in a
+  [wavefront](wavefront), and is responsible for executing instructions that are
+  known to be uniform across the wavefront at compile-time. The SALU has a
+  memory unit (SMEM) for interacting with memory, but it cannot issue separately
+  from the SALU.
+* The :ref:`desc-lds` is an on-CU software-managed scratchpad memory
+  that can be used to efficiently share data between all threads in a
+  [workgroup](workgroup).
+* The :ref:`desc-scheduler` is responsible for issuing and decoding instructions for all
+  the [wavefronts](wavefront) on the compute unit.
+* The *vector L1 data cache (vL1D)* is the first level cache local to the
+  compute unit. On current CDNA accelerators, the vL1D is write-through. The
+  vL1D caches from multiple compute units are kept coherent with one another
+  through software instructions.
+* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
+  specialized matrix-multiplication accelerator pipelines known as the
+  :ref:`desc-mfma`.
+
+For a more in-depth description of a compute unit on a CDNA accelerator, see
+slides 22 to 28 in
+`An introduction to AMD GPU Programming with HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`_
+and slide 27 in
+`The AMD GCN Architecture - A Crash Course (Layla Mah) <https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah>`_.
+
+:ref:`pipeline-desc` details the various
+execution pipelines (VALU, SALU, LDS, Scheduler, etc.). The metrics
+presented by Omniperf for these pipelines are described in
+:ref:`pipeline-metrics`. Finally, the `vL1D <vL1D>`__ cache and
+`LDS <LDS>`__ will be described their own sections.
+
+.. include:: ./includes/pipeline-descriptions.rst
+
+.. include:: ./includes/pipeline-metrics.rst
diff --git a/docs/conceptual/glossary.rst b/docs/conceptual/glossary.rst
@@ -6,7 +6,129 @@
 Glossary
 ********
 
-.. include:: ./includes/terms.rst
+The following table briefly defines some terminology used in Omniperf interfaces
+and in this documentation.
+
+.. list-table::
+   :header-rows: 1
+
+   * - Name
+     - Description
+     - Unit
+
+   * - Kernel time
+     - The number of seconds the accelerator was executing a kernel, from the
+       :ref:`command processor <def-cp>`'s (CP) start-of-kernel
+       timestamp (a number of cycles after the CP beings processing the packet)
+       to the CP's end-of-kernel timestamp (a number of cycles before the CP
+       stops processing the packet).
+     - Seconds
+
+   * - Kernel cycles
+     - The number of cycles the accelerator was active doing *any* work, as
+       measured by the :ref:`command processor <def-cp>` (CP).
+     - Cycles
+
+   * - Total CU cycles
+     - The number of cycles the accelerator was active doing *any* work
+       (that is, kernel cycles), multiplied by the number of
+       :doc:`compute units <compute-unit>` on the accelerator. A
+       measure of the total possible active cycles the compute units could be
+       doing work, useful for the normalization of metrics inside the CU.
+     - Cycles
+
+   * - Total active CU cycles
+     - The number of cycles a CU on the accelerator was active doing *any*
+       work, summed over all :ref:`compute units <def-cu>` on the
+       accelerator.
+     - Cycles
+
+   * - Total SIMD cycles
+     - The number of cycles the accelerator was active doing *any* work (that
+       is, kernel cycles), multiplied by the number of
+       :ref:`SIMDs <def-cu>` on the accelerator. A measure of the
+       total possible active cycles the SIMDs could be doing work, useful for
+       the normalization of metrics inside the CU.
+     - Cycles
+
+   * - Total L2 cycles
+     - The number of cycles the accelerator was active doing *any* work (that
+       is, kernel cycles), multiplied by the number of :ref:`L2 <def-l2>`
+       channels on the accelerator. A measure of the total possible active
+       cycles the L2 channels could be doing work, useful for the normalization
+       of metrics inside the L2.
+     - Cycles
+
+   * - Total active L2 cycles
+     - The number of cycles a channel of the L2 cache was active doing *any*
+       work, summed over all :ref:`L2 <def-l2>` channels on the accelerator.
+     - Cycles
+
+   * - Total sL1D cycles
+     - The number of cycles the accelerator was active doing *any* work (that
+       is, kernel cycles), multiplied by the number of
+       :ref:`scalar L1 data caches <def-sl1d>` on the accelerator. A measure of
+       the total possible active cycles the sL1Ds could be doing work, useful
+       for the normalization of metrics inside the sL1D.
+     - Cycles
+
+   * - Total L1I cycles
+     - The number of cycles the accelerator was active doing *any* work (that
+       is, kernel cycles), multiplied by the number of
+       :ref:`L1 instruction caches <def-l1i>` (L1I) on the accelerator. A
+       measure of the total possible active cycles the L1Is could be doing
+       work, useful for the normalization of metrics inside the L1I.
+     - Cycles
+
+   * - Total scheduler-pipe cycles
+     - The number of cycles the accelerator was active doing *any* work (that
+       is, kernel cycles), multiplied by the number of
+       :ref:`scheduler pipes <def-cp>` on the accelerator. A measure of the
+       total possible active cycles the scheduler-pipes could be doing work,
+       useful for the normalization of metrics inside the
+       :ref:`workgroup manager <def-spi>` and :ref:`command processor <def-cp>`.
+     - Cycles
+
+   * - Total shader-engine cycles
+     - The total number of cycles the accelerator was active doing *any* work,
+       multiplied by the number of :ref:`shader engines <def-se>` on the
+       accelerator. A measure of the total possible active cycles the shader
+       engines could be doing work, useful for the normalization of
+       metrics inside the :ref:`workgroup manager <def-spi>`.
+     - Cycles
+
+   * - Thread-requests
+     - The number of unique memory addresses accessed by a single memory
+       instruction. On AMD Instinct accelerators, this has a maximum of 64
+       (that is, the size of the :ref:`wavefront <def-wavefront>`).
+     - Addresses
+
+   * - Work-item
+     - A single *thread*, or lane, of execution that executes in lockstep with
+       the rest of the work-items comprising a :ref:`wavefront <def-wavefront>`
+       of execution.
+     - N/A
+
+   * - Wavefront
+     - A group of work-items, or threads, that execute in lockstep on the
+       :ref:`compute unit <def-cu>`. On AMD Instinct accelerators, the
+       wavefront size is always 64 work-items.
+     - N/A
+
+   * - Workgroup
+     - A group of wavefronts that execute on the same
+       :ref:`compute unit <def-cu>`, and can cooperatively execute and share
+       data via the use of synchronization primitives, :ref:`LDS <def-lds>`,
+       atomics, and others.
+     - N/A
+
+   * - Divergence
+     - Divergence within a wavefront occurs when not all work-items are active
+       when executing an instruction, that is, due to non-uniform control flow
+       within a wavefront. Can reduce execution efficiency by causing,
+       for instance, the :ref:`VALU <def-valu>` to need to execute both
+       branches of a conditional with different sets of work-items active.
+     - N/A
 
 .. include:: ./includes/normalization-units.rst
 
@@ -81,14 +203,12 @@ of LLVM:
 will always have the most up-to-date information, and the interested reader is
 referred to this source for a more complete explanation.
 
-.. _def-memory-type:
-
 Memory type
 ===========
 
-AMD Instinct(tm) accelerators contain a number of different memory allocation
+AMD Instinct accelerators contain a number of different memory allocation
 types to enable the HIP language's
-:doc:`memory coherency model <hip:how-to/programming-manual>`.
+:doc:`memory coherency model <hip:how-to/programming_manual>`.
 These memory types are broadly similar between AMD Instinct accelerator
 generations, but may differ in exact implementation.
 

diff --git a/docs/conceptual/includes/compute-unit.rst b/docs/conceptual/includes/compute-unit.rst
@@ -1,45 +1 @@
-.. _def-compute-unit:
 
-Compute unit
-============
-
-The compute unit (CU) is responsible for executing a user's kernels on
-CDNA-based accelerators. All :ref:`wavefronts` of a :ref:`workgroup` are
-scheduled on the same CU.
-
-.. image:: ../data/performance-model/gcn_compute_unit.png
-    :alt: AMD CDNA accelerator compute unit diagram
-
-The CU consists of several independent pipelines and functional units.
-
-* The *vector arithmetic logic unit (VALU)* is composed of multiple SIMD (single
-  instruction, multiple data) vector processors, vector general purpose
-  registers (VGPRs) and instruction buffers. The VALU is responsible for
-  executing much of the computational work on CDNA accelerators, including but
-  not limited to floating-point operations (FLOPs) and integer operations
-  (IOPs).
-* The *vector memory (VMEM)* unit is responsible for issuing loads, stores and
-  atomic operations that interact with the memory system.
-* The *scalar arithmetic logic unit (SALU)* is shared by all threads in a
-  [wavefront](wavefront), and is responsible for executing instructions that are
-  known to be uniform across the wavefront at compile-time. The SALU has a
-  memory unit (SMEM) for interacting with memory, but it cannot issue separately
-  from the SALU.
-* The *local data share (LDS)* is an on-CU software-managed scratchpad memory
-  that can be used to efficiently share data between all threads in a
-  [workgroup](workgroup).
-* The *scheduler* is responsible for issuing and decoding instructions for all
-  the [wavefronts](wavefront) on the compute unit.
-* The *vector L1 data cache (vL1D)* is the first level cache local to the
-  compute unit. On current CDNA accelerators, the vL1D is write-through. The
-  vL1D caches from multiple compute units are kept coherent with one another
-  through software instructions.
-* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
-  specialized matrix-multiplication accelerator pipelines known as the
-  [matrix fused multiply-add (MFMA)](mfma).
-
-For a more in-depth description of a compute unit on a CDNA accelerator, see
-slides 22 to 28 in
-`An introduction to AMD GPU Programming with HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`_
-and slide 27 in
-`The AMD GCN Architecture - A Crash Course (Layla Mah) <https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah>`_.
diff --git a/docs/conceptual/includes/high-level-design.rst b/docs/conceptual/includes/high-level-design.rst
diff --git a/docs/conceptual/includes/l2-cache.rst b/docs/conceptual/includes/l2-cache.rst
diff --git a/docs/conceptual/includes/memory-spaces.rst b/docs/conceptual/includes/memory-spaces.rst
diff --git a/docs/conceptual/includes/memory-types.rst b/docs/conceptual/includes/memory-types.rst