Skip to content

Commit

Permalink
Update documentation for V1.8
Browse files Browse the repository at this point in the history
  • Loading branch information
chesik-amd committed Sep 14, 2020
1 parent 0a1e44b commit 016720b
Show file tree
Hide file tree
Showing 20 changed files with 139 additions and 118 deletions.
30 changes: 14 additions & 16 deletions Known_Issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,31 +7,29 @@ All platforms
3) Radeon Developer Panel cannot capture profiles from non-AMD GPUs.
4) Radeon Developer Panel will NOT capture profiles from Windows Insider Editions.
5) Anti-virus may impede key-based capture (Ctrl+Shift+C)
6) Applications that call Present() from the async compute queue are not supported. A fix will be available soon.
7) When using RGP with RenderDoc, please make sure that RenderDoc is terminated between RenderDoc capture sessions (generating a RenderDoc capture file or loading a RenderDoc capture file is considered a session for the purpose here). While it is possible to take multiple RGP profiles of a RenderDoc capture file, it is not possible to take RGP profiles between RenderDoc sessions. If this is attempted, RenderDoc will pop up an error dialog box indicating that an RGP profile can't be taken and to restart RenderDoc
6) Applications that call Present() from the async compute queue are not supported.
7) When using RGP with RenderDoc, please make sure that RenderDoc is terminated between RenderDoc capture sessions (generating a RenderDoc capture file or loading a RenderDoc capture file is considered a session for the purpose here). While it is possible to take multiple RGP profiles of a RenderDoc capture file, it is not possible to take RGP profiles between RenderDoc sessions. If this is attempted, RenderDoc will show an error dialog box indicating that an RGP profile can't be taken and to restart RenderDoc
8) If an instance of Radeon GPU Profiler is spawned from RenderDoc, it must be closed before restarting RenderDoc. The menu option to create new RGP profiles will not be enabled otherwise.
9) OpenCL captures may include an extra DMA command buffer in the Profile Summary. This issue will be fixed in a future version.
10) Launching the RadeonDeveloperPanel, clicking "connect" and starting an application may cause a hang or reboot when using 3 or more attached monitors (especially if they are 4K). Please use a dual-monitor configuration at most to avoid this from happening. This issue is currently being investigated.
9) OpenCL captures may include an extra DMA command buffer in the Profile Summary.
10) Launching the RadeonDeveloperPanel, clicking "connect" and starting an application may cause a hang or reboot when using 3 or more attached monitors (especially if they are 4K). Please use a dual-monitor configuration at most to avoid this from happening.
11) Detailed instruction timing is not yet supported on OpenCL.
12) The last 2 arguments to vkDrawIndexed may contain incorrect values on Vega and Radeon 7 GPU's. This issue is currently under investigation.

Windows
-------
1) Queue synchronization data will be missing from DirectX12 apps running on Windows 10 Home.
2) D3D12 command list calls of ExecuteIndirect() may show in RGP as multiple compute events. This will be corrected in a future release after obtaining more information from the driver.
3) The Radeon Overlay hotkey, Alt+R, conflicts with the Radeon GPU Profiler shortcut key used to select the Pipeline state pane. The Radeon Overlay hotkey can be reconfigured by opening the Radeon Settings panel (from the system tray), selecting the Preferences tab then clicking the "Toggle Radeon Overlay Hotkey" button.
4) Radeon Developer Panel is unable to capture on Windows 7. If running on Windows 7, please use Radeon Developer Panel from the RGP 1.6 release.
2) D3D12 command list calls of ExecuteIndirect() may show in RGP as multiple compute events.
3) Some Radeon Software hotkeys may conflict with Radeon GPU Profiler shortcut keys. The Radeon Software hotkeys can be reconfigured by opening the Radeon Software panel (from the system tray), selecting the Hotkeys tab under Settings then changing or unbinding any conflicting hotkeys.

Linux
-----
1) If the Developer Panel or the Developer Service crash while running with the root account, it may be necessary to restart/exit them again with the root account in order to cleanup shared memory.
2) The Radeon Developer Service and Panel are only officially supported using the standard desktop managers (GDM and Unity). Other desktop managers should work but a dialog box indicating that the service is running in headless mode may pop up. However, it should still be possible to capture profiles.
3) If the RadeonDeveloperServiceCLI application crashes, shared memory may need to be cleaned up by running the RemoveSharedMemory.sh script located in the script folder of the RGP release kit. Run the script with elevated privileges using sudo.
4) The Radeon Developer Panel may fail to start the Radeon Developer Service when the Connect button is clicked. If this occurs, manually start the Radeon Developer Service, select localhost from the the Recent connections list and click the Connect button again.
5) Capture on Radeon RX 5500 and Radeon RX 5300 series graphics cards may cause crashes or hangs on Linux. This is expected to be fixed in a future driver release.
1) After an amdgpu-pro driver install on Ubuntu 20.04, the default Vulkan ICD may be the RADV ICD. In order to capture a profile, Vulkan applications must be using the amdgpu-pro Vulkan ICD. The default Vulkan ICD can be overridden by setting the following environment variable before launching a Vulkan application: VK_ICD_FILENAMES=/etc/vulkan/icd.d/amd_icd64.json
2) After launching RGP from the Developer Panel to view a captured profile, the panel may fail to connect the next time it is launched. The workaround is to close RGP before relaunching the panel.
3) If the Developer Panel or the Developer Service crash while running with the root account, it may be necessary to restart/exit them again with the root account in order to cleanup shared memory.
4) When running with the root account, the Developer Panel may output error or warning messages to the terminal. These should not prevent the panel from functioning properly.
5) The Radeon Developer Service and Panel are only officially supported using the standard desktop managers (GDM and Unity). Other desktop managers should work but a dialog box indicating that the service is running in headless mode may pop up. However, it should still be possible to capture profiles.
6) If the RadeonDeveloperServiceCLI application crashes, shared memory may need to be cleaned up by running the RemoveSharedMemory.sh script located in the script folder of the RGP release kit. Run the script with elevated privileges using sudo.
7) The Radeon Developer Panel may fail to start the Radeon Developer Service when the Connect button is clicked. If this occurs, manually start the Radeon Developer Service, select localhost from the the Recent connections list and click the Connect button again.

RDNA
----
1) Some shaders that write to the execute mask register may not have instruction timing data.
2) The Device configuration does not show the correct Work group processor per Shader engine for certain parts with harvested CUs.
3) OpenCL is not yet supported for RX 5500 and RX 5300 series graphics cards. This support will be added in a future driver release.
1) The Device configuration does not show the correct Work group processor per Shader engine for certain parts with harvested CUs.
11 changes: 4 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ In order to use the latest features of RGP, it is strongly recommended that user
* RadeonDeveloperServiceCLI (RDS headless)
* Radeon Developer Panel (RDP)
* Radeon GPU Profiler (RGP)
3. To gather a profile from a game run the Radeon Developer Panel and follow the instructions in the Help. Help can be found in the following locations:
3. To capture a profile from a game, run the Radeon Developer Panel and follow the instructions in the Help. Help can be found in the following locations:
* Help web pages exist in the "docs" sub directory
* Help web pages can be accessed from the **Help** button in the Developer Panel
* Help web pages can be accessed from the Welcome screen in the Radeon GPU Profiler, or from the **Help** menu
* The documentation is hosted publicly at:
* http://devdrivertools.readthedocs.io/en/latest/
* https://radeon-developer-panel.readthedocs.io/en/latest/
* http://radeon-gpuprofiler.readthedocs.io/en/latest/

## Supported ASICs
Expand All @@ -38,11 +38,8 @@ In order to use the latest features of RGP, it is strongly recommended that user
* DirectX12
* Vulkan

### Windows7
* Vulkan
* User must install latest VC 2015 redistributables from https://www.microsoft.com/en-us/download/details.aspx?id=53840

### Ubuntu 18.04.3 LTS
### Ubuntu 20.04.1 LTS
* Vulkan

## Supported compute APIs, ASICs, and operating systems
Expand All @@ -57,5 +54,5 @@ In order to use the latest features of RGP, it is strongly recommended that user

### Supported Operating Systems
* Windows 10
* Windows 7
* Ubuntu 18.04.3 LTS
* Ubuntu 20.04.1 LTS
15 changes: 11 additions & 4 deletions docs/source/Barriers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,13 @@ barriers and layout transitions will be shown as 'N/A'.

The summary at the top left of the UI quickly lets
the developer know if there is an issue with barrier usage in the frame.
In the case above the barrier usage is taking up 0% of the frame.
When calculating the percentage, only portions of a barrier's duration
which are not overlapped by one or more events from any queue are taken
into consideration. For instance, if a barrier has a duration of 100 ns,
but 80 ns of that barrier's duration are overlapped by other events (on
the same queue or on a different queue), then only 20 ns of that
particular barrier contributes to the percentage calculation.
In the case shown above, the barrier usage is taking up 0% of the frame.

This summary also displays the average number of barriers
per draw or dispatch and the average number of
Expand All @@ -38,8 +44,9 @@ The table shows the following information:
in the graphics pipe we need the work to drain from

#. **Layout transitions** - A blue check box indicates if the barrier is
associated with a layout transition. There are 6 columns indicating the
type of layout transition
associated with a layout transition. There are six columns indicating the
type of layout transition. These are described in the Layout transition
section below.

#. **Invalidated** - A list of invalidated caches

Expand Down Expand Up @@ -127,4 +134,4 @@ As we see, the time taken due to barriers is typically very small since inter-di
It should be noted that the meaning of barriers in RGP for OpenCL is different from OpenCL's synchronization
APIs and is not related to the OpenCL synchronization APIs based on cl_event or cl_barrier.
For this reason, the barriers seen in OpenCL profiles are known as cmdBarrier() which is not a part of the OpenCL API.
For OpenCL profiles, RGP does not presently show OpenCL events or host synchronization.
For OpenCL profiles, RGP does not presently show OpenCL events or host synchronization.
67 changes: 22 additions & 45 deletions docs/source/InstructionTiming.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ Instruction Timing

The Instruction Timing pane shows the average issue latency of each instruction of a single shader.
The Instruction Timing information is generated using hardware support on AMD GCN GPUs. Generating
Instruction Timing does not require recompilation of shaders or insert any instrumentation into
shaders.
Instruction Timing does not require recompilation of shaders or insertion of any instrumentation
into shaders.

The Instruction Timing pane shows GCN ISA. For a description of GCN ISA, refer to the shader
programming guide at
Expand All @@ -20,7 +20,7 @@ instruction and the one after that. To provide information on what Average Laten
sample ISA statements are shown below.

**Best Case Instruction Issue:** In the below image, we see three instructions. The *4 clocks*
denotes the latency between the issue of the *s_mov_b32* instruction and the issue of the following
denote the latency between the issue of the *s_mov_b32* instruction and the issue of the following
*v_lshlrev_b32_e32* instruction. Similarly, the interval between the issue of *v_lshlrev_b32_e32*
and *s_lshr_b32* instruction is 6 clocks. This example shows the best performance case where each
instruction is issued at an interval of 4 clocks.
Expand All @@ -32,18 +32,18 @@ instruction is issued at an interval of 4 clocks.
*exp pos0* instruction's issue can be delayed for reasons such as unavailable memory resources
which may be in use by other wavefronts. As a result, there is a long duration in the instruction.
Since the latency waiting for memory resources was seen for the first export instruction,
subsequent exports have much shorter duration.
subsequent exports have a much shorter duration.

.. image:: media_rgp/RGP_Instruction_Timing_Example_2.png

**Waitcounts and Instruction Issue:** In the below image, we see seven instructions. The
*v_mov_b32_e32* and the *v_perm_b32* instructions issue in 4 clocks as expected. We then see a
*tbuffer_load* followed by a *s_waitcnt*. The *s_waitcnt* has a longer issue interval of 721
clocks. The *tbuffer_load* instruction also has a relatively short latency of only 19 cycles which
may seen counter intuitive since its a memory load instruction. However, this is expected as
may seem counter intuitive since it's a memory load instruction. However, this is expected as
*s_waitcnt* is a shader instruction used for synchronization to wait for previous instructions such
as the previous buffer load to finish. The *s_waitcnt* instruction will issue and then wait (in this
case 721 clocks) until the next instruction instruction which is the *ds_write2_b32* can be issued.
case 721 clocks) until the next instruction which is the *ds_write2_b32* can be issued.

.. image:: media_rgp/RGP_Instruction_Timing_Example_3.png

Expand All @@ -64,56 +64,31 @@ the shader)*

\ **Instruction Timing Capture Granularity**

Instruction Timing information is generated for a part of the RGP profile rather than for the whole
RGP profile. The granularity is configured using the API PSO hash. Instruction Timing information
is generated by providing an API PSO hash to Radeon Developer Panel. The API PSO hash can be copied
from RGP and pasted into Radeon Developer Panel. Please see the Radeon Developer Panel for more
information on how to capture instruction information.
Instruction Timing information is generated for the whole RGP profile, but data is limited to a
single shader engine. Only waves executed by a single shader engine contribute to the hit counts
and timing information shown in the Instruction timing pane. Please see the Radeon Developer Panel
documentation for more information on how to capture instruction timing information.

It is important to note that the instruction trace information can also be present for events that
use a different PSO than the one that was provided to RDP. The main reasons for this behavior are:

- Using the Radeon Developer Panel, instruction tracing will be enabled until the end of the
command buffer. So when an event using the selected PSO starts execution, detailed tracing should
be visible for events until the end of the command buffer.

- The ability to capture Instruction Trace is enabled globally for all the events executing on the
GPU once it has been enabled.

Due to these reasons, it is very likely that Instruction Timing information will be available for
events that have a different API PSO from the one that was provided to the Radeon Developer Panel.
To view all the events that have Instruction Timing information, the developer can choose the
"Color by Instruction Timing" option in the Wavefront Occupancy or the Event Timing views.

\ **Availability of Instruction Timing**

The Instruction Timing will be available for the draws / dispatches that used the PSO that was
chosen. However, in certain cases it is possible that the Instruction Timing information may not be
there for the selected PSO. The main reasons why Instruction Timing information may not be present
for a selected API PSO are described below.
In certain cases it is possible that the Instruction Timing information may not be available for
all events. The main reasons why Instruction Timing information may not be present
for an event are described below.

\ **Hardware Architecture and Draw Scheduling**: Instruction Timing information is only sampled
from some of the compute units of the GPU. As a result, it is possible for events with very few
waves to not have instruction data even if the API PSO hash was selected. This can happen if the
GPU schedules the waves on a compute unit that doesn't have instruction trace enabled.

\ **Pipeline Used in Captured Frame**: The developer should pass the API PSO hash to the Radeon
Developer Panel before starting the application and taking a new profile. The developer should
ensure that the API PSO intended to be captured is used in a command buffer that is actually
executed in the frame.

\ **Changes in API PSO Hash**: If any state within API PSO has been changed, it is important to
note that the API PSO hash will also likely have changed. This can occur if the shaders within
the API PSO were edited. If that is the case, the user should gather an RGP trace without
Instruction Trace first to get the updated API PSO hash and then capture a new profile with
Instruction Trace.
from some of the compute units on a single shader engine of the GPU. As a result, it is possible
for events with very few waves to not have instruction data. This can happen if the
GPU schedules the waves on a shader engine or compute unit that doesn't have instruction trace enabled.

\ **Internal Events**: It should be noted that it is not possible to view Instruction Timing
information for internal events such as Clear().

\ **Navigation**

The Instruction Timing for an event can be accessed by by right clicking on that event and choosing
The Instruction Timing for an event can be accessed by right clicking on that event and choosing
the "View In Instruction Timing" option. Since it is common to use the same shader in multiple
events, RGP provides an easy way to toggle between multiple events that use the same shader using
the event drop down shown below.
Expand Down Expand Up @@ -202,13 +177,15 @@ long shaders.
- Branches: This denotes the number of branch instructions in the shader and the percentage of
the total number of branches that were taken by the shader.

- Theoretical Occupancy: From the register information and knowledge about the GPU architecture we
can calculate the theoretical maximum wavefront occupancy for the shader.

- Vector and Scalar Registers: The register values indicate the number of registers that the shader
is using. The value in parentheses is the number of registers that have been allocated for the
shader.

- Theoretical Occupancy: From the register information and knowledge about the GCN architecture we
can calculate the theoretical maximum wavefront occupancy for the shader.

- Local Data Share Size: This value indicates how many bytes of local data share are used by the
shader. This is only displayed for Compute Shaders.

\ **Instruction Timing for RNDA**

Expand Down

0 comments on commit 016720b

Please sign in to comment.