Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,21 +1,17 @@
---
title: Optimizing graphics vertex efficiency for Arm GPUs

draft: true
cascade:
draft: true
title: Optimize graphics vertex efficiency for Arm GPUs

minutes_to_complete: 10

who_is_this_for: This is an advanced topic for Android graphics application developers.
who_is_this_for: This is an advanced topic for Android graphics application developers aiming to enhance GPU performance through smarter vertex optimization.

learning_objectives:
- Optimize vertex representations on Arm GPUs
- How to interpret Vertex Memory Efficiency in Arm Frame Advisor
- Optimize vertex representations on Arm GPUs.
- Analyze Vertex Memory Efficiency using Arm Frame Advisor.

prerequisites:
- An understanding of vertex attributes
- Familiarity with Arm Frame Advisor, part of Arm Performance Studio
- Understanding of vertex attributes.
- Familiarity with Arm Frame Advisor (part of Arm Performance Studio).

author:
- Andrew Kilroy
Expand Down Expand Up @@ -43,13 +39,17 @@ further_reading:
link: https://developer.arm.com/documentation/102693/latest/
type: documentation
- resource:
title: Analyse a Frame with Frame Advisor
title: Analyze a Frame with Frame Advisor
link: https://learn.arm.com/learning-paths/mobile-graphics-and-gaming/analyze_a_frame_with_frame_advisor/
type: blog
- resource:
title: Arm Performance Studio
link: https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio%20for%20Mobile
type: website
- resource:
title: Attribute Layouts
link: https://developer.arm.com/documentation/101897/0304/Vertex-shading/Attribute-layout
type: website



Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,45 +6,39 @@ weight: 5
layout: learningpathall
---

# Optimizing graphics vertex efficiency for Arm GPUs
## Diagnosing poor Vertex Memory Efficiency with Frame Advisor

You are writing a graphics application targeting an Arm Immortalis
GPU, and not hitting your desired performance. When running the Arm
Frame Advisor tool, you spot that the draw calls in your shadow map
creation pass have poor Vertex Memory Efficiency (VME) scores. How
should you go about improving this?
Imagine you're developing a graphics application targeting an Arm Immortalis GPU, but you're seeing subpar performance.

![Frame Advisor screenshot](fa-found-bad-vme-in-content-metrics.png)
After profiling your frame with Arm Frame Advisor, you might notice that the shadow map draw calls have low Vertex Memory Efficiency (VME), as shown in the image below.

In this Learning Path, you will learn about a common source of rendering
inefficiency, how to spot the issue using Arm Frame Advisor, and how
to rectify it.
This raises an important question: what's causing the inefficiency, and how can you fix it?

![Frame Advisor screenshot#center](fa-found-bad-vme-in-content-metrics.png "Arm Frame Advisor showing poor Vertex Memory Efficiency (VME) in shadow map draw calls.")

## Finding a solution

This Learning Path shows you approaches to addressing this problem, by demonstrating:

* Common sources of rendering inefficiencies.
* How to identify and rectify issues using Arm Frame Advisor.

## Shadow mapping

In this scenario, draw calls in the shadow map render pass are the
source of our poor VME scores. Let's start by reviewing exactly what
these draws are doing.
In this scenario, draw calls in the shadow map render pass are responsible for the low Vertex Memory Efficiency (VME) scores. To understand why, let's begin by reviewing what these draws are doing.

Shadow mapping is the mechanism that decides, for every visible pixel,
whether it is lit or in shadow. A shadow map is a texture that is
created as the first part of this process. It is rendered from the
point of view of the light source, and stores the distance to all of
the objects that light can see. Parts of a surface that are visible
to the light are lit, and any part that is occluded must be in shadow.
*Shadow mapping* is the mechanism that decides whether each visible pixel is lit or in shadow. The process begins by rendering a shadow map - a texture rendered from the point of view of the light source. This texture stores the distance to the nearest surfaces visible to the light.

During the final render pass, the GPU compares the depth of each pixel from the camera’s viewpoint to the corresponding value in the shadow map. If the pixel is farther away than what the light "sees," it’s considered occluded and rendered in shadow. Otherwise, it is lit.

## Mesh layout

The primary input into shadow map creation is the object geometry for
all of the objects that cast shadows. In this scenario, let's
assume that the vertex data for each object is stored in memory as an
array structure, which is a commonly used layout in many applications:
The primary input for shadow map creation is the geometry of all objects that cast shadows. In this scenario, assume that each object’s vertex data is stored in memory as an array structure, a layout commonly used in many applications:

``` C++
struct Vertex {
float position[3];
float color[3].
float color[3];
float normal[3];
};

Expand All @@ -54,67 +48,46 @@ std::vector<Vertex> mesh {

```

This would give the mesh the following layout in memory:
This gives the mesh the following layout in memory:

![Initial memory layout](initial-memory-layout.png)
![Initial memory layout#center](initial-memory-layout.png "Initial memory layout")

## Why is this sub-optimal?
## Why is this suboptimal?

This looks like a standard way of passing mesh data into a GPU,
At a first glance, this looks like a standard way of passing mesh data into a GPU,
so where is the inefficiency coming from?

The vertex data that is defined contains all of the attributes that
you need for your object, including those that are needed to compute
color in the main lighting pass. When generating the shadow map,
you only need to compute the position of the object, so most
of your vertex attributes will be unused by the shadow map generation
draw calls.

The inefficiency comes from how hardware gets the data it needs from
main memory so that computation can proceed. Processors do not fetch
single values from DRAM, but instead fetch a small neighborhood of
data, because this is the most efficient way to read from DRAM. For Arm
GPUs, the hardware will read an entire 64 byte cache line at a time.
The vertex data that is defined contains all of the attributes that you need for your object, including those that are needed to compute color in the main lighting pass. When generating the shadow map, you only need to compute the position of the object, so most of your vertex attributes will be unused by the shadow map generation draw calls.

In this example, an attempt to fetch a vertex position during shadow
map creation would also load the nearby color and normal values,
even though you do not need them.
The inefficiency comes from how GPUs fetch vertex data from main memory. GPUs don't retrieve individual values from DRAM. Instead, they fetch a small neighborhood of data at once, which is more efficient for memory access. On Arm GPUs, this typically means reading an entire 64-byte cache line at a time.

In this example, fetching a vertex position for shadow map rendering also loads the adjacent color and normal attributes into cache, even though they're not needed. This wastes memory bandwidth and contributes to poor Vertex Memory Efficiency (VME).

## Detecting a sub-optimal layout
## Detecting a suboptimal layout

Arm Frame Advisor analyzes the attribute memory layout for each draw
call the application makes, and provides the Vertex Memory Efficiency
(VME) metric to show how efficiently that attribute layout is working.
Arm Frame Advisor analyzes the vertex attribute memory layout for each draw call and reports a Vertex Memory Efficiency (VME) metric to show how efficiently the GPU accesses vertex data.

![Location of vertex memory efficiency in FA](fa-navigate-to-call.png)
![Location of vertex memory efficiency in FA#center](fa-navigate-to-call.png "Location of vertex memory efficiency in Frame Advisor")

A VME of 1.0 would indicate that the draw call is making an optimal
use of the memory bandwidth, with no unnecessary data fetches.
A VME of 1.0 indicates that the draw call is making an optimal use of the memory bandwidth, with no unnecessary data fetches.

A VME of less than one indicates that unnecessary data is being loaded
from memory, wasting bandwidth on data that is not being used in the
computation on the GPU.
A VME score below 1.0 indicates that unnecessary data is being loaded from memory, wasting bandwidth on attributes not being used in the computation on the GPU.

In this mesh layout you are only using 12 bytes for the `position`
field, out of a total vertex size of 36 bytes, so your VME score would
be only 0.33.
In this mesh layout you are only using 12 bytes for the `position` field, out of a 36-byte vertex, resulting in a VME score of 0.33.

## Fixing a suboptimal layout

## Fixing a sub-optimal layout
Shadow mapping only needs to load position, so to fix this issue you need to use a memory layout that allows position to be fetched in isolation from the other data. It is still preferable to leave the other attributes interleaved.

Shadow mapping only needs to load position, so to fix this issue you
need to use a memory layout that allows position to be fetched in
isolation from the other data. It is still preferable to leave the
other attributes interleaved. On the CPU, this would look like the following:
On the CPU, this looks like this:

``` C++
struct VertexPart1 {
float position[3];
};

struct VertexPart2 {
float color[3].
float color[3];
float normal[3];
};

Expand All @@ -127,35 +100,14 @@ std::vector<VertexPart2> mesh {
};
```

This allows the shadow map creation pass to read only useful position
data, without any waste. The main lighting pass that renders the full
object will then read from both memory regions.

The good news is that this technique is actually a useful one to apply
all of the time, even for the main lighting pass! Many mobile GPUs,
including Arm GPUs, process geometry in two passes. The first pass
computes only the primitive position, and second pass will process
the remainder of the vertex shader only for the primitives that are
visible after primitive culling has been performed. By splitting
the position attributes into a separate stream, you avoid wasting
memory bandwidth fetching non-position data for primitives that are
ultimately discarded by primitive culling tests.
This allows the shadow map creation pass to read only useful position data, without any waste. The main lighting pass that renders the full object will then read from both memory regions.

The good news is that this technique is actually a useful one to apply all of the time, even for the main lighting pass! Many mobile GPUs, including Arm GPUs, process geometry in two passes: an initial pass that computes only primitive positions, followed by a second pass that runs the full vertex shader only for primitives that survive culling. By placing position data in a separate buffer or stream, you reduce memory bandwidth wasted on fetching attributes like color or normals for primitives that are ultimately discarded.

# Conclusion
## Conclusion

Arm Frame Advisor can give you actionable metrics that can identify
specific inefficiencies in your application to optimize.
Arm Frame Advisor provides actionable metrics that can help identify specific inefficiencies in your graphics application. The Vertex Memory Efficiency metric measures how efficiently you are using your input vertex memory bandwidth, indicating what proportion of the input data is actually consumed by the shader program. You can improve VME by adjusting your vertex memory layout to separate attribute data into distinct streams, ensuring that each render pass only loads the data it needs. Avoid packing unused attributes into memory regions accessed by draw calls, as this wastes bandwidth and reduces performance.

The VME metric shows how efficiently you are using your input
vertex memory bandwidth, indicating what proportion of the input
data is actually used by the shader program. VME can be improved by
changing vertex memory layout to separate the different streams of
data such that only the data needed for type of computation is packed
together. Try not to mix data in that a computation would not use.

# Other links

Arm's advice on [attribute layouts][2]

[2]: https://developer.arm.com/documentation/101897/0304/Vertex-shading/Attribute-layout