diff --git a/content/learning-paths/laptops-and-desktops/mca-godbolt/_index.md b/content/learning-paths/laptops-and-desktops/mca-godbolt/_index.md index db64069da1..f26bf792e7 100644 --- a/content/learning-paths/laptops-and-desktops/mca-godbolt/_index.md +++ b/content/learning-paths/laptops-and-desktops/mca-godbolt/_index.md @@ -1,27 +1,26 @@ --- -title: Running MCA with Arm assembly +title: Use LLVM Machine Code Analyzer to understand code performance minutes_to_complete: 60 -who_is_this_for: This is an introductory topic for Arm developers who want to diagnose performance issues of Arm programs using MCA and Compiler Explorer. +who_is_this_for: This is an introductory topic for Arm developers who want to diagnose performance issues of Arm programs using LLVM Machine Code Analyzer (MCA) and Compiler Explorer. learning_objectives: - - Estimate the hardware resource pressure and the number of cycles taken to execute your code snippet using llvm-mca - - Understand how this estimate can help diagnose possible performance issues - - Use Compiler Explorer to run llvm-mca + - Estimate the hardware resource pressure and the number of cycles taken to execute your code snippet using llvm-mca. + - Understand how this estimate can help diagnose possible performance issues. + - Use Compiler Explorer to run llvm-mca. prerequisites: - - Familiarity with Arm assembly - - clang compiler or access to Compiler Explorer + - Familiarity with Arm assembly. + - LLVM version 16 or newer (to include Neoverse V2 support). author_primary: Rin Dobrescu ### Tags skilllevels: Introductory -subjects: Performance +subjects: Performance and Architecture armips: - Neoverse - - Armv8 tools_software_languages: - assembly - llvm-mca diff --git a/content/learning-paths/laptops-and-desktops/mca-godbolt/mca_on_godbolt.md b/content/learning-paths/laptops-and-desktops/mca-godbolt/mca_on_godbolt.md index b85909c581..ffaaa63768 100644 --- a/content/learning-paths/laptops-and-desktops/mca-godbolt/mca_on_godbolt.md +++ b/content/learning-paths/laptops-and-desktops/mca-godbolt/mca_on_godbolt.md @@ -1,19 +1,29 @@ --- -title: Using MCA with Compiler Explorer +title: Use MCA with Compiler Explorer weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- ### What is Compiler Explorer? -Compiler Explorer is an interactive online compiler that lets you edit code in C/C++, Java, Python and many other programming languages. It then allows you to see what the code looks like after being compiled in real time. It supports multiple compilers and has many tools available, including llvm-mca. + +Compiler Explorer is an interactive online compiler that lets you enter code in C/C++, Java, Python and many other programming languages. It allows you to see what the code looks like after being compiled in real time. + +Compiler Explorer supports multiple compilers and has many tools available, including `llvm-mca`. ### Running MCA in Compiler Explorer -To access Compiler Explorer, open a browser and go to https://godbolt.org. This leads you to a page that looks as in Figure 1 below: + +To access Compiler Explorer, open a browser and go to https://godbolt.org. + +This leads you to the page shown below in Figure 1. Your view may be a slightly different. + ![godbolt open alt-text#center](open.png "Figure 1. Compiler Explorer") -On the left side of the page is the source code. In Figure 1 it is set to C++, you can click on the programming language to select a different language for the source code. Now copy this code and use it as C++ source: -``` +The left side of the page contains the source code. In Figure 1, the language is set to C++, but you can click on the programming language to select a different language for the source code. + +Copy the code below and paste it into Compiler Explorer as C++ source code: + +```C int func(int a, int b, int c, int d, int e, int f) { a = a + b; a = a + c; @@ -24,10 +34,28 @@ int func(int a, int b, int c, int d, int e, int f) { } ``` -On the right side of the page is the disassembly output from the compiler. You can change the compiler by clicking on it and selecting a different one. You can try this now and select `armv8-a clang(trunk)` as the compiler. Then add some compiler flags by typing `-O3` in the `Compiler options` box. You can view the full set of options passed to the compiler by clicking on the green tick next to the compiler. You can now add llvm-mca from the `Add tool` dropdown button, as shown in Figure 2 below: +The right side of the page contains the disassembly output from the compiler. + +You can change the compiler by clicking on it and selecting a different one. + +Select `armv8-a clang(trunk)` as the compiler to see Arm instructions. + +Next, update the compiler flags by typing `-O3` in the `Compiler options` box. + +You can view the full set of options passed to the compiler by clicking on the green tick next to the compiler. + +Click the `Add tool` dropdown button to add `llvm-mca` as a tool as shown in Figure 2 below: + ![tool mca alt-text#center](tool-mca.png "Figure 2. Assembly in Compiler Explorer") -To add more flags to `llvm-mca`, click on the `Arguments` button and type them in. Specify the CPU name to `llvm-mca` by using the `-mcpu` flag. To find what CPUs are supported you can check the [clang documentation](https://clang.llvm.org/docs/CommandGuide/clang.html#cmdoption-print-supported-cpus). You can try this now and add `-mcpu=neoverse-v2`, as well as any other flags you choose to pass to `llvm-mca`. On the right side of the page is the output from running `llvm-mca` on the disassembly of the source code, as shown in Figure 3 below: +To add more flags to `llvm-mca`, click on the `Arguments` button and type them in. + +Add `-mcpu=neoverse-v2`, as well as any other flags you choose to pass to `llvm-mca`. + +To find what CPUs are supported you can check the [clang documentation](https://clang.llvm.org/docs/CommandGuide/clang.html#cmdoption-print-supported-cpus). + +The right side of the page now contains the output from running `llvm-mca` on the disassembly of the source code, as shown in Figure 3 below: + ![argument mca alt-text#center](mca-arguments.png "Figure 3. MCA in Compiler Explorer") -You are now able to run `llvm-mca` using Compiler Explorer. \ No newline at end of file +You are now able to run `llvm-mca` using Compiler Explorer. This is helpful when you want to try different compiler versions without installing them. \ No newline at end of file diff --git a/content/learning-paths/laptops-and-desktops/mca-godbolt/running_mca.md b/content/learning-paths/laptops-and-desktops/mca-godbolt/running_mca.md index 0781e23345..50f416c13b 100644 --- a/content/learning-paths/laptops-and-desktops/mca-godbolt/running_mca.md +++ b/content/learning-paths/laptops-and-desktops/mca-godbolt/running_mca.md @@ -1,29 +1,47 @@ --- -title: Running MCA with Arm assembly +title: Run MCA with Arm assembly weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- -### Before you start -Familiarise yourself with the terms below: -- Instruction scheduling: If two instructions appear in a sequence in a program, but are independent from each other, the compiler can swap them without affecting the program's behaviour. The goal of instruction scheduling is to find a valid permutation of the program instructions that also optimises the program's performance, by making use of processor resources. -- Pipeline: A pipeline is the mechanism used by the processor to execute instructions. Pipelining makes efficient use of processor resources by dividing instructions into stages that can overlap and be processed in parallel, reducing the time it takes for instructions to execute. Instructions can only be executed if the required data is available, otherwise this leads to a delay in execution called a pipeline stall. -- Resource pressure: Resources refer to the hardware units used to execute instructions. If instructions in a program all rely on the same resources, then it leads to pressure. Execution is slowed down as instructions must wait until the unit they need becomes available. -- Data dependency: Data dependency refers to the relationship between instructions. When an instruction requires data from a previous instruction this creates a data dependency. +### Terminology +Before you get started, familiarize yourself with the terms below: -### What is MCA? -MCA stands for Machine Code Analyser. It is a performance analysis tool that uses information available in [LLVM](https://github.com/llvm/llvm-project) to measure performance on a specific CPU. +- **Instruction scheduling**: If two instructions appear in a sequence in a program, but are independent from each other, the compiler can swap them without affecting the program's behavior. The goal of instruction scheduling is to find a valid permutation of the program instructions that also optimizes the program's performance, by making use of processor resources. + +- **Pipeline**: A pipeline is the mechanism used by the processor to execute instructions. Pipelining makes efficient use of processor resources by dividing instructions into stages that can overlap and be processed in parallel, reducing the time it takes for instructions to execute. Instructions can only be executed if the required data is available, otherwise this leads to a delay in execution called a pipeline stall. + +- **Resource pressure**: Resources refer to the hardware units used to execute instructions. If instructions in a program all rely on the same resources, then it leads to pressure. Execution is slowed down as instructions must wait until the unit they need becomes available. + +- **Data dependency**: Data dependency refers to the relationship between instructions. When an instruction requires data from a previous instruction this creates a data dependency. + + +### What is Machine Code Analyzer (MCA)? + +Machine Code Analyzer (MCA) is a performance analysis tool that uses information available in [LLVM](https://github.com/llvm/llvm-project) to measure performance on a specific CPU. ### How can MCA be useful? -MCA takes as input a snippet of assembly code and then simulates the execution of that code in a loop of iterations (default is 100). MCA then outputs a performance report, which contains information such as the latency and throughput of the assembly block and the resource usage for each instruction. Using this information, you can identify bottlenecks in performance such as resource pressure and data dependencies. There are many options you can give MCA to get performance metrics. Those are explained in the [llvm-mca documentation](https://llvm.org/docs/CommandGuide/llvm-mca.html). + +MCA takes as input a snippet of assembly code and then simulates the execution of that code in a loop of iterations (default is 100). + +MCA then outputs a performance report, which contains information such as the latency and throughput of the assembly block and the resource usage for each instruction. + +Using this information, you can identify bottlenecks in performance such as resource pressure and data dependencies. There are many options you can give MCA to get performance metrics. The options are explained in the [llvm-mca documentation](https://llvm.org/docs/CommandGuide/llvm-mca.html). ### MCA example with Arm assembly -You have learned what MCA is and what kind of information it can provide. Now you are going to use MCA to identify a performance issue and improve a snippet of Arm assembly. The example below demonstrates how to run llvm-mca, what the expected output is and what sort of conclusions you can come to using the performance metrics MCA provides. -Write assembly code to compute the sum of 6 numbers. Use a text editor to save the program below in a file named `sum_test1.s`: + +You have learned what MCA is and what kind of information it can provide. Now you are going to use MCA to identify a performance issue and improve a snippet of Arm assembly. + +The example below demonstrates how to run `llvm-mca`, what the expected output is, and the conclusions you can draw using the performance metrics MCA provides. + +The example below computes the sum of 6 numbers. + +Use a text editor to save the program below in a file named `sum_test1.s`: + ``` add x1, x1, x2 add x1, x1, x3 @@ -32,11 +50,15 @@ add x1, x1, x5 add x1, x1, x6 ``` -Now run llvm-mca on this program using: -`llvm-mca -mtriple=aarch64 -mcpu=neoverse-v2 sum_test1.s` +Now run `llvm-mca` on this code: -The produced output looks like below: +```console +llvm-mca -mtriple=aarch64 -mcpu=neoverse-v2 sum_test1.s ``` + +The printed output is similar to: + +```output Iterations: 100 Instructions: 500 Total Cycles: 503 @@ -97,18 +119,29 @@ Resource pressure by instruction: - - - - - - - 0.16 0.17 0.17 0.17 0.17 0.16 - - - - add x1, x1, x6 ``` -The MCA output shows a lot of information. Let us discuss the most relevant parts. For further details, you can look at the [llvm-mca documentation](https://llvm.org/docs/CommandGuide/llvm-mca.html#how-llvm-mca-works). -The first part of the output, up to the `Instruction Info` section, is general information about the loop and the hardware. MCA simulated the execution of your code in a loop for 100 iterations. It executed a total of 500 instructions in 503 cycles. If you calculate the instructions per cycle (IPC) on average you get 500/503≈0.99 IPC. The dispatch width of 16 means the CPU is capable of dispatching 16 instructions per cycle. +The MCA output shows a lot of information. The most relevant parts are covered below. For further details, you can look at the [llvm-mca documentation](https://llvm.org/docs/CommandGuide/llvm-mca.html#how-llvm-mca-works). + +The first part of the output, up to the `Instruction Info` section, is general information about the loop and the hardware. MCA simulated the execution of the code in a loop for 100 iterations. It executed a total of 500 instructions in 503 cycles. If you calculate the instructions per cycle (IPC) on average you get 500/503≈0.99 IPC. The dispatch width of 16 means the CPU is capable of dispatching 16 instructions per cycle. The second part of the output, up to the `Resources` section, gives information about each individual instruction. Latency represents how many cycles each instruction takes to execute. Throughput represents the rate at which instructions are executed per cycle. Reciprocal throughput (RThroughput) is the inverse of throughput (1/throughput) and represents cycles per instruction. An important part of this output is the `Resource pressure by instruction` section. It shows which instructions are executed on which pipelines. You can see that the add instructions use resources `[4]-[9]` and that pressure is equally spread through the available resources. -The [Software Optimisation Guide](https://developer.arm.com/documentation/109898/latest/) shows which pipelines are used by which instructions. -After going through the MCA output, we can conclude that our `sum_test1.s` program is not achieving the throughput the processor is capable of. It can only compute 1 instruction per cycle, despite putting a lot of pressure on resources `[4]-[9]`. We are therefore not using all the available resources. -In order to understand what causes this behavior, we can look into how the instruction state changes throughout the execution pipeline. We can do that by running MCA with the `-timeline` flag: `llvm-mca -mtriple=aarch64 -mcpu=neoverse-v2 -timeline sum_test1.s`. MCA now adds a timeline view of execution to the output, which looks like this: +The [Arm Neoverse V2 Software Optimization Guide](https://developer.arm.com/documentation/109898/latest/) shows which pipelines are used by which instructions. + +After going through the MCA output, you will conclude that the `sum_test1.s` program is not achieving the throughput the processor is capable of. It can only compute 1 instruction per cycle, despite putting a lot of pressure on resources `[4]-[9]`. + +In order to understand what causes this behavior, you can look into how the instruction state changes throughout the execution pipeline. + +Run again, this time with the `-timeline` flag: +```console +llvm-mca -mtriple=aarch64 -mcpu=neoverse-v2 -timeline sum_test1.s ``` + +The MCA output now includes a timeline view of execution to the output, which looks like this: + +```output Timeline view: 0123456789 0123456789 012 Index 0123456789 0123456789 0123456789 @@ -188,16 +221,24 @@ These states are represented by the following characters: - = : Instruction already dispatched, waiting to be executed. - \- : Instruction executed, waiting to be retired. -Looking at the `Index` in the timeline view, on the horizontal axis we have cycles and on the vertical axis we have a pair of indices representing iterations and instructions. Since we did not pass the `-timeline-max-iterations` flag to specify an iteration number to be used for the timeline view, `llvm-mca` used its default (10 iterations) so the iteration indices range from 0-9 inclusively. Since there are 5 instructions in `sum_test1`, the instruction indices range from 0-4 inclusively. -From the timeline view of `sum_test1.s` we have the following: +Looking at the `Index` in the timeline view, on the horizontal axis you have cycles and on the vertical axis you have a pair of indices representing iterations and instructions. Since you did not pass the `-timeline-max-iterations` flag to specify an iteration number to be used for the timeline view, `llvm-mca` used its default (10 iterations) so the iteration indices range from 0-9 inclusively. Since there are 5 instructions in `sum_test1`, the instruction indices range from 0-4 inclusively. + +From the timeline view of `sum_test1.s` you can see the following: + - Instruction `[0, 4]` corresponds to the first iteration of the fifth instruction `add x1, x1, x6`. This instruction was dispatched on cycle 0, it started execution on cycle 5, finished execution at cycle 6 and retired at cycle 7. + - Instruction `[2, 2]` corresponds to the third iteration of the third instruction `add x1, x1, x4`. This instruction was dispatched on cycle 0, it started execution on cycle 13, finished execution at cycle 14 and retired at cycle 15. -The iterations timeline shows that in subsequent iterations, instructions spend a longer time waiting to start the execution. That is because all add instructions in the code block are in a Read After Write (RAW) dependency chain. Register x1 written by the first instruction `add x1, x1, x2` is immediately used by the next instruction `add x1, x1, x3` and so on. Long register dependencies negatively impact performance. The `Average Wait times` section of the timeline view also highlights this. The number of cycles spent in the ready state is very small compared to the number of cycles spent waiting in a scheduler's queue. +The iterations timeline shows that in subsequent iterations, instructions spend a longer time waiting to start the execution. That is because all add instructions in the code block are in a Read After Write (RAW) dependency chain. -After analysing the information provided by MCA, we now understand that a long chain of dependencies is affecting the performance of the program. With this understanding, we can now write assembly code to compute the sum of 6 numbers, this time avoiding register dependencies. +Register x1 written by the first instruction `add x1, x1, x2` is immediately used by the next instruction `add x1, x1, x3` and so on. Long register dependencies negatively impact performance. The `Average Wait times` section of the timeline view also highlights this. The number of cycles spent in the ready state is very small compared to the number of cycles spent waiting in a scheduler's queue. + +After analyzing the information provided by MCA, you now understand that a long chain of dependencies is affecting the performance of the program. + +With this understanding, you can write new assembly code to compute the sum of 6 numbers, this time avoiding register dependencies. Use a text editor to save the program below in a file named `sum_test2.s`: + ``` add x10, x1, x2 add x11, x3, x4 @@ -206,11 +247,15 @@ add x13, x10, x11 add x14, x12, x13 ``` -Now run llvm-mca on this program using: -`llvm-mca -mtriple=aarch64 -mcpu=neoverse-v2 sum_test2.s` +Run `llvm-mca` on the new program using: -The produced output looks like below: +```console +llvm-mca -mtriple=aarch64 -mcpu=neoverse-v2 sum_test2.s ``` + +The new output is shown below: + +```output Iterations: 100 Instructions: 500 Total Cycles: 88 @@ -271,7 +316,9 @@ Resource pressure by instruction: - - - - - - - 0.01 - - - 0.18 0.81 - - - - add x14, x12, x13 ``` -We can immediately see an improvement in the performance of our program by looking a the number of total cycles it took to execute and the IPC. Below you can see a comparison between the MCA performance metrics of `sum_test1.s` and `sum_test2.s`: +You can immediately see an improvement in the performance of the program by looking a the number of total cycles it took to execute and the IPC. + +Below you can see a comparison between the MCA performance metrics of `sum_test1.s` and `sum_test2.s`: | | sum_test1 | sum_test2 | | ----------------- | --------- | --------- | @@ -283,11 +330,22 @@ We can immediately see an improvement in the performance of our program by looki | uOps Per Cycle | 0.99 | 5.68 | | IPC | 0.99 | 5.68 | | Block RThroughput | 0.8 | 0.8 | -The improved version of the code now has a higher IPC and takes fewer cycles to run. We can also see that there is less pressure on resources `[4]-[9]`, which allows for instructions to execute in parallel. -Look at the timeline view and see how the lack of register dependencies improved performance. Run llvm-mca using: `llvm-mca -mtriple=aarch64 -mcpu=neoverse-v2 -timeline sum_test2.s ` -The produced output looks like below: +The improved version of the code now has a higher IPC and takes fewer cycles to run. + +You can also see that there is less pressure on resources `[4]-[9]`, which allows for instructions to execute in parallel. + +Look at the timeline view and see how the lack of register dependencies improved performance. + +Run `llvm-mca` again using: + +```console +llvm-mca -mtriple=aarch64 -mcpu=neoverse-v2 -timeline sum_test2.s ``` + +The produced output is: + +```output Timeline view: 012 Index 0123456789 @@ -358,6 +416,9 @@ Average Wait times (based on the timeline view): 4. 10 4.9 0.0 0.0 add x14, x12, x13 10 3.6 1.9 0.7 ``` -You can see by looking at the timeline view that instructions no longer depend on each other and can execute in parallel. Instructions also spend less time waiting in the scheduler's queue. This explains why the performance of `sum_test2.s` is so much better than `sum_test1.s`. -In the next section you can try running `llvm-mca` with Compiler Explorer. \ No newline at end of file +You can see by looking at the timeline view that instructions no longer depend on each other and can execute in parallel. + +Instructions also spend less time waiting in the scheduler's queue. This explains why the performance of `sum_test2.s` is so much better than `sum_test1.s`. + +In the next section, you can try running `llvm-mca` with Compiler Explorer. \ No newline at end of file