Build and validate DynamoRIO on AArch64 SVE hardware #5365

AssadHashmi · 2022-02-17T11:04:51Z

We need to fix build and runtime issues now that SVE support is becoming available on AArch64 hardware.
This ticket should only track incomplete test and runtime core/engine SVE support on the current master.

Other issues should track the addition of full SVE and later SVE2 instruction support, e.g. #3044 for the codec.

AssadHashmi · 2022-02-17T11:05:28Z

User issue raised when running on A64FX https://groups.google.com/g/dynamorio-users/c/_7H9NZXh3wc

This patch adds Arm's Scalable Vector Extension vector length support. The vector length is determined at runtime on startup in get_processor_specific_info() and available using proc_get_vector_length(). Cleancall, machine and signal context code have been updated to handle SVE registers as have API functions like reg_get_size() which will return the hardware's vector size rather than OPSZ_SCALABLE. The SVE specification allows for a maximum vector length of 2048 bits. We currently support 512 bits maximum due to DR's stack size limitation. There is currently no stock SVE hardware with vector lengths greater than 512 bits. There will be follow on patches to add: - Predicate registers. - Handling of First Fault Register (FFR). - Targetted SVE tests. Issue: #5365, #3044

For the current decode/encode functions of: ``` LDR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}] LDR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}] STR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}] STR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}] PRFB <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] PRFH <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] PRFW <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] PRFD <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] ``` vector indexing is used in the memory operand at the IR level. However the IR must always refer to the address in terms of the base register value plus a byte offset displacement. This patch changes the decode/encode functions for these instructions to expect byte offsets at the IR level, converting to vector length offsets within the codec. Issues #3044, #5365

…6230) For the current decode/encode functions of: LDR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}] LDR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}] STR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}] STR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}] PRFB <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] PRFH <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] PRFW <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] PRFD <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] Vector indexing is used in the memory operand at the IR level. However the IR must always refer to the address in terms of the base register value plus a byte offset displacement. This patch changes the decode/encode functions for these instructions to expect byte offsets at the IR level, converting to vector length offsets within the codec. Issues #3044, #5365

…ynamoRIO#6230) For the current decode/encode functions of: LDR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}] LDR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}] STR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}] STR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}] PRFB <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] PRFH <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] PRFW <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] PRFD <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}] Vector indexing is used in the memory operand at the IR level. However the IR must always refer to the address in terms of the base register value plus a byte offset displacement. This patch changes the decode/encode functions for these instructions to expect byte offsets at the IR level, converting to vector length offsets within the codec. Issues DynamoRIO#3044, DynamoRIO#5365

This patch adds Arm AArch64 Scalable Vector Extension (SVE) support to the core including related changes to the codec, IR and relevant clients. SVE and SVE2 are major extensions to Arm's 64 bit architecture. Developers and users should reference the relevant documentation at developer.arm.com, (currently https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions). The architecture allows hardware implementations to support vector lengths from 128 to 2048 bits. This patch supports up to 512 bits due to DynamoRIO's stack size limitation. There is currently no stock SVE hardware with vector lengths greater than 512 bits. The vector length is determined by get_processor_specific_info() at runtime on startup and is available by calling proc_get_vector_length(). For Z registers, reg_get_size() will return the vector size implemented by the hardware rather than OPSZ_SCALABLE. There will be follow up patches for: - SVE scatter/gather emulation - Full SVE signal context support - Complete SVE support in sample clients and drcachesim tracer. Issues: #5365, #3044 --------- Co-authored-by: Cam Mannett <camden.mannett@arm.com>

Add BUILD_TESTS_SVE build option to compile with SVE flags and high optimisation (-O3). Add some error checking to allow the -O3 build and consequently update a template (expected output) file. Issue: #5365

Build most core tests with SVE flags and high optimisation (-O3), if building on a AARCH64 SVE machine. Tests which fail when built with -O3 are not included. Add some error checking to a few tests to allow the -O3 build and update template (expected output) files as necessary. Issue: #5365

Build most core tests with SVE flags and high optimisation (-O3), if building on an AARCH64 SVE machine. Tests which fail when built with -O3 are not included. Add some error checking to a few tests to allow the -O3 build and update template (expected output) files as necessary. Issue #6429 raised to cover making the removal of optimization flags more granular. Issue: #5365

drcachesim's tracer.cpp, sample clients memtrace_simple.c and memval_simple.c have checks to avoid handling SVE scatter/gather memory instructions, i.e. use of Z registers in memory address operands. Now that a significant number of scatter/gather instructions have been implemented, these checks can be removed. Issues: #5036, #5365, #3044

Build most core tests with SVE flags and high optimisation (-O3), if building on an AARCH64 SVE machine. Tests which fail when built with -O3 are not included. Add some error checking to a few tests to allow the -O3 build and update template (expected output) files as necessary. Issue #6429 raised to cover making the removal of optimization flags more granular. Issue: #5365

Build most core tests with SVE flags and high optimisation (-O3), if building on a AARCH64 SVE machine. Tests which fail when built with -O3 are not included. Add some error checking to a few tests to allow the -O3 build and update template (expected output) files as necessary. Issue #6429 raised to cover making the removal of optimization flags more granular. Issue: #5365

…#6431) drcachesim's tracer.cpp, sample clients memtrace_simple.c and memval_simple.c have checks to avoid handling SVE scatter/gather memory instructions, i.e. use of Z registers in memory address operands. Now that a significant number of scatter/gather instructions have been implemented, these checks can be removed. Issues: #5036, #5365, #3044

- client.drsyms-test and client.drwrap-test-detach: The tests expect to observe a certain function call a certain sub-function but it doesn't happen when built with optimisation on because the sub-function gets inlined. This is fixed by marking the sub-functions as NOINLINE. - client.drx-scattergather and client.drx-scattergather-bbdup The test clients used with these tests count the number of scatter/gather instructions that are expanded and print the number at the end of the test, which gets checked against a reference value. Building the test app with -O3 causes some code to be auto vectorized so there are additional scatter/gather instructions which throws off the count. I removed this tests from the sve_tests list so it won't be built with -O3. Issue: #5365

- client.drsyms-test and client.drwrap-test-detach: The tests expect to observe a certain function call a certain sub-function but it doesn't happen when built with optimisation on because the sub-function gets inlined. This is fixed by marking the sub-functions as NOINLINE. - client.drx-scattergather and client.drx-scattergather-bbdup: The test clients used with these tests count the number of scatter/gather instructions that are expanded and print the number at the end of the test, which gets checked against a reference value. Building the test app with -O3 causes some code to be auto vectorized so there are additional scatter/gather instructions which throws off the count. I removed this tests from the sve_tests list so it won't be built with -O3. Issue: #5365

When debugging i#6499 we noticed that drcachesim was producing 0 byte read/write records for some SVE load/store instructions: ``` ifetch 4 byte(s) @ 0x0000000000405b3c a54a4681 ld1w (%x20,%x10,lsl #2) %p1/z -> %z1.s read 0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c ifetch 4 byte(s) @ 0x0000000000405b4 ``` This turned out to be due to drdecode being linked into drcachesim twice: once into the drcachesim executable, once into libdynamorio. drdecode uses a global variable to store the SVE vector length to use when decoding so we end up with two copies of that variable and only one was being initialized. To fix this properly we would need to refactor the libraries so that there is only one copy of the sve_veclen global variable, or change the way that the decoder gets the vector length so its no longer stored in a global variable. In the mean time we have a workaround which makes sure both copies of the variable get initialized and drcachesim produces correct results. With that workaround in place however, the results were still wrong. For expanded scatter/gather instructions when you are using an offline trace, raw2trace doesn't have access to the load/store instructions from the expansion, only the original app scatter/gather instruction. It has to create the read/write records using only information from the original scatter/gather instruction and it uses the size of the memory operand to determine the size of each read/write. This works for x86 because the x86 IR uses the per-element data size as for the memory operand of scatter/gather instructions. This doesn't work for AArch64 because the AArch64 codec uses the maximum data transferred (per-element data size * number of elements) like other SIMD load/store instructions. We plan to make the AArch64 IR consistent with x86 by changing it to use the same convention as x86 for scatter/gather instructions but in the mean time we can work around the inconsistency by fixing the size in raw2trace based on the instruction's opcode. Issues: #6499, #5365

derekbruening · 2024-01-18T02:12:56Z

#5036 covers expanding scatter/gather instructions for easier instrumentation

When debugging i#6499 we noticed that drcachesim was producing 0 byte read/write records for some SVE load/store instructions: ``` ifetch 4 byte(s) @ 0x0000000000405b3c a54a4681 ld1w (%x20,%x10,lsl #2) %p1/z -> %z1.s read 0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c ifetch 4 byte(s) @ 0x0000000000405b4 ``` This turned out to be due to drdecode being linked into drcachesim twice: once into the drcachesim executable, once into libdynamorio. drdecode uses a global variable to store the SVE vector length to use when decoding so we end up with two copies of that variable and only one was being initialized. To fix this properly we would need to refactor the libraries so that there is only one copy of the sve_veclen global variable, or change the way that the decoder gets the vector length so its no longer stored in a global variable. In the mean time we have a workaround which makes sure both copies of the variable get initialized and drcachesim produces correct results. With that workaround in place however, the results were still wrong. For expanded scatter/gather instructions when you are using an offline trace, raw2trace doesn't have access to the load/store instructions from the expansion, only the original app scatter/gather instruction. It has to create the read/write records using only information from the original scatter/gather instruction and it uses the size of the memory operand to determine the size of each read/write. This works for x86 because the x86 IR uses the per-element data size as for the memory operand of scatter/gather instructions. This doesn't work for AArch64 because the AArch64 codec uses the maximum data transferred (per-element data size * number of elements) like other SIMD load/store instructions. We plan to make the AArch64 IR consistent with x86 by changing it to use the same convention as x86 for scatter/gather instructions but in the mean time we can work around the inconsistency by fixing the size in raw2trace based on the instruction's opcode. Issues: #6499, #5365, #5036

This makes the IR consistent with x86 which already uses the per-element transfer size for the scatter/gather memory operand size. Issues: #5365, #5036, #6561

#6574) Make the AArch64 IR consistent with x86 which already uses the per-element transfer size for the scatter/gather memory operand size. This changes the AArch64 codec for the scatter/gather and predicated contiguous load/store instructions to use the per-element access size for the memory operand instead of the maximum total transfer size that it used previously, and updates the tests accordingly. Issues: #5365, #5036, #6561

Remove linux.fib-conflict from list of tests to be builts with -03. The cause of the infinite loop is hard to determine, being caused by the linker script. As this is not a DynamoRIO issue, and this test often fails at the moment anyway, just build it without -O3 for now. Issue: #5365

…6645) Remove linux.fib-conflict from list of tests to be builts with -03. The cause of the infinite loop is hard to determine, being caused by the linker script. As this is not a DynamoRIO issue, and this test often fails at the moment anyway, just build it without -O3 for now. Issue: #5365

Some of the SVE tests are written assuming a 256-bit vector length so that we get consistent output from the codec regardless of the hardware vector length that the test is run on. This was previously acheived by hard coding DynamoRIO's vector length to 256-bits when built with BUILD_TESTS=1. This worked fine for the codec tests (api.ir_sve, api.dis-a64-sve) but this breaks tests such as client.drx-scattergather which need the vector length to match the hardware. This patch tweaks two things so that all tests should now work on all vector lengths: 1. get_processor_specific_info() now initializes the vector length to the correct hardware value whether or not BUILD_TESTS=1. This enables the client tests to work on all vector lengths. 2. The AArch64 codec now uses dr_get_sve_vector_length() to get the vector length when built with BUILT_TESTS=1. This allows the api tests to override the vector length used by the codec by calling dr_set_sve_vector_length(). The api tests already call enable_all_test_cpu_features() which itself calls dr_set_sve_vector_length(256) so no changes to the tests themselves were needed. Issue: #5365

This patch adds SVE support for signals in the core. It is the follow on patch from the SVE core work part 1, in PR #5835 (f646a63) and includes vector address computation for SVE scatter/gather, enabling first-fault load handling. Issue: #5365, #5036 Co-authored-by: Jack Gallagher <jack.gallagher@arm.com>

Currently runsuite.cmake assumes that "origin/master" is the branch to diff against. However sometimes this is not the case. Add a "branch" parameter to runsuite.cmake, defaulting to "master", allowing a different source branch to be specified. Issue: #5365

Currently runsuite.cmake assumes that "origin/master" is the branch to diff against. However sometimes this is not the case, e.g. for internal CI systems using their own branches. Add a "branch" parameter to runsuite.cmake, defaulting to "master", allowing a different source branch to be specified. Issue: #5365

This patch adds SVE support for signals in the core. It is the follow on patch from the SVE core work part 1, in PR #5835 (f646a63) and includes vector address computation for SVE scatter/gather, enabling first-fault load handling. Issue: #5365, #5036 Co-authored-by: Jack Gallagher <jack.gallagher@arm.com>

Fixes the slot used to save and restore FP regs at fcache enter and return events. PR #6725 adjusted the slots used during signal handling in core/unix/signal_linux_aarch64.c but did not adjust the same in fcache enter/return and attach events. Prior to that PR, the FP regs were simply stored in a contiguous manner in signal handling code and fcache enter/return routines (instead of in their respective dr_simd_t struct member), which was a bit confusing. The mismatch between slot usage in signal handling and fcache enter/return code caused failures in the signalNNN1 tests on some systems. Verified that those tests pass with this fix. Also fixes the same issue in save_priv_mcontext_helper which is used in the dr_app_start API. Unit tests for this scenario will be added as part of #6759. Issue: #5036, #6755, #5365 Fixes #6758

PR #6757 fixed the way we read/write SVE register slots but unfortunately it is now broken on systems with 128-bit vector length. Both SVE vector and predicate registers use dr_simd_t slots which is a 64-byte type meant to store up to 512-bit vector registers. SVE predicate registers are always 1/8 the size of the vector register so for 512-bit vector length systems we only really need 64 / 8 = 8 bytes to store predicate registers. The ldr/str instructions we use to read and write the predicate register slots have a base+offset memory operand where the offset is a value in the range [-256, 255] scaled based by predicate register length. We read and write the registers by setting the base address to the address of the first slot, and setting the offset to n * sizeof(dr_simd_t) for each register Pn. For systems with 128-bit vector length, this means the predicate registers are 16 / 8 = 2 bytes so the maximum offset we can reach is 2 * 255 = 510 bytes. This means on 128-bit VL systems we can only reach the first 8 predicate registers (8 * sizeof(dr_simd_t) = 512). By changing the predicate register and FFR slots to use a new type dr_svep_t which is 1/8 the size of dr_simd_t we can fix this bug and save space. dr_svep_t is currently 8 bytes to correspond to 64 byte vectors, but even if we extend DynamoRIO to support the maximum SVE vector length of 2048-bits (256 bytes) dr_svep_t will only need to be increased to 256 / 8 = 32 bytes so the maximum offset (15 * 32 = 480 bytes) will always be in range. As this changes the size of the predicate register and FFR slots, this changes the size of the dr_mcontext_t structure and breaks backwards compatibility with earlier versions of DynamoRIO so the version number is increased to 10.90. Issues: #6760, #5365 Fixes: #6760

…6774) PR #6757 fixed the way we read/write SVE register slots but unfortunately it is now broken on systems with 128-bit vector length. Both SVE vector and predicate registers use dr_simd_t slots which is a 64-byte type meant to store up to 512-bit vector registers. SVE predicate registers are always 1/8 the size of the vector register so for 512-bit vector length systems we only really need 64 / 8 = 8 bytes to store predicate registers. The ldr/str instructions we use to read and write the predicate register slots have a base+offset memory operand where the offset is a value in the range [-256, 255] scaled based by predicate register length. We read and write the registers by setting the base address to the address of the first slot, and setting the offset to n * sizeof(dr_simd_t) for each register Pn. For systems with 128-bit vector length, this means the predicate registers are 16 / 8 = 2 bytes so the maximum offset we can reach is 2 * 255 = 510 bytes. This means on 128-bit VL systems we can only reach the first 8 predicate registers (8 * sizeof(dr_simd_t) = 512). By changing the predicate register and FFR slots to use a new type dr_svep_t which is 1/8 the size of dr_simd_t we can fix this bug and save space. dr_svep_t is currently 8 bytes to correspond to 64 byte vectors, but even if we extend DynamoRIO to support the maximum SVE vector length of 2048-bits (256 bytes) dr_svep_t will only need to be increased to 256 / 8 = 32 bytes so the maximum offset (15 * 32 = 480 bytes) will always be in range. As this changes the size of the predicate register and FFR slots, this changes the size of the dr_mcontext_t structure and breaks backwards compatibility with earlier versions of DynamoRIO so the version number is increased to 10.90. Issues: #6760, #5365 Fixes: #6760

AssadHashmi added OpSys-Linux Bug-AppFail OpSys-AArch64 labels Feb 17, 2022

AssadHashmi self-assigned this Feb 17, 2022

This was referenced Jan 24, 2023

i#5365: Add AArch64 SVE support to the core (part 1) #5835

Merged

Add scatter/gather support for AArch64's Scalable Vector Extension (SVE) #5844

Closed

AssadHashmi mentioned this issue Jul 26, 2023

i#3044 AArch64 SVE codec: change LDR/STR and PRF to use byte offsets #6230

Merged

philramsey-arm mentioned this issue Oct 16, 2023

i#5365: Build core unit tests with SVE enabled #6371

Merged

AssadHashmi mentioned this issue Nov 3, 2023

i#5365 Update aarch64 workflow #6293

Closed

AssadHashmi mentioned this issue Nov 9, 2023

i#5036 AArch64: Remove Z register checks in tracer and sample clients #6431

Merged

AssadHashmi mentioned this issue Nov 13, 2023

OpenMP fails on A64FX (AArch64 SVE) #6451

Open

jackgallagher-arm mentioned this issue Dec 4, 2023

i#5365 Fix some tests which fail when build with -O3 #6492

Merged

jackgallagher-arm mentioned this issue Jan 8, 2024

i#5365 AArch64: Fix 0 size read/write records in drmemtrace #6544

Merged

joshua-warburton mentioned this issue Jan 10, 2024

i#5365 Update ci-aarch64-native workflow with new runner #6549

Merged

jackgallagher-arm mentioned this issue Jan 23, 2024

i#5365 AArch64: Change scatter/gather instructions to per-element size #6574

Merged

philramsey-arm mentioned this issue Feb 9, 2024

i#5365: Infinite loop in linux.fib-conflict test when built with -O3 #6645

Merged

jackgallagher-arm mentioned this issue Feb 14, 2024

i#5365 AArch64: Fix tests on non 256-bit VL hardware #6652

Merged

AssadHashmi mentioned this issue Mar 26, 2024

i#5365 AArch64 SVE core, part 2: add signals support #6725

Merged

philramsey-arm mentioned this issue Mar 28, 2024

i#5365: Add "branch" parameter to runsuite.cmake #6740

Merged

AssadHashmi mentioned this issue Apr 5, 2024

AArch64: Fix P register save/restore on 128-bit vector length systems #6760

Closed

abhinav92003 mentioned this issue Apr 5, 2024

i#6758: Fix AArch64 FP state at fcache events and attach #6757

Merged

jackgallagher-arm mentioned this issue Apr 12, 2024

i#6760 AArch64: Use smaller data types for SVE P and FFR registers #6774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build and validate DynamoRIO on AArch64 SVE hardware #5365

Build and validate DynamoRIO on AArch64 SVE hardware #5365

AssadHashmi commented Feb 17, 2022 •

edited

AssadHashmi commented Feb 17, 2022

derekbruening commented Jan 18, 2024

Build and validate DynamoRIO on AArch64 SVE hardware #5365

Build and validate DynamoRIO on AArch64 SVE hardware #5365

Comments

AssadHashmi commented Feb 17, 2022 • edited

AssadHashmi commented Feb 17, 2022

derekbruening commented Jan 18, 2024

AssadHashmi commented Feb 17, 2022 •

edited