Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build and validate DynamoRIO on AArch64 SVE hardware #5365

Open
AssadHashmi opened this issue Feb 17, 2022 · 2 comments
Open

Build and validate DynamoRIO on AArch64 SVE hardware #5365

AssadHashmi opened this issue Feb 17, 2022 · 2 comments

Comments

@AssadHashmi
Copy link
Contributor

AssadHashmi commented Feb 17, 2022

We need to fix build and runtime issues now that SVE support is becoming available on AArch64 hardware.
This ticket should only track incomplete test and runtime core/engine SVE support on the current master.

Other issues should track the addition of full SVE and later SVE2 instruction support, e.g. #3044 for the codec.

@AssadHashmi
Copy link
Contributor Author

User issue raised when running on A64FX https://groups.google.com/g/dynamorio-users/c/_7H9NZXh3wc

AssadHashmi added a commit that referenced this issue Jan 24, 2023
This patch adds Arm's Scalable Vector Extension vector length support.
The vector length is determined at runtime on startup in
get_processor_specific_info() and available using
proc_get_vector_length().

Cleancall, machine and signal context code have been updated to handle
SVE registers as have API functions like reg_get_size() which will
return the hardware's vector size rather than OPSZ_SCALABLE.

The SVE specification allows for a maximum vector length of 2048 bits.
We currently support 512 bits maximum due to DR's stack size limitation.
There is currently no stock SVE hardware with vector lengths greater
than 512 bits.

There will be follow on patches to add:
- Predicate registers.
- Handling of First Fault Register (FFR).
- Targetted SVE tests.

Issue: #5365, #3044
AssadHashmi added a commit that referenced this issue Jul 26, 2023
For the current decode/encode functions of:
```
LDR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}]
LDR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}]
STR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}]
STR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFB <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFH <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFW <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFD <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]
```
vector indexing is used in the memory operand at the IR level. However
the IR must always refer to the address in terms of the base register
value plus a byte offset displacement. This patch changes the
decode/encode functions for these instructions to expect byte offsets
at the IR level, converting to vector length offsets within the codec.

Issues #3044, #5365
AssadHashmi added a commit that referenced this issue Jul 27, 2023
…6230)

For the current decode/encode functions of:

LDR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}]
LDR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}]
STR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}]
STR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFB <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFH <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFW <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFD <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]

Vector indexing is used in the memory operand at the IR level. However
the IR must always refer to the address in terms of the base register
value plus a byte offset displacement. This patch changes the
decode/encode functions for these instructions to expect byte offsets
at the IR level, converting to vector length offsets within the codec.

Issues #3044, #5365
ivankyluk pushed a commit to ivankyluk/dynamorio that referenced this issue Jul 28, 2023
…ynamoRIO#6230)

For the current decode/encode functions of:

LDR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}]
LDR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}]
STR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}]
STR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFB <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFH <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFW <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]
PRFD <prfop>, <Pg>, [<Xn|SP>{, #<imm>, MUL VL}]

Vector indexing is used in the memory operand at the IR level. However
the IR must always refer to the address in terms of the base register
value plus a byte offset displacement. This patch changes the
decode/encode functions for these instructions to expect byte offsets
at the IR level, converting to vector length offsets within the codec.

Issues DynamoRIO#3044, DynamoRIO#5365
AssadHashmi added a commit that referenced this issue Aug 14, 2023
This patch adds Arm AArch64 Scalable Vector Extension (SVE) support to
the core including related changes to the codec, IR and relevant
clients.

SVE and SVE2 are major extensions to Arm's 64 bit architecture.
Developers and users should reference the relevant documentation at
developer.arm.com, (currently
https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions).

The architecture allows hardware implementations to support vector
lengths from 128 to 2048 bits. This patch supports up to 512 bits due
to DynamoRIO's stack size limitation. There is currently no stock SVE
hardware with vector lengths greater than 512 bits. The vector length
is determined by get_processor_specific_info() at runtime on startup
and is available by calling proc_get_vector_length(). For Z registers,
reg_get_size() will return the vector size implemented by the hardware
rather than OPSZ_SCALABLE.

There will be follow up patches for:
- SVE scatter/gather emulation
- Full SVE signal context support
- Complete SVE support in sample clients and drcachesim tracer.

Issues: #5365, #3044

---------

Co-authored-by: Cam Mannett <camden.mannett@arm.com>
derekbruening pushed a commit that referenced this issue Aug 15, 2023
This patch adds Arm AArch64 Scalable Vector Extension (SVE) support to
the core including related changes to the codec, IR and relevant
clients.

SVE and SVE2 are major extensions to Arm's 64 bit architecture.
Developers and users should reference the relevant documentation at
developer.arm.com, (currently
https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions).

The architecture allows hardware implementations to support vector
lengths from 128 to 2048 bits. This patch supports up to 512 bits due
to DynamoRIO's stack size limitation. There is currently no stock SVE
hardware with vector lengths greater than 512 bits. The vector length
is determined by get_processor_specific_info() at runtime on startup
and is available by calling proc_get_vector_length(). For Z registers,
reg_get_size() will return the vector size implemented by the hardware
rather than OPSZ_SCALABLE.

There will be follow up patches for:
- SVE scatter/gather emulation
- Full SVE signal context support
- Complete SVE support in sample clients and drcachesim tracer.

Issues: #5365, #3044

---------

Co-authored-by: Cam Mannett <camden.mannett@arm.com>
philramsey-arm added a commit that referenced this issue Oct 16, 2023
Add BUILD_TESTS_SVE build option to compile with SVE flags and high
optimisation (-O3).

Add some error checking to allow the -O3 build and consequently
update a template (expected output) file.

Issue: #5365
philramsey-arm added a commit that referenced this issue Oct 18, 2023
Add BUILD_TESTS_SVE build option to compile with SVE flags and high
optimisation (-O3).

Add some error checking to allow the -O3 build and consequently
update a template (expected output) file.

Issue: #5365
philramsey-arm added a commit that referenced this issue Nov 8, 2023
Build most core tests with SVE flags and high
optimisation (-O3), if building on a AARCH64 SVE machine.

Tests which fail when built with -O3 are not included.

Add some error checking to a few tests to allow the -O3 build
and update template (expected output) files as necessary.

Issue: #5365
philramsey-arm added a commit that referenced this issue Nov 9, 2023
Build most core tests with SVE flags and high optimisation (-O3), if building
on an AARCH64 SVE machine.

Tests which fail when built with -O3 are not included.

Add some error checking to a few tests to allow the -O3 build and update
template (expected output) files as necessary.

Issue #6429 raised to cover making the removal of optimization flags more
granular.

Issue: #5365
philramsey-arm added a commit that referenced this issue Nov 9, 2023
Build most core tests with SVE flags and high optimisation (-O3), if building
on an AARCH64 SVE machine.

Tests which fail when built with -O3 are not included.

Add some error checking to a few tests to allow the -O3 build and update
template (expected output) files as necessary.

Issue #6429 raised to cover making the removal of optimization flags more
granular.

Issue: #5365
philramsey-arm added a commit that referenced this issue Nov 9, 2023
Build most core tests with SVE flags and high optimisation (-O3), if building
on an AARCH64 SVE machine.

Tests which fail when built with -O3 are not included.

Add some error checking to a few tests to allow the -O3 build and update
template (expected output) files as necessary.

Issue #6429 raised to cover making the removal of optimization flags more
granular.

Issue: #5365
AssadHashmi added a commit that referenced this issue Nov 9, 2023
drcachesim's tracer.cpp, sample clients memtrace_simple.c and
memval_simple.c have checks to avoid handling SVE scatter/gather
memory instructions, i.e. use of Z registers in memory address
operands. Now that a significant number of scatter/gather instructions
have been implemented, these checks can be removed.

Issues: #5036, #5365, #3044
philramsey-arm added a commit that referenced this issue Nov 10, 2023
Build most core tests with SVE flags and high optimisation (-O3), if building
on an AARCH64 SVE machine.

Tests which fail when built with -O3 are not included.

Add some error checking to a few tests to allow the -O3 build and update
template (expected output) files as necessary.

Issue #6429 raised to cover making the removal of optimization flags more
granular.

Issue: #5365
philramsey-arm added a commit that referenced this issue Nov 10, 2023
Build most core tests with SVE flags and high optimisation (-O3), if
building on a AARCH64 SVE machine.

Tests which fail when built with -O3 are not included.

Add some error checking to a few tests to allow the -O3 build and update
template (expected output) files as necessary.

Issue #6429 raised to cover making the removal of optimization flags
more granular.

Issue: #5365
AssadHashmi added a commit that referenced this issue Nov 15, 2023
…#6431)

drcachesim's tracer.cpp, sample clients memtrace_simple.c and
memval_simple.c have checks to avoid handling SVE scatter/gather memory
instructions, i.e. use of Z registers in memory address operands. Now
that a significant number of scatter/gather instructions have been
implemented, these checks can be removed.

Issues: #5036, #5365, #3044
brettcoon pushed a commit that referenced this issue Nov 16, 2023
…#6431)

drcachesim's tracer.cpp, sample clients memtrace_simple.c and
memval_simple.c have checks to avoid handling SVE scatter/gather memory
instructions, i.e. use of Z registers in memory address operands. Now
that a significant number of scatter/gather instructions have been
implemented, these checks can be removed.

Issues: #5036, #5365, #3044
jackgallagher-arm added a commit that referenced this issue Dec 4, 2023
- client.drsyms-test and client.drwrap-test-detach:
    The tests expect to observe a certain function call a certain
    sub-function but it doesn't happen when built with optimisation on
    because the sub-function gets inlined.
    This is fixed by marking the sub-functions as NOINLINE.

- client.drx-scattergather and client.drx-scattergather-bbdup
    The test clients used with these tests count the number of
    scatter/gather instructions that are expanded and print the number
    at the end of the test, which gets checked against a reference
    value. Building the test app with -O3 causes some code to be
    auto vectorized so there are additional scatter/gather instructions
    which throws off the count. I removed this tests from the sve_tests
    list so it won't be built with -O3.

Issue: #5365
AssadHashmi pushed a commit that referenced this issue Dec 4, 2023
- client.drsyms-test and client.drwrap-test-detach: The tests expect to
observe a certain function call a certain sub-function but it doesn't
happen when built with optimisation on because the sub-function gets
inlined. This is fixed by marking the sub-functions as NOINLINE.

- client.drx-scattergather and client.drx-scattergather-bbdup: The test
clients used with these tests count the number of scatter/gather
instructions that are expanded and print the number at the end of the
test, which gets checked against a reference value. Building the test
app with -O3 causes some code to be auto vectorized so there are
additional scatter/gather instructions which throws off the count. I
removed this tests from the sve_tests list so it won't be built with
-O3.

Issue: #5365
jackgallagher-arm added a commit that referenced this issue Jan 8, 2024
When debugging i#6499 we noticed that drcachesim was producing 0 byte
read/write records for some SVE load/store instructions:

```
 ifetch       4 byte(s) @ 0x0000000000405b3c a54a4681   ld1w   (%x20,%x10,lsl #2) %p1/z -> %z1.s
 read         0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c
 ifetch       4 byte(s) @ 0x0000000000405b4
```

This turned out to be due to drdecode being linked into drcachesim
twice: once into the drcachesim executable, once into libdynamorio.
drdecode uses a global variable to store the SVE vector length to use
when decoding so we end up with two copies of that variable and only
one was being initialized.
To fix this properly we would need to refactor the libraries so that
there is only one copy of the sve_veclen global variable, or change the
way that the decoder gets the vector length so its no longer stored in
a global variable. In the mean time we have a workaround which
makes sure both copies of the variable get initialized and drcachesim
produces correct results.

With that workaround in place however, the results were still wrong.
For expanded scatter/gather instructions when you are using an offline
trace, raw2trace doesn't have access to the load/store instructions
from the expansion, only the original app scatter/gather instruction.
It has to create the read/write records using only information from the
original scatter/gather instruction and it uses the size of the memory
operand to determine the size of each read/write. This works for x86
because the x86 IR uses the per-element data size as for the memory
operand of scatter/gather instructions. This doesn't work for AArch64
because the AArch64 codec uses the maximum data transferred
(per-element data size * number of elements) like other SIMD load/store
instructions.

We plan to make the AArch64 IR consistent with x86 by changing it to
use the same convention as x86 for scatter/gather instructions but in
the mean time we can work around the inconsistency by fixing the size
in raw2trace based on the instruction's opcode.

Issues: #6499, #5365
@derekbruening
Copy link
Contributor

#5036 covers expanding scatter/gather instructions for easier instrumentation

jackgallagher-arm added a commit that referenced this issue Jan 18, 2024
When debugging i#6499 we noticed that drcachesim was producing 0 byte
read/write records for some SVE load/store instructions:

```
 ifetch       4 byte(s) @ 0x0000000000405b3c a54a4681   ld1w   (%x20,%x10,lsl #2) %p1/z -> %z1.s
 read         0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c
 ifetch       4 byte(s) @ 0x0000000000405b4
```

This turned out to be due to drdecode being linked into drcachesim
twice: once into the drcachesim executable, once into libdynamorio.
drdecode uses a global variable to store the SVE vector length to use
when decoding so we end up with two copies of that variable and only one
was being initialized.
To fix this properly we would need to refactor the libraries so that
there is only one copy of the sve_veclen global variable, or change the
way that the decoder gets the vector length so its no longer stored in a
global variable. In the mean time we have a workaround which makes sure
both copies of the variable get initialized and drcachesim produces
correct results.

With that workaround in place however, the results were still wrong. For
expanded scatter/gather instructions when you are using an offline
trace, raw2trace doesn't have access to the load/store instructions from
the expansion, only the original app scatter/gather instruction. It has
to create the read/write records using only information from the
original scatter/gather instruction and it uses the size of the memory
operand to determine the size of each read/write. This works for x86
because the x86 IR uses the per-element data size as for the memory
operand of scatter/gather instructions. This doesn't work for AArch64
because the AArch64 codec uses the maximum data transferred (per-element
data size * number of elements) like other SIMD load/store instructions.

We plan to make the AArch64 IR consistent with x86 by changing it to use
the same convention as x86 for scatter/gather instructions but in the
mean time we can work around the inconsistency by fixing the size in
raw2trace based on the instruction's opcode.

Issues: #6499, #5365, #5036
jackgallagher-arm added a commit that referenced this issue Jan 23, 2024
This makes the IR consistent with x86 which already uses the per-element
transfer size for the scatter/gather memory operand size.

Issues: #5365, #5036, #6561
jackgallagher-arm added a commit that referenced this issue Jan 24, 2024
#6574)

Make the AArch64 IR consistent with x86 which already uses the
per-element transfer size for the scatter/gather memory operand
size.

This changes the AArch64 codec for the scatter/gather and predicated
contiguous load/store instructions to use the per-element access size
for the memory operand instead of the maximum total transfer size
that it used previously, and updates the tests accordingly.
    
Issues: #5365, #5036, #6561
philramsey-arm added a commit that referenced this issue Feb 9, 2024
Remove linux.fib-conflict from list of tests to be builts with -03.

The cause of the infinite loop is hard to determine, being caused by the
linker script.

As this is not a DynamoRIO issue, and this test often fails at the
moment anyway, just build it without -O3 for now.

Issue: #5365
philramsey-arm added a commit that referenced this issue Feb 12, 2024
…6645)

Remove linux.fib-conflict from list of tests to be builts with -03.

The cause of the infinite loop is hard to determine, being caused by the
linker script.

As this is not a DynamoRIO issue, and this test often fails at the
moment anyway, just build it without -O3 for now.

Issue: #5365
jackgallagher-arm added a commit that referenced this issue Feb 14, 2024
Some of the SVE tests are written assuming a 256-bit vector length so
that we get consistent output from the codec regardless of the hardware
vector length that the test is run on. This was previously acheived by
hard coding DynamoRIO's vector length to 256-bits when built with
BUILD_TESTS=1.
This worked fine for the codec tests (api.ir_sve, api.dis-a64-sve) but
this breaks tests such as client.drx-scattergather which need the
vector length to match the hardware.

This patch tweaks two things so that all tests should now work on all
vector lengths:
 1. get_processor_specific_info() now initializes the vector length
    to the correct hardware value whether or not BUILD_TESTS=1.
    This enables the client tests to work on all vector lengths.
 2. The AArch64 codec now uses dr_get_sve_vector_length() to get the
    vector length when built with BUILT_TESTS=1. This allows the api
    tests to override the vector length used by the codec by calling
    dr_set_sve_vector_length().

The api tests already call enable_all_test_cpu_features() which itself
calls dr_set_sve_vector_length(256) so no changes to the tests
themselves were needed.

Issue: #5365
jackgallagher-arm added a commit that referenced this issue Feb 14, 2024
Some of the SVE tests are written assuming a 256-bit vector length so
that we get consistent output from the codec regardless of the hardware
vector length that the test is run on. This was previously acheived by
hard coding DynamoRIO's vector length to 256-bits when built with
BUILD_TESTS=1.
This worked fine for the codec tests (api.ir_sve, api.dis-a64-sve) but
this breaks tests such as client.drx-scattergather which need the vector
length to match the hardware.

This patch tweaks two things so that all tests should now work on all
vector lengths:
1. get_processor_specific_info() now initializes the vector length to
the correct hardware value whether or not BUILD_TESTS=1. This enables
the client tests to work on all vector lengths.
2. The AArch64 codec now uses dr_get_sve_vector_length() to get the
vector length when built with BUILT_TESTS=1. This allows the api tests
to override the vector length used by the codec by calling
dr_set_sve_vector_length().

The api tests already call enable_all_test_cpu_features() which itself
calls dr_set_sve_vector_length(256) so no changes to the tests
themselves were needed.

Issue: #5365
AssadHashmi added a commit that referenced this issue Mar 26, 2024
This patch adds SVE support for signals in the core. It is the follow
on patch from the SVE core work part 1, in PR #5835 (f646a63) and
includes vector address computation for SVE scatter/gather, enabling
first-fault load handling.

Issue: #5365, #5036

Co-authored-by: Jack Gallagher <jack.gallagher@arm.com>
philramsey-arm added a commit that referenced this issue Mar 28, 2024
Currently runsuite.cmake assumes that "origin/master" is the branch to
diff against. However sometimes this is not the case.

Add a "branch" parameter to runsuite.cmake, defaulting to "master",
allowing a different source branch to be specified.

Issue: #5365
philramsey-arm added a commit that referenced this issue Mar 28, 2024
Currently runsuite.cmake assumes that "origin/master" is the branch to
diff against.

However sometimes this is not the case, e.g. for internal CI systems
using their own branches.

Add a "branch" parameter to runsuite.cmake, defaulting to "master",
allowing a different source branch to be specified.

Issue: #5365
AssadHashmi added a commit that referenced this issue Apr 3, 2024
This patch adds SVE support for signals in the core. It is the follow on
patch from the SVE core work part 1, in PR #5835 (f646a63) and
includes vector address computation for SVE scatter/gather, enabling
first-fault load handling.

Issue: #5365, #5036

Co-authored-by: Jack Gallagher <jack.gallagher@arm.com>
abhinav92003 added a commit that referenced this issue Apr 5, 2024
Fixes the slot used to save and restore FP regs at fcache enter and
return events. PR #6725 adjusted the slots used during signal handling
in core/unix/signal_linux_aarch64.c but did not adjust the same in
fcache enter/return and attach events. Prior to that PR, the FP regs
were simply stored in a contiguous manner in signal handling code and
fcache enter/return routines (instead of in their respective dr_simd_t
struct member), which was a bit confusing.

The mismatch between slot usage in signal handling and fcache
enter/return code caused failures in the signalNNN1 tests on some
systems. Verified that those tests pass with this fix.

Also fixes the same issue in save_priv_mcontext_helper which is used in
the dr_app_start API. Unit tests for this scenario will be added as part
of #6759.

Issue: #5036, #6755, #5365
Fixes #6758
jackgallagher-arm added a commit that referenced this issue Apr 12, 2024
PR #6757 fixed the way we read/write SVE register slots but
unfortunately it is now broken on systems with 128-bit vector length.

Both SVE vector and predicate registers use dr_simd_t slots which is a
64-byte type meant to store up to 512-bit vector registers. SVE
predicate registers are always 1/8 the size of the vector register so
for 512-bit vector length systems we only really need 64 / 8 = 8 bytes
to store predicate registers.

The ldr/str instructions we use to read and write the predicate
register slots have a base+offset memory operand where the offset is
a value in the range [-256, 255] scaled based by predicate register
length. We read and write the registers by setting the base address
to the address of the first slot, and setting the offset to
n * sizeof(dr_simd_t) for each register Pn.
For systems with 128-bit vector length, this means the predicate
registers are 16 / 8 = 2 bytes so the maximum offset we can reach
is 2 * 255 = 510 bytes. This means on 128-bit VL systems we can only
reach the first 8 predicate registers (8 * sizeof(dr_simd_t) = 512).

By changing the predicate register and FFR slots to use a new type
dr_svep_t which is 1/8 the size of dr_simd_t we can fix this bug and
save space.

dr_svep_t is currently 8 bytes to correspond to 64 byte vectors, but
even if we extend DynamoRIO to support the maximum SVE vector length of
2048-bits (256 bytes) dr_svep_t will only need to be increased to
256 / 8 = 32 bytes so the maximum offset (15 * 32 = 480 bytes) will
always be in range.

As this changes the size of the predicate register and FFR slots, this
changes the size of the dr_mcontext_t structure and breaks backwards
compatibility with earlier versions of DynamoRIO so the version number
is increased to 10.90.

Issues: #6760, #5365
Fixes: #6760
jackgallagher-arm added a commit that referenced this issue Apr 15, 2024
…6774)

PR #6757 fixed the way we read/write SVE register slots but
unfortunately it is now broken on systems with 128-bit vector length.

Both SVE vector and predicate registers use dr_simd_t slots which is a
64-byte type meant to store up to 512-bit vector registers. SVE
predicate registers are always 1/8 the size of the vector register so
for 512-bit vector length systems we only really need 64 / 8 = 8 bytes
to store predicate registers.

The ldr/str instructions we use to read and write the predicate register
slots have a base+offset memory operand where the offset is a value in
the range [-256, 255] scaled based by predicate register length. We read
and write the registers by setting the base address to the address of
the first slot, and setting the offset to n * sizeof(dr_simd_t) for each
register Pn.
For systems with 128-bit vector length, this means the predicate
registers are 16 / 8 = 2 bytes so the maximum offset we can reach is 2 *
255 = 510 bytes. This means on 128-bit VL systems we can only reach the
first 8 predicate registers (8 * sizeof(dr_simd_t) = 512).

By changing the predicate register and FFR slots to use a new type
dr_svep_t which is 1/8 the size of dr_simd_t we can fix this bug and
save space.

dr_svep_t is currently 8 bytes to correspond to 64 byte vectors, but
even if we extend DynamoRIO to support the maximum SVE vector length of
2048-bits (256 bytes) dr_svep_t will only need to be increased to 256 /
8 = 32 bytes so the maximum offset (15 * 32 = 480 bytes) will always be
in range.

As this changes the size of the predicate register and FFR slots, this
changes the size of the dr_mcontext_t structure and breaks backwards
compatibility with earlier versions of DynamoRIO so the version number
is increased to 10.90.

Issues: #6760, #5365
Fixes: #6760
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants