i#5365 AArch64: Fix 0 size read/write records in drmemtrace #6544

jackgallagher-arm · 2024-01-08T13:00:52Z

When debugging i#6499 we noticed that drcachesim was producing 0 byte read/write records for some SVE load/store instructions:

 ifetch       4 byte(s) @ 0x0000000000405b3c a54a4681   ld1w   (%x20,%x10,lsl #2) %p1/z -> %z1.s
 read         0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c
 ifetch       4 byte(s) @ 0x0000000000405b4

This turned out to be due to drdecode being linked into drcachesim twice: once into the drcachesim executable, once into libdynamorio. drdecode uses a global variable to store the SVE vector length to use when decoding so we end up with two copies of that variable and only one was being initialized.
To fix this properly we would need to refactor the libraries so that there is only one copy of the sve_veclen global variable, or change the way that the decoder gets the vector length so its no longer stored in a global variable. In the mean time we have a workaround which makes sure both copies of the variable get initialized and drcachesim produces correct results.

With that workaround in place however, the results were still wrong. For expanded scatter/gather instructions when you are using an offline trace, raw2trace doesn't have access to the load/store instructions from the expansion, only the original app scatter/gather instruction. It has to create the read/write records using only information from the original scatter/gather instruction and it uses the size of the memory operand to determine the size of each read/write. This works for x86 because the x86 IR uses the per-element data size as for the memory operand of scatter/gather instructions. This doesn't work for AArch64 because the AArch64 codec uses the maximum data transferred (per-element data size * number of elements) like other SIMD load/store instructions.

We plan to make the AArch64 IR consistent with x86 by changing it to use the same convention as x86 for scatter/gather instructions but in the mean time we can work around the inconsistency by fixing the size in raw2trace based on the instruction's opcode.

Issues: #6499, #5365, #5036

When debugging i#6499 we noticed that drcachesim was producing 0 byte read/write records for some SVE load/store instructions: ``` ifetch 4 byte(s) @ 0x0000000000405b3c a54a4681 ld1w (%x20,%x10,lsl #2) %p1/z -> %z1.s read 0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c ifetch 4 byte(s) @ 0x0000000000405b4 ``` This turned out to be due to drdecode being linked into drcachesim twice: once into the drcachesim executable, once into libdynamorio. drdecode uses a global variable to store the SVE vector length to use when decoding so we end up with two copies of that variable and only one was being initialized. To fix this properly we would need to refactor the libraries so that there is only one copy of the sve_veclen global variable, or change the way that the decoder gets the vector length so its no longer stored in a global variable. In the mean time we have a workaround which makes sure both copies of the variable get initialized and drcachesim produces correct results. With that workaround in place however, the results were still wrong. For expanded scatter/gather instructions when you are using an offline trace, raw2trace doesn't have access to the load/store instructions from the expansion, only the original app scatter/gather instruction. It has to create the read/write records using only information from the original scatter/gather instruction and it uses the size of the memory operand to determine the size of each read/write. This works for x86 because the x86 IR uses the per-element data size as for the memory operand of scatter/gather instructions. This doesn't work for AArch64 because the AArch64 codec uses the maximum data transferred (per-element data size * number of elements) like other SIMD load/store instructions. We plan to make the AArch64 IR consistent with x86 by changing it to use the same convention as x86 for scatter/gather instructions but in the mean time we can work around the inconsistency by fixing the size in raw2trace based on the instruction's opcode. Issues: #6499, #5365

jackgallagher-arm · 2024-01-08T14:45:18Z

Hi @derekbruening @abhinav92003, the problems I found debugging the 0-byte read/write issue we spotted in #6499 raise a few questions that I would value your input on:

Can you confirm how the memory operand size works for x86 scatter/gather instructions? AArch64 currently uses the maximum number of bytes transferred by the instruction (the amount of data that would be read/written if all elements are active, i.e per-element-size * number-of-elements) but raw2trace appears to expect it to be expecting it to be the per-element size. We want to make sure that the AArch64 IR is consistent with the x86 IR so I want to make sure I understand how it works in x86.
The drcachesim process ends up with two copies of drdecode.a: one linked in to drcachesim, one linked in to libdynamorio.so. Is this intended?
This patch sets the vector length used when decoding instructions in an offline raw trace to the VL for the current process. Most of the time if you are processing the trace on the same machine the trace was produced on this will be correct, but if you are using a different machine with a different VL (or you are using the same machine but have configured a different VL), it won't be correct. raw2trace really needs to be configuring the decoder to use the VL of the traced process, not the current process. Is there a way we can record the VL in the raw trace so that we can read it back later in raw2trace and call dr_set_sve_vector_length() to configure it correctly?

abhinav92003 · 2024-01-08T16:42:38Z

We want to make sure that the AArch64 IR is consistent with the x86 IR so I want to make sure I understand how it works in x86.

On x86, the IR has a fixed operand size for each scatter-gather opcode, which is equal to the per-element size. E.g. 2, 3 where for vpgatherdd the operand size is set to 4 bytes.

Is there a way we can record the VL in the raw trace so that we can read it back later in raw2trace and call dr_set_sve_vector_length() to configure it correctly?

We can perhaps add a new marker4 for vector size in the thread raw trace header 5, similar to page and cache sizes.

Question:
Can an application use multiple vector lengths throughout its execution by setting ZCR_Elx.LEN1 to different values? Or is the vector length guaranteed to be fixed throughout for an app?

derekbruening · 2024-01-08T17:31:31Z

The drcachesim process ends up with two copies of drdecode.a: one linked in to drcachesim, one linked in to libdynamorio.so. Is this intended?

No. We have put effort into supporting the same libdynamorio used for both standalone decoding and managed mode (at different time of course) in the same process, so the separate copy should not be needed and it is surprising to see it. Is it easy to remove it?

jackgallagher-arm · 2024-01-09T10:21:06Z

On x86, the IR has a fixed operand size for each scatter-gather opcode, which is equal to the per-element size. E.g. 2, 3 where for vpgatherdd the operand size is set to 4 bytes.

Thanks. We will adjust the AArch64 IR so SVE scatter/gather instructions use the same convention.

We can perhaps add a new marker4 for vector size in the thread raw trace header 5, similar to page and cache sizes.

Thanks. I'll have a look in to it.

Question: Can an application use multiple vector lengths throughout its execution by setting ZCR_Elx.LEN1 to different values? Or is the vector length guaranteed to be fixed throughout for an app?

Technically at a hardware level, yes you can change the vector length by writing to ZCR_Elx and Linux provides a system call that allows you to change the vector length for the current thread, but as I understand it doing so during a running application is considered undefined behaviour and DynamoRIO doesn't support it. DynamoRIO samples the VL once during initialization and assumes that it stays the same for the duration of the application.

jackgallagher-arm · 2024-01-09T11:00:33Z

The drcachesim process ends up with two copies of drdecode.a: one linked in to drcachesim, one linked in to libdynamorio.so. Is this intended?

No. We have put effort into supporting the same libdynamorio used for both standalone decoding and managed mode (at different time of course) in the same process, so the separate copy should not be needed and it is surprising to see it.

Right now there are two functions used by the codec which are not exported by libdynamorio:
dr_set_sve_vector_length(int vl) and dr_get_sve_vector_length(). This means when using libdynamorio you can't call dr_set_sve_vector_length(vl) to override the vector length that the codec uses. If we want 'libdynamorio' and 'drdecode' to be equivalent we will need to export those functions or rethink how it gets configured.

Is it easy to remove it?

I was able to make it to link correctly with one copy of drdecode.a by manually modifying the linker command, but I wasn't able to get CMake to generate that link command. I think we will need to look at the graph of dependency libraries to get this to build robustly with one copy of 'drdecode.a'

abhinav92003 · 2024-01-09T14:22:44Z

Technically at a hardware level, yes you can change the vector length by writing to ZCR_Elx and Linux provides a system call that allows you to change the vector length for the current thread, but as I understand it doing so during a running application is considered undefined behaviour and DynamoRIO doesn't support it.

Do you mean the hardware considers it undefined behaviour or DynamoRIO does? I assume you mean the latter. Well, we do have pre and post system call control points in DynamoRIO where we can potentially add some special handling -- maybe invoke some hook in drcachesim which adds a marker to the raw trace denoting a change in vector size; in raw2trace we can then make the required adjustments when these markers are found.

derekbruening · 2024-01-09T16:16:45Z

Right now there are two functions used by the codec which are not exported by libdynamorio:
dr_set_sve_vector_length(int vl) and dr_get_sve_vector_length(). This means when using libdynamorio you can't call dr_set_sve_vector_length(vl) to override the vector length that the codec uses. If we want 'libdynamorio' and 'drdecode' to be equivalent we will need to export those functions or rethink how it gets configured.

All standalone decode/encode functions should be available in libdynamorio. Looking at dr_set_sve_vector_length() and dr_get_sve_vector_length() in encode_api.h: they are missing DR_API specifiers and this is why they are not exported when in a shared library.

derekbruening · 2024-01-09T16:19:24Z

they are missing DR_API specifiers and this is why they are not exported when in a shared library

It may be worth a quick audit of any other new interfaces added recently to ensure they all have DR_API.

jackgallagher-arm · 2024-01-10T13:54:45Z

Do you mean the hardware considers it undefined behaviour or DynamoRIO does? I assume you mean the latter.

I mean that language runtimes treat the vector length as a runtime constant. For example LLVM's vector type IR uses a vscale constant to represent scalable vectors:

For scalable vectors, the total number of elements is a constant multiple (called vscale) of the specified number of elements; vscale is a positive integer that is unknown at compile time and the same hardware-dependent constant for all scalable vectors at run time. The size of a specific scalable vector type is thus constant within IR, even if the exact size in bytes cannot be determined until run time.

So although an application author could call prctl(...) and change the vector length, the compiler isn't expecting it to change and you might end up in a bad state if you do so. For example, if you had spilled a vector register to memory before changing the vector length, and try to restore it after.

Well, we do have pre and post system call control points in DynamoRIO where we can potentially add some special handling -- maybe invoke some hook in drcachesim which adds a marker to the raw trace denoting a change in vector size; in raw2trace we can then make the required adjustments when these markers are found.

Thanks. The current plan is to keep things as they are and assume that the vector length doesn't change but its worth knowing there is a possible way forward if that plan changes.

derekbruening · 2024-01-10T16:54:17Z

Generally, we want to be able to run any program that the hardware supports, not just those that follow software conventions in compilers or ABIs, since there always seem to be real programs that violate those conventions, and there are cases where DR is used to run deliberately violating programs (analyzing malware, etc.). Only when such support is intractable do we reluctantly relax that and assume conventions: such as some rseq corner cases. So if it is practical to support it changing we would want to do that. We should add a handler for the prctl call that at least detects a change and provides a warning or error with a TODO to actually handle it.

clients/drcachesim/tracer/raw2trace.cpp

Change-Id: Idd84efa6be879af2b12f849c2143212cb48dacc1

jackgallagher-arm · 2024-01-12T11:17:00Z

Generally, we want to be able to run any program that the hardware supports, not just those that follow software conventions in compilers or ABIs, since there always seem to be real programs that violate those conventions, and there are cases where DR is used to run deliberately violating programs (analyzing malware, etc.). Only when such support is intractable do we reluctantly relax that and assume conventions: such as some rseq corner cases. So if it is practical to support it changing we would want to do that. We should add a handler for the prctl call that at least detects a change and provides a warning or error with a TODO to actually handle it.

That makes sense. We will create a PR to detect vector length changes and issue a warning. Thanks.

clients/drcachesim/tracer/raw2trace.cpp

Change-Id: I82d48ab7fc69e485a9f92dc2d85254723bf2f51a

jackgallagher-arm · 2024-01-23T11:27:05Z

Proper fix for the memory operand sizes: #6574

derekbruening · 2024-01-23T14:30:21Z

This PR has broken our internal build: we get undefined references to dr_set_sve_vector_length, due to what was noted above. Filed #6575 as this is blocking us. I will add the missing DR_API to expedite the fix.

derekbruening · 2024-01-25T07:09:36Z

We have hit two more problems around vector lengths: a buffer overflow #6584, and incorrectly using the underlying machine's length when processing a trace gathered on a different architecture, which was added by this PR: #6585.

jackgallagher-arm added 2 commits January 8, 2024 12:54

Fix formatting

f5f3189

jackgallagher-arm requested review from derekbruening and abhinav92003 January 10, 2024 10:04

derekbruening reviewed Jan 10, 2024

View reviewed changes

clients/drcachesim/tracer/raw2trace.cpp Show resolved Hide resolved

derekbruening reviewed Jan 10, 2024

View reviewed changes

clients/drcachesim/tracer/raw2trace.cpp Show resolved Hide resolved

jackgallagher-arm mentioned this pull request Jan 11, 2024

drdecode.a is linked in to drcachesim twice #6556

Open

Update issue refs

bb024bb

Change-Id: Idd84efa6be879af2b12f849c2143212cb48dacc1

jackgallagher-arm mentioned this pull request Jan 12, 2024

What should the memory operand size be for SVE predicated contiguous loads and stores #6561

Open

jackgallagher-arm requested a review from derekbruening January 17, 2024 09:30

derekbruening approved these changes Jan 18, 2024

View reviewed changes

clients/drcachesim/tracer/raw2trace.cpp Show resolved Hide resolved

clients/drcachesim/tracer/raw2trace.cpp Show resolved Hide resolved

jackgallagher-arm added 2 commits January 18, 2024 08:54

Merge branch 'master' into i5365-fix-drcachesim-0-size-read-write

59a8cb3

Change-Id: I82d48ab7fc69e485a9f92dc2d85254723bf2f51a

Update comment

c41ca0e

jackgallagher-arm merged commit 0ddaa0f into master Jan 18, 2024
14 of 15 checks passed

jackgallagher-arm deleted the i5365-fix-drcachesim-0-size-read-write branch January 18, 2024 09:50

derekbruening mentioned this pull request Jan 23, 2024

dr_set_sve_vector_length is not exported, causing link errors when used from libdynamorio #6575

Open

derekbruening mentioned this pull request Jan 25, 2024

raw2trace incorrectly tries to use the host machine's vector length when processing a trace from another architecture #6585

Closed

jackgallagher-arm mentioned this pull request Feb 5, 2024

Handle changes in SVE vector length on AArch64 #6625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i#5365 AArch64: Fix 0 size read/write records in drmemtrace #6544

i#5365 AArch64: Fix 0 size read/write records in drmemtrace #6544

jackgallagher-arm commented Jan 8, 2024 •

edited

Loading

jackgallagher-arm commented Jan 8, 2024

abhinav92003 commented Jan 8, 2024

derekbruening commented Jan 8, 2024

jackgallagher-arm commented Jan 9, 2024

jackgallagher-arm commented Jan 9, 2024

abhinav92003 commented Jan 9, 2024

derekbruening commented Jan 9, 2024

derekbruening commented Jan 9, 2024

jackgallagher-arm commented Jan 10, 2024

derekbruening commented Jan 10, 2024

jackgallagher-arm commented Jan 12, 2024

jackgallagher-arm commented Jan 23, 2024

derekbruening commented Jan 23, 2024

derekbruening commented Jan 25, 2024

i#5365 AArch64: Fix 0 size read/write records in drmemtrace #6544

i#5365 AArch64: Fix 0 size read/write records in drmemtrace #6544

Conversation

jackgallagher-arm commented Jan 8, 2024 • edited Loading

jackgallagher-arm commented Jan 8, 2024

abhinav92003 commented Jan 8, 2024

derekbruening commented Jan 8, 2024

jackgallagher-arm commented Jan 9, 2024

jackgallagher-arm commented Jan 9, 2024

abhinav92003 commented Jan 9, 2024

derekbruening commented Jan 9, 2024

derekbruening commented Jan 9, 2024

jackgallagher-arm commented Jan 10, 2024

derekbruening commented Jan 10, 2024

jackgallagher-arm commented Jan 12, 2024

jackgallagher-arm commented Jan 23, 2024

derekbruening commented Jan 23, 2024

derekbruening commented Jan 25, 2024

jackgallagher-arm commented Jan 8, 2024 •

edited

Loading