Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i#5365 AArch64: Fix 0 size read/write records in drmemtrace #6544

Merged
merged 5 commits into from
Jan 18, 2024

Conversation

jackgallagher-arm
Copy link
Collaborator

@jackgallagher-arm jackgallagher-arm commented Jan 8, 2024

When debugging i#6499 we noticed that drcachesim was producing 0 byte read/write records for some SVE load/store instructions:

 ifetch       4 byte(s) @ 0x0000000000405b3c a54a4681   ld1w   (%x20,%x10,lsl #2) %p1/z -> %z1.s
 read         0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c
 ifetch       4 byte(s) @ 0x0000000000405b4

This turned out to be due to drdecode being linked into drcachesim twice: once into the drcachesim executable, once into libdynamorio. drdecode uses a global variable to store the SVE vector length to use when decoding so we end up with two copies of that variable and only one was being initialized.
To fix this properly we would need to refactor the libraries so that there is only one copy of the sve_veclen global variable, or change the way that the decoder gets the vector length so its no longer stored in a global variable. In the mean time we have a workaround which makes sure both copies of the variable get initialized and drcachesim produces correct results.

With that workaround in place however, the results were still wrong. For expanded scatter/gather instructions when you are using an offline trace, raw2trace doesn't have access to the load/store instructions from the expansion, only the original app scatter/gather instruction. It has to create the read/write records using only information from the original scatter/gather instruction and it uses the size of the memory operand to determine the size of each read/write. This works for x86 because the x86 IR uses the per-element data size as for the memory operand of scatter/gather instructions. This doesn't work for AArch64 because the AArch64 codec uses the maximum data transferred (per-element data size * number of elements) like other SIMD load/store instructions.

We plan to make the AArch64 IR consistent with x86 by changing it to use the same convention as x86 for scatter/gather instructions but in the mean time we can work around the inconsistency by fixing the size in raw2trace based on the instruction's opcode.

Issues: #6499, #5365, #5036

When debugging i#6499 we noticed that drcachesim was producing 0 byte
read/write records for some SVE load/store instructions:

```
 ifetch       4 byte(s) @ 0x0000000000405b3c a54a4681   ld1w   (%x20,%x10,lsl #2) %p1/z -> %z1.s
 read         0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c
 ifetch       4 byte(s) @ 0x0000000000405b4
```

This turned out to be due to drdecode being linked into drcachesim
twice: once into the drcachesim executable, once into libdynamorio.
drdecode uses a global variable to store the SVE vector length to use
when decoding so we end up with two copies of that variable and only
one was being initialized.
To fix this properly we would need to refactor the libraries so that
there is only one copy of the sve_veclen global variable, or change the
way that the decoder gets the vector length so its no longer stored in
a global variable. In the mean time we have a workaround which
makes sure both copies of the variable get initialized and drcachesim
produces correct results.

With that workaround in place however, the results were still wrong.
For expanded scatter/gather instructions when you are using an offline
trace, raw2trace doesn't have access to the load/store instructions
from the expansion, only the original app scatter/gather instruction.
It has to create the read/write records using only information from the
original scatter/gather instruction and it uses the size of the memory
operand to determine the size of each read/write. This works for x86
because the x86 IR uses the per-element data size as for the memory
operand of scatter/gather instructions. This doesn't work for AArch64
because the AArch64 codec uses the maximum data transferred
(per-element data size * number of elements) like other SIMD load/store
instructions.

We plan to make the AArch64 IR consistent with x86 by changing it to
use the same convention as x86 for scatter/gather instructions but in
the mean time we can work around the inconsistency by fixing the size
in raw2trace based on the instruction's opcode.

Issues: #6499, #5365
@jackgallagher-arm
Copy link
Collaborator Author

Hi @derekbruening @abhinav92003, the problems I found debugging the 0-byte read/write issue we spotted in #6499 raise a few questions that I would value your input on:

  1. Can you confirm how the memory operand size works for x86 scatter/gather instructions? AArch64 currently uses the maximum number of bytes transferred by the instruction (the amount of data that would be read/written if all elements are active, i.e per-element-size * number-of-elements) but raw2trace appears to expect it to be expecting it to be the per-element size. We want to make sure that the AArch64 IR is consistent with the x86 IR so I want to make sure I understand how it works in x86.
  2. The drcachesim process ends up with two copies of drdecode.a: one linked in to drcachesim, one linked in to libdynamorio.so. Is this intended?
  3. This patch sets the vector length used when decoding instructions in an offline raw trace to the VL for the current process. Most of the time if you are processing the trace on the same machine the trace was produced on this will be correct, but if you are using a different machine with a different VL (or you are using the same machine but have configured a different VL), it won't be correct. raw2trace really needs to be configuring the decoder to use the VL of the traced process, not the current process. Is there a way we can record the VL in the raw trace so that we can read it back later in raw2trace and call dr_set_sve_vector_length() to configure it correctly?

@abhinav92003
Copy link
Contributor

We want to make sure that the AArch64 IR is consistent with the x86 IR so I want to make sure I understand how it works in x86.

On x86, the IR has a fixed operand size for each scatter-gather opcode, which is equal to the per-element size. E.g. 2, 3 where for vpgatherdd the operand size is set to 4 bytes.

Is there a way we can record the VL in the raw trace so that we can read it back later in raw2trace and call dr_set_sve_vector_length() to configure it correctly?

We can perhaps add a new marker4 for vector size in the thread raw trace header 5, similar to page and cache sizes.

Question:
Can an application use multiple vector lengths throughout its execution by setting ZCR_Elx.LEN1 to different values? Or is the vector length guaranteed to be fixed throughout for an app?

@derekbruening
Copy link
Contributor

  1. The drcachesim process ends up with two copies of drdecode.a: one linked in to drcachesim, one linked in to libdynamorio.so. Is this intended?

No. We have put effort into supporting the same libdynamorio used for both standalone decoding and managed mode (at different time of course) in the same process, so the separate copy should not be needed and it is surprising to see it. Is it easy to remove it?

@jackgallagher-arm
Copy link
Collaborator Author

On x86, the IR has a fixed operand size for each scatter-gather opcode, which is equal to the per-element size. E.g. 2, 3 where for vpgatherdd the operand size is set to 4 bytes.

Thanks. We will adjust the AArch64 IR so SVE scatter/gather instructions use the same convention.

We can perhaps add a new marker4 for vector size in the thread raw trace header 5, similar to page and cache sizes.

Thanks. I'll have a look in to it.

Question: Can an application use multiple vector lengths throughout its execution by setting ZCR_Elx.LEN1 to different values? Or is the vector length guaranteed to be fixed throughout for an app?

Technically at a hardware level, yes you can change the vector length by writing to ZCR_Elx and Linux provides a system call that allows you to change the vector length for the current thread, but as I understand it doing so during a running application is considered undefined behaviour and DynamoRIO doesn't support it. DynamoRIO samples the VL once during initialization and assumes that it stays the same for the duration of the application.

@jackgallagher-arm
Copy link
Collaborator Author

  1. The drcachesim process ends up with two copies of drdecode.a: one linked in to drcachesim, one linked in to libdynamorio.so. Is this intended?

No. We have put effort into supporting the same libdynamorio used for both standalone decoding and managed mode (at different time of course) in the same process, so the separate copy should not be needed and it is surprising to see it.

Right now there are two functions used by the codec which are not exported by libdynamorio:
dr_set_sve_vector_length(int vl) and dr_get_sve_vector_length(). This means when using libdynamorio you can't call dr_set_sve_vector_length(vl) to override the vector length that the codec uses. If we want 'libdynamorio' and 'drdecode' to be equivalent we will need to export those functions or rethink how it gets configured.

Is it easy to remove it?

I was able to make it to link correctly with one copy of drdecode.a by manually modifying the linker command, but I wasn't able to get CMake to generate that link command. I think we will need to look at the graph of dependency libraries to get this to build robustly with one copy of 'drdecode.a'

@abhinav92003
Copy link
Contributor

Technically at a hardware level, yes you can change the vector length by writing to ZCR_Elx and Linux provides a system call that allows you to change the vector length for the current thread, but as I understand it doing so during a running application is considered undefined behaviour and DynamoRIO doesn't support it.

Do you mean the hardware considers it undefined behaviour or DynamoRIO does? I assume you mean the latter. Well, we do have pre and post system call control points in DynamoRIO where we can potentially add some special handling -- maybe invoke some hook in drcachesim which adds a marker to the raw trace denoting a change in vector size; in raw2trace we can then make the required adjustments when these markers are found.

@derekbruening
Copy link
Contributor

Right now there are two functions used by the codec which are not exported by libdynamorio:
dr_set_sve_vector_length(int vl) and dr_get_sve_vector_length(). This means when using libdynamorio you can't call dr_set_sve_vector_length(vl) to override the vector length that the codec uses. If we want 'libdynamorio' and 'drdecode' to be equivalent we will need to export those functions or rethink how it gets configured.

All standalone decode/encode functions should be available in libdynamorio. Looking at dr_set_sve_vector_length() and dr_get_sve_vector_length() in encode_api.h: they are missing DR_API specifiers and this is why they are not exported when in a shared library.

@derekbruening
Copy link
Contributor

they are missing DR_API specifiers and this is why they are not exported when in a shared library

It may be worth a quick audit of any other new interfaces added recently to ensure they all have DR_API.

@jackgallagher-arm
Copy link
Collaborator Author

Do you mean the hardware considers it undefined behaviour or DynamoRIO does? I assume you mean the latter.

I mean that language runtimes treat the vector length as a runtime constant. For example LLVM's vector type IR uses a vscale constant to represent scalable vectors:

For scalable vectors, the total number of elements is a constant multiple (called vscale) of the specified number of elements; vscale is a positive integer that is unknown at compile time and the same hardware-dependent constant for all scalable vectors at run time. The size of a specific scalable vector type is thus constant within IR, even if the exact size in bytes cannot be determined until run time.

So although an application author could call prctl(...) and change the vector length, the compiler isn't expecting it to change and you might end up in a bad state if you do so. For example, if you had spilled a vector register to memory before changing the vector length, and try to restore it after.

Well, we do have pre and post system call control points in DynamoRIO where we can potentially add some special handling -- maybe invoke some hook in drcachesim which adds a marker to the raw trace denoting a change in vector size; in raw2trace we can then make the required adjustments when these markers are found.

Thanks. The current plan is to keep things as they are and assume that the vector length doesn't change but its worth knowing there is a possible way forward if that plan changes.

@derekbruening
Copy link
Contributor

Generally, we want to be able to run any program that the hardware supports, not just those that follow software conventions in compilers or ABIs, since there always seem to be real programs that violate those conventions, and there are cases where DR is used to run deliberately violating programs (analyzing malware, etc.). Only when such support is intractable do we reluctantly relax that and assume conventions: such as some rseq corner cases. So if it is practical to support it changing we would want to do that. We should add a handler for the prctl call that at least detects a change and provides a warning or error with a TODO to actually handle it.

Change-Id: Idd84efa6be879af2b12f849c2143212cb48dacc1
@jackgallagher-arm
Copy link
Collaborator Author

Generally, we want to be able to run any program that the hardware supports, not just those that follow software conventions in compilers or ABIs, since there always seem to be real programs that violate those conventions, and there are cases where DR is used to run deliberately violating programs (analyzing malware, etc.). Only when such support is intractable do we reluctantly relax that and assume conventions: such as some rseq corner cases. So if it is practical to support it changing we would want to do that. We should add a handler for the prctl call that at least detects a change and provides a warning or error with a TODO to actually handle it.

That makes sense. We will create a PR to detect vector length changes and issue a warning. Thanks.

@jackgallagher-arm jackgallagher-arm merged commit 0ddaa0f into master Jan 18, 2024
14 of 15 checks passed
@jackgallagher-arm jackgallagher-arm deleted the i5365-fix-drcachesim-0-size-read-write branch January 18, 2024 09:50
@jackgallagher-arm
Copy link
Collaborator Author

Proper fix for the memory operand sizes: #6574

@derekbruening
Copy link
Contributor

This PR has broken our internal build: we get undefined references to dr_set_sve_vector_length, due to what was noted above. Filed #6575 as this is blocking us. I will add the missing DR_API to expedite the fix.

@derekbruening
Copy link
Contributor

We have hit two more problems around vector lengths: a buffer overflow #6584, and incorrectly using the underlying machine's length when processing a trace gathered on a different architecture, which was added by this PR: #6585.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants