Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASSERT: Invariant test failure: Too many reads (AArch64+sve) #6499

Closed
joshua-warburton opened this issue Dec 8, 2023 · 9 comments
Closed

Comments

@joshua-warburton
Copy link
Collaborator

joshua-warburton commented Dec 8, 2023

Describe the bug
When running various applications, including fleetbench, we run into the following assert:

Trace invariant failure in T1909 at ref # 19484 (957 instrs since timestamp 13345924302819833): Too many read records

This failing test compares a predicted number of reads with the actual number of reads recorded by dynamorio.

The above case is failing at this point, when the second read is considered:

       19482       15318:        1909 ifetch       4 byte(s) @ 0x0000000000405b3c a54a4681   ld1w   (%x20,%x10,lsl #2) %p1/z -> %z1.s
       19483       15318:        1909 read         0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c
       19484       15318:        1909 read         0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c
       19485       15318:        1909 read         0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c
       19486       15318:        1909 read         0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c
       19487       15318:        1909 read         0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c
       19488       15318:        1909 read         0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c
       19489       15318:        1909 read         0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c
       19490       15318:        1909 read         0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c
       19491       15319:        1909 ifetch       4 byte(s) @ 0x0000000000405b4 

As this is run on a machine with a 256 bit vector length, we see 8 reads as expected for this instruction.

The invariant checker predicts the reads using the following function

uint
instr_num_memory_read_access(instr_t *instr)
{
    int i;
    opnd_t curop;
    int count = 0;
    const int opc = instr_get_opcode(instr);
    if (opc_is_not_a_real_memory_load(opc))
        return 0;
    for (i = 0; i < instr_num_srcs(instr); i++) {
        curop = instr_get_src(instr, i);
        if (opnd_is_memory_reference(curop)) {
            ++count;
        }
    }
    return count;
}

This means that we can only ever expect one read per memory accessing operand (effectively one per instruction).

Should we expect sve instructions to "roll up" the memory accesses into a single entry, or do we have to do a better job at predicting the number of reads and writes?

Versions

  • This is run against the HEAD
@derekbruening
Copy link
Contributor

Side question: why do the read records in the trace snippet all have 0 bytes? Something looks off there.

Main question: this is definitely an issue with the invariant checker. One solution is to mark scatter/gather instructions as predicated since with their masks for each load/store being optional they are dynamically variable in whether they do each load/store. We marked x86 rep string instructions this way: #1556, #6284. (We would then also mark x86 scatter/gather as well.). The invariant checker gives up on this check for predicated instructions.

If there are aspects that can be checked by the invariant checker, we should try to add them: e.g., checking the sizes of any loads/stores -- that's statically known from the decoding? That would then report an error on these 0-byte loads.

@joshua-warburton
Copy link
Collaborator Author

For the side question, I suspect it may be due to the predicate being all zeros.

We can probably know the number of loads and stores by looking at the destination register and the vector length, we're also likely to know the sizes of the loads and stores as that's usually directly related to the element size.

@derekbruening
Copy link
Contributor

For the side question, I suspect it may be due to the predicate being all zeros.

If a predicate bit is zero, then no load is done, so there should be no read record in the trace for that element. If the scatter/gather expansion is leaving an app load instr in place for a zero predicate, that is incorrect: that load needs to be skipped.

@joshua-warburton
Copy link
Collaborator Author

I've been looking into it and considering your previous comments the obvious solution is to update instr_is_predicated to correctly detect predicated AArch64 instructions.

I had a basic go at this and it does seem to resolve the problem of the assert in question but it doesn't remove all of the related "too many writes" assert.

This is due to using this method to detect if something is predicated

for (int i = 0; i < instr_num_srcs(instr); i++) {
        opnd = instr_get_src(instr, i);
        if (opnd_is_predicate_zero(opnd) || opnd_is_predicate_merge(opnd))
            return true;
    }

This will catch most predicated instructions.

AArch64 has several ways of using predicates, and to have the correct return for instr_is_predicated we have to account for all of them.

The simplest is what is caught by the above snippet:

LD1W    { <Zt1>.S-<Zt4>.S }, <PNg>/Z, [<Xn|SP>, <Xm>, LSL #2]
FMAXNMP <Zdn>.<Ts>, <Pg>/M, <Zdn>.<Ts>, <Zm>.<Ts>
et al

These are predicates with modes which directly imply that these are governing predicates
and are thus predicated instructions.

We can also reasonably assume that predicates in ld/st instructions like

ST1D    { <Zt1>.D-<Zt2>.D }, <PNg>, [<Xn|SP>{, #<simm>, MUL VL}]

imply that they are predicated instructions.

The problem comes with instructions like

FADDQV  <Vd>.<Ts>, <Pg>, <Zn>.<Tb>

which have governing predicates, and are thus predicated instructions, but have no
features in the current version of the codec that distinguish them from instructions like

SEL     <Zd>.<Ts>, <Pv>, <Zn>.<Ts>, <Zm>.<Ts>

where the predicates are merely being manipulated rather than modifying the reads or writes of the instruction. (There are a very limited number of this type of instruction).

If we want to make sure that instr_is_predicated always returns the correct we will either have to:

  • Add something to the codec to mark predicated instructions with a flag or similar when decodeing/being created
  • Use a set of general rules based off the properties of the operand to determine if something is a predicated or not, with a list of exceptions to check against.
  • Something else

Do you have any opinion on how we should proceed with this?

@derekbruening
Copy link
Contributor

For x86 we have a flag in the codec/table for predicates of various types, which turns into a DR_PRED_* instr_t-level flag. I would vote for a similar approach for aarch64 as well.

@derekbruening
Copy link
Contributor

Filed #6510 for x86.

@joshua-warburton
Copy link
Collaborator Author

Hey @derekbruening, I ended up modifying the codec so that properties can be added to operands with a dot notation. For example p10_lo.gov uses p10_lo to decode the operand but it is then marked as a governing operand.

Instructions with at least one governing operand are then marked as predicated by modifying the decode function generator to apply the DR_PRED_GOVERNING predicate value. This means we can then skip those instructions during the invariant checker and both the instructions which are predicated and the governing predicate are marked in the API for anyone who is interested in that information

@derekbruening
Copy link
Contributor

For the aarch64-internal codec that sounds fine. For the API-visible part, since the x86 and SVE concepts are so similar, we should try to share the interface to make it easier on clients. I'll add such a discussion to PR #6535.

jackgallagher-arm added a commit that referenced this issue Jan 8, 2024
When debugging i#6499 we noticed that drcachesim was producing 0 byte
read/write records for some SVE load/store instructions:

```
 ifetch       4 byte(s) @ 0x0000000000405b3c a54a4681   ld1w   (%x20,%x10,lsl #2) %p1/z -> %z1.s
 read         0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c
 ifetch       4 byte(s) @ 0x0000000000405b4
```

This turned out to be due to drdecode being linked into drcachesim
twice: once into the drcachesim executable, once into libdynamorio.
drdecode uses a global variable to store the SVE vector length to use
when decoding so we end up with two copies of that variable and only
one was being initialized.
To fix this properly we would need to refactor the libraries so that
there is only one copy of the sve_veclen global variable, or change the
way that the decoder gets the vector length so its no longer stored in
a global variable. In the mean time we have a workaround which
makes sure both copies of the variable get initialized and drcachesim
produces correct results.

With that workaround in place however, the results were still wrong.
For expanded scatter/gather instructions when you are using an offline
trace, raw2trace doesn't have access to the load/store instructions
from the expansion, only the original app scatter/gather instruction.
It has to create the read/write records using only information from the
original scatter/gather instruction and it uses the size of the memory
operand to determine the size of each read/write. This works for x86
because the x86 IR uses the per-element data size as for the memory
operand of scatter/gather instructions. This doesn't work for AArch64
because the AArch64 codec uses the maximum data transferred
(per-element data size * number of elements) like other SIMD load/store
instructions.

We plan to make the AArch64 IR consistent with x86 by changing it to
use the same convention as x86 for scatter/gather instructions but in
the mean time we can work around the inconsistency by fixing the size
in raw2trace based on the instruction's opcode.

Issues: #6499, #5365
AssadHashmi pushed a commit that referenced this issue Jan 8, 2024
…predicated (#6535)

This patch adds the DR_PRED_MASKED flag to instructions containing any
governing predicate registers as well as tagging the governing register
with DR_OPND_IS_MASKED.

To do this, it extends the codec functionality such that operands may
have "flags" which allow us to add some special code generation when
they are encountered. In this case we amend the instr decode/encode
functions to both tag the governing predicate and the instr itself.

issue: #6499
@joshua-warburton
Copy link
Collaborator Author

joshua-warburton commented Jan 9, 2024

With the above merged patch, this issue is no longer occurring in our tests

jackgallagher-arm added a commit that referenced this issue Jan 18, 2024
When debugging i#6499 we noticed that drcachesim was producing 0 byte
read/write records for some SVE load/store instructions:

```
 ifetch       4 byte(s) @ 0x0000000000405b3c a54a4681   ld1w   (%x20,%x10,lsl #2) %p1/z -> %z1.s
 read         0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c
 ifetch       4 byte(s) @ 0x0000000000405b4
```

This turned out to be due to drdecode being linked into drcachesim
twice: once into the drcachesim executable, once into libdynamorio.
drdecode uses a global variable to store the SVE vector length to use
when decoding so we end up with two copies of that variable and only one
was being initialized.
To fix this properly we would need to refactor the libraries so that
there is only one copy of the sve_veclen global variable, or change the
way that the decoder gets the vector length so its no longer stored in a
global variable. In the mean time we have a workaround which makes sure
both copies of the variable get initialized and drcachesim produces
correct results.

With that workaround in place however, the results were still wrong. For
expanded scatter/gather instructions when you are using an offline
trace, raw2trace doesn't have access to the load/store instructions from
the expansion, only the original app scatter/gather instruction. It has
to create the read/write records using only information from the
original scatter/gather instruction and it uses the size of the memory
operand to determine the size of each read/write. This works for x86
because the x86 IR uses the per-element data size as for the memory
operand of scatter/gather instructions. This doesn't work for AArch64
because the AArch64 codec uses the maximum data transferred (per-element
data size * number of elements) like other SIMD load/store instructions.

We plan to make the AArch64 IR consistent with x86 by changing it to use
the same convention as x86 for scatter/gather instructions but in the
mean time we can work around the inconsistency by fixing the size in
raw2trace based on the instruction's opcode.

Issues: #6499, #5365, #5036
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants