Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i#2390: Replace add+ldar with ldr+dmb in AArch64 HT lookup #4375

Closed
wants to merge 907 commits into from
Closed

i#2390: Replace add+ldar with ldr+dmb in AArch64 HT lookup #4375

wants to merge 907 commits into from

Conversation

merckhung
Copy link

Replace the add+ldar with a ldr+dmb pair in HT lookup and results in
-3% to -28% (reduction) in overhead (instrument vs. native) ratios
of SPECInt 2006 and 2017. Although, found a regression in 657.xz_s
model (+7%).

An add+ldar pair was used to prevent memory-access instructions from
being reordered to ensure a hash mask is always loaded before a
hash_table is loaded. The same ordering restriction is also imposed
in the corresponding update_lookuptable_tls() routine.

With the ldr+dmb pair replacement, an add instruction is eliminated
from the critical path and contributed majorly to overhead reductions.

In addition, a frequently taken branch of the inner-loop epilogue is
converged into a more condensed (4 instr.) and smaller inner-loop,
with the cost of adding 1 more sub instruction in both 2 exit paths.

Since exit paths are not as hot as HT lookup, the impact of adding
a sub instruction in both is quite trivial (not seen).

In order to condense the inner-loop down to 4 instructions, a
pre-index ldr is replaced with a post-index version in the prologue.

On the hit-exit path, register use is carefully swapped and results in a
move instruction elimination. The number of instructions on the path
remains unchanged since a sub instruction is added to its entry.

Issue: #2390
Tests: SPECInt 2006 and 2017 ran on ARM Juno r2 (w/ 8GB RAM, Debian)

derekbruening and others added 30 commits November 10, 2019 13:01
Adds configuration to disable blank Github issues, so users will
always use our templates (and in particular stop using the issue
tracker to just ask questions).

Also adds a contact link pointing at the users list.
Suggests first starting with the users list in the feature template.
Adds users list link requests to both templates.
… register state. (#3928)

Adds a test to check for the correct state of the clobbered mask register if an
asynchronous event hits the sequence within the small window that clobbers it.
Note that the restore event is not implemented yet, and the test is currently
correctly reporting an error.

A future patch that will add support for restoring the mask register will also
include an update to this test.

Minor refactoring of the test to use drmgr_register_bb_instrumentation_ex_event()
instead of drmgr_register_bb_instrumentation_event().

Adds some missing __AVX512F__ and __AVX__ ifdefs.

Issue: #2985
…ter state. (#3932)

Adds a test to check for the correct state of the gather instruction's mask register if an
asynchronous event hits the sequence after a completed load, but before the mask register
update. Note that the restore event is not implemented yet, and the test is currently
expected to report an error.

Re-factors the tests and puts the functions test_avx512_restore_mask_fault,
test_avx512_restore_mask_clobber, and test_avx512_restore_mask_update
in one common macro.

A future patch that will add support for restoring the mask register will also
include an update to this test.

Issue: #2985
Updates the NULL defines used in DR to be pointer-sized to avoid
potential issues such as in PR #3920.
Avoids C++ compiler warnings by defining NULL as nullptr for __cplusplus.

Fixes a format string error that showed up along with the NULL cast
errors for C++ (prior to defining as nullptr).

Fixes #3930
…ter test variants. (#3933)

Adds all scatter variants for previously added gather expansion tests. More currently
expected error cases have been added to the test's template, and will be removed once
support for the restore event in drx has been implemented.

Does some minor refactoring that was missed in #3932.

Issue: #2985
Adds printing of timestamp, core, and kernel transfer markers to the
drcachesim trace view tool.  The markers embedded in the disassembly
make it much easier to see signal handlers and thread transitions.

Adds a sanity check to the view test and updates the docs.
Updates the prefix order to properly decode crc32w.

Fixes the destination size codes for OP_crc32: the destination can
never shrink to 16 bits.  Resolves an old comment expressing confusion
about this where the Intel manual had a separate entry for a data
prefix, resulting in different codes than we need in our table.

Enables all the crc32w test cases in the binutils decenc tests.
Updates the expected test patterns.

Fixes #2431
…x. (#3941)

We need to close it in cases of successful/failed fd_priv_dup, so just
promote the existing close to before the success/failure check.

Fixes #3940
… test variants. (#3942)

Adds an test for updating the destination mask to client.drx-scattergather for AVX2. As
with previous tests of this kind, the error is currently expected until a drx restore event
has been implemented.

Does some renaming into AVX2/AVX-512 function names.

Initializes the pc variables with INT_MAX in order to add robustness to the test.

Issue: #2985
Removes the problematic Linux approach for handling an invalid
application instruction in a basic block, where we would isolate it to
its own bb and then copy 17 bytes for native execution, assuming the
processor would raise a fault.  This is very fragile and caused
real-world problems with the crc32w decoder bug (#2431).  Those 17
bytes often end mid-instruction, causing the exit cti to not exist and
DR to execute junk instructions, leading to all kinds of bad behavior.

Now, Linux does what Windows does: we immediately forge an illegal
instruction fault (SIGILL for Linux).  While it might be nicer to try
to continue execution for instructions whose length we can guess at,
there are just too many risks here as described above.  It's better to
have a clear fault that someone can identify and debug easily to
address decoder issues, instead of mysterious behavior and strange
crashes that take a lot of effort to track back to the decoder.

Tested on crc32w with the #2431 removed locally for a
valid-but-we-think-it's-not test.  The common.decode-bad test has
several cases of legitimately-invalid instructions.

Adds error handling of a failed cache decode in
recreate_app_state_from_info() so we're more robust there.  Tested
with a local revert of the above removal of invalid_instr_hack.

Fixes #3212
Replaces {E,G,M}d_q with {E,G,M}y.  The 'y' code matches what Intel
uses for OPSZ_4_rex8; matches our existing {B,E,G,R}y codes; and
conforms to the convention that an underscore between two size letters
indicates a partial size out of a larger whole.

Renames the F{x,y,z} codes used for particular floating-point sizes to
Ff{x,y,z}, to avoid confusion with Intel's 'x', 'y', and 'z' size
codes.
Fixes a bug where drcachesim's raw2trace places delayed branches
*after* signal markers, leading to confusing trace orderings.

Improves the checking for markers in trace_invariants, but
unfortunately it still will not catch this case as it does not know
the targets of branches nor the addresses of signal handlers.

Adds log_instruction() to trace_converter_t and uses it to disassemble
memtrace instructions on *every* instance, instead of just the first
time encountered.  This is essential for debugging these types of
issues.

Issue: #3937
…es. (#3952)

The bug led to an invalid encoding error on machines with no support for AVX512BW,
but AVX512F.

Tested manually by setting proc_avx512_enabled to true on a machine w/o AVX-512.

Issue: #1312
…IX and 32-bit. (#3954)

As pointed out in cf1ec32, AVX-512 context management by DynamoRIO is not
fully supported or is untested on non-UNIX as well as on 32-bit builds. Yet processor
detection and certain AVX-512 functionality was enabled on those builds when running on
machines with AVX-512 support enabled by the OS. This was causing problems (xref #3949).
For now, this patch completely disables all AVX-512 features on unsupported and untested
builds.

Fixes #3949
Issue: #1312
…3957)

In dadb8ff, we disabled support for AVX-512 for all 32-bit UNIX builds completely.
Yet those builds are only missing support for AVX-512 state in signal frames. Various
tests already exist for existing context switching in 32-bit. In order to keep these tests
working, we are re-enabling partial AVX-512 support for UNIX 32-bit. A warning is issued
if AVX-512 code is encountered.

Issue: #1312
…mulation code. (#3955)

Adds a drx restore event that will eventually handle fixing AVX-512 gather, AVX-512
scatter and AVX2 gather emulation sequence application state. This patch handles
the AVX-512 gather sequence.

The patch adds drx support to detect and restore the scratch mask as well as the
destination mask state in the AVX-512 gather emulation sequence. This is necessary if
a translation event hits the AVX-512 gather emulation sequence in certain parts of the
emulation code. Detection is done by a state machine that attempts to match a known
pattern of the emulation sequence.

AVX-512 scatter and AVX2 gather is not yet supported.

Removes the relevant errors from client.drx-scattergather.

Issue: #2985
…ate machines. (#3961)

Converts calling the state machine into a function pointer call in order to be able to
more easily share code when adding an AVX-512 scatter and an AVX2 gather state machine.

The patch also fixes 2 minor assertion bugs and another minor state machine bug where we
did not skip unknown instructions.

Issue: #2985
Adds the module offset of the interruption point PC to the kernel
event marker to record when a signal happened.  Refactors
raw2trace to check for signals after every insruction and not
just every memref, and to look at the offset for correct
placement.

Adds a new test application signal_invariants which uses assembly
for precise control over tests around signal marker placement.
Changes the trace_invariants tests to use this new application.
The test uses prefetch instructions as annotations to communicate
with the trace_invariants reader.  Uses the new test to add
several additional invariant checks to trace_invariants: signal
placement mid-bb, placement mid-memref, handler resumption,
handler immediacy.

Updates the docs to show an example of a signal handler
interruption in the view tool.

Documents the multi-memref fault issue (#3958) and the online
trace missing instruction issue (#3739).

Relaxes the invariant that a branch target must follow the branch, when
a signal arrives in between (updates the docs for this too).

Passes -verbose to the IPC reader for online trace debugging.

Solving signal placement in the middle of online trace
instruction bundles is left for future work.  Comments in the
code and docs make this clear.

Issue: #3937
Label callbacks are called every time an instruction is destroyed iff the instruction is
a label instruction and the field is valid. When cloning an instruction, we can't blindly
copy the label callback field causing it to be called when the cloned instruction is
destroyed. We are adding a note to the instr_clone() docs and prevent the field from
being copied.

Fixes #3926
Issue: #3962
… AVX2 gather. (#3964)

Adds support for restoring the destination mask in the AVX2 gather emulation sequence. As
before, detection is done by a state machine that attempts to match a known pattern of the
emulation sequence.

Removes the relevant error from client.drx-scattergather and fixes a bug in the test for AVX2
gather in 32-bit mode.

Fixes a bug in the AVX-512 state machine if index is OPSZ_8. Only OPSZ_4 is currently
tested in drx.scattergather.

Issue: #2985
…orted value types in droption (#3965)

Adds default implementations of convert_from_string and default_as_string to avoid crashes when using unsupported option types.

Adds droption_t implementations for long, long long, unsigned long, and unsigned long long.

Lists supported types in the droption documentation.

Uses strtol and strtoll for error detection in the pre-existing code and the new code.

Fixes #3959
… AVX-512 scatter. (#3970)

The patch adds drx support to detect and restore the scratch mask as well as the
destination mask state in the AVX-512 scatter emulation sequence.

Adds two more states to the AVX2 gather and the AVX-512 gather state machine making
them more robust detecting the emulation sequence.

Fixes a bug where we read register values from mcontext instead of raw_mcontext
when restoring it, which matters for example if the register is the stack pointer.

Fixes some bugs in the illustrating comments.

Removes the relevant errors from client.drx-scattergather.

Issue: #2985
For 32-bit, when the private loader calls functions in potentially
third-party libraries, it needs to align the stack to 16 to account
for the mismatch in DR's 4-byte alignment (i#847) and the new gcc ABI.

Tested on libdrmemtrace which with gcc 8.3.0 has SIMD code which
crashes without alignment.

This is a short-term fix.  More issues remain with clean calls and we
should perhaps consider changing our base alignment.

Issue: #847, #3966
Documents that dr_mutex_unlock must only be used on mutexes locked by the current thread.
Fixes a bug in decode_cti() with a (e)vex prefix with its 2nd or 3rd
byte also looking like the first byte of a (e)vex prefix where the
instruction is considered invalid, crashing the application.

Adds testing of decode_cti() to api.ir by export decode_cti() in DEBUG
and BUILD_TESTS builds.  Adds a test case that triggers the observed
bug without the fix.

Fixes #3978
Splits the largest of our IR testing headers into two pieces to avoid
a compiler error with VS2013 "out of heap space".

Fixes #3992
Updates a reference to a split header in the cmake code which was missed by PR #3993.

Issue: #3992
dkim-virsec and others added 7 commits July 7, 2020 11:15
Fixes the thread handle leak when thread_handle_to_pid() in win32/syscall.c is called.

Fixes: #4352
When raw files are gzipped and named xxx.raw.gz, the corresponding
final trace files end up named xxx.raw.trace.gz, i.e., with an extra
"raw." in there.  We fix that here.

Issue: #4318
Fixes an out of memory error due to allocation of more than 2GB of
unreachable heap by fixing a bug where beyond-vmm allocations
required reachability constraints even for unreachable heap.

Adds a simple test of many large allocs to the client.alloc test.

Co-authored-by: Derek Bruening <bruening@google.com>

Fixes #4335
For the view tool, the original wrong-arch check was placed *after*
the entry skipping for -skip_refs, which can result in never checking
the architecture.  We fix that here by moving the check up front.

Issue: #4318
Removes the stale workaround put in place that disabled zlib
altogether for cross-compiling before we had CMAKE_FIND_ROOT_PATH set
properly in all of our toolchain files.  Now that we're seeing the
find-root properly we need to remove the workaround to allow finding a
target zlib when cross-compiling.

Tested manually on an aarch64 cross-compilation.

The core of i#2285 was fixed earlier by PR #2551; this finishes off
the issue.

Fixes #2285
…mbols (#4371)

Fix compilation on old Linux which does not have EM_ARM/EM_AARCH64 symbols.
Replace the add+ldar with a ldr+dmb pair in HT lookup and results in
-3% to -28% (reduction) in overhead (instrument vs. native) ratios
of SPECInt 2006 and 2017. Although, found a regression in 657.xz_s
model (+7%).

An add+ldar pair was used to prevent memory-access instructions from
being reordered to ensure a hash mask is always loaded before a
hash_table is loaded. The same ordering restriction is also imposed
in the corresponding update_lookuptable_tls() routine.

With the ldr+dmb pair replacement, an add instruction is eliminated
from the critical path and contributed majorly to overhead reductions.

In addition, a frequently taken branch of the inner-loop epilogue is
converged into a more condensed (4 instr.) and smaller inner-loop,
with the cost of adding 1 more sub instruction in both 2 exit paths.

Since exit paths are not as hot as HT lookup, the impact of adding
a sub instruction in both is quite trivial (not seen).

In order to condense the inner-loop down to 4 instructions, a
pre-index ldr is replaced with a post-index version in the prologue.

On the hit-exit path, register use is carefully swapped and results in a
move instruction elimination. The number of instructions on the path
remains unchanged since a sub instruction is added to its entry.

Issue: #2390
Tests: SPECInt 2006 and 2017 ran on ARM Juno r2 (w/ 8GB RAM, Debian)
@merckhung merckhung marked this pull request as ready for review July 18, 2020 19:02
Copy link
Contributor

@derekbruening derekbruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look. Probably best for an AArch64 expert and/or those who wrote the original code to take a deeper look.

APP(&ilist,
INSTR_CREATE_ldar(dc, opnd_create_reg(DR_REG_X1),
OPND_CREATE_MEMPTR(DR_REG_X1, 0)));
instr_create_0dst_1src(dc, OP_dmb, OPND_CREATE_INT(/* DR_DMB_ISHLD */ 9)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd prefer to use INSTR_CREATE_macros: ARM and x86 are complete but it seems AArch64 is missing some. Could you add the ones you need? IIRC they are manually done instead of automated like the ARM ones are.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I have created INSTR_CREATE_dmb macro for AArch64 in the 2nd commit.

APP(&ilist,
INSTR_CREATE_ldar(dc, opnd_create_reg(DR_REG_X1),
OPND_CREATE_MEMPTR(DR_REG_X1, 0)));
instr_create_0dst_1src(dc, OP_dmb, OPND_CREATE_INT(/* DR_DMB_ISHLD */ 9)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use DR_DMB_ISHLD directly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. I have made a copy in AArch64's instr_create.h instead of including ARM's header, to avoid macro definition conflicts across architectures.

dc, opnd_create_reg(DR_REG_X0),
OPND_CREATE_MEMPTR(DR_REG_X1, offsetof(fragment_entry_t, tag_fragment))));
/* ldr x0, [x1], #tag_fragment_offset */
instr_t *instr = instr_create_2dst_3src(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd again prefer an INSTR_CREATE_.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. INSTR_CREATE_ldr_imm has been created to serve this purpose.

APP(&ilist, instr_create_save_to_tls(dc, DR_REG_X0, TLS_REG3_SLOT));
/* Load hash mask before loading hash_table, with DMB to prevent memory-access
* instructions from being reordered. A corresponding store-release is in
* update_lookuptable_tls() to ensure a new hast_table will be written after a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: hash_table

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected. Thank you!

@derekbruening
Copy link
Contributor

I took a quick look. Probably best for an AArch64 expert and/or those who wrote the original code to take a deeper look.

Requesting @egrimley to take a look (or delegate)

@AssadHashmi
Copy link
Contributor

run arm tests

1 similar comment
@AssadHashmi
Copy link
Contributor

run arm tests

Merck Hung and others added 8 commits July 20, 2020 14:49
Add INSTR_CREATE_ldr_imm definition to create an LDR immediate instruction.
There is an existing INSTR_CREATE_ldr that emittes LDR register version, it
takes 2dst/1src operands. But the AArch64 encoder requires a 2dst/3src
operands to encode an LDR immediate version. Hence, a new and separated
INSTR_CREATE_ldr_imm definition is created.

The LDR immediate instruction also has post-index and pre-index variants.
The original code has been fixed to pre-index encoding. To allow selecting,
one more parameter (bool post_index) appends to opnd_create_far_base_disp_ex().
Changes related to opnd_create_far_base_disp_ex() are updated across arches.

INSTR_CREATE_dmb has been defined in ARM but in order to avoid mixing macros
across 2 architectures. A copy of INSTR_CREATE_dmb and its enums is added
to AArch64's instr_create.h.

AArch64's indirect branch lookup has been updated to use new defintions.

Issue: #2390
Test: test suites, 400.perlbench
Add INSTR_CREATE_ldr_imm definition to create an LDR immediate instruction.
There is an existing INSTR_CREATE_ldr that emittes LDR register version, it
takes 2dst/1src operands. But the AArch64 encoder requires a 2dst/3src
operands to encode an LDR immediate version. Hence, a new and separated
INSTR_CREATE_ldr_imm definition is created.

The LDR immediate instruction also has post-index and pre-index variants.
The original code has been fixed to pre-index encoding. To allow selecting,
one more parameter (bool post_index) appends to opnd_create_far_base_disp_ex().
Changes related to opnd_create_far_base_disp_ex() are updated across arches.

INSTR_CREATE_dmb has been defined in ARM but in order to avoid mixing macros
across 2 architectures. A copy of INSTR_CREATE_dmb and its enums is added
to AArch64's instr_create.h.

AArch64's indirect branch lookup has been updated to use new defintions.

Issue: #2390
Test: test suites, 400.perlbench
…tion

Notice the build failures in x86_64, update ext/drbbdup/drbbdup.c to using
the new definitions of opnd_create_far_base_disp_ex()
…h instrs (#4341)

Adds new architecture agnostic trace types for prefetch instructions. These instructions have multiple variants based on:
- prefetch target: this may be the L1, L2 or L3 cache
- prefetch policy: either temporal or non-temporal
- operation type: prefetch may be for load, instruction or store

Uses these trace types for variants of AArch64 prfm and prfum prefetch instructions. Also marks some of the new trace types as aliases for existing x86 specific ones as they have same meaning. Other existing architecture specific trace types have different meaning and cannot be mapped to the new trace types.

Also makes changes to account for these new prefetch trace types in drcachesim.

Adds a custom analyzer test to verify prefetch operation counts for each variant.

Issue: #4328
@derekbruening
Copy link
Contributor

@AssadHashmi -- trying to get a review from an ARM expert

Copy link
Contributor

@AssadHashmi AssadHashmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @merckhung.

Based on my (limited) experience the change looks functionally correct.

I can't comment on performance improvements until we've run more tests with this change.
@merckhung 's SPECInt figures look promising.

The reduction of instructions in the inner loops is good but I'm surprised replacing add+ldar with ldr+dmb improves performance as dmb affects all memory operations whereas ldar affects only those touching the same cache line, AIUI. The SPECInt figures tell a different story, so in this context that must not be the case. In any case, we'll keep an eye on performance after this change is merged.

*/
#define INSTR_CREATE_dmb(dc, imm) instr_create_0dst_1src((dc), OP_dmb, (imm))

/** Immediate values for INSTR_CREATE_dmb(). */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an exact copy of the enum in arm/. To avoid duplicating code, please instead move it to a shared header. Maybe instr_create_shared.h, or maybe a new ir/aarchxx/instr_create_aarchxx.h, or possibly instr.h or opnd.h.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Wondering what happens to the duplicated enum values in the Doxygen generated docs? Is there a warning from Doxygen?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I will change the code accordingly. It will take a while on my end, but I will submit a version by the end of Wed. I didn't look at the generated Doxygen docs, let me check it out. I don't remember it warns, perhaps I didn't notice one back then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had a look of those aforementioned files. Thank you!
instr.h: defines architectural values WRT instruction encoding/decoding.
opnd.h: defines architectural values WRT operand and register encoding/decoding.
instr_create_shared.h: seems like defining architecturally independent macros and definitions.

For the DMB macro and its IMM definitions, I feel creating a new ir/aarchxx/instr_create_aarchxx.h is a way to go.

Because DMB's IMM enum doesn't share the exactly same semantics with the other arch (e.g. x86), so the instr_create_shared.h is ruled out;

instr.h and opnd.h are more related to architectural definitions, rather than instruction-level definitions. So I think I will go with creating a new ir/aarchxx/instr_create_aarchxx.h and move both ARM and AArch64's DMB macro and IMM enum to the new file. If no objection, then I will go with this path forward. Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ir/aarchxx/instr_create_aarchxx.h seems good to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created instr_create_aarchxx.h in the latest commit, but I am struggling with making it appear in the Doxygen htmls.
Do you know how to make instr_create_aarchxx.h show up in the doxygen updates? Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be added to the file list in genapi.pl.

/**
* Creates a DMB instruction.
* \param dc The void * dcontext used to allocate memory for the instr_t.
* \param imm The integer constant opnd_t operand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point at one of the DR_DMB_OSHLD values, using # to make a doxygen link, so the user knows there are named constants.

Copy link
Author

@merckhung merckhung Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done the change accordingly in the latest commit.
Since the instr_create_aarchxx.h doxygen is not generated in the latest commit, will defer the verification.

core/ir/opnd.h Outdated
@@ -1702,7 +1702,7 @@ DR_API
opnd_t
opnd_create_far_base_disp_ex(reg_id_t seg, reg_id_t base_reg, reg_id_t index_reg,
int scale, int disp, opnd_size_t size, bool encode_zero_disp,
bool force_full_disp, bool disp_short_addr);
bool force_full_disp, bool disp_short_addr, bool post_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking interface change. This needs to be accompanied by documentation changes announcing the break. Alternatively, instead of changing this function, a separate opnd_set_zero_offset_pre_index() could be added, which would be better for compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess opnd_set_zero_offset_post_index() since the default is pre.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done the change accordingly in the latest commit.

@@ -539,6 +539,12 @@
#define INSTR_CREATE_ldp(dc, rt1, rt2, mem) \
instr_create_2dst_1src(dc, OP_ldp, rt1, rt2, mem)
#define INSTR_CREATE_ldr(dc, Rd, mem) instr_create_1dst_1src((dc), OP_ldr, (Rd), (mem))
#define INSTR_CREATE_ldr_imm(dc, Rt, Rn, imm, post) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this needs the "post": that is just an encoding detail that doesn't affect semantics, and it only applies to a zero offset, right?

Copy link
Contributor

@derekbruening derekbruening Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment is based on only passing this to opnd_create_far_base_disp_ex(). See the comment for the pre_index field in opnd.h: it only affects a zero offset, since any other will just be a different offset value in the memop.

For creating semantically-different variants of the instruction itself: not sure what is currently done for other A64 instrs, but for A32 we have different variants of the macro itself for different write-back options, rather than a bool flag.

Copy link
Author

@merckhung merckhung Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should've read the explanation of the "pre_index" bitfield. Thank you.
I did gdb tracing and found pre_index is hard-coded to true, so I immediately figured the way to workaround it is to post-process it (manually changing to false, after instr_t is constructed).

Today, I've tried passing a negative offset to opnd_create_far_base_disp_ex() and I found it doesn't encode correctly.
Report: Internal Error: DynamoRIO debug check failure: /home/merck/Code/dynamorio_orig/core/arch/emit_utils_shared.c:1061 nxt_pc != NULL

** I will look into it tomorrow to see what actually happens, but I guess the AArch64 encoder doesn't catch it.

I learned that when a pre_index LDR (immediate) is encoded the instr_t will be handled by encode_opndsgen_38400c00() in encode_gen.h (line: 8085). The AArch64 encoder tries each possible encoding function of LDR until one of those catches it.

LDR (immediate) encoding differs pre- or post-index by looking at bit 11 and 12 (ARMv8 manual p.1000).
When it comes to encoding a post-index LDR, the encoder will first call encode_opndsgen_38400400() function (bit11 = 0) (before calling the pre-index function), and the instr_t is supposed to be caught by it.
In the first commit of this pull request, I found toggling the pre_index bitfield does affect how AArch64 encodes LDR (immediate). But passing a negative offset to opnd_create_far_base_disp_ex() doesn't seem will be encoded correctly in AArch64.

Let me have a deeper look tmr, and I will come back with a better change for reviews. Thank you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, thank you for laying out the design principles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, I've tried passing a negative offset to opnd_create_far_base_disp_ex() and I found it doesn't encode correctly.
Report: Internal Error: DynamoRIO debug check failure: /home/merck/Code/dynamorio_orig/core/arch/emit_utils_shared.c:1061 nxt_pc != NULL

I remember something off in the A64 branch linking code: #4273, but that's a different assert. @AssadHashmi might know more?

Copy link
Author

@merckhung merckhung Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some investigations about the post-LDR encoding problem.

Summary:
0) I think the explanation of pre_index bit-field in the opnd.h is described in terms of A64 decoding.
For recording a decoded LDR is pre- or post-index, it uses the pre_index bit-field to distinguish it.
The value of the pre_index will be written by A64 decoder according to the input app's code sequence,
and the JIT is supposed to emit a correct LDR version by looking at the value of the pre_index field.

It's not meant to say a post-index LDR cannot have an offset other than 0, but both pre- and post-index
LDR could both have a zero offset encoded. So the pre_index field is needed for distinguishing them in DR at all time, not only when offset is 0.

Therefore, for manually encoding an LDR in the DR program itself, the pre_index field is hard-coded to true, if a developer calls OPND_CREATE_MEMPTR()->opnd_create_far_base_disp_ex() in DR. The A64 encoder will never have a chance to emit a correct post-LDR instr. Because the A64 encoder has been determining pre- or post-LDR by looking at the pre_index field in JIT decoding/encoding process.

Tl;Dr

  1. Only LDR/STR family uses encode_opnd_mem9/encode_opnd_mem9post/encode_opnd_mem9_bytes functions to encode the signed offset imm9 field (bit 20:12). If a change happens in those will only affect LDR/STR family.

  2. In the encode_opnd_mem9post() and the following encode_opnd_mem9_bytes() function, if post==true, then the disp is requested to be 0, otherwise it will fail the encoding (codec.c, line: 662). This is where it fails.
    if (post ? (disp != 0) : (disp < -256 || disp > 255))
    return false;

  3. Based on point instrument_post_syscall() should be called after post_system_call() processes the real result #1, regardless of the indexing direction, per A64 SPEC. both pre- and post-index LDR/STR support encoding a signed 0 imm9 offset (AARM, p. 1000). But in the point update makefiles to build with local tools instead of /build/toolchain #2, current impl. limits an input of post==true to have a zero disp and allows post==false to have a disp of range from -256 to +255. This impl. is contradicting with the A64 SPEC.

  4. Because the imm9 field's type is a signed integer. If passing an negative offset to opnd_create_far_base_disp_ex() function for the purpose of determining the pre- or post-index attribute of an LDR/STR encoding will result in prohibiting negative offsets to be correctly encoded in the further pre-index LDR/STR.
    Therefore, will not go this way, but will think about setting FLAG_NEGATIVE and converting value to positive

  5. The later AArch64 instruction encoding process doesn't determine a pre- or post-index LDR by directly looking at the pre_index bit-field. Instead, the A64 encoder tries each possible LDR encoding functions until one of those reports successful. A bit inefficient in A64 encoding process. Furthermore, if a preceding encoding function returns TRUE, it may cascadingly block a later more correct one. This actually complicates the constructions of opnd_t variables, because the relationship among those opnd_t variables determines the selection of the later A64 encoding functions. Rather than simply determining by looking at the values of individually opnd_t vars at first.

Conclusion:
6) I will study more about ARM's implementation (NEGATIVE FLAG, positive conversion), before I implement a change for A64. Will come back with a more considerate commit that is more aligned with current ARM port.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the changes of opnd_create_far_base_disp_ex(). Refined INSTR_CREATE_ldr_imm() macros to align with DR design (moved away from AArch64 mnemonics style to DR style). If the developers want a post-indexed LDR, they are supposed to call opnd_set_zero_offset_pre_index() on the just generated memop to set the pre_index field from true to false (indicating post-indexed), see an example in core/arch/aarch64/emit_utils.c.

Per discussions, revert changes to opnd_create_far_base_disp_ex().
A new opnd_set_zero_offset_post_index() function is created to set
the pre_index field of a memop operand to false (post-indexed).

INSTR_CREATE_ldr_imm macro is refined to take Rd, memop, and imm
operands (DR way), instead of Rd, Rt, Imm (assembly mnemonics).
The opnd_set_zero_offset_post_index() on memop allows selecting a
post-indexed encoding if needed (pre-indexed by default).

core/ir/aarchxx/instr_create_aarchxx.h is created to share instr.
macros between AArch64 and ARM ports. INSTR_CREATE_dmb and its enums
is moved into the new file.

Issue: #2390
Test: 400.perlbench, suite/runsuite_wrapper.pl
@merckhung
Copy link
Author

Close this pull request since I messed up with my fork with a force-push. A clean and new pull request has been created at #4390 4390.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet