i#2390: Replace add+ldar with ldr+dmb in AArch64 HT lookup #4375

merckhung · 2020-07-18T18:29:13Z

Replace the add+ldar with a ldr+dmb pair in HT lookup and results in
-3% to -28% (reduction) in overhead (instrument vs. native) ratios
of SPECInt 2006 and 2017. Although, found a regression in 657.xz_s
model (+7%).

An add+ldar pair was used to prevent memory-access instructions from
being reordered to ensure a hash mask is always loaded before a
hash_table is loaded. The same ordering restriction is also imposed
in the corresponding update_lookuptable_tls() routine.

With the ldr+dmb pair replacement, an add instruction is eliminated
from the critical path and contributed majorly to overhead reductions.

In addition, a frequently taken branch of the inner-loop epilogue is
converged into a more condensed (4 instr.) and smaller inner-loop,
with the cost of adding 1 more sub instruction in both 2 exit paths.

Since exit paths are not as hot as HT lookup, the impact of adding
a sub instruction in both is quite trivial (not seen).

In order to condense the inner-loop down to 4 instructions, a
pre-index ldr is replaced with a post-index version in the prologue.

On the hit-exit path, register use is carefully swapped and results in a
move instruction elimination. The number of instructions on the path
remains unchanged since a sub instruction is added to its entry.

Issue: #2390
Tests: SPECInt 2006 and 2017 ran on ARM Juno r2 (w/ 8GB RAM, Debian)

Adds configuration to disable blank Github issues, so users will always use our templates (and in particular stop using the issue tracker to just ask questions). Also adds a contact link pointing at the users list.

Suggests first starting with the users list in the feature template.

Adds users list link requests to both templates.

… register state. (#3928) Adds a test to check for the correct state of the clobbered mask register if an asynchronous event hits the sequence within the small window that clobbers it. Note that the restore event is not implemented yet, and the test is currently correctly reporting an error. A future patch that will add support for restoring the mask register will also include an update to this test. Minor refactoring of the test to use drmgr_register_bb_instrumentation_ex_event() instead of drmgr_register_bb_instrumentation_event(). Adds some missing __AVX512F__ and __AVX__ ifdefs. Issue: #2985

Adds a missing "a".

…ter state. (#3932) Adds a test to check for the correct state of the gather instruction's mask register if an asynchronous event hits the sequence after a completed load, but before the mask register update. Note that the restore event is not implemented yet, and the test is currently expected to report an error. Re-factors the tests and puts the functions test_avx512_restore_mask_fault, test_avx512_restore_mask_clobber, and test_avx512_restore_mask_update in one common macro. A future patch that will add support for restoring the mask register will also include an update to this test. Issue: #2985

Updates the NULL defines used in DR to be pointer-sized to avoid potential issues such as in PR #3920. Avoids C++ compiler warnings by defining NULL as nullptr for __cplusplus. Fixes a format string error that showed up along with the NULL cast errors for C++ (prior to defining as nullptr). Fixes #3930

…ter test variants. (#3933) Adds all scatter variants for previously added gather expansion tests. More currently expected error cases have been added to the test's template, and will be removed once support for the restore event in drx has been implemented. Does some minor refactoring that was missed in #3932. Issue: #2985

Adds printing of timestamp, core, and kernel transfer markers to the drcachesim trace view tool. The markers embedded in the disassembly make it much easier to see signal handlers and thread transitions. Adds a sanity check to the view test and updates the docs.

Updates the prefix order to properly decode crc32w. Fixes the destination size codes for OP_crc32: the destination can never shrink to 16 bits. Resolves an old comment expressing confusion about this where the Intel manual had a separate entry for a data prefix, resulting in different codes than we need in our table. Enables all the crc32w test cases in the binutils decenc tests. Updates the expected test patterns. Fixes #2431

…x. (#3941) We need to close it in cases of successful/failed fd_priv_dup, so just promote the existing close to before the success/failure check. Fixes #3940

… test variants. (#3942) Adds an test for updating the destination mask to client.drx-scattergather for AVX2. As with previous tests of this kind, the error is currently expected until a drx restore event has been implemented. Does some renaming into AVX2/AVX-512 function names. Initializes the pc variables with INT_MAX in order to add robustness to the test. Issue: #2985

Removes the problematic Linux approach for handling an invalid application instruction in a basic block, where we would isolate it to its own bb and then copy 17 bytes for native execution, assuming the processor would raise a fault. This is very fragile and caused real-world problems with the crc32w decoder bug (#2431). Those 17 bytes often end mid-instruction, causing the exit cti to not exist and DR to execute junk instructions, leading to all kinds of bad behavior. Now, Linux does what Windows does: we immediately forge an illegal instruction fault (SIGILL for Linux). While it might be nicer to try to continue execution for instructions whose length we can guess at, there are just too many risks here as described above. It's better to have a clear fault that someone can identify and debug easily to address decoder issues, instead of mysterious behavior and strange crashes that take a lot of effort to track back to the decoder. Tested on crc32w with the #2431 removed locally for a valid-but-we-think-it's-not test. The common.decode-bad test has several cases of legitimately-invalid instructions. Adds error handling of a failed cache decode in recreate_app_state_from_info() so we're more robust there. Tested with a local revert of the above removal of invalid_instr_hack. Fixes #3212

Replaces {E,G,M}d_q with {E,G,M}y. The 'y' code matches what Intel uses for OPSZ_4_rex8; matches our existing {B,E,G,R}y codes; and conforms to the convention that an underscore between two size letters indicates a partial size out of a larger whole. Renames the F{x,y,z} codes used for particular floating-point sizes to Ff{x,y,z}, to avoid confusion with Intel's 'x', 'y', and 'z' size codes.

Fixes a bug where drcachesim's raw2trace places delayed branches *after* signal markers, leading to confusing trace orderings. Improves the checking for markers in trace_invariants, but unfortunately it still will not catch this case as it does not know the targets of branches nor the addresses of signal handlers. Adds log_instruction() to trace_converter_t and uses it to disassemble memtrace instructions on *every* instance, instead of just the first time encountered. This is essential for debugging these types of issues. Issue: #3937

…es. (#3952) The bug led to an invalid encoding error on machines with no support for AVX512BW, but AVX512F. Tested manually by setting proc_avx512_enabled to true on a machine w/o AVX-512. Issue: #1312

…IX and 32-bit. (#3954) As pointed out in cf1ec32, AVX-512 context management by DynamoRIO is not fully supported or is untested on non-UNIX as well as on 32-bit builds. Yet processor detection and certain AVX-512 functionality was enabled on those builds when running on machines with AVX-512 support enabled by the OS. This was causing problems (xref #3949). For now, this patch completely disables all AVX-512 features on unsupported and untested builds. Fixes #3949 Issue: #1312

…3957) In dadb8ff, we disabled support for AVX-512 for all 32-bit UNIX builds completely. Yet those builds are only missing support for AVX-512 state in signal frames. Various tests already exist for existing context switching in 32-bit. In order to keep these tests working, we are re-enabling partial AVX-512 support for UNIX 32-bit. A warning is issued if AVX-512 code is encountered. Issue: #1312

…mulation code. (#3955) Adds a drx restore event that will eventually handle fixing AVX-512 gather, AVX-512 scatter and AVX2 gather emulation sequence application state. This patch handles the AVX-512 gather sequence. The patch adds drx support to detect and restore the scratch mask as well as the destination mask state in the AVX-512 gather emulation sequence. This is necessary if a translation event hits the AVX-512 gather emulation sequence in certain parts of the emulation code. Detection is done by a state machine that attempts to match a known pattern of the emulation sequence. AVX-512 scatter and AVX2 gather is not yet supported. Removes the relevant errors from client.drx-scattergather. Issue: #2985

…ate machines. (#3961) Converts calling the state machine into a function pointer call in order to be able to more easily share code when adding an AVX-512 scatter and an AVX2 gather state machine. The patch also fixes 2 minor assertion bugs and another minor state machine bug where we did not skip unknown instructions. Issue: #2985

Adds the module offset of the interruption point PC to the kernel event marker to record when a signal happened. Refactors raw2trace to check for signals after every insruction and not just every memref, and to look at the offset for correct placement. Adds a new test application signal_invariants which uses assembly for precise control over tests around signal marker placement. Changes the trace_invariants tests to use this new application. The test uses prefetch instructions as annotations to communicate with the trace_invariants reader. Uses the new test to add several additional invariant checks to trace_invariants: signal placement mid-bb, placement mid-memref, handler resumption, handler immediacy. Updates the docs to show an example of a signal handler interruption in the view tool. Documents the multi-memref fault issue (#3958) and the online trace missing instruction issue (#3739). Relaxes the invariant that a branch target must follow the branch, when a signal arrives in between (updates the docs for this too). Passes -verbose to the IPC reader for online trace debugging. Solving signal placement in the middle of online trace instruction bundles is left for future work. Comments in the code and docs make this clear. Issue: #3937

Label callbacks are called every time an instruction is destroyed iff the instruction is a label instruction and the field is valid. When cloning an instruction, we can't blindly copy the label callback field causing it to be called when the cloned instruction is destroyed. We are adding a note to the instr_clone() docs and prevent the field from being copied. Fixes #3926 Issue: #3962

… AVX2 gather. (#3964) Adds support for restoring the destination mask in the AVX2 gather emulation sequence. As before, detection is done by a state machine that attempts to match a known pattern of the emulation sequence. Removes the relevant error from client.drx-scattergather and fixes a bug in the test for AVX2 gather in 32-bit mode. Fixes a bug in the AVX-512 state machine if index is OPSZ_8. Only OPSZ_4 is currently tested in drx.scattergather. Issue: #2985

…orted value types in droption (#3965) Adds default implementations of convert_from_string and default_as_string to avoid crashes when using unsupported option types. Adds droption_t implementations for long, long long, unsigned long, and unsigned long long. Lists supported types in the droption documentation. Uses strtol and strtoll for error detection in the pre-existing code and the new code. Fixes #3959

… AVX-512 scatter. (#3970) The patch adds drx support to detect and restore the scratch mask as well as the destination mask state in the AVX-512 scatter emulation sequence. Adds two more states to the AVX2 gather and the AVX-512 gather state machine making them more robust detecting the emulation sequence. Fixes a bug where we read register values from mcontext instead of raw_mcontext when restoring it, which matters for example if the register is the stack pointer. Fixes some bugs in the illustrating comments. Removes the relevant errors from client.drx-scattergather. Issue: #2985

For 32-bit, when the private loader calls functions in potentially third-party libraries, it needs to align the stack to 16 to account for the mismatch in DR's 4-byte alignment (i#847) and the new gcc ABI. Tested on libdrmemtrace which with gcc 8.3.0 has SIMD code which crashes without alignment. This is a short-term fix. More issues remain with clean calls and we should perhaps consider changing our base alignment. Issue: #847, #3966

Documents that dr_mutex_unlock must only be used on mutexes locked by the current thread.

Fixes a bug in decode_cti() with a (e)vex prefix with its 2nd or 3rd byte also looking like the first byte of a (e)vex prefix where the instruction is considered invalid, crashing the application. Adds testing of decode_cti() to api.ir by export decode_cti() in DEBUG and BUILD_TESTS builds. Adds a test case that triggers the observed bug without the fix. Fixes #3978

Splits the largest of our IR testing headers into two pieces to avoid a compiler error with VS2013 "out of heap space". Fixes #3992

Updates a reference to a split header in the cmake code which was missed by PR #3993. Issue: #3992

Fixes the thread handle leak when thread_handle_to_pid() in win32/syscall.c is called. Fixes: #4352

When raw files are gzipped and named xxx.raw.gz, the corresponding final trace files end up named xxx.raw.trace.gz, i.e., with an extra "raw." in there. We fix that here. Issue: #4318

Fixes an out of memory error due to allocation of more than 2GB of unreachable heap by fixing a bug where beyond-vmm allocations required reachability constraints even for unreachable heap. Adds a simple test of many large allocs to the client.alloc test. Co-authored-by: Derek Bruening <bruening@google.com> Fixes #4335

For the view tool, the original wrong-arch check was placed *after* the entry skipping for -skip_refs, which can result in never checking the architecture. We fix that here by moving the check up front. Issue: #4318

Removes the stale workaround put in place that disabled zlib altogether for cross-compiling before we had CMAKE_FIND_ROOT_PATH set properly in all of our toolchain files. Now that we're seeing the find-root properly we need to remove the workaround to allow finding a target zlib when cross-compiling. Tested manually on an aarch64 cross-compilation. The core of i#2285 was fixed earlier by PR #2551; this finishes off the issue. Fixes #2285

…mbols (#4371) Fix compilation on old Linux which does not have EM_ARM/EM_AARCH64 symbols.

Replace the add+ldar with a ldr+dmb pair in HT lookup and results in -3% to -28% (reduction) in overhead (instrument vs. native) ratios of SPECInt 2006 and 2017. Although, found a regression in 657.xz_s model (+7%). An add+ldar pair was used to prevent memory-access instructions from being reordered to ensure a hash mask is always loaded before a hash_table is loaded. The same ordering restriction is also imposed in the corresponding update_lookuptable_tls() routine. With the ldr+dmb pair replacement, an add instruction is eliminated from the critical path and contributed majorly to overhead reductions. In addition, a frequently taken branch of the inner-loop epilogue is converged into a more condensed (4 instr.) and smaller inner-loop, with the cost of adding 1 more sub instruction in both 2 exit paths. Since exit paths are not as hot as HT lookup, the impact of adding a sub instruction in both is quite trivial (not seen). In order to condense the inner-loop down to 4 instructions, a pre-index ldr is replaced with a post-index version in the prologue. On the hit-exit path, register use is carefully swapped and results in a move instruction elimination. The number of instructions on the path remains unchanged since a sub instruction is added to its entry. Issue: #2390 Tests: SPECInt 2006 and 2017 ran on ARM Juno r2 (w/ 8GB RAM, Debian)

derekbruening

I took a quick look. Probably best for an AArch64 expert and/or those who wrote the original code to take a deeper look.

derekbruening · 2020-07-20T02:04:14Z

core/arch/aarch64/emit_utils.c

    APP(&ilist,
-        INSTR_CREATE_ldar(dc, opnd_create_reg(DR_REG_X1),
-                          OPND_CREATE_MEMPTR(DR_REG_X1, 0)));
+        instr_create_0dst_1src(dc, OP_dmb, OPND_CREATE_INT(/* DR_DMB_ISHLD */ 9)));


We'd prefer to use INSTR_CREATE_macros: ARM and x86 are complete but it seems AArch64 is missing some. Could you add the ones you need? IIRC they are manually done instead of automated like the ARM ones are.

Got it. I have created INSTR_CREATE_dmb macro for AArch64 in the 2nd commit.

derekbruening · 2020-07-20T02:04:42Z

core/arch/aarch64/emit_utils.c

    APP(&ilist,
-        INSTR_CREATE_ldar(dc, opnd_create_reg(DR_REG_X1),
-                          OPND_CREATE_MEMPTR(DR_REG_X1, 0)));
+        instr_create_0dst_1src(dc, OP_dmb, OPND_CREATE_INT(/* DR_DMB_ISHLD */ 9)));


Could we use DR_DMB_ISHLD directly?

Sure thing. I have made a copy in AArch64's instr_create.h instead of including ARM's header, to avoid macro definition conflicts across architectures.

derekbruening · 2020-07-20T02:06:45Z

core/arch/aarch64/emit_utils.c

-            dc, opnd_create_reg(DR_REG_X0),
-            OPND_CREATE_MEMPTR(DR_REG_X1, offsetof(fragment_entry_t, tag_fragment))));
+    /* ldr x0, [x1], #tag_fragment_offset */
+    instr_t *instr = instr_create_2dst_3src(


We'd again prefer an INSTR_CREATE_.

I see. INSTR_CREATE_ldr_imm has been created to serve this purpose.

derekbruening · 2020-07-20T02:07:22Z

core/arch/aarch64/emit_utils.c

+    APP(&ilist, instr_create_save_to_tls(dc, DR_REG_X0, TLS_REG3_SLOT));
+    /* Load hash mask before loading hash_table, with DMB to prevent memory-access
+     * instructions from being reordered. A corresponding store-release is in
+     * update_lookuptable_tls() to ensure a new hast_table will be written after a


typo: hash_table

Corrected. Thank you!

derekbruening · 2020-07-20T02:10:41Z

I took a quick look. Probably best for an AArch64 expert and/or those who wrote the original code to take a deeper look.

Requesting @egrimley to take a look (or delegate)

AssadHashmi · 2020-07-20T08:50:14Z

run arm tests

AssadHashmi · 2020-07-20T09:21:39Z

run arm tests

Add INSTR_CREATE_ldr_imm definition to create an LDR immediate instruction. There is an existing INSTR_CREATE_ldr that emittes LDR register version, it takes 2dst/1src operands. But the AArch64 encoder requires a 2dst/3src operands to encode an LDR immediate version. Hence, a new and separated INSTR_CREATE_ldr_imm definition is created. The LDR immediate instruction also has post-index and pre-index variants. The original code has been fixed to pre-index encoding. To allow selecting, one more parameter (bool post_index) appends to opnd_create_far_base_disp_ex(). Changes related to opnd_create_far_base_disp_ex() are updated across arches. INSTR_CREATE_dmb has been defined in ARM but in order to avoid mixing macros across 2 architectures. A copy of INSTR_CREATE_dmb and its enums is added to AArch64's instr_create.h. AArch64's indirect branch lookup has been updated to use new defintions. Issue: #2390 Test: test suites, 400.perlbench

…tion Notice the build failures in x86_64, update ext/drbbdup/drbbdup.c to using the new definitions of opnd_create_far_base_disp_ex()

…h instrs (#4341) Adds new architecture agnostic trace types for prefetch instructions. These instructions have multiple variants based on: - prefetch target: this may be the L1, L2 or L3 cache - prefetch policy: either temporal or non-temporal - operation type: prefetch may be for load, instruction or store Uses these trace types for variants of AArch64 prfm and prfum prefetch instructions. Also marks some of the new trace types as aliases for existing x86 specific ones as they have same meaning. Other existing architecture specific trace types have different meaning and cannot be mapped to the new trace types. Also makes changes to account for these new prefetch trace types in drcachesim. Adds a custom analyzer test to verify prefetch operation counts for each variant. Issue: #4328

derekbruening · 2020-07-24T04:31:50Z

@AssadHashmi -- trying to get a review from an ARM expert

AssadHashmi

Thanks for doing this @merckhung.

Based on my (limited) experience the change looks functionally correct.

I can't comment on performance improvements until we've run more tests with this change.
@merckhung 's SPECInt figures look promising.

The reduction of instructions in the inner loops is good but I'm surprised replacing add+ldar with ldr+dmb improves performance as dmb affects all memory operations whereas ldar affects only those touching the same cache line, AIUI. The SPECInt figures tell a different story, so in this context that must not be the case. In any case, we'll keep an eye on performance after this change is merged.

derekbruening · 2020-07-24T15:16:13Z

core/ir/aarch64/instr_create.h

+ */
+#define INSTR_CREATE_dmb(dc, imm) instr_create_0dst_1src((dc), OP_dmb, (imm))
+
+/** Immediate values for INSTR_CREATE_dmb(). */


This looks like an exact copy of the enum in arm/. To avoid duplicating code, please instead move it to a shared header. Maybe instr_create_shared.h, or maybe a new ir/aarchxx/instr_create_aarchxx.h, or possibly instr.h or opnd.h.

(Wondering what happens to the duplicated enum values in the Doxygen generated docs? Is there a warning from Doxygen?)

Got it. I will change the code accordingly. It will take a while on my end, but I will submit a version by the end of Wed. I didn't look at the generated Doxygen docs, let me check it out. I don't remember it warns, perhaps I didn't notice one back then.

I have had a look of those aforementioned files. Thank you!
instr.h: defines architectural values WRT instruction encoding/decoding.
opnd.h: defines architectural values WRT operand and register encoding/decoding.
instr_create_shared.h: seems like defining architecturally independent macros and definitions.

For the DMB macro and its IMM definitions, I feel creating a new ir/aarchxx/instr_create_aarchxx.h is a way to go.

Because DMB's IMM enum doesn't share the exactly same semantics with the other arch (e.g. x86), so the instr_create_shared.h is ruled out;

instr.h and opnd.h are more related to architectural definitions, rather than instruction-level definitions. So I think I will go with creating a new ir/aarchxx/instr_create_aarchxx.h and move both ARM and AArch64's DMB macro and IMM enum to the new file. If no objection, then I will go with this path forward. Thank you.

ir/aarchxx/instr_create_aarchxx.h seems good to me.

I have created instr_create_aarchxx.h in the latest commit, but I am struggling with making it appear in the Doxygen htmls.
Do you know how to make instr_create_aarchxx.h show up in the doxygen updates? Thank you.

It needs to be added to the file list in genapi.pl.

derekbruening · 2020-07-24T15:16:48Z

core/ir/aarch64/instr_create.h

+/**
+ * Creates a DMB instruction.
+ * \param dc      The void * dcontext used to allocate memory for the instr_t.
+ * \param imm     The integer constant opnd_t operand.


Point at one of the DR_DMB_OSHLD values, using # to make a doxygen link, so the user knows there are named constants.

Done the change accordingly in the latest commit.
Since the instr_create_aarchxx.h doxygen is not generated in the latest commit, will defer the verification.

derekbruening · 2020-07-24T15:20:16Z

core/ir/opnd.h

@@ -1702,7 +1702,7 @@ DR_API
 opnd_t
 opnd_create_far_base_disp_ex(reg_id_t seg, reg_id_t base_reg, reg_id_t index_reg,
                             int scale, int disp, opnd_size_t size, bool encode_zero_disp,
-                             bool force_full_disp, bool disp_short_addr);
+                             bool force_full_disp, bool disp_short_addr, bool post_index);


This is a breaking interface change. This needs to be accompanied by documentation changes announcing the break. Alternatively, instead of changing this function, a separate opnd_set_zero_offset_pre_index() could be added, which would be better for compatibility.

I guess opnd_set_zero_offset_post_index() since the default is pre.

Done the change accordingly in the latest commit.

derekbruening · 2020-07-24T15:24:50Z

core/ir/aarch64/instr_create.h

@@ -539,6 +539,12 @@
 #define INSTR_CREATE_ldp(dc, rt1, rt2, mem) \
    instr_create_2dst_1src(dc, OP_ldp, rt1, rt2, mem)
 #define INSTR_CREATE_ldr(dc, Rd, mem) instr_create_1dst_1src((dc), OP_ldr, (Rd), (mem))
+#define INSTR_CREATE_ldr_imm(dc, Rt, Rn, imm, post)                                      \


Not sure why this needs the "post": that is just an encoding detail that doesn't affect semantics, and it only applies to a zero offset, right?

My comment is based on only passing this to opnd_create_far_base_disp_ex(). See the comment for the pre_index field in opnd.h: it only affects a zero offset, since any other will just be a different offset value in the memop.

For creating semantically-different variants of the instruction itself: not sure what is currently done for other A64 instrs, but for A32 we have different variants of the macro itself for different write-back options, rather than a bool flag.

I should've read the explanation of the "pre_index" bitfield. Thank you.
I did gdb tracing and found pre_index is hard-coded to true, so I immediately figured the way to workaround it is to post-process it (manually changing to false, after instr_t is constructed).

Today, I've tried passing a negative offset to opnd_create_far_base_disp_ex() and I found it doesn't encode correctly.
Report: Internal Error: DynamoRIO debug check failure: /home/merck/Code/dynamorio_orig/core/arch/emit_utils_shared.c:1061 nxt_pc != NULL

** I will look into it tomorrow to see what actually happens, but I guess the AArch64 encoder doesn't catch it.

I learned that when a pre_index LDR (immediate) is encoded the instr_t will be handled by encode_opndsgen_38400c00() in encode_gen.h (line: 8085). The AArch64 encoder tries each possible encoding function of LDR until one of those catches it.

LDR (immediate) encoding differs pre- or post-index by looking at bit 11 and 12 (ARMv8 manual p.1000).
When it comes to encoding a post-index LDR, the encoder will first call encode_opndsgen_38400400() function (bit11 = 0) (before calling the pre-index function), and the instr_t is supposed to be caught by it.
In the first commit of this pull request, I found toggling the pre_index bitfield does affect how AArch64 encodes LDR (immediate). But passing a negative offset to opnd_create_far_base_disp_ex() doesn't seem will be encoded correctly in AArch64.

Let me have a deeper look tmr, and I will come back with a better change for reviews. Thank you!

BTW, thank you for laying out the design principles.

Today, I've tried passing a negative offset to opnd_create_far_base_disp_ex() and I found it doesn't encode correctly.
Report: Internal Error: DynamoRIO debug check failure: /home/merck/Code/dynamorio_orig/core/arch/emit_utils_shared.c:1061 nxt_pc != NULL

I remember something off in the A64 branch linking code: #4273, but that's a different assert. @AssadHashmi might know more?

I've done some investigations about the post-LDR encoding problem.

Summary:
0) I think the explanation of pre_index bit-field in the opnd.h is described in terms of A64 decoding.
For recording a decoded LDR is pre- or post-index, it uses the pre_index bit-field to distinguish it.
The value of the pre_index will be written by A64 decoder according to the input app's code sequence,
and the JIT is supposed to emit a correct LDR version by looking at the value of the pre_index field.

It's not meant to say a post-index LDR cannot have an offset other than 0, but both pre- and post-index
LDR could both have a zero offset encoded. So the pre_index field is needed for distinguishing them in DR at all time, not only when offset is 0.

Therefore, for manually encoding an LDR in the DR program itself, the pre_index field is hard-coded to true, if a developer calls OPND_CREATE_MEMPTR()->opnd_create_far_base_disp_ex() in DR. The A64 encoder will never have a chance to emit a correct post-LDR instr. Because the A64 encoder has been determining pre- or post-LDR by looking at the pre_index field in JIT decoding/encoding process.

Tl;Dr

Only LDR/STR family uses encode_opnd_mem9/encode_opnd_mem9post/encode_opnd_mem9_bytes functions to encode the signed offset imm9 field (bit 20:12). If a change happens in those will only affect LDR/STR family.

In the encode_opnd_mem9post() and the following encode_opnd_mem9_bytes() function, if post==true, then the disp is requested to be 0, otherwise it will fail the encoding (codec.c, line: 662). This is where it fails.
if (post ? (disp != 0) : (disp < -256 || disp > 255))
return false;

Based on point instrument_post_syscall() should be called after post_system_call() processes the real result #1, regardless of the indexing direction, per A64 SPEC. both pre- and post-index LDR/STR support encoding a signed 0 imm9 offset (AARM, p. 1000). But in the point update makefiles to build with local tools instead of /build/toolchain #2, current impl. limits an input of post==true to have a zero disp and allows post==false to have a disp of range from -256 to +255. This impl. is contradicting with the A64 SPEC.

Because the imm9 field's type is a signed integer. If passing an negative offset to opnd_create_far_base_disp_ex() function for the purpose of determining the pre- or post-index attribute of an LDR/STR encoding will result in prohibiting negative offsets to be correctly encoded in the further pre-index LDR/STR.
Therefore, will not go this way, but will think about setting FLAG_NEGATIVE and converting value to positive

The later AArch64 instruction encoding process doesn't determine a pre- or post-index LDR by directly looking at the pre_index bit-field. Instead, the A64 encoder tries each possible LDR encoding functions until one of those reports successful. A bit inefficient in A64 encoding process. Furthermore, if a preceding encoding function returns TRUE, it may cascadingly block a later more correct one. This actually complicates the constructions of opnd_t variables, because the relationship among those opnd_t variables determines the selection of the later A64 encoding functions. Rather than simply determining by looking at the values of individually opnd_t vars at first.

Conclusion:
6) I will study more about ARM's implementation (NEGATIVE FLAG, positive conversion), before I implement a change for A64. Will come back with a more considerate commit that is more aligned with current ARM port.

Reverted the changes of opnd_create_far_base_disp_ex(). Refined INSTR_CREATE_ldr_imm() macros to align with DR design (moved away from AArch64 mnemonics style to DR style). If the developers want a post-indexed LDR, they are supposed to call opnd_set_zero_offset_pre_index() on the just generated memop to set the pre_index field from true to false (indicating post-indexed), see an example in core/arch/aarch64/emit_utils.c.

Per discussions, revert changes to opnd_create_far_base_disp_ex(). A new opnd_set_zero_offset_post_index() function is created to set the pre_index field of a memop operand to false (post-indexed). INSTR_CREATE_ldr_imm macro is refined to take Rd, memop, and imm operands (DR way), instead of Rd, Rt, Imm (assembly mnemonics). The opnd_set_zero_offset_post_index() on memop allows selecting a post-indexed encoding if needed (pre-indexed by default). core/ir/aarchxx/instr_create_aarchxx.h is created to share instr. macros between AArch64 and ARM ports. INSTR_CREATE_dmb and its enums is moved into the new file. Issue: #2390 Test: 400.perlbench, suite/runsuite_wrapper.pl

merckhung · 2020-08-02T08:29:26Z

Close this pull request since I messed up with my fork with a force-push. A clean and new pull request has been created at #4390 4390.

derekbruening and others added 30 commits November 10, 2019 13:01

Add configuration for Github issue templates

b101ff1

Adds configuration to disable blank Github issues, so users will always use our templates (and in particular stop using the issue tracker to just ask questions). Also adds a contact link pointing at the users list.

Add users list pointer to feature template

46057f6

Suggests first starting with the users list in the feature template.

Add users list link requests to templates

7a79e63

Adds users list link requests to both templates.

Fix a missing word in the bug template

8024906

Adds a missing "a".

i#3940 FD leak: close the leaked file descriptor with -satisfy_w_xor_…

6d9ee1f

…x. (#3941) We need to close it in cases of successful/failed fd_priv_dup, so just promote the existing close to before the success/failure check. Fixes #3940

i#1312 AVX-512 support: Fix OPSZ_SAVED_OPMASK for non-AVX512BW machin…

b26f803

…es. (#3952) The bug led to an invalid encoding error on machines with no support for AVX512BW, but AVX512F. Tested manually by setting proc_avx512_enabled to true on a machine w/o AVX-512. Issue: #1312

Document that mutexes must be unlocked by the acquiring thread (#3975)

62ee222

Documents that dr_mutex_unlock must only be used on mutexes locked by the current thread.

i#3992: Work around VS2013 compiler OOM (#3993)

831606e

Splits the largest of our IR testing headers into two pieces to avoid a compiler error with VS2013 "out of heap space". Fixes #3992

i#3992 VS2013 OOM: Update stale header reference (#3994)

1db071a

Updates a reference to a split header in the cmake code which was missed by PR #3993. Issue: #3992

dkim-virsec and others added 7 commits July 7, 2020 11:15

i#4352: Fix thread handle leak in thread_handle_to_pid() (#4361)

520ecc4

Fixes the thread handle leak when thread_handle_to_pid() in win32/syscall.c is called. Fixes: #4352

i#4318 xarch memtrace: Remove raw from final name (#4365)

677e2e8

When raw files are gzipped and named xxx.raw.gz, the corresponding final trace files end up named xxx.raw.trace.gz, i.e., with an extra "raw." in there. We fix that here. Issue: #4318

i#4318 xarch memtrace: Detect view arch mismatch earlier (#4368)

4f953a9

For the view tool, the original wrong-arch check was placed *after* the entry skipping for -skip_refs, which can result in never checking the architecture. We fix that here by moving the check up front. Issue: #4318

Fix compilation on old Linux which does not have EM_ARM/EM_AARCH64 sy…

7ac898c

…mbols (#4371) Fix compilation on old Linux which does not have EM_ARM/EM_AARCH64 symbols.

merckhung marked this pull request as ready for review July 18, 2020 19:02

derekbruening reviewed Jul 20, 2020

View reviewed changes

derekbruening requested a review from egrimley July 20, 2020 02:09

Merck Hung and others added 8 commits July 20, 2020 14:49

i#2390: Update opnd_create_far_base_disp_ex() call-site to new defini…

ebd3454

…tion Notice the build failures in x86_64, update ext/drbbdup/drbbdup.c to using the new definitions of opnd_create_far_base_disp_ex()

i#2390: Apply correct indent to INSTR_CREATE_dmb description

6ea8c26

i#2390: Apply code format according to the lint tool

f52200e

Merge branch 'master' into b2

8788295

Merge branch 'b2' of https://github.com/merckhung/dynamorio into b2

0da198e

derekbruening requested a review from AssadHashmi July 24, 2020 04:31

AssadHashmi approved these changes Jul 24, 2020

View reviewed changes

derekbruening requested changes Jul 24, 2020

View reviewed changes

derekbruening reviewed Jul 24, 2020

View reviewed changes

merckhung closed this Aug 2, 2020

merckhung mentioned this pull request Aug 9, 2020

i#2390: Replace add+ldar with ldr+dmb in AArch64 HT lookup #4390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i#2390: Replace add+ldar with ldr+dmb in AArch64 HT lookup #4375

i#2390: Replace add+ldar with ldr+dmb in AArch64 HT lookup #4375

merckhung commented Jul 18, 2020

derekbruening left a comment

derekbruening Jul 20, 2020

merckhung Jul 20, 2020

derekbruening Jul 20, 2020

merckhung Jul 20, 2020

derekbruening Jul 20, 2020

merckhung Jul 20, 2020

derekbruening Jul 20, 2020

merckhung Jul 20, 2020

derekbruening commented Jul 20, 2020

AssadHashmi commented Jul 20, 2020

AssadHashmi commented Jul 20, 2020

derekbruening commented Jul 24, 2020

AssadHashmi left a comment

derekbruening Jul 24, 2020

derekbruening Jul 24, 2020

merckhung Jul 27, 2020

merckhung Jul 28, 2020

derekbruening Jul 28, 2020

merckhung Aug 2, 2020

derekbruening Aug 2, 2020

derekbruening Jul 24, 2020

merckhung Aug 2, 2020 •

edited

Loading

derekbruening Jul 24, 2020

derekbruening Jul 24, 2020

merckhung Aug 2, 2020

derekbruening Jul 24, 2020

derekbruening Jul 24, 2020 •

edited

Loading

merckhung Jul 28, 2020 •

edited

Loading

merckhung Jul 28, 2020

derekbruening Jul 28, 2020

merckhung Jul 30, 2020 •

edited

Loading

merckhung Aug 2, 2020

merckhung commented Aug 2, 2020

i#2390: Replace add+ldar with ldr+dmb in AArch64 HT lookup #4375

i#2390: Replace add+ldar with ldr+dmb in AArch64 HT lookup #4375

Conversation

merckhung commented Jul 18, 2020

derekbruening left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekbruening commented Jul 20, 2020

AssadHashmi commented Jul 20, 2020

AssadHashmi commented Jul 20, 2020

derekbruening commented Jul 24, 2020

AssadHashmi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merckhung Aug 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekbruening Jul 24, 2020 • edited Loading

Choose a reason for hiding this comment

merckhung Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merckhung Jul 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merckhung commented Aug 2, 2020

merckhung Aug 2, 2020 •

edited

Loading

derekbruening Jul 24, 2020 •

edited

Loading

merckhung Jul 28, 2020 •

edited

Loading

merckhung Jul 30, 2020 •

edited

Loading