New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ARM/Thumb interworking #35
Conversation
This fixes the BXF_ARCH_ARM and BXF_ARCH_ARM64 macros as well.
Hm, I'm not sure what happened with the |
I'm not even seeing the failure logs on Cirrus, maybe the ARM runner is broken. In any case, if I'd had to guess, I'd assume it crashes trying to call main if the trampoline is not compatible? |
Oh I think I have identified the exact problem. ldr r3, addr_data_alt
bx r3
addr_data_alt:
.fill 4, 1, 0 The assembler generates the following from the above: (gdb) x/3i opcodes
0xfffef608: ldr r3, [pc, #0] ; (0xfffef60c)
0xfffef60a: bx r3
0xfffef60c: address of new_main
But when PC is used as a base register for addressing operations, it uses a word-aligned value, which means the offset is variable, so this can't be used reliably for patching. (gdb) x/3i main
0x4012e6: ldr r3, [pc, #0] ; (0x4012e8)
0x4012e8: bx r3
0x4012ea: address of new_main |
I'm testing an idea that could make this truly position independent. |
85c6adc
to
05216c9
Compare
BX exchanges the instruction set if necessary.
61fe8af
to
b688d20
Compare
Well, I'm not proud of the end result, but it finally works. |
@@ -179,7 +187,8 @@ int bxfi_exe_patch_main(bxfi_exe_fn *new_main) | |||
size_t len = align2_up(offset + sizeof (opcodes), PAGE_SIZE); | |||
|
|||
mprotect(base, len, PROT_READ | PROT_WRITE | PROT_EXEC); | |||
memcpy(nonstd (void *) addr, opcodes, sizeof (opcodes)); | |||
size_t prelude_len = bxfi_exe_inject_prelude(addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified; I think the non-alignment is only an issue for the thumb trampoline, right? i.e. everything other than arm thumb is already aligned, and it's just the thumb trampoline that's misaligned.
If so, I think we should just preface the thumb trampoline with a 2-byte nop, and then align2_up(opcodes, 4) so that the nop gets ignored during the copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried implementing this, but I haven't been able to succeed.
My experience was that I can't really insert nops into the original trampoline, because it interferes with how pseudo-instructions are generated by the assembler. For example, adding a nop just before the ldr
instruction would generate a different pc-relative jump, which then I would have to patch somehow in the C code when injecting it to a word-aligned Thumb address.
Just for the record, when I tried the above, timeout.c began to work (the halfword-aligned one), but all word-aligned tests failed:
if (bxfi_exe_is_arm_func_word_aligned(func_to_patch))
*trampoline = (void*) ((uintptr_t) &bxfi_trampoline_thumb + 2);
1/4 nested.c FAIL 0.01s killed by signal 6 SIGABRT
>>> MALLOC_PERTURB_=72 /tmp/cirrus-ci-build/build/sample/nested.c.bin
2/4 callback.c FAIL 0.01s killed by signal 6 SIGABRT
>>> MALLOC_PERTURB_=179 /tmp/cirrus-ci-build/build/sample/callback.c.bin
3/4 context.c FAIL 0.01s killed by signal 6 SIGABRT
>>> MALLOC_PERTURB_=195 /tmp/cirrus-ci-build/build/sample/context.c.bin
4/4 timeout.c OK 2.00s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It's a bit disappointing that we have to maintain an extra pair of symbols just to delimit a nop, but I can live with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, I wonder if we could hardcode the nop opcode as a constant for ARM and pre-fill the opcodes buffer with a nop slide, and then copying the actual opcodes at an aligned index. I don't prefer one over the other, so your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could inject the actual opcode from C, but I thought that would be even more terrible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which way to go, my problem with the hard-coded opcode was portability.
ARM docs state that NOP may be a pseudo-instruction, and the assembler might replace it with different instructions for different processors.
For example, MOV r8, r8
or MOV r0, r0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my idea was essentially to do this:
char opcodes[4096];
bxfi_nop_fill(opcodes, sizeof(opcodes));
uintptr_t offset = (uintptr_t) base - (uintptr_t) align2((uintptr_t) base, arch_word_size);
memcpy(opcodes + offset, trampoline, sizeof (opcodes) - offset);
and have bxfi_nop_fill fill the buffer with nops depending on the actual architecture, so that we ensure that regardless of the architecture, the trampoline start is word aligned. That said it assumes that the base can't be misaligned by an offset smaller than the size of the smallest nop, but I think the current PR makes the same assumption anyway (with the thumb nop being something like mov r8, r8 and being 2-bytes-long).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good to me.
I'm a bit afraid of the implementation of bxfi_nop_fill
, so I'd leave this PR as is for now, if you don't mind. :)
When adding the WIndows/iOS support, I'm going to try implementing this as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I'm fine with the current code. I'm always happier the better the code is, so if it looks like my suggestion could end up being overall better, of course I'd prefer it, but this is perfectly fine as-is too. And better yet, it works.
Thumb is an ARM instruction set encoding, it uses 16-bit instructions instead of 32. When patching such a function, we should use a Thumb-encoded trampoline.
bxfi_exe_trampoline_fixup() can be used to make small modifications on the trampoline code or on the function being patched.
https://developer.arm.com/documentation/dui0471/m/interworking-arm-and-thumb/pointers-to-functions-in-thumb-state Pointers to Thumb functions have their least significant bit set to ensure that interworking works correctly. That bit is just an indicator, we need to get rid of it before patching the function with our trampoline.
Thumb is a halfword-aligned instruction set encoding, but our trampoline code uses a PC-relative jump, which needs to be positioned on a word aligned address.
b688d20
to
f35ffa7
Compare
c1cf9cc
to
f4f83a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on my end; is this OK to merge?
Thanks for the thorough review. I think it's ready now. |
Great, thanks for the awesome work! |
Had to rebase on top of master, so the commit hash will need an update in the Criterion PR |
My understanding is that we have to support all possible scenarios, where both the original
main()
andnew_main()
can either be Thumb or ARM.Note: I could not find a way to make the Thumb code generation optional, but it seems Thumb is a pretty old thing.
TODO: