Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ARM/Thumb interworking #35

Merged
merged 10 commits into from Jan 3, 2022
Merged

Conversation

MrAnno
Copy link
Collaborator

@MrAnno MrAnno commented Jan 2, 2022

My understanding is that we have to support all possible scenarios, where both the original main() and new_main() can either be Thumb or ARM.

Note: I could not find a way to make the Thumb code generation optional, but it seems Thumb is a pretty old thing.

TODO:

  • Retest on real armv6
  • Retest on real armv8
  • "Provide your own main" tests: arm-calls-arm, thumb-calls-thumb, arm-calls-thumb, thumb-calls-arm

This fixes the BXF_ARCH_ARM and BXF_ARCH_ARM64 macros as well.
@MrAnno
Copy link
Collaborator Author

MrAnno commented Jan 2, 2022

Hm, I'm not sure what happened with the timeout.c sample, it crashes somewhere in libc, before calling our patched main.

@Snaipe
Copy link
Owner

Snaipe commented Jan 2, 2022

I'm not even seeing the failure logs on Cirrus, maybe the ARM runner is broken. In any case, if I'd had to guess, I'd assume it crashes trying to call main if the trampoline is not compatible?

@MrAnno
Copy link
Collaborator Author

MrAnno commented Jan 2, 2022

Oh I think I have identified the exact problem.

ldr     r3, addr_data_alt
bx      r3

addr_data_alt:
.fill 4, 1, 0

The assembler generates the following from the above:

(gdb) x/3i opcodes
   0xfffef608:  ldr     r3, [pc, #0]    ; (0xfffef60c)
   0xfffef60a:  bx      r3
   0xfffef60c:  address of new_main

pc is 2 instructions ahead, which is 4 bytes in Thumb mode.

But when PC is used as a base register for addressing operations, it uses a word-aligned value, which means the offset is variable, so this can't be used reliably for patching. timeout.c is failing because its main function's address is not word-aligned (it's halfword-aligned), so the trampoline jumps to the wrong location:

(gdb) x/3i main
   0x4012e6:    ldr     r3, [pc, #0]    ; (0x4012e8)
   0x4012e8:    bx      r3
   0x4012ea:    address of new_main

@MrAnno
Copy link
Collaborator Author

MrAnno commented Jan 2, 2022

I'm testing an idea that could make this truly position independent.

@MrAnno MrAnno force-pushed the fix-arm32v7-trampoline branch 6 times, most recently from 85c6adc to 05216c9 Compare January 3, 2022 00:43
BX exchanges the instruction set if necessary.
@MrAnno MrAnno force-pushed the fix-arm32v7-trampoline branch 2 times, most recently from 61fe8af to b688d20 Compare January 3, 2022 01:37
@MrAnno
Copy link
Collaborator Author

MrAnno commented Jan 3, 2022

Well, I'm not proud of the end result, but it finally works.

src/asm/trampoline-arm.S Outdated Show resolved Hide resolved
src/asm/trampoline-arm.S Show resolved Hide resolved
src/asm/trampoline-arm.S Show resolved Hide resolved
src/asm/trampoline-arm.S Show resolved Hide resolved
@@ -179,7 +187,8 @@ int bxfi_exe_patch_main(bxfi_exe_fn *new_main)
size_t len = align2_up(offset + sizeof (opcodes), PAGE_SIZE);

mprotect(base, len, PROT_READ | PROT_WRITE | PROT_EXEC);
memcpy(nonstd (void *) addr, opcodes, sizeof (opcodes));
size_t prelude_len = bxfi_exe_inject_prelude(addr);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified; I think the non-alignment is only an issue for the thumb trampoline, right? i.e. everything other than arm thumb is already aligned, and it's just the thumb trampoline that's misaligned.

If so, I think we should just preface the thumb trampoline with a 2-byte nop, and then align2_up(opcodes, 4) so that the nop gets ignored during the copy.

Copy link
Collaborator Author

@MrAnno MrAnno Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried implementing this, but I haven't been able to succeed.

My experience was that I can't really insert nops into the original trampoline, because it interferes with how pseudo-instructions are generated by the assembler. For example, adding a nop just before the ldr instruction would generate a different pc-relative jump, which then I would have to patch somehow in the C code when injecting it to a word-aligned Thumb address.

Just for the record, when I tried the above, timeout.c began to work (the halfword-aligned one), but all word-aligned tests failed:

if (bxfi_exe_is_arm_func_word_aligned(func_to_patch))
    *trampoline = (void*) ((uintptr_t) &bxfi_trampoline_thumb + 2);
1/4 nested.c          FAIL            0.01s   killed by signal 6 SIGABRT
>>> MALLOC_PERTURB_=72 /tmp/cirrus-ci-build/build/sample/nested.c.bin

2/4 callback.c        FAIL            0.01s   killed by signal 6 SIGABRT
>>> MALLOC_PERTURB_=179 /tmp/cirrus-ci-build/build/sample/callback.c.bin

3/4 context.c         FAIL            0.01s   killed by signal 6 SIGABRT
>>> MALLOC_PERTURB_=195 /tmp/cirrus-ci-build/build/sample/context.c.bin

4/4 timeout.c         OK              2.00s

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It's a bit disappointing that we have to maintain an extra pair of symbols just to delimit a nop, but I can live with it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I wonder if we could hardcode the nop opcode as a constant for ARM and pre-fill the opcodes buffer with a nop slide, and then copying the actual opcodes at an aligned index. I don't prefer one over the other, so your call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could inject the actual opcode from C, but I thought that would be even more terrible.

Copy link
Collaborator Author

@MrAnno MrAnno Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which way to go, my problem with the hard-coded opcode was portability.
ARM docs state that NOP may be a pseudo-instruction, and the assembler might replace it with different instructions for different processors.

For example, MOV r8, r8 or MOV r0, r0.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my idea was essentially to do this:

    char opcodes[4096];
    bxfi_nop_fill(opcodes, sizeof(opcodes));
    uintptr_t offset = (uintptr_t) base - (uintptr_t) align2((uintptr_t) base, arch_word_size);
    memcpy(opcodes + offset, trampoline, sizeof (opcodes) - offset);

and have bxfi_nop_fill fill the buffer with nops depending on the actual architecture, so that we ensure that regardless of the architecture, the trampoline start is word aligned. That said it assumes that the base can't be misaligned by an offset smaller than the size of the smallest nop, but I think the current PR makes the same assumption anyway (with the thumb nop being something like mov r8, r8 and being 2-bytes-long).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good to me.
I'm a bit afraid of the implementation of bxfi_nop_fill, so I'd leave this PR as is for now, if you don't mind. :)
When adding the WIndows/iOS support, I'm going to try implementing this as you suggested.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm fine with the current code. I'm always happier the better the code is, so if it looks like my suggestion could end up being overall better, of course I'd prefer it, but this is perfectly fine as-is too. And better yet, it works.

Thumb is an ARM instruction set encoding, it uses 16-bit instructions
instead of 32.

When patching such a function, we should use a Thumb-encoded trampoline.
bxfi_exe_trampoline_fixup() can be used to make small modifications on the
trampoline code or on the function being patched.
https://developer.arm.com/documentation/dui0471/m/interworking-arm-and-thumb/pointers-to-functions-in-thumb-state

Pointers to Thumb functions have their least significant bit set to
ensure that interworking works correctly.

That bit is just an indicator, we need to get rid of it before patching
the function with our trampoline.
Thumb is a halfword-aligned instruction set encoding, but our trampoline
code uses a PC-relative jump, which needs to be positioned on a word
aligned address.
src/exe-elf-arm-fixup.h Outdated Show resolved Hide resolved
Copy link
Owner

@Snaipe Snaipe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on my end; is this OK to merge?

@MrAnno
Copy link
Collaborator Author

MrAnno commented Jan 3, 2022

Thanks for the thorough review.

I think it's ready now.

@Snaipe
Copy link
Owner

Snaipe commented Jan 3, 2022

Great, thanks for the awesome work!

@Snaipe Snaipe merged commit 46cfaeb into Snaipe:master Jan 3, 2022
@Snaipe
Copy link
Owner

Snaipe commented Jan 3, 2022

Had to rebase on top of master, so the commit hash will need an update in the Criterion PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants