Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

native_exec: regain control when calling from a native module into a non-native module #978

Open
derekbruening opened this issue Nov 28, 2014 · 5 comments

Comments

@derekbruening
Copy link
Contributor

From rnk@google.com on November 14, 2012 16:36:32

For hybrid instrumentation tools, the dream is to be able to execute the main module natively and dynamically instrument everything else. This should drastically improve startup time for such tools.

Our current -native_exec support is all about calling out to a module that will run natively, and then regaining control when it returns or from our syscall hooks. This issue covers extending it to support regaining control when the native module calls to code that we want to instrument.

There are three ways we can regain control:

  • Patch the PLT for direct calls
  • NX protect the non-native modules
  • Have the compiler annotate indirect calls in the native code

In the PLT, we can basically create a little thunk that takes control. There are details to work out, especially with delay-load imports on Windows and the equivalent RTLD_LAZY on Linux. Ideally, we can find a way to fit the thunk into the existing PLT code sequences, and then we don't have to allocate memory for stubs.

The NX protection is nice because it provides perfect error checking for us, but I'd rather avoid faults if possible. It makes debugging harder, and could have bad performance in a tight loop.

I've actually implemented the LLVM side of the annotations in msan, although I haven't sent it for review. Mostly I just wanted to verify that it is doable. We can do this last once we've got the initial re-takeover support.

For the actual transition point, I'm thinking we want to find a way to quickly jump into an optimized IBL without spilling the full mcontext or switching stacks. We'll also have to swap the library segment.

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=978

@derekbruening
Copy link
Contributor Author

From rnk@google.com on January 29, 2013 11:39:04

Some thoughts on hooking the ELF PLT.

We want this code to work as expected:

#include <stdio.h>
int main(void) {
printf("&printf: %p\n", &printf);
return 0;
}

The printed address should be from libc, not some DR-related stub. I was worried that there would be only one GOT entry for both the address reference and the actual call, but there are actually two relocations:

$ gcc -fPIE -pie ./printf.c -o printf && ./printf
&printf: 0x7f2f9af80840

$ objdump -R ./printf
...
DYNAMIC RELOCATION RECORDS
OFFSET TYPE VALUE
...
0000000000200fb8 R_X86_64_GLOB_DAT printf
...
0000000000201000 R_X86_64_JUMP_SLOT printf
...

The assembly confirms that the data relocation is used to take the address while the PLT uses the jump slot entry:
$ disas_func.py ./printf printf@plt
0000000000000620 printf@plt:
620: ff 25 da 09 20 00 jmpq *0x2009da(%rip) # 201000 <GLOBAL_OFFSET_TABLE+0x18>
626: 68 00 00 00 00 pushq $0x0
62b: e9 e0 ff ff ff jmpq 610 <_init+0x20>

$ disas_func.py ./printf main
000000000000074c

:
74c: 55 push %rbp
74d: 48 89 e5 mov %rsp,%rbp
750: 48 8d 05 15 01 00 00 lea 0x115(%rip),%rax # 86c <_IO_stdin_used+0x4>
757: 48 8b 15 5a 08 20 00 mov 0x20085a(%rip),%rdx # 200fb8 <_DYNAMIC+0x198>
75e: 48 89 d6 mov %rdx,%rsi
761: 48 89 c7 mov %rax,%rdi
764: b8 00 00 00 00 mov $0x0,%eax
769: e8 b2 fe ff ff callq 620 printf@plt
76e: b8 00 00 00 00 mov $0x0,%eax
773: 5d pop %rbp
774: c3 retq

In conclusion, we should be able to overwrite values in .plt.got without breaking things.

@derekbruening
Copy link
Contributor Author

From rnk@google.com on January 29, 2013 12:22:53

Another observation is that modern Ubuntu has been defaulting to marking the PLTGOT readonly since 10.04: https://wiki.ubuntu.com/Security/Features (search for RELRO)

gcc defaults to linking with -z relro, in other words. I can get lazy linking with -z norelro -z lazy. lazy might imply norelro.

I thought I was observing lazy linking when I last looked at native_exec in November, but it must not have been the case, since I was on Precise at the time.

For lazy linking, the PLT looks kind of like this:

shared_trampoline:
push loader_link_map(%rip)
jmp _dl_runtime_resolve(%rip)

PLT stubs:
__errno_location@plt:
jmp __errno_location@plt.got(%rip)
push $reloc_index
jmp shared_trampoline

__errno_location@plt.got points to "push $reloc_index" initially

The _dl_runtime_resolve slot is at a well known location defined by the sysv amd64 psABI docs. I believe the situation is similar for i386, but I didn't see the PLT discussed in the i386 ABI docs.

The resolver is the third slot in the array referred to by the DT_PLTGOT pointer from the .dynamic section. We can overwrite it with out own stub which will call into DR's C code, call _dl_fixup, and then insert our hook if desired.

For eager resolution, things are different. If we detect an mprotect(readonly) call on the module that matches the PT_RELRO info, we can assume the module has been relocated, and we can kick off a scan of the PLTGOT to intercept any cross-module calls. We could even reprotect it.

If we don't detect the mprotect, then it could still be eagerly bound or it could be lazy. Overwriting our resolver replacement is good enough for modules present at LD_PRELOAD init time, but for dlopen'd modules, we fire the module_load event post-mmap. If we insert our resolver then, it will be overwritten by the loader. We may need a post-relocation module load control point.

@derekbruening
Copy link
Contributor Author

From rnk@google.com on January 29, 2013 12:44:31

For the record, if we couldn't overwrite the pointers in .plt.got, then the only way to intercept PLT calls would be to overwrite the PLT code stubs themselves. I don't like this because it feels to invasive. We'd be making some of .text RWX. Each PLT stub is 16 bytes, for both ia32 and x64. We'd have to turn it into something like:

sym@plt:
mov sym@plt.got(%rip/%ebx) -> % r11 /eax ; 6-ish bytes?
jmp dr_native_plt_call ; how to make this x64 reachable?

For x64, we'd have to find some unused 8-byte slot in the module to put a pointer to dr_native_plt_call, or come up with another sequence. There are 3 reserved slots in .plt.got: .dynamic, link_map, and _dl_runtime_resolve. Some of these might not be used, but it's dicey. There is a 4-byte nop at the end of the shared PLT trampoline, but that's not enough.

We don't have enough space to use the 'jmp 0(%rip) ; ' trick.

We could also something evil like: 'call place(%rip) ; ', but it doesn't really save any bytes over 'push imm32 ; jmp place(%rip)'.

In summary, I'd much rather overwrite the pointers in the GOT, and with the info from comment #1, it seems we can do this without breaking transparency of function pointer values.

@derekbruening
Copy link
Contributor Author

From rnk@google.com on February 03, 2013 08:24:01

I have PLT interception (without takeover) pretty much implemented for 32 and 64-bit ELFs. For each function pointer I replace in the GOT, I need to create a stub of code that contains the real function pointer resolved by the loader.

How should I allocate code for this stub? Right now I'm just allocating a fixed size buffer of RWX code up front. In theory, these should live as long as the module being intercepted, so that a reloaded module doesn't cause leaks. Eventually, it might be nice to create linkstubs for each PLT entry so that we can go straight from native code into the fcache, so long as the target fragment hasn't been unlinked.

@derekbruening
Copy link
Contributor Author

From rnk@google.com on February 14, 2013 08:00:09

I'm at the point where I need to handle nested calls into native modules like in this stack:
...



Currently there is only one native_exec retaddr slot in the dcontext.

In order to handle this, I need to make some assumptions about the application, which I want to discuss and document. I propose the following assumptions:

  1. When calling a native module, we assume we can write past the app TOS. Currently we do a large clean call to entering_native. I don't have a profiler that can tell me if this is the bottleneck, but normal 'drrun clang hello.c' is faster than it is with '-native_exec_list clang', and the only extra overhead imposed is that of the native transitions.

The current state is that we only touch the retaddr slot, which has already been written. The risk is that we might trigger guard page faults, which we'd have to translate.

I'm not concerned about disturbing apps that write data past the TOS. This is a cross-module call where presumably the callee will clobber the stack.

  1. The application may longjmp or unwind past frames with back_from_native return addresses, but it will not re-enter unwound frames after that. This could happen as part of some kind of split-stack continuation passing scheme where functions can return multiple times. It's possible that the go language does something like this, but go binaries are totally static so running them under DR with native_exec is not very interesting.

As a consequence, we can maintain a native retaddr stack in the dcontext. When we hit back_from_native, and the current SP does not match the SP from the top of the retaddr stack, we can scan backwards until the first matching SP and throw away everything that doesn't match. This is similar to how drwrap works, I believe.


I'm unsure about making stack assumptions, but I'm trying to come up with ideas to make the native transitions as fast as possible. There are API boundaries in Chrome that are very hot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant