Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastfail fix introduced a bug -- unreachable code #830

Open
expend20 opened this issue Oct 12, 2022 · 26 comments
Open

fastfail fix introduced a bug -- unreachable code #830

expend20 opened this issue Oct 12, 2022 · 26 comments
Labels
bug Something isn't working

Comments

@expend20
Copy link
Contributor

expend20 commented Oct 12, 2022

fastfail fix introduced next bug, it assumes UnhandledExceptionFilter should not return, but in the case it's called for the exception which is not treated as a crash (like c++'s throw) then we need to return from UnhandledExceptionFilter and continue execution. I don't know yet, if Frida is capable of unwinding in instrumented code, if not, then it's kind of bad, but let's talk about that later.

Here is the harness which reproduces the bug:

struct MyException : public std::exception {
    const char *what() const throw() { return "C++ Exception"; }
};

extern "C" __declspec(dllexport) void __cdecl crash()
{
    *(size_t *)0 = 0;
}

extern "C" __declspec(dllexport) size_t
    FuzzMeCPPEH(const char *data, unsigned int len)
{
    try {
        printf("Throwing CPPEH ...\n");
        DoNothing(1);
        throw MyException();
        DoNothing(2);
    }
    catch (MyException &e) {
        printf("In CPPEH handler\n");
        DoNothing(3);
    }

    printf("The end of the function\n");
    if (*(size_t *)data == 0x37333331) {
        crash();
    }
    return 1;
}

example run frida_gdiplus.exe -H C:\git\boxer_cpp\build\src\tests\RelWithDebInfo\acc_test.dll -F FuzzMeCPPEH -l acc_test.dll
the log:

Child process file stdio is not supported on Windows yet. Dumping to stdout instead...
spawning on cores: Cores { cmdline: "0", ids: [CoreId { id: 0 }] }
I am broker!!.
Connected to port 1337
New connection: 127.0.0.1:53743/127.0.0.1:53743
Setting core affinity to CoreId { id: 0 }
Spawning next client (id 0)
First run. Let's set it all up
We're a client, let's fuzz :)
excluding range: 0-7ff7b00c0000
excluding range: 7ff7b265f000-7ffc29f80000
excluding range: 7ffc29f95000-ffffffffffffffff
Loading file "corpus/not_kitty.png" ...
Throwing CPPEH ...
The application panicked (crashed).
Message:  internal error: entered unreachable code: handle_exception should not return
Location: C:\git\libafl_f\libafl_frida\src\windows_hooks.rs:53

Backtrace omitted.

Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
The application panicked (crashed).
Message:  called `Option::unwrap()` on a `None` value
Location: C:\git\libafl_f\libafl\src\executors\inprocess.rs:1154

Backtrace omitted.

the solution would be to call the original function

@expend20 expend20 added the bug Something isn't working label Oct 12, 2022
@expend20
Copy link
Contributor Author

btw, if you just remove the initialize() in windows_hook.rs (no hooks of IsProcessorFeaturePresent and UnhandledExceptionFilter) you'll end up in some weird crash like this:

AccTest: loaded(2cb0.674): Unknown exception - code 00000000 (first chance)
(2cb0.674): Unknown exception - code 00000000 (!!! second chance !!!)
ntdll!RtlRaiseStatus+0x3d:
00007ffc`392ef86d 65488b0c2560000000 mov   rcx,qword ptr gs:[60h] gs:00000000`00000060=????????????????
3:015> k
 # Child-SP          RetAddr               Call Site
00 000000e2`5aacc0b0 000002a3`d92b4cd2     ntdll!RtlRaiseStatus+0x3d
01 000000e2`5aacc650 000000e2`5aaccce0     0x000002a3`d92b4cd2
02 000000e2`5aacc658 000000e2`5aacc690     0x000000e2`5aaccce0
03 000000e2`5aacc660 00000000`00000000     0x000000e2`5aacc690

which may indicate that Frida doesn't like c++ exceptions in instrumented code

@expend20
Copy link
Contributor Author

so, basically calling the original function in the hook led us to this ^ issue, and the next possible steps toward solution in my opinion would be:

  • Try to investigate why Frida issues this, is that completely not supported or we just have misused Frida in a weird way? would appreciate any examples of using Frida for code instrumentations like this but I need to instrument external module
  • Redirect execution to the start of the next iteration if we experience any exception (even expected) and mark this coverage as insignificant.
    • Not really sure if that's the desired behavior, because even OutputDebugString could produce the exception if I'm not mistaken.
    • Not sure Frida allows this, if anyone has an example of something similar would appreciate it.

@WorksButNotTested
Copy link
Collaborator

I can only help with the last point I’m afraid. Here’s my implementation for AFL_FRIDA_PERSISTENT_RET.

https://github.com/AFLplusplus/AFLplusplus/blob/d1e1bbc713b22d620956143634ecdf97223aa59f/frida_mode/src/persistent/persistent_x64.c#L321

This file has the inline assembly generation code to support persistent mode. The epilogue is inserted when we want to terminate an iteration early.

@s1341
Copy link
Collaborator

s1341 commented Oct 13, 2022

I don't really have the ability to test or debug on windows. I know that the unwinding of stack/exceptions through stalked code in frida is a bit iffy at best, so it is entirely possible that this is the issue. We need to bring @oleavr into the conversation, I'm afraid.

@oleavr
Copy link
Collaborator

oleavr commented Oct 13, 2022

We'll need to improve Stalker so it augments the native unwinding like it does on some of the other ABIs. This boils down to (something like) hooking an internal function to translate the JITed address back to the original address.

Stalker recompiles the code being run on the fly, and executes the recompiled copy instead. While doing this it ensures it has identical side-effects as the original code -- e.g. making sure the recompiled code for a CALL instruction results in the same return address being pushed as if the original CALL was the instruction being executed.

Without having spent any time researching this, I suspect the solution might be to hook the user-space callback that the kernel calls, and modify the CONTEXT's EIP/RIP so it looks like the exception happened in the original code, and not the recompiled version that's actually being executed.

Frida already hooks the user-space callback actually, as this is how its Exceptor API is implemented on Windows. So in theory all Stalker needs to do is call gum_exceptor_add() to register a handler which does this work. It should return TRUE if the EIP/RIP was translated (signaling that it handled the exceptions, and no other handlers should be called), and FALSE otherwise.

Anyone interested in taking a stab at implementing this? I should be able to pitch in in the not-too-distant future, just a bit too much on my plate right now. Definitely happy to assist though if anybody's keen on diving in right away.

@expend20
Copy link
Contributor Author

Thanks @oleavr for the response!

Well, without promising anything I could try.

Maybe we could discuss in brief, what is the plan. In theory when exception happend in instrumented code, you need to walk all the current stack and replace all the retun addresses. Only the latest one is not enough, because there could be chained exceptoins and unwinds. So you change the pointers in the stack from the instrumented code, to the original code and continue execution. In that way windows will be able to unwind the stack properly, but you would need to expect the original code to continue execution pretty much anywhere. Which leads us to the next question.

How do you hook the original code? By removing the executable attribute from the memory and handling the exception, right?

Just before I dive into the code, maybe there are additional context/information which may be beneficial?

@WorksButNotTested
Copy link
Collaborator

Take a quick look at the Linux exception handling support. I’d guess the approach would be pretty similar.

https://github.com/frida/frida-gum/blob/187492a38f0a2ecef3915765a2917e53b0a6e303/gum/backend-x86/gumstalker-x86.c#L876

Rather than modifying the stack itself, it’s likely to be a case of hooking some functions involved in the exception handling process and modifying the arguments or return values.

@oleavr
Copy link
Collaborator

oleavr commented Oct 17, 2022

In theory when exception happend in instrumented code, you need to walk all the current stack and replace all the retun addresses.

There's luckily nothing that needs replacing on the stack -- Stalker ensures that the side-effects of the recompiled code are identical to the original code. This means that anything pushed on the stack will be exactly as if Stalker wasn't there.

When something goes wrong though, the exception details coming from the kernel will have an instruction pointer that is in instrumented code. This needs to be translated back to the corresponding original address. Implementing that may require adding some extra book-keeping in Stalker. However, I suspect there's a simpler solution -- by hooking the exception handling internals similar to what we do on Linux, so I would recommend taking a look at the code that @WorksButNotTested pointed out, and compare the Windows implementation's internals to that.

@WorksButNotTested
Copy link
Collaborator

Exception handling in Linux is explained in excruciating detail here. https://monkeywritescode.blogspot.com/p/c-exceptions-under-hood.html

It’s a very comprehensive and thorough explanation.

@expend20
Copy link
Contributor Author

thanks guys, I'll try to allocate time on this

@expend20
Copy link
Contributor Author

Hello, I tried to reproduce it on frida-gum's example, and it... just worked. So, Frida actually supports C++ exceptions on Windows, at least on this minimalistic example.

Attaching the reproducer.
frida-ex-repro.zip

Now, the most interesting question, why doesn't it work on libafl? No clue yet.

@WorksButNotTested
Copy link
Collaborator

I would test playing around with excluded ranges. Try including everything and see if it works. Then try excluding everything. I suspect the issue might relate to an exception occurring in an excluded range with a handler in an included region.

@expend20
Copy link
Contributor Author

You were right, in the reproducer if I call stalker.exclude() for every virtual address except the instrumented module, it stops working.

I wonder why do we have both: excluded ranges and explicit check in the instrumentation callback? Sound like if we have a check in the instrumentation callback, we don't need any exclude ranges, and vice versa, mutually exclusive in short.

@WorksButNotTested
Copy link
Collaborator

@domenukk
Copy link
Member

Doesn't that just mean we should include the LibAFL error Handler in Stalker and that may be enough? Or maybe I'm not understanding something correctly

@tokatoka
Copy link
Member

tokatoka commented Nov 3, 2022

I think if we excluded irrelevant ranges then the program doesn't even jump into these callbacks when they are executed

let mut first = true;

therefore we can achieve more speed (?)

@expend20
Copy link
Contributor Author

expend20 commented Nov 3, 2022

therefore we can achieve more speed (?)

Instrumentation phase is one time cost, so it potentially will lead to only faster startup, fuzzing speed is not going to be affected.

I think we need solid understanding why these excluded ranges needed for Windows in the first place, if it's just an artifact from porting code from Linux, then perhaps we can move it behind cfg(not(windows)) and this will resolve the exceptions issue.

@tokatoka
Copy link
Member

tokatoka commented Nov 3, 2022

@s1341
do you have any ideas?
can we remove these excludes ?

@expend20
Copy link
Contributor Author

expend20 commented Nov 3, 2022

Meanwhile I can try if Frida's exeptor can be of help

@s1341
Copy link
Collaborator

s1341 commented Nov 3, 2022

The issue is also present on macOS.

It's a fundamental issue with how frida stalker 'jits' the code, and how exception unwinding works on each platform. The correct solution is to implement fixes upstream in frida.

@domenukk
Copy link
Member

domenukk commented Nov 3, 2022

cough @oleavr

@tokatoka
Copy link
Member

tokatoka commented Nov 4, 2022

Instrumentation phase is one time cost, so it potentially will lead to only faster startup, fuzzing speed is not going to be affected.

as long as I checked with frida_libpng the speed dropped from 33k/sec to 22k/sec if I removed stalker.exclude(). (on linux)
so i'd say it's still meaningful for performance reason

@expend20
Copy link
Contributor Author

expend20 commented Nov 4, 2022

as long as I checked with frida_libpng the speed dropped from 33k/sec to 22k/sec if I removed stalker.exclude(). (on linux)
so i'd say it's still meaningful for performance reason

wow, that's weird as for me, what could be the reason?

@WorksButNotTested
Copy link
Collaborator

If we tell FRIDA stalker to exclude a range, then when we call a function in it, it will push a return address onto the stack to jump back to FRIDA and then jump to the excluded function. FRIDA will only get control back when that function returns, no matter if that function in turn calls other functions not in excluded ranges.

In AFL++, since we don’t want to lose control of execution before we’ve even started our target, we don’t instruct FRiDA to exclude any ranges until we hit the entrypoint. This means that we will be instrumenting blocks which may actually be excluded until we hit the entrypoint and these will be cached for later use.

Backpatching (https://medium.com/@oleavr/anatomy-of-a-code-tracer-b081aadb0df8) means that at times transition between instrumented blocks can happen without transitioning back to FRIDA to find the next block (a significant performance improvement). Thus we won’t check for an excluded range when this happens. So we may actually run instrumented code for excluded ranges for short periods. Therefore by checking for exclusions when generating the instrumented code, we can skip emitting coverage code at the start of these blocks to avoid the additional overhead and polluting the coverage map when they run.

The issue with exception handling is conceptually quite simple, if execution transitions from an included range to an excluded range and encounters and triggers an exception, then if that exception is handled back in the included range we hit a problem.

When running instrumented code, we can replace any RET instruction with alternative code. Therefore, we can continue to put genuine return instructions on the stack and have the instrumented code perform any translation when we encounter the RET (by substituting different instructions) to work out where to branch to next.

However, for excluded ranges we vector to the original code itself, this is obviously much more performant. But as we are executing real code, when we hit the final RET instruction we will jump to whatever address is at the top of the stack. Therefore to get back into stalker and instrumented code, we cannot simply put the genuine return address there. We have to put an address which returns back into FRIDA or an instrumented block.

This causes problems with exception handlers as they walk the stack to find the exception handlers and expect to find the correct return addresses in there. For Linux, we hook the unwinding mechanism to correct these addresses. I expect a similar approach may be needed for other operating systems.

@tokatoka
Copy link
Member

So for now can we just remove the exclude() temporarily for windows..?

@WorksButNotTested
Copy link
Collaborator

I would make it the default to skip telling FRIDA to exclude ranges for all platforms and architectures except for Linux on x86/64. But I would make it configurable so that users with targets without exception handling can re-enable it. I would also have it print a warning about what it is doing and information on how to re-configure it.

So the default behaviour ensures everything works, but at the expense of performance for many targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants