Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APP CRASH drmem full on Chromium net_unitests #1723

Closed
derekbruening opened this issue May 4, 2015 · 3 comments
Closed

APP CRASH drmem full on Chromium net_unitests #1723

derekbruening opened this issue May 4, 2015 · 3 comments

Comments

@derekbruening
Copy link
Contributor

I ran full net_ via:

RUNNING_ON_VALGRIND=yes tools/valgrind/chrome_tests.sh -b out/Release -t net --tool drmemory_full 2>&1 | tee OUT-net

and:

[ RUN ] DiskCacheBackendTest.InvalidRankingsFailure
Dr.M WARNING: application exited with abnormal code 0x40010006

Strange:

define DBG_PRINTEXCEPTION_C ((NTSTATUS)0x40010006L) // winnt

Light mode seems to work.

Excluding that one test, and we hit:

[ RUN ] DiskCacheBackendTest.Enumerations
Dr.M WARNING: application exited with abnormal code 0x80000001

Hmm, xref #1690.

Back to no exclusions, but after fixing i#1685 assert (see below) which is
the only problem I can hit if I run just one of these DiskCacheBackendTest
subtests, I still see problems:

[ RUN      ] DiskCacheBackendTest.InvalidEntry2
Backtrace:
        RtlUlonglongByteSwap [0x77EE0B85+79237]
        IsNLSDefinedString [0x759072FF+2013]
        IsNLSDefinedString [0x759070E7+1477]
        RtlMoveMemory [0x77EAB63D+1373]
        RtlMoveMemory [0x77EAB60F+1327]
        KiUserExceptionDispatcher [0x77E60133+15]
        OutputDebugStringA [0x758E34F0+91]
        logging::LogMessage::~LogMessage [0x67F36CC0+160] (z:\derek\chromium\src\base\logging.cc:582)
        disk_cache::BackendImpl::NewEntry [0x66C8B139+665] (z:\derek\chromium\src\net\disk_cache\blockfile\backend_impl.cc:1547)
<...>
~~Dr.M~~ WARNING: application exited with abnormal code 0xc0000028

http://build.chromium.org/p/client.drmemory/builds/drmemory-windows-r2075-sfx.exe
=> seems to work fine, so it looks like a regression.

Excluding all of the DiskCacheBackendTest.* and later we hit:

[ RUN      ] ProxyServiceMojoTest.DnsResolution
~~Dr.M~~ WARNING: application exited with abnormal code 0x80000001

-no_share_xl8: still crashes.
-stack_size 128K: the crash w/ OutputDebugStringA

I reproduced w/:

# ~/drmemory/git/build_x86_dbg/bin/drmemory.exe -debug $DRMEM_CHROME_ARGS -pause_at_assert -dr d:/derek/dr/git/exports  -- out/Release/net_unittests.exe --single-process-tests --gtest_filter=DiskCacheBackendTest.*

But haven't found any smaller subset of tests that reproduces. The
0xc0000028 is the most common, often with no backtrace printed, and
-pause_at_exit won't pause (so a uncaught-by-DR crash?), so I'm having a
hard time getting information about the crash.

-dr_debug we see early on:
<Out of vmheap reservation - reserving 256KB.Falling back onto OS allocation>
but then the tests all pass. So something limited to DR release build?

@derekbruening
Copy link
Contributor Author

Although I never got as far as finding precisely what was triggering the
guard page here (very difficult to track), the theory is:

Some client code, while the TEB stack fields point to dstack, touched an
app guard page and triggered app stack expansion. The dstack happened to
be at a lower address than the app stack, and thus the kernel did not
update TEB.StackLimit. The StackLimit field is checked explicitly by
RtlDispatchException, and the app calling OutputDebugStringA raises
exceptions relatively frequently.

Somehow I can't reproduce this in DR debug build, nor in other that DrMem
full mode: the guard page trigger or the app thread stack location (due to
address space perturbation) must be limited to those somehow.

I'm dumping my notes below:

*** TODO soln #1: always keep dstack higher than app stack and don't swap StackLimit?
**** DONE keep dstack higher than app stack, always
CLOSED: [2015-05-05 Tue 18:25]
- State "DONE" from "TODO" [2015-05-05 Tue 18:25]

if esp is lower than TEB.StackLimit, kernel updates it:

0:000> !teb
TEB at 7efdd000
ExceptionList: 003dfb7c
StackBase: 003e0000
StackLimit: 003de000
0:000> ed 7efdd008 12340000
0:000> ed 7efdd004 12350000
0:000> !teb
TEB at 7efdd000
ExceptionList: 003dfb7c
StackBase: 12350000
StackLimit: 12340000

0:000> !teb
TEB at 7efdd000
ExceptionList: 003dfb7c
StackBase: 12350000
StackLimit: 003d7000

xref DynamoRIO/dynamorio#1102:
"why haven't we seen this before? related to DRi#921 where pre-win8 stacks
were always lower than our dstack?"

so this bug requires an app stack at a higher address than its dstack, an
explanation for some nondet. that suggests one possible solution: ensure
dstack is higher than app stack and don't swap StackLimit, though that
seems unworkable on x64? or does dstack not have any reachability
concerns, if we were to split it from gencode? xref DRi#921 where win8 x64
stacks are up high.

-stack_shares_gencode isn't active anyway due to the large client stacks:
Not enough room to allocate 0x00005000 bytes post stack of size 0x0000e000

and should_swap_peb_pointer() is only if there's a client:
return (INTERNAL_OPTION(private_peb) &&
!IS_INTERNAL_STRING_OPTION_EMPTY(client_lib));

so kind of mutually exclusive.

impl: call os_heap_reserve_in_region() with lower bound of app stack and
upper of top of address space. dstacks are then never inside vm_reserve (or
could check whether app stack is low enough for that first). is that a
problem? having them scattered around? may run out of address space.
will have longer dynamo_areas vector.

could have a 2nd -vm_stack_reserve up high.

could make this best-effort and combine w/ #2 w/ various checks for
StackLimit being off: though could still fail, esp on win8+.

Looks like many instances of low dstacks are once beyond vm reservation, for win7:
@@@@ dstack=0x07a7f000 vs app xsp=0x11e9fa7c
@@@@ dstack=0x07a7f000 vs app xsp=0x11dcfa34
@@@@ dstack=0x07a7f000 vs app xsp=0x11d3fd64
@@@@ dstack=0x07a7f000 vs app xsp=0x11dbfb70
@@@@ dstack=0x07a7f000 vs app xsp=0x11e3f83c
@@@@ dstack=0x07a7f000 vs app xsp=0x11f8fd74
@@@@ dstack=0x0126f000 vs app xsp=0x1238fd90
@@@@ dstack=0x0126f000 vs app xsp=0x1231fa00
@@@@ dstack=0x07a7f000 vs app xsp=0x121ef980
@@@@ dstack=0x07a7f000 vs app xsp=0x1212ff38

Though remember that on win8 app xsp is much higher.

Implementing this soln, here's where we go off vm_reserve:

T3344 dstack=0x27bcf000, app esp=0x1181f87c, TEB lim=0x1181f000, TEB base=0x11820000
T1584 dstack=0x2816f000, app esp=0x1198fc80, TEB lim=0x1198f000, TEB base=0x11990000
T5952 dstack=0x2840f000, app esp=0x1141fd34, TEB lim=0x1141f000, TEB base=0x11420000
T5172 dstack=0x2852f000, app esp=0x11c7fb18, TEB lim=0x11c7f000, TEB base=0x11c80000
T7024 dstack=0x285cf000, app esp=0x11b5fae8, TEB lim=0x11b5f000, TEB base=0x11b60000
T6064 dstack=0x11dff000, app esp=0x11defcbc, TEB lim=0x11def000, TEB base=0x11df0000
T7544 dstack=0x11fdf000, app esp=0x11fcfc28, TEB lim=0x11fcf000, TEB base=0x11fd0000
T6412 dstack=0x1226f000, app esp=0x1225f804, TEB lim=0x1225f000, TEB base=0x12260000
T7696 dstack=0x1213f000, app esp=0x1212fc24, TEB lim=0x1212f000, TEB base=0x12130000
T6676 dstack=0x124ef000, app esp=0x124dfd1c, TEB lim=0x124df000, TEB base=0x124e0000
T3860 dstack=0x1265f000, app esp=0x1264fa18, TEB lim=0x1264f000, TEB base=0x12650000
T4604 dstack=0x127bf000, app esp=0x127afdbc, TEB lim=0x127af000, TEB base=0x127b0000
T6400 dstack=0x1295f000, app esp=0x1294fafc, TEB lim=0x1294f000, TEB base=0x12950000
[ OK ] DiskCacheBackendTest.NewEvictionDisableSuccess4 (2231 ms)

**** DONE don't swap StackLimit
CLOSED: [2015-05-05 Tue 18:27]
- State "DONE" from "TODO" [2015-05-05 Tue 18:27]

A high-dstack soln doesn't work by itself: sure, the kernel updates
the TEB.StackLimit while in priv state, but we throw that out when we go
to the app: we use the stored (stale) value from before we switched to
priv state in the first place. So we have to not swap StackLimit.

An alternative could be to swap StackLimit and have the priv-to-app cxt sw
cmp the cur StackLimit to bottom of dstack: and if no match, then save it
to app limit. That cmp can use lea+jecxz.

Seems better to essentially combine w/ soln#3, if we have property that
dstack > app stack: we never swap StackLimit and leave it w/ app value all
the time
, except at thread exit where we swap it to avoid DRi#1102.

For thread exit: really we need to swap after app stack is deallocated.
On XP+ that happens out-of-process, but where is the IPC notification
made? Somewhere in ExitThread, but it can't happen until the thread is
done executing user-mode code, so it seems like it should be fine to wait
for the NtTerminateThread syscall. But if so, why did DRi#921 crash? I
guess it was at process exit where many threads exited already.

A priv lib _chkstk will check if esp is >= TEB.StackLimit and if so, it's
done. Else it take TEB.StackLimit, subtracts one page, and touches it.
This should all work fine: when on dstack, esp will be well above
StackLimit, and when on app stack it will behave normally. When on dstack
and client touches app guard page (for cur thread), kernel will adjust
StackLimit.

i#1676 should still be fine unless it has some max stack size or sthg
built in.

So the only remaining problem is the client touching the guard page of a
different thread. For that I guess all we can do is rely on soln #2's
fallback checks. So we end up essentially using all 3 solutions.

*** TODO soln #2: we check and update TEB.StackLimit ourselves prior to any harm

many things could trigger guard: leak scan, safe read is_retaddr, reading
of syscall args, etc. could be from clean call, so can't just check in
dispatch or sthg (could be in steady-state anyway). checking in cxt sw
would require memquery syscalls to find where guard page is which would be
expensive.

if consequence was always a raised guard page fault (0x80000001), could
handle there, plus check at NtRaiseException. actually, how does app die
with 0x80000001? that requires low StackLimit while _chkstk tries to
trigger guard page: but our problem is a too-high StackLimit.

what about _chkstk? can it hit AV? it might fail to trigger the right
page? but already triggered: so nothing negative, right?

but haven't we seen more than 2 exit codes (0x80000001 and 0xc0000028)?
definitely 0x40010006: how did that end up as exit code?
and somewhere I saw a wacky one: from where?!?
what about the dumped Cr backtraces?

I put in checks on NtRaiseException and KiUserExceptionDispatcher and they
trigger a lot:

[ RUN ] DiskCacheBackendTest.CreateBackend_MissingFile
T6916 dstack=0x26c3f000, app esp=0x0992fb30, TEB lim=0x0992f000, TEB base=0x09930000
updated stored TEB.StackLimit from 0x0992e000 to 0x0992d000
xxxxxx T6916 RaiseException 0x40010006
dc lim=0x0992d000, dc base=0x09930000, TEB lim=0x26c31000, TEB base=0x26c3f000
updated stored TEB.StackLimit from 0x0992d000 to 0x0975d000

Requires a little mem query syscall loop to find the guard page, but
limited to particular syscall and fault it should be ok.

What about i#1676 where kernel now checks precise esp bounds? That
requires updating StackLimit prior to next fault => this technique is not
sufficient.

What about asking client, for any read of app stack, to call some DR
routine? If all wrapped in dr_safe_read we're ok: but what about inside
DR_TRY? DR_TRY has to go do a query loop every time in case it hit an app
guard page? Or client has to call a DR routine at the end of its DR_TRY
accesses? But what if that thread touches a different thread's guard page?

*** CANCELED soln #3: can we store a fake range in TEB field instead of dstack range? => impossible
CLOSED: [2015-05-05 Tue 11:19]
- State "CANCELED" from "TODO" [2015-05-05 Tue 11:19]

for DynamoRIO/dynamorio#921: need StackLimit page mapped, or have no _chkstk in DR+tool exit
for #1676: need cur xsp to be in [StackLimit..StackBase)
for https://github.com/DynamoRIO/dynamorio/issues/1102: need: ?!? seems just like i#921 but w/ _chkstk in priv lib.
for #1723: need cur xsp to be < StackLimit

so #1676 and #1723 have conflicting requirements.

@derekbruening
Copy link
Contributor Author

We have another related bug hit while testing:

**** TODO ASSERT

[ RUN ] DiskCacheBackendTest.ShaderCacheOnlyDoomAll
T5456 dstack=0x2f0cf000, app esp=0x1090f98c, TEB lim=0x1090f000, TEB base=0x10910000
5456 ASSERT FAILURE (thread 5456): D:\derek\drmemory\git\src\drmemory\drmemory.c:1159: mc.xsp <= (reg_t)teb->StackBase && mc.xsp > (reg_t)teb->StackLimit (initial xsp for thread invalid)Dr.M WARNING: application exited with abnormal code 0xffffffff

another w/ more info (took 29 iters to hit it):

[ RUN ] DiskCacheBackendTest.SimpleCacheLoad
T5000 dstack=0x227ef000, app esp=0x1214fbe0, TEB lim=0x1214f000, TEB base=0x12150000
XXXXXX xsp=0x1214fbe0 not in 0x1214e000-0x00000000 for T5000
5000 ASSERT FAILURE (thread 5000): D:\derek\drmemory\git\src\drmemory\drmemory.c:1162: mc.xsp <= (reg_t)teb->StackBase && mc.xsp > (reg_t)teb->StackLimit (initial xsp for thread invalid)Dr.M WARNING: application exited with abnormal code 0xffffffff

DrMem is calling:
dr_switch_to_app_state(drcontext);

and swap_peb_pointer_ex() does nothing on to_priv for stack since
!dynamo_initialized, and on to_app should do nothing b/c not a DR
address. so how did it get set to NULL? must have thought 0x12150000 was
a DR address? maybe it is and we need to subtract 1: shouldn't query the
top itself.

@derekbruening
Copy link
Contributor Author

The adopted solution is a combination of the 3 proposed solutions above.
It is summarized here:

/* We need to meet these requirements:
 * + DrMi#1676: cur esp is in [StackLimit..StackBase) at all times on Win8.1.
 * + i#921, i#1102: StackLimit is a valid page at all times.
 * + DrMi#1723: StackLimit must be updated for app guard page hits, even when
 *   in client code.
 * Our solution is:
 * A) Ensure dstack > app xsp.
 * B) StackLimit holds app value at all times, except on thread exit where
 *    loader_pre_client_thread_exit() swaps it.
 * C) Swap StackBase on context switches between priv and app values.
 */
#define SWAP_TEB_STACKLIMIT() (false)
#define SWAP_TEB_STACKBASE() (true)

*** TODO what about fibers or other stack swaps? need dstack above those too!

ryumiel pushed a commit to ltilve/chromium that referenced this issue May 7, 2015
This re-lands fixes for problems handling HeapWalk().

This also fixes several other bugs:
DynamoRIO/dynamorio#1690,
DynamoRIO/drmemory#1718,
DynamoRIO/drmemory#1722, and
DynamoRIO/drmemory#1723.

BUG=481231
TBR=thakis@chromium.org
NOTRY=true

Review URL: https://codereview.chromium.org/1128923003

Cr-Commit-Position: refs/heads/master@{#328548}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant