-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ASan builds for glibc 2.36+ #44811
Conversation
@azat What if we will, instead, do the following:
If we do this, we will be a static binary while still using GLibc as a dynamic library (the specific version of our choice and under our control). It can be an intermediate solution - a substitute for using Musl. |
@alexey-milovidov this sounds like an interesting approach, but for this particular problem it does not wroth it in my opinion, plus this will require some time to implement it and fix all the bugs :) Also maybe making fully static binaries with Musl will be even easier then doing this, however ASan does not support static linkage, so shared binaries will be still used for ASan. So I wound say that this should be done not "instead" but in addition to. |
Ok, good to me. |
fd0ea0a
to
99d69bb
Compare
…ceptors) Recently I noticed that clickhouse compiled with ASan does not work with newer glibc 2.36+, before I though that this was only about compiling with old but using new, however that was not correct, ASan simply does not work with glibc 2.36+. Here is a simple reproducer [1]: $ cat > test-asan.cpp <<EOL #include <pthread.h> int main() { // something broken in ASan in interceptor for __pthread_mutex_lock // and only since glibc 2.36, and for pthread_mutex_lock everything is OK pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; return __pthread_mutex_lock(&mutex); } EOL $ clang -g3 -o test-asan test-asan.cpp -fsanitize=address $ ./test-asan AddressSanitizer:DEADLYSIGNAL ================================================================= ==15659==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0x7fffffffccb0 sp 0x7fffffffcb98 T0) ==15659==Hint: pc points to the zero page. ==15659==The signal is caused by a READ memory access. ==15659==Hint: address points to the zero page. #0 0x0 (<unknown module>) ClickHouse#1 0x7ffff7cda28f (/usr/lib/libc.so.6+0x2328f) (BuildId: 1e94beb079e278ac4f2c8bce1f53091548ea1584) AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV (<unknown module>) ==15659==ABORTING [1]: https://gist.github.com/azat/af073e57a248e04488b21068643f079e I've started observing glibc code, there was some changes in glibc, that moves pthread functions out from libpthread.so.0 into libc.so.6 (somewhere between 2.31 and 2.35), but the problem pops up only with 2.36, 2.35 works fine. After this I've looked into changes between 2.35 and 2.36, and found this patch [2] - "dlsym: Make RTLD_NEXT prefer default version definition [BZ ClickHouse#14932]", that fixes this bug [3]. [2]: https://sourceware.org/git/?p=glibc.git;a=commit;h=efa7936e4c91b1c260d03614bb26858fbb8a0204 [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=14932 The problem with using DL_LOOKUP_RETURN_NEWEST flag for RTLD_NEXT is that it does not resolve hidden symbols (and __pthread_mutex_lock is indeed hidden). Here is a sample that will show the difference [4]: $ cat > test-dlsym.c <<EOL #define _GNU_SOURCE #include <dlfcn.h> #include <stdio.h> int main() { void *p = dlsym(RTLD_NEXT, "__pthread_mutex_lock"); printf("__pthread_mutex_lock: %p (via RTLD_NEXT)\n", p); return 0; } EOL # glibc 2.35: __pthread_mutex_lock: 0x7ffff7e27f70 (via RTLD_NEXT) # glibc 2.36: __pthread_mutex_lock: (nil) (via RTLD_NEXT) [4]: https://gist.github.com/azat/3b5f2ae6011bef2ae86392cea7789eb7 But ThreadFuzzer uses internal symbols to wrap pthread_mutex_lock/pthread_mutex_unlock, which are intercepted by ASan and this leads to NULL dereference. The fix was obvious - just use dlsym(RTLD_NEXT), however on older glibc's this leads to endless recursion (see commits in the code). But only for jemalloc [5], and even though sanitizers does not uses jemalloc the code of ThreadFuzzer is generic and I don't want to guard it with more preprocessors macros. [5]: https://gist.github.com/azat/588d9c72c1e70fc13ebe113197883aa2 So we have to use RTLD_NEXT only for ASan. There is also one more interesting issue, if you will compile with clang that itself had been compiled with newer libc (i.e. 2.36), you will get the following error: $ podman run --privileged -v $PWD/.cmake-asan/programs:/root/bin -e PATH=/bin:/root/bin -e --rm -it ubuntu-dev-v3 clickhouse ==1==ERROR: AddressSanitizer failed to allocate 0x0 (0) bytes of SetAlternateSignalStack (error code: 22) ... ==1==End of process memory map. AddressSanitizer: CHECK failed: sanitizer_common.cpp:53 "((0 && "unable to mmap")) != (0)" (0x0, 0x0) (tid=1) <empty stack> The problem is that since GLIBC_2.31, `SIGSTKSZ` is a call to `getconf(_SC_MINSIGSTKSZ)`, but older glibc does not have it, so `-1` will be returned and used as `SIGSTKSZ` instead. The workaround to disable alternative stack: $ podman run --privileged -v $PWD/.cmake-asan/programs:/root/bin -e PATH=/bin:/root/bin -e ASAN_OPTIONS=use_sigaltstack=0 --rm -it ubuntu-dev-v3 clickhouse client --version ClickHouse client version 22.13.1.1. Fixes: ClickHouse#43426 Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
99d69bb
to
bdeb551
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too cool for me🎉.
Just in case, I've tested the binary from CI, at it works:
|
Recently I noticed that clickhouse compiled with ASan does not work with
newer glibc 2.36+, before I though that this was only about compiling
with old but using new, however that was not correct, ASan simply does
not work with glibc 2.36+.
Here is a simple reproducer 1:
I've started observing glibc code, there was some changes in glibc, that
moves pthread functions out from libpthread.so.0 into libc.so.6
(somewhere between 2.31 and 2.35), but
the problem pops up only with 2.36, 2.35 works fine.
After this I've looked into changes between 2.35 and 2.36, and found
this patch 2 - "dlsym: Make RTLD_NEXT prefer default version
definition [BZ https://github.com//pull/14932]", that fixes this bug 3.
The problem with using DL_LOOKUP_RETURN_NEWEST flag for RTLD_NEXT is
that it does not resolve hidden symbols (and __pthread_mutex_lock is
indeed hidden).
Here is a sample that will show the difference 4:
But ThreadFuzzer uses internal symbols to wrap
pthread_mutex_lock/pthread_mutex_unlock, which are intercepted by ASan
and this leads to NULL dereference.
The fix was obvious - just use dlsym(RTLD_NEXT), however on older
glibc's this leads to endless recursion (see commits in the code). But
only for jemalloc 5, and even though sanitizers does not uses jemalloc
the code of ThreadFuzzer is generic and I don't want to guard it with
more preprocessors macros.
So we have to use RTLD_NEXT only for ASan.
There is also one more interesting issue, if you will compile with clang
that itself had been compiled with newer libc (i.e. 2.36), you will get
the following error:
The problem is that since GLIBC_2.31,
SIGSTKSZ
is a call togetconf(_SC_MINSIGSTKSZ)
, but older glibc does not have it, so-1
will be returned and used as
SIGSTKSZ
instead.The workaround to disable alternative stack:
P.S. I think that I will submit a patch to llvm to intercept
__pthread_mutex_lock/__pthread_mutex_unlock correctly anyway (since this
can hit someone else, although it is pretty specific piece of code).
Fixes: #43426
Changelog category (leave one):