Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ASan builds for glibc 2.36+ #44811

Merged
merged 1 commit into from Jan 23, 2023

Conversation

azat
Copy link
Collaborator

@azat azat commented Dec 31, 2022

Recently I noticed that clickhouse compiled with ASan does not work with
newer glibc 2.36+, before I though that this was only about compiling
with old but using new, however that was not correct, ASan simply does
not work with glibc 2.36+.

Here is a simple reproducer 1:

$ cat > test-asan.cpp <<EOL
#include <pthread.h>
int main()
{
    // something broken in ASan in interceptor for __pthread_mutex_lock
    // and only since glibc 2.36, and for pthread_mutex_lock everything is OK
    pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
    return __pthread_mutex_lock(&mutex);
}
EOL
$ clang -g3 -o test-asan test-asan.cpp -fsanitize=address
$ ./test-asan
AddressSanitizer:DEADLYSIGNAL
=================================================================
==15659==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0x7fffffffccb0 sp 0x7fffffffcb98 T0)
==15659==Hint: pc points to the zero page.
==15659==The signal is caused by a READ memory access.
==15659==Hint: address points to the zero page.
    #0 0x0  (<unknown module>)
    https://github.com/ClickHouse/ClickHouse/pull/1 0x7ffff7cda28f  (/usr/lib/libc.so.6+0x2328f) (BuildId: 1e94beb079e278ac4f2c8bce1f53091548ea1584)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
==15659==ABORTING

I've started observing glibc code, there was some changes in glibc, that
moves pthread functions out from libpthread.so.0 into libc.so.6
(somewhere between 2.31 and 2.35), but
the problem pops up only with 2.36, 2.35 works fine.

After this I've looked into changes between 2.35 and 2.36, and found
this patch 2 - "dlsym: Make RTLD_NEXT prefer default version
definition [BZ https://github.com//pull/14932]", that fixes this bug 3.

The problem with using DL_LOOKUP_RETURN_NEWEST flag for RTLD_NEXT is
that it does not resolve hidden symbols (and __pthread_mutex_lock is
indeed hidden).

Here is a sample that will show the difference 4:

$ cat > test-dlsym.c <<EOL
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>

int main()
{
    void *p = dlsym(RTLD_NEXT, "__pthread_mutex_lock");
    printf("__pthread_mutex_lock: %p (via RTLD_NEXT)\n", p);
    return 0;
}
EOL

# glibc 2.35: __pthread_mutex_lock: 0x7ffff7e27f70 (via RTLD_NEXT)
# glibc 2.36: __pthread_mutex_lock: (nil) (via RTLD_NEXT)

But ThreadFuzzer uses internal symbols to wrap
pthread_mutex_lock/pthread_mutex_unlock, which are intercepted by ASan
and this leads to NULL dereference.

The fix was obvious - just use dlsym(RTLD_NEXT), however on older
glibc's this leads to endless recursion (see commits in the code). But
only for jemalloc 5, and even though sanitizers does not uses jemalloc
the code of ThreadFuzzer is generic and I don't want to guard it with
more preprocessors macros.

So we have to use RTLD_NEXT only for ASan.

There is also one more interesting issue, if you will compile with clang
that itself had been compiled with newer libc (i.e. 2.36), you will get
the following error:

$ podman run --privileged -v $PWD/.cmake-asan/programs:/root/bin -e PATH=/bin:/root/bin -e --rm -it ubuntu-dev-v3 clickhouse
==1==ERROR: AddressSanitizer failed to allocate 0x0 (0) bytes of SetAlternateSignalStack (error code: 22)
...
==1==End of process memory map.
AddressSanitizer: CHECK failed: sanitizer_common.cpp:53 "((0 && "unable to mmap")) != (0)" (0x0, 0x0) (tid=1)
    <empty stack>

The problem is that since GLIBC_2.31, SIGSTKSZ is a call to
getconf(_SC_MINSIGSTKSZ), but older glibc does not have it, so -1
will be returned and used as SIGSTKSZ instead.

The workaround to disable alternative stack:

$ podman run --privileged -v $PWD/.cmake-asan/programs:/root/bin -e PATH=/bin:/root/bin -e ASAN_OPTIONS=use_sigaltstack=0 --rm -it ubuntu-dev-v3 clickhouse client --version
ClickHouse client version 22.13.1.1.

P.S. I think that I will submit a patch to llvm to intercept
__pthread_mutex_lock/__pthread_mutex_unlock correctly anyway (since this
can hit someone else, although it is pretty specific piece of code).

Fixes: #43426

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Dec 31, 2022
@azat azat marked this pull request as draft December 31, 2022 22:47
@alexey-milovidov
Copy link
Member

@azat What if we will, instead, do the following:

  1. Implement our own dynamic loader similar to GLibc's one.
  2. Embed this dynamic loader directly into the ClickHouse binary, so it will not depend on ld.so and load as a static binary.
  3. Embed GLibc into the ClickHouse binary as a resource. Teach our own dynamic loader to link to it.

If we do this, we will be a static binary while still using GLibc as a dynamic library (the specific version of our choice and under our control).

It can be an intermediate solution - a substitute for using Musl.

@azat
Copy link
Collaborator Author

azat commented Jan 1, 2023

@alexey-milovidov this sounds like an interesting approach, but for this particular problem it does not wroth it in my opinion, plus this will require some time to implement it and fix all the bugs :)

Also maybe making fully static binaries with Musl will be even easier then doing this, however ASan does not support static linkage, so shared binaries will be still used for ASan.

So I wound say that this should be done not "instead" but in addition to.

@alexey-milovidov
Copy link
Member

Ok, good to me.

…ceptors)

Recently I noticed that clickhouse compiled with ASan does not work with
newer glibc 2.36+, before I though that this was only about compiling
with old but using new, however that was not correct, ASan simply does
not work with glibc 2.36+.

Here is a simple reproducer [1]:

    $ cat > test-asan.cpp <<EOL
    #include <pthread.h>
    int main()
    {
        // something broken in ASan in interceptor for __pthread_mutex_lock
        // and only since glibc 2.36, and for pthread_mutex_lock everything is OK
        pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
        return __pthread_mutex_lock(&mutex);
    }
    EOL
    $ clang -g3 -o test-asan test-asan.cpp -fsanitize=address
    $ ./test-asan
    AddressSanitizer:DEADLYSIGNAL
    =================================================================
    ==15659==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0x7fffffffccb0 sp 0x7fffffffcb98 T0)
    ==15659==Hint: pc points to the zero page.
    ==15659==The signal is caused by a READ memory access.
    ==15659==Hint: address points to the zero page.
        #0 0x0  (<unknown module>)
        ClickHouse#1 0x7ffff7cda28f  (/usr/lib/libc.so.6+0x2328f) (BuildId: 1e94beb079e278ac4f2c8bce1f53091548ea1584)

    AddressSanitizer can not provide additional info.
    SUMMARY: AddressSanitizer: SEGV (<unknown module>)
    ==15659==ABORTING

  [1]: https://gist.github.com/azat/af073e57a248e04488b21068643f079e

I've started observing glibc code, there was some changes in glibc, that
moves pthread functions out from libpthread.so.0 into libc.so.6
(somewhere between 2.31 and 2.35), but
the problem pops up only with 2.36, 2.35 works fine.

After this I've looked into changes between 2.35 and 2.36, and found
this patch [2] - "dlsym: Make RTLD_NEXT prefer default version
definition [BZ ClickHouse#14932]", that fixes this bug [3].

  [2]: https://sourceware.org/git/?p=glibc.git;a=commit;h=efa7936e4c91b1c260d03614bb26858fbb8a0204
  [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=14932

The problem with using DL_LOOKUP_RETURN_NEWEST flag for RTLD_NEXT is
that it does not resolve hidden symbols (and __pthread_mutex_lock is
indeed hidden).

Here is a sample that will show the difference [4]:

    $ cat > test-dlsym.c <<EOL
    #define _GNU_SOURCE
    #include <dlfcn.h>
    #include <stdio.h>

    int main()
    {
        void *p = dlsym(RTLD_NEXT, "__pthread_mutex_lock");
        printf("__pthread_mutex_lock: %p (via RTLD_NEXT)\n", p);
        return 0;
    }
    EOL

    # glibc 2.35: __pthread_mutex_lock: 0x7ffff7e27f70 (via RTLD_NEXT)
    # glibc 2.36: __pthread_mutex_lock: (nil) (via RTLD_NEXT)

  [4]: https://gist.github.com/azat/3b5f2ae6011bef2ae86392cea7789eb7

But ThreadFuzzer uses internal symbols to wrap
pthread_mutex_lock/pthread_mutex_unlock, which are intercepted by ASan
and this leads to NULL dereference.

The fix was obvious - just use dlsym(RTLD_NEXT), however on older
glibc's this leads to endless recursion (see commits in the code). But
only for jemalloc [5], and even though sanitizers does not uses jemalloc
the code of ThreadFuzzer is generic and I don't want to guard it with
more preprocessors macros.

  [5]: https://gist.github.com/azat/588d9c72c1e70fc13ebe113197883aa2

So we have to use RTLD_NEXT only for ASan.

There is also one more interesting issue, if you will compile with clang
that itself had been compiled with newer libc (i.e. 2.36), you will get
the following error:

    $ podman run --privileged -v $PWD/.cmake-asan/programs:/root/bin -e PATH=/bin:/root/bin -e --rm -it ubuntu-dev-v3 clickhouse
    ==1==ERROR: AddressSanitizer failed to allocate 0x0 (0) bytes of SetAlternateSignalStack (error code: 22)
    ...
    ==1==End of process memory map.
    AddressSanitizer: CHECK failed: sanitizer_common.cpp:53 "((0 && "unable to mmap")) != (0)" (0x0, 0x0) (tid=1)
        <empty stack>

The problem is that since GLIBC_2.31, `SIGSTKSZ` is a call to
`getconf(_SC_MINSIGSTKSZ)`, but older glibc does not have it, so `-1`
will be returned and used as `SIGSTKSZ` instead.

The workaround to disable alternative stack:

    $ podman run --privileged -v $PWD/.cmake-asan/programs:/root/bin -e PATH=/bin:/root/bin -e ASAN_OPTIONS=use_sigaltstack=0 --rm -it ubuntu-dev-v3 clickhouse client --version
    ClickHouse client version 22.13.1.1.

Fixes: ClickHouse#43426
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
@azat azat marked this pull request as ready for review January 20, 2023 12:09
Copy link
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too cool for me🎉.

@azat
Copy link
Collaborator Author

azat commented Jan 20, 2023

Just in case, I've tested the binary from CI, at it works:

azat@s1:~/ch/tmp/44811$ apt-cache policy libc6
libc6:
  Installed: 2.36-5
  Candidate: 2.36-8
  Version table:
     2.36-8 700
        -10 http://ftp.debian.org/debian testing/main amd64 Packages
 *** 2.36-5 100
        100 /var/lib/dpkg/status
azat@s1:~/ch/tmp/44811$ ./clickhouse-asan local -q 'select 1'
1

@alexey-milovidov alexey-milovidov self-assigned this Jan 23, 2023
@alexey-milovidov alexey-milovidov merged commit 62a8de3 into ClickHouse:master Jan 23, 2023
@azat azat deleted the build/glibc2.36-fix branch January 23, 2023 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-not-for-changelog This PR should not be mentioned in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ASan binaries does not work with glibc 2.36
4 participants