Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of gwp-asan technique #36826

Closed
wants to merge 6 commits into from

Conversation

l1tsolaiki
Copy link
Contributor

@l1tsolaiki l1tsolaiki commented Apr 30, 2022

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Implementation of GWP-Asan (https://www.chromium.org/Home/chromium-security/articles/gwp-asan/) in ClickHouse.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

Checklist:

  • guarded allocator (implement guarded allocator + use it from new/delete)
  • record allocation metadata for diagnostics on crash (record stack traces on allocation and deallallocation)
  • provide report on failure (implement a function to check if error is related to memory allocated with GWP)
  • make GWP configurable (i.e pass parameters to init)
  • add build option to not include GWP-Asan in build altoughether – we discussed @alexey-milovidov that we do not need that option, it will always be enabled.
  • tests – we will just run regular tests with GWP enabled
  • remove debug stuff
  • code review changes

@CLAassistant
Copy link

CLAassistant commented Apr 30, 2022

CLA assistant check
All committers have signed the CLA.

@robot-clickhouse robot-clickhouse added the pr-feature Pull request with new product feature label Apr 30, 2022
@l1tsolaiki l1tsolaiki marked this pull request as draft April 30, 2022 13:06
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Apr 30, 2022
@l1tsolaiki
Copy link
Contributor Author

@azat You agreed to be tagged in PR. This is a draft PR, but basic allocate/deallocate seem to work locally on my machine.

@qoega
Copy link
Member

qoega commented Apr 30, 2022

How to enable/use it?

@l1tsolaiki
Copy link
Contributor Author

@qoega

How to enable/use it?

Right now it is enabled by default. It is initialized at.

It is automatically used for allocations with new/delete. For every allocation GWP-Asan's shouldAllocate method is called which is used to randomly sample allocatioins (although currently in this draft it always returns true). If it returns true, newImpl allocates memory with GWP-Asan's method allocate here.

So overall it is automatically used for every allocation made through new/delete, for which shouldSample returns true.
GWP-Asan also has a fixed number of slots which it allocates on program start. This number is configured with maxSimultaneousAllocations parameter.

At this moment threre are also a number of debug printf outputs, which give an insight into everything going on inside of GWP-Asan.

I should note that as of right now linux build seems to produce segmentation fault on start, because for some reason (I am in process of figuring this out) GWP is initialized too late: new/delete are called before its initialization. Although on my machine (Apple M1) everything seems to be working correctly.

There is also a simple commented out test in main.cpp which results in segmentation fault (as expected) upon reaching guard page.

@l1tsolaiki
Copy link
Contributor Author

l1tsolaiki commented May 1, 2022

Ok, looks like I fixed it for linux: had to set __attribute__((init_priority(102))) and __attribute__((init_priority(103))) to __attribute__((init_priority(101))).

Copy link
Collaborator

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is a draft, so just a few brief comments.

Also can you add a checklist into the PR description, to see the progress, something like:

  • guard allocator
  • server settings (like pool size and so on)
  • provide report on failure
  • tests
  • remove debug stuff

src/Common/GuardedPoolAllocator.cpp Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocator.h Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocatorState.h Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocatorState.h Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocator.h Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocator.cpp Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocator.cpp Outdated Show resolved Hide resolved
@l1tsolaiki
Copy link
Contributor Author

l1tsolaiki commented May 5, 2022

Updated PR. Addressed several comments above in it + added allocation metadata collecting (stack traces on alloc/dealloc).
Note: this code is only for debug/testing purposes – to make sure that stack traces are correctly collected. However, printing stack trace here causes SIGSEGV on server shutdown for an unknown to me reason. Of course this code will not be present in the final version, but I am little worried that I will have the same issue when printing stack trace in signal handler.

@l1tsolaiki l1tsolaiki marked this pull request as ready for review May 8, 2022 12:56
@l1tsolaiki
Copy link
Contributor Author

l1tsolaiki commented May 8, 2022

This PR is now ready for review for the most part. There is still work to be done to make it configurable, which I would consider to be a minor task. Backbone of GWP is done.
Note: I recommend building it with clang-13, because clang-14 uses DWARF-5 format and AFAIK ClickHouse uses DWARF-4. This causes problems when trying to symbolize (and subsequently output) stack trace in report.

As for tests, I am not sure how exactly to test this component. I am open to advice on this part.

Note: this code contains intentional memory error in programs/Server.cpp to test report. If your local build does not produce SIGSEGV and subsequent report, I would recommend increasing maxSimultaneousAllocations parameter.

CC @azat @qoega

@l1tsolaiki l1tsolaiki changed the title Draft implementation of gwp-asan technique Implementation of gwp-asan technique May 8, 2022
@azat
Copy link
Collaborator

azat commented May 9, 2022

@l1tsolaiki can you please rebase on top upstream/master? (there was some changes in some basic headers, and no CI fails to compile your code, because it uses merge head).

@l1tsolaiki
Copy link
Contributor Author

l1tsolaiki commented May 9, 2022

@l1tsolaiki can you please rebase on top upstream/master? (there was some changes in some basic headers, and no CI fails to compile your code, because it uses merge head).

Yes, I did that, but apparently I did not push it =/

Just now pushed rebased version + added tuning GWP with options passed through env variable CLICKHOUSE_GWP_ASAN_OPTIONS . This marks completion of this PR as I see it.
You can launch ClickHouse like so: CLICKHOUSE_GWP_ASAN_OPTIONS =help=true ./programs/clickhouse-server to see all available options. Other options are passed in the same way (example: CLICKHOUSE_GWP_ASAN_OPTIONS =sample_rate=8,max_simultaneous_allocations=15000 ./programs/clickhouse-server)
There are also warnings printed if you input some garbage in these options. However, those will probably be removed, because we should probably not used printf this early in execution. In the final version there will probably only be a generic "Error while parsing GWP-ASan options\n".

Also we discussed with @alexey-milovidov that no specific tests are expected here. We'll just run all existing tests and make sure that none of them fail with GWP enabled.

UPD: updated env variable name

@azat
Copy link
Collaborator

azat commented May 11, 2022

You can launch ClickHouse like so: GWP_ASAN_OPTIONS=help=true ./programs/clickhouse-server to see all available options. Other options are passed in the same way (example: GWP_ASAN_OPTIONS=sample_rate=8,max_simultaneous_allocations=15000 ./programs/clickhouse-server)

Name of the environment variable conflicts with GWP ASan from LLVM, maybe it should be changed to something like CLICKHOUSE_GWP_ASAN_OPTIONS ?

@l1tsolaiki
Copy link
Contributor Author

You can launch ClickHouse like so: GWP_ASAN_OPTIONS=help=true ./programs/clickhouse-server to see all available options. Other options are passed in the same way (example: GWP_ASAN_OPTIONS=sample_rate=8,max_simultaneous_allocations=15000 ./programs/clickhouse-server)

Name of the environment variable conflicts with GWP ASan from LLVM, maybe it should be changed to something like CLICKHOUSE_GWP_ASAN_OPTIONS ?

Agreed. Changed that in the most recent commit.

Copy link
Collaborator

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM, apart from my comments.

Also I still think that it worth adding some tests, i.e. few examples with different errors (double free, overflow, underflow, ...), later/separately they can be integrated to CI.

programs/server/Server.cpp Outdated Show resolved Hide resolved
src/Common/memory.h Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocatorOptions.inc Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocatorOptions.inc Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocator.cpp Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocator.cpp Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocatorCommon.cpp Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocator.cpp Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocator.cpp Outdated Show resolved Hide resolved
src/Common/GuardedPoolAllocator.cpp Outdated Show resolved Hide resolved
// printf("user_ptr=%p, slot_end=%p\n", reinterpret_cast<void *>(user_ptr), reinterpret_cast<void *>(slot_end));
// printf("Distance to guard page: %zu\n", slot_end - user_ptr);

// printf("\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add profile events for guarded allocations, this way we can measure the difference between queries with them and w/o (this will help ensuring that this pool does not affects performance tests for instance).

For this you need to add new events to

M(QueryMemoryLimitExceeded, "Number of times when memory limit exceeded for query.") \

    M(GuardedPoolAllocations, "Number of times when allocation was satisfied from guarded pool.") \
    M(GuardedPoolDeallocations, "Number of times when deallocation was from guarded pool.") \

And them use them here like (somewhere in the beginning of this file):

#include <Common/ProfileEvents.h>

namespace ProfileEvents
{
    extern const Event GuardedPoolAllocations;
    extern const Event GuardedPoolDeallocations;
}

And in this function:

ProfileEvents::increment(ProfileEvents::GuardedPoolAllocations);

And don't forget to do the same for deallocations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done this, now waiting for result.

@azat
Copy link
Collaborator

azat commented May 14, 2022

Also to measure the effect of this in performance tests you can tune guard allocator, by adding export CLICKHOUSE_GWP_ASAN_OPTIONS=sample_rate=100:slot_size:512 here

export MALLOC_CONF="confirm_conf:true"

After I can help with interpreting the results of performance tests.

NOTE: I've set slot_size to 512 to cover allocations up to 2MB, since they should be pretty hot (#12142 (comment))


static constexpr uint64_t kInvalidThreadID = UINT64_MAX;

struct GuardedPoolAllocatorState
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it will be great if you will add layout in the comment, that will include:

  • user pointers
  • guard pages for that user pointers
  • metadata

@azat
Copy link
Collaborator

azat commented May 18, 2022

Performance Comparison (actions) [1/4] — Failed to parse the report.

2022-05-17 03:30:43      right/scripts/compare.sh: line 19:   536 Segmentation fault      (core dumped) clickhouse-client --port "$1" --query "select 1"

@l1tsolaiki
Copy link
Contributor Author

Performance Comparison (actions) [1/4] — Failed to parse the report.

2022-05-17 03:30:43      right/scripts/compare.sh: line 19:   536 Segmentation fault      (core dumped) clickhouse-client --port "$1" --query "select 1"

Yes, I noticed that too. Will try to investigate tomorrow.

@l1tsolaiki
Copy link
Contributor Author

l1tsolaiki commented May 25, 2022

Also to measure the effect of this in performance tests you can tune guard allocator, by adding export CLICKHOUSE_GWP_ASAN_OPTIONS=sample_rate=100:slot_size:512 here

export MALLOC_CONF="confirm_conf:true"

After I can help with interpreting the results of performance tests.

NOTE: I've set slot_size to 512 to cover allocations up to 2MB, since they should be pretty hot (#12142 (comment))

@azat I've added ProfileEvents, where do we look at results now?
P.S. I see errors in tests, but those are not connected to GWP as far as I can see.
P.S.S. Previous segmentation fault error was due to the fact that GWP was initialized before ProfileEvents, so ProfileEvents::increment() caused an error.

@l1tsolaiki l1tsolaiki force-pushed the gwp-asan branch 2 times, most recently from 98f5c19 to 85d88e7 Compare May 30, 2022 14:42
Draft implementation of gwp-asan technique

Update

Remove test example from main

Uncomment one line

Uncomment another line :(

Fix init_priotiy: set it to 101

Update

Add report

Add GWP-ASan options through env and comment debug printfs

Remove unused member variable

Remove error

Include missing headers

Remove intentional error from Server.cpp; maybe it causes error on FastRun?

Uncomment some debug output

Ahhh I see, I need to remove all debug output for tests to work

Remove some more debug output

Uncommone isPowerOfTwo and add [[maybe_unused]] + fix one more problem

Create class CrashReporter and move diagnostics functions into it + fix style

Set allocator_ptr in constructor

Fix typo

Add debug printf in case of nullptr

Add check for nullptr

Tune default sample_rate and fix typos

Address comments from PR + some fixes

Remove use after free demonstratioin

Remove almost all debug stuff

Remove some more stuff + add checks for options in init

Apply suggestions from code review

Co-authored-by: Azat Khuzhin <a3at.mail@gmail.com>

Fix mistake in option description

First bulk of review changes

Second bulk of review changes

Third bulk of review changes

Fourth bulk of review changes

Fifth bulk of review changes

One more

Small fix

Small fix

Rewrite gwp options without double includes

Address some review comments

Add profile events + minor review comments addressed

Set aggressive settings for gwp in perf comparison tests

Save progress

Remove redundant addrToMetadata

One try

Add memory tracking for gwp

Style

BuilderBinTidy fixes

Minor review fixes

Move stop of alloator, change snprintf -> fmt::format
@azat
Copy link
Collaborator

azat commented Jun 2, 2022

Here are some numbers with different options (CLICKHOUSE_GWP_ASAN_OPTIONS)
I see ~2% overhead for default setup, it does not looks significant, but I did not expect that with default sampling settings it should give >1% overhead.
So maybe it should be turned OFF by default for now until performance will be investigated.

TL;DR;

quantile server env avg (sec) dev
90.000% upstream 3.21 0.000036
90.000% pr enabled=0 3.07 0.000274
90.000% pr sample_rate=100,slot_size=512,max_simultaneous_allocations=512,enabled=1 3.36 0.003614
90.000% pr sample_rate=100000,slot_size=1,max_simultaneous_allocations=32,enabled=1 3.12 0.000469
95.000% upstream 3.23 0.000149
95.000% pr enabled=0 3.09 0.000352
95.000% pr sample_rate=100,slot_size=512,max_simultaneous_allocations=512,enabled=1 3.4 0.002172
95.000% pr sample_rate=100000,slot_size=1,max_simultaneous_allocations=32,enabled=1 3.16 0.000397

Also interesting moment that upstream is slower here, maybe my environment was not stable, need to verify.

@@ -113,6 +113,9 @@ function restart
# https://github.com/jemalloc/jemalloc/wiki/Getting-Started
export MALLOC_CONF="confirm_conf:true"

# Temporary to measure effect GWP has on performance
export CLICKHOUSE_GWP_ASAN_OPTIONS="sample_rate=100,slot_size=512,max_simultaneous_allocations=512"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not looked into performance tests deeply (will do this later), but seems that they are fine even with more higher sample rate.
Let's remove this line for now them and see how they will differs.

@hanfei1991 hanfei1991 self-assigned this Dec 7, 2022
@Enna1
Copy link

Enna1 commented Jan 9, 2023

Hi, what's the status about this PR? Is there any plan or progress to enable this gwp-asan technique? :)

@alexey-milovidov
Copy link
Member

This PR is unfinished, but it is on the list of our tasks for hardening and reliability - we want to continue and merge it.

@hanfei1991
Copy link
Member

I will continue the work in a few days. Thanks for patience!

@l1tsolaiki
Copy link
Contributor Author

l1tsolaiki commented Jan 9, 2023

As far as I remember, the only thing left to do here was to fix style

@alexey-milovidov
Copy link
Member

Superseded by #45226.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants