New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of gwp-asan technique #36826
Conversation
@azat You agreed to be tagged in PR. This is a draft PR, but basic allocate/deallocate seem to work locally on my machine. |
How to enable/use it? |
Right now it is enabled by default. It is initialized at. It is automatically used for allocations with new/delete. For every allocation GWP-Asan's So overall it is automatically used for every allocation made through new/delete, for which At this moment threre are also a number of debug printf outputs, which give an insight into everything going on inside of GWP-Asan. I should note that as of right now linux build seems to produce segmentation fault on start, because for some reason (I am in process of figuring this out) GWP is initialized too late: new/delete are called before its initialization. Although on my machine (Apple M1) everything seems to be working correctly. There is also a simple commented out test in main.cpp which results in segmentation fault (as expected) upon reaching guard page. |
Ok, looks like I fixed it for linux: had to set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this is a draft, so just a few brief comments.
Also can you add a checklist into the PR description, to see the progress, something like:
- guard allocator
- server settings (like pool size and so on)
- provide report on failure
- tests
- remove debug stuff
Updated PR. Addressed several comments above in it + added allocation metadata collecting (stack traces on alloc/dealloc). |
This PR is now ready for review for the most part. There is still work to be done to make it configurable, which I would consider to be a minor task. Backbone of GWP is done. As for tests, I am not sure how exactly to test this component. I am open to advice on this part. Note: this code contains intentional memory error in programs/Server.cpp to test report. If your local build does not produce SIGSEGV and subsequent report, I would recommend increasing maxSimultaneousAllocations parameter. |
@l1tsolaiki can you please rebase on top |
Yes, I did that, but apparently I did not push it =/ Just now pushed rebased version + added tuning GWP with options passed through env variable Also we discussed with @alexey-milovidov that no specific tests are expected here. We'll just run all existing tests and make sure that none of them fail with GWP enabled. UPD: updated env variable name |
Name of the environment variable conflicts with GWP ASan from LLVM, maybe it should be changed to something like |
Agreed. Changed that in the most recent commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM, apart from my comments.
Also I still think that it worth adding some tests, i.e. few examples with different errors (double free, overflow, underflow, ...), later/separately they can be integrated to CI.
src/Common/GuardedPoolAllocator.cpp
Outdated
// printf("user_ptr=%p, slot_end=%p\n", reinterpret_cast<void *>(user_ptr), reinterpret_cast<void *>(slot_end)); | ||
// printf("Distance to guard page: %zu\n", slot_end - user_ptr); | ||
|
||
// printf("\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add profile events for guarded allocations, this way we can measure the difference between queries with them and w/o (this will help ensuring that this pool does not affects performance tests for instance).
For this you need to add new events to
ClickHouse/src/Common/ProfileEvents.cpp
Line 261 in ef35fa6
M(QueryMemoryLimitExceeded, "Number of times when memory limit exceeded for query.") \ |
M(GuardedPoolAllocations, "Number of times when allocation was satisfied from guarded pool.") \
M(GuardedPoolDeallocations, "Number of times when deallocation was from guarded pool.") \
And them use them here like (somewhere in the beginning of this file):
#include <Common/ProfileEvents.h>
namespace ProfileEvents
{
extern const Event GuardedPoolAllocations;
extern const Event GuardedPoolDeallocations;
}
And in this function:
ProfileEvents::increment(ProfileEvents::GuardedPoolAllocations);
And don't forget to do the same for deallocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done this, now waiting for result.
Also to measure the effect of this in performance tests you can tune guard allocator, by adding
After I can help with interpreting the results of performance tests. NOTE: I've set |
|
||
static constexpr uint64_t kInvalidThreadID = UINT64_MAX; | ||
|
||
struct GuardedPoolAllocatorState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it will be great if you will add layout in the comment, that will include:
- user pointers
- guard pages for that user pointers
- metadata
|
Yes, I noticed that too. Will try to investigate tomorrow. |
@azat I've added ProfileEvents, where do we look at results now? |
98f5c19
to
85d88e7
Compare
Draft implementation of gwp-asan technique Update Remove test example from main Uncomment one line Uncomment another line :( Fix init_priotiy: set it to 101 Update Add report Add GWP-ASan options through env and comment debug printfs Remove unused member variable Remove error Include missing headers Remove intentional error from Server.cpp; maybe it causes error on FastRun? Uncomment some debug output Ahhh I see, I need to remove all debug output for tests to work Remove some more debug output Uncommone isPowerOfTwo and add [[maybe_unused]] + fix one more problem Create class CrashReporter and move diagnostics functions into it + fix style Set allocator_ptr in constructor Fix typo Add debug printf in case of nullptr Add check for nullptr Tune default sample_rate and fix typos Address comments from PR + some fixes Remove use after free demonstratioin Remove almost all debug stuff Remove some more stuff + add checks for options in init Apply suggestions from code review Co-authored-by: Azat Khuzhin <a3at.mail@gmail.com> Fix mistake in option description First bulk of review changes Second bulk of review changes Third bulk of review changes Fourth bulk of review changes Fifth bulk of review changes One more Small fix Small fix Rewrite gwp options without double includes Address some review comments Add profile events + minor review comments addressed Set aggressive settings for gwp in perf comparison tests Save progress Remove redundant addrToMetadata One try Add memory tracking for gwp Style BuilderBinTidy fixes Minor review fixes Move stop of alloator, change snprintf -> fmt::format
Here are some numbers with different options ( TL;DR;
Also interesting moment that upstream is slower here, maybe my environment was not stable, need to verify. |
@@ -113,6 +113,9 @@ function restart | |||
# https://github.com/jemalloc/jemalloc/wiki/Getting-Started | |||
export MALLOC_CONF="confirm_conf:true" | |||
|
|||
# Temporary to measure effect GWP has on performance | |||
export CLICKHOUSE_GWP_ASAN_OPTIONS="sample_rate=100,slot_size=512,max_simultaneous_allocations=512" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not looked into performance tests deeply (will do this later), but seems that they are fine even with more higher sample rate.
Let's remove this line for now them and see how they will differs.
Hi, what's the status about this PR? Is there any plan or progress to enable this gwp-asan technique? :) |
This PR is unfinished, but it is on the list of our tasks for hardening and reliability - we want to continue and merge it. |
I will continue the work in a few days. Thanks for patience! |
As far as I remember, the only thing left to do here was to fix style |
Superseded by #45226. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implementation of GWP-Asan (https://www.chromium.org/Home/chromium-security/articles/gwp-asan/) in ClickHouse.
Checklist:
guarded allocator (implement guarded allocator + use it from new/delete)record allocation metadata for diagnostics on crash (record stack traces on allocation and deallallocation)provide report on failure (implement a function to check if error is related to memory allocated with GWP)make GWP configurable (i.e pass parameters to init)add build option to not include GWP-Asan in build altoughether– we discussed @alexey-milovidov that we do not need that option, it will always be enabled.tests– we will just run regular tests with GWP enabledremove debug stuff