New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a crash log flush to the disk after the unexpected crash. #51720
Conversation
Please review the changes add the label can be tested. |
This is an automated comment for commit 7b4e7cd with description of existing statuses. It's updated for the latest CI running
|
Not a bug fix. |
src/Interpreters/CrashLog.cpp
Outdated
@@ -84,5 +84,6 @@ void collectCrashLog(Int32 signal, UInt64 thread_id, const String & query_id, co | |||
|
|||
CrashLogElement element{static_cast<time_t>(time / 1000000000), time, signal, thread_id, query_id, trace, trace_full}; | |||
crash_log_owned->add(element); | |||
crash_log_owned->flush(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the best option, because a crash can happen in multiple threads in a short time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a comment.
I'd better keep it as is, or - as a maximum - lower the flush interval to one second. |
I will try to test with a flush interval of 1 second. Although I like the blocking wait more than hoping that the crash log has time to store on disk before the force quit. Update: <crash_log>
<database>system</database>
<table>crash_log</table>
<flush_interval_milliseconds>1000</flush_interval_milliseconds>
</crash_log> |
Let's check - what are the mechanics, and why it reproduces? I like the idea in general, but for the fatal error handler, it makes sense to keep it loose - it is intended to have "best effort" behavior, and adding extra logic there can have side effects, e.g., subsequent errors. |
When a signal is received, the
for (size_t i = 0; i < 300; ++i)
{
/// We will synchronize with the thread printing the messages with an atomic variable to finish earlier.
if (fatal_error_printed.test())
break;
/// signal handler thread wait for 1 second, but it is not enough. (Because flusing logs actually takes more than 1 second (it does file creation, file writing and file closing)
sleepForSeconds(1);
}
|
Thank you! Now everything is clear. Until recently, the logic was simpler - sleep for 10 seconds and call the default signal handler. Then it was changed to waiting and calling the default signal handler as soon as the fatal handling thread finished its work. Let's think about what we can do without flushing too aggressively... |
The 10-second waiting was initially intended not for the crash_log, but for query_log flushing. It means we have to return it nevertheless. Let's modify the signal handler, so it will sleep additional 10 seconds after it understood that the fatal handling thread finished its work. PS. Dealing with it with sleep might sound stupid, but I think it is what is best for this particular code - fatal signal handler... |
10 Seconds should be enough in most of user cases. Perhaps It is possible to improve my solution without blocking // the code is simplified for readability
template <typename LogElement>
void SystemLogBase<LogElement>::notifyFlush()
{
std::lock_guard lock(mutex);
// immediately flush, but do not lock the thread
requested_flush_up_to = std::max(requested_flush_up_to, this_thread_requested_offset);
flush_event.notify_all();
// here we don't wait before flush completes
} and in |
Update:
|
What is happening with the integration tests for PostgreSQL? |
I tried to read these logs: https://s3.amazonaws.com/clickhouse-test-reports/51720/2b77196a470e3181d9812e4a7f5ffd769ab6663e/integration_tests__asan__[6_6].html But I don't see anything to latch on to. The failed test_postgresql_replica_database_engine_1 test does not have published logs at all. This is the only failing test (all test cases fail). Other PostgreSQL tests pass. |
The PR is good, and we need to merge it, but we have to address the issue with this test regardless (it cannot be merged without investigation). |
And we still have to figure out what queries in the stress test consume large amount of memory. |
The tests fails seems unsable: Now There another integration tests fail:
|
We cannot proceed with failed tests. It is most likely our problem, not yours... but we can hold this PR as a hostage until we will fix all the tests. |
Mergeable check passed: |
Closes #51574
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Added a crash log flush to the disk after the unexpected crash.