New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DatabaseReplicated: fix DDL query timeout after recovering a replica #56796
Conversation
This is an automated comment for commit a2acb0e with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
@@ -1083,12 +1083,14 @@ void DatabaseReplicated::recoverLostReplica(const ZooKeeperPtr & current_zookeep | |||
} | |||
LOG_INFO(log, "All tables are created successfully"); | |||
|
|||
if (max_log_ptr_at_creation != 0) | |||
chassert(max_log_ptr_at_creation || our_log_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tavplubix sorry for the dumb question, but if a new replica died between database being created and this line, next time when the replica is up, it will always be stuck on this line right? Since max_log_ptr_at_creation
is always 0 (because this replica will now attach the database instead of creating it) and our_log_ptr
is also 0. I seem to run into this situation when trying to make some code change, but I feel it is also possible to happen in reality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question, you have found a bug! Yes, debug builds and builds with sanitizers will always crash on this line with SIGABRT. Release builds ignore assertions, so it will continue executing this code and will not create the nodes in zk. As a result, some DDL queries may throw Watching task {} is executing longer than distributed_ddl_task_timeout ...
(if they started while the replica was being created)
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Some distributed DDL queries might show an error like
Watching task ... is executing longer than distributed_ddl_task_timeout ... There are 1 unfinished hosts (0 of them are currently active), they are going to execute the query in background
if some replica was recovering from a stale state when the query has started. It's fixed.Fixes
test_replicated_database/test.py::test_alters_from_different_replicas
: https://pastila.nl/?0024df70/980a4c7882649f60e9174d3d46f85cc0#EB/kiQAvQYMP+9CfoC4LTg==