Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for failover of Distributed queries #6399

Merged
merged 3 commits into from Sep 7, 2019

Conversation

Enmk
Copy link
Contributor

@Enmk Enmk commented Aug 8, 2019

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

  • Improvement

Short description (up to few sentences):
Improvements for failover of Distributed queries. Shorten recovery time, also it is now configurable and can be seen in system.clusters.
...

Detailed description (optional):

  • Added a limit on how many errors can replica accumulate
  • Decreased default error halving time to 60 seconds
  • Made both configurable via settings: replica_error_max_count and replica_error_decrease_period
  • Showing errors count and estimated recovery time for each replica in system.clusters

Example

SELECT *
FROM system.clusters

┌─cluster──────┬─shard_num─┬─shard_weight─┬─replica_num─┬─host_name─┬─host_address─┬─port─┬─is_local─┬─user────┬─default_database─┬─errors_count─┬─estimated_recovery_time─┐
│ test_cluster │         1 │            1 │           1 │ localhost │ 127.0.0.1    │ 9000 │        1 │ default │                  │            6 │                     166 │
│ test_cluster │         1 │            1 │           2 │ localhost │ 127.0.0.1    │ 9001 │        1 │ default │                  │            0 │                       0 │
│ test_cluster │         1 │            1 │           3 │ localhost │ 127.0.0.1    │ 9002 │        1 │ default │                  │            6 │                     166 │
└──────────────┴───────────┴──────────────┴─────────────┴───────────┴──────────────┴──────┴──────────┴─────────┴──────────────────┴──────────────┴─────────────────────────┘

  • errors_count is number of times this host tried to reach replica but failed
  • estimated_recovery_time how many seconds are left until replica error count is zeroed and replica is considered to be back to normal

Please note that errors_count is updated once per query to the cluster, but estimated_recovery_time is recalculated on-demand. So there could be a case of non-zero errors_count and zero estimated_recovery_time, that means that next query will zero errors count and will try to use replica as if it had no errors.
...

Fixes #5317

@alexey-milovidov alexey-milovidov changed the title Replica recovery fixes Replica failover fixes Aug 8, 2019
@alexey-milovidov alexey-milovidov changed the title Replica failover fixes Improvements for failover of Distributed queries Aug 8, 2019
@alexey-milovidov
Copy link
Member

Let's merge with master, because we have fixes for integration and performance tests.

@Enmk Enmk force-pushed the replica_recovery_interval branch 3 times, most recently from 0408f0e to 77256fd Compare August 22, 2019 09:20
dbms/src/Core/Settings.h Outdated Show resolved Hide resolved
@Enmk Enmk force-pushed the replica_recovery_interval branch 5 times, most recently from c807551 to c1c1d1d Compare August 29, 2019 14:22
* Added a limit on how many errors can replica accumulate
* Decreased default error halving time to 60 seconds
* Made both configurable via settings
* Showing errors count and estimated recovery time for each replica in system.clusters
* Actually using the replica recovery settings for cluster
* A bit of doc on DBMS_CONNECTION_POOL_WITH_FAILOVER_MAX_ERROR_COUNT
* StorageDistributedDirectoryMonitor using settings for ConnectionPoolWithFailover
* Using SettingSeconds instead of SettingUInt64 for replica_error_decrease_period
dbms/src/Core/Settings.h Outdated Show resolved Hide resolved
@akuzm
Copy link
Contributor

akuzm commented Sep 4, 2019

The docs should be updated to describe the new settings and the new fields in system.clusters.

@Enmk
Copy link
Contributor Author

Enmk commented Sep 4, 2019

Docs updated, settings renamed.

Renamed settings, updated docs.
@akuzm akuzm self-requested a review September 5, 2019 15:45
@alexey-milovidov alexey-milovidov merged commit 25de2e1 into ClickHouse:master Sep 7, 2019
@KochetovNicolai KochetovNicolai added the pr-improvement Pull request with some product improvements label Sep 19, 2019
@Enmk Enmk deleted the replica_recovery_interval branch October 1, 2019 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Increase the speed of getting replica back to healthy state
5 participants