Improvements for failover of Distributed queries #6399

Enmk · 2019-08-08T10:26:06Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

Improvement

Short description (up to few sentences):
Improvements for failover of Distributed queries. Shorten recovery time, also it is now configurable and can be seen in system.clusters.
...

Detailed description (optional):

Added a limit on how many errors can replica accumulate
Decreased default error halving time to 60 seconds
Made both configurable via settings: replica_error_max_count and replica_error_decrease_period
Showing errors count and estimated recovery time for each replica in system.clusters

Example

SELECT *
FROM system.clusters

┌─cluster──────┬─shard_num─┬─shard_weight─┬─replica_num─┬─host_name─┬─host_address─┬─port─┬─is_local─┬─user────┬─default_database─┬─errors_count─┬─estimated_recovery_time─┐
│ test_cluster │         1 │            1 │           1 │ localhost │ 127.0.0.1    │ 9000 │        1 │ default │                  │            6 │                     166 │
│ test_cluster │         1 │            1 │           2 │ localhost │ 127.0.0.1    │ 9001 │        1 │ default │                  │            0 │                       0 │
│ test_cluster │         1 │            1 │           3 │ localhost │ 127.0.0.1    │ 9002 │        1 │ default │                  │            6 │                     166 │
└──────────────┴───────────┴──────────────┴─────────────┴───────────┴──────────────┴──────┴──────────┴─────────┴──────────────────┴──────────────┴─────────────────────────┘

errors_count is number of times this host tried to reach replica but failed
estimated_recovery_time how many seconds are left until replica error count is zeroed and replica is considered to be back to normal

Please note that errors_count is updated once per query to the cluster, but estimated_recovery_time is recalculated on-demand. So there could be a case of non-zero errors_count and zero estimated_recovery_time, that means that next query will zero errors count and will try to use replica as if it had no errors.
...

Fixes #5317

alexey-milovidov · 2019-08-11T02:19:08Z

Let's merge with master, because we have fixes for integration and performance tests.

dbms/src/Core/Settings.h

dbms/src/Core/Defines.h

dbms/src/Client/ConnectionPoolWithFailover.h

* Added a limit on how many errors can replica accumulate * Decreased default error halving time to 60 seconds * Made both configurable via settings * Showing errors count and estimated recovery time for each replica in system.clusters

* Actually using the replica recovery settings for cluster * A bit of doc on DBMS_CONNECTION_POOL_WITH_FAILOVER_MAX_ERROR_COUNT * StorageDistributedDirectoryMonitor using settings for ConnectionPoolWithFailover * Using SettingSeconds instead of SettingUInt64 for replica_error_decrease_period

dbms/src/Core/Settings.h

akuzm · 2019-09-04T12:48:23Z

The docs should be updated to describe the new settings and the new fields in system.clusters.

docs/en/operations/system_tables.md

Enmk · 2019-09-04T19:13:30Z

Docs updated, settings renamed.

Renamed settings, updated docs.

alexey-milovidov added the can be tested label Aug 8, 2019

alexey-milovidov changed the title ~~Replica recovery fixes~~ Replica failover fixes Aug 8, 2019

alexey-milovidov changed the title ~~Replica failover fixes~~ Improvements for failover of Distributed queries Aug 8, 2019

Enmk force-pushed the replica_recovery_interval branch 3 times, most recently from 0408f0e to 77256fd Compare August 22, 2019 09:20

akuzm reviewed Aug 22, 2019

View reviewed changes

dbms/src/Core/Settings.h Outdated Show resolved Hide resolved

akuzm reviewed Aug 22, 2019

View reviewed changes

dbms/src/Core/Defines.h Show resolved Hide resolved

akuzm reviewed Aug 23, 2019

View reviewed changes

dbms/src/Client/ConnectionPoolWithFailover.h Show resolved Hide resolved

Enmk force-pushed the replica_recovery_interval branch 5 times, most recently from c807551 to c1c1d1d Compare August 29, 2019 14:22

Enmk added 2 commits September 2, 2019 17:26

Replica recovery fixes

84fc4ba

* Added a limit on how many errors can replica accumulate * Decreased default error halving time to 60 seconds * Made both configurable via settings * Showing errors count and estimated recovery time for each replica in system.clusters

Enmk force-pushed the replica_recovery_interval branch from c1c1d1d to f98c488 Compare September 2, 2019 15:18

akuzm reviewed Sep 4, 2019

View reviewed changes

dbms/src/Core/Settings.h Outdated Show resolved Hide resolved

Enmk force-pushed the replica_recovery_interval branch from 12f82af to f248726 Compare September 4, 2019 15:20

akuzm reviewed Sep 4, 2019

View reviewed changes

docs/en/operations/system_tables.md Outdated Show resolved Hide resolved

Post-PR fixes

c2fc71b

Renamed settings, updated docs.

Enmk force-pushed the replica_recovery_interval branch from f248726 to c2fc71b Compare September 5, 2019 10:36

akuzm self-requested a review September 5, 2019 15:45

akuzm approved these changes Sep 5, 2019

View reviewed changes

alexey-milovidov merged commit 25de2e1 into ClickHouse:master Sep 7, 2019

alexey-milovidov approved these changes Sep 7, 2019

View reviewed changes

KochetovNicolai added the pr-improvement Pull request with some product improvements label Sep 19, 2019

Enmk deleted the replica_recovery_interval branch October 1, 2019 16:36

filimonov added the altinity label Oct 2, 2019

azat mentioned this pull request May 13, 2020

Balance the query load between replicas in a shard #10564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements for failover of Distributed queries #6399

Improvements for failover of Distributed queries #6399

Enmk commented Aug 8, 2019 •

edited

alexey-milovidov commented Aug 11, 2019

akuzm commented Sep 4, 2019

Enmk commented Sep 4, 2019

Improvements for failover of Distributed queries #6399

Improvements for failover of Distributed queries #6399

Conversation

Enmk commented Aug 8, 2019 • edited

Example

alexey-milovidov commented Aug 11, 2019

akuzm commented Sep 4, 2019

Enmk commented Sep 4, 2019

Enmk commented Aug 8, 2019 •

edited