You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First 3 nodes are voting members (leader and followers) and the last 3 are non-voting members (learners in nuRaft terms)
Description of the problem
If trying to perform a DR manual failover switching to the 3 learners, the command rcvr is not working as expected. The command is not able to force the recovery of the last 3 nodes.
Steps to reproduce
Initiating the DR reconfiguration by removing/stopping first 3 nodes (keeper-01 to keeper-03) and reconfiguring the 3 learners to voting members (removing can_become_leader tag and reloading config file). This is the log of keeper-04 which was sent the rcvr command:
<Information> Application: Ready for connections.
2024.04.17 12:27:31.000711 [ 40 ] {} <Trace> AsynchronousMetrics: MemoryTracking: was 26.93 MiB, peak 31.92 MiB, free memory in arenas 7.54 MiB, will set to 30.93 MiB (RSS), difference: 4.00 MiB
2024.04.17 12:28:40.424033 [ 19 ] {} <Trace> KeeperTCPHandlerFactory: Keeper request. Address: 172.19.0.1:49394
2024.04.17 12:28:40.424168 [ 19 ] {} <Debug> KeeperTCPHandler: Receive four letter command mntr
2024.04.17 12:29:03.969696 [ 34 ] {} <Debug> KeeperStateMachine: Session ID response 1 with timeout 10000
2024.04.17 12:29:14.000411 [ 40 ] {} <Trace> AsynchronousMetrics: MemoryTracking: was 31.07 MiB, peak 34.07 MiB, free memory in arenas 3.77 MiB, will set to 26.04 MiB (RSS), difference: -5.02 MiB
2024.04.17 12:32:13.980674 [ 31 ] {} <Error> RaftInstance: session 1 failed to read rpc header from socket ::ffff:172.19.0.6:39562 due to error 2, End of file, ref count 1
2024.04.17 12:32:30.002436 [ 40 ] {} <Trace> AsynchronousMetrics: Scanning /sys/block
2024.04.17 12:33:30.486563 [ 41 ] {} <Debug> ConfigReloader: Loading config '/etc/clickhouse-keeper/keeper_config.xml'
2024.04.17 12:33:30.486689 [ 41 ] {} <Debug> ConfigProcessor: Processing configuration file '/etc/clickhouse-keeper/keeper_config.xml'.
2024.04.17 12:33:30.487735 [ 41 ] {} <Debug> ConfigProcessor: Saved preprocessed configuration to '/var/lib/clickhouse-keeper/preprocessed_configs/keeper_config.xml'.
2024.04.17 12:33:30.487765 [ 41 ] {} <Debug> ConfigReloader: Loaded config '/etc/clickhouse-keeper/keeper_config.xml', performing update on configuration
2024.04.17 12:33:30.487850 [ 41 ] {} <Information> Application: keeper_server.max_memory_usage_soft_limit is set to 13.77 GiB
2024.04.17 12:33:30.488955 [ 41 ] {} <Warning> KeeperDispatcher: Configuration changed for more than one server (3) from cluster, it's strictly not recommended
2024.04.17 12:33:30.489006 [ 37 ] {} <Information> KeeperServer: Will try to wait for (Remove server 3)
2024.04.17 12:33:30.489041 [ 41 ] {} <Information> CertificateReloader: One of paths is empty. Cannot apply new configuration for certificates. Fill all paths and try again.
2024.04.17 12:33:30.489051 [ 41 ] {} <Debug> ConfigReloader: Loaded config '/etc/clickhouse-keeper/keeper_config.xml', performed update on configuration
2024.04.17 12:33:31.000386 [ 40 ] {} <Trace> AsynchronousMetrics: MemoryTracking: was 26.08 MiB, peak 34.07 MiB, free memory in arenas 5.07 MiB, will set to 31.20 MiB (RSS), difference: 5.13 MiB
2024.04.17 12:35:15.491090 [ 37 ] {} <Information> KeeperDispatcher: Cannot wait for configuration update, maybe we became leader or maybe update is invalid, will try to wait one more time
2024.04.17 12:35:15.491227 [ 37 ] {} <Information> KeeperServer: Will try to wait for (Remove server 3)
2024.04.17 12:35:26.082910 [ 19 ] {} <Trace> KeeperTCPHandlerFactory: Keeper request. Address: 172.19.0.1:36670
2024.04.17 12:35:26.083025 [ 19 ] {} <Debug> KeeperTCPHandler: Receive four letter command rcvr
2024.04.17 12:35:26.083041 [ 19 ] {} <Warning> KeeperServer: This instance is in recovery mode. Until the quorum is restored, no requests should be sent to any of the cluster instances. This instance will start accepting requests only when the recovery is finished.
2024.04.17 12:35:26.083089 [ 19 ] {} <Information> RaftInstance: parameters: election timeout range 1000 - 2000, heartbeat 500, leadership expiry 10000, max batch 100, backoff 50, snapshot distance 100000, enable randomized snapshot creation NO, log sync stop gap 99999, reserved logs 100000, client timeout 10000, auto forwarding on, API call type async, custom commit quorum size 1, custom election quorum size 1, snapshot receiver included, leadership transfer wait time 0, grace period of lagging state machine 0, snapshot IO: blocking, parallel log appending: on
2024.04.17 12:35:26.083115 [ 19 ] {} <Information> RaftInstance: new election timeout range: 1000 - 2000
2024.04.17 12:35:29.491931 [ 37 ] {} <Information> KeeperDispatcher: Cannot wait for configuration update, maybe we became leader or maybe update is invalid, will try to wait one more time
2024.04.17 12:35:29.492049 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:35:34.492210 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:35:39.492427 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:35:44.492637 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:35:49.492813 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:35:54.493029 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:35:59.493241 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:04.493452 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:09.493664 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:14.493874 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:19.494083 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:24.494297 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:29.494510 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:34.494703 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:39.494918 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:44.495096 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:49.495299 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:54.495504 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:36:59.495720 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:37:04.495907 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:37:09.496090 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:37:14.496299 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:37:19.496507 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:37:24.496693 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
2024.04.17 12:37:29.496851 [ 37 ] {} <Information> KeeperDispatcher: Server is recovering, will not apply configuration until recovery is finished
After this keeper-05 and keeper-06 are restarted normally one by one. This is the log of keeper-05/06:
2024.04.17 12:37:12.318899 [ 1 ] {} <Information> KeeperStateManager: Read state from /var/lib/clickhouse-keeper/state
2024.04.17 12:37:12.318933 [ 1 ] {} <Information> RaftInstance: parameters: election timeout range 1000 - 2000, heartbeat 500, leadership expiry 10000, max batch 100, backoff 50, snapshot distance 100000, enable randomized snapshot creation NO, log sync stop gap 99999, reserved logs 100000, client timeout 10000, auto forwarding on, API call type async, custom commit quorum size 0, custom election quorum size 0, snapshot receiver included, leadership transfer wait time 0, grace period of lagging state machine 0, snapshot IO: blocking, parallel log appending: on
2024.04.17 12:37:12.318945 [ 1 ] {} <Information> RaftInstance: new election timeout range: 1000 - 2000
2024.04.17 12:37:12.318961 [ 1 ] {} <Information> RaftInstance: === INIT RAFT SERVER ===
commit index 0
term 1
election timer allowed
log store start 1, end 49
config log idx 1, prev log idx 0
2024.04.17 12:37:12.318971 [ 1 ] {} <Information> RaftInstance: detect a configuration change that is not committed yet at index 1
2024.04.17 12:37:12.319185 [ 1 ] {} <Information> RaftInstance: peer 1: DC ID 0, keeper-01:9234, voting member, 1
peer 2: DC ID 0, keeper-02:9234, voting member, 1
peer 3: DC ID 0, keeper-03:9234, voting member, 1
peer 4: DC ID 0, keeper-04:9234, learner, 1
peer 5: DC ID 0, keeper-05:9234, learner, 1
peer 6: DC ID 0, keeper-06:9234, learner, 1
my id: 5, learner
num peers: 5
2024.04.17 12:37:12.319208 [ 1 ] {} <Information> RaftInstance: global manager does not exist. will use local thread for commit and append
2024.04.17 12:37:12.319267 [ 1 ] {} <Information> RaftInstance: wait for HB, for 50 + [1000, 2000] ms
2024.04.17 12:37:12.319319 [ 34 ] {} <Information> RaftInstance: bg append_entries thread initiated
2024.04.17 12:37:12.369421 [ 1 ] {} <Debug> KeeperDispatcher: Server initialized, waiting for quorum
2024.04.17 12:40:12.369574 [ 1 ] {} <Warning> KeeperServer: Failed to wait for RAFT initialization in 180000ms, will continue in background
2024.04.17 12:40:12.369691 [ 1 ] {} <Debug> KeeperDispatcher: Quorum initialized
2024.04.17 12:40:12.369881 [ 1 ] {} <Debug> KeeperDispatcher: Dispatcher initialized
2024.04.17 12:40:12.370063 [ 1 ] {} <Information> Application: Listening for Keeper (tcp): 0.0.0.0:9181
2024.04.17 12:40:12.370091 [ 36 ] {} <Information> KeeperDispatcher: Server still not initialized, will not apply configuration until initialization finished
2024.04.17 12:40:12.370377 [ 1 ] {} <Information> Application: Listening for HTTP Control: http://0.0.0.0:9182
2024.04.17 12:40:12.370588 [ 1 ] {} <Trace> AsynchronousMetrics: MemoryTracking: was 50.47 KiB, peak 50.47 KiB, free memory in arenas 1.87 MiB, will set to 26.96 MiB (RSS), difference: 26.91 MiB
2024.04.17 12:40:12.372908 [ 1 ] {} <Debug> ConfigReloader: Loading config '/etc/clickhouse-keeper/keeper_config.xml'
2024.04.17 12:40:12.372933 [ 1 ] {} <Debug> ConfigProcessor: Processing configuration file '/etc/clickhouse-keeper/keeper_config.xml'.
2024.04.17 12:40:12.373401 [ 1 ] {} <Debug> ConfigProcessor: Saved preprocessed configuration to '/var/lib/clickhouse-keeper/preprocessed_configs/keeper_config.xml'.
2024.04.17 12:40:12.373423 [ 1 ] {} <Debug> ConfigReloader: Loaded config '/etc/clickhouse-keeper/keeper_config.xml', performing update on configuration
2024.04.17 12:40:12.373481 [ 1 ] {} <Information> Application: keeper_server.max_memory_usage_soft_limit is set to 13.77 GiB
2024.04.17 12:40:12.374436 [ 1 ] {} <Warning> KeeperDispatcher: Configuration changed for more than one server (3) from cluster, it's strictly not recommended
2024.04.17 12:40:12.374466 [ 1 ] {} <Information> CertificateReloader: One of paths is empty. Cannot apply new configuration for certificates. Fill all paths and try again.
2024.04.17 12:40:12.374475 [ 1 ] {} <Debug> ConfigReloader: Loaded config '/etc/clickhouse-keeper/keeper_config.xml', performed update on configuration
2024.04.17 12:40:12.375031 [ 1 ] {} <Information> Application: Ready for connections.
2024.04.17 12:40:13.000180 [ 39 ] {} <Trace> AsynchronousMetrics: MemoryTracking: was 28.49 MiB, peak 33.48 MiB, free memory in arenas 6.78 MiB, will set to 32.52 MiB (RSS), difference: 4.04 MiB
2024.04.17 12:40:17.370267 [ 36 ] {} <Information> KeeperDispatcher: Server still not initialized, will not apply configuration until initialization finished
2024.04.17 12:40:22.370393 [ 36 ] {} <Information> KeeperDispatcher: Server still not initialized, will not apply configuration until initialization finished
2024.04.17 12:40:27.370785 [ 36 ] {} <Information> KeeperDispatcher: Server still not initialized, will not apply configuration until initialization finished
2024.04.17 12:40:32.370947 [ 36 ] {} <Information> KeeperDispatcher: Server still not initialized, will not apply configuration until initialization finished
There is no quorum established and the command rcvr is not working as expected. The recovery is not being forced and the last 3 nodes are not able to form a stable ensemble. Seems like recovery keeper-04 is not receiving any request.
Expected behaviour
If using the --force-recovery flag instead of rcvr command it works as expected and the last 3 nodes are able to form a stable ensemble. The command rcvr should work as expected and force the recovery of the last 3 nodes as the --force-recovery flag does.
As a side note, it seems that changing the configuration for different nodes in cluster (stopping nodes and reconfig learners to followers) is not recommended:
2024.04.17 12:33:30.488955 [ 41 ] {} <Warning> KeeperDispatcher: Configuration changed for more than one server (3) from cluster, it's strictly not recommended
Also after changing the config, still the log output prints a non-updated config:
2024.04.17 12:37:12.319185 [ 1 ] {} <Information> RaftInstance: peer 1: DC ID 0, keeper-01:9234, voting member, 1
peer 2: DC ID 0, keeper-02:9234, voting member, 1
peer 3: DC ID 0, keeper-03:9234, voting member, 1
peer 4: DC ID 0, keeper-04:9234, learner, 1
peer 5: DC ID 0, keeper-05:9234, learner, 1
peer 6: DC ID 0, keeper-06:9234, learner, 1
my id: 5, learner
num peers: 5
So could this be related to a configuration issue, that keepers do not load correctly the configuration changes if using rcvr?
Description
6-node keeper (latest version) ensemble with this config:
First 3 nodes are voting members (leader and followers) and the last 3 are non-voting members (learners in nuRaft terms)
Description of the problem
If trying to perform a DR manual failover switching to the 3 learners, the command
rcvr
is not working as expected. The command is not able to force the recovery of the last 3 nodes.Steps to reproduce
Initiating the DR reconfiguration by removing/stopping first 3 nodes (keeper-01 to keeper-03) and reconfiguring the 3 learners to voting members (removing
can_become_leader tag
and reloading config file). This is the log of keeper-04 which was sent thercvr
command:After this keeper-05 and keeper-06 are restarted normally one by one. This is the log of keeper-05/06:
There is no quorum established and the command
rcvr
is not working as expected. The recovery is not being forced and the last 3 nodes are not able to form a stable ensemble. Seems like recovery keeper-04 is not receiving any request.Expected behaviour
If using the
--force-recovery
flag instead ofrcvr
command it works as expected and the last 3 nodes are able to form a stable ensemble. The commandrcvr
should work as expected and force the recovery of the last 3 nodes as the--force-recovery
flag does.As a side note, it seems that changing the configuration for different nodes in cluster (stopping nodes and reconfig learners to followers) is not recommended:
2024.04.17 12:33:30.488955 [ 41 ] {} <Warning> KeeperDispatcher: Configuration changed for more than one server (3) from cluster, it's strictly not recommended
Also after changing the config, still the log output prints a non-updated config:
So could this be related to a configuration issue, that keepers do not load correctly the configuration changes if using
rcvr
?@antonio2368 😄
The text was updated successfully, but these errors were encountered: