Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Keydb hang with full sync (active-replica mode) #619

Open
chicknsoup opened this issue Mar 31, 2023 · 1 comment
Open

[BUG] Keydb hang with full sync (active-replica mode) #619

chicknsoup opened this issue Mar 31, 2023 · 1 comment

Comments

@chicknsoup
Copy link

Describe the bug

I'm running a pair of KeyDB servers in active-replica mode. When restarting one of the node it cannot be synced with the other.

To reproduce

  1. Start both nodes in clean slate (A & B)
  2. Push data (around 16m keys)
  3. Restart A
  4. A requests full sync with B
  5. A stream between A & B created and the diskless-sync process started
  6. A hangs and becomes unresponsive (unable to connect with keydb-cli)

Expected behavior

Two nodes are in-sync.

Additional information
Configuration:

10.192.0.41:6379> config get *
  1) "rdbchecksum"
  2) "yes"
  3) "daemonize"
  4) "yes"
  5) "lua-replicate-commands"
  6) "yes"
  7) "always-show-logo"
  8) "yes"
  9) "enable-motd"
 10) "yes"
 11) "protected-mode"
 12) "no"
 13) "rdbcompression"
 14) "yes"
 15) "rdb-del-sync-files"
 16) "no"
 17) "activerehashing"
 18) "yes"
 19) "stop-writes-on-bgsave-error"
 20) "no"
 21) "set-proc-title"
 22) "yes"
 23) "dynamic-hz"
 24) "yes"
 25) "lazyfree-lazy-eviction"
 26) "no"
 27) "lazyfree-lazy-expire"
 28) "no"
 29) "lazyfree-lazy-server-del"
 30) "no"
 31) "lazyfree-lazy-user-del"
 32) "no"
 33) "lazyfree-lazy-user-flush"
 34) "no"
 35) "repl-disable-tcp-nodelay"
 36) "no"
 37) "repl-diskless-sync"
 38) "yes"
 39) "aof-rewrite-incremental-fsync"
 40) "yes"
 41) "no-appendfsync-on-rewrite"
 42) "no"
 43) "cluster-require-full-coverage"
 44) "yes"
 45) "rdb-save-incremental-fsync"
 46) "yes"
 47) "aof-load-truncated"
 48) "yes"
 49) "aof-use-rdb-preamble"
 50) "yes"
 51) "cluster-replica-no-failover"
 52) "no"
 53) "cluster-slave-no-failover"
 54) "no"
 55) "replica-lazy-flush"
 56) "no"
 57) "slave-lazy-flush"
 58) "no"
 59) "replica-serve-stale-data"
 60) "yes"
 61) "slave-serve-stale-data"
 62) "yes"
 63) "replica-read-only"
 64) "no"
 65) "slave-read-only"
 66) "no"
 67) "replica-ignore-maxmemory"
 68) "yes"
 69) "slave-ignore-maxmemory"
 70) "yes"
 71) "multi-master"
 72) "yes"
 73) "jemalloc-bg-thread"
 74) "yes"
 75) "activedefrag"
 76) "no"
 77) "syslog-enabled"
 78) "no"
 79) "cluster-enabled"
 80) "no"
 81) "appendonly"
 82) "no"
 83) "cluster-allow-reads-when-down"
 84) "no"
 85) "delete-on-evict"
 86) "no"
 87) "use-fork"
 88) "no"
 89) "io-threads-do-reads"
 90) "no"
 91) "time-thread-priority"
 92) "no"
 93) "prefetch-enabled"
 94) "yes"
 95) "allow-rdb-resize-op"
 96) "yes"
 97) "crash-log-enabled"
 98) "yes"
 99) "crash-memcheck-enabled"
100) "yes"
101) "use-exit-on-panic"
102) "no"
103) "disable-thp"
104) "yes"
105) "cluster-allow-replica-migration"
106) "yes"
107) "replica-announced"
108) "yes"
109) "enable-async-commands"
110) "no"
111) "multithread-load-enabled"
112) "no"
113) "active-client-balancing"
114) "yes"
115) "aclfile"
116) ""
117) "unixsocket"
118) ""
119) "pidfile"
120) "/var/run/keydb/keydb-server.pid"
121) "replica-announce-ip"
122) ""
123) "slave-announce-ip"
124) ""
125) "masteruser"
126) ""
127) "cluster-announce-ip"
128) ""
129) "syslog-ident"
130) "redis"
131) "dbfilename"
132) "backup.rdb"
133) "appendfilename"
134) "appendonly.aof"
135) "server_cpulist"
136) ""
137) "bio_cpulist"
138) ""
139) "aof_rewrite_cpulist"
140) ""
141) "bgsave_cpulist"
142) ""
143) "storage-provider-options"
144) ""
145) "ignore-warnings"
146) ""
147) "proc-title-template"
148) "{title} {listen-addr} {server-mode}"
149) "masterauth"
150) ""
151) "requirepass"
152) ""
153) "supervised"
154) "no"
155) "syslog-facility"
156) "local0"
157) "repl-diskless-load"
158) "disabled"
159) "loglevel"
160) "debug"
161) "maxmemory-policy"
162) "noeviction"
163) "appendfsync"
164) "everysec"
165) "storage-cache-mode"
166) "writethrough"
167) "oom-score-adj"
168) "no"
169) "acl-pubsub-default"
170) "allchannels"
171) "sanitize-dump-payload"
172) "no"
173) "databases"
174) "16"
175) "port"
176) "6379"
177) "auto-aof-rewrite-percentage"
178) "100"
179) "cluster-replica-validity-factor"
180) "10"
181) "cluster-slave-validity-factor"
182) "10"
183) "list-max-ziplist-size"
184) "-2"
185) "tcp-keepalive"
186) "300"
187) "cluster-migration-barrier"
188) "1"
189) "active-defrag-cycle-min"
190) "1"
191) "active-defrag-cycle-max"
192) "25"
193) "active-defrag-threshold-lower"
194) "10"
195) "active-defrag-threshold-upper"
196) "100"
197) "lfu-log-factor"
198) "10"
199) "lfu-decay-time"
200) "1"
201) "replica-priority"
202) "100"
203) "slave-priority"
204) "100"
205) "repl-diskless-sync-delay"
206) "5"
207) "maxmemory-samples"
208) "16"
209) "maxmemory-eviction-tenacity"
210) "10"
211) "timeout"
212) "1500"
213) "replica-announce-port"
214) "0"
215) "slave-announce-port"
216) "0"
217) "tcp-backlog"
218) "1024"
219) "cluster-announce-bus-port"
220) "0"
221) "cluster-announce-port"
222) "0"
223) "cluster-announce-tls-port"
224) "0"
225) "repl-timeout"
226) "1500"
227) "repl-ping-replica-period"
228) "10"
229) "repl-ping-slave-period"
230) "10"
231) "list-compress-depth"
232) "0"
233) "rdb-key-save-delay"
234) "0"
235) "key-load-delay"
236) "0"
237) "active-expire-effort"
238) "1"
239) "hz"
240) "10"
241) "min-replicas-to-write"
242) "0"
243) "min-slaves-to-write"
244) "0"
245) "min-replicas-max-lag"
246) "10"
247) "min-slaves-max-lag"
248) "10"
249) "min-clients-per-thread"
250) "20"
251) "storage-flush-period"
252) "500"
253) "replica-quorum"
254) "-1"
255) "replica-weighting-factor"
256) "2"
257) "maxclients"
258) "10000"
259) "loading-process-events-interval-keys"
260) "8192"
261) "active-defrag-max-scan-fields"
262) "1000"
263) "slowlog-max-len"
264) "128"
265) "acllog-max-len"
266) "128"
267) "lua-time-limit"
268) "5000"
269) "cluster-node-timeout"
270) "15000"
271) "slowlog-log-slower-than"
272) "10000"
273) "latency-monitor-threshold"
274) "0"
275) "proto-max-bulk-len"
276) "536870912"
277) "stream-node-max-entries"
278) "100"
279) "repl-backlog-size"
280) "1048576"
281) "repl-backlog-disk-reserve"
282) "0"
283) "max-snapshot-slip"
284) "400"
285) "maxmemory"
286) "0"
287) "maxstorage"
288) "0"
289) "hash-max-ziplist-entries"
290) "512"
291) "set-max-intset-entries"
292) "512"
293) "zset-max-ziplist-entries"
294) "128"
295) "active-defrag-ignore-bytes"
296) "104857600"
297) "hash-max-ziplist-value"
298) "64"
299) "stream-node-max-bytes"
300) "4096"
301) "zset-max-ziplist-value"
302) "64"
303) "hll-sparse-max-bytes"
304) "3000"
305) "tracking-table-max-keys"
306) "1000000"
307) "client-query-buffer-limit"
308) "1073741824"
309) "repl-backlog-ttl"
310) "3600"
311) "auto-aof-rewrite-min-size"
312) "67108864"
313) "loading-process-events-interval-bytes"
314) "2097152"
315) "multi-master-no-forward"
316) "no"
317) "allow-write-during-load"
318) "no"
319) "force-backlog-disk-reserve"
320) "no"
321) "soft-shutdown"
322) "no"
323) "tls-port"
324) "0"
325) "tls-session-cache-size"
326) "20480"
327) "tls-session-cache-timeout"
328) "300"
329) "tls-cluster"
330) "no"
331) "tls-replication"
332) "no"
333) "tls-auth-clients"
334) "yes"
335) "tls-prefer-server-ciphers"
336) "no"
337) "tls-session-caching"
338) "yes"
339) "tls-rotation"
340) "no"
341) "tls-cert-file"
342) ""
343) "tls-key-file"
344) ""
345) "tls-key-file-pass"
346) ""
347) "tls-client-cert-file"
348) ""
349) "tls-client-key-file"
350) ""
351) "tls-client-key-file-pass"
352) ""
353) "tls-dh-params-file"
354) ""
355) "tls-ca-cert-file"
356) ""
357) "tls-ca-cert-dir"
358) ""
359) "tls-protocols"
360) ""
361) "tls-ciphers"
362) ""
363) "tls-ciphersuites"
364) ""
365) "logfile"
366) "/var/log/keydb/keydb-server.log"
367) "watchdog-period"
368) "0"
369) "dir"
370) "/Data/keydb/data"
371) "save"
372) "900 1"
373) "client-output-buffer-limit"
374) "normal 0 0 0 slave 1073741824 536870912 60 pubsub 33554432 8388608 60"
375) "unixsocketperm"
376) "0"
377) "slaveof"
378) "10.192.0.8 6379"
379) "notify-keyspace-events"
380) ""
381) "bind"
382) "0.0.0.0"
383) "oom-score-adj-values"
384) "0 200 800"
385) "active-replica"
386) "yes"
387) "tls-allowlist"
388) (empty list or set)

Info:

10.192.0.41:6379> info
# Server
redis_version:6.3.2
redis_git_sha1:d1dff8c2
redis_git_dirty:0
redis_build_id:451d58beb6ae007f
redis_mode:standalone
os:Linux 3.10.0-1160.36.2.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:7.3.1
process_id:17112
process_supervised:no
run_id:d89c30d3338da887ad9a4b700fc1df7231186619
tcp_port:6379
server_time_usec:1680228752191632
uptime_in_seconds:60018
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:2507152
executable:/usr/bin/keydb-server
config_file:/etc/keydb/keydb.conf

# Clients
connected_clients:1
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:32
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
current_client_thread:0
thread_0_clients:1
thread_1_clients:0
thread_2_clients:0
thread_3_clients:0

# Memory
used_memory:2288561552
used_memory_human:2.13G
used_memory_rss:4017909760
used_memory_rss_human:3.74G
used_memory_peak:3479369672
used_memory_peak_human:3.24G
used_memory_peak_perc:65.78%
used_memory_overhead:1129184184
used_memory_startup:3365320
used_memory_dataset:1159377368
used_memory_dataset_perc:50.73%
allocator_allocated:2289769552
allocator_active:3594039296
allocator_resident:4029018112
total_system_memory:37779304448
total_system_memory_human:35.18G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.57
allocator_frag_bytes:1304269744
allocator_rss_ratio:1.12
allocator_rss_bytes:434978816
rss_overhead_ratio:1.00
rss_overhead_bytes:-11108352
mem_fragmentation_ratio:1.76
mem_fragmentation_bytes:1729390960
mem_not_counted_for_evict:1048576
mem_replication_backlog:1048576
mem_clients_slaves:0
mem_clients_normal:20512
mem_aof_buffer:0
mem_allocator:jemalloc-5.2.1
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0
storage_provider:none

# Persistence
loading:0
current_cow_size:4209373184
current_cow_size_age:8
current_fork_perc:0.00
current_save_keys_processed:15082497
current_save_keys_total:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1680228748
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:67
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:4209369088
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:32987
total_commands_processed:264246123
instantaneous_ops_per_sec:0
total_net_input_bytes:65635275109
total_net_output_bytes:2372140621
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:1
expired_keys:10445475
expired_stale_perc:0.00
expired_time_cap_reached_count:3592
expire_cycle_cpu_milliseconds:0
evicted_keys:0
keyspace_hits:382534
keyspace_misses:131908752
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
total_forks:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:1381
dump_payload_sanitizations:0
total_reads_processed:264280530
total_writes_processed:266470797
instantaneous_lock_contention:1
avg_lock_contention:0.046875
storage_provider_read_hits:0
storage_provider_read_misses:0

# Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:3c7da7d7d78188d8a05647cb949cc181b001d600
master_replid2:4c21ed6eb1a72855461f1298444bae8cef804073
master_repl_offset:266737940869
second_repl_offset:266116889834
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:266736892294
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:7478.335665
used_cpu_user:9441.184143
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000
server_threads:4
long_lock_waits:5708
used_cpu_sys_main_thread:5628.715563
used_cpu_user_main_thread:4270.121643

# Modules

# Errorstats
errorstat_ERR:count=1
errorstat_LOADING:count=1380

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=15214933,expires=15214896,avg_ttl=79689933,cached_keys=15214933

# KeyDB
mvcc_depth:0

Here is the log:

7578:7598:S 31 Mar 2023 08:44:47.008 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '4c21ed6eb1a72855461f1298444bae8cef804073', my replication IDs are 'e50cd6c93bf5ed0635d3c9b55f5fe7b2b8b05f2b' and '0000000000000000000000000000000000000000')
7578:7598:S 31 Mar 2023 08:44:47.008 * Delay next BGSAVE for diskless SYNC
7578:7598:S 31 Mar 2023 08:44:49.144 - DB 0: 15091496 keys (15091459 volatile) in 16777216 slots HT.
7578:7598:S 31 Mar 2023 08:44:49.145 . 2 clients connected (1 replicas), 2199275328 bytes in use
7578:7598:S 31 Mar 2023 08:44:52.160 * Starting BGSAVE for SYNC with target: replicas sockets
7578:7598:S 31 Mar 2023 08:44:52.184 * Background RDB transfer started
7578:8020:S 31 Mar 2023 08:44:52.203 - RDB: 2290 MB of memory used by copy-on-write
7578:8020:S 31 Mar 2023 08:44:54.224 - RDB: 2395 MB of memory used by copy-on-write
7578:7598:S 31 Mar 2023 08:44:54.256 - DB 0: 15091496 keys (15091459 volatile) in 4 slots HT.
7578:7598:S 31 Mar 2023 08:44:54.256 . 2 clients connected (1 replicas), 2329470016 bytes in use
7578:8020:S 31 Mar 2023 08:44:57.234 - RDB: 2555 MB of memory used by copy-on-write
7578:7598:S 31 Mar 2023 08:44:59.446 - DB 0: 15091496 keys (15091459 volatile) in 4 slots HT.
7578:7598:S 31 Mar 2023 08:44:59.446 . 2 clients connected (1 replicas), 2554825808 bytes in use
7578:8020:S 31 Mar 2023 08:45:00.240 - RDB: 2637 MB of memory used by copy-on-write
7578:8020:S 31 Mar 2023 08:45:03.419 - RDB: 2690 MB of memory used by copy-on-write
^C

@nickb937
Copy link

nickb937 commented Apr 6, 2023

I experience this too. However I found that the problem seemed related to an excessive growth in open file-descriptors. Once the server got to around 11K-12K open, the process would hang. Therefore I'm wondering if it is solved by this #579

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants