Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keydb crash right after REPLICA sync is completed (5.3.1) #137

Closed
jamzed opened this issue Feb 1, 2020 · 5 comments
Closed

Keydb crash right after REPLICA sync is completed (5.3.1) #137

jamzed opened this issue Feb 1, 2020 · 5 comments

Comments

@jamzed
Copy link

jamzed commented Feb 1, 2020

Replica node running on KeyDB 5.3.1 crash right after REPLICA sync process is finished.


18427:S 01 Feb 2020 21:33:13.467 * MASTER <-> REPLICA sync started
18427:S 01 Feb 2020 21:33:13.467 * Non blocking connect for SYNC fired the event.
18427:S 01 Feb 2020 21:33:13.468 * Master replied to PING, replication can continue...
18427:S 01 Feb 2020 21:33:13.468 * Trying a partial resynchronization (request 9a78168835ed94040983034a391379d3dbf96fb4:17230809381).
18427:S 01 Feb 2020 21:33:19.686 * Full resync from master: 9a78168835ed94040983034a391379d3dbf96fb4:17745271325
18427:S 01 Feb 2020 21:33:19.686 * Discarding previously cached master state.
18427:S 01 Feb 2020 21:33:19.798 * MASTER <-> REPLICA sync: receiving streamed RDB from master
18427:S 01 Feb 2020 21:34:05.080 * MASTER <-> REPLICA sync: Flushing old data
18427:S 01 Feb 2020 21:34:09.053 * MASTER <-> REPLICA sync: Loading DB in memory
18427:S 01 Feb 2020 21:34:09.053 * Loading RDB produced by version 0.0.0
18427:S 01 Feb 2020 21:34:09.053 * RDB age 50 seconds
18427:S 01 Feb 2020 21:34:09.053 * RDB memory usage when created 4018.63 Mb
18427:S 01 Feb 2020 21:34:22.563 * MASTER <-> REPLICA sync: Finished with success

=== KEYDB BUG REPORT START: Cut & paste starting from here ===
18427:S 01 Feb 2020 21:34:22.685 # === ASSERTION FAILED OBJECT CONTEXT ===
18427:S 01 Feb 2020 21:34:22.685 # Object type: 0
18427:S 01 Feb 2020 21:34:22.685 # Object encoding: 0
18427:S 01 Feb 2020 21:34:22.685 # Object refcount: 1
18427:S 01 Feb 2020 21:34:22.685 # Object raw string len: 91
18427:S 01 Feb 2020 21:34:22.685 # Object raw string content: "X-XXXX||XXXX||XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX||XXXX-XX-XX XX:XX:XX"
18427:S 01 Feb 2020 21:34:22.685 # === ASSERTION FAILED ===
18427:S 01 Feb 2020 21:34:22.685 # ==> db.cpp:226 'retval == DICT_OK' is not true
18427:S 01 Feb 2020 21:34:22.685 # (forcing SIGSEGV to print the bug report.)
18427:S 01 Feb 2020 21:34:22.685 # KeyDB 0.0.0 crashed by signal: 11
18427:S 01 Feb 2020 21:34:22.685 # Crashed running the instruction at: 0x55de3272811e
18427:S 01 Feb 2020 21:34:22.685 # Accessing address: 0xffffffffffffffff
18427:S 01 Feb 2020 21:34:22.685 # Failed assertion: retval == DICT_OK (db.cpp:226)

------ STACK TRACE ------
EIP:
/usr/bin/keydb-server 0.0.0.0:3379 cluster [0x55de3272811e]

Backtrace:
/usr/bin/keydb-server 0.0.0.0:3379 cluster [0x55de327298ab]
/usr/bin/keydb-server 0.0.0.0:3379 [cluster](sigsegvHandler(int, siginfo_t*, void*)+0xc9) [0x55de3272a049]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890) [0x7f48a331d890]
/usr/bin/keydb-server 0.0.0.0:3379 cluster [0x55de3272811e]
/usr/bin/keydb-server 0.0.0.0:3379 [cluster](dbAdd(redisDb*, redisObject*, redisObject*)+0x2d) [0x55de326ef06d]
/usr/bin/keydb-server 0.0.0.0:3379 [cluster](incrDecrCommand(client*, long long)+0x19c) [0x55de32708acc]
/usr/bin/keydb-server 0.0.0.0:3379 [cluster](call(client*, int)+0xcd) [0x55de326d1bbd]
/usr/bin/keydb-server 0.0.0.0:3379 [cluster](processCommand(client*, int)+0x7f4) [0x55de326d6334]
/usr/bin/keydb-server 0.0.0.0:3379 [cluster](processCommandAndResetClient(client*, int)+0x21) [0x55de326e0161]
/usr/bin/keydb-server 0.0.0.0:3379 [cluster](processInputBuffer(client*, int)+0x18b) [0x55de326e4edb]
/usr/bin/keydb-server 0.0.0.0:3379 cluster [0x55de326e4fdd]
/usr/bin/keydb-server 0.0.0.0:3379 [cluster](readQueryFromClient(aeEventLoop*, int, void*, int)+0x370) [0x55de326e7b50]
/usr/bin/keydb-server 0.0.0.0:3379 cluster [0x55de326c9fec]
/usr/bin/keydb-server 0.0.0.0:3379 cluster [0x55de326ca285]
/usr/bin/keydb-server 0.0.0.0:3379 cluster [0x55de326ca655]
/usr/bin/keydb-server 0.0.0.0:3379 cluster [0x55de326ceb7e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f48a33126db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f48a303b88f]

------ INFO OUTPUT ------

Server

redis_version:0.0.0
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:8cbd3f419723bc42
redis_mode:cluster
os:Linux 4.15.0-1058-aws x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:7.4.0
process_id:18427
run_id:968f8ba21de5ab6fc6beaf76cda683d11dfabfe0
tcp_port:3379
uptime_in_seconds:83
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:3534558
executable:/usr/bin/keydb-server
config_file:/etc/keydb/keydb.conf

Clients

connected_clients:1
client_recent_max_input_buffer:4
client_recent_max_output_buffer:0
blocked_clients:0
current_client_thread:0
thread_0_clients:1
thread_1_clients:0
thread_2_clients:0
thread_3_clients:0
thread_4_clients:0
thread_5_clients:0
thread_6_clients:0
thread_7_clients:0

Memory

used_memory:3377198168
used_memory_human:3.15G
used_memory_rss:6412812288
used_memory_rss_human:5.97G
used_memory_peak:3377198168
used_memory_peak_human:3.15G
used_memory_peak_perc:100.05%
used_memory_overhead:220930326
used_memory_startup:55757616
used_memory_dataset:3156267842
used_memory_dataset_perc:95.03%
allocator_allocated:3377446912
allocator_active:3393327104
allocator_resident:6464962560
total_system_memory:66008489984
total_system_memory_human:61.48G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:56107216486
maxmemory_human:52.25G
maxmemory_policy:noeviction
allocator_frag_ratio:1.00
allocator_frag_bytes:15880192
allocator_rss_ratio:1.91
allocator_rss_bytes:3071635456
rss_overhead_ratio:0.99
rss_overhead_bytes:-52150272
mem_fragmentation_ratio:1.90
mem_fragmentation_bytes:3037280688
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:0
mem_clients_normal:49902
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

Persistence

loading:0
rdb_changes_since_last_save:5434
rdb_bgsave_in_progress:0
rdb_last_save_time:1580592779
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

Stats

total_connections_received:5
total_commands_processed:5443
instantaneous_ops_per_sec:0
total_net_input_bytes:1598678728
total_net_output_bytes:76341
instantaneous_input_kbps:29944.01
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:6592
expired_stale_perc:0.00
expired_time_cap_reached_count:1
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

Replication

role:slave
Master 0:
master_host:192.168.111.222
master_port:3379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:17750817651
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:9a78168835ed94040983034a391379d3dbf96fb4
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:17750793895
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:17749745320
repl_backlog_histlen:1048576

CPU

used_cpu_sys:9.258006
used_cpu_user:68.288977
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000
server_threads:8
long_lock_waits:13

Commandstats

cmdstat_setex:calls=3347,usec=6284,usec_per_call=1.88
cmdstat_del:calls=904,usec=3529,usec_per_call=3.90
cmdstat_cluster:calls=4,usec=854,usec_per_call=213.50
cmdstat_incr:calls=563,usec=640,usec_per_call=1.14
cmdstat_select:calls=1,usec=1,usec_per_call=1.00
cmdstat_expire:calls=554,usec=371,usec_per_call=0.67
cmdstat_set:calls=66,usec=60,usec_per_call=0.91
cmdstat_command:calls=4,usec=6500,usec_per_call=1625.00

Cluster

cluster_enabled:1

Keyspace

db0:keys=2239931,expires=1278519,avg_ttl=393820

------ CLIENT LIST OUTPUT ------
id=16 addr=192.168.111.222:3379 fd=104 name= age=0 idle=0 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=24260 qbuf-free=8508 obl=0 oll=0 omem=0 events=r cmd=incr

------ CURRENT CLIENT INFO ------
id=16 addr=192.168.111.222:3379 fd=104 name= age=0 idle=0 flags=M db=0 sub=0 psub=0 multi=-1 qbuf=24260 qbuf-free=8508 obl=0 oll=0 omem=0 events=r cmd=incr
argv[0]: 'INCR'
argv[1]: 'X-XXXX||XXXX||XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX||XXXX-XX-XX XX:XX:XX'
18427:S 01 Feb 2020 21:34:22.686 # key 'X-XXXX||XXXX||XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX||XXXX-XX-XX XX:XX:XX' found in DB containing the following object:
18427:S 01 Feb 2020 21:34:22.686 # Object type: 0
18427:S 01 Feb 2020 21:34:22.686 # Object encoding: 1
18427:S 01 Feb 2020 21:34:22.686 # Object refcount: 1

------ REGISTERS ------
18427:S 01 Feb 2020 21:34:22.686 #
RAX:0000000000000000 RBX:00000000000000e2
RCX:0000000000000002 RDX:00007f47d00008d0
RDI:00007f47d0001e00 RSI:00007f47d00009d8
RBP:000055de3281e4b7 RSP:00007f47d79fd2d0
R8 :00007f47d79ff700 R9 :00007f47d79ff700
R10:00000000ffffffd6 R11:0000000000000206
R12:000055de3281b16e R13:000055de32a71a10
R14:00007f46c3d64680 R15:0000000000000000
RIP:000055de3272811e EFL:0000000000010206
CSGSFS:002b000000000033
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2df) -> 00007f47d79fd350
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2de) -> 0000000000000002
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2dd) -> 000000000000153a
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2dc) -> 0000000000000002
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2db) -> 000055de326d1bbd
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2da) -> 00059d8a741d6752
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2d9) -> 000000000000000f
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2d8) -> 00007f46c3d64730
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2d7) -> 62bcc93369011500
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2d6) -> 0000000000000001
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2d5) -> 000055de32708acc
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2d4) -> 00007f46c3d64680
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2d3) -> 000055de326ef06d
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2d2) -> 0000000000000001
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2d1) -> 00007f48a2a4c3f8
18427:S 01 Feb 2020 21:34:22.686 # (00007f47d79fd2d0) -> 00007f47a33af7f0

------ FAST MEMORY TEST ------
18427:S 01 Feb 2020 21:34:22.686 # Bio thread for job type #0 terminated
18427:S 01 Feb 2020 21:34:22.686 # Bio thread for job type #1 terminated
18427:S 01 Feb 2020 21:34:22.686 # Bio thread for job type #2 terminated
*** Preparing to test memory region 55de32a77000 (2260992 bytes)
*** Preparing to test memory region 55de33174000 (135168 bytes)
*** Preparing to test memory region 55de33195000 (131072 bytes)
*** Preparing to test memory region 7f46c1400000 (4001366016 bytes)
*** Preparing to test memory region 7f47afc80000 (3805184 bytes)
*** Preparing to test memory region 7f47b4000000 (135168 bytes)
*** Preparing to test memory region 7f47b8000000 (135168 bytes)
*** Preparing to test memory region 7f47bc080000 (12058624 bytes)
*** Preparing to test memory region 7f47bcd00000 (13631488 bytes)
*** Preparing to test memory region 7f47bdb00000 (13631488 bytes)
*** Preparing to test memory region 7f47be97e000 (6815744 bytes)
*** Preparing to test memory region 7f47befff000 (8388608 bytes)
*** Preparing to test memory region 7f47bf800000 (8388608 bytes)
*** Preparing to test memory region 7f47c0000000 (135168 bytes)
*** Preparing to test memory region 7f47c4000000 (135168 bytes)
*** Preparing to test memory region 7f47c8000000 (135168 bytes)
*** Preparing to test memory region 7f47cc000000 (135168 bytes)
*** Preparing to test memory region 7f47d0000000 (135168 bytes)
*** Preparing to test memory region 7f47d417e000 (2621440 bytes)
*** Preparing to test memory region 7f47d43ff000 (8388608 bytes)
*** Preparing to test memory region 7f47d4c00000 (8388608 bytes)
*** Preparing to test memory region 7f47d5400000 (4194304 bytes)
*** Preparing to test memory region 7f47d59fd000 (8388608 bytes)
*** Preparing to test memory region 7f47d61fe000 (8388608 bytes)
*** Preparing to test memory region 7f47d69ff000 (8388608 bytes)
*** Preparing to test memory region 7f47d7200000 (8388608 bytes)
*** Preparing to test memory region 7f47d7a00000 (2789212160 bytes)
*** Preparing to test memory region 7f487df8a000 (520093696 bytes)
*** Preparing to test memory region 7f489cf8b000 (8388608 bytes)
*** Preparing to test memory region 7f489d78c000 (8388608 bytes)
*** Preparing to test memory region 7f489df8d000 (8388608 bytes)
*** Preparing to test memory region 7f489e78d000 (63963136 bytes)
*** Preparing to test memory region 7f48a2600000 (8388608 bytes)
*** Preparing to test memory region 7f48a3307000 (16384 bytes)
*** Preparing to test memory region 7f48a3526000 (16384 bytes)
*** Preparing to test memory region 7f48a3e65000 (16384 bytes)
*** Preparing to test memory region 7f48a468f000 (49152 bytes)
*** Preparing to test memory region 7f48a46a5000 (4096 bytes)
.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.18737:C 01 Feb 2020 21:34:35.248 # oO0OoO0OoO0Oo KeyDB is starting oO0OoO0OoO0Oo

To Reproduce
Kill the replica node under high traffic to force resync from master.

@JohnSully
Copy link
Collaborator

Just to better understand the scenario here, are you combining clustering with traditional replication? Is Active Replication enabled?

@jamzed
Copy link
Author

jamzed commented Feb 6, 2020

It is a pure cluster mode, (3 masters, 3 slaves), and it seems similar to #132 when INCR was executed.

@JohnSully
Copy link
Collaborator

Thanks,

Yes that was my thought as well however Active Rep was the only way to get into that case. It seems like something else might be going on here. I will investigate.

@JohnSully
Copy link
Collaborator

Got a repro, it's definitely related in that it's an expire issue. I was convinced this was only possible with Active Replication so I apologize it wasn't fixed along with that older issue.

JohnSully added a commit that referenced this issue Feb 7, 2020
JohnSully added a commit that referenced this issue Feb 7, 2020
@JohnSully
Copy link
Collaborator

The issue was we were using the result of expireIfNeeded to determine if lookupKeyWrite should return an object. This results in a situation where a key is in the underlying database but the high level logic doesn't see it. When the high level command tries to readd the key it fails because it already exists.

Commit 49d8674 adds tests to detect this
Commit 050d580 fixes the actual issue.

This only repros on replicas, not masters because expireIfNeeded will delete the actual key on a master (replicas wait for the master to send a DEL).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants