[CRASH] The server crashed when resizing the replication backlog size and running info at the same time #477

paulmchen · 2022-07-27T14:06:46Z

Steps to reproduce this issue:

Start KeyDB server by specifying repl-backlog-size and repl-backlog-disk-reserve as the following:

save ""
server-threads 4
repl-backlog-size 1mb
repl-backlog-disk-reserve 100mb
At the meantime, run the following info command every 100 second:

watch -n 0.1 -c 'date;./keydb-cli -h 192.168.0.221 -p 7001 info'
Run the following benchmark command:

./keydb-benchmark -h 192.168.0.221 -c 200 -p 7001 -r 10000 -t set -d 500 -l -q

Note: it looks like the parallel executing the replication backlog resize causing deadlock and crash the server. However, it may not happen all the time.

object.cpp https://github.com/Snapchat/KeyDB/blob/main/src/object.cpp (line 1094-1095), it crashed when += mem size for the repl_backlog

if (g_pserver->repl_backlog)
mem += zmalloc_size(g_pserver->repl_backlog);

Do we need to "Lock" mem += operation here?

Crash report

423364:423390:M 27 Jul 2022 17:47:25.977 * Synchronization with replica 192.168.0.221:7002 succeeded
423364:423391:M 27 Jul 2022 17:47:26.966 # Replication backlog is too small, resizing to: 2048 bytes
423364:423391:M 27 Jul 2022 17:47:26.966 * Switching to disk backed replication backlog due to exceeding memory limits


=== KEYDB BUG REPORT START: Cut & paste starting from here ===
423364:423392:M 27 Jul 2022 17:47:27.011 # KeyDB 255.255.255 crashed by signal: 11, si_code: 1
423364:423392:M 27 Jul 2022 17:47:27.011 # Accessing address: 0x18e000
423364:423392:M 27 Jul 2022 17:47:27.011 # Crashed running the instruction at: 0xab26ec

------ STACK TRACE ------
EIP:
./keydb-server-init 192.168.0.221:7001(je_malloc_usable_size+0x5c) [0xab26ec]

Backtrace:
/usr/lib64/libpthread.so.0(+0x135a0) [0x7fb1ab9085a0]
./keydb-server-init 192.168.0.221:7001(je_malloc_usable_size+0x5c) [0xab26ec]
./keydb-server-init 192.168.0.221:7001(getMemoryOverheadData()+0x1af) [0x65a56f]
./keydb-server-init 192.168.0.221:7001(genRedisInfoString(client*, char const*)+0x48f) [0x69256f]
./keydb-server-init 192.168.0.221:7001(infoCommand(client*)+0x33) [0x6941e3]
./keydb-server-init 192.168.0.221:7001(call(client*, int)+0xa1) [0x695621]
./keydb-server-init 192.168.0.221:7001(processCommand(client*, int)+0x820) [0x696650]
./keydb-server-init 192.168.0.221:7001(processCommandAndResetClient(client*, int)+0x65) [0x5b2255]
./keydb-server-init 192.168.0.221:7001(processInputBuffer(client*, bool, int)+0x174) [0x5b4e74]
./keydb-server-init 192.168.0.221:7001(processClients()+0xc9) [0x5b6369]
./keydb-server-init 192.168.0.221:7001() [0x6aa4f2]
./keydb-server-init 192.168.0.221:7001(beforeSleep(aeEventLoop*)+0x16e) [0x59bc5e]
./keydb-server-init 192.168.0.221:7001(aeProcessEvents+0xe8) [0x597868]
./keydb-server-init 192.168.0.221:7001(aeMain+0x37) [0x59b957]
./keydb-server-init 192.168.0.221:7001(workerThreadMain(void*)+0x64) [0x6aa584]
/usr/lib64/libpthread.so.0(+0x8f3b) [0x7fb1ab8fdf3b]
/usr/lib64/libc.so.6(clone+0x40) [0x7fb1ab835810]

------ REGISTERS ------
423364:423392:M 27 Jul 2022 17:47:27.013 # 
RAX:000000000018e000 RBX:00007fb171c00000
RCX:00007fb171c00000 RDX:0000000000000001
RDI:00007fb194efbbd0 RSI:00007fb194efbb10
RBP:0000000000000004 RSP:00007fb194efa820
R8 :0000000000000000 R9 :00007fb140000000
R10:0000000000000006 R11:00000000000000b9
R12:0000000000394b88 R13:00000000011b0700
R14:00000000011b0700 R15:0000000000000002
RIP:0000000000ab26ec EFL:0000000000010206
CSGSFS:002b000000000033
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82f) -> 00007fb19627f005
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82e) -> 0000002c00000001
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82d) -> 00007fb194efaf38
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82c) -> 000000000000b000
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82b) -> 00007fb194000900
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82a) -> 0000000000b17223
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa829) -> 0000000000000000
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa828) -> 0000000000b3f8ed
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa827) -> 0000000000b3f8f6
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa826) -> 0000000000b17481
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa825) -> 0000000000f94160
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa824) -> 00007fb194efba90
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa823) -> 00007fb194010f00
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa822) -> 00007fb194efa840
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa821) -> 00007fb194efa9f0
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa820) -> 0000003000000030

------ INFO OUTPUT ------

<pre>
Core dump:

(gdb) bt
#0  atomic_load_p (mo=atomic_memory_order_relaxed, a=0x18e000) at include/jemalloc/internal/atomic.h:62
#1  rtree_leaf_elm_bits_read (dependent=true, elm=0x18e000, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:175
#2  rtree_leaf_elm_szind_read (dependent=true, elm=0x18e000, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:227
#3  rtree_szind_read (dependent=true, key=140400094347264, rtree_ctx=, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:434
#4  arena_salloc (ptr=, tsdn=) at include/jemalloc/internal/arena_inlines_b.h:191
#5  isalloc (ptr=, tsdn=) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:38
#6  je_malloc_usable_size (ptr=0x7fb171c00000) at src/jemalloc.c:3740
#7  0x000000000065a56f in _Z21getMemoryOverheadDatav () at object.cpp:1094
#8  0x000000000069256f in genRedisInfoString (c=0x0, section=0xb2383c "all") at server.cpp:5687
#9  0x00000000005ee9e1 in logServerInfo () at debug.cpp:1748
#10 0x00000000005eea43 in printCrashReport () at debug.cpp:2013
#11 0x00000000005eeaec in sigsegvHandler (sig=11, info=0x7fb194ef9c70, secret=0x7fb194ef9b40) at debug.cpp:1999
#12 
#13 atomic_load_p (mo=atomic_memory_order_relaxed, a=0x18e000) at include/jemalloc/internal/atomic.h:62
#14 rtree_leaf_elm_bits_read (dependent=true, elm=0x18e000, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:175
#15 rtree_leaf_elm_szind_read (dependent=true, elm=0x18e000, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:227
#16 rtree_szind_read (dependent=true, key=140400094347264, rtree_ctx=, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:434
#17 arena_salloc (ptr=, tsdn=) at include/jemalloc/internal/arena_inlines_b.h:191
#18 isalloc (ptr=, tsdn=) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:38
#19 je_malloc_usable_size (ptr=0x7fb171c00000) at src/jemalloc.c:3740
#20 0x000000000065a56f in _Z21getMemoryOverheadDatav () at object.cpp:1094
#21 0x000000000069256f in genRedisInfoString (c=c@entry=0x7fb185c56600, section=0xb26e30 "default") at server.cpp:5687
#22 0x00000000006941e3 in infoCommand (c=, c=) at server.cpp:6358
#23 0x0000000000695621 in call (c=c@entry=0x7fb185c56600, flags=flags@entry=31) at server.cpp:4488
#24 0x0000000000696650 in processCommand (c=0x7fb185c56600, callFlags=31) at server.cpp:5067
#25 0x00000000005b2255 in processCommandAndResetClient (c=c@entry=0x7fb185c56600, flags=flags@entry=31) at networking.cpp:2616
#26 0x00000000005b4e74 in processInputBuffer (c=c@entry=0x7fb185c56600, fParse=fParse@entry=false, callFlags=callFlags@entry=31) at networking.cpp:2772
#27 0x00000000005b6369 in processClients () at networking.cpp:2922
#28 0x00000000006aa4f2 in _Z25runAndPropogateToReplicasIFvvEJEEvPT_DpT0_ (pfn=) at server.h:3902
#29 0x000000000059bc5e in beforeSleep (eventLoop=0x7fb1ab232900) at server.cpp:2807
#30 0x0000000000597868 in aeProcessEvents (eventLoop=eventLoop@entry=0x7fb1ab232900, flags=flags@entry=27) at ae.cpp:710
#31 0x000000000059b957 in aeMain (eventLoop=) at ae.cpp:770
#32 0x00000000006aa584 in _Z16workerThreadMainPv (parg=0x2) at server.cpp:7324
#33 0x00007fb1ab8fdf3b in ?? () from /usr/lib64/libpthread.so.0
#34 0x00007fb1ab835810 in clone () from /usr/lib64/libc.so.6
</pre>

The text was updated successfully, but these errors were encountered:

paulmchen · 2022-07-27T14:08:15Z

@JohnSully add John

JohnSully · 2022-08-01T07:18:26Z

Fix change a2e6e35

JohnSully closed this as completed Aug 1, 2022

msotheeswaran-sc mentioned this issue Jan 9, 2023

KeyDB development status #533

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CRASH] The server crashed when resizing the replication backlog size and running info at the same time #477

[CRASH] The server crashed when resizing the replication backlog size and running info at the same time #477

paulmchen commented Jul 27, 2022

paulmchen commented Jul 27, 2022

JohnSully commented Aug 1, 2022

[CRASH] The server crashed when resizing the replication backlog size and running info at the same time #477

[CRASH] The server crashed when resizing the replication backlog size and running info at the same time #477

Comments

paulmchen commented Jul 27, 2022

paulmchen commented Jul 27, 2022

JohnSully commented Aug 1, 2022