Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH] The server crashed when resizing the replication backlog size and running info at the same time #477

Closed
paulmchen opened this issue Jul 27, 2022 · 2 comments

Comments

@paulmchen
Copy link
Contributor

Steps to reproduce this issue:

  1. Start KeyDB server by specifying repl-backlog-size and repl-backlog-disk-reserve as the following:

    save ""
    server-threads 4
    repl-backlog-size 1mb
    repl-backlog-disk-reserve 100mb

  2. At the meantime, run the following info command every 100 second:

    watch -n 0.1 -c 'date;./keydb-cli -h 192.168.0.221 -p 7001 info'

  3. Run the following benchmark command:

    ./keydb-benchmark -h 192.168.0.221 -c 200 -p 7001 -r 10000 -t set -d 500 -l -q

Note: it looks like the parallel executing the replication backlog resize causing deadlock and crash the server. However, it may not happen all the time.

object.cpp https://github.com/Snapchat/KeyDB/blob/main/src/object.cpp (line 1094-1095), it crashed when += mem size for the repl_backlog

if (g_pserver->repl_backlog)
mem += zmalloc_size(g_pserver->repl_backlog);

Do we need to "Lock" mem += operation here?

Crash report

423364:423390:M 27 Jul 2022 17:47:25.977 * Synchronization with replica 192.168.0.221:7002 succeeded
423364:423391:M 27 Jul 2022 17:47:26.966 # Replication backlog is too small, resizing to: 2048 bytes
423364:423391:M 27 Jul 2022 17:47:26.966 * Switching to disk backed replication backlog due to exceeding memory limits


=== KEYDB BUG REPORT START: Cut & paste starting from here ===
423364:423392:M 27 Jul 2022 17:47:27.011 # KeyDB 255.255.255 crashed by signal: 11, si_code: 1
423364:423392:M 27 Jul 2022 17:47:27.011 # Accessing address: 0x18e000
423364:423392:M 27 Jul 2022 17:47:27.011 # Crashed running the instruction at: 0xab26ec

------ STACK TRACE ------
EIP:
./keydb-server-init 192.168.0.221:7001(je_malloc_usable_size+0x5c) [0xab26ec]

Backtrace:
/usr/lib64/libpthread.so.0(+0x135a0) [0x7fb1ab9085a0]
./keydb-server-init 192.168.0.221:7001(je_malloc_usable_size+0x5c) [0xab26ec]
./keydb-server-init 192.168.0.221:7001(getMemoryOverheadData()+0x1af) [0x65a56f]
./keydb-server-init 192.168.0.221:7001(genRedisInfoString(client*, char const*)+0x48f) [0x69256f]
./keydb-server-init 192.168.0.221:7001(infoCommand(client*)+0x33) [0x6941e3]
./keydb-server-init 192.168.0.221:7001(call(client*, int)+0xa1) [0x695621]
./keydb-server-init 192.168.0.221:7001(processCommand(client*, int)+0x820) [0x696650]
./keydb-server-init 192.168.0.221:7001(processCommandAndResetClient(client*, int)+0x65) [0x5b2255]
./keydb-server-init 192.168.0.221:7001(processInputBuffer(client*, bool, int)+0x174) [0x5b4e74]
./keydb-server-init 192.168.0.221:7001(processClients()+0xc9) [0x5b6369]
./keydb-server-init 192.168.0.221:7001() [0x6aa4f2]
./keydb-server-init 192.168.0.221:7001(beforeSleep(aeEventLoop*)+0x16e) [0x59bc5e]
./keydb-server-init 192.168.0.221:7001(aeProcessEvents+0xe8) [0x597868]
./keydb-server-init 192.168.0.221:7001(aeMain+0x37) [0x59b957]
./keydb-server-init 192.168.0.221:7001(workerThreadMain(void*)+0x64) [0x6aa584]
/usr/lib64/libpthread.so.0(+0x8f3b) [0x7fb1ab8fdf3b]
/usr/lib64/libc.so.6(clone+0x40) [0x7fb1ab835810]

------ REGISTERS ------
423364:423392:M 27 Jul 2022 17:47:27.013 # 
RAX:000000000018e000 RBX:00007fb171c00000
RCX:00007fb171c00000 RDX:0000000000000001
RDI:00007fb194efbbd0 RSI:00007fb194efbb10
RBP:0000000000000004 RSP:00007fb194efa820
R8 :0000000000000000 R9 :00007fb140000000
R10:0000000000000006 R11:00000000000000b9
R12:0000000000394b88 R13:00000000011b0700
R14:00000000011b0700 R15:0000000000000002
RIP:0000000000ab26ec EFL:0000000000010206
CSGSFS:002b000000000033
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82f) -> 00007fb19627f005
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82e) -> 0000002c00000001
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82d) -> 00007fb194efaf38
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82c) -> 000000000000b000
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82b) -> 00007fb194000900
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa82a) -> 0000000000b17223
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa829) -> 0000000000000000
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa828) -> 0000000000b3f8ed
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa827) -> 0000000000b3f8f6
423364:423392:M 27 Jul 2022 17:47:27.013 # (00007fb194efa826) -> 0000000000b17481
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa825) -> 0000000000f94160
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa824) -> 00007fb194efba90
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa823) -> 00007fb194010f00
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa822) -> 00007fb194efa840
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa821) -> 00007fb194efa9f0
423364:423392:M 27 Jul 2022 17:47:27.014 # (00007fb194efa820) -> 0000003000000030

------ INFO OUTPUT ------
<pre>
Core dump:

(gdb) bt
#0  atomic_load_p (mo=atomic_memory_order_relaxed, a=0x18e000) at include/jemalloc/internal/atomic.h:62
#1  rtree_leaf_elm_bits_read (dependent=true, elm=0x18e000, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:175
#2  rtree_leaf_elm_szind_read (dependent=true, elm=0x18e000, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:227
#3  rtree_szind_read (dependent=true, key=140400094347264, rtree_ctx=, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:434
#4  arena_salloc (ptr=, tsdn=) at include/jemalloc/internal/arena_inlines_b.h:191
#5  isalloc (ptr=, tsdn=) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:38
#6  je_malloc_usable_size (ptr=0x7fb171c00000) at src/jemalloc.c:3740
#7  0x000000000065a56f in _Z21getMemoryOverheadDatav () at object.cpp:1094
#8  0x000000000069256f in genRedisInfoString (c=0x0, section=0xb2383c "all") at server.cpp:5687
#9  0x00000000005ee9e1 in logServerInfo () at debug.cpp:1748
#10 0x00000000005eea43 in printCrashReport () at debug.cpp:2013
#11 0x00000000005eeaec in sigsegvHandler (sig=11, info=0x7fb194ef9c70, secret=0x7fb194ef9b40) at debug.cpp:1999
#12 
#13 atomic_load_p (mo=atomic_memory_order_relaxed, a=0x18e000) at include/jemalloc/internal/atomic.h:62
#14 rtree_leaf_elm_bits_read (dependent=true, elm=0x18e000, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:175
#15 rtree_leaf_elm_szind_read (dependent=true, elm=0x18e000, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:227
#16 rtree_szind_read (dependent=true, key=140400094347264, rtree_ctx=, rtree=, tsdn=) at include/jemalloc/internal/rtree.h:434
#17 arena_salloc (ptr=, tsdn=) at include/jemalloc/internal/arena_inlines_b.h:191
#18 isalloc (ptr=, tsdn=) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:38
#19 je_malloc_usable_size (ptr=0x7fb171c00000) at src/jemalloc.c:3740
#20 0x000000000065a56f in _Z21getMemoryOverheadDatav () at object.cpp:1094
#21 0x000000000069256f in genRedisInfoString (c=c@entry=0x7fb185c56600, section=0xb26e30 "default") at server.cpp:5687
#22 0x00000000006941e3 in infoCommand (c=, c=) at server.cpp:6358
#23 0x0000000000695621 in call (c=c@entry=0x7fb185c56600, flags=flags@entry=31) at server.cpp:4488
#24 0x0000000000696650 in processCommand (c=0x7fb185c56600, callFlags=31) at server.cpp:5067
#25 0x00000000005b2255 in processCommandAndResetClient (c=c@entry=0x7fb185c56600, flags=flags@entry=31) at networking.cpp:2616
#26 0x00000000005b4e74 in processInputBuffer (c=c@entry=0x7fb185c56600, fParse=fParse@entry=false, callFlags=callFlags@entry=31) at networking.cpp:2772
#27 0x00000000005b6369 in processClients () at networking.cpp:2922
#28 0x00000000006aa4f2 in _Z25runAndPropogateToReplicasIFvvEJEEvPT_DpT0_ (pfn=) at server.h:3902
#29 0x000000000059bc5e in beforeSleep (eventLoop=0x7fb1ab232900) at server.cpp:2807
#30 0x0000000000597868 in aeProcessEvents (eventLoop=eventLoop@entry=0x7fb1ab232900, flags=flags@entry=27) at ae.cpp:710
#31 0x000000000059b957 in aeMain (eventLoop=) at ae.cpp:770
#32 0x00000000006aa584 in _Z16workerThreadMainPv (parg=0x2) at server.cpp:7324
#33 0x00007fb1ab8fdf3b in ?? () from /usr/lib64/libpthread.so.0
#34 0x00007fb1ab835810 in clone () from /usr/lib64/libc.so.6
</pre>

@paulmchen
Copy link
Contributor Author

@JohnSully add John

@JohnSully
Copy link
Collaborator

Fix change a2e6e35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants