Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"epoll_ctl failed: Bad file descriptor" when testing keydb cluster with multi-threading #125

Closed
hengku opened this issue Dec 19, 2019 · 4 comments

Comments

@hengku
Copy link
Contributor

hengku commented Dec 19, 2019

I tried a keydb cluster consisting of 4 keydb servers (complied under Ubuntu 18.04) running on 4 machines (56 cores, 384GB RAM). Each keydb server has 4 threads with binding cores. Both RDB and AOF were enabled on each server. I observed "epoll_ctl failed: Bad file descriptor" messages many times from each server when I ran the following test:

memtier_benchmark -s -p -d 256 -t 50 -c 1 --key-pattern=P:P -n 50000000 --key-minimum=1 --key-maximum=2500000000 --ratio 1:0 --cluster-mode --hide-histogram

However, I didn't see those messages when I ran the same test against Redis cluster in the same environment.

Here is the sample output:

Starting automatic rewriting of AOF on 101% growth
Background append only file rewriting started by pid 39718
AOF rewrite child asks to stop sending diffs.
Parent agreed to stop sending diffs. Finalizing AOF...
Concatenating 62.81 MB of AOF diff received from parent.
SYNC append only file rewrite performed
AOF rewrite: 337 MB of memory used by copy-on-write
Background AOF rewrite terminated with success
Residual parent diff successfully flushed to the rewritten AOF (29.51 MB)
Background AOF rewrite finished successfully
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor
epoll_ctl failed: Bad file descriptor

@JohnSully
Copy link
Collaborator

@hengku Are you running v5.2? I could have sworn I fixed an issue like this already.

@hengku
Copy link
Contributor Author

hengku commented Jan 13, 2020

Hi @JohnSully , I retested the latest source code from RELEASE_5 branch (KeyDB 5.3), it crashed after I ran memtier-benchmark when I enable the AOF (appendonly yes). It worked well when I disabled the AOF. Below is the error message when keydb-server was crashed.

7439:M 12 Jan 2020 23:40:24.612 # Cluster state changed: ok
7439:M 12 Jan 2020 23:41:07.673 * 10000 changes in 60 seconds. Saving...
7439:M 12 Jan 2020 23:41:07.673 * Background saving started by pid 9116
9116:C 12 Jan 2020 23:41:07.716 * DB saved on disk
9116:C 12 Jan 2020 23:41:07.717 * RDB: 4 MB of memory used by copy-on-write
7439:M 12 Jan 2020 23:41:07.773 * Background saving terminated with success
7439:M 12 Jan 2020 23:41:08.675 * Starting automatic rewriting of AOF on 7254515300% growth
7439:M 12 Jan 2020 23:41:08.677 * Background append only file rewriting started by pid 9139
AE_ASSERT FAILURE ae.cpp: 299

=== KEYDB BUG REPORT START: Cut & paste starting from here ===
7439:M 12 Jan 2020 23:41:08.702 # KeyDB 5.3.0 crashed by signal: 11
7439:M 12 Jan 2020 23:41:08.702 # Crashed running the instruction at: 0x434955
7439:M 12 Jan 2020 23:41:08.702 # Accessing address: 0x1
7439:M 12 Jan 2020 23:41:08.702 # Failed assertion: (:0)

------ STACK TRACE ------
EIP:
src/keydb-server 0.0.0.0:11000 [cluster](aePostFunction(aeEventLoop*, std::function<void ()>, bool)+0x205) [0x434955]

Backtrace:
src/keydb-server 0.0.0.0:11000 cluster [0x48f139]
src/keydb-server 0.0.0.0:11000 [cluster](sigsegvHandler(int, siginfo_t*, void*)+0xae) [0x48f7be]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390) [0x7f3299d47390]
src/keydb-server 0.0.0.0:11000 [cluster](aePostFunction(aeEventLoop*, std::function<void ()>, bool)+0x205) [0x434955]
src/keydb-server 0.0.0.0:11000 [cluster](aofRewriteBufferAppend(unsigned char*, unsigned long)+0x127) [0x487527]
src/keydb-server 0.0.0.0:11000 [cluster](feedAppendOnlyFile(redisCommand*, int, redisObject**, int)+0x154) [0x488394]
src/keydb-server 0.0.0.0:11000 [cluster](propagate(redisCommand*, int, redisObject**, int, int)+0x55) [0x43c965]
src/keydb-server 0.0.0.0:11000 [cluster](call(client*, int)+0x52c) [0x43d01c]
src/keydb-server 0.0.0.0:11000 [cluster](processCommand(client*, int)+0x7b0) [0x440ec0]
src/keydb-server 0.0.0.0:11000 [cluster](processCommandAndResetClient(client*, int)+0x21) [0x44aa01]
src/keydb-server 0.0.0.0:11000 [cluster](processInputBuffer(client*, int)+0x7b) [0x44f07b]
src/keydb-server 0.0.0.0:11000 [cluster](readQueryFromClient(aeEventLoop*, int, void*, int)+0x328) [0x451a78]
src/keydb-server 0.0.0.0:11000 cluster [0x435639]
src/keydb-server 0.0.0.0:11000 cluster [0x4358c5]
src/keydb-server 0.0.0.0:11000 cluster [0x435c4d]
src/keydb-server 0.0.0.0:11000 cluster [0x439e46]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f3299d3d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3299a7341d]

@JohnSully
Copy link
Collaborator

This should be fixed with the following change: aa6409f

This change was made 26 days ago and your last attempt was 25, but I don't remember when the change was pushed to github so it's hard to say if your test actually ran this code. Could you give it one more try?

@JohnSully
Copy link
Collaborator

I added fix 16a019d which handles the error case without crashing. Getting into the error case should be near impossible after the earlier change, but this makes certain it's non-fatal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants