New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] rdb_bgsave_in_progress since a long time, aof file growing non-stop #675
Comments
I have an strace (before the kill) if you need it, tell me where I can send it. |
hi @zas the aof file growing is caused by the background save process hanging, if this happens again, sharing a stack trace of the background save process(it is a different process created with fork() named "keydb-rdb-bgsave"), or strace would be very helpful! |
It happened again. The log output stopped on:
The strace doesn't show much, the process is stuck. A simple kill has no effect, so:
After that, log resumed:
|
The issue is clearly that the rdb process is hanging but in order to fix it I need to understand why, if you can't get the stack trace, can you provide repro instructions(your config, what commands you ran in order to see this issue) so that I can repro it and fix it? |
Strace shows the bgsave process gets stuck, but as shown on graphs above it only happens once every X days. Actually it happened 4 times in 20 days, at different times, so it is hard to correlate with anything. It happened once on one instance, and thre times on the other one (so it doesn't look to depend on the machine). As explained above, we run keydb in a docker container, using One instance (10.2.2.30) has following options:
The other (10.2.2.60):
It started to happen after we upgraded to 6.3.1 (and now 6.3.3). AFAIK we never had the issue with versions prior 6.3.x. Here is the output of one instance (the other is the same but IP):
|
@zas can you also share the output of |
On rex:
On rudi:
Note: that's after a restart because of stuck process; so normal operation atm. EDIT: it happened again (rex bgsave stuck), here are On rex;
On rudi:
|
Hey guys, just had this issue. I'm also using Keydb 6.3.3 on docker (official image) and discovered this might be a bug. I will see what's going on as soon as I find some spare time to make something. Meanwhile here is the output of strace -i for some commands:
|
Hello.
Once the bgsave process is killed, the next bgsave successfully resumes. Log file:
|
Hi! Attaching more logs. Please take a look at the time.
|
By the way.
|
Hi! Just wanted to confirm ekexcello's comment, I ran a test server with appendonly disabled and happened the same. |
On further investigation and analysis of what's going on, we also found that the similar behavior was also described in redis bugtracker. Does anybody aware whether this applies (or was fixed?) in Keydb-6.3? |
It seems to be an issue in redis tests, not sure it is similar to the one described here (which happens at runtime).
It still unfixed, happening randomly here with keydb 6.3.3 (latest stable as today). |
Happens randomly. Sometimes only the aof of one node is growing. The last time we observed the issue even all three nodes got affected at the same time and key-db was not working properly anymore. We had to clean the aof and recreate the K8s pods. Is there any update about a fix or at least any workaround ? |
We're affected by the same bug. Eagerly waiting for a fix. |
I just realized that I have an instance running for 88 days, which I go and maintain every two days by using kill -9 {pid} whenever it stops working during bgsave. |
#720 says it contains a fix for this issue, is there a time window for next stable release? Also what is the actual fix for this issue among all those merged commits? |
The issue was a race condition on a lock when forking, the fix was 596c513, don't touch the lock in the forked process. Next release should be within this month, unfortunately have been busy with internal work and haven't been able to get it out yet. |
Soo many "thank yous" sent in a huge array occupiying like 192GB of RAM to you guys. |
The issue didn't appear since the fix. Thanks a lot. |
This just happened again for us being on the latest version v6.3.4 |
Describe the bug
We are running a master<->master keydb setup, using keydb docker images from eqalpha/keydb:x86_64_v6.3.3
After some time without any problem, AOF file starts to grow on ONE instance.
Currently that's the instance where most clients connect, but we had the issue with other some time ago.
Here is info the the instance with growing AOF (rex):
Here is the info output for the instance without issue (rudi):
It should be noted that rudi instance is less active, but we had the problem with it too a week ago, we upgraded from 6.3.1 to 6.3.3 in the hope it will fix the issue when it happened.
On rex, we can see:
Killing the stuck process:
After few seconds, on rex:
To reproduce
Steps to reproduce the behavior and/or a minimal code sample.
Expected behavior
A description of what you expected to happen.
Additional information
Any additional information that is relevant to the problem.
The text was updated successfully, but these errors were encountered: