-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected Crash when booting up (Segmentation Fault) #1176
Comments
Hi @namirsab |
Hey @OfirMos I could show you the crash that happened before:
|
But regarding commands, we have the feeling that the problem might be related to TS.DEL We called TS.DEL on a non existing time series, that's for sure. This normally shouldn't cause any isseus, but my feeling is that if you call TS.DEL with a big enough time range, it might crash redis. I couldn't reproduce this though, as we are right now dealing with the original problem: our database doesn't boot up |
In case this helps, this also happens a lot when reading the AOF. Even though this happened before and nothing crashed:
|
@namirsab Is it possible for you to send me the AOF or some other reproduction script? |
We could, but the AOF is quite big, 100GB. Our database is quite big unfortunately. |
@namirsab So maybe a small reproduction script |
As said, I cannot reproduce it, I just told you all the information I have. Sorry :( |
namirsab is it cluster or single shard(node) redis? |
@OfirMos is a single node using sentinel, with 3 replicas. Unfortuntately there are not more logs. Before the crashes, the ony logs we had were similar to this one:
Also I see other previous crashes, which I don't know if they are related. I'll post one, but I think they are mostly related to TS.MADD:
I'll try to get you the log files, but I think it might be lost. I can only see the logs so far in Kibana. Do you have an idea what dththis could be related to? |
@namirsab I see that you're using Redis 6.2.4 how did you install/build it? |
Hey @gkorland sorry, but we are using 6.2.6, i made I mistake in the first comment. The exact image we are using is |
Since we have an AOF file that can reliably reproduce the crash, we could provide you with a dedicated Ubuntu machine (by authorising your public SSH key) that has sufficient RAM and the AOF file in place for debugging this. Would that work for you? |
Sounds great! |
i spent a few minutes looking at the disassembly,
the attempt to access the as for the other crash, the one in p.s. the stack traces seems a bit messed up and out of order. as if several processes are printing to the same log file and crashed at the same time. i also like to draw attention to these:
and
this seem like they could cause some corruption. |
Hey @oranagra We are trying to figure out why that
Here we can see that it's an We are checking manually what happens if we run Regarding the missing disassembly dump, unfortunately that data is not there, the bug report from Redis stopped before somehow... 🤷 @OfirMos we are already running in 1.6.10 since yesterday after the crash, do you think upgrading to 1.6.11 would have a big impact? |
@namirsab the CRITICAL error is indeed a bug which will be fixed in the next version (I was able to reproduce it). |
@namirsab Can you share what are the type of commands you are using and also give me access to the dedicated ubuntu machine on which it reproduces? |
@OfirMos i can't think of anything else. as i said it looks like the command arguments are freed before the module command returns. maybe that only happens in some error flow. e.g. the cleanup code releases (decrements the refcount) a string on error, but in some flows it was not retained (refcount wasn't incremented). i think it would be a good idea to disassemble the other crashes (the one in |
Hey @OfirMos
EDIT: We can reproduce it 100% of the times, it seems we were using a different version accidentally. My colleague will write a follow up message with more details |
In the end, we did manage to produce an Ubuntu machine that can reliably crash an AOF file that is present there, which is the file that originally took our production down. The crash happens with Redis TimeSeries 1.6.7 and with 1.6.10. For both cases, we used Redis 6.2.6. If you provide your SSH public keys, i can authorise them on the machine. If you would prefer, you can also join a space on our company's Google Chat (by Google account email), so we can have more real-time communication on the matter. Or we could join a Slack space or similar. |
@WeaselScience @namirsab I merged a fix that will probably fix the bug, it's in origin/1.6 , can you try to upgrade to this branch? You'll probably still won't be able to load the AOF and if it's still needed we will have to edit it. |
@OfirMos that's great, thanks. We found a probably related issue with our database, that might show either corruption or a bug in TS.RANGE. We are testing this with version 1.6.7, because we are trying to reproduce the original problem we had. Querying the same key, first for the whole month, and then for just a subset of the month, gives us different results. For the whole month we run
and we get results until the 19th of the month, which is not correct because data is missing. We know that because when we run a query for a subset of the month, we get data:
And this is correct. Even more weird, when we use Also You can find the results attached. You can just check that timestamps in the the results from the 22nd to the 26th are missing in the |
We have been able to export the affected time series. The script below will show you that:
# Import
cat energyDelta_6193a943383f940007c0b782.bin.txt | redis-cli -x RESTORE test_energyDelta_6193a943383f940007c0b782 0
# Create Check Directory
mkdir check
# Entire Month
echo "TS.RANGE test_energyDelta_6193a943383f940007c0b782 1640991600000 1643670000000" | redis-cli > check/range_full.txt
# 22nd-26th
echo "TS.RANGE test_energyDelta_6193a943383f940007c0b782 1642806000000 1643151600000" | redis-cli > check/range_part.txt
# Entire Month (Reverse)
echo "TS.REVRANGE test_energyDelta_6193a943383f940007c0b782 1640991600000 1643670000000" | redis-cli > check/revrange_full.txt
# Show Filesizes
ls -lh check We hope this helps! |
I am able to reproduce the above mentioned issue in FROM ubuntu:18.04 AS builder
RUN apt-get update && apt-get install -y git build-essential
RUN git clone --recursive https://github.com/RedisTimeSeries/RedisTimeSeries.git
RUN cd RedisTimeSeries && git checkout "origin/1.6" && ./deps/readies/bin/getpy3 && ./system-setup.py && make build
FROM redis:6.2.6
COPY --from=builder /RedisTimeSeries/bin/redistimeseries.so /redistimeseries.so
CMD redis-server --loadmodule /redistimeseries.so DUPLICATE_POLICY LAST Running the commands in the above post by @namirsab indeed produced unexpectedly inconsistent datasets. |
We've been able to reproduce the issue using a fresh Redis 6.2.6 instance with no data, with the latest commit of Redis TimeSeries branch 1.6. Reproduction steps:
We've packaged the reproduction into a repository: https://gitlab.com/weasel.science/redis-timeseries-tsdel-bug-reproduction It includes a Dockerfile that self-containedly compiles Redis TimeSeries and reproduces the bug. In this reproduction, the behavior of TS.RANGE and TS.REVRANGE seems consistently broken. |
Thank you for the patch. I've rerun the reproduction on the |
Yes, it's going to be released hopefully this week as 1.6.12 |
@WeaselScience Please rebase over master/1.6 latest again there is another PR that is relevant to the bug: |
@OfirMos is it safe to use the latest tag 1.6.13 or should we wait for the official release? Thanks! |
@namirsab you can use it on an intel CPU. It yet to be tested on M1 before release. |
We've retested on 1.6.13, we can't reproduce the issue. |
Redis Version: 6.2.6
RedisTimeSeries version: 1.6.7
After a crash, we tried bringing redis up and it keeps crashing in
RedisModuleCommandDispatcher
and the only module we have is TimeSeries, hence I think this bug belongs here.The trace:
We tried fixing the
AOF
with theredis-check-aof
tool, but it didn't work out.Finally we had to restore from a backup, unfortunately.
Any clue why this happened?
The text was updated successfully, but these errors were encountered: