-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in actx_get_by_node() #113
Comments
Interesting, do you have any ideas what events that lead up to this? |
Unfortunately no. |
Hi @bjosv Have some updates from the core analysis. Posting it here. (All data picked from either frame 0 or 1.) It looks like both table and slots pointers are holding corrupted data at the time of crash. From the hiredis code, the node calculation is done as table[slot_number]. slot_number as seen from the core file is 8535. So,
which is same as the corrupted cluster_node pointer seen at the time of crash. So, the corrupted node is pulled from the corrupted table data. However, the nodes dictionary object (of type struct dictEntry) seems to be holding valid data.
So, there are total 7 nodes available at the time of crash as indicated by the nodes dictionary. All 7 cluster_node (valid) pointers are decoded below.
Each cluster_node holds information of the slots they serve. Below information captures slots information of each of the above 7 nodes.
slots to cluster_node mapping from above data ==>
So, it's indicating 6 valid slot ranges. So, there is 1 extra node ($125 from gdb), which is not having any slots distributed to it. One more interesting observation is that contiguous slots of the corrupted table object seem to be holding same cluster_node address (even though it is holding invalid data) as how it should be when range of slots are assigned to multiple cluster_nodes.
6 unique cluster_node addresses came up (added below) for the entire 16384 slots, indicating there were 6 nodes available at some point of time (which is 7 at the time of crash).
Not really sure what path it took for the corruption to happen. Please share any thoughts you have on the same. |
Just as a reference, what |
An also what called
Was it from within a response callback, and in that case, was it a NULL response? |
@bjosv No, that data looks all corrupted.
|
No, this was directly called from the application trying to send out a command to hiredis server. |
It is stated that you run multi-threaded and the question is in what way? When a response is sent from a redis node the used eventsystem will notice that data is received on the socket, If the eventsystem is able to process replies in multiple threads at the same time (likely from different nodes) , there is a chance that the replies can run the callback that updates the dict and table at the same time. |
Hi @bjosv, all access into hiredis is restricted using 1 common mutex. So, any threads trying to call into hiredis would first acquire the lock and then make the hiredis command call. Once the command execution is completed, the lock is released, and another thread can acquire it and invoke its command call. The eventsystem can only process 1 reply at a time (is always run in a single threaded context). It takes the lock before calling into hiredis for any read/write event which will then be executing the redisProcessCallbacks(). |
We were thinking of possibilities of how the above stack trace could be related to the one in ticket 124. One theory which we came up in our internal discussion is that:
If the above scenario is a possibility, do you agree that we could end up in a case as above, and that both are linked ? |
I believe your scenario was a possibility and likely fixed by #125. |
Hello,
We are using hiredis-cluster Release 0.8.1, and this issue is with regards to a crash we see in our performance testing.
We use async hiredis APIs, and invoke them both from the application and also from hiredis callbacks.
Our application is multi-threaded, and hiredis context object & hiredis API calls are mutex protected.
We noticed the following crash a few hours into our test:
Crash reason: SIGSEGV /0x00000080
Crash address: 0x0
Process uptime: not available
Thread 13 (crashed)
0 libhiredis_cluster.so.0.8!actx_get_by_node [hircluster.c : 3630 + 0x0]
1 libhiredis_cluster.so.0.8!redisClusterAsyncFormattedCommand [hircluster.c : 4117 + 0x8]
2 libhiredis_cluster.so.0.8!redisClustervAsyncCommand [hircluster.c : 4244 + 0x16]
3 libhiredis_cluster.so.0.8!redisClusterAsyncCommand [hircluster.c : 4258 + 0x5]
Kindly help resolve this crash.
Thank you.
The text was updated successfully, but these errors were encountered: