-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem explanation of stalled and re-queued jobs #1355
Comments
answering your questions above:
I think you should look into |
thanks @manast for answering.
as you can see from the execution logs, a worker takes few seconds (<30) to complete a job. So, even if we don't renew the job's lock it wouldn't be marked as stalled. |
so it does not matter how long time it takes for a job to complete, it can be marked temporarily as "stalled", since |
actually, if the execution of a job is faster than We have noticed that in the as you can see in the logs:
the
|
Can you check that the clock in the different servers is synchronised at least to second level? |
Yes we have already checked the clock days ago and installed on each servers the same NTP server in order to synch them all. |
So I am a bit puzzled by this. I wonder if you could use REDIS MONITOR to get more info when this strange case happens. It seems as if the |
in the begining of the script we do this check: -- Check if we need to check for stalled jobs now.
if rcall("EXISTS", KEYS[5]) == 1 then
return {{}, {}}
end The key is a key that expires after 30 seconds, and only then we move jobs to wait and mark jobs as potentially stalled. |
so basically, no matter what, a job that is moved to active should always have at least 30 seconds before it can potentially be stalled. |
Those are the same hipotesis we arrived to but the problem is real and it happens. It isn't possible for us to launch REDIS MONITOR when the problem happens but we are setting up a test environment in order to make it happen and track it. Thanks @manast , I will keep you posted on this. If you have any other ideas please tell me |
We have made some tests on the problem. We have inserted pre-recorded production data and sent them to Redis using Bull during the test. The system had about
we succeeded to cause the issue and track it using REDIS MONITOR. Here's the grep of the logs of one single stalled job:
here are the logs from the two consumers that consume the job:
we crossed the redis monitor log and the consumers logs to understand which server executed any redis command:
We noticed a strange thing: are these information usefull @manast ? |
thanks for the detailed information, I am analysing it trying to understand why this edge case occurs. |
can you provide info about your current redis setup? |
Seems like we will need the previous logs of what you are showing in the issue, because the following logs are super strange_
It would seem as the EXISTS call is the one in
since it is a set there cannot be dupplicates, so how can it get 3 identical elements? Furthermore, the call to SMEMBERS is not part of these logs, which I also find very interesting, since LUA scripts are atomic, it should not be possible that the commands are interleaved like what you are displaying, I mean, between this call
|
I think they are not duplicates since they have 3 different timestamps (30s between each other) so they probably are from 3 different executions of the
here the redis conf. It is pretty much the default file beside the eviction part and the save part
|
It is just one redis server or do you have a master-slave setup? Regarding the three EXISTS. Even if they are three different calls to the script, where are all the previous commands? it is impossible that they get interleaved like this, redis executes the LUA commands atomically. |
so it would be quite interesting if you could get more logs before the event, and also to see if the missing logs have a timestamp that actually implies the LUA script was executed correctly but redis monitor is somehow outputting the logs out of order. That would be quite helpful to understand what is going on. |
just a single server. Redis is bound to a tun0 interface created by OpenVPN by which each server that needs Redis is connected to. About logs: give me some minutes and I will provide you a more detailed log file |
ok, so after some analysis what I can see is that the I do not think this is strange. Angel 4 locks the job because it just had the "luck" to fetch it from the queue. For some reason Angel 4 is not extending the lock after 45 seconds (since you have 90s lockDuration, the lock will renew after half time). Since it is not renewing the lock, the job is marked as stalled, and when this happens any other worker can pick the job, in this case Angel 5. Could it be that Angel4 is overloaded and incapable of renewing the lock ? |
also, in any case, make sure that the stalledInterval is the same as the lockDuration. Actually I may remove one of those settings so that it is not possible to make mistakes. |
It is not possible that a consumer froze for 45s, we have a monitor system application that records system statistics of each servers. The only high statistic was TCP retrasmission.
the modification of |
I will try to scale down the number of consumers on Angel4. Thanks @manast , I will keep you posted |
facing the same issue after moving to default redis docker image instead of bitnami https://github.com/bitnami/bitnami-docker-redis So it could be regarding redis config |
Me and my team are currently noticing similar behavior as this. @Gappa88 did you ever find a solution to your problem? |
I don't really remember what we have done in order to face this issue. For sure, we got rid off the openVPN structure because created a bottleneck in communications and replaced it with ufw firewall and some custom rules and scaled down the system. Now, after 2 years, we still have the hope will be usefull If you find any solution please share it with us @beardedtim |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I'm creating a new complete issue/question in order to gather information to solve/understand the problem we actually have.
We already have tried to get some information posting unleash questions but I would like to create the complete structure overview in order to solve the problem. Here the previous questions:
#1349
#1341
#1352
We have a Pub/Sub system in which there are about 50 subscribers (split up between 5 distinct servers) and 50 publishers (split up between 3 distinct servers) that reads/writes data from/to Redis using Bull.
The load is about 400k jobs created in 3 hours, each job's dimension goes from few hundreds of KB to few tens of MB.
Each subscriber gets one job at a time and processes it from few hundreds of milliseconds to few tens of seconds (< 30s)
Each publisher checks every x seconds if redis is 85% full, if it is then the publisher will not write in redis otherwise yes.
Redis is installed onto another server with an upperboud of RAM utilization and the maxmemory-policy is set to noeviction.
All servers are connected via VPN.
We are experiencing some strange behaviours regarding
stalled
jobs andevents
.I would like to ask some questions first:
lua
scripts executed atomically?lua
script crash? If so, what happens? The error is managed or logged somewhere?moveToActive-8.lua
script called ?here are the subscriber's structure:
here's an example of job bad execution:
log's explanation:
the job is received by one server, called 'Angel5', and after few fractions of seconds the job is marked
as
stalled
but the event is collected by another server, called 'Paganini'.Meanwhile the job is correctly executed by 'Angel5' and terminated but because of the stalled event the job is re-queued and it is double processed from another server, called 'Angel4', that, at the end of the execution, fails to finish the job because already deleted by 'Angel5'.
This sequence happens about few hundreds of times for the whole 3 hours, so, not much but we would like to know what the real 'cause comes from.
We didn't find any point of failure in our scripts (it doesn't mean that there aren't bugs, just we didn't find them) and tried to debug the whole Bull library. For what we noticed, considering our timing, less than 30 seconds, the only moment in which a job could be marked as
stalled
, and its event fired, is between the shift of the job from thewait
queue to theactive
queue and the moment the lock is acquired.Those things happen inside the
moveToActive-8.lua
script.BULL Version: 3.10
@manast could you please try to drive us in all this?
The text was updated successfully, but these errors were encountered: