Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem explanation of stalled and re-queued jobs #1355

Closed
Gappa88 opened this issue Jun 20, 2019 · 28 comments
Closed

Problem explanation of stalled and re-queued jobs #1355

Gappa88 opened this issue Jun 20, 2019 · 28 comments

Comments

@Gappa88
Copy link

Gappa88 commented Jun 20, 2019

I'm creating a new complete issue/question in order to gather information to solve/understand the problem we actually have.
We already have tried to get some information posting unleash questions but I would like to create the complete structure overview in order to solve the problem. Here the previous questions:
#1349
#1341
#1352

We have a Pub/Sub system in which there are about 50 subscribers (split up between 5 distinct servers) and 50 publishers (split up between 3 distinct servers) that reads/writes data from/to Redis using Bull.
The load is about 400k jobs created in 3 hours, each job's dimension goes from few hundreds of KB to few tens of MB.
Each subscriber gets one job at a time and processes it from few hundreds of milliseconds to few tens of seconds (< 30s)
Each publisher checks every x seconds if redis is 85% full, if it is then the publisher will not write in redis otherwise yes.

Redis is installed onto another server with an upperboud of RAM utilization and the maxmemory-policy is set to noeviction.

All servers are connected via VPN.

We are experiencing some strange behaviours regarding stalled jobs and events.

I would like to ask some questions first:

  • are lua scripts executed atomically?
  • could a lua script crash? If so, what happens? The error is managed or logged somewhere?
  • when a subscriber gets a job, is the moveToActive-8.lua script called ?

here are the subscriber's structure:

// master.js

const jobid_dones = {};

const cp = require('child_process');
const processor = cp.fork(`${__dirname}/processor.js`, []);

const Bull = require('bull');
let redis_conf = nconf.get("redis_angl:connection_parameters");
const queue = new Bull(nconf.get("redis_angl:prefix"),
  {
    settings: {
      maxStalledCount: 1
    },
    redis: redis_conf
  }
);
queue.process(1, (job, done) => {
    // saving the done function
    jobid_dones[job.id] = done;
    processor.send({ cmd: 'proc', job });
  });

processor.on('message', (obj) => {
  switch (obj.cmd) {
      case 'processed': {
         // complete the job
         jobid_dones[obj.job_id]();
         break;
       }
    }
});

queue.on('completed', async (job, result) => {  .... do stuff ...});
queue.on('error', (err) => {  .... do stuff ...});
queue.on('failed', function (job, err) {  .... do stuff ...});
queue.on('stalled', function (job) {  logger(`job #${job.id} stalled`)});

// processor.js

process.on('message', async (obj) => {
  switch (obj.cmd) {    
    case 'proc': {
     // execute the job
     logger(`job #${job_id}","EXECUTE_JOB`)
     .....
     logger(`job #${job_id}","JOB_EXECUTED`)
     // return the result
      process.send({ cmd: 'processed', result});
      break;
    }    
  }
});

here's an example of job bad execution:

datetime,hostname,"ip_address","job_id",message,"message_type"
"Jun 20 2019 @ 01:53:15.682",Angel5,"405,284","job #405284","EXECUTE_JOB"
"Jun 20 2019 @ 01:53:15.759",Paganini,"405,284","job #405284 stalled",ERROR
"Jun 20 2019 @ 01:53:15.869",Angel5,"405,284","job #405284 op...",MESS
"Jun 20 2019 @ 01:53:15.873",Angel5,"405,284","job #405284 op... in: 4ms",MESS
"Jun 20 2019 @ 01:53:15.877",Angel5,"405,284","job #405284 done in 8ms",MESS
"Jun 20 2019 @ 01:53:15.889",Angel5,"405,284",208,"JOB_EXECUTED"
"Jun 20 2019 @ 01:53:15.894",Angel5,"405,284","job #405284 completed","JOB_COMPLETED"
"Jun 20 2019 @ 01:53:22.569",Angel4,"405,284","job #405284","EXECUTE_JOB"
"Jun 20 2019 @ 01:53:22.696",Angel4,"405,284","job #405284 op...",MESS
"Jun 20 2019 @ 01:53:22.702",Angel4,"405,284","job #405284 op...",MESS
"Jun 20 2019 @ 01:53:22.702",Angel4,"405,284","job #405284 done 5ms",MESS
"Jun 20 2019 @ 01:53:22.703",Angel4,"405,284",134,"JOB_EXECUTED"
"Jun 20 2019 @ 01:53:22.702",Angel4,"405,284",""error : {"stack":"Error: Missing lock for job 405284 failed
    at Object.finishedErrors (/path/to/script/node_modules/bull/lib/scripts.js:182:16)
    at multi.exec.then.results (/path/to/script/node_modules/bull/lib/job.js:268:31)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)","message":"Missing lock for job 405284 failed"}"}",MESS

log's explanation:
the job is received by one server, called 'Angel5', and after few fractions of seconds the job is marked
as stalled but the event is collected by another server, called 'Paganini'.
Meanwhile the job is correctly executed by 'Angel5' and terminated but because of the stalled event the job is re-queued and it is double processed from another server, called 'Angel4', that, at the end of the execution, fails to finish the job because already deleted by 'Angel5'.

This sequence happens about few hundreds of times for the whole 3 hours, so, not much but we would like to know what the real 'cause comes from.
We didn't find any point of failure in our scripts (it doesn't mean that there aren't bugs, just we didn't find them) and tried to debug the whole Bull library. For what we noticed, considering our timing, less than 30 seconds, the only moment in which a job could be marked as stalled, and its event fired, is between the shift of the job from the wait queue to the active queue and the moment the lock is acquired.
Those things happen inside the moveToActive-8.lua script.

BULL Version: 3.10

@manast could you please try to drive us in all this?

@manast
Copy link
Member

manast commented Jun 21, 2019

answering your questions above:

  • are lua scripts executed atomically?
    Yes.

  • could a lua script crash? If so, what happens? The error is managed or logged somewhere?
    Should never crash in theory, But since it is promised based you should either see the error handled as a job failure or as an unhandled exception.

  • when a subscriber gets a job, is the moveToActive-8.lua script called ?
    Not always, if the wait list is not empty moveToFinish will move the next job to the active status, to avoid an extra roundtrip.

I think you should look into moveStalledJobsToWait-7. That is the script that periodically checks for stalled jobs. Basically workers needs to clean the jobs they are processing from the stalled list, if they fail to do so in time the job will be handled as stalled. So somehow in your case your workers are not calling extendLock-2 often enough which ends in the jobs being marked as stalled.

@Gappa88
Copy link
Author

Gappa88 commented Jun 21, 2019

thanks @manast for answering.

So somehow in your case your workers are not calling extendLock-2 often enough which ends in the jobs being marked as stalled.

as you can see from the execution logs, a worker takes few seconds (<30) to complete a job. So, even if we don't renew the job's lock it wouldn't be marked as stalled.
One more thing, in the logs you can also notice that the job is marked as stalled just few milliseconds after the job started.
By the way, we don't have modified the default values except for maxStalledCount.

@manast
Copy link
Member

manast commented Jun 21, 2019

so it does not matter how long time it takes for a job to complete, it can be marked temporarily as "stalled", since moveStalledJobsToWait runs every 30 seconds, a job could just be moved to active exactly before moveStalledJobsToWait is called.

@Gappa88
Copy link
Author

Gappa88 commented Jun 21, 2019

actually, if the execution of a job is faster than lockDuration, code freezing or not, we could not understand how could be possible that just few milliseconds after the job is taken and locked by a worker is going to be marked as stalled and its event fired (meaning that is not just marked as temporary stalled).

We have noticed that in the moveStalledJobsToWait script if a job has the lock it will not be marked as permanently stalled and moved to the wait queue, so it seems that the problem is pointing to the lock managing.

as you can see in the logs:

"Jun 20 2019 @ 01:53:15.682",Angel5,"405,284","job #405284","EXECUTE_JOB"
"Jun 20 2019 @ 01:53:15.759",Paganini,"405,284","job #405284 stalled",ERROR"

the EXECUTE_JOB means that the worker got the job 405284 and after few milliseconds another worker installed on another server sat this very job as stalled.

  • does this mean that somehow the lock is not correctly set by the worker?
  • could the lock duration be wrongly set?

@manast
Copy link
Member

manast commented Jun 21, 2019

Can you check that the clock in the different servers is synchronised at least to second level?

@Gappa88
Copy link
Author

Gappa88 commented Jun 21, 2019

Yes we have already checked the clock days ago and installed on each servers the same NTP server in order to synch them all.
They have the same time

@manast
Copy link
Member

manast commented Jun 22, 2019

So I am a bit puzzled by this. I wonder if you could use REDIS MONITOR to get more info when this strange case happens. It seems as if the moveStalledJobsToWait script is exectuted much tighter than 30s, which should be impossible.

@manast
Copy link
Member

manast commented Jun 22, 2019

in the begining of the script we do this check:

-- Check if we need to check for stalled jobs now.
if rcall("EXISTS", KEYS[5]) == 1 then
  return {{}, {}}
end

The key is a key that expires after 30 seconds, and only then we move jobs to wait and mark jobs as potentially stalled.

@manast
Copy link
Member

manast commented Jun 22, 2019

so basically, no matter what, a job that is moved to active should always have at least 30 seconds before it can potentially be stalled.

@Gappa88
Copy link
Author

Gappa88 commented Jun 24, 2019

Those are the same hipotesis we arrived to but the problem is real and it happens.

It isn't possible for us to launch REDIS MONITOR when the problem happens but we are setting up a test environment in order to make it happen and track it.
Otherwise we could try to keep REDIS MONITOR active for a whole day hoping it will not slow down the system too much.

Thanks @manast , I will keep you posted on this. If you have any other ideas please tell me

@Gappa88
Copy link
Author

Gappa88 commented Jun 27, 2019

We have made some tests on the problem.
We cleaned up the program from our business logic keeping just the Bull connection and adding a 1 second of work using await new Promise(r=>setTimeout(r,1000)) in the queue.process handler.

We have inserted pre-recorded production data and sent them to Redis using Bull during the test.

The system had about 60 consumers spread on 5 servers and 40 publishers spread on 3 servers.
Data is composed of a json object with some short fields and a wide text field.
Data had the following length distribution:

mean length = 250 KB 
max length = 1.2 MB
min length = 600 B

we succeeded to cause the issue and track it using REDIS MONITOR. Here's the grep of the logs of one single stalled job:

1561471034.290475 [0 lua] "SET" "bull:html_pusher:29512:lock" "8c4564ad-842a-4165-8aa3-51b6362b3ab6" "PX" "90000"
1561471034.290491 [0 lua] "ZREM" "bull:html_pusher:priority" "29512"
1561471034.290498 [0 lua] "PUBLISH" "bull:html_pusher:active@8c4564ad-842a-4165-8aa3-51b6362b3ab6" "29512"
1561471034.290509 [0 lua] "HSET" "bull:html_pusher:29512" "processedOn" "1561471034258"
1561471034.290525 [0 lua] "HGETALL" "bull:html_pusher:29512"

1561471066.807329 [0 lua] "EXISTS" "bull:html_pusher:29512:lock"
1561471097.302379 [0 lua] "EXISTS" "bull:html_pusher:29512:lock"
1561471128.313215 [0 lua] "EXISTS" "bull:html_pusher:29512:lock"

1561471128.313226 [0 lua] "LREM" "bull:html_pusher:active" "1" "29512"
1561471128.313247 [0 lua] "HINCRBY" "bull:html_pusher:29512" "stalledCounter" "1"
1561471128.313262 [0 lua] "RPUSH" "bull:html_pusher:wait" "29512"
1561471128.313275 [0 lua] "PUBLISH" "bull:html_pusher:stalled@" "29512"

1561471128.624921 [0 lua] "SET" "bull:html_pusher:29512:lock" "492808c8-5ad4-4413-bff2-634b4b6dd630" "PX" "90000"
1561471128.624934 [0 lua] "ZREM" "bull:html_pusher:priority" "29512"
1561471128.624940 [0 lua] "PUBLISH" "bull:html_pusher:active@492808c8-5ad4-4413-bff2-634b4b6dd630" "29512"
1561471128.624958 [0 lua] "HSET" "bull:html_pusher:29512" "processedOn" "1561471128360"
1561471128.624975 [0 lua] "HGETALL" "bull:html_pusher:29512"

1561471129.895860 [0 lua] "EXISTS" "bull:html_pusher:29512"
1561471129.895870 [0 lua] "GET" "bull:html_pusher:29512:lock"
1561471129.895878 [0 lua] "DEL" "bull:html_pusher:29512:lock"
1561471129.895887 [0 lua] "LREM" "bull:html_pusher:active" "-1" "29512"
1561471129.895898 [0 lua] "DEL" "bull:html_pusher:29512" "bull:html_pusher:29512:logs"
1561471129.895923 [0 lua] "PUBLISH" "bull:html_pusher:completed" "{\"jobId\":\"29512\"}"


1561471155.079505 [0 10.8.0.52:39112] "hmset" "bull:html_pusher:29512" "attemptsMade" "1" "stacktrace" "[\"Error: Missing key for job 29512 finished\\n    at Object.finishedErrors (/path/to/script/node_modules/bull/lib/scripts.js:180:16)\\n    at job.queue.client.moveToFinished.then.result (/path/to/script/node_modules/bull/lib/scripts.js:169:23)\\n    at <anonymous>\\n    at process._tickCallback (internal/process/next_tick.js:189:7)\"]" "failedReason" "Missing key for job 29512 finished"

1561471155.079653 [0 lua] "EXISTS" "bull:html_pusher:29512"
1561471155.079661 [0 lua] "GET" "bull:html_pusher:29512:lock"

here are the logs from the two consumers that consume the job:

Server Angel5 logs:
{"name":"solar","hostname":"Angel5","pid":31224,"level":30,"msg":"job arrived #29512","time":"2019-06-25T13:58:48.639Z","v":0}
{"name":"solar","hostname":"Angel5","pid":31224,"level":30,"msg":"job #29512 completed","time":"2019-06-25T13:58:49.922Z","v":0}


Server Angel4 logs:
{"name":"solar","hostname":"Angel4","pid":5752,"level":30,"msg":"job arrived #29512","time":"2019-06-25T13:59:13.702Z","v":0}
{"name":"solar","hostname":"Angel4","pid":5752,"level":50,"msg":"error : {\"stack\":\"Error: Missing lock for job 29512 failed\\n    at Object.finishedErrors (/path/to/script//node_modules/bull/lib/scripts.js:182:16)\\n    at multi.exec.then.results (/path/to/script/node_modules/bull/lib/job.js:268:31)\\n    at <anonymous>\\n    at process._tickCallback (internal/process/next_tick.js:189:7)\",\"message\":\"Missing lock for job 29512 failed\"}","time":"2019-06-25T13:59:15.082Z","v":0}

we crossed the redis monitor log and the consumers logs to understand which server executed any redis command:

8c4564ad-842a-4165-8aa3-51b6362b3ab6 = 10.8.0.52 = Angel4
492808c8-5ad4-4413-bff2-634b4b6dd630 = 10.8.0.56 = Angel5

We noticed a strange thing:
from Redis Monitor log you can see that Angel4 was the first one to lock the job and after 90000ms (we have sat lockDuration to 90s just for test) the job is locked again but by Angel5.
The stange thing is that from the consumers logs you can see that the one that executes the job in the first place is not Angel4 but Angel5.
It seems that the first lock, the one sat by Angel4, was useless because the job didn't arrive to the queue.process handler immediately but after that Angel5 already processed it, in fact Angel4 failed to complete the job because it has been deleted by Angel5.

are these information usefull @manast ?

@manast
Copy link
Member

manast commented Jun 27, 2019

thanks for the detailed information, I am analysing it trying to understand why this edge case occurs.

@manast
Copy link
Member

manast commented Jun 27, 2019

can you provide info about your current redis setup?

@manast
Copy link
Member

manast commented Jun 27, 2019

Seems like we will need the previous logs of what you are showing in the issue, because the following logs are super strange_

1561471066.807329 [0 lua] "EXISTS" "bull:html_pusher:29512:lock"
1561471097.302379 [0 lua] "EXISTS" "bull:html_pusher:29512:lock"
1561471128.313215 [0 lua] "EXISTS" "bull:html_pusher:29512:lock"

1561471128.313226 [0 lua] "LREM" "bull:html_pusher:active" "1" "29512"
1561471128.313247 [0 lua] "HINCRBY" "bull:html_pusher:29512" "stalledCounter" "1"
1561471128.313262 [0 lua] "RPUSH" "bull:html_pusher:wait" "29512"
1561471128.313275 [0 lua] "PUBLISH" "bull:html_pusher:stalled@" "29512"

It would seem as the EXISTS call is the one in moveStalledJobsToWait-7
, however it seems impossible that it is repeated three times for the same jobId, because the list of jobIds is taken from this call:

local stalling = rcall('SMEMBERS', KEYS[1])

since it is a set there cannot be dupplicates, so how can it get 3 identical elements? Furthermore, the call to SMEMBERS is not part of these logs, which I also find very interesting, since LUA scripts are atomic, it should not be possible that the commands are interleaved like what you are displaying, I mean, between this call 1561471034.290525 [0 lua] "HGETALL" "bull:html_pusher:29512" and 1561471066.807329 [0 lua] "EXISTS" "bull:html_pusher:29512:lock" there should be 5 other other calls:

if rcall("EXISTS", KEYS[5]) == 1 then
rcall("SET", KEYS[5], ARGV[3], "PX", ARGV[4])
local stalling = rcall('SMEMBERS', KEYS[1])
  if rcall("EXISTS", KEYS[6]) ~= 1 then
  rcall('DEL', KEYS[1])

@Gappa88
Copy link
Author

Gappa88 commented Jun 27, 2019

[...] since it is a set there cannot be duplicates [...]

I think they are not duplicates since they have 3 different timestamps (30s between each other) so they probably are from 3 different executions of the moveStalledJobsToWait-7 script.

can you provide info about your current redis setup?

here the redis conf. It is pretty much the default file beside the eviction part and the save part

protected-mode no

tcp-backlog 511

timeout 0

tcp-keepalive 300

daemonize no

save 10 1

stop-writes-on-bgsave-error no

rdbcompression yes

rdbchecksum yes

repl-diskless-sync no

repl-diskless-sync-delay 5

repl-disable-tcp-nodelay no

requirepass <password>

maxmemory 11000mb

maxmemory-policy noeviction

lazyfree-lazy-eviction no
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
slave-lazy-flush no

appendonly no

appendfilename "appendonly.aof"

appendfsync everysec

no-appendfsync-on-rewrite no

auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

aof-load-truncated yes

aof-use-rdb-preamble no

lua-time-limit 5000

latency-monitor-threshold 0

activerehashing yes


client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60


hz 10

aof-rewrite-incremental-fsync yes

@manast
Copy link
Member

manast commented Jun 27, 2019

It is just one redis server or do you have a master-slave setup?

Regarding the three EXISTS. Even if they are three different calls to the script, where are all the previous commands? it is impossible that they get interleaved like this, redis executes the LUA commands atomically.

@manast
Copy link
Member

manast commented Jun 27, 2019

so it would be quite interesting if you could get more logs before the event, and also to see if the missing logs have a timestamp that actually implies the LUA script was executed correctly but redis monitor is somehow outputting the logs out of order. That would be quite helpful to understand what is going on.

@Gappa88
Copy link
Author

Gappa88 commented Jun 27, 2019

It is just one redis server or do you have a master-slave setup?

just a single server. Redis is bound to a tun0 interface created by OpenVPN by which each server that needs Redis is connected to.

About logs:
we have grepped the original log file to get only the job id that creates the issue.

give me some minutes and I will provide you a more detailed log file

@Gappa88
Copy link
Author

Gappa88 commented Jun 27, 2019

here it is @manast

redis_log

the issued job_id was 29512

@manast
Copy link
Member

manast commented Jun 27, 2019

ok, so after some analysis what I can see is that the extendLock script is not being called for that particular job. This script renews the lock as well as it removes the job from the stalled lists.
You wrote above:
"We noticed a strange thing:
from Redis Monitor log you can see that Angel4 was the first one to lock the job and after 90000ms (we have sat lockDuration to 90s just for test) the job is locked again but by Angel5.
The stange thing is that from the consumers logs you can see that the one that executes the job in the first place is not Angel4 but Angel5.
It seems that the first lock, the one sat by Angel4, was useless because the job didn't arrive to the queue.process handler immediately but after that Angel5 already processed it, in fact Angel4 failed to complete the job because it has been deleted by Angel5."

I do not think this is strange. Angel 4 locks the job because it just had the "luck" to fetch it from the queue. For some reason Angel 4 is not extending the lock after 45 seconds (since you have 90s lockDuration, the lock will renew after half time). Since it is not renewing the lock, the job is marked as stalled, and when this happens any other worker can pick the job, in this case Angel 5. Could it be that Angel4 is overloaded and incapable of renewing the lock ?

@manast
Copy link
Member

manast commented Jun 27, 2019

also, in any case, make sure that the stalledInterval is the same as the lockDuration. Actually I may remove one of those settings so that it is not possible to make mistakes.

@Gappa88
Copy link
Author

Gappa88 commented Jun 27, 2019

Could it be that Angel4 is overloaded and incapable of renewing the lock ?

It is not possible that a consumer froze for 45s, we have a monitor system application that records system statistics of each servers. The only high statistic was TCP retrasmission.
It is possible that a consumer crashes but it is not the case, we checked that.

make sure that the stalledInterval is the same as the lockDuration. Actually I may remove one of those settings so that it is not possible to make mistakes

the modification of lockDuration was made during tests. In production we have the default values.

@manast
Copy link
Member

manast commented Jun 27, 2019

My bet would be that something is wrong with the Angel4 server. For instance look how tight all the calls to create/check lock looks on one of the servers (568fe403-04b4-48c3-955d-b506bc2a6fe3):
image
Or Angel5 (492808c8-5ad4-4413-bff2-634b4b6dd630):
image
Whereas Angel4 looks extremely sparse:
image

@Gappa88
Copy link
Author

Gappa88 commented Jun 28, 2019

I will try to scale down the number of consumers on Angel4.

Thanks @manast , I will keep you posted

@ifokeev
Copy link

ifokeev commented Jul 21, 2020

facing the same issue after moving to default redis docker image instead of bitnami https://github.com/bitnami/bitnami-docker-redis

So it could be regarding redis config

@beardedtim
Copy link

Me and my team are currently noticing similar behavior as this. @Gappa88 did you ever find a solution to your problem?

@Gappa88
Copy link
Author

Gappa88 commented Feb 21, 2021

I don't really remember what we have done in order to face this issue.

For sure, we got rid off the openVPN structure because created a bottleneck in communications and replaced it with ufw firewall and some custom rules and scaled down the system.

Now, after 2 years, we still have the Missing lock issue but it is not so frequent and we can handle it. I admit that I did't not spend any more time to investigate on this after we obtained this result because we had other stuff to do.

hope will be usefull

If you find any solution please share it with us @beardedtim
Thanks

@stale
Copy link

stale bot commented Jul 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 12, 2021
@stale stale bot closed this as completed Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants