Stuck & Ophaned jobs #688

emilsedgh · 2017-09-07T08:59:40Z

Hi there.

It seems 3.0 is very stable. Thank you.

I process tens of thousands of jobs every day. I always have removeOnComplete set to true on them, so ideally, if the queue is empty, redis should be empty.

Today I looked and there are thousands (~30K) jobs that are just there. Stuck.

Some of them has had 0 attempts made.

This is an example:

First I thought maybe I'm not doing a graceful shutdown or something like that. But this job ~12 hours old and I haven't had a restart in the queue worker for the past 48 hours or something.

Any ideas how I can get more clues to find the problem?

The text was updated successfully, but these errors were encountered:

manast · 2017-09-07T09:07:24Z

I am not really sure what stuck state means. I guess this is using bull-ui right? Jobs should always have one of the following states: wait, active, delayed, completed or failed. So first thing I would like to know in which state are these jobs in?

emilsedgh · 2017-09-07T09:15:10Z

Oh well that's exactly the problem. To me, it seems none. Look:

(I'm using Arena)

Also I got the job using queue.getJob:

Q.getJob(1243080).then(j => console.log('Job', j));
Promise {
  _bitField: 0,
  _fulfillmentHandler0: undefined,
  _rejectionHandler0: undefined,
  _promise0: undefined,
  _receiver0: undefined }
> Job Job {
  opts: 
   { attempts: 100,
     backoff: { type: 'fixed', delay: 5000 },
     priority: 7,
     removeOnComplete: true,
     timeout: 180000,
     delay: 0,
     timestamp: 1504728193122 },
  name: '__default__',
  queue: 
   Queue {
     name: 'upload photo',
     token: 'ae44a146-9e0f-4b02-9f22-9a3962c7a5fa',
     keyPrefix: 'bull',
     clients: [ [Object] ],
     _initializing: 
      Promise {
        _bitField: 33554432,
        _fulfillmentHandler0: undefined,
        _rejectionHandler0: undefined,
        _promise0: undefined,
        _receiver0: undefined },
     handlers: {},
     processing: [],
     retrieving: 0,
     settings: 
      { lockDuration: 30000,
        stalledInterval: 30000,
        maxStalledCount: 1,
        guardInterval: 5000,
        retryProcessDelay: 5000,
        drainDelay: 5,
        lockRenewTime: 15000 },
     _events: { error: [Function] },
     _eventsCount: 1,
     timers: TimerManager { idle: true, listeners: [], timers: {} },
     moveUnlockedJobsToWait: [Function: bound ],
     processJob: [Function: bound ],
     getJobFromId: [Function: bound ],
     keys: 
      { '': 'bull:upload photo:',
        wait: 'bull:upload photo:wait',
        paused: 'bull:upload photo:paused',
        'meta-paused': 'bull:upload photo:meta-paused',
        active: 'bull:upload photo:active',
        id: 'bull:upload photo:id',
        delayed: 'bull:upload photo:delayed',
        priority: 'bull:upload photo:priority',
        'stalled-check': 'bull:upload photo:stalled-check',
        completed: 'bull:upload photo:completed',
        failed: 'bull:upload photo:failed',
        stalled: 'bull:upload photo:stalled',
        repeat: 'bull:upload photo:repeat' } },
  data: 
   { id: '5eb6a67c-89a0-11e7-ae13-f23c91b0d077',
     created_at: '2017-08-25T14:19:13.489Z',
     value: 
      { Type: 'Image',
        Order: '24',
        Table_MUI: '73264519',
        UploadDate: '2017-08-25T09:18:58.710',
        Description: '',
        ModifiedDate: '2017-08-25T09:40:32.460',
        matrix_unique_id: '73284771' },
     class: 'Media',
     resource: 'Media',
     matrix_unique_id: 73284771,
     revision: 7,
     processed: 
      { matrix_unique_id: 73284771,
        listing_mui: 73264519,
        description: '',
        order: 24 } },
  _progress: 0,
  delay: 0,
  timestamp: 1504728193122,
  stacktrace: [],
  returnvalue: undefined,
  attemptsMade: 0,
  toKey: [Function: wrapper],
  id: 1243080,
  failedReason: undefined }

manast · 2017-09-07T09:23:00Z

that is very strange actually. It should not be possible that a job is not in one of the states, jobs are always moved atomically from one state to another.

emilsedgh · 2017-09-07T09:23:54Z

Some more clues:

While the UI shows the number of items in priority queue as 0, I'm getting this in Redis:

zcard 'bull:upload photo:priority'
(integer) 20777

They are not getting processed as the active queue is completely empty and the user interface is showing 0 items are in the queue.

Also I can confirm that this specific job 1243080 is indeed in the priority queue's sorted set.

manast · 2017-09-07T09:26:36Z

do you have any way to check if the job has actually being processed correctly?

manast · 2017-09-07T09:29:35Z

yes it seems as if the jobs have not been processed at all actually. What priority values are you using?

emilsedgh · 2017-09-07T09:31:19Z

It has not been.
The first thing I do when I process them is to log some key data.

This photo has been queued to be uploaded twice (2 different jobs). But it has not been processed at all. Only queued.

This is an example of something that was processed fine:

emilsedgh · 2017-09-07T09:32:35Z

The priority for this job was 7 I think.

manast · 2017-09-07T09:33:21Z

can you verify the jobId is really not in the wait list?

emilsedgh · 2017-09-07T09:36:47Z

I can confirm that it's not in the 'bull:upload photo:wait'

What I did was this:

lrange 'bull:upload photo:wait' 0 10000

It has 3K items in it. None of them are this job.

(I have a load of jobs now, which will be processed in a few minutes)

manast · 2017-09-07T09:38:16Z

According to the logic, if a job leaves the wait status (which is the first status a job gets when being added to the queue), it gets the property, processedOn with a timestamp when the job starts to be processed, and it is also removed from the priority set, so it seems impossible a job can be removed from wait if it is still in the priority set and does not have a processedOn property. I am clueless right now.

emilsedgh · 2017-09-07T09:38:36Z

oh wait, could it be that my jobs have had a ttl and their ttl was reached before I processed them?

manast · 2017-09-07T09:40:53Z

you mean the timeout option?, that is only considered when already processing the job. So if it had timedout the job would be in the failed status, and not in the priority set.

manast · 2017-09-07T09:42:09Z

from which version did you upgrade? we would need to find out from which version you get this issue.

emilsedgh · 2017-09-07T09:44:51Z

I'm on bull 3.0.0 according to npm ls on the production machine.

I was using 2.x but when I upgraded last week I flushed the DB.

If you want I think I can arrange access for you so you can take a deeper look if its worrying. Feel free to contact me directly.

manast · 2017-09-07T10:13:20Z

are you able to reproduce the issue in a development machine?
What would be useful is to upgrade from 2.x to 3.0 using a binary search with the 10 or so release candidates to see if we an pinpoint in which release you start getting this issues.

emilsedgh · 2017-09-07T10:40:01Z

I don't think I can reproduce easily as I've no idea how it happened.
I'll keep an eye on it. Will let you know if I've got more information.

emilsedgh · 2017-09-11T17:01:57Z

Here is another clue I got.
Sometimes we restart the worker scripts while there are thousands of items in the queue.
In the process of restart suddenly everything gets dropped to 0. It's like the 'wait' list gets removed or something like that.

Of course the job keys are still there.

While the workers are shutting down, we're not pausing the queues.
But when they start we do call a resume.

This might be related to this issue or might be an irrelevant one. I'll try to find that out.

manast · 2017-09-11T18:41:46Z

Maybe you are calling clean in you shutdown or start up process? In any case you do not need to call resume when starting up, the workers will keep up processing the queue as soon as they start.

Nelrohd · 2017-09-12T16:44:43Z

I have the same issue, thousands of jobs are "stuck" in the queue. In fact they are processed, even done() is called but the job is not completed or removed even with removeOnComplete to true.

I'm using bull 3.0.0-rc.10 and redis server 3.0.6.

manast · 2017-09-12T18:42:44Z

@Nelrohd its good if you could upgrade to 3.2.0.
When you say stuck, do you mean the same as @emilsedgh, i.e. that the job is not on any list but the job key is still there?

manast · 2017-09-22T11:07:33Z

@emilsedgh any news regarding this issue?

emilsedgh · 2017-09-24T08:10:15Z

No not any news. This keeps happening though.

I haven't looked into it with more details. Probably within next few weeks I will.

Is bull using any debugs I can enable using env vars that helps?

jagged3dge · 2017-10-09T10:35:26Z

Is bull using any debugs I can enable using env vars that helps?

According to the docs, seems like NODE_DEBUG=bull node ./your-script.js should enable bull's internal debugging.

nitzanav · 2018-08-27T00:17:34Z

this issue sounds troubling to get in sensitive production enviroment. how probable this issue is, anyone else has this issue?

manast · 2018-08-27T06:36:58Z

@nitzanav Unfortunately the poster is not able to reproduce it consistently so I am going to close this. For all we know it can be an issue in the poster process function.

emilsedgh · 2018-08-27T07:57:20Z

That'd be just fair to close this.

I was not able to reproduce this consistently. And our load on it has decreased a lot and we're not seeing it again.

macedd · 2019-10-30T22:47:59Z

I'm experiencing the same issues; Jobs are created with stuck status, or get into it right away.
I did instrument the job creation and many of these new jobs are stuck from the beginning.

If the reason is those IDs have been taken already, what would be the right approach to clean up the state?

For context, this queue has custom Job IDs, and the job has to be removeOnComplete: true.

kand617 · 2019-11-27T14:13:38Z

Actually I am seeing similar behaviour.
For some reason I am seeing jobs that do not show up in job counts, however I am able to retrieve them by job id.

and job.getState() returns true

Also I had some blocked clients... I am not sure if that provides any insight

Surprisingly I was unable to .remove() the job too

W4G1 · 2022-05-24T09:30:04Z

I'm experiencing the same issues; Jobs are created with stuck status, or get into it right away. I did instrument the job creation and many of these new jobs are stuck from the beginning.

If the reason is those IDs have been taken already, what would be the right approach to clean up the state?

For context, this queue has custom Job IDs, and the job has to be removeOnComplete: true.

2 Years later and I'm still seeing the same behavior of jobs getting the "stuck" status even when explicitly calling done() and returning. Any updates?

manast · 2022-05-24T11:36:59Z

@W4G1 this is a closed issue. If you have more information and a/or a case that reproduces it you are more than welcome to create a new issue and we will look into it.

manast closed this as completed Aug 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck & Ophaned jobs #688

Stuck & Ophaned jobs #688

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

emilsedgh commented Sep 11, 2017

manast commented Sep 11, 2017

Nelrohd commented Sep 12, 2017

manast commented Sep 12, 2017

manast commented Sep 22, 2017

emilsedgh commented Sep 24, 2017

jagged3dge commented Oct 9, 2017

nitzanav commented Aug 27, 2018 •

edited

manast commented Aug 27, 2018

emilsedgh commented Aug 27, 2018

macedd commented Oct 30, 2019 •

edited

kand617 commented Nov 27, 2019 •

edited

W4G1 commented May 24, 2022

manast commented May 24, 2022

Stuck & Ophaned jobs #688

Stuck & Ophaned jobs #688

Comments

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

manast commented Sep 7, 2017

emilsedgh commented Sep 7, 2017

emilsedgh commented Sep 11, 2017

manast commented Sep 11, 2017

Nelrohd commented Sep 12, 2017

manast commented Sep 12, 2017

manast commented Sep 22, 2017

emilsedgh commented Sep 24, 2017

jagged3dge commented Oct 9, 2017

nitzanav commented Aug 27, 2018 • edited

manast commented Aug 27, 2018

emilsedgh commented Aug 27, 2018

macedd commented Oct 30, 2019 • edited

kand617 commented Nov 27, 2019 • edited

W4G1 commented May 24, 2022

manast commented May 24, 2022

nitzanav commented Aug 27, 2018 •

edited

macedd commented Oct 30, 2019 •

edited

kand617 commented Nov 27, 2019 •

edited