Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck & Ophaned jobs #688

Closed
emilsedgh opened this issue Sep 7, 2017 · 31 comments
Closed

Stuck & Ophaned jobs #688

emilsedgh opened this issue Sep 7, 2017 · 31 comments

Comments

@emilsedgh
Copy link

Hi there.

It seems 3.0 is very stable. Thank you.

I process tens of thousands of jobs every day. I always have removeOnComplete set to true on them, so ideally, if the queue is empty, redis should be empty.

Today I looked and there are thousands (~30K) jobs that are just there. Stuck.

Some of them has had 0 attempts made.

This is an example:

spectacle bw8263

First I thought maybe I'm not doing a graceful shutdown or something like that. But this job ~12 hours old and I haven't had a restart in the queue worker for the past 48 hours or something.

Any ideas how I can get more clues to find the problem?

@manast
Copy link
Member

manast commented Sep 7, 2017

I am not really sure what stuck state means. I guess this is using bull-ui right? Jobs should always have one of the following states: wait, active, delayed, completed or failed. So first thing I would like to know in which state are these jobs in?

@emilsedgh
Copy link
Author

Oh well that's exactly the problem. To me, it seems none. Look:

spectacle ch8263

(I'm using Arena)

Also I got the job using queue.getJob:

Q.getJob(1243080).then(j => console.log('Job', j));
Promise {
  _bitField: 0,
  _fulfillmentHandler0: undefined,
  _rejectionHandler0: undefined,
  _promise0: undefined,
  _receiver0: undefined }
> Job Job {
  opts: 
   { attempts: 100,
     backoff: { type: 'fixed', delay: 5000 },
     priority: 7,
     removeOnComplete: true,
     timeout: 180000,
     delay: 0,
     timestamp: 1504728193122 },
  name: '__default__',
  queue: 
   Queue {
     name: 'upload photo',
     token: 'ae44a146-9e0f-4b02-9f22-9a3962c7a5fa',
     keyPrefix: 'bull',
     clients: [ [Object] ],
     _initializing: 
      Promise {
        _bitField: 33554432,
        _fulfillmentHandler0: undefined,
        _rejectionHandler0: undefined,
        _promise0: undefined,
        _receiver0: undefined },
     handlers: {},
     processing: [],
     retrieving: 0,
     settings: 
      { lockDuration: 30000,
        stalledInterval: 30000,
        maxStalledCount: 1,
        guardInterval: 5000,
        retryProcessDelay: 5000,
        drainDelay: 5,
        lockRenewTime: 15000 },
     _events: { error: [Function] },
     _eventsCount: 1,
     timers: TimerManager { idle: true, listeners: [], timers: {} },
     moveUnlockedJobsToWait: [Function: bound ],
     processJob: [Function: bound ],
     getJobFromId: [Function: bound ],
     keys: 
      { '': 'bull:upload photo:',
        wait: 'bull:upload photo:wait',
        paused: 'bull:upload photo:paused',
        'meta-paused': 'bull:upload photo:meta-paused',
        active: 'bull:upload photo:active',
        id: 'bull:upload photo:id',
        delayed: 'bull:upload photo:delayed',
        priority: 'bull:upload photo:priority',
        'stalled-check': 'bull:upload photo:stalled-check',
        completed: 'bull:upload photo:completed',
        failed: 'bull:upload photo:failed',
        stalled: 'bull:upload photo:stalled',
        repeat: 'bull:upload photo:repeat' } },
  data: 
   { id: '5eb6a67c-89a0-11e7-ae13-f23c91b0d077',
     created_at: '2017-08-25T14:19:13.489Z',
     value: 
      { Type: 'Image',
        Order: '24',
        Table_MUI: '73264519',
        UploadDate: '2017-08-25T09:18:58.710',
        Description: '',
        ModifiedDate: '2017-08-25T09:40:32.460',
        matrix_unique_id: '73284771' },
     class: 'Media',
     resource: 'Media',
     matrix_unique_id: 73284771,
     revision: 7,
     processed: 
      { matrix_unique_id: 73284771,
        listing_mui: 73264519,
        description: '',
        order: 24 } },
  _progress: 0,
  delay: 0,
  timestamp: 1504728193122,
  stacktrace: [],
  returnvalue: undefined,
  attemptsMade: 0,
  toKey: [Function: wrapper],
  id: 1243080,
  failedReason: undefined }

@manast
Copy link
Member

manast commented Sep 7, 2017

that is very strange actually. It should not be possible that a job is not in one of the states, jobs are always moved atomically from one state to another.

@emilsedgh
Copy link
Author

Some more clues:

While the UI shows the number of items in priority queue as 0, I'm getting this in Redis:

zcard 'bull:upload photo:priority'
(integer) 20777

They are not getting processed as the active queue is completely empty and the user interface is showing 0 items are in the queue.

Also I can confirm that this specific job 1243080 is indeed in the priority queue's sorted set.

@manast
Copy link
Member

manast commented Sep 7, 2017

do you have any way to check if the job has actually being processed correctly?

@manast
Copy link
Member

manast commented Sep 7, 2017

yes it seems as if the jobs have not been processed at all actually. What priority values are you using?

@emilsedgh
Copy link
Author

It has not been.
The first thing I do when I process them is to log some key data.

spectacle ed8263

This photo has been queued to be uploaded twice (2 different jobs). But it has not been processed at all. Only queued.

This is an example of something that was processed fine:

spectacle gw8263

@emilsedgh
Copy link
Author

The priority for this job was 7 I think.

@manast
Copy link
Member

manast commented Sep 7, 2017

can you verify the jobId is really not in the wait list?

@emilsedgh
Copy link
Author

I can confirm that it's not in the 'bull:upload photo:wait'

What I did was this:

lrange 'bull:upload photo:wait' 0 10000

It has 3K items in it. None of them are this job.

(I have a load of jobs now, which will be processed in a few minutes)

@manast
Copy link
Member

manast commented Sep 7, 2017

According to the logic, if a job leaves the wait status (which is the first status a job gets when being added to the queue), it gets the property, processedOn with a timestamp when the job starts to be processed, and it is also removed from the priority set, so it seems impossible a job can be removed from wait if it is still in the priority set and does not have a processedOn property. I am clueless right now.

@emilsedgh
Copy link
Author

oh wait, could it be that my jobs have had a ttl and their ttl was reached before I processed them?

@manast
Copy link
Member

manast commented Sep 7, 2017

you mean the timeout option?, that is only considered when already processing the job. So if it had timedout the job would be in the failed status, and not in the priority set.

@manast
Copy link
Member

manast commented Sep 7, 2017

from which version did you upgrade? we would need to find out from which version you get this issue.

@emilsedgh
Copy link
Author

I'm on bull 3.0.0 according to npm ls on the production machine.

I was using 2.x but when I upgraded last week I flushed the DB.

If you want I think I can arrange access for you so you can take a deeper look if its worrying. Feel free to contact me directly.

@manast
Copy link
Member

manast commented Sep 7, 2017

are you able to reproduce the issue in a development machine?
What would be useful is to upgrade from 2.x to 3.0 using a binary search with the 10 or so release candidates to see if we an pinpoint in which release you start getting this issues.

@emilsedgh
Copy link
Author

I don't think I can reproduce easily as I've no idea how it happened.
I'll keep an eye on it. Will let you know if I've got more information.

@emilsedgh
Copy link
Author

Here is another clue I got.
Sometimes we restart the worker scripts while there are thousands of items in the queue.
In the process of restart suddenly everything gets dropped to 0. It's like the 'wait' list gets removed or something like that.

Of course the job keys are still there.

While the workers are shutting down, we're not pausing the queues.
But when they start we do call a resume.

This might be related to this issue or might be an irrelevant one. I'll try to find that out.

@manast
Copy link
Member

manast commented Sep 11, 2017

Maybe you are calling clean in you shutdown or start up process? In any case you do not need to call resume when starting up, the workers will keep up processing the queue as soon as they start.

@Nelrohd
Copy link

Nelrohd commented Sep 12, 2017

I have the same issue, thousands of jobs are "stuck" in the queue. In fact they are processed, even done() is called but the job is not completed or removed even with removeOnComplete to true.

I'm using bull 3.0.0-rc.10 and redis server 3.0.6.

@manast
Copy link
Member

manast commented Sep 12, 2017

@Nelrohd its good if you could upgrade to 3.2.0.
When you say stuck, do you mean the same as @emilsedgh, i.e. that the job is not on any list but the job key is still there?

@manast
Copy link
Member

manast commented Sep 22, 2017

@emilsedgh any news regarding this issue?

@emilsedgh
Copy link
Author

No not any news. This keeps happening though.

I haven't looked into it with more details. Probably within next few weeks I will.

Is bull using any debugs I can enable using env vars that helps?

@jagged3dge
Copy link

Is bull using any debugs I can enable using env vars that helps?

According to the docs, seems like NODE_DEBUG=bull node ./your-script.js should enable bull's internal debugging.

@nitzanav
Copy link

nitzanav commented Aug 27, 2018

this issue sounds troubling to get in sensitive production enviroment. how probable this issue is, anyone else has this issue?

@manast
Copy link
Member

manast commented Aug 27, 2018

@nitzanav Unfortunately the poster is not able to reproduce it consistently so I am going to close this. For all we know it can be an issue in the poster process function.

@manast manast closed this as completed Aug 27, 2018
@emilsedgh
Copy link
Author

That'd be just fair to close this.

I was not able to reproduce this consistently. And our load on it has decreased a lot and we're not seeing it again.

@macedd
Copy link

macedd commented Oct 30, 2019

I'm experiencing the same issues; Jobs are created with stuck status, or get into it right away.
I did instrument the job creation and many of these new jobs are stuck from the beginning.

If the reason is those IDs have been taken already, what would be the right approach to clean up the state?

For context, this queue has custom Job IDs, and the job has to be removeOnComplete: true.

@kand617
Copy link

kand617 commented Nov 27, 2019

Actually I am seeing similar behaviour.
For some reason I am seeing jobs that do not show up in job counts, however I am able to retrieve them by job id.

and job.getState() returns true

Also I had some blocked clients... I am not sure if that provides any insight
image

Surprisingly I was unable to .remove() the job too

@W4G1
Copy link

W4G1 commented May 24, 2022

I'm experiencing the same issues; Jobs are created with stuck status, or get into it right away. I did instrument the job creation and many of these new jobs are stuck from the beginning.

If the reason is those IDs have been taken already, what would be the right approach to clean up the state?

For context, this queue has custom Job IDs, and the job has to be removeOnComplete: true.

2 Years later and I'm still seeing the same behavior of jobs getting the "stuck" status even when explicitly calling done() and returning. Any updates?

@manast
Copy link
Member

manast commented May 24, 2022

@W4G1 this is a closed issue. If you have more information and a/or a case that reproduces it you are more than welcome to create a new issue and we will look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants