Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible infinite loop if BRPOPLPUSH fails ? #1304

Open
smennesson opened this issue May 2, 2019 · 25 comments
Open

Possible infinite loop if BRPOPLPUSH fails ? #1304

smennesson opened this issue May 2, 2019 · 25 comments
Labels

Comments

@smennesson
Copy link

Description

Hello today we faced a problem that seemed to be an infinite loop on Bull library.
Our service is hosted on Heroku with the Redis addon and today we reached the memory quota of the Redis DB.
What happened is that we had an enormous log stack saying this:

BRPOPLPUSH { ReplyError: OOM command not allowed when used memory > 'maxmemory'. 
    at parseError (/app/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:179:12) 
    at parseType (/app/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:302:14) 
  command: 
   { name: 'brpoplpush', 
     args: 
      [ 'bull:<name-of-our-job>:wait', 
        'bull:<name-of-our-job>:active', 
        '5' ] } } 

The log stack took up to 1gb in a few minutes until we fix the quota issue.

By looking a little bit to the code in lib/queue.js, it seems that the error on BRPOPLPUSH is ignored in Queue.prototype.getNextJob. So I guess that what happened is that the loop searching for new jobs to process was infinitely popping the error.

I don't have enough knowledge about how Bull internally works to propose a fix, but I think this is something that should be handled, maybe by detecting when there is several errors on BRPOPLPUSH and add a waiting duration when this happens to frequently.

Bull version

v3.7.0

(just seen that 3.8 is out ; it doesn't seem that it would be fixed in this version by reading the changelog)

@balexandre
Copy link

balexandre commented May 9, 2019

exactly same issue on the same circumstances: Heroku, Redis (Premium 0 instance - v3.2.12), BRPOPLPUSH :(

enter image description here

what are the steps to prevent REDIS reach the 50Mb limit ... what can I do after a job is done?

P.S. I've upgraded my Redis instance to 4.0.14 ... I'll report if I get the same issue

@smennesson
Copy link
Author

@balexandre actually reaching the Redis limit is not the real problem here. There could be a multiple reason causing your Redis db to be full and there is a few tools to show you what keys are taking the most memory space. So you can either clean a little bit or upgrade your plan.

Problem is that the Bull library is in an infinite loop when this occurs, generating a huge volume of logs, and ignoring the error, so we have no way to be warned this is occurring.

@kand617
Copy link

kand617 commented Sep 14, 2019

I have the same issue.....

Is there some way to catch this error?
And log a different message?

I am not even sure where it's thrown from

@mmelvin0
Copy link

Seems to have been introduced in this commit:

bull/lib/queue.js

Lines 893 to 896 in f8ff368

// Swallow error
if(err.message !== 'Connection is closed.'){
console.error('BLPOP', err);
}

Perhaps the fix is simply to not swallow any errors here? Wouldn't this fire an error event on the queue if not swallowed?

@masiamj
Copy link

masiamj commented Mar 9, 2020

@mmelvin0 Did you receive a response or have any updates on this? Thanks!

@manast
Copy link
Member

manast commented Mar 9, 2020

actually I think it would be as easy as only swalling "Connection is closed" error, since this error happens when we force to close the connection with redis, should be an easy fix.

@manast manast added the bug label Mar 9, 2020
@tnolet
Copy link

tnolet commented Mar 11, 2020

This blew up our logging pipeline too. Filled up our Papertrail daily limit in two minutes. I love Bull and would like to move on this quickly. @manast how can I or my team help fix this? We actually have an open ticket about this on our backlog because it can potentially break our production at https://checklyhq.com when this happens and we don't intervene manually

@manast
Copy link
Member

manast commented Mar 11, 2020

if you provide a PR I will merge it quickly and make a release. It is as mentioned as simple as only swalling the error in case of connection is closed, and maybe add a small 5 seconds delay before trying again.

@tnolet
Copy link

tnolet commented Mar 11, 2020

@manast I will make a PR

@tnolet
Copy link

tnolet commented Mar 11, 2020

@manast I think this will do it. I refrained from adding a 5s delay because of unclear impact. I think not spamming the log is the first issue here. #1674

@tnolet
Copy link

tnolet commented Mar 27, 2020

Would love feedback or insights into the PR I mentioned above. This is still a significant risk to overwhelm any logging pipeline when things go south.

@manast-apsis
Copy link

This is high prio, will try to merge today.

@tnolet
Copy link

tnolet commented Mar 31, 2020

@manast-apsis thanks, let me know if I can help anywhere.

@tnolet
Copy link

tnolet commented Apr 6, 2020

@manast @manast-apsis Let me know what I need to do to get this merged. Thanks

@tnolet
Copy link

tnolet commented Apr 20, 2020

@manast @manast-apsis hope all is well, let me know if I can assist with merging the PR for this?

@manast
Copy link
Member

manast commented Apr 20, 2020

The thing is that the current PR does not do anything valuable. I am not sure which solution is the best here, basically I think we need to just delay some seconds in the case of error, maybe can be configurable. But sometimes the connection will error with connection closed error since we are manually closing the queue, and in this case we do not want to delay so we need to check if the closing promise is set and in that case no delay.

@tnolet
Copy link

tnolet commented Apr 21, 2020

@manast what is your suggestion for moving this ticket forward? I have the feeling we are going in circles a bit. I would be happy to remove/close my PR in favour of a better solution.

The reason I'm pushing this is that it is still a risk for many folks using logging solutions with hard limits.

As mentioned, I'm ready to help or have my team also look at it.

@danielkv
Copy link

same problem here, only in production, using docker

@tnolet
Copy link

tnolet commented May 18, 2020

The thing is that the current PR does not do anything valuable. I am not sure which solution is the best here, basically I think we need to just delay some seconds in the case of error, maybe can be configurable. But sometimes the connection will error with connection closed error since we are manually closing the queue, and in this case we do not want to delay so we need to check if the closing promise is set and in that case no delay.

@manast what do you mean when you say "we need to just delay some seconds".

  • What are we delaying? The console.log()
  • What happens after the delay? We just log? Or do we take some other action?

I think I'm just really confused on the "delay" part. This will not solve spamming the logging channel. It will just spam with a delay...

@dobesv
Copy link
Contributor

dobesv commented May 23, 2020

I posted a possible solution to this problem in #1746

@stale
Copy link

stale bot commented Jul 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 12, 2021
@manast manast removed the wontfix label Jul 14, 2021
@manast
Copy link
Member

manast commented Jul 14, 2021

@tnolet not sure if you are aware but this issue has been addressed some time ago #1974

Let me know if that is not the case.

@tnolet
Copy link

tnolet commented Jul 15, 2021

@manast I was not aware of this. Thanks for mentioning and fixing this! I will remove my thumbs down.

@Jimmycon210
Copy link

Hello, I am currently running into this. Finding that if the redis memory is full, BRPOPLPUSH fails spamming logs. This also prevents our app from processing jobs in the queue, causing the queue to remain full.

Seems to be failing here: https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L1202

And the catch block doesn't catch anything, no error is thrown. We get stuck in an infinite loop.

@manast
Copy link
Member

manast commented Apr 12, 2023

Hello, I am currently running into this. Finding that if the redis memory is full, BRPOPLPUSH fails spamming logs. This also prevents our app from processing jobs in the queue, causing the queue to remain full.

Seems to be failing here: https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L1202

And the catch block doesn't catch anything, no error is thrown. We get stuck in an infinite loop.

If Redis is full, then the best is to delete some data by hand or upgrade the instance to one with more memory, most hosting provides seamless upgrade procedures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests