Possible infinite loop if BRPOPLPUSH fails ? #1304

smennesson · 2019-05-02T12:08:25Z

Description

Hello today we faced a problem that seemed to be an infinite loop on Bull library.
Our service is hosted on Heroku with the Redis addon and today we reached the memory quota of the Redis DB.
What happened is that we had an enormous log stack saying this:

BRPOPLPUSH { ReplyError: OOM command not allowed when used memory > 'maxmemory'. 
    at parseError (/app/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:179:12) 
    at parseType (/app/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:302:14) 
  command: 
   { name: 'brpoplpush', 
     args: 
      [ 'bull:<name-of-our-job>:wait', 
        'bull:<name-of-our-job>:active', 
        '5' ] } }

The log stack took up to 1gb in a few minutes until we fix the quota issue.

By looking a little bit to the code in lib/queue.js, it seems that the error on BRPOPLPUSH is ignored in Queue.prototype.getNextJob. So I guess that what happened is that the loop searching for new jobs to process was infinitely popping the error.

I don't have enough knowledge about how Bull internally works to propose a fix, but I think this is something that should be handled, maybe by detecting when there is several errors on BRPOPLPUSH and add a waiting duration when this happens to frequently.

Bull version

v3.7.0

(just seen that 3.8 is out ; it doesn't seem that it would be fixed in this version by reading the changelog)

The text was updated successfully, but these errors were encountered:

balexandre · 2019-05-09T22:49:34Z

exactly same issue on the same circumstances: Heroku, Redis (Premium 0 instance - v3.2.12), BRPOPLPUSH :(

what are the steps to prevent REDIS reach the 50Mb limit ... what can I do after a job is done?

P.S. I've upgraded my Redis instance to 4.0.14 ... I'll report if I get the same issue

smennesson · 2019-05-10T07:37:18Z

@balexandre actually reaching the Redis limit is not the real problem here. There could be a multiple reason causing your Redis db to be full and there is a few tools to show you what keys are taking the most memory space. So you can either clean a little bit or upgrade your plan.

Problem is that the Bull library is in an infinite loop when this occurs, generating a huge volume of logs, and ignoring the error, so we have no way to be warned this is occurring.

kand617 · 2019-09-14T00:39:36Z

I have the same issue.....

Is there some way to catch this error?
And log a different message?

I am not even sure where it's thrown from

mmelvin0 · 2019-12-27T22:43:34Z

Seems to have been introduced in this commit:

bull/lib/queue.js

Lines 893 to 896 in f8ff368

    
           // Swallow error 
        
           if(err.message !== 'Connection is closed.'){ 
        
             console.error('BLPOP', err); 
        
           }

Perhaps the fix is simply to not swallow any errors here? Wouldn't this fire an error event on the queue if not swallowed?

masiamj · 2020-03-09T19:43:03Z

@mmelvin0 Did you receive a response or have any updates on this? Thanks!

manast · 2020-03-09T21:50:21Z

actually I think it would be as easy as only swalling "Connection is closed" error, since this error happens when we force to close the connection with redis, should be an easy fix.

tnolet · 2020-03-11T10:50:00Z

This blew up our logging pipeline too. Filled up our Papertrail daily limit in two minutes. I love Bull and would like to move on this quickly. @manast how can I or my team help fix this? We actually have an open ticket about this on our backlog because it can potentially break our production at https://checklyhq.com when this happens and we don't intervene manually

manast · 2020-03-11T11:18:57Z

if you provide a PR I will merge it quickly and make a release. It is as mentioned as simple as only swalling the error in case of connection is closed, and maybe add a small 5 seconds delay before trying again.

tnolet · 2020-03-11T12:18:37Z

@manast I will make a PR

tnolet · 2020-03-11T12:45:34Z

@manast I think this will do it. I refrained from adding a 5s delay because of unclear impact. I think not spamming the log is the first issue here. #1674

tnolet · 2020-03-27T08:53:47Z

Would love feedback or insights into the PR I mentioned above. This is still a significant risk to overwhelm any logging pipeline when things go south.

manast-apsis · 2020-03-31T08:24:57Z

This is high prio, will try to merge today.

tnolet · 2020-03-31T19:20:38Z

@manast-apsis thanks, let me know if I can help anywhere.

tnolet · 2020-04-06T15:03:17Z

@manast @manast-apsis Let me know what I need to do to get this merged. Thanks

tnolet · 2020-04-20T16:06:51Z

@manast @manast-apsis hope all is well, let me know if I can assist with merging the PR for this?

manast · 2020-04-20T20:19:13Z

The thing is that the current PR does not do anything valuable. I am not sure which solution is the best here, basically I think we need to just delay some seconds in the case of error, maybe can be configurable. But sometimes the connection will error with connection closed error since we are manually closing the queue, and in this case we do not want to delay so we need to check if the closing promise is set and in that case no delay.

tnolet · 2020-04-21T08:08:14Z

@manast what is your suggestion for moving this ticket forward? I have the feeling we are going in circles a bit. I would be happy to remove/close my PR in favour of a better solution.

The reason I'm pushing this is that it is still a risk for many folks using logging solutions with hard limits.

As mentioned, I'm ready to help or have my team also look at it.

danielkv · 2020-05-18T00:21:42Z

same problem here, only in production, using docker

tnolet · 2020-05-18T10:13:52Z

The thing is that the current PR does not do anything valuable. I am not sure which solution is the best here, basically I think we need to just delay some seconds in the case of error, maybe can be configurable. But sometimes the connection will error with connection closed error since we are manually closing the queue, and in this case we do not want to delay so we need to check if the closing promise is set and in that case no delay.

@manast what do you mean when you say "we need to just delay some seconds".

What are we delaying? The console.log()
What happens after the delay? We just log? Or do we take some other action?

I think I'm just really confused on the "delay" part. This will not solve spamming the logging channel. It will just spam with a delay...

dobesv · 2020-05-23T01:11:38Z

I posted a possible solution to this problem in #1746

stale · 2021-07-12T14:43:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

manast · 2021-07-14T14:36:22Z

@tnolet not sure if you are aware but this issue has been addressed some time ago #1974

Let me know if that is not the case.

tnolet · 2021-07-15T08:28:24Z

@manast I was not aware of this. Thanks for mentioning and fixing this! I will remove my thumbs down.

Jimmycon210 · 2023-04-12T21:05:27Z

Hello, I am currently running into this. Finding that if the redis memory is full, BRPOPLPUSH fails spamming logs. This also prevents our app from processing jobs in the queue, causing the queue to remain full.

Seems to be failing here: https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L1202

And the catch block doesn't catch anything, no error is thrown. We get stuck in an infinite loop.

manast · 2023-04-12T21:53:34Z

Hello, I am currently running into this. Finding that if the redis memory is full, BRPOPLPUSH fails spamming logs. This also prevents our app from processing jobs in the queue, causing the queue to remain full.

Seems to be failing here: https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L1202

And the catch block doesn't catch anything, no error is thrown. We get stuck in an infinite loop.

If Redis is full, then the best is to delete some data by hand or upgrade the instance to one with more memory, most hosting provides seamless upgrade procedures.

manast added the bug label Mar 9, 2020

tnolet added a commit to checkly/bull that referenced this issue Mar 11, 2020

fix: only swallow connection errors OptimalBits#1304

c1145eb

tnolet mentioned this issue Mar 11, 2020

fix: only swallow connection errors #1304 #1674

Closed

dobesv mentioned this issue May 23, 2020

Closing shared redis connection causes infinite loop and crash [BUG] #1746

Closed

stale bot added the wontfix label Jul 12, 2021

manast removed the wontfix label Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible infinite loop if BRPOPLPUSH fails ? #1304

Possible infinite loop if BRPOPLPUSH fails ? #1304

smennesson commented May 2, 2019

balexandre commented May 9, 2019 •

edited

smennesson commented May 10, 2019

kand617 commented Sep 14, 2019

mmelvin0 commented Dec 27, 2019

masiamj commented Mar 9, 2020

manast commented Mar 9, 2020

tnolet commented Mar 11, 2020

manast commented Mar 11, 2020

tnolet commented Mar 11, 2020

tnolet commented Mar 11, 2020

tnolet commented Mar 27, 2020

manast-apsis commented Mar 31, 2020

tnolet commented Mar 31, 2020

tnolet commented Apr 6, 2020

tnolet commented Apr 20, 2020

manast commented Apr 20, 2020

tnolet commented Apr 21, 2020 •

edited

danielkv commented May 18, 2020

tnolet commented May 18, 2020

dobesv commented May 23, 2020

stale bot commented Jul 12, 2021

manast commented Jul 14, 2021

tnolet commented Jul 15, 2021

Jimmycon210 commented Apr 12, 2023

manast commented Apr 12, 2023

Possible infinite loop if BRPOPLPUSH fails ? #1304

Possible infinite loop if BRPOPLPUSH fails ? #1304

Comments

smennesson commented May 2, 2019

Description

Bull version

balexandre commented May 9, 2019 • edited

smennesson commented May 10, 2019

kand617 commented Sep 14, 2019

mmelvin0 commented Dec 27, 2019

masiamj commented Mar 9, 2020

manast commented Mar 9, 2020

tnolet commented Mar 11, 2020

manast commented Mar 11, 2020

tnolet commented Mar 11, 2020

tnolet commented Mar 11, 2020

tnolet commented Mar 27, 2020

manast-apsis commented Mar 31, 2020

tnolet commented Mar 31, 2020

tnolet commented Apr 6, 2020

tnolet commented Apr 20, 2020

manast commented Apr 20, 2020

tnolet commented Apr 21, 2020 • edited

danielkv commented May 18, 2020

tnolet commented May 18, 2020

dobesv commented May 23, 2020

stale bot commented Jul 12, 2021

manast commented Jul 14, 2021

tnolet commented Jul 15, 2021

Jimmycon210 commented Apr 12, 2023

manast commented Apr 12, 2023

balexandre commented May 9, 2019 •

edited

tnolet commented Apr 21, 2020 •

edited