Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REDIS Timeout Issue #1602

Open
elucidsoft opened this issue Jan 2, 2020 · 8 comments
Open

REDIS Timeout Issue #1602

elucidsoft opened this issue Jan 2, 2020 · 8 comments
Labels

Comments

@elucidsoft
Copy link

I have no idea why this happens, but it does. When the Redis server gets rebooted, and your Queue loses its connection it doesn't reconnect. We have discussed this before, you believe its an issue with IORedis. But I am not sure now, after spending several hours playing with this I can consistently repeat the issue now.

Some observations:

  • The only method that throws an exception is queue.add
  • All other methods seem to work fine, reconnect to Redis fine and return data fine.
  • queue.clien.ping() returns 'PONG' fine, but queue.add fails with an exception.

I created the following rather ugly, horribly ugly code for my healthcheck as a temp workaround until I can find the root cause of this issue.

const job = await this.queueService.queue.add(null, {removeOnComplete: true, delay: 9999}); await job.remove();
This code will ALWAYS throw an exception in this scenario, so its a pretty good check to see...

Environment:

Kubernetes cluster, I have tried the following environments and it seems to happen on each with different errors:

Single Redis Instance: Same behavior, but you get an ECCONREFUSED error from IORedis.

Sentinel Redis with 3 instances master/slave, if you kill all of them simultaneously you get ALL SENTINELS are down error from IORedis.

Both appear to behave exactly the same. If you reboot your app, everything works again.

@elucidsoft
Copy link
Author

taskforcesh/bullmq#83

@stansv
Copy link
Contributor

stansv commented Jan 3, 2020

Hi @elucidsoft , thanks for your investigation!

I believe you have got these results but I confused by the fact this is not exactly matches with what I observed before, when I've did my research.

I cannot explain why only add() method stops working. I suppose this can be related to lua script executed on Redis inside add(), and any other Bull api call using lua inside would fail.

The only issue I know in Bull is related to internal initializing promise. This is used to indicate that internal ioredis instance is created and initialized. It happens when all lua scripts are registered with defineCommand api method provided by ioredis. The problem is that if this promise becomes rejected, the Queue instance methods won't work anymore, since internally it always makes sure that initializing promise is resolved. However, if you try to execute any redis command manually via queue.client, for example, 'SADD' — it would resolve once Redis is up.

What I can suggest is to implement some retry logic when creating queue instance, it may look like this

const createQueue = async(name: string): Promise<Bull.Queue> => {
    let attempts = 100; // can be Infinity, but I'm not sure about memory leaks
    let queue;

    while(attempts-- > 0) {
        try {
            queue = new Bull(name);
            await queue.isReady();
            return queue;
        } catch(error) {
            // log error if you want
            if(queue) {
                try {
                    queue.close(true);
                    queue = undefined;
                } catch(_) {
                }
            }
            await new Promise((resolve) => { setTimeout(resolve, 1000); });
        }
    }
}; 

@stansv
Copy link
Contributor

stansv commented Jan 3, 2020

Also, can you please share more complete set of steps to reproduce that behavior observed by you? I didn't tried k8s, my redis is running in docker but node process is executed natively on host.

@manast
Copy link
Member

manast commented Jan 3, 2020

so maybe the concept of initialising must be rethought in BullMQ, since when a redis instance is completely rebooted, all loaded commands must be reloaded. A better approach would be a lazy load of the commands, but then we may need to use our own code for handling commands than the one provided by ioredis.

@elucidsoft
Copy link
Author

ahh I didn't think about recreating the Bull instance. This is a good solution, also is there an easier way to detect this scenario rather than calling add?

@elucidsoft
Copy link
Author

elucidsoft commented Jan 3, 2020

@manast This behavior is also not fixed by using Sentinel either, when a redis instance goes down ioredis will switch to a new master via Sentinel. Which seems to work ok, until the original master comes back up and gets voted back to master then we start seeing same issue.

@faller
Copy link

faller commented Aug 31, 2020

I have encountered the timeout issue 2 times when master crashed with Sentinel Mode

@stale
Copy link

stale bot commented Jul 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 12, 2021
@manast manast added bug and removed wontfix labels Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants