Consumers do not resume consuming tasks after broker connection error #207
GitHub issues are for bugs. If you have questions, please ask them on the discussion board.
What OS are you using?
What version of Dramatiq are you using?
What did you do?
With several tasks still in the queue, I killed the master Redis process to trigger a failover to the slave one. I observed the following during the failover period of ~5 seconds.
What did you expect would happen?
In our own code, we handle connection errors gracefully as in:
I expected that the consumer (worker) resume consuming tasks from its relevant queue once the connection problem gets resolved (which eventually happened).
I got the following error and the consumers ceased consuming tasks from the queue:
In other cases where no unhandled exception arises (due to low level Dramatiq code not being run during the connection error), the workers immediately resume consuming tasks once the connection gets successfully established.
Additional info: Here is the output of
So, it seems that the workers stop checking the main queue itself for tasks.
The text was updated successfully, but these errors were encountered:
Here is the rest of the logs (these lines came after I restarted dramatiq via
This is what happens: Dramatiq does indeed restart the workers, but they do not seem to start consuming tasks from the main queue, although I can see their heartbeats in Redis (but they do resume their normal functionaning after a manual restart).
So between the logs in the last comment and these logs it looks like you waited for about 15 minutes before restarting. Is that correct?
If that's the case I'll try to reproduce and fix this sometime this weekend. We do have unit tests that check what happens when RMQ suddenly goes away, but I don't remember if I we have anything like that for Redis.
Let me remind that, during those 15 minutes of waiting,