Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the messenger queue layout #461

Merged
merged 10 commits into from
Feb 14, 2024
Merged

Conversation

BentiGorlich
Copy link
Member

This change should improve the message handling. Messages are being retried in case of a temporary error, the failed queue is being worked on with a new dead queue that stores dead messages and can probably be cleaned on a regular basis.
The normal message handler will try 5 times before sending a message to the failed queue. It starts with a 10min delay and multiplies it by 10 each time the message fails.
0 -> 10 min -> 1h 40min -> 16h 40min -> ~1 week -> 1 week. It caps the delay at 1 week

The failed queue keeps the delay at the old 30 minutes and the multiplier of 2. We may want to change that. The numbers are just arbitrary. I just copied them from the old settings. And if a message fails in the failed queue it goes to the dead queue.

The messenger supervisor config changed, so it just has one worker that handles all the queues with descending priority. That means the normal async queue is getting worked at first, then the async_ap and then failed queue.
The docker compose is changed accordingly

@BentiGorlich BentiGorlich added enhancement New feature or request backend Backend related issues and pull requests labels Jan 28, 2024
@BentiGorlich BentiGorlich self-assigned this Jan 28, 2024
@melroy89 melroy89 self-requested a review January 29, 2024 15:50
@melroy89
Copy link
Member

melroy89 commented Jan 29, 2024

New queue system seems to work fine ("ready" in the RabbitMQ graph is indeed misleading and might need to be documented correctly in your Mbin docs):

We also need to make a special release out of this Mbin release, since it's not a simple post-upgrade release. More then only checking "Admin guide changes" in our release table...

image
image

@melroy89
Copy link
Member

melroy89 commented Jan 29, 2024

We also need to warn people (and document this) that this is now considered "normal" behavior:

image

@melroy89
Copy link
Member

It's seems to be more and more and more.. Will the queue every go lower? Maybe the max. retries and duration is a bit too long?

image
image

@melroy89

This comment was marked as off-topic.

@BentiGorlich
Copy link
Member Author

About the queue size: yes it is annoying that the retried messages with their delay count as "ready messages". I put the delays up top: 0 -> 10 min -> 1h 40min -> 16h 40min -> ~1 week -> 1 week. So a message that is constantly returning a timeout for example will stay in the "ready messages" for about 2 weeks and one day. I do think that this is reasonable. It does not hurt to have the messages stay there and wait a long time to be retried.

As for the huge publish rates: that comes from a circulating message, that is constantly dispatching itself again. I'd be suprised if your publish rate has gone down after reverting back to main, as I do not see a reason why they should...

@melroy89
Copy link
Member

melroy89 commented Jan 30, 2024

So a message that is constantly returning a timeout for example will stay in the "ready messages" for about 2 weeks and one day. I do think that this is reasonable. It does not hurt to have the messages stay there and wait a long time to be retried.

I agree, as long as it make sense on errors which can get resolved maybe over time. But errors that are fatal and you know a retry will not help, should not enter this queue imo.

Is there a cronjob way to clean-up dead automatically?

@BentiGorlich
Copy link
Member Author

BentiGorlich commented Jan 31, 2024

So a message that is constantly returning a timeout for example will stay in the "ready messages" for about 2 weeks and one day. I do think that this is reasonable. It does not hurt to have the messages stay there and wait a long time to be retried.

I agree, as long as it make sense on errors which can get resolved maybe over time. But errors that are fatal and you know a retry will not help, should not enter this queue imo.

Fatal errors can now really use the "UnrecoverableMessageException and will end up in dead very quickly

Is there a cronjob way to clean-up dead automatically?

Yes. It is in the sql table and there is a queue_name column. So you could just do

DELETE FROM messenger_messages WHERE queue_name='dead';

@melroy89
Copy link
Member

melroy89 commented Feb 1, 2024

I'm OK with the changes as long as you document them. Extend for example: https://github.com/MbinOrg/mbin/blob/main/FAQ.md#how-to-clean-up-all-failed-messages. With your SQL query. But I was actually hoping for some kind of automated clean-up without running this DELETE query (but I digress).

Also try to explain in the admin / docker / FAQ guide what the expected behavior now is (with retries, higher amount of messages 'ready'). So we can point admins to this kind of docs.

TLDR: Explain your changes to server admins, so we do not get overwhelmed with support. Please :)

@nobodyatroot
Copy link
Member

Agree with @melroy89's comments, and we need to put big RED letters about updating the supervisor config on the next release. Otherwise, looks good!

nobodyatroot
nobodyatroot previously approved these changes Feb 4, 2024
melroy89
melroy89 previously approved these changes Feb 12, 2024
@melroy89
Copy link
Member

LGTM. But we need to add a special notice on the release page.. This is a change for server admins.

@BentiGorlich BentiGorlich merged commit d909a80 into main Feb 14, 2024
7 checks passed
@BentiGorlich BentiGorlich deleted the improve-messenger-config branch February 14, 2024 12:56
@TheVillageGuy
Copy link
Contributor

TheVillageGuy commented Feb 15, 2024

Since updating my CPU load is at 100% constantly, something I am not used to.

image

image

@BentiGorlich
Copy link
Member Author

Since updating my CPU load is at 100% constantly, something I am not used to.

Have you updated the supervisor config and restarted all the processes? It looks to me like you messengers are constantly crashing. Supervisor should not be at the top in htop 😄
You can check that by typing supervisorctl status. If the uptime is not above a minute, then they are probably constantly crashing...

@TheVillageGuy
Copy link
Contributor

It turns out the problem was my config set to /mbin/ while my setup still uses /kbin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Backend related issues and pull requests enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants