-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message posting retry for tx_commit() hang #732
Comments
This could be fixed in sr3 and v2. I'm not sure how to reproduce the issue. |
Could we hack on it now... git cloning and trying things with a dev tree... because of the issue with reproducing... |
yeah I'll try |
prefer to fix sr3 first... only goal for now is to print a helpful message sooner... so folks know there is something wrong quicker. |
adding a log message confirms it's getting stuck |
adding the timeout to basic_publish doesn't help this issue. I added another log in between the basic_publish and tx_commit which reveals that it's the tx_commit where we're getting stuck:
|
This discusses reasons that could cause tx_commit to hang, and has some commands we could run to troubleshoot. https://rabbitmq-users.narkive.com/Lq4Pdf6J/rabbitmq-discuss-hanging-on-message-sending-with-py-amqplib I'm using the pre-existing timeout option in the basic_publish and added the extra log line in this commit 7157aef. The tx_commit method doesn't support a timeout, is it worth trying to add our own timeout? |
Does the timeout on the basic_publish apply to the tx_commit()? What I understood is happenning is the publish is returning, and then the process is hanging on tx_commit(). Ideally, it would not hang, but would timeout, and put the message in the publish retry queue. Hanging is bad. |
I'm not sure how it's supposed to work. I tested it using a 10 second timeout on the basic_publish and basic_publish returned right away but it got stuck at tx_commit, which definitely didn't timeout after >1 minute. |
discussion from meeting: try to implement our own timeout for the tx_commit so we can retry posting. |
any progress on this? I see the branch sitting there... I just did a PR for needed fixes for scheduled flowcb plugins... and you hit real problems with those that are addressed by what is in v03_wip now, so might want to release soon. |
no progress, I haven't had time to spend on it :( The changes in the branch don't fix the problem we experienced ( |
Sure, sounds good... err... what is the benefit of what's there already? |
It adds a timeout to publishing messages, which might help prevent hangs and improve error recovery when a broker isn't working. If the publish times out, it will raise an exception instead of hanging. The additional logging also helps troubleshooting - it was unclear where the publishing was getting stuck before. The only thing missing is adding a timeout to the
|
We have run into a case where we're connected to a broker and try to post a message and Sarra gets stuck. This seems to be caused by a misbehaving broker and is hard to troubleshoot.
Writing a debug log message here:
sarracenia/sarracenia/moth/amqp.py
Lines 647 to 648 in 459cc4e
before posting a message would be helpful for debugging issues where the broker doesn't respond.
I think we should also add a timeout option to the basic_publish. The default is None, and it's not clear if the post will ever fail. The existing error handling doesn't run because the basic_publish or tx_commit seems to get stuck forever. https://docs.celeryq.dev/projects/amqp/en/latest/reference/amqp.channel.html#amqp.channel.Channel.basic_publish
The text was updated successfully, but these errors were encountered: