Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message posting retry for tx_commit() hang #732

Open
reidsunderland opened this issue Jul 26, 2023 · 14 comments
Open

Message posting retry for tx_commit() hang #732

reidsunderland opened this issue Jul 26, 2023 · 14 comments
Assignees
Labels
Discussion_Needed developers should discuss this issue. enhancement New feature or request Priority 3 - Important worrisome impact... should study carefully

Comments

@reidsunderland
Copy link
Member

reidsunderland commented Jul 26, 2023

We have run into a case where we're connected to a broker and try to post a message and Sarra gets stuck. This seems to be caused by a misbehaving broker and is hard to troubleshoot.

Writing a debug log message here:

self.channel.basic_publish(AMQP_Message, exchange, topic)
self.channel.tx_commit()

before posting a message would be helpful for debugging issues where the broker doesn't respond.

I think we should also add a timeout option to the basic_publish. The default is None, and it's not clear if the post will ever fail. The existing error handling doesn't run because the basic_publish or tx_commit seems to get stuck forever. https://docs.celeryq.dev/projects/amqp/en/latest/reference/amqp.channel.html#amqp.channel.Channel.basic_publish

@reidsunderland reidsunderland added v3 issue deferred to or affects version 3 Sugar nice to have features... not super important. labels Jul 26, 2023
@reidsunderland reidsunderland self-assigned this Jul 26, 2023
@reidsunderland reidsunderland changed the title Add debug log message before posting message Message posting improvements for sr3 Jul 26, 2023
@reidsunderland reidsunderland removed the v3 issue deferred to or affects version 3 label Jul 26, 2023
@reidsunderland reidsunderland changed the title Message posting improvements for sr3 Message posting improvements Jul 26, 2023
@reidsunderland reidsunderland added enhancement New feature or request Priority 3 - Important worrisome impact... should study carefully Discussion_Needed developers should discuss this issue. and removed Sugar nice to have features... not super important. labels Jul 26, 2023
@reidsunderland
Copy link
Member Author

This could be fixed in sr3 and v2. I'm not sure how to reproduce the issue.

@petersilva
Copy link
Contributor

Could we hack on it now... git cloning and trying things with a dev tree... because of the issue with reproducing...
try publishing to the sick broker and coaxing it to generate a message, or surround it with a try/except and/or add a timeout.

@reidsunderland
Copy link
Member Author

yeah I'll try

@petersilva
Copy link
Contributor

prefer to fix sr3 first... only goal for now is to print a helpful message sooner... so folks know there is something wrong quicker.

@reidsunderland
Copy link
Member Author

2023-07-26 15:57:43,549 [DEBUG] 343300 sarracenia.transfer read_write sr_proto read_write
2023-07-26 15:57:43,945 [DEBUG] 343300 sarracenia.flow do_download downloaded ok: /apps/sarra/public_data/20230726/MSC-DMS-OP/NINJO/METAR/15/METAR_1.0-001_202307261500_htmp_orig_man_wmo.xml
2023-07-26 15:57:43,945 [DEBUG] 343300 sarracenia.flow.sarra do processing 1 messages worked!
2023-07-26 15:57:43,945 [INFO] 343300 sarracenia.flowcb.log after_work downloaded ok: /apps/sarra/public_data/20230726/MSC-DMS-OP/NINJO/METAR/15/METAR_1.0-001_202307261500_htmp_orig_man_wmo.xml
2023-07-26 15:57:43,945 [DEBUG] 343300 sarracenia.moth.amqp putNewMessage trying to publish body: {"sundew_extension": "DMS:CMC:METAR:XML:", "to_clusters": "ALL", "x-delay": 0, "source": "MSC-DMS-OP", "from_cluster": "DDSR.CMC", "pubTime": "20230726T153640.118", "baseUrl": "http://removed/", "relPath": "20230726/MSC-DMS-OP/NINJO/METAR/15/METAR_1.0-001_202307261500_htmp_orig_man_wmo.xml", "identity": {"method": "arbitrary", "value": "0f9159f55187458e3616b0f4bbbbf407"}, "size": 1673, "contentType": "text/xml"} headers: {} to xpublic under: v02.post.20230726.MSC-DMS-OP.NINJO.METAR.15

adding a log message confirms it's getting stuck

@reidsunderland
Copy link
Member Author

adding the timeout to basic_publish doesn't help this issue. I added another log in between the basic_publish and tx_commit which reveals that it's the tx_commit where we're getting stuck:

2023-07-26 16:05:42,553 [DEBUG] 346216 sarracenia.moth.amqp putNewMessage trying to publish body:  removed
2023-07-26 16:05:42,553 [DEBUG] 346216 sarracenia.moth.amqp putNewMessage published, trying to tx_commit

@reidsunderland
Copy link
Member Author

This discusses reasons that could cause tx_commit to hang, and has some commands we could run to troubleshoot. https://rabbitmq-users.narkive.com/Lq4Pdf6J/rabbitmq-discuss-hanging-on-message-sending-with-py-amqplib

@junhu3

I'm using the pre-existing timeout option in the basic_publish and added the extra log line in this commit 7157aef. The tx_commit method doesn't support a timeout, is it worth trying to add our own timeout?

@petersilva
Copy link
Contributor

Does the timeout on the basic_publish apply to the tx_commit()? What I understood is happenning is the publish is returning, and then the process is hanging on tx_commit(). Ideally, it would not hang, but would timeout, and put the message in the publish retry queue. Hanging is bad.

@reidsunderland
Copy link
Member Author

I'm not sure how it's supposed to work. I tested it using a 10 second timeout on the basic_publish and basic_publish returned right away but it got stuck at tx_commit, which definitely didn't timeout after >1 minute.

@petersilva petersilva changed the title Message posting improvements Message posting retry for tx_commit() hang Jul 31, 2023
@reidsunderland
Copy link
Member Author

discussion from meeting: try to implement our own timeout for the tx_commit so we can retry posting.

@petersilva
Copy link
Contributor

any progress on this? I see the branch sitting there... I just did a PR for needed fixes for scheduled flowcb plugins... and you hit real problems with those that are addressed by what is in v03_wip now, so might want to release soon.

@reidsunderland
Copy link
Member Author

no progress, I haven't had time to spend on it :(

The changes in the branch don't fix the problem we experienced (tx_commit hanging forever), but I think they're still good to have. Do you want me to create a PR?

@petersilva
Copy link
Contributor

Sure, sounds good... err... what is the benefit of what's there already?

@reidsunderland
Copy link
Member Author

It adds a timeout to publishing messages, which might help prevent hangs and improve error recovery when a broker isn't working. If the publish times out, it will raise an exception instead of hanging. The additional logging also helps troubleshooting - it was unclear where the publishing was getting stuck before.

The only thing missing is adding a timeout to the self.channel.tx_commit() call. We had talked about doing something similar to how timeouts are implemented here:

def alarm_raise(n, f):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion_Needed developers should discuss this issue. enhancement New feature or request Priority 3 - Important worrisome impact... should study carefully
Projects
None yet
Development

No branches or pull requests

2 participants