Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve warning message about truncated messages #605

Merged
merged 3 commits into from
Mar 30, 2022

Conversation

kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Dec 3, 2021

The "truncated message" error is most often triggered by a mismatch in collective size or env settings between ranks.

For better interpretability, the patch displays hints of cause so that it would be easier for user to debug.

It also changes the error type from InternalError to InvalidUsage to reflect the above most-likely causes.

Display hints of cause so that it would be easier for user to debug.
Also change the error type from InternalError to InvalidUsage as most
of time this is caused by a mismatch in collective size or env settings.
@sjeaugey
Copy link
Member

sjeaugey commented Feb 8, 2022

Thanks for the PR. I can agree with the intent but shouldn't we do the same for IB, and split the tests here in two parts, and return invalid usage for the first check (size > slots[r].size), keeping internal error for the others?

First part on collective mismatch, second part on internal errors
@kwen2501
Copy link
Contributor Author

kwen2501 commented Feb 8, 2022

Sylvain, that's a great idea, thanks! I added a commit to implement what you refer to.

@kwen2501
Copy link
Contributor Author

Hi @sjeaugey @AddyLaddy @jbachan , I was wondering if you have more comments about this PR. We recently observed more incidents where these paths got hit -- both the socket one and the IB one. It would be great if we could use a different code than ncclInternalError so that users can have a correct first direction when debugging the issue. Admittedly, nothing is perfect, just looking for a most likely match here. Thanks!

@sjeaugey sjeaugey merged commit 2dfd837 into NVIDIA:master Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants