-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anyways to enable TRACE logging without recompiling from source #197
Comments
In order to get the TRACE() output you have to recompile NCCL using There have been issues reported with NCCL attempting to connect using IPC across nodes when the hostnames are not unique. However that issue should be fixed in newer versions of NCCL. Which version of NCCL are you using ? |
@AddyLaddy Thanks for the suggestion. This is with 2.4.2. The full hostnames in question are unique, but not if you look only at the first component when splitting on '.'. @ngoyal2707 first reported this in #187 but it appears to persist. Are there any call sites other than logging that split a hostname on the period character? |
+1 to what pietern said. |
@ngoyal2707 Perhaps runtime overhead of the calls? Trace may produce a LOT of output. |
Yes the TRACE() output can get very long and noisy, and there are some TRACE() calls in the critical data paths too, so we wanted to avoid any performance overheads. With NCCL going open source, we felt it would be easy for customers to recompile with TRACE=1 and also add their own additional TRACE() calls. |
I see, that makes sense. I will close this issue then. |
We are facing some issue where NCCL is trying to communicate across hosts with IPC.
Looking at the code
INFO
logs are not detailed enough for this issue butTRACE
would have been helpful.It seems the code does have
NCCL_DEBUG=TRACE
but looking at the code it seems that functionality is hidden behindifdef ENABLE_TRACE
and there doesn't seem to be anyway to enableTRACE
logging without recompiling.Is that correct? (the docs also says accepted values are just
VERSION
,WARN
andINFO
).cc: @pietern
The text was updated successfully, but these errors were encountered: