Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anyways to enable TRACE logging without recompiling from source #197

Closed
ngoyal2707 opened this issue Mar 20, 2019 · 6 comments
Closed

Anyways to enable TRACE logging without recompiling from source #197

ngoyal2707 opened this issue Mar 20, 2019 · 6 comments

Comments

@ngoyal2707
Copy link

ngoyal2707 commented Mar 20, 2019

We are facing some issue where NCCL is trying to communicate across hosts with IPC.
Looking at the code INFO logs are not detailed enough for this issue but TRACE would have been helpful.

It seems the code does have NCCL_DEBUG=TRACE but looking at the code it seems that functionality is hidden behind ifdef ENABLE_TRACE and there doesn't seem to be anyway to enable TRACE logging without recompiling.

Is that correct? (the docs also says accepted values are just VERSION, WARN and INFO).

cc: @pietern

@AddyLaddy
Copy link
Collaborator

In order to get the TRACE() output you have to recompile NCCL using make TRACE=1
Then you need to enable it at runtime with NCCL_DEBUG_FLAGS=TRACE
You can enable further TRACE() output by setting NCCL_DEBUG_SUBSYS=[INIT|COLL|P2P|SHM|NET|ALL]

There have been issues reported with NCCL attempting to connect using IPC across nodes when the hostnames are not unique. However that issue should be fixed in newer versions of NCCL.

Which version of NCCL are you using ?

@pietern
Copy link

pietern commented Mar 20, 2019

@AddyLaddy Thanks for the suggestion. This is with 2.4.2. The full hostnames in question are unique, but not if you look only at the first component when splitting on '.'.

@ngoyal2707 first reported this in #187 but it appears to persist. Are there any call sites other than logging that split a hostname on the period character?

@ngoyal2707
Copy link
Author

+1 to what pietern said.
Also I didn’t get the reason of requiring to compile for TRACE. Why not remove the Ifdef ENABLE_TRACE and just check for env[“NCCL_DEBUG”]. As in, this particular issue was more from enhancement perspective. Not requiring to recompile for trace logging, will be helpful

@pietern
Copy link

pietern commented Mar 20, 2019

@ngoyal2707 Perhaps runtime overhead of the calls? Trace may produce a LOT of output.

@AddyLaddy
Copy link
Collaborator

Yes the TRACE() output can get very long and noisy, and there are some TRACE() calls in the critical data paths too, so we wanted to avoid any performance overheads.

With NCCL going open source, we felt it would be easy for customers to recompile with TRACE=1 and also add their own additional TRACE() calls.

@ngoyal2707
Copy link
Author

I see, that makes sense. I will close this issue then.
Will continue to discuss the original issue in #187 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants