Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about fine-grained transport selection for multi-node env #9560

Open
qelk123 opened this issue Dec 24, 2023 · 1 comment
Open

question about fine-grained transport selection for multi-node env #9560

qelk123 opened this issue Dec 24, 2023 · 1 comment

Comments

@qelk123
Copy link

qelk123 commented Dec 24, 2023

Hi all,
I am utilizing UCX for inter-process communication across multiple nodes in my environment, and I've observed that UCX selects different transport configurations based on the communication pattern. During my testing, the configurations were displayed as follows:

ucp_worker.c:1783 UCX INFO ep_cfg[1]: tag(sysv/memory cma/memory cuda_copy/cuda)

ucp_worker.c:1783 UCX INFO ep_cfg[2]: tag(sysv/memory cma/memory tcp/ib1)

ucp_worker.c:1783 UCX INFO ep_cfg[3]: tag(tcp/ib1 tcp/ib0)

My initial question pertains to the interpretation of these ep_cfg entries. I presume the ep_cfg[0] means the trans to itself, the ep_cfg[1] means the trans between different CPU cores or GPUs for different processes within one server node,and ep_cfg[2] is associated with inter-node communication between different server nodes. Am I interpreting these correctly?

Furthermore, my primary concern is whether I can specify transport constraints individually for different communication patterns (or for different ep_cfg entries). As it stands, I am only able to set constraints globally using UCX_TLS, which affects all ep_cfg entries and could result in suboptimal configurations for certain communication patterns. Is there a way to configure the transport layer with finer granularity for distinct communication patterns?

Regards,
Micheal

@yosefe
Copy link
Contributor

yosefe commented Dec 25, 2023

The different configurations usually refer to self, intra,inter transports but not always, it can also depend on endpoint creation parameters, different MPI components that use UCX, etc. In newer versions this log message also prints the type of configuration.
Currently there is no way to set transport constraint per process topology.
In which use case do you see a sub optimal selection by default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants