You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello. I'm having difficulties running the code provided. First of all, I have a question: is it possible to run your code without infiniband? I'm running as follows:
Traceback (most recent call last):
File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
self.auto_dp = AutoDataParallel(config)
File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 46, in __init__
self.init_rpc()
File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 117, in init_rpc
rpc.init_rpc(
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in
init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 360, in
_tensorpipe_init_backend_handler
api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 224, in _all_gather
rpc_sync(
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 809, in rpc_sync
return fut.wait()
RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
Do you have an idea what it can be cause by?
I was thinking that maybe it's because i haven't turned infiniband on, but when I change 0 "lo" to 1 "ib0" in both scripts I get another error message:
Traceback (most recent call last):
File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
self.auto_dp = AutoDataParallel(config)
File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 45, in __init__
self.init_ddp()
File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 86, in init_ddp
dist.init_process_group(init_method='tcp://' + str(self.config.master_addr) + ':' + str(self.config.master_port),
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in
init_process_group
default_pg = _new_process_group_helper(
File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1670525541990/work/third_party/gloo/gloo/transport/tcp/device.cc:80] ifa != nullptr. Unable to find address for: ib0
Any help would be appreciated
The text was updated successfully, but these errors were encountered:
Hello. I'm having difficulties running the code provided. First of all, I have a question: is it possible to run your code without infiniband? I'm running as follows:
And get the following error:
Do you have an idea what it can be cause by?
I was thinking that maybe it's because i haven't turned infiniband on, but when I change
0 "lo"
to1 "ib0"
in both scripts I get another error message:Any help would be appreciated
The text was updated successfully, but these errors were encountered: