Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error #6

Open
RealAntonVoronov opened this issue Dec 29, 2022 · 0 comments
Open

Runtime error #6

RealAntonVoronov opened this issue Dec 29, 2022 · 0 comments

Comments

@RealAntonVoronov
Copy link

RealAntonVoronov commented Dec 29, 2022

Hello. I'm having difficulties running the code provided. First of all, I have a question: is it possible to run your code without infiniband? I'm running as follows:

nohup sh run_tc_pipetransformer.sh 8 2 0 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &
nohup sh run_tc_pipetransformer.sh 8 2 1 65.108.32.147 11111 0 "lo" 1e-5 8 0 freeze 2 > ./PipeTransformer-TC.log 2>&1 &

And get the following error:

Traceback (most recent call last):
  File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
    pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
  File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
    self.auto_dp = AutoDataParallel(config)
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 46, in __init__
    self.init_rpc()
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 117, in init_rpc
    rpc.init_rpc(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in 
init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 360, in 
_tensorpipe_init_backend_handler
    api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 224, in _all_gather
    rpc_sync(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/rpc/api.py", line 809, in rpc_sync
    return fut.wait()
RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)

Do you have an idea what it can be cause by?
I was thinking that maybe it's because i haven't turned infiniband on, but when I change 0 "lo" to 1 "ib0" in both scripts I get another error message:

Traceback (most recent call last):
  File "/root/PipeTransformer/examples/text_classification/main_tc.py", line 261, in <module>
    pipe_transformer = PipeTransformer(config, tc_data_manager, model_config, model)
  File "/root/PipeTransformer/pipe_transformer/pipe_transformer.py", line 15, in __init__
    self.auto_dp = AutoDataParallel(config)
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 45, in __init__
    self.init_ddp()
  File "/root/PipeTransformer/pipe_transformer/dp/auto_dp.py", line 86, in init_ddp
    dist.init_process_group(init_method='tcp://' + str(self.config.master_addr) + ':' + str(self.config.master_port),
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in 
init_process_group
    default_pg = _new_process_group_helper(
  File "/root/anaconda3/envs/pipe/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
    pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1670525541990/work/third_party/gloo/gloo/transport/tcp/device.cc:80] ifa != nullptr. Unable to find address for: ib0

Any help would be appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant