Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train error #95

Open
bageheyalu opened this issue Sep 20, 2019 · 3 comments
Open

train error #95

bageheyalu opened this issue Sep 20, 2019 · 3 comments

Comments

@bageheyalu
Copy link

bageheyalu commented Sep 20, 2019

Anybody knows the reason ?
there is the blog.

  0%|                                                | 0/480000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 212, in <module>
    train(training_dbs, validation_db, args.start_iter)
  File "train.py", line 141, in train
    training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(**training)
  File "/home/wlk/Center/CenterNet/nnet/py_factory.py", line 82, in train
    loss_kp = self.network(xs, ys)
  File "/home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wlk/Center/CenterNet/models/py_utils/data_parallel.py", line 66, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids, self.chunk_sizes)
  File "/home/wlk/Center/CenterNet/models/py_utils/data_parallel.py", line 77, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim, chunk_sizes=self.chunk_sizes)
  File "/home/wlk/Center/CenterNet/models/py_utils/scatter_gather.py", line 30, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim, chunk_sizes) if inputs else []
  File "/home/wlk/Center/CenterNet/models/py_utils/scatter_gather.py", line 25, in scatter
    return scatter_map(inputs)
  File "/home/wlk/Center/CenterNet/models/py_utils/scatter_gather.py", line 18, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/wlk/Center/CenterNet/models/py_utils/scatter_gather.py", line 20, in scatter_map
    return list(map(list, zip(*map(scatter_map, obj))))
  File "/home/wlk/Center/CenterNet/models/py_utils/scatter_gather.py", line 15, in scatter_map
    return Scatter.apply(target_gpus, chunk_sizes, dim, obj)
  File "/home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 89, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/cuda/comm.py", line 148, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: Expected the device associated with the stream at index 4 (was 24371) to match the device supplied at that index (expected 48) (scatter at /opt/conda/conda-bld/pytorch_1544202130060/work/torch/csrc/cuda/comm.cpp:199)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f922e9ebcc5 in /home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef<long>, c10::optional<std::vector<long, std::allocator<long> > > const&, long, c10::optional<std::vector<c10::optional<at::cuda::CUDAStream>, std::allocator<c10::optional<at::cuda::CUDAStream> > > > const&) + 0x85d (0x7f926f55552d in /home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x4fae71 (0x7f926f55ae71 in /home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x112176 (0x7f926f172176 in /home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #11: THPFunction_apply(_object*, _object*) + 0x5a1 (0x7f926f36dbf1 in /home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

Exception ignored in: <function tqdm.__del__ at 0x7f922ad38a60>
Traceback (most recent call last):
  File "/home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/tqdm/_tqdm.py", line 885, in __del__
    self.close()
  File "/home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1090, in close
    self._decr_instances(self)
  File "/home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/tqdm/_tqdm.py", line 454, in _decr_instances
    cls.monitor.exit()
  File "/home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/tqdm/_monitor.py", line 52, in exit
    self.join()
  File "/home/wlk/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 1029, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread
terminate called without an active exception```
@Dawn-LX
Copy link

Dawn-LX commented Dec 10, 2019

我也遇到了这个问题

@Dawn-LX
Copy link

Dawn-LX commented Dec 10, 2019

有人解决了嘛

@kehuantiantang
Copy link

I use pytorch=1.0.0 annd modify the code in 'nnet/py_factory.py'
which import the 'from torch.nn import DataParallel' instead of 'from models.py_utils.data_parallel import DataParallel'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants