Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use multi-gpus? #25

Open
flytoylf opened this issue Dec 14, 2020 · 0 comments
Open

How to use multi-gpus? #25

flytoylf opened this issue Dec 14, 2020 · 0 comments

Comments

@flytoylf
Copy link

image

As shown in the figure above, I want to use multi gpus to run my job. But it has error like as follow:

Traceback (most recent call last):
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
**merge_kwargs)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 193, in _init_from_checkpoint
ckpt_file = _get_checkpoint_filename(ckpt_dir_or_file)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 280, in _get_checkpoint_filename
if gfile.IsDirectory(ckpt_dir_or_file):
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 467, in is_directory
return pywrap_tensorflow.IsDirectory(compat.as_bytes(dirname), status)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 61, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got PerDevice({'/replica:0/task:0/device:GPU:0': '/data/yangxiaohan/.eztransfer_modelzoo/bert/google-bert-base-zh/model.ckpt', '/replica:0/task:0/device:GPU:1': '/data/yangxiaohan/.eztransfer_modelzoo/bert/google-bert-base-zh/model.ckpt'})
Traceback (most recent call last):
File "src/fit.py", line 179, in
tf.app.run()
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "src/fit.py", line 168, in main
train()
File "src/fit.py", line 24, in train
app.run_train(reader=train_reader)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/easytransfer/engines/model.py", line 616, in run_train
max_steps=self.train_steps)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1205, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1316, in _train_model_distributed
self.config)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/training/distribute.py", line 721, in call_for_each_tower
return self._call_for_each_tower(fn, *args, **kwargs)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 556, in _call_for_each_tower
return _call_for_each_tower(self, fn, *args, **kwargs)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 183, in _call_for_each_tower
coord.join(threads)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
**merge_kwargs)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 193, in _init_from_checkpoint
ckpt_file = _get_checkpoint_filename(ckpt_dir_or_file)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 280, in _get_checkpoint_filename
if gfile.IsDirectory(ckpt_dir_or_file):
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 467, in is_directory
return pywrap_tensorflow.IsDirectory(compat.as_bytes(dirname), status)
File "/data/yangxiaohan/tool/python3.6/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 61, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got PerDevice({'/replica:0/task:0/device:GPU:0': '/data/yangxiaohan/.eztransfer_modelzoo/bert/google-bert-base-zh/model.ckpt', '/replica:0/task:0/device:GPU:1': '/data/yangxiaohan/.eztransfer_modelzoo/bert/google-bert-base-zh/model.ckpt'})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant