Multi-GPU training is not working #9

dlazesz · 2020-06-24T20:52:06Z

In a multi-GPU environment (eg. at lambda) the training stops with the following error:

Traceback (most recent call last):
  File "emBERT/scripts/train_embert.py", line 502, in <module>
    main()
  File "emBERT/scripts/train_embert.py", line 460, in main
    trainer.train()
  File "emBERT/scripts/train_embert.py", line 239, in train
    self.train_step(stats)
  File "emBERT/scripts/train_embert.py", line 260, in train_step
    label_ids, valid_ids, l_mask)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/emBERT/embert/model.py", line 24, in forward
    device=next(self.parameters()).device
StopIteration

self.parameters() seems to yield empty iterator. The same setup runs flawlessly if only one GPU is used with CUDA_VISIBLE_DEVICES="1" .
Did you managed to run it in such environment? Do you have any idea what could be wrong and how to fix this error?

The text was updated successfully, but these errors were encountered:

DavidNemeskey · 2020-06-25T05:22:59Z

I trained all my models on all 3 GPUs of lambda. How did you invoke the training script?

dlazesz · 2020-06-25T07:29:40Z

The following Makefile contains the used commands:

setup:
	rm -rf embert_venv/
	virtualenv -p python3 embert_venv
	cd embert_venv && git clone https://github.com/DavidNemeskey/emBERT.git  # Models not needed
	./embert_venv/bin/pip install wheel
	./embert_venv/bin/pip install -r embert_venv/emBERT/requirements.txt

train:
	cd embert_venv && PYTHONPATH=`pwd`/emBERT ./bin/python3 emBERT/scripts/train_embert.py --data_dir ../corpus --bert_model bert-base-multilingual-cased --task_name szeged_chunk --data_format tsv --output_dir out  --do_train

train-one-gpu:
	cd embert_venv && PYTHONPATH=`pwd`/emBERT CUDA_VISIBLE_DEVICES="1" ./bin/python3 emBERT/scripts/train_embert.py --data_dir ../corpus --bert_model bert-base-multilingual-cased --task_name szeged_chunk --data_format tsv --output_dir out  --do_train

I used the two commands above to setup and run the training. ../corpus directory contains the corpus supplied by you (train.txt, valid.txt, test.txt), out dir is empty.

Did the underlying libraries changed or I am missing something?

Thank you for your help in advance!

DavidNemeskey · 2020-06-25T09:12:37Z

I tried the command (NOT the Makefile) you have and it works for me without a hitch, running on 3 GPUs. What environment do you use? On my side, I have

python 3.7.4 (from miniconda)
torch 1.3.0
transformers 2.9.1

From the error, I would suspect the torch version first. Would you do me a favor and run the script by hand to see if the error manifests that way as well? I mean:

create a virtualenv
pip install -r requirements.txt
train_embert.py ...

If that doesn't work, would you install the whole package with pip install -e . instead of just the requirements, and see if it fixes the issue? Thanks, looking forward to your results.

dlazesz · 2020-06-25T10:21:54Z

Pinning torch 1.3.0-1.4.0 yields the following error, but the training starts:

/home/dlazesz/bert_szeged_maxnp/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all

torch 1.5.0 as well as 1.5.1 prematurely stops at the beginning of the training with the aforementioned StopIteration exception.

I cannot judge if the warning above is serious and the StopIteration could be fixed somehow in emBert or not.
The easiest solution (for the exception) would be to pin all versions in requirements.txt. (This would leave the warning untouched.)
Feel free to fix the issue in your way! BTW I really like this piece of software. :)

PS. In any case my environment is:

Python 3.6.9 (system, virtualenv)

certifi==2020.6.20
chardet==3.0.4
click==7.1.2
dataclasses==0.7
Deprecated==1.2.10
filelock==3.0.12
future==0.18.2
idna==2.9
joblib==0.15.1
numpy==1.19.0
packaging==20.4
pkg-resources==0.0.0
progressbar==2.5
PyGithub==1.51
PyJWT==1.7.1
pyparsing==2.4.7
PyYAML==5.3.1
regex==2020.6.8
requests==2.24.0
sacremoses==0.0.43
sentencepiece==0.1.91
seqeval==0.0.5
six==1.15.0
tokenizers==0.7.0
torch==1.4.0
tqdm==4.46.1
transformers==2.11.0
urllib3==1.25.9
wrapt==1.12.1

dlazesz · 2020-06-25T10:29:43Z

The error is a known issue for torch: pytorch/pytorch#40457
I am still not sure about the warning, though.

DavidNemeskey · 2020-06-25T11:52:19Z

@dlazesz Thanks for investigating the issue. I am locking torch < 1.5 in setup.py and requirements.txt.

As for the warning, I think it's nothing to worry about. CrossEntropyLoss returns a scalar, and the Gather function raises a warning in this case for some reason. But it still handles the data correctly.

below that. Resolves #9.

DavidNemeskey added a commit that referenced this issue Jun 25, 2020

DataParallel is broken in torch 1.5.0, so the version has been pinned

57c7c1d

below that. Resolves #9.

DavidNemeskey mentioned this issue Jun 25, 2020

DataParallel is broken in torch 1.5.0, so the version has been pinned #10

Merged

DavidNemeskey closed this as completed in #10 Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training is not working #9

Multi-GPU training is not working #9

dlazesz commented Jun 24, 2020

DavidNemeskey commented Jun 25, 2020 •

edited

Loading

dlazesz commented Jun 25, 2020

DavidNemeskey commented Jun 25, 2020

dlazesz commented Jun 25, 2020

dlazesz commented Jun 25, 2020

DavidNemeskey commented Jun 25, 2020

Multi-GPU training is not working #9

Multi-GPU training is not working #9

Comments

dlazesz commented Jun 24, 2020

DavidNemeskey commented Jun 25, 2020 • edited Loading

dlazesz commented Jun 25, 2020

DavidNemeskey commented Jun 25, 2020

dlazesz commented Jun 25, 2020

dlazesz commented Jun 25, 2020

DavidNemeskey commented Jun 25, 2020

DavidNemeskey commented Jun 25, 2020 •

edited

Loading