Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training is not working #9

Closed
dlazesz opened this issue Jun 24, 2020 · 6 comments · Fixed by #10
Closed

Multi-GPU training is not working #9

dlazesz opened this issue Jun 24, 2020 · 6 comments · Fixed by #10

Comments

@dlazesz
Copy link

dlazesz commented Jun 24, 2020

In a multi-GPU environment (eg. at lambda) the training stops with the following error:

Traceback (most recent call last):
  File "emBERT/scripts/train_embert.py", line 502, in <module>
    main()
  File "emBERT/scripts/train_embert.py", line 460, in main
    trainer.train()
  File "emBERT/scripts/train_embert.py", line 239, in train
    self.train_step(stats)
  File "emBERT/scripts/train_embert.py", line 260, in train_step
    label_ids, valid_ids, l_mask)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/emBERT/embert/model.py", line 24, in forward
    device=next(self.parameters()).device
StopIteration

self.parameters() seems to yield empty iterator. The same setup runs flawlessly if only one GPU is used with CUDA_VISIBLE_DEVICES="1" .
Did you managed to run it in such environment? Do you have any idea what could be wrong and how to fix this error?

@DavidNemeskey
Copy link
Owner

DavidNemeskey commented Jun 25, 2020

I trained all my models on all 3 GPUs of lambda. How did you invoke the training script?

@dlazesz
Copy link
Author

dlazesz commented Jun 25, 2020

The following Makefile contains the used commands:

setup:
	rm -rf embert_venv/
	virtualenv -p python3 embert_venv
	cd embert_venv && git clone https://github.com/DavidNemeskey/emBERT.git  # Models not needed
	./embert_venv/bin/pip install wheel
	./embert_venv/bin/pip install -r embert_venv/emBERT/requirements.txt

train:
	cd embert_venv && PYTHONPATH=`pwd`/emBERT ./bin/python3 emBERT/scripts/train_embert.py --data_dir ../corpus --bert_model bert-base-multilingual-cased --task_name szeged_chunk --data_format tsv --output_dir out  --do_train

train-one-gpu:
	cd embert_venv && PYTHONPATH=`pwd`/emBERT CUDA_VISIBLE_DEVICES="1" ./bin/python3 emBERT/scripts/train_embert.py --data_dir ../corpus --bert_model bert-base-multilingual-cased --task_name szeged_chunk --data_format tsv --output_dir out  --do_train

I used the two commands above to setup and run the training. ../corpus directory contains the corpus supplied by you (train.txt, valid.txt, test.txt), out dir is empty.

Did the underlying libraries changed or I am missing something?

Thank you for your help in advance!

@DavidNemeskey
Copy link
Owner

I tried the command (NOT the Makefile) you have and it works for me without a hitch, running on 3 GPUs. What environment do you use? On my side, I have

  • python 3.7.4 (from miniconda)
  • torch 1.3.0
  • transformers 2.9.1

From the error, I would suspect the torch version first. Would you do me a favor and run the script by hand to see if the error manifests that way as well? I mean:

  • create a virtualenv
  • pip install -r requirements.txt
  • train_embert.py ...

If that doesn't work, would you install the whole package with pip install -e . instead of just the requirements, and see if it fixes the issue? Thanks, looking forward to your results.

@dlazesz
Copy link
Author

dlazesz commented Jun 25, 2020

Pinning torch 1.3.0-1.4.0 yields the following error, but the training starts:

/home/dlazesz/bert_szeged_maxnp/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all 

torch 1.5.0 as well as 1.5.1 prematurely stops at the beginning of the training with the aforementioned StopIteration exception.

I cannot judge if the warning above is serious and the StopIteration could be fixed somehow in emBert or not.
The easiest solution (for the exception) would be to pin all versions in requirements.txt. (This would leave the warning untouched.)
Feel free to fix the issue in your way! BTW I really like this piece of software. :)

PS. In any case my environment is:

Python 3.6.9 (system, virtualenv)

certifi==2020.6.20
chardet==3.0.4
click==7.1.2
dataclasses==0.7
Deprecated==1.2.10
filelock==3.0.12
future==0.18.2
idna==2.9
joblib==0.15.1
numpy==1.19.0
packaging==20.4
pkg-resources==0.0.0
progressbar==2.5
PyGithub==1.51
PyJWT==1.7.1
pyparsing==2.4.7
PyYAML==5.3.1
regex==2020.6.8
requests==2.24.0
sacremoses==0.0.43
sentencepiece==0.1.91
seqeval==0.0.5
six==1.15.0
tokenizers==0.7.0
torch==1.4.0
tqdm==4.46.1
transformers==2.11.0
urllib3==1.25.9
wrapt==1.12.1

@dlazesz
Copy link
Author

dlazesz commented Jun 25, 2020

The error is a known issue for torch: pytorch/pytorch#40457
I am still not sure about the warning, though.

@DavidNemeskey
Copy link
Owner

@dlazesz Thanks for investigating the issue. I am locking torch < 1.5 in setup.py and requirements.txt.

As for the warning, I think it's nothing to worry about. CrossEntropyLoss returns a scalar, and the Gather function raises a warning in this case for some reason. But it still handles the data correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants