-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU training is not working #9
Comments
I trained all my models on all 3 GPUs of |
The following setup:
rm -rf embert_venv/
virtualenv -p python3 embert_venv
cd embert_venv && git clone https://github.com/DavidNemeskey/emBERT.git # Models not needed
./embert_venv/bin/pip install wheel
./embert_venv/bin/pip install -r embert_venv/emBERT/requirements.txt
train:
cd embert_venv && PYTHONPATH=`pwd`/emBERT ./bin/python3 emBERT/scripts/train_embert.py --data_dir ../corpus --bert_model bert-base-multilingual-cased --task_name szeged_chunk --data_format tsv --output_dir out --do_train
train-one-gpu:
cd embert_venv && PYTHONPATH=`pwd`/emBERT CUDA_VISIBLE_DEVICES="1" ./bin/python3 emBERT/scripts/train_embert.py --data_dir ../corpus --bert_model bert-base-multilingual-cased --task_name szeged_chunk --data_format tsv --output_dir out --do_train I used the two commands above to setup and run the training. Did the underlying libraries changed or I am missing something? Thank you for your help in advance! |
I tried the command (NOT the Makefile) you have and it works for me without a hitch, running on 3 GPUs. What environment do you use? On my side, I have
From the error, I would suspect the torch version first. Would you do me a favor and run the script by hand to see if the error manifests that way as well? I mean:
If that doesn't work, would you install the whole package with |
Pinning torch 1.3.0-1.4.0 yields the following error, but the training starts: /home/dlazesz/bert_szeged_maxnp/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all torch 1.5.0 as well as 1.5.1 prematurely stops at the beginning of the training with the aforementioned StopIteration exception. I cannot judge if the warning above is serious and the StopIteration could be fixed somehow in emBert or not. PS. In any case my environment is:
|
The error is a known issue for torch: pytorch/pytorch#40457 |
@dlazesz Thanks for investigating the issue. I am locking torch < 1.5 in As for the warning, I think it's nothing to worry about. CrossEntropyLoss returns a scalar, and the |
In a multi-GPU environment (eg. at lambda) the training stops with the following error:
self.parameters()
seems to yield empty iterator. The same setup runs flawlessly if only one GPU is used withCUDA_VISIBLE_DEVICES="1"
.Did you managed to run it in such environment? Do you have any idea what could be wrong and how to fix this error?
The text was updated successfully, but these errors were encountered: