-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
accelerator: tpuTensor Processing UnitTensor Processing UnitbugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task
Description
🐛 Bug
When I run trainer.test(model) on a pre-trained model using a Colab TPU instance, the following exception is thrown.
NB:
trainer.train(model)works.
Stack trace
Traceback (most recent call last):
File "run_pl_ged.py", line 217, in <module>
trainer.test(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 958, in test
self.fit(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 777, in fit
xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
start_method=start_method)
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 116, in _start_fn
_setup_replication()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 109, in _setup_replication
xm.set_replication(str(device), [str(device)])
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 194, in set_replication
replication_devices = xla_replication_devices(devices)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 181, in xla_replication_devices
.format(len(local_devices), len(kind_devices)))
RuntimeError: Cannot replicate if number of devices (1) is different from 8
Code sample
import pytorch_lightning as pl
model = MyLightningModule()
trainer = pl.Trainer(num_tpu_cores=8)
model = model.load_from_checkpoint(checkpoint)
model.prepare_data() # See https://github.com/PyTorchLightning/pytorch-lightning/issues/1562
trainer.test(model)Environment
Colab TPU instance with XLA 1.5
* CUDA:
- GPU:
- available: False
- version: None
* Packages:
- numpy: 1.18.3
- pyTorch_debug: False
- pyTorch_version: 1.5.0a0+ab660ae
- pytorch-lightning: 0.7.5
- tensorboard: 2.2.1
- tqdm: 4.38.0
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.6.9
- version: #1 SMP Wed Feb 19 05:26:34 PST 2020
Possibly related: #1019
sarthak-314omiita and ArthDh
Metadata
Metadata
Assignees
Labels
accelerator: tpuTensor Processing UnitTensor Processing UnitbugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task