Skip to content

[TPU-Colab] RuntimeError: Cannot replicate if number of devices (1) is different from 8 #1703

@simonepri

Description

@simonepri

🐛 Bug

When I run trainer.test(model) on a pre-trained model using a Colab TPU instance, the following exception is thrown.

NB: trainer.train(model) works.

Stack trace

Traceback (most recent call last):
  File "run_pl_ged.py", line 217, in <module>
    trainer.test(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 958, in test
    self.fit(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 777, in fit
    xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 116, in _start_fn
    _setup_replication()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 109, in _setup_replication
    xm.set_replication(str(device), [str(device)])
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 194, in set_replication
    replication_devices = xla_replication_devices(devices)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 181, in xla_replication_devices
    .format(len(local_devices), len(kind_devices)))
RuntimeError: Cannot replicate if number of devices (1) is different from 8

Code sample

import pytorch_lightning as pl

model = MyLightningModule()
trainer = pl.Trainer(num_tpu_cores=8)
model = model.load_from_checkpoint(checkpoint)
model.prepare_data() # See https://github.com/PyTorchLightning/pytorch-lightning/issues/1562
trainer.test(model)

Environment

Colab TPU instance with XLA 1.5

* CUDA:
	- GPU:
	- available:         False
	- version:           None
* Packages:
	- numpy:             1.18.3
	- pyTorch_debug:     False
	- pyTorch_version:   1.5.0a0+ab660ae
	- pytorch-lightning: 0.7.5
	- tensorboard:       2.2.1
	- tqdm:              4.38.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.6.9
	- version:           #1 SMP Wed Feb 19 05:26:34 PST 2020

Possibly related: #1019

Metadata

Metadata

Assignees

No one assigned

    Labels

    accelerator: tpuTensor Processing UnitbugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority task

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions