DDP breaks LR finder #1831

s-rog · 2020-05-14T02:26:31Z

🐛 Bug

DDP breaks LR finder

To Reproduce

finder = trainer.lr_find(model)
print(finder.suggestion())

Traceback (most recent call last):
  File "./training.py", line 107, in <module>
    main(hparam_trial)
  File "./training.py", line 97, in main
    finder = trainer.lr_find(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/lr_finder.py", line 153, in lr_find
    self.fit(model, train_dataloader=train_dataloader)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_LRFinder._get_new_optimizer.<locals>.configure_optimizers'

At first I thought it's because configure_optimizers returns [opt], [sched] but returning opt still causes the error. Training works correctly with the same code.

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-05-14T06:35:54Z

@SkafteNicki

Alikerin · 2020-05-14T09:35:04Z

I also face a similar issue with Tensorboard logger whenever the logger flag is left as default both on GPU and TPU colab runtime. It throws the following exception on TPU runtime

Exception in device=TPU:0: dictionary update sequence element #0 has length 1; 2 is required
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 531, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 980, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 465, in run_training_epoch
    self.log_metrics(batch_step_metrics, grad_norm_dic)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/logging.py", line 74, in log_metrics
    self.logger.save()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py", line 10, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/loggers/tensorboard.py", line 161, in save
    save_hparams_to_yaml(hparams_file, self.hparams)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/saving.py", line 151, in save_hparams_to_yaml
    yaml.dump(hparams, fp)
  File "/usr/local/lib/python3.6/dist-packages/yaml/__init__.py", line 200, in dump
    return dump_all([data], stream, Dumper=Dumper, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/yaml/__init__.py", line 188, in dump_all
    dumper.represent(data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 26, in represent
    node = self.represent_data(data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 47, in represent_data
    node = self.yaml_representers[data_types[0]](self, data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 205, in represent_dict
    return self.represent_mapping('tag:yaml.org,2002:map', data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 116, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 51, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
ValueError: dictionary update sequence element #0 has length 1; 2 is required

Similarly, on GPU runtime it throws an exception saying can't pickle _thread.lock objects.
I resolve the issue by setting logger=False

s-rog added bug Something isn't working help wanted Open to be worked on labels May 14, 2020

SkafteNicki mentioned this issue May 14, 2020

[WIP]: ddp pickle fix for learning rate finder #1834

Closed

5 tasks

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

williamFalcon closed this as completed in #2029 Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP breaks LR finder #1831

DDP breaks LR finder #1831

s-rog commented May 14, 2020 •

edited

Loading

williamFalcon commented May 14, 2020

Alikerin commented May 14, 2020

DDP breaks LR finder #1831

DDP breaks LR finder #1831

Comments

s-rog commented May 14, 2020 • edited Loading

🐛 Bug

To Reproduce

williamFalcon commented May 14, 2020

Alikerin commented May 14, 2020

s-rog commented May 14, 2020 •

edited

Loading