Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP breaks LR finder #1831

Closed
s-rog opened this issue May 14, 2020 · 2 comments 路 Fixed by #2029
Closed

DDP breaks LR finder #1831

s-rog opened this issue May 14, 2020 · 2 comments 路 Fixed by #2029
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@s-rog
Copy link
Contributor

s-rog commented May 14, 2020

馃悰 Bug

DDP breaks LR finder

To Reproduce

finder = trainer.lr_find(model)
print(finder.suggestion())
Traceback (most recent call last):
  File "./training.py", line 107, in <module>
    main(hparam_trial)
  File "./training.py", line 97, in main
    finder = trainer.lr_find(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/lr_finder.py", line 153, in lr_find
    self.fit(model, train_dataloader=train_dataloader)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_LRFinder._get_new_optimizer.<locals>.configure_optimizers'

At first I thought it's because configure_optimizers returns [opt], [sched] but returning opt still causes the error. Training works correctly with the same code.

@s-rog s-rog added bug Something isn't working help wanted Open to be worked on labels May 14, 2020
@williamFalcon
Copy link
Contributor

@SkafteNicki

@Alikerin
Copy link

I also face a similar issue with Tensorboard logger whenever the logger flag is left as default both on GPU and TPU colab runtime. It throws the following exception on TPU runtime

Exception in device=TPU:0: dictionary update sequence element #0 has length 1; 2 is required
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 531, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 980, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 465, in run_training_epoch
    self.log_metrics(batch_step_metrics, grad_norm_dic)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/logging.py", line 74, in log_metrics
    self.logger.save()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py", line 10, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/loggers/tensorboard.py", line 161, in save
    save_hparams_to_yaml(hparams_file, self.hparams)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/saving.py", line 151, in save_hparams_to_yaml
    yaml.dump(hparams, fp)
  File "/usr/local/lib/python3.6/dist-packages/yaml/__init__.py", line 200, in dump
    return dump_all([data], stream, Dumper=Dumper, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/yaml/__init__.py", line 188, in dump_all
    dumper.represent(data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 26, in represent
    node = self.represent_data(data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 47, in represent_data
    node = self.yaml_representers[data_types[0]](self, data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 205, in represent_dict
    return self.represent_mapping('tag:yaml.org,2002:map', data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 116, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 51, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
ValueError: dictionary update sequence element #0 has length 1; 2 is required

Similarly, on GPU runtime it throws an exception saying can't pickle _thread.lock objects.
I resolve the issue by setting logger=False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
3 participants