Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow tpu train #2033

Merged
Merged

Conversation

lezwon
Copy link
Contributor

@lezwon lezwon commented May 31, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #2016 .

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 馃檭

@mergify mergify bot requested a review from a team May 31, 2020 19:26
@Borda Borda changed the title Bugfix/2016 slow tpu train slow tpu train May 31, 2020
@Borda Borda added the bug Something isn't working label May 31, 2020
@Borda Borda added this to the 0.8.0 milestone May 31, 2020
@mergify mergify bot requested a review from a team May 31, 2020 20:33
@rohitgr7
Copy link
Contributor

Working fine on single tpu_core and specific tpu_core but now getting an error RuntimeError: Cannot replicate if number of devices (1) is different from 8 when training on all the 8 cores tpu_cores=8. This might be the reason.

@williamFalcon
Copy link
Contributor

once this is merged:
#2029

Let's get rid of .spawn and replace with the same method

@lezwon
Copy link
Contributor Author

lezwon commented Jun 1, 2020

Working fine on single tpu_core and specific tpu_core but now getting an error RuntimeError: Cannot replicate if number of devices (1) is different from 8 when training on all the 8 cores tpu_cores=8. This might be the reason.

@rohitgr7 This usually happens when you have already selected a core i.e tpu_cores=[1] and later try to train on all cores tpu_cores=8. If you select tpu_cores=8 on the first attempt, it should work fine. :]

@mergify mergify bot requested a review from a team June 1, 2020 05:59
@Borda Borda added priority: 0 High priority task ready PRs ready to be merged and removed priority: 0 High priority task labels Jun 1, 2020
@Borda
Copy link
Member

Borda commented Jun 1, 2020

@lezwon mind add note to changelog?

if self.use_tpu and self.tpu_id is None:
device = xm.xla_device()
if self.use_tpu:
device = xm.xla_device(self.tpu_id) if self.tpu_id is not None else xm.xla_device()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to docs, the None case does not need to be handled explicitly:
https://pytorch.org/xla/release/1.5/index.html#torch_xla.core.xla_model.xla_device
device = xm.xla_device(self.tpu_id) is fine even if self.tpu_id = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli Yea, it does work without it too. We could refactor these instances across the project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored it here #1756 so if you do it here too, then I think it's pretty much all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it also work for older versions... =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Borda Do you mean older versions of lightning? I've noticed xm.xla_device(None) working earlier too. The docs @awaelchli pointed out though, confirms that it should work well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of older version of XLA but there is no reason why it should not :]

@mergify mergify bot requested a review from a team June 1, 2020 06:11
@mergify
Copy link
Contributor

mergify bot commented Jun 1, 2020

This pull request is now in conflict... :(

@lezwon lezwon force-pushed the bugfix/2016_slow_tpu_train branch from ce7f0b9 to c09f2ec Compare June 1, 2020 06:30
@codecov
Copy link

codecov bot commented Jun 1, 2020

Codecov Report

Merging #2033 into master will not change coverage.
The diff coverage is 50%.

@@          Coverage Diff           @@
##           master   #2033   +/-   ##
======================================
  Coverage      88%     88%           
======================================
  Files          74      74           
  Lines        4666    4666           
======================================
  Hits         4084    4084           
  Misses        582     582           

@rohitgr7
Copy link
Contributor

rohitgr7 commented Jun 1, 2020

@lezwon I ran directly with 8 cores and that error occured. I see you have made some changes in the PR. Will try again with updated version tonight and let you know.

@williamFalcon williamFalcon removed the ready PRs ready to be merged label Jun 1, 2020
@rohitgr7
Copy link
Contributor

rohitgr7 commented Jun 1, 2020

Getting this error now in all the cases, even when is set checkpoint_callback=False.

KeyError                                  Traceback (most recent call last)
<ipython-input-17-31113c96baf3> in <module>
      1 # Train on specific tpu_core
----> 2 train_validate(0, [7])

<ipython-input-16-7fc43d3b07cf> in train_validate(fold_i, tpu_cores)
     21 #         fast_dev_run=True
     22     )
---> 23     trainer.fit(model, train_dl, valid_dl)
     24     torch.cuda.empty_cache()

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders)
    910 
    911             # load weights if not interrupted
--> 912             self.load_spawn_weights(model)
    913             self.model = model
    914 

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py in load_spawn_weights(self, original_model)
    415             # load weights saved in ddp
    416             path = os.path.join(self.default_root_dir, '__temp_weight_ddp_end.ckpt')
--> 417             loaded_model = original_model.__class__.load_from_checkpoint(path)
    418 
    419             # copy loaded weights to old model

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py in load_from_checkpoint(cls, checkpoint_path, map_location, hparams_file, tags_csv, *args, **kwargs)
   1561 
   1562         # override the module_arguments with values that were passed in
-> 1563         checkpoint[CHECKPOINT_KEY_MODULE_ARGS].update(kwargs)
   1564 
   1565         model = cls._load_model_state(checkpoint, *args, **kwargs)

KeyError: 'module_arguments'

@awaelchli
Copy link
Member

the checkpoints trained with an old version (before hparams PR) are not compatible with the latest change to how hparams are stored in the checkpoints. This is a pending issue that needs to be resolved.
You would have to retrain the model or go and manually modify the checkpoint to set the module_arguments key.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Jun 1, 2020

I retrained the model from scratch. Even when setting checkpoint_callback=False, I am getting this error. Just to be sure I am installing this PR using !pip install -U git+https://github.com/lezwon/pytorch-lightning.git@bugfix/2016_slow_tpu_train. This is the correct way, right?

@pvnieo
Copy link

pvnieo commented Jun 1, 2020

Hi @rohitgr7 @lezwon @awaelchli
I updated my pytorch-lightning package to the version available yesterday in master, But i just noticed that the hparams.yaml file is empty contrary to what was the default behavior before, also, when I wanted to test my model, I got the following error:

File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1563, in load_from_checkpoint
    checkpoint[CHECKPOINT_KEY_MODULE_ARGS].update(kwargs)
KeyError: 'module_arguments'

Is this issue solved yet?

@williamFalcon
Copy link
Contributor

@Borda FYI

we鈥檙e, working on fixing the hparams thing

CHANGELOG.md Outdated Show resolved Hide resolved
@Borda Borda added the ready PRs ready to be merged label Jun 2, 2020
@mergify mergify bot requested a review from a team June 2, 2020 14:10
@williamFalcon williamFalcon merged commit 943c4b2 into Lightning-AI:master Jun 2, 2020
@pvnieo
Copy link

pvnieo commented Jun 2, 2020

Hi @Borda @williamFalcon
Is the issue concerning hparams not being saved, and the issue that arises when testing (KeyError: 'module_arguments') is solved on master?

@rohitgr7 rohitgr7 mentioned this pull request Jun 7, 2020
justusschock pushed a commit that referenced this pull request Jun 29, 2020
* use parallel loader

* Revert "use parallel loader"

This reverts commit ed6e758

* select tpu id for pl

* condition if tpu_id is None

* added info to changelog

* Revert "condition if tpu_id is None"

This reverts commit 1fb6e58

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

specifying the tpu_core speed-up TPU training
6 participants