slow tpu train #2033

lezwon · 2020-05-31T19:25:17Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #2016 .

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

pytorch_lightning/trainer/training_loop.py

rohitgr7 · 2020-05-31T20:34:48Z

Working fine on single tpu_core and specific tpu_core but now getting an error RuntimeError: Cannot replicate if number of devices (1) is different from 8 when training on all the 8 cores tpu_cores=8. This might be the reason.

williamFalcon · 2020-05-31T20:42:16Z

once this is merged:
#2029

Let's get rid of .spawn and replace with the same method

lezwon · 2020-06-01T01:24:42Z

Working fine on single tpu_core and specific tpu_core but now getting an error RuntimeError: Cannot replicate if number of devices (1) is different from 8 when training on all the 8 cores tpu_cores=8. This might be the reason.

@rohitgr7 This usually happens when you have already selected a core i.e tpu_cores=[1] and later try to train on all cores tpu_cores=8. If you select tpu_cores=8 on the first attempt, it should work fine. :]

Borda · 2020-06-01T06:05:46Z

@lezwon mind add note to changelog?

awaelchli · 2020-06-01T06:11:15Z

pytorch_lightning/trainer/evaluation_loop.py

-            if self.use_tpu and self.tpu_id is None:
-                device = xm.xla_device()
+            if self.use_tpu:
+                device = xm.xla_device(self.tpu_id) if self.tpu_id is not None else xm.xla_device()


According to docs, the None case does not need to be handled explicitly:
https://pytorch.org/xla/release/1.5/index.html#torch_xla.core.xla_model.xla_device
device = xm.xla_device(self.tpu_id) is fine even if self.tpu_id = None

@awaelchli Yea, it does work without it too. We could refactor these instances across the project.

I have refactored it here #1756 so if you do it here too, then I think it's pretty much all.

Does it also work for older versions... =)

@Borda Do you mean older versions of lightning? I've noticed xm.xla_device(None) working earlier too. The docs @awaelchli pointed out though, confirms that it should work well.

I was thinking of older version of XLA but there is no reason why it should not :]

mergify · 2020-06-01T06:29:14Z

This pull request is now in conflict... :(

This reverts commit ed6e758

This reverts commit 1fb6e58

codecov · 2020-06-01T06:52:31Z

Codecov Report

Merging #2033 into master will not change coverage.
The diff coverage is 50%.

@@          Coverage Diff           @@
##           master   #2033   +/-   ##
======================================
  Coverage      88%     88%           
======================================
  Files          74      74           
  Lines        4666    4666           
======================================
  Hits         4084    4084           
  Misses        582     582

rohitgr7 · 2020-06-01T07:28:03Z

@lezwon I ran directly with 8 cores and that error occured. I see you have made some changes in the PR. Will try again with updated version tonight and let you know.

rohitgr7 · 2020-06-01T18:30:20Z

Getting this error now in all the cases, even when is set checkpoint_callback=False.

KeyError                                  Traceback (most recent call last)
<ipython-input-17-31113c96baf3> in <module>
      1 # Train on specific tpu_core
----> 2 train_validate(0, [7])

<ipython-input-16-7fc43d3b07cf> in train_validate(fold_i, tpu_cores)
     21 #         fast_dev_run=True
     22     )
---> 23     trainer.fit(model, train_dl, valid_dl)
     24     torch.cuda.empty_cache()

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders)
    910 
    911             # load weights if not interrupted
--> 912             self.load_spawn_weights(model)
    913             self.model = model
    914 

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py in load_spawn_weights(self, original_model)
    415             # load weights saved in ddp
    416             path = os.path.join(self.default_root_dir, '__temp_weight_ddp_end.ckpt')
--> 417             loaded_model = original_model.__class__.load_from_checkpoint(path)
    418 
    419             # copy loaded weights to old model

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py in load_from_checkpoint(cls, checkpoint_path, map_location, hparams_file, tags_csv, *args, **kwargs)
   1561 
   1562         # override the module_arguments with values that were passed in
-> 1563         checkpoint[CHECKPOINT_KEY_MODULE_ARGS].update(kwargs)
   1564 
   1565         model = cls._load_model_state(checkpoint, *args, **kwargs)

KeyError: 'module_arguments'

awaelchli · 2020-06-01T18:33:34Z

the checkpoints trained with an old version (before hparams PR) are not compatible with the latest change to how hparams are stored in the checkpoints. This is a pending issue that needs to be resolved.
You would have to retrain the model or go and manually modify the checkpoint to set the module_arguments key.

rohitgr7 · 2020-06-01T18:38:02Z

I retrained the model from scratch. Even when setting checkpoint_callback=False, I am getting this error. Just to be sure I am installing this PR using !pip install -U git+https://github.com/lezwon/pytorch-lightning.git@bugfix/2016_slow_tpu_train. This is the correct way, right?

pvnieo · 2020-06-01T23:53:57Z

Hi @rohitgr7 @lezwon @awaelchli
I updated my pytorch-lightning package to the version available yesterday in master, But i just noticed that the hparams.yaml file is empty contrary to what was the default behavior before, also, when I wanted to test my model, I got the following error:

File "/home/.local/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1563, in load_from_checkpoint
    checkpoint[CHECKPOINT_KEY_MODULE_ARGS].update(kwargs)
KeyError: 'module_arguments'

Is this issue solved yet?

williamFalcon · 2020-06-02T00:07:53Z

@Borda FYI

we’re, working on fixing the hparams thing

CHANGELOG.md

pvnieo · 2020-06-02T23:44:34Z

Hi @Borda @williamFalcon
Is the issue concerning hparams not being saved, and the issue that arises when testing (KeyError: 'module_arguments') is solved on master?

* use parallel loader * Revert "use parallel loader" This reverts commit ed6e758 * select tpu id for pl * condition if tpu_id is None * added info to changelog * Revert "condition if tpu_id is None" This reverts commit 1fb6e58 * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

mergify bot requested a review from a team May 31, 2020 19:26

Borda changed the title ~~Bugfix/2016 slow tpu train~~ slow tpu train May 31, 2020

Borda added the bug Something isn't working label May 31, 2020

Borda added this to the 0.8.0 milestone May 31, 2020

Borda requested changes May 31, 2020

View reviewed changes

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved

mergify bot requested a review from a team May 31, 2020 20:33

Borda approved these changes Jun 1, 2020

View reviewed changes

mergify bot requested a review from a team June 1, 2020 05:59

Borda added priority: 0 High priority task ready PRs ready to be merged and removed priority: 0 High priority task labels Jun 1, 2020

awaelchli reviewed Jun 1, 2020

View reviewed changes

mergify bot requested a review from a team June 1, 2020 06:11

lezwon added 5 commits June 1, 2020 11:59

use parallel loader

4b9e9cb

Revert "use parallel loader"

6f88630

This reverts commit ed6e758

select tpu id for pl

1741fc3

condition if tpu_id is None

1fb6e58

added info to changelog

c09f2ec

lezwon force-pushed the bugfix/2016_slow_tpu_train branch from ce7f0b9 to c09f2ec Compare June 1, 2020 06:30

Revert "condition if tpu_id is None"

452d5f9

This reverts commit 1fb6e58

Borda requested review from awaelchli, jeremyjordan and justusschock June 1, 2020 12:04

williamFalcon removed the ready PRs ready to be merged label Jun 1, 2020

Borda reviewed Jun 2, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Apply suggestions from code review

5da95ea

Borda added the ready PRs ready to be merged label Jun 2, 2020

mergify bot requested a review from a team June 2, 2020 14:10

williamFalcon merged commit 943c4b2 into Lightning-AI:master Jun 2, 2020

rohitgr7 mentioned this pull request Jun 7, 2020

tpu_cores=8 not working #2106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow tpu train #2033

slow tpu train #2033

lezwon commented May 31, 2020 •

edited by Borda

rohitgr7 commented May 31, 2020

williamFalcon commented May 31, 2020

lezwon commented Jun 1, 2020

Borda commented Jun 1, 2020

awaelchli Jun 1, 2020

lezwon Jun 1, 2020

awaelchli Jun 1, 2020

Borda Jun 1, 2020

lezwon Jun 1, 2020

Borda Jun 1, 2020

mergify bot commented Jun 1, 2020

codecov bot commented Jun 1, 2020

rohitgr7 commented Jun 1, 2020

rohitgr7 commented Jun 1, 2020 •

edited

awaelchli commented Jun 1, 2020

rohitgr7 commented Jun 1, 2020 •

edited

pvnieo commented Jun 1, 2020

williamFalcon commented Jun 2, 2020

pvnieo commented Jun 2, 2020 •

edited

slow tpu train #2033

slow tpu train #2033

Conversation

lezwon commented May 31, 2020 • edited by Borda

Before submitting

What does this PR do?

PR review

Did you have fun?

rohitgr7 commented May 31, 2020

williamFalcon commented May 31, 2020

lezwon commented Jun 1, 2020

Borda commented Jun 1, 2020

awaelchli Jun 1, 2020

Choose a reason for hiding this comment

lezwon Jun 1, 2020

Choose a reason for hiding this comment

awaelchli Jun 1, 2020

Choose a reason for hiding this comment

Borda Jun 1, 2020

Choose a reason for hiding this comment

lezwon Jun 1, 2020

Choose a reason for hiding this comment

Borda Jun 1, 2020

Choose a reason for hiding this comment

mergify bot commented Jun 1, 2020

codecov bot commented Jun 1, 2020

Codecov Report

rohitgr7 commented Jun 1, 2020

rohitgr7 commented Jun 1, 2020 • edited

awaelchli commented Jun 1, 2020

rohitgr7 commented Jun 1, 2020 • edited

pvnieo commented Jun 1, 2020

williamFalcon commented Jun 2, 2020

pvnieo commented Jun 2, 2020 • edited

lezwon commented May 31, 2020 •

edited by Borda

rohitgr7 commented Jun 1, 2020 •

edited

rohitgr7 commented Jun 1, 2020 •

edited

pvnieo commented Jun 2, 2020 •

edited