Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU hangs when using only a train loop (ie: no val loop) #2498

Closed
williamFalcon opened this issue Jul 4, 2020 · 3 comments
Closed

TPU hangs when using only a train loop (ie: no val loop) #2498

williamFalcon opened this issue Jul 4, 2020 · 3 comments
Assignees
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@williamFalcon
Copy link
Contributor

williamFalcon commented Jul 4, 2020

I think it's somehow related to checkpointing.

Easiest way to debug is to get on colab.

@williamFalcon williamFalcon added bug Something isn't working help wanted Open to be worked on labels Jul 4, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Jul 4, 2020

Hi! thanks for your contribution!, great first issue!

@awaelchli
Copy link
Member

@williamFalcon I struggle to reproduce this. PL does not even recognize the TPUs in the runtime, XLA_AVAILABLE gets set to false (mnist tpu colab). Tried different pytorch and xla versions, all the same.
Could you share the colab in which you observed the hangs?

@awaelchli
Copy link
Member

@williamFalcon checked again and now I am able to run the mnist colab without validation and it does not hang anymore (latest master). Not sure what fixed it.

@awaelchli awaelchli mentioned this issue Jul 27, 2020
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants