Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

specifying the tpu_core speed-up TPU training #2016

Closed
rohitgr7 opened this issue May 30, 2020 · 4 comments 路 Fixed by #2033
Closed

specifying the tpu_core speed-up TPU training #2016

rohitgr7 opened this issue May 30, 2020 · 4 comments 路 Fixed by #2033
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@rohitgr7
Copy link
Contributor

馃悰 Bug

  1. I am getting a huge time difference between training a model on a specific tpu core tpu_cores=[1] and training a model on just 1 tpu core tpu_cores=1. What's the reason for that? Aren't both the conditions the same with just the difference that I am assigning a specific tpu_core in the first case and assigning the number of tpu_cores I want to use in the second case. Also in the second case, I am getting an error. When training with tpu_cores=[1] epoch time is 17 seconds with tpu_cores=1 epoch time is just 5 seconds.
  2. Running on colab gives me an error but no error on Kaggle kernels. But the time difference issue is the same on both the platforms.

To Reproduce

Code sample

Colab Notebook

Expected behavior

As far as I know in both cases, the training time should be the same regardless of training on a single core or training on a specific core.

Environment

  • PyTorch Version (e.g., 1.0): 1.5.0
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.7
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: Tesla P100-PCIE-16GB
  • Any other relevant information:

Additional context

@rohitgr7 rohitgr7 added the help wanted Open to be worked on label May 30, 2020
@rohitgr7
Copy link
Contributor Author

@williamFalcon @dlibenzi

@lezwon lezwon mentioned this issue May 31, 2020
5 tasks
@lezwon
Copy link
Contributor

lezwon commented May 31, 2020

@dlibenzi I recall when training on a single core and using ParallelLoader, I used to receive an error. Hence the self.tpu_id is None condition. However, I did a recheck and it seems to be working fine with ParallelLoader now. Made a PR for the same. :)

@Borda Borda changed the title Training time on tpu is less when specifying the tpu_core specifying the tpu_core speed-up TPU training Jun 2, 2020
@Borda Borda added the feature Is an improvement or enhancement label Jun 2, 2020
@Borda Borda added this to the 0.8.0 milestone Jun 2, 2020
@Borda Borda reopened this Jun 10, 2020
@Borda
Copy link
Member

Borda commented Jun 16, 2020

not sure why this was reopened...

@Borda Borda closed this as completed Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants