Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress bar doesn't show up on Kaggle TPU with num_workers greater than 0. #9814

Open
RahulBhalley opened this issue Oct 4, 2021 · 33 comments
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on

Comments

@RahulBhalley
Copy link

RahulBhalley commented Oct 4, 2021

🐛 Bug

As the issue title says: the progress bar doesn't show up on Kaggle TPU with num_workers greater than 0.

Disclaimer: I haven't tested this program on Google Colab TPU.

To Reproduce

Set num_workers to any number greater than zero up to max CPU cores. On Kaggle, the following code sets it to 4.

train_dataset_loader = DataLoader(train_dataset, 
                                  batch_size=BATCH_SIZE, 
                                  shuffle=True, 
                                  num_workers=multiprocessing.cpu_count(),
                                  drop_last=True)

The training is successful but instead of showing progress bar the following output is shown:

2021-10-04 04:35:45.588313: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py:103: LightningDeprecationWarning: The signature of `Callback.on_train_epoch_end` has changed in v1.3. `outputs` parameter has been removed. Support for the old signature will be removed in v1.5
  "The signature of `Callback.on_train_epoch_end` has changed in v1.3."
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py:192: UserWarning: cleaning up tpu spawn environment...
  rank_zero_warn("cleaning up tpu spawn environment...")

Expected behavior

The progrès bar must show up.

Environment

  • PyTorch Lightning Version: 1.4.9
  • PyTorch Version: 1.8.0a0+6e9f2c8
  • Python version: 3.7.10
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: N/A
  • GPU models and configuration: N/A
  • How you installed PyTorch (conda, pip, source): pip
  • If compiling from source, the output of torch.__config__.show(): N/A
  • Any other relevant information: TPU on Kaggle

Additional context

N/A

cc @kaushikb11 @rohitgr7 @tchaton

@RahulBhalley RahulBhalley added bug Something isn't working help wanted Open to be worked on labels Oct 4, 2021
@Programmer-RD-AI
Copy link
Contributor

ok I will check the issue and check back

@Programmer-RD-AI
Copy link
Contributor

hi can you please send me the "train_dataset" that you are using. thank you

@RahulBhalley
Copy link
Author

Here is the data loader function:

def get_dataset_loader():
    transform = transforms.Compose([
        transforms.Resize(IMAGE_SIZE), # the shorter side is resize to match image_size
        transforms.CenterCrop(IMAGE_SIZE),
        transforms.ToTensor(), # to tensor [0,1]
        transforms.Lambda(lambda x: x.mul(255)) # convert back to [0, 255]
    ])
    train_dataset = datasets.ImageFolder(DATASET, 
                                         transform)
    train_loader = DataLoader(train_dataset,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              num_workers=multiprocessing.cpu_count(),
                              drop_last=True)
    return train_loader

# Load train dataset
train_dataloader = get_dataset_loader()

Here, DATASET="../input/coco-2017/train2017/". This dataset can be found at this link.

@Programmer-RD-AI
Copy link
Contributor

ok thank you I will check it

@Programmer-RD-AI
Copy link
Contributor

hi I also tried this train loader function with a GPU and Kaggle TPU but even when num_workers is 0 it doesn't give a progress bar https://www.kaggle.com/ranugadisansagamage/notebook353e8aa184 this is the notebook in Kaggle that I used I installed Pytorch 1.8 also is there any differences in this code.

@tchaton tchaton added priority: 1 Medium priority task accelerator: tpu Tensor Processing Unit labels Oct 6, 2021
@RahulBhalley
Copy link
Author

RahulBhalley commented Oct 6, 2021

I can't see the notebook. You haven't saved a version with execution. Btw for me GPU execution is not a problem even if I set CPU cores to max i.e. 4 in case of Kaggle.

@Programmer-RD-AI
Copy link
Contributor

hi I updated the kaggle notebook can you please check it and confirm that the code is correct https://www.kaggle.com/ranugadisansagamage/notebook353e8aa184

@Programmer-RD-AI
Copy link
Contributor

notebook353e8aa184.zip

@Programmer-RD-AI
Copy link
Contributor

@RahulBhalley
Copy link
Author

RahulBhalley commented Oct 6, 2021

notebook353e8aa184.zip

There seem to be a training loop there.

My notebooks output when I set num_workers=4 is the following. Looks like whole RAM is consumed whereas in case of GPU this is not an issue at all.

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py:192: UserWarning: cleaning up tpu spawn environment...
  rank_zero_warn("cleaning up tpu spawn environment...")

@Programmer-RD-AI
Copy link
Contributor

ok I will check this error

@RahulBhalley
Copy link
Author

Please try running this notebook on TPU instance. Try to incrementally change values of num_workers. I think around > 2 the RAM starts to get consumed totally and TPU releases the resources and program get's canceled.

@Programmer-RD-AI
Copy link
Contributor

ok I will check it

@Programmer-RD-AI
Copy link
Contributor

Programmer-RD-AI commented Oct 7, 2021

hi, I found some similar issues.
Can you check them?
#1112
#207
#Example Repo

Regards

@RahulBhalley
Copy link
Author

Hi

  1. Can't test #207. The code is incomplete.
  2. Modified the example notebook: new one for proper installation of PL, XLA, and other packages. It shows progress bar for NUM_WORKERS from 0 to 2 but not above.
  3. #1112 I don't use Google Colab much.

I hope this information is helpful.

Regards,
Rahul Bhalley

@Programmer-RD-AI
Copy link
Contributor

ok I will check the notebook

Regards

@Programmer-RD-AI
Copy link
Contributor

the notebook doesn't exist
it's a 404 error

@RahulBhalley
Copy link
Author

the notebook doesn't exist
it's a 404 error

Forgot to save changes after setting visibility to public. Please check it now.

@Programmer-RD-AI
Copy link
Contributor

ok

@RahulBhalley
Copy link
Author

RahulBhalley commented Oct 8, 2021

@Programmer-RD-AI Btw I am stuck at a problem where I need to implement torch.linalg.svd from scratch in PyTorch. Do you know how to do that? Or are you aware of any correct implementation NumPy or PyTorch (from scratch)?

Sorry, this is not a right place for this question but I feel like I went everywhere on Internet but didn't get anything correct.

@Programmer-RD-AI
Copy link
Contributor

ok I will check If I can find anything

Regards

@Programmer-RD-AI
Copy link
Contributor

hi, I have checked but I couldn't find anything, Sorry
If I find any resource I will add it to this issue

Regards,
Ranuga

@RahulBhalley
Copy link
Author

Thank you so much!

@Programmer-RD-AI
Copy link
Contributor

@tchaton

Can you please help with this issue please thank you.

Regards.

@RahulBhalley
Copy link
Author

hi, I have checked but I couldn't find anything, Sorry If I find any resource I will add it to this issue

Regards, Ranuga

Btw please look no further if you are trying to. I have found some helpful potential implementations (although not perfect i.e. doesn't give same results in terms of signs but I'll search more).

@Programmer-RD-AI
Copy link
Contributor

ok, I am sorry I couldn't solve the problem.

Regards.

I will try and find a resource for the error.

Regards

@RahulBhalley
Copy link
Author

Hi, I've gone through all those references but doesn't look they relate to this issue. 🤔 This issue might have to do something with library itself.

@Programmer-RD-AI
Copy link
Contributor

@tchaton Can you please check this issue?

Thanks.

With best regards,
Ranuga

@Programmer-RD-AI
Copy link
Contributor

Hi, I've gone through all those references but doesn't look they relate to this issue. thinking This issue might have to do something with library itself.

I am new to PyTorch lightning so I am not sure so @tchaton can help.

With best regards,
Ranuga

@kaushikb11
Copy link
Contributor

@RahulBhalley @Programmer-RD-AI I will take a stab at this with GCP TPUs.

@Programmer-RD-AI
Copy link
Contributor

ok, thank you @kaushikb11

With best regards,
Ranuga

@RahulBhalley
Copy link
Author

This is issue doesn't really affect my work but just curious if there's been any progress on resolving it.

@kaushikb11 kaushikb11 self-assigned this Nov 8, 2021
@tchaton tchaton added priority: 0 High priority task and removed priority: 1 Medium priority task labels Nov 29, 2021
@Borda Borda added priority: 1 Medium priority task and removed priority: 0 High priority task labels Aug 8, 2022
@Borda Borda self-assigned this Nov 7, 2022
@carmocca carmocca removed the priority: 1 Medium priority task label Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

6 participants