Progress bar doesn't show up on Kaggle TPU with `num_workers` greater than `0`. #9814

RahulBhalley · 2021-10-04T09:41:24Z

🐛 Bug

As the issue title says: the progress bar doesn't show up on Kaggle TPU with num_workers greater than 0.

Disclaimer: I haven't tested this program on Google Colab TPU.

To Reproduce

Set num_workers to any number greater than zero up to max CPU cores. On Kaggle, the following code sets it to 4.

train_dataset_loader = DataLoader(train_dataset, 
                                  batch_size=BATCH_SIZE, 
                                  shuffle=True, 
                                  num_workers=multiprocessing.cpu_count(),
                                  drop_last=True)

The training is successful but instead of showing progress bar the following output is shown:

2021-10-04 04:35:45.588313: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py:103: LightningDeprecationWarning: The signature of `Callback.on_train_epoch_end` has changed in v1.3. `outputs` parameter has been removed. Support for the old signature will be removed in v1.5
  "The signature of `Callback.on_train_epoch_end` has changed in v1.3."
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py:192: UserWarning: cleaning up tpu spawn environment...
  rank_zero_warn("cleaning up tpu spawn environment...")

Expected behavior

The progrès bar must show up.

Environment

PyTorch Lightning Version: 1.4.9
PyTorch Version: 1.8.0a0+6e9f2c8
Python version: 3.7.10
OS (e.g., Linux): Linux
CUDA/cuDNN version: N/A
GPU models and configuration: N/A
How you installed PyTorch (conda, pip, source): pip
If compiling from source, the output of torch.__config__.show(): N/A
Any other relevant information: TPU on Kaggle

Additional context

N/A

cc @kaushikb11 @rohitgr7 @tchaton

The text was updated successfully, but these errors were encountered:

Programmer-RD-AI · 2021-10-04T12:04:17Z

ok I will check the issue and check back

Programmer-RD-AI · 2021-10-04T16:35:20Z

hi can you please send me the "train_dataset" that you are using. thank you

RahulBhalley · 2021-10-04T23:46:29Z

Here is the data loader function:

def get_dataset_loader():
    transform = transforms.Compose([
        transforms.Resize(IMAGE_SIZE), # the shorter side is resize to match image_size
        transforms.CenterCrop(IMAGE_SIZE),
        transforms.ToTensor(), # to tensor [0,1]
        transforms.Lambda(lambda x: x.mul(255)) # convert back to [0, 255]
    ])
    train_dataset = datasets.ImageFolder(DATASET, 
                                         transform)
    train_loader = DataLoader(train_dataset,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              num_workers=multiprocessing.cpu_count(),
                              drop_last=True)
    return train_loader

# Load train dataset
train_dataloader = get_dataset_loader()

Here, DATASET="../input/coco-2017/train2017/". This dataset can be found at this link.

Programmer-RD-AI · 2021-10-05T03:21:49Z

ok thank you I will check it

Programmer-RD-AI · 2021-10-05T20:19:13Z

hi I also tried this train loader function with a GPU and Kaggle TPU but even when num_workers is 0 it doesn't give a progress bar https://www.kaggle.com/ranugadisansagamage/notebook353e8aa184 this is the notebook in Kaggle that I used I installed Pytorch 1.8 also is there any differences in this code.

RahulBhalley · 2021-10-06T14:55:58Z

I can't see the notebook. You haven't saved a version with execution. Btw for me GPU execution is not a problem even if I set CPU cores to max i.e. 4 in case of Kaggle.

Programmer-RD-AI · 2021-10-06T15:12:59Z

hi I updated the kaggle notebook can you please check it and confirm that the code is correct https://www.kaggle.com/ranugadisansagamage/notebook353e8aa184

Programmer-RD-AI · 2021-10-06T15:26:45Z

notebook353e8aa184.zip

Programmer-RD-AI · 2021-10-06T15:26:58Z

notebook353e8aa184.zip

this is the same in https://www.kaggle.com/ranugadisansagamage/notebook353e8aa184

RahulBhalley · 2021-10-06T15:29:36Z

notebook353e8aa184.zip

There seem to be a training loop there.

My notebooks output when I set num_workers=4 is the following. Looks like whole RAM is consumed whereas in case of GPU this is not an issue at all.

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py:192: UserWarning: cleaning up tpu spawn environment...
  rank_zero_warn("cleaning up tpu spawn environment...")

Programmer-RD-AI · 2021-10-06T15:30:26Z

ok I will check this error

RahulBhalley · 2021-10-06T16:24:45Z

Please try running this notebook on TPU instance. Try to incrementally change values of num_workers. I think around > 2 the RAM starts to get consumed totally and TPU releases the resources and program get's canceled.

Programmer-RD-AI · 2021-10-06T16:25:56Z

ok I will check it

Programmer-RD-AI · 2021-10-07T11:36:45Z

hi, I found some similar issues.
Can you check them?
#1112
#207
#Example Repo

Regards

RahulBhalley · 2021-10-07T15:56:43Z

Hi

Can't test #207. The code is incomplete.
Modified the example notebook: new one for proper installation of PL, XLA, and other packages. It shows progress bar for NUM_WORKERS from 0 to 2 but not above.
#1112 I don't use Google Colab much.

I hope this information is helpful.

Regards,
Rahul Bhalley

Programmer-RD-AI · 2021-10-07T16:09:50Z

ok I will check the notebook

Regards

Programmer-RD-AI · 2021-10-08T09:47:36Z

the notebook doesn't exist
it's a 404 error

RahulBhalley · 2021-10-08T11:09:19Z

the notebook doesn't exist
it's a 404 error

Forgot to save changes after setting visibility to public. Please check it now.

Programmer-RD-AI · 2021-10-08T11:17:48Z

ok

RahulBhalley · 2021-10-08T11:47:39Z

@Programmer-RD-AI Btw I am stuck at a problem where I need to implement torch.linalg.svd from scratch in PyTorch. Do you know how to do that? Or are you aware of any correct implementation NumPy or PyTorch (from scratch)?

Sorry, this is not a right place for this question but I feel like I went everywhere on Internet but didn't get anything correct.

Programmer-RD-AI · 2021-10-08T12:01:05Z

ok I will check If I can find anything

Regards

Programmer-RD-AI · 2021-10-08T12:04:50Z

hi, I have checked but I couldn't find anything, Sorry
If I find any resource I will add it to this issue

Regards,
Ranuga

RahulBhalley · 2021-10-08T12:04:51Z

Thank you so much!

Programmer-RD-AI · 2021-10-09T14:00:38Z

@tchaton

Can you please help with this issue please thank you.

Regards.

RahulBhalley · 2021-10-09T14:05:03Z

hi, I have checked but I couldn't find anything, Sorry If I find any resource I will add it to this issue

Regards, Ranuga

Btw please look no further if you are trying to. I have found some helpful potential implementations (although not perfect i.e. doesn't give same results in terms of signs but I'll search more).

Programmer-RD-AI · 2021-10-09T14:06:35Z

ok, I am sorry I couldn't solve the problem.

Regards.

I will try and find a resource for the error.

Regards

Programmer-RD-AI · 2021-10-10T05:11:37Z

hi can you check the links bellow,

https://www.youtube.com/watch?v=eBZciVDr21o
https://www.kaggle.com/pytorchlightning/pytorch-on-tpu-with-pytorch-lightning
https://pytorch-lightning.readthedocs.io/en/latest/notebooks/lightning_examples/mnist-tpu-training.html
https://pytorch-lightning.readthedocs.io/en/latest/advanced/tpu.html
https://devblog.pytorchlightning.ai/announcing-lightning-1-4-8cd20482aee9
#3649

Regards

RahulBhalley · 2021-10-13T04:38:12Z

Hi, I've gone through all those references but doesn't look they relate to this issue. 🤔 This issue might have to do something with library itself.

Programmer-RD-AI · 2021-10-13T04:48:01Z

@tchaton Can you please check this issue?

Thanks.

With best regards,
Ranuga

Programmer-RD-AI · 2021-10-13T04:48:51Z

Hi, I've gone through all those references but doesn't look they relate to this issue. thinking This issue might have to do something with library itself.

I am new to PyTorch lightning so I am not sure so @tchaton can help.

With best regards,
Ranuga

kaushikb11 · 2021-10-13T07:15:29Z

@RahulBhalley @Programmer-RD-AI I will take a stab at this with GCP TPUs.

Programmer-RD-AI · 2021-10-13T07:32:50Z

ok, thank you @kaushikb11

With best regards,
Ranuga

RahulBhalley · 2021-11-07T23:42:41Z

This is issue doesn't really affect my work but just curious if there's been any progress on resolving it.

RahulBhalley added bug Something isn't working help wanted Open to be worked on labels Oct 4, 2021

tchaton added priority: 1 Medium priority task accelerator: tpu Tensor Processing Unit labels Oct 6, 2021

kaushikb11 self-assigned this Nov 8, 2021

tchaton added priority: 0 High priority task and removed priority: 1 Medium priority task labels Nov 29, 2021

Borda added priority: 1 Medium priority task and removed priority: 0 High priority task labels Aug 8, 2022

Borda self-assigned this Nov 7, 2022

carmocca unassigned Borda and kaushikb11 Apr 15, 2023

carmocca removed the priority: 1 Medium priority task label Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress bar doesn't show up on Kaggle TPU with `num_workers` greater than `0`. #9814

Progress bar doesn't show up on Kaggle TPU with `num_workers` greater than `0`. #9814

RahulBhalley commented Oct 4, 2021 •

edited by github-actions bot

Loading

Programmer-RD-AI commented Oct 4, 2021

Programmer-RD-AI commented Oct 4, 2021

RahulBhalley commented Oct 4, 2021

Programmer-RD-AI commented Oct 5, 2021

Programmer-RD-AI commented Oct 5, 2021

RahulBhalley commented Oct 6, 2021 •

edited

Loading

Programmer-RD-AI commented Oct 6, 2021

Programmer-RD-AI commented Oct 6, 2021

Programmer-RD-AI commented Oct 6, 2021

RahulBhalley commented Oct 6, 2021 •

edited

Loading

Programmer-RD-AI commented Oct 6, 2021

RahulBhalley commented Oct 6, 2021

Programmer-RD-AI commented Oct 6, 2021

Programmer-RD-AI commented Oct 7, 2021 •

edited

Loading

RahulBhalley commented Oct 7, 2021

Programmer-RD-AI commented Oct 7, 2021

Programmer-RD-AI commented Oct 8, 2021

RahulBhalley commented Oct 8, 2021

Programmer-RD-AI commented Oct 8, 2021

RahulBhalley commented Oct 8, 2021 •

edited

Loading

Programmer-RD-AI commented Oct 8, 2021

Programmer-RD-AI commented Oct 8, 2021

RahulBhalley commented Oct 8, 2021

Programmer-RD-AI commented Oct 9, 2021

RahulBhalley commented Oct 9, 2021

Programmer-RD-AI commented Oct 9, 2021

Programmer-RD-AI commented Oct 10, 2021

RahulBhalley commented Oct 13, 2021

Programmer-RD-AI commented Oct 13, 2021

Programmer-RD-AI commented Oct 13, 2021

kaushikb11 commented Oct 13, 2021

Programmer-RD-AI commented Oct 13, 2021

RahulBhalley commented Nov 7, 2021

Progress bar doesn't show up on Kaggle TPU with num_workers greater than 0. #9814

Progress bar doesn't show up on Kaggle TPU with num_workers greater than 0. #9814

Comments

RahulBhalley commented Oct 4, 2021 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Programmer-RD-AI commented Oct 4, 2021

Programmer-RD-AI commented Oct 4, 2021

RahulBhalley commented Oct 4, 2021

Programmer-RD-AI commented Oct 5, 2021

Programmer-RD-AI commented Oct 5, 2021

RahulBhalley commented Oct 6, 2021 • edited Loading

Programmer-RD-AI commented Oct 6, 2021

Programmer-RD-AI commented Oct 6, 2021

Programmer-RD-AI commented Oct 6, 2021

RahulBhalley commented Oct 6, 2021 • edited Loading

Programmer-RD-AI commented Oct 6, 2021

RahulBhalley commented Oct 6, 2021

Programmer-RD-AI commented Oct 6, 2021

Programmer-RD-AI commented Oct 7, 2021 • edited Loading

RahulBhalley commented Oct 7, 2021

Programmer-RD-AI commented Oct 7, 2021

Programmer-RD-AI commented Oct 8, 2021

RahulBhalley commented Oct 8, 2021

Programmer-RD-AI commented Oct 8, 2021

RahulBhalley commented Oct 8, 2021 • edited Loading

Programmer-RD-AI commented Oct 8, 2021

Programmer-RD-AI commented Oct 8, 2021

RahulBhalley commented Oct 8, 2021

Programmer-RD-AI commented Oct 9, 2021

RahulBhalley commented Oct 9, 2021

Programmer-RD-AI commented Oct 9, 2021

Programmer-RD-AI commented Oct 10, 2021

RahulBhalley commented Oct 13, 2021

Programmer-RD-AI commented Oct 13, 2021

Programmer-RD-AI commented Oct 13, 2021

kaushikb11 commented Oct 13, 2021

Programmer-RD-AI commented Oct 13, 2021

RahulBhalley commented Nov 7, 2021

Progress bar doesn't show up on Kaggle TPU with `num_workers` greater than `0`. #9814

Progress bar doesn't show up on Kaggle TPU with `num_workers` greater than `0`. #9814

RahulBhalley commented Oct 4, 2021 •

edited by github-actions bot

Loading

RahulBhalley commented Oct 6, 2021 •

edited

Loading

RahulBhalley commented Oct 6, 2021 •

edited

Loading

Programmer-RD-AI commented Oct 7, 2021 •

edited

Loading

RahulBhalley commented Oct 8, 2021 •

edited

Loading