Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data shard delation with multi GPU does not work #140

Open
rakro101 opened this issue May 24, 2024 · 4 comments
Open

Data shard delation with multi GPU does not work #140

rakro101 opened this issue May 24, 2024 · 4 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@rakro101
Copy link
Contributor

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Create a litdata set, stream the shard (image 224,224,3 + some text) and using mutli GPU using Bert + Resnet setting the max_cache_size="6GB"

Added a studio to reproduce the issue.

Code sample

Added a studio to reproduce the error.

Additional context

@rakro101 rakro101 added bug Something isn't working help wanted Extra attention is needed labels May 24, 2024
@tchaton
Copy link
Collaborator

tchaton commented May 24, 2024

From the logs, it seems 4 processes are downloading the chunks but one deletes it before the other are finished with it.

DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
Sanity Checking DataLoader 0:   0%|                                                                                                                                       | 0/2 [00:00<?, ?it/s]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
Epoch 0:   1%|| 280/20000 [05:13<6:08:00,  0.89it/s, v_num=10, train/loss=2.240, train/acc=0.124, train/f1=0.0799, train/recall=0.124, train/precision=0.0627]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
Epoch 0:   1%|| 281/20000 [05:14<6:07:58,  0.89it/s, v_num=10, train/loss=2.210, train/acc=0.209, train/f1=0.124, train/recall=0.209, train/precision=0.116]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
Epoch 0:   1%|| 282/20000 [05:15<6:07:54,  0.89it/s, v_num=10, train/loss=2.190, train/acc=0.130, train/f1=0.107, train/recall=0.130, train/precision=0.0993]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
Epoch 0:   1%|| 283/20000 [05:16<6:07:52,  0.89it/s, v_num=10, train/loss=2.220, train/acc=0.105, train/f1=0.103, train/recall=0.105, train/precision=0.206]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
Epoch 0:   1%|| 284/20000 [05:17<6:07:48,  0.89it/s, v_num=10, train/loss=2.250, train/acc=0.0921, train/f1=0.0709, train/recall=0.0921, train/precision=0.0658]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
Epoch 0:   1%|| 285/20000 [05:18<6:07:45,  0.89it/s, v_num=10, train/loss=2.170, train/acc=0.120, train/f1=0.099, train/recall=0.120, train/precision=0.0995]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
Epoch 0:   1%|| 286/20000 [05:20<6:07:42,  0.89it/s, v_num=10, train/loss=2.190, train/acc=0.102, train/f1=0.0932, train/recall=0.102, train/precision=0.0884]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
Epoch 0:   1%|| 287/20000 [05:21<6:07:39,  0.89it/s, v_num=10, train/loss=2.210, train/acc=0.102, train/f1=0.0824, train/recall=0.102, train/precision=0.0897]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
Epoch 0:   1%|| 289/20000 [05:23<6:07:33,  0.89it/s, v_num=10, train/loss=2.190, train/acc=0.140, train/f1=0.107, train/recall=0.140, train/precision=0.148]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
Epoch 0:   2%|| 347/20000 [06:26<6:05:01,  0.90it/s, v_num=10, train/loss=2.240, train/acc=0.105, train/f1=0.0605, tEpoch 0:   2%| | 348/20000 [06:27<6:04:58,  0.90it/s, v_num=10, train/loss=2.240, train/acc=0.105, train/f1=0.0605, train/recall=0.105, train/precisioEpoch 0:   2%| | 360/20000 [06:40<6:04:32,  0.90it/s, v_num=10, train/loss=2.190, train/acc=0.138, train/f1=0.0883, train/recall=0.138, train/precisioTraceback (most recent call last):
  File "/teamspace/studios/this_studio/train.py", line 107, in <module>
...
RuntimeError: Waiting too long for the /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin to be ready

@rakro101
Copy link
Contributor Author

Comment: When you are using multiple GPUs, avoid creating your datasets in the init method of the DataModule. (Support will be added in the future)

@tchaton
Copy link
Collaborator

tchaton commented May 26, 2024

Hey @rakro101 do you think you could contribute an example with PyTorch Lightning to the repo ?

@deeptimhe
Copy link

Hey @rakro101 do you think you could contribute an example with PyTorch Lightning to the repo ?

Looking forward to the examples!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants