-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data shard delation with multi GPU does not work #140
Comments
From the logs, it seems 4 processes are downloading the chunks but one deletes it before the other are finished with it. DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
Epoch 0: 1%|▍ | 280/20000 [05:13<6:08:00, 0.89it/s, v_num=10, train/loss=2.240, train/acc=0.124, train/f1=0.0799, train/recall=0.124, train/precision=0.0627]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
Epoch 0: 1%|▌ | 281/20000 [05:14<6:07:58, 0.89it/s, v_num=10, train/loss=2.210, train/acc=0.209, train/f1=0.124, train/recall=0.209, train/precision=0.116]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
Epoch 0: 1%|▍ | 282/20000 [05:15<6:07:54, 0.89it/s, v_num=10, train/loss=2.190, train/acc=0.130, train/f1=0.107, train/recall=0.130, train/precision=0.0993]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
Epoch 0: 1%|▌ | 283/20000 [05:16<6:07:52, 0.89it/s, v_num=10, train/loss=2.220, train/acc=0.105, train/f1=0.103, train/recall=0.105, train/precision=0.206]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
Epoch 0: 1%|▍ | 284/20000 [05:17<6:07:48, 0.89it/s, v_num=10, train/loss=2.250, train/acc=0.0921, train/f1=0.0709, train/recall=0.0921, train/precision=0.0658]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
Epoch 0: 1%|▍ | 285/20000 [05:18<6:07:45, 0.89it/s, v_num=10, train/loss=2.170, train/acc=0.120, train/f1=0.099, train/recall=0.120, train/precision=0.0995]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
Epoch 0: 1%|▍ | 286/20000 [05:20<6:07:42, 0.89it/s, v_num=10, train/loss=2.190, train/acc=0.102, train/f1=0.0932, train/recall=0.102, train/precision=0.0884]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
Epoch 0: 1%|▍ | 287/20000 [05:21<6:07:39, 0.89it/s, v_num=10, train/loss=2.210, train/acc=0.102, train/f1=0.0824, train/recall=0.102, train/precision=0.0897]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
Epoch 0: 1%|▌ | 289/20000 [05:23<6:07:33, 0.89it/s, v_num=10, train/loss=2.190, train/acc=0.140, train/f1=0.107, train/recall=0.140, train/precision=0.148]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
Epoch 0: 2%|▌ | 347/20000 [06:26<6:05:01, 0.90it/s, v_num=10, train/loss=2.240, train/acc=0.105, train/f1=0.0605, tEpoch 0: 2%| | 348/20000 [06:27<6:04:58, 0.90it/s, v_num=10, train/loss=2.240, train/acc=0.105, train/f1=0.0605, train/recall=0.105, train/precisioEpoch 0: 2%| | 360/20000 [06:40<6:04:32, 0.90it/s, v_num=10, train/loss=2.190, train/acc=0.138, train/f1=0.0883, train/recall=0.138, train/precisioTraceback (most recent call last):
File "/teamspace/studios/this_studio/train.py", line 107, in <module>
...
RuntimeError: Waiting too long for the /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin to be ready |
Comment: When you are using multiple GPUs, avoid creating your datasets in the init method of the DataModule. (Support will be added in the future) |
Hey @rakro101 do you think you could contribute an example with PyTorch Lightning to the repo ? |
Looking forward to the examples! |
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
Create a litdata set, stream the shard (image 224,224,3 + some text) and using mutli GPU using Bert + Resnet setting the max_cache_size="6GB"
Added a studio to reproduce the issue.
Code sample
Added a studio to reproduce the error.
Additional context
The text was updated successfully, but these errors were encountered: