-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ram increasing during first epoch of training #141
Comments
I also encountered a similar issue while training an ImageNet classification model on a resource-limited PC. Specifically, I found that the prefetch_factor of StreamingDataLoader is set to 10 by default when num_worker > 0. litdata/src/litdata/streaming/dataloader.py Line 604 in d5eff39
This default value appears to differ from what is described in the StreamingDataLoader docstring and also PyTorch's DataLoader default. Manually setting the prefetch_factor to a smaller number, like 2, significantly reduced the RAM usage. |
Thanks for pointing out wrong default value in docstring, it'll be fixed soon. Did you try values in the middle, like 5-6? If yes, please share your experience. It'll help in fixing this if something needs to be done. |
I think the behavior of StreamingDataLoader is expected.
If the training_step consumes batches faster than the dataloader prepares them, the prefetch queue could be small, thereby consuming less RAM. However, if the training_step slows down and the dataloader gradually fills up the prefetch queue until it reaches the limit (prefetch_factor * num_workers), an increase in host memory usage can be observed. Therefore, I think the only problem is that the default prefetch_factor is too high and not in sync with well-known defaults, causing people to easily overlook this issue. |
Hey @jackcyc. Yes, it seems to be fine with LLM pre-training and ImageNet but as you perfectly stated, it is data specific. This was a tuning I made to accelerate training as I noticied lower GPU utilization. If you feel like the value should be put back to something lower, feel free to make a PR and we will merge it. |
🐛 Bug
During training, in the first epoch the ram increase.
For example starting with 27 GB aber 10 % of first epoch it is up 40GB ...
I do not expect this ram increase when streaming data.
To Reproduce
See #140
Steps to reproduce the behavior:
Code sample
See #140
Expected behavior
Ram should no be increasing due the full first epoch.
Environment
See #140
Additional context
see See #140
The text was updated successfully, but these errors were encountered: