Ram increasing during first epoch of training #141

rakro101 · 2024-05-24T11:19:11Z

🐛 Bug

During training, in the first epoch the ram increase.
For example starting with 27 GB aber 10 % of first epoch it is up 40GB ...
I do not expect this ram increase when streaming data.

To Reproduce

See #140

Steps to reproduce the behavior:

use a dataset and convert it to litdata format
start a training, where in one epoch you go to each sample for the train data set once
see the ram increase

Code sample

See #140

Expected behavior

Ram should no be increasing due the full first epoch.

Environment

See #140

Additional context

see See #140

jackcyc · 2024-06-19T19:20:54Z

I also encountered a similar issue while training an ImageNet classification model on a resource-limited PC. Specifically, I found that the prefetch_factor of StreamingDataLoader is set to 10 by default when num_worker > 0.

litdata/src/litdata/streaming/dataloader.py

Line 604 in d5eff39

    
           prefetch_factor=(10 if num_workers > 0 else None) if prefetch_factor is None else prefetch_factor,

This default value appears to differ from what is described in the StreamingDataLoader docstring and also PyTorch's DataLoader default.

Manually setting the prefetch_factor to a smaller number, like 2, significantly reduced the RAM usage.

deependujha · 2024-06-20T03:09:36Z

Thanks for pointing out wrong default value in docstring, it'll be fixed soon.

https://github.com/Lightning-AI/litdata/blob/d5eff393cd17ba4f789fa846788f40b5ca4d0779/src/litdata/streaming/dataloader.py#L533C1-L538C1

Did you try values in the middle, like 5-6? If yes, please share your experience.

It'll help in fixing this if something needs to be done.

jackcyc · 2024-06-21T12:14:48Z

I think the behavior of StreamingDataLoader is expected.

prefetch_factor (int, optional, keyword-only arg) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise, if value of num_workers > 0 default is 2).

If the training_step consumes batches faster than the dataloader prepares them, the prefetch queue could be small, thereby consuming less RAM. However, if the training_step slows down and the dataloader gradually fills up the prefetch queue until it reaches the limit (prefetch_factor * num_workers), an increase in host memory usage can be observed. Therefore, I think the only problem is that the default prefetch_factor is too high and not in sync with well-known defaults, causing people to easily overlook this issue.

tchaton · 2024-06-21T12:21:32Z

Hey @jackcyc. Yes, it seems to be fine with LLM pre-training and ImageNet but as you perfectly stated, it is data specific. This was a tuning I made to accelerate training as I noticied lower GPU utilization.

If you feel like the value should be put back to something lower, feel free to make a PR and we will merge it.

rakro101 added bug Something isn't working help wanted Extra attention is needed labels May 24, 2024

tchaton closed this as completed Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ram increasing during first epoch of training #141

Ram increasing during first epoch of training #141

rakro101 commented May 24, 2024

jackcyc commented Jun 19, 2024

deependujha commented Jun 20, 2024 •

edited

Loading

jackcyc commented Jun 21, 2024

tchaton commented Jun 21, 2024 •

edited

Loading

Ram increasing during first epoch of training #141

Ram increasing during first epoch of training #141

Comments

rakro101 commented May 24, 2024

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

jackcyc commented Jun 19, 2024

deependujha commented Jun 20, 2024 • edited Loading

jackcyc commented Jun 21, 2024

tchaton commented Jun 21, 2024 • edited Loading

deependujha commented Jun 20, 2024 •

edited

Loading

tchaton commented Jun 21, 2024 •

edited

Loading