Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ram increasing during first epoch of training #141

Open
rakro101 opened this issue May 24, 2024 · 4 comments
Open

Ram increasing during first epoch of training #141

rakro101 opened this issue May 24, 2024 · 4 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@rakro101
Copy link
Contributor

🐛 Bug

During training, in the first epoch the ram increase.
For example starting with 27 GB aber 10 % of first epoch it is up 40GB ...
I do not expect this ram increase when streaming data.

To Reproduce

See #140

Steps to reproduce the behavior:

  1. use a dataset and convert it to litdata format
  2. start a training, where in one epoch you go to each sample for the train data set once
  3. see the ram increase

Code sample

See #140

Expected behavior

Ram should no be increasing due the full first epoch.

Environment

See #140

Additional context

see See #140

@rakro101 rakro101 added bug Something isn't working help wanted Extra attention is needed labels May 24, 2024
@jackcyc
Copy link

jackcyc commented Jun 19, 2024

I also encountered a similar issue while training an ImageNet classification model on a resource-limited PC. Specifically, I found that the prefetch_factor of StreamingDataLoader is set to 10 by default when num_worker > 0.

prefetch_factor=(10 if num_workers > 0 else None) if prefetch_factor is None else prefetch_factor,

This default value appears to differ from what is described in the StreamingDataLoader docstring and also PyTorch's DataLoader default.

Manually setting the prefetch_factor to a smaller number, like 2, significantly reduced the RAM usage.

@deependujha
Copy link
Contributor

deependujha commented Jun 20, 2024

Thanks for pointing out wrong default value in docstring, it'll be fixed soon.

https://github.com/Lightning-AI/litdata/blob/d5eff393cd17ba4f789fa846788f40b5ca4d0779/src/litdata/streaming/dataloader.py#L533C1-L538C1

Did you try values in the middle, like 5-6? If yes, please share your experience.

It'll help in fixing this if something needs to be done.

@jackcyc
Copy link

jackcyc commented Jun 21, 2024

I think the behavior of StreamingDataLoader is expected.

prefetch_factor (int, optional, keyword-only arg) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise, if value of num_workers > 0 default is 2).

If the training_step consumes batches faster than the dataloader prepares them, the prefetch queue could be small, thereby consuming less RAM. However, if the training_step slows down and the dataloader gradually fills up the prefetch queue until it reaches the limit (prefetch_factor * num_workers), an increase in host memory usage can be observed. Therefore, I think the only problem is that the default prefetch_factor is too high and not in sync with well-known defaults, causing people to easily overlook this issue.

@tchaton
Copy link
Collaborator

tchaton commented Jun 21, 2024

Hey @jackcyc. Yes, it seems to be fine with LLM pre-training and ImageNet but as you perfectly stated, it is data specific. This was a tuning I made to accelerate training as I noticied lower GPU utilization.

If you feel like the value should be put back to something lower, feel free to make a PR and we will merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants