Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #344

Closed
zkdfbb opened this issue Dec 5, 2018 · 5 comments
Closed

Memory leak #344

zkdfbb opened this issue Dec 5, 2018 · 5 comments
Labels
bug Something isn't working

Comments

@zkdfbb
Copy link

zkdfbb commented Dec 5, 2018

When I upgrade to 0.5.0 I found a huge memory usage growth. I checked version 0.2.0(include 0.3), the memory growth as:
4826, 5096, 5302
and 0.5.0(include 0.4.1) growth as:
6754, 8970, 11114

and when run evaluation, the last batch maybe smaller, so every evaluation I re read the datasets(tfrecords), the result under version 0.2.0 is:
4826, 5294, 5510

How can I solve this problem?

@JanuszL JanuszL added bug Something isn't working question Further information is requested labels Dec 5, 2018
@JanuszL
Copy link
Contributor

JanuszL commented Dec 5, 2018

Hi,
This problem was reported in #328 as well. Now we don't have a good, general solution for that.
Could you tell more about pipeline do you use, what data set and etc?
Maybe this will help us to narrow down the problem you experiencing.
Tracked as DALI-425.
Br,
Janusz

@zkdfbb
Copy link
Author

zkdfbb commented Dec 5, 2018

@JanuszL my own dataset, not public, using TFRecordPipeline. I'm afraid there is a bug between 0.3.0 and 0.4.0

@JanuszL
Copy link
Contributor

JanuszL commented Dec 5, 2018

Can you check this on some publicly available data set, so we have 1005 reproduction here as well?
Otherwise even if we fix something we cannot tell if it works for you.

@JanuszL JanuszL removed the question Further information is requested label Jan 21, 2020
@mzient
Copy link
Contributor

mzient commented Feb 4, 2020

Regarding host memory usage:
We're pleased to say that we've changed the allocation strategy for (non-pinned) CPU buffers. It reduces the memory consumption in RN50 training in PyTorch by almost 50%. Please check the latest master (or next successul nightly) and see if your issue is resolved.
The memory is now freed when a requested tensor is smaller than a given percentage of actual allocation. You can tweak it by setting the environment variable DALI_HOST_BUFFER_SHRINK_THRESHOLD=0.xx. The default value is 0.9. You can also set it in python using nvidia.dali.backend.SetHostBufferShrinkThreshold(threshold).

@JanuszL JanuszL added this to the Release_0.19.0 milestone Feb 4, 2020
@JanuszL
Copy link
Contributor

JanuszL commented Mar 2, 2020

0.19 is out and should address this. Please reopen if it doesn't work.

@JanuszL JanuszL closed this as completed Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants