Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dali leaks memory on invalid images #4740

Closed
joey-trigo opened this issue Mar 23, 2023 · 5 comments
Closed

Dali leaks memory on invalid images #4740

joey-trigo opened this issue Mar 23, 2023 · 5 comments
Assignees
Labels
bug Something isn't working perf Issues related to DALI performance

Comments

@joey-trigo
Copy link

joey-trigo commented Mar 23, 2023

Hi!

Due to an issue earlier in my pipeline dali received invalid JPEGs. When that happened, dali ended up leaking memory on each request until my process was killed by OOM. I don't necessarily expect dali to handle invalid JPEGs elegantly but it definitely shouldn't be leaking memory

Here is a minimal test case to reproduce this behaviour:

from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types

image_dir = "./"

while True:
    @pipeline_def
    def simple_pipeline():
        jpegs, labels = fn.readers.file(file_root=image_dir)
        images = fn.decoders.image(jpegs, device='mixed', output_type=types.BGR)
        return images, labels

    pipe = simple_pipeline(batch_size=1, num_threads=1, device_id=0)
    pipe.build()

    pipe_out = pipe.run()

Running this with a folder foo/ containing a regular jpg (such as good.jpg works fine. The script has no output and the memory is stable. Running with a corrupted jpg such as
0.jpg causes a rapid memory increase until the process is killed

@JanuszL
Copy link
Contributor

JanuszL commented Mar 23, 2023

Hi @joey-trigo,

Thank you for reporting this problem. I remember we fixed a similar one some time ago in #4138.
I tested your code with these two samples and the level of the GPU memory consumption remains stable.
Can you tell me what DALI version you use?
What I did was:

 docker run --rm -ti --gpus all nvidia/cuda:12.1.0-devel-ubuntu20.04

apt update && apt install -y nano python3-pip wget && \
pip install --upgrade pip && \
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda120 && \
mkdir -p test && cd test && mkdir -p foo && cd foo  && \
wget https://user-images.githubusercontent.com/125357480/227272721-f7945ccb-1835-4a90-8103-fb74910977b9.jpg && \
wget https://user-images.githubusercontent.com/125357480/227272969-ba005433-e869-471a-b09e-10c85d39f921.jpg && \
cd .. && \
python3 test.py

where test.py is your sample.

@JanuszL JanuszL added bug Something isn't working lack_of_repro Needs clear reproduction steps and script labels Mar 23, 2023
@jantonguirao jantonguirao assigned JanuszL and unassigned klecki Mar 24, 2023
@joey-trigo
Copy link
Author

Hi,
The leak is in CPU memory, not GPU memory. I ran the exact commands you did and can see the memory climbing with, with, for example:

while true; do awk '{ print $24 }' /proc/$(pgrep python3)/stat; sleep 1; done

The bug you referenced does sound very similar, is it possible that only the GPU memory is freed there and not the CPU memory?

@JanuszL JanuszL removed the lack_of_repro Needs clear reproduction steps and script label Mar 27, 2023
@JanuszL
Copy link
Contributor

JanuszL commented Mar 27, 2023

Thanks, @joey-trigo, the repro is very helpful. I believe I managed to fix the issue in #4748.
Please check the nightly build that follows the merge of the mentioned commit.

@joey-trigo
Copy link
Author

Will do, thanks! That fix looks good, I'll test it as soon as it merges

@stiepan stiepan added this to the Release_1.25.0 milestone Apr 12, 2023
@joey-trigo
Copy link
Author

Hi, tested the change and it looks like it's working. Thanks again!

@klecki klecki added the perf Issues related to DALI performance label Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working perf Issues related to DALI performance
Projects
None yet
Development

No branches or pull requests

4 participants