AudioText leaks memory for large-scale training datasets #1467

vadimkantorov · 2020-11-18T12:30:42Z

If I understand well, it appends and stores paths to audio files in a list:

NeMo/nemo/collections/asr/parts/collections.py

Line 90 in 8d506d5

class AudioText(_Collection):

If this object is then scattered to all DataLoader workers, it leads to leaking Linux shared memory (measured by PSS) and gets some data loader threads killed by OOM killer halting the training process. Full explanation of the bug is in: pytorch/pytorch#13246.

Large-scale training datasets contain a lot of audio files and hence a lot of audio path strings, this exacerbates the problem

One solution is to use Manager objects as proposed by @snakers4 in pytorch/pytorch#13246 (comment)

Another solution is to pack strings into tensors as done in my gist: https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57#file-tensorbackeddictarray-py-L32

nithinraok · 2020-11-18T22:53:25Z

Thanks, @vadimkantorov for reporting this. Will look into the solution and test it out.

vadimkantorov · 2020-11-20T08:06:49Z

Indeed, when @Squire-tomsk investigated stubbing out .data in the UserList to our share-safe list holder in https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57, the memory leak goes away. The other solution with Manager'd ShareableDict objects may require Python 3.8, but it probably should be fine.

Also if supporting the full UserList interface is not necessary, it would be best to get rid of it, since slicing would require special care with these shareable list hacks.

vadimkantorov · 2020-11-23T08:45:23Z

The leak will be even worse when persistent workers are used

vadimkantorov · 2021-07-13T15:23:11Z

@titu1994 Is the issue fixed? Or superseded by another issue? Could you please provide any clarifying information about issue closing? Thank you!

roman-vygon · 2021-08-06T05:17:04Z

@vadimkantorov were you able to solve the issue using your tensorbacked arrays? I tried replacing audio_files in the init function with a StringArray, but it crashes due to high cpu and memory load, probably when iterating through it

samehraban · 2022-08-03T13:31:01Z

Another option is to switch to tarred datasets.

vadimkantorov · 2023-05-17T16:39:23Z

@samehraban btw do you know if TarDatasets from webdataset/in nemo support caching an index file? so that one does not need to always do a linear scan to discover what's storred in the tarball

samehraban · 2023-05-17T21:55:06Z

@vadimkantorov Do you mean a single cache for all the tar files? I don't think so.

vadimkantorov · 2023-05-17T22:06:01Z

A single index for all tar files, or even an index-per-tarfile... For if we have offset+count index, individual files can be read from tarfile by just doing seek

vadimkantorov added the bug Something isn't working label Nov 18, 2020

nithinraok self-assigned this Nov 18, 2020

This was referenced Nov 20, 2020

DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes pytorch/pytorch#13246

Open

RESOLVED: Don't let "tensor like" objects substitute for tensors pytorch/pytorch#46138

Closed

titu1994 closed this as completed Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AudioText leaks memory for large-scale training datasets #1467

AudioText leaks memory for large-scale training datasets #1467

vadimkantorov commented Nov 18, 2020 •

edited

nithinraok commented Nov 18, 2020

vadimkantorov commented Nov 20, 2020 •

edited

vadimkantorov commented Nov 23, 2020 •

edited

vadimkantorov commented Jul 13, 2021

roman-vygon commented Aug 6, 2021

samehraban commented Aug 3, 2022

vadimkantorov commented May 17, 2023 •

edited

samehraban commented May 17, 2023

vadimkantorov commented May 17, 2023

AudioText leaks memory for large-scale training datasets #1467

AudioText leaks memory for large-scale training datasets #1467

Comments

vadimkantorov commented Nov 18, 2020 • edited

nithinraok commented Nov 18, 2020

vadimkantorov commented Nov 20, 2020 • edited

vadimkantorov commented Nov 23, 2020 • edited

vadimkantorov commented Jul 13, 2021

roman-vygon commented Aug 6, 2021

samehraban commented Aug 3, 2022

vadimkantorov commented May 17, 2023 • edited

samehraban commented May 17, 2023

vadimkantorov commented May 17, 2023

vadimkantorov commented Nov 18, 2020 •

edited

vadimkantorov commented Nov 20, 2020 •

edited

vadimkantorov commented Nov 23, 2020 •

edited

vadimkantorov commented May 17, 2023 •

edited