-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AudioText leaks memory for large-scale training datasets #1467
Comments
Thanks, @vadimkantorov for reporting this. Will look into the solution and test it out. |
Indeed, when @Squire-tomsk investigated stubbing out Also if supporting the full UserList interface is not necessary, it would be best to get rid of it, since slicing would require special care with these shareable list hacks. |
The leak will be even worse when persistent workers are used |
@titu1994 Is the issue fixed? Or superseded by another issue? Could you please provide any clarifying information about issue closing? Thank you! |
@vadimkantorov were you able to solve the issue using your tensorbacked arrays? I tried replacing audio_files in the init function with a StringArray, but it crashes due to high cpu and memory load, probably when iterating through it |
Another option is to switch to tarred datasets. |
@samehraban btw do you know if TarDatasets from webdataset/in nemo support caching an index file? so that one does not need to always do a linear scan to discover what's storred in the tarball |
@vadimkantorov Do you mean a single cache for all the tar files? I don't think so. |
A single index for all tar files, or even an index-per-tarfile... For if we have offset+count index, individual files can be read from tarfile by just doing seek |
If I understand well, it appends and stores paths to audio files in a list:
NeMo/nemo/collections/asr/parts/collections.py
Line 90 in 8d506d5
If this object is then scattered to all DataLoader workers, it leads to leaking Linux shared memory (measured by PSS) and gets some data loader threads killed by OOM killer halting the training process. Full explanation of the bug is in: pytorch/pytorch#13246.
Large-scale training datasets contain a lot of audio files and hence a lot of audio path strings, this exacerbates the problem
One solution is to use Manager objects as proposed by @snakers4 in pytorch/pytorch#13246 (comment)
Another solution is to pack strings into tensors as done in my gist: https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57#file-tensorbackeddictarray-py-L32
The text was updated successfully, but these errors were encountered: