Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AudioText leaks memory for large-scale training datasets #1467

Closed
vadimkantorov opened this issue Nov 18, 2020 · 9 comments
Closed

AudioText leaks memory for large-scale training datasets #1467

vadimkantorov opened this issue Nov 18, 2020 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Nov 18, 2020

If I understand well, it appends and stores paths to audio files in a list:

class AudioText(_Collection):

If this object is then scattered to all DataLoader workers, it leads to leaking Linux shared memory (measured by PSS) and gets some data loader threads killed by OOM killer halting the training process. Full explanation of the bug is in: pytorch/pytorch#13246.

Large-scale training datasets contain a lot of audio files and hence a lot of audio path strings, this exacerbates the problem

One solution is to use Manager objects as proposed by @snakers4 in pytorch/pytorch#13246 (comment)

Another solution is to pack strings into tensors as done in my gist: https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57#file-tensorbackeddictarray-py-L32

@vadimkantorov vadimkantorov added the bug Something isn't working label Nov 18, 2020
@nithinraok nithinraok self-assigned this Nov 18, 2020
@nithinraok
Copy link
Collaborator

Thanks, @vadimkantorov for reporting this. Will look into the solution and test it out.

@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented Nov 20, 2020

Indeed, when @Squire-tomsk investigated stubbing out .data in the UserList to our share-safe list holder in https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57, the memory leak goes away. The other solution with Manager'd ShareableDict objects may require Python 3.8, but it probably should be fine.

Also if supporting the full UserList interface is not necessary, it would be best to get rid of it, since slicing would require special care with these shareable list hacks.

@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented Nov 23, 2020

The leak will be even worse when persistent workers are used

@vadimkantorov
Copy link
Contributor Author

@titu1994 Is the issue fixed? Or superseded by another issue? Could you please provide any clarifying information about issue closing? Thank you!

@roman-vygon
Copy link
Contributor

@vadimkantorov were you able to solve the issue using your tensorbacked arrays? I tried replacing audio_files in the init function with a StringArray, but it crashes due to high cpu and memory load, probably when iterating through it

@samehraban
Copy link

Another option is to switch to tarred datasets.

@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented May 17, 2023

@samehraban btw do you know if TarDatasets from webdataset/in nemo support caching an index file? so that one does not need to always do a linear scan to discover what's storred in the tarball

@samehraban
Copy link

@vadimkantorov Do you mean a single cache for all the tar files? I don't think so.

@vadimkantorov
Copy link
Contributor Author

A single index for all tar files, or even an index-per-tarfile... For if we have offset+count index, individual files can be read from tarfile by just doing seek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants