New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce S3 uploader mem usage #1639

Closed
wants to merge 9 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@darcatron
Contributor

darcatron commented Oct 25, 2017

Originally, we'd recurse through the directory to find files to upload. Based on the heap dump, the depth could get large if many subdirectories existed and were checked. The biggest piece seemed to be the list of paths themselves.

The order we check the files shouldn't matter here; we just want to check all files within a dir. So this updates the logic so we get all file paths we need to check, put them in a list, and loop through each path to check the file. This removes the overhead of the recursive method calls and the copies of references to Set<Path> synchronizedToUpload and List<Path> toUpload.

However, it's possible that the bigger problem is in the number Paths we have in the toUpload list unrelated to how it's handled within the recursion. One solution to that would be to just upload in limited batch sizes (e.g. 100 items at a time). Let me know your thoughts

@darcatron darcatron requested a review from ssalinas Oct 25, 2017

@ssalinas ssalinas added this to the 0.19.0 milestone Nov 3, 2017

darcatron added some commits Jan 31, 2018

@darcatron

This comment has been minimized.

Show comment
Hide comment
@darcatron

darcatron Feb 11, 2018

Contributor

I've tested various spreads of files and directory recursion and the benefits aren't as plentiful as expected. Good news is the changes are uploading properly.
I've given the code another read and realized we're passing in a set of all files we've uploaded during the run. The uploader only used the set to check if we've already uploaded a file so that it doesn't upload it again. Singularity essentially guarantees no uploader clash with the unique request name being in the file path. I've removed this piece since it's a lot of added overhead

Contributor

darcatron commented Feb 11, 2018

I've tested various spreads of files and directory recursion and the benefits aren't as plentiful as expected. Good news is the changes are uploading properly.
I've given the code another read and realized we're passing in a set of all files we've uploaded during the run. The uploader only used the set to check if we've already uploaded a file so that it doesn't upload it again. Singularity essentially guarantees no uploader clash with the unique request name being in the file path. I've removed this piece since it's a lot of added overhead

@ssalinas ssalinas modified the milestones: 0.19.0, 0.20.0 Feb 15, 2018

@darcatron darcatron referenced this pull request Feb 15, 2018

Closed

removed unused path set #1714

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Feb 20, 2018

Member

Going to close this in favor of just the unused path set one. I'll be pulling that into another bug fix as well.

Member

ssalinas commented Feb 20, 2018

Going to close this in favor of just the unused path set one. I'll be pulling that into another bug fix as well.

@ssalinas ssalinas closed this Feb 20, 2018

@ssalinas ssalinas removed this from the 0.20.0 milestone Jun 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment