Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce S3 uploader mem usage #1639

Closed
wants to merge 9 commits into from
Closed

reduce S3 uploader mem usage #1639

wants to merge 9 commits into from

Conversation

matush-v
Copy link
Contributor

@matush-v matush-v commented Oct 25, 2017

Originally, we'd recurse through the directory to find files to upload. Based on the heap dump, the depth could get large if many subdirectories existed and were checked. The biggest piece seemed to be the list of paths themselves.

The order we check the files shouldn't matter here; we just want to check all files within a dir. So this updates the logic so we get all file paths we need to check, put them in a list, and loop through each path to check the file. This removes the overhead of the recursive method calls and the copies of references to Set<Path> synchronizedToUpload and List<Path> toUpload.

However, it's possible that the bigger problem is in the number Paths we have in the toUpload list unrelated to how it's handled within the recursion. One solution to that would be to just upload in limited batch sizes (e.g. 100 items at a time). Let me know your thoughts

@matush-v matush-v requested a review from ssalinas October 25, 2017 18:47
@ssalinas ssalinas added this to the 0.19.0 milestone Nov 3, 2017
@matush-v
Copy link
Contributor Author

I've tested various spreads of files and directory recursion and the benefits aren't as plentiful as expected. Good news is the changes are uploading properly.
I've given the code another read and realized we're passing in a set of all files we've uploaded during the run. The uploader only used the set to check if we've already uploaded a file so that it doesn't upload it again. Singularity essentially guarantees no uploader clash with the unique request name being in the file path. I've removed this piece since it's a lot of added overhead

@ssalinas ssalinas modified the milestones: 0.19.0, 0.20.0 Feb 15, 2018
@ssalinas
Copy link
Member

Going to close this in favor of just the unused path set one. I'll be pulling that into another bug fix as well.

@ssalinas ssalinas closed this Feb 20, 2018
@ssalinas ssalinas removed this from the 0.20.0 milestone Jun 11, 2018
@ssalinas ssalinas deleted the s3-mem branch September 5, 2019 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants