[DOP-7051] Support parallel operations in file classes#57
Merged
Conversation
d8c38f1 to
42275ce
Compare
42275ce to
56cf579
Compare
56cf579 to
c4d50db
Compare
c4d50db to
75917e3
Compare
Codecov Report
@@ Coverage Diff @@
## develop #57 +/- ##
===========================================
+ Coverage 92.70% 92.75% +0.05%
===========================================
Files 126 126
Lines 6000 6101 +101
Branches 1109 1144 +35
===========================================
+ Hits 5562 5659 +97
Misses 345 345
- Partials 93 97 +4
|
75917e3 to
121129f
Compare
121129f to
363fdc6
Compare
363fdc6 to
0be9631
Compare
andy-takker
reviewed
Jul 5, 2023
dmitry-pedchenko
approved these changes
Jul 5, 2023
andy-takker
approved these changes
Jul 5, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Summary
Uploading/downloading/moving files one by one maybe OK for small number of files, but definitely not for many. For example, uploading 1k files to HDFS takes about 13 minutes - namenode ensures that file blocks are replicated to proper number of data nodes, and responses to client only after that, and the rest of files are waiting for that.
All these operations are IO-bound tasks, so Python releases GIL, and we can use threads to run them in parallel.
That was changed:
FileMover/FileDownloader/FileUploaderwas enhanced to useThreadPoolExecutorfor parallel file operations.FileMover.Options/FileDownloader.Options/FileUploader.Optionsclasses got new optionworkers: intwhich can be set by user to switch between plain oldforloop andThreadPoolExecutorwith specific number of workers.FileConnection.clientand related_get_client/_is_client_closed/_close_clientwas updated to create and cache a separated client for each thread, avoiding issues with using non-thread safe client implementations (most of them are not safe).source_file/local_fileexistance - it was performed both before and during file handling process.Related issue number
Checklist
docs/changelog/next_release/<pull request or issue id>.<change type>.rstfile added describing change(see CONTRIBUTING.rst for details.)