Skip to content

[DOP-7051] Support parallel operations in file classes#57

Merged
dolfinus merged 1 commit intodevelopfrom
feature/DOP-7051
Jul 5, 2023
Merged

[DOP-7051] Support parallel operations in file classes#57
dolfinus merged 1 commit intodevelopfrom
feature/DOP-7051

Conversation

@dolfinus
Copy link
Member

@dolfinus dolfinus commented Jul 4, 2023

Change Summary

Uploading/downloading/moving files one by one maybe OK for small number of files, but definitely not for many. For example, uploading 1k files to HDFS takes about 13 minutes - namenode ensures that file blocks are replicated to proper number of data nodes, and responses to client only after that, and the rest of files are waiting for that.

All these operations are IO-bound tasks, so Python releases GIL, and we can use threads to run them in parallel.

That was changed:

  • Internal implementation of FileMover/FileDownloader/FileUploader was enhanced to use ThreadPoolExecutor for parallel file operations.
  • FileMover.Options / FileDownloader.Options / FileUploader.Options classes got new option workers: int which can be set by user to switch between plain old for loop and ThreadPoolExecutor with specific number of workers.
  • Internal implementation of FileConnection.client and related _get_client/_is_client_closed/_close_client was updated to create and cache a separated client for each thread, avoiding issues with using non-thread safe client implementations (most of them are not safe).
  • Skipped duplicated check for source_file/local_file existance - it was performed both before and during file handling process.

Related issue number

Checklist

  • Commit message and PR title is comprehensive
  • Keep the change as small as possible
  • Unit and integration tests for the changes exist
  • Tests pass on CI and coverage does not decrease
  • Documentation reflects the changes where applicable
  • docs/changelog/next_release/<pull request or issue id>.<change type>.rst file added describing change
    (see CONTRIBUTING.rst for details.)
  • My PR is ready to review.

@dolfinus dolfinus self-assigned this Jul 4, 2023
@dolfinus dolfinus force-pushed the feature/DOP-7051 branch from d8c38f1 to 42275ce Compare July 4, 2023 10:35
@dolfinus dolfinus temporarily deployed to test-pypi July 4, 2023 10:35 — with GitHub Actions Inactive
@dolfinus dolfinus force-pushed the feature/DOP-7051 branch from 42275ce to 56cf579 Compare July 4, 2023 11:41
@dolfinus dolfinus temporarily deployed to test-pypi July 4, 2023 11:41 — with GitHub Actions Inactive
@dolfinus dolfinus force-pushed the feature/DOP-7051 branch from 56cf579 to c4d50db Compare July 4, 2023 11:50
@dolfinus dolfinus temporarily deployed to test-pypi July 4, 2023 11:50 — with GitHub Actions Inactive
@dolfinus dolfinus force-pushed the feature/DOP-7051 branch from c4d50db to 75917e3 Compare July 4, 2023 12:01
@dolfinus dolfinus temporarily deployed to test-pypi July 4, 2023 12:01 — with GitHub Actions Inactive
@codecov
Copy link

codecov bot commented Jul 4, 2023

Codecov Report

Merging #57 (0be9631) into develop (0210d99) will increase coverage by 0.05%.
The diff coverage is 97.76%.

@@             Coverage Diff             @@
##           develop      #57      +/-   ##
===========================================
+ Coverage    92.70%   92.75%   +0.05%     
===========================================
  Files          126      126              
  Lines         6000     6101     +101     
  Branches      1109     1144      +35     
===========================================
+ Hits          5562     5659      +97     
  Misses         345      345              
- Partials        93       97       +4     
Impacted Files Coverage Δ
...netl/connection/file_connection/file_connection.py 93.68% <94.11%> (-0.18%) ⬇️
onetl/file/file_mover/file_mover.py 97.19% <97.61%> (-0.15%) ⬇️
onetl/file/file_uploader/file_uploader.py 94.81% <97.87%> (+0.39%) ⬆️
onetl/file/file_downloader/file_downloader.py 94.93% <97.91%> (+0.23%) ⬆️
onetl/connection/file_connection/ftp.py 96.34% <100.00%> (ø)
onetl/connection/file_connection/hdfs.py 95.92% <100.00%> (ø)
onetl/connection/file_connection/s3.py 97.12% <100.00%> (ø)
onetl/connection/file_connection/sftp.py 86.29% <100.00%> (ø)
onetl/connection/file_connection/webdav.py 97.27% <100.00%> (ø)
onetl/log.py 85.71% <100.00%> (ø)

@dolfinus dolfinus marked this pull request as ready for review July 4, 2023 12:15
@dolfinus dolfinus force-pushed the feature/DOP-7051 branch from 75917e3 to 121129f Compare July 4, 2023 12:25
@dolfinus dolfinus temporarily deployed to test-pypi July 4, 2023 12:25 — with GitHub Actions Inactive
@dolfinus dolfinus force-pushed the feature/DOP-7051 branch from 121129f to 363fdc6 Compare July 4, 2023 12:47
@dolfinus dolfinus temporarily deployed to test-pypi July 4, 2023 12:47 — with GitHub Actions Inactive
@dolfinus dolfinus force-pushed the feature/DOP-7051 branch from 363fdc6 to 0be9631 Compare July 4, 2023 12:57
@dolfinus dolfinus temporarily deployed to test-pypi July 4, 2023 12:57 — with GitHub Actions Inactive
@dolfinus dolfinus requested a review from andy-takker July 5, 2023 13:44
@dolfinus dolfinus merged commit d6edc7c into develop Jul 5, 2023
@dolfinus dolfinus deleted the feature/DOP-7051 branch July 5, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants