Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add translation of IOConfig to PyArrow filesystem arguments #1592

Merged
merged 18 commits into from
Nov 13, 2023

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Nov 10, 2023

Closes: #1589

  1. Deprecates the passing around of fsspec filesystems in PythonStorageConfig. This was actually never used already since we deprecated the user-facing fsspec APIs in Daft 0.2.0
  2. Significant refactors to our pyarrow filesystem utilities in filesystem.py:
    • Instead of taking in an optional filesystem arg, our _resolve_paths_and_filesystem utility now takes in an optional IOConfig
    • When inferring a filesystem (in the now-renamed _infer_filesystem) we now inmplement our own logic to translate IOConfig information into pyarrow filesystem kwargs
  3. We now correctly propagate io_config into _resolve_paths_and_filesystem for all cases where we are still using PythonStorageConfig in our read APIs (i.e. when users opt to use use_native_downloader=False
  4. In addition to the above, we add new functionality to propagate an IOConfig in our write_* APIs as well

@github-actions github-actions bot added the enhancement New feature or request label Nov 10, 2023
Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Merging #1592 (7f24a74) into main (a0aa639) will decrease coverage by 2.12%.
Report is 3 commits behind head on main.
The diff coverage is 58.92%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1592      +/-   ##
==========================================
- Coverage   85.29%   83.17%   -2.12%     
==========================================
  Files          54       54              
  Lines        5161     5189      +28     
==========================================
- Hits         4402     4316      -86     
- Misses        759      873     +114     
Files Coverage Δ
daft/dataframe/dataframe.py 89.58% <100.00%> (ø)
daft/execution/execution_step.py 92.85% <100.00%> (+0.01%) ⬆️
daft/execution/physical_plan.py 92.97% <ø> (-2.03%) ⬇️
daft/execution/rust_physical_plan_shim.py 87.50% <ø> (-11.12%) ⬇️
daft/io/_csv.py 94.73% <100.00%> (ø)
daft/io/_json.py 100.00% <100.00%> (ø)
daft/io/_parquet.py 100.00% <100.00%> (ø)
daft/logical/builder.py 86.72% <100.00%> (-2.66%) ⬇️
daft/table/schema_inference.py 98.14% <100.00%> (+3.70%) ⬆️
daft/table/table_io.py 95.80% <100.00%> (-0.73%) ⬇️
... and 1 more

... and 5 files with indirect coverage changes

daft/filesystem.py Show resolved Hide resolved
daft/filesystem.py Outdated Show resolved Hide resolved
daft/filesystem.py Outdated Show resolved Hide resolved
daft/filesystem.py Show resolved Hide resolved
src/daft-plan/src/logical_ops/sink.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM!

Comment on lines +219 to +220
_set_if_not_none(translated_kwargs, "access_key", s3_config.key_id)
_set_if_not_none(translated_kwargs, "secret_key", s3_config.access_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming asymmetry is a bit spooky to me, any particular reason for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just have it named differently :(

PyArrow:

  1. $AWS_ACCESS_KEY_ID: access_key
  2. $AWS_SECRET_ACCESS_KEY: secret_key

Daft:

  1. $AWS_ACCESS_KEY_ID: key_id
  2. $AWS_SECRET_ACCESS_KEY: access_key

Maybe @samster25 knows where we inherited our naming from?

Copy link
Member

@samster25 samster25 Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaychia Ah yeah, I based it off the aws sdk and boto3 suffix rather than what pyarrow did. What do you think we should use instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I wasn't proposing a change, just wondering where we got it from 😛

@jaychia jaychia merged commit 819dcd8 into main Nov 13, 2023
35 of 37 checks passed
@jaychia jaychia deleted the jay/io-config-pyarrow branch November 13, 2023 18:39
jaychia added a commit that referenced this pull request Nov 15, 2023
The change in #1592 introduced a bug where the inferred PyArrow
S3FileSystem doesn't correctly infer the region of the bucket

To work-around this, we can manually specify the `region_name` in our
IOConfig. This problem should be less common once we move towards
all-native reads.

Co-authored-by: Jay Chia <jaychia94@gmail.com@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing to S3-compatible location
3 participants