-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Enable feature-flagged native downloader in daft.read_parquet #1190
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1190 +/- ##
==========================================
+ Coverage 88.37% 88.40% +0.02%
==========================================
Files 54 55 +1
Lines 5576 5606 +30
==========================================
+ Hits 4928 4956 +28
- Misses 648 650 +2
|
…file metadata retrieval
e118673
to
1e5309e
Compare
parquet_statistics = Table.read_parquet_statistics( | ||
list(filepaths_to_infos.keys()), source_info.io_config | ||
).to_pydict() | ||
for path, num_rows in zip(parquet_statistics["uris"], parquet_statistics["row_count"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This code relies on hardcoded strings which are the names of the table's series. A little error prone perhaps?
@@ -364,6 +364,6 @@ def read_parquet_statistics( | |||
io_config: IOConfig | None = None, | |||
) -> Table: | |||
if not isinstance(paths, Series): | |||
paths = Series.from_pylist(paths) | |||
|
|||
paths = Series.from_pylist(paths, name="uris") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: Our Series objects default to the name "list_col"
. Modified here to have a better name, but needs to be consistent across calls and so I added an assert after.
src/daft-io/src/python.rs
Outdated
@@ -34,6 +34,20 @@ impl PyIOConfig { | |||
config: self.config.s3.clone(), | |||
}) | |||
} | |||
|
|||
pub fn __setstate__(&mut self, py: Python, state: PyObject) -> PyResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to make this class pickleable for our optimizer's deepcopy
Enables changes to
daft.read_parquet
such that when we specifyuse_native_downloader=True
, this uses our new Rust-based native Parquet downloading and parsing for:Table