Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] [Scan Operator] Integrate size_bytes with ScanOperators #1586

Merged
merged 4 commits into from
Nov 10, 2023

Conversation

clarkzinzow
Copy link
Contributor

This PR integrates size_bytes, which is used for eventual resource requests for the Ray runner, with the ScanOperator implementations (currently just GlobScanOperator) and MicroPartition.

@github-actions github-actions bot added the enhancement New feature or request label Nov 9, 2023
Copy link

codecov bot commented Nov 9, 2023

Codecov Report

Merging #1586 (3408af6) into main (a0aa639) will decrease coverage by 0.11%.
The diff coverage is 66.66%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1586      +/-   ##
==========================================
- Coverage   85.29%   85.19%   -0.11%     
==========================================
  Files          54       54              
  Lines        5161     5180      +19     
==========================================
+ Hits         4402     4413      +11     
- Misses        759      767       +8     
Files Coverage Δ
daft/execution/rust_physical_plan_shim.py 98.61% <ø> (ø)
daft/runners/partitioning.py 81.19% <100.00%> (ø)
daft/table/micropartition.py 89.58% <100.00%> (ø)
daft/execution/physical_plan.py 92.97% <63.15%> (-2.03%) ⬇️

... and 1 file with indirect coverage changes

} else {
// if the table is not loaded and we dont have stats, just return 0.
// if the table is not loaded, we dont have stats, and we don't have the file size in bytes, just return 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return None in that case as in... "I don't know"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change this to return a DaftResult<Option<usize>>, which should be fine for resource request usage, but I'll have to double-check other usages to make sure that a None won't blow things up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that we sum size_bytes() results in a few places in physical_plan.py, so we'd need to do this 0 fallback there anyways. How about we keep this 0 fallback for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, or we could also do None to 0 conversion in the physical
planner and add a loud comment there.

The case I’m worried about is if we don’t know the size of a file, and tell
the planner that it is 0, but the file in fact turns out to be huge 😝

At least if the planner knows the file size is unknown it can maybe handle
it a little differently in the future

src/daft-scan/src/glob.rs Outdated Show resolved Hide resolved
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow merged commit cdc1b94 into main Nov 10, 2023
36 of 37 checks passed
@clarkzinzow clarkzinzow deleted the clark/size-bytes-globbing branch November 10, 2023 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants