[FEAT] [Scan Operator] Add `ChunkSpec` for specifying format-specific per-file row subset selection for `ScanTask`s. #1590

clarkzinzow · 2023-11-10T03:05:35Z

This PR adds a ChunkSpec abstraction + integration that allows for format-specific per-file row subset selection for ScanTasks. The first concrete implementation is for Parquet, via an index-based row group selection. Pushing this down to be a ScanTask concept makes it easier to specify things like this at a per-file level without polluting user-facing abstractions like the file format config, and should make merging/splitting ScanTasks along the row dimension easier.

In the future, other formats such as CSV and NDJSON could support selecting by byte range ending at row boundaries.

codecov · 2023-11-10T03:18:39Z

Codecov Report

Merging #1590 (3c858f8) into main (cdc1b94) will not change coverage.
The diff coverage is n/a.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1590   +/-   ##
=======================================
  Coverage   85.19%   85.19%           
=======================================
  Files          54       54           
  Lines        5180     5180           
=======================================
  Hits         4413     4413           
  Misses        767      767

Files	Coverage Δ
daft/table/micropartition.py	`89.58% <ø> (ø)`
daft/table/table.py	`81.97% <ø> (ø)`

src/daft-micropartition/src/micropartition.rs

…canTask.

clarkzinzow requested review from samster25 and jaychia November 10, 2023 03:05

github-actions bot added the enhancement New feature or request label Nov 10, 2023

jaychia approved these changes Nov 10, 2023

View reviewed changes

src/daft-micropartition/src/micropartition.rs Outdated Show resolved Hide resolved

clarkzinzow force-pushed the clark/row-group-selector-per-file branch from d9a32d3 to af75f86 Compare November 10, 2023 19:38

clarkzinzow added 3 commits November 10, 2023 15:12

Add support for per-file row group selectors.

8541ff0

Use ChunkSpec to specify format-specific row subset selection for a S…

1c81e17

…canTask.

Use shared utility for getting row groups.

3c858f8

clarkzinzow force-pushed the clark/row-group-selector-per-file branch from af75f86 to 3c858f8 Compare November 10, 2023 23:15

clarkzinzow merged commit ef4d2fd into main Nov 11, 2023
37 checks passed

clarkzinzow deleted the clark/row-group-selector-per-file branch November 11, 2023 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] [Scan Operator] Add `ChunkSpec` for specifying format-specific per-file row subset selection for `ScanTask`s. #1590

[FEAT] [Scan Operator] Add `ChunkSpec` for specifying format-specific per-file row subset selection for `ScanTask`s. #1590

clarkzinzow commented Nov 10, 2023 •

edited

Loading

codecov bot commented Nov 10, 2023 •

edited

Loading

[FEAT] [Scan Operator] Add ChunkSpec for specifying format-specific per-file row subset selection for ScanTasks. #1590

[FEAT] [Scan Operator] Add ChunkSpec for specifying format-specific per-file row subset selection for ScanTasks. #1590

Conversation

clarkzinzow commented Nov 10, 2023 • edited Loading

codecov bot commented Nov 10, 2023 • edited Loading

Codecov Report

[FEAT] [Scan Operator] Add `ChunkSpec` for specifying format-specific per-file row subset selection for `ScanTask`s. #1590

[FEAT] [Scan Operator] Add `ChunkSpec` for specifying format-specific per-file row subset selection for `ScanTask`s. #1590

clarkzinzow commented Nov 10, 2023 •

edited

Loading

codecov bot commented Nov 10, 2023 •

edited

Loading