Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] parquet reader refactor, add parquet_stats_reader and parquet_schema_reader (1/2) #1191

Merged
merged 13 commits into from
Jul 27, 2023

Conversation

samster25
Copy link
Member

@samster25 samster25 commented Jul 26, 2023

  • Refactors Native parquet reader (without perf improvements)
  • adds ability to add offset and limit to number of rows
  • adds python: Table.read_parquet_statistics to grab basic stats from a series of urls
  • adds python: Schema.from_parquet to generate Schema from parquet file

@samster25 samster25 changed the title Sammy/parquet reader refactor [FEAT] parquet reader refactor, add parquet_stats_reader and parquet_schema_reader Jul 26, 2023
@github-actions github-actions bot added the enhancement New feature or request label Jul 26, 2023
@samster25 samster25 changed the title [FEAT] parquet reader refactor, add parquet_stats_reader and parquet_schema_reader [FEAT] parquet reader refactor, add parquet_stats_reader and parquet_schema_reader (1/2) Jul 26, 2023
@samster25 samster25 requested a review from jaychia July 26, 2023 05:04
@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Merging #1191 (0e5a9c2) into main (b13bdf8) will decrease coverage by 0.07%.
Report is 2 commits behind head on main.
The diff coverage is 61.53%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1191      +/-   ##
==========================================
- Coverage   88.44%   88.37%   -0.07%     
==========================================
  Files          54       54              
  Lines        5564     5576      +12     
==========================================
+ Hits         4921     4928       +7     
- Misses        643      648       +5     
Files Changed Coverage Δ
daft/table/table.py 87.50% <50.00%> (-0.97%) ⬇️
daft/logical/schema.py 92.30% <71.42%> (-1.81%) ⬇️

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

max_request_size: 16 * 1024 * 1024,
split_threshold: 24 * 1024 * 1024,
}));
pub fn read_parquet_statistics(uris: &Series, io_client: Arc<IOClient>) -> DaftResult<Table> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a requirement for us to get the result of this operation as a Table (a Vec<ParquetFileStatstics> or Vec<(int, int, str)> could suffice) but if this avoids more Python glue code then should be fine

io_client,
)?
.into())
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python-facing functions look good, should work!

row_groups: list[int] | None = None,
file_size: None | int = None,
start_offset: int | None = None,
num_rows: int | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@samster25 samster25 merged commit 440b8b3 into main Jul 27, 2023
17 of 18 checks passed
@samster25 samster25 deleted the sammy/parquet-reader-refactor branch July 27, 2023 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants