Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable file statistics #1789

Merged
merged 8 commits into from
Sep 20, 2023
Merged

Conversation

universalmind303
Copy link
Contributor

@universalmind303 universalmind303 commented Sep 19, 2023

now we can get some metadata only queries like this

> explain select count(*) from 'hits.parquet';
┌───────────────┬───────────────────────────────────────────────────────────────────────────────┐
│ plan_type     │ plan                                                                          │
│ ──            │ ──                                                                            │
│ Utf8          │ Utf8                                                                          │
╞═══════════════╪═══════════════════════════════════════════════════════════════════════════════╡
│ logical_plan  │ Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]↵                            │
│               │   TableScan: /Users/xxxxxxxxxxxxx/Downloads/hits.parquet projection=[WatchID] │
│ physical_plan │ ProjectionExec: expr=[99997497 as COUNT(UInt8(1))]↵                           │
│               │   EmptyExec: produce_one_row=true↵                                            │
│               │                                                                               │
└───────────────┴───────────────────────────────────────────────────────────────────────────────┘

Copy link
Collaborator

@tychoish tychoish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code looks reasonable enough to me 😀

I think regression tests for:

  • ndjson extension checking
  • ensure that we do metadata only scans for counts when we expect? (it's easy to regress on these things and we don't know catch it, but users often do.)

Other things:

  • tighter commit message on the PR (I think either the conventional thing, or <component>: <user impact>) if you plan to squash. (also you/we'd called this "metadata-only operations" or something when we talked yesterday, which is a great way to talk about this.
  • when using glaredb on local files, there's also (maybe) some metadata in files that we could use for quick counts/etc?

@universalmind303
Copy link
Contributor Author

universalmind303 commented Sep 19, 2023

ensure that we do metadata only scans for counts when we expect? (it's easy to regress on these things and we don't know catch it, but users often do.)

so this is actually just plugging in to the object store & datafusions native handling, we were just missing out on a lot of these optimizations because we were not fetching the file statistics. For fileformats other than parquet, the infer_file_statistics is a noop & just returns the default values.

tighter commit message on the PR (I think either the conventional thing, or : ) if you plan to squash. (also you/we'd called this "metadata-only operations" or something when we talked yesterday, which is a great way to talk about this.

yeah i'm terrible at commit messages. I prefer doing squash commits & using the PR title as the commit. That way you don't even have to think about it.

when using glaredb on local files, there's also (maybe) some metadata in files that we could use for quick counts/etc?

this'll be the case for parquet files as they have all the metadata needed in the footer. See the explain statement in my first comment, that is from disk. Since this is acting at the objectstore instead of at the individual implementation (http,fs,aws,...) it'll apply those statistics based optimizations wherever possible.

@tychoish
Copy link
Collaborator

yeah i'm terrible at commit messages. I prefer doing squash commits & using the PR title as the commit. That way you don't even have to think about it.

I definitely agree and am a squash/merge kind of guy. I thought your ndjson and use file statistics commits were better than the PR title. Anyway, I think as a (very loose) guideline PR titles that:

  • have the kind of commit (fix/feature/etc. I actually care least about this, but it seems like other folks do it, so...)
  • the major subsystem
  • what the impact to users is.

Sorry for perhaps being too terse earlier, I was thinking about having this optimization in local parquet files? It's perhaps ok if we handle this as part of future work.

@universalmind303
Copy link
Contributor Author

partially closes #1776

We can still make some further optimizations such as row group parallelization & column parallelization

@universalmind303 universalmind303 changed the title use object-store file statistics feat: enable file statistics Sep 19, 2023
@universalmind303
Copy link
Contributor Author

we should follow up this PR with #1791

@universalmind303
Copy link
Contributor Author

dropping this in here to show the perf gains
this branch: 125ms
main: 459ms

glaredb  universalmind303/file-statistics ❯ timeit {./target/release/glaredb -q "select count(*) from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet'"}
┌─────────────────┐
│ COUNT(UInt8(1)) │
│              ── │
│           Int64 │
╞═════════════════╡
│         3066766 │
└─────────────────┘
125ms 269µs 291ns
glaredb  universalmind303/file-statistics ❯ timeit {./glaredb-main -q "select count(*) from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet'"}
┌─────────────────┐
│ COUNT(UInt8(1)) │
│              ── │
│           Int64 │
╞═════════════════╡
│         3066766 │
└─────────────────┘
459ms 394µs 167ns

@universalmind303 universalmind303 enabled auto-merge (squash) September 20, 2023 13:21
@universalmind303 universalmind303 merged commit 327d185 into main Sep 20, 2023
7 checks passed
@universalmind303 universalmind303 deleted the universalmind303/file-statistics branch September 20, 2023 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

non-optimal physical plan from explain select count(*) from 'hits.parquet';
2 participants