-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable file statistics #1789
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code looks reasonable enough to me 😀
I think regression tests for:
- ndjson extension checking
- ensure that we do metadata only scans for counts when we expect? (it's easy to regress on these things and we don't know catch it, but users often do.)
Other things:
- tighter commit message on the PR (I think either the conventional thing, or
<component>: <user impact>
) if you plan to squash. (also you/we'd called this "metadata-only operations" or something when we talked yesterday, which is a great way to talk about this. - when using glaredb on local files, there's also (maybe) some metadata in files that we could use for quick counts/etc?
so this is actually just plugging in to the object store & datafusions native handling, we were just missing out on a lot of these optimizations because we were not fetching the file statistics. For fileformats other than parquet, the
yeah i'm terrible at commit messages. I prefer doing squash commits & using the PR title as the commit. That way you don't even have to think about it.
this'll be the case for parquet files as they have all the metadata needed in the footer. See the |
I definitely agree and am a squash/merge kind of guy. I thought your ndjson and
Sorry for perhaps being too terse earlier, I was thinking about having this optimization in local parquet files? It's perhaps ok if we handle this as part of future work. |
partially closes #1776 We can still make some further optimizations such as row group parallelization & column parallelization |
we should follow up this PR with #1791 |
dropping this in here to show the perf gains
|
now we can get some metadata only queries like this