New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize count from data in most formats, better work with _file/_path virtual columns #53174
Conversation
…l columns in file/s3/url/hdfs/azure functions
This is an automated comment for commit 98706ce with description of existing statuses. It's updated for the latest CI running
|
Just for clarity, and I believe this is the case, this is already implemented in the s3 table function correct? |
Yes, for s3 and azureBlobStorage it already worked. I also implemented it for file/url/hdfs |
I think I will split this PR into 4 separate PRs:
|
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Optimize count from files in most input formats. Don't read actual data when only
_file/_path
columns are requested and just count the number of rows. Use filter on_file/_path
before reading data in file/url/hdfs functions, fix issues with _path/_file virtual columns. Use cache for number of rows in files that checks file last modification time (just like schema inference cache). Optimize group by with all constant keys (optimizes queries likeselect count() from file(...) group by _file/_path
).Closes #44334
Documentation entry for user-facing changes