feat: enable file statistics #1789

universalmind303 · 2023-09-19T21:30:13Z

now we can get some metadata only queries like this

> explain select count(*) from 'hits.parquet';
┌───────────────┬───────────────────────────────────────────────────────────────────────────────┐
│ plan_type     │ plan                                                                          │
│ ──            │ ──                                                                            │
│ Utf8          │ Utf8                                                                          │
╞═══════════════╪═══════════════════════════════════════════════════════════════════════════════╡
│ logical_plan  │ Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]↵                            │
│               │   TableScan: /Users/xxxxxxxxxxxxx/Downloads/hits.parquet projection=[WatchID] │
│ physical_plan │ ProjectionExec: expr=[99997497 as COUNT(UInt8(1))]↵                           │
│               │   EmptyExec: produce_one_row=true↵                                            │
│               │                                                                               │
└───────────────┴───────────────────────────────────────────────────────────────────────────────┘

crates/datasources/src/object_store/http.rs

crates/datasources/src/object_store/mod.rs

tychoish

the code looks reasonable enough to me 😀

I think regression tests for:

ndjson extension checking
ensure that we do metadata only scans for counts when we expect? (it's easy to regress on these things and we don't know catch it, but users often do.)

Other things:

tighter commit message on the PR (I think either the conventional thing, or <component>: <user impact>) if you plan to squash. (also you/we'd called this "metadata-only operations" or something when we talked yesterday, which is a great way to talk about this.
when using glaredb on local files, there's also (maybe) some metadata in files that we could use for quick counts/etc?

universalmind303 · 2023-09-19T22:08:48Z

ensure that we do metadata only scans for counts when we expect? (it's easy to regress on these things and we don't know catch it, but users often do.)

so this is actually just plugging in to the object store & datafusions native handling, we were just missing out on a lot of these optimizations because we were not fetching the file statistics. For fileformats other than parquet, the infer_file_statistics is a noop & just returns the default values.

tighter commit message on the PR (I think either the conventional thing, or : ) if you plan to squash. (also you/we'd called this "metadata-only operations" or something when we talked yesterday, which is a great way to talk about this.

yeah i'm terrible at commit messages. I prefer doing squash commits & using the PR title as the commit. That way you don't even have to think about it.

when using glaredb on local files, there's also (maybe) some metadata in files that we could use for quick counts/etc?

this'll be the case for parquet files as they have all the metadata needed in the footer. See the explain statement in my first comment, that is from disk. Since this is acting at the objectstore instead of at the individual implementation (http,fs,aws,...) it'll apply those statistics based optimizations wherever possible.

tychoish · 2023-09-19T22:29:28Z

yeah i'm terrible at commit messages. I prefer doing squash commits & using the PR title as the commit. That way you don't even have to think about it.

I definitely agree and am a squash/merge kind of guy. I thought your ndjson and use file statistics commits were better than the PR title. Anyway, I think as a (very loose) guideline PR titles that:

have the kind of commit (fix/feature/etc. I actually care least about this, but it seems like other folks do it, so...)
the major subsystem
what the impact to users is.

Sorry for perhaps being too terse earlier, I was thinking about having this optimization in local parquet files? It's perhaps ok if we handle this as part of future work.

universalmind303 · 2023-09-19T22:30:11Z

partially closes #1776

We can still make some further optimizations such as row group parallelization & column parallelization

universalmind303 · 2023-09-19T23:02:08Z

we should follow up this PR with #1791

universalmind303 · 2023-09-20T00:37:27Z

dropping this in here to show the perf gains
this branch: 125ms
main: 459ms

glaredb  universalmind303/file-statistics ❯ timeit {./target/release/glaredb -q "select count(*) from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet'"}
┌─────────────────┐
│ COUNT(UInt8(1)) │
│              ── │
│           Int64 │
╞═════════════════╡
│         3066766 │
└─────────────────┘
125ms 269µs 291ns
glaredb  universalmind303/file-statistics ❯ timeit {./glaredb-main -q "select count(*) from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet'"}
┌─────────────────┐
│ COUNT(UInt8(1)) │
│              ── │
│           Int64 │
╞═════════════════╡
│         3066766 │
└─────────────────┘
459ms 394µs 167ns

universalmind303 added 3 commits September 19, 2023 16:29

add "ndjson" to file extensions

6dc9a6e

remove file

54d1a5e

utilize file statistics

4d4e510

universalmind303 commented Sep 19, 2023

View reviewed changes

crates/datasources/src/object_store/http.rs Outdated Show resolved Hide resolved

universalmind303 commented Sep 19, 2023

View reviewed changes

crates/datasources/src/object_store/mod.rs Show resolved Hide resolved

clippy

907713c

tychoish reviewed Sep 19, 2023

View reviewed changes

tests & clippy

1d127a8

clippy

4babe88

universalmind303 linked an issue Sep 19, 2023 that may be closed by this pull request

non-optimal physical plan from explain select count(*) from 'hits.parquet'; #1787

Closed

universalmind303 changed the title ~~use object-store file statistics~~ feat: enable file statistics Sep 19, 2023

Merge branch 'main' into universalmind303/file-statistics

5243f8a

universalmind303 enabled auto-merge (squash) September 20, 2023 13:21

Merge branch 'main' into universalmind303/file-statistics

6542be0

universalmind303 merged commit 327d185 into main Sep 20, 2023
7 checks passed

universalmind303 deleted the universalmind303/file-statistics branch September 20, 2023 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable file statistics #1789

feat: enable file statistics #1789

universalmind303 commented Sep 19, 2023 •

edited

Loading

tychoish left a comment

universalmind303 commented Sep 19, 2023 •

edited

Loading

tychoish commented Sep 19, 2023

universalmind303 commented Sep 19, 2023

universalmind303 commented Sep 19, 2023

universalmind303 commented Sep 20, 2023

feat: enable file statistics #1789

feat: enable file statistics #1789

Conversation

universalmind303 commented Sep 19, 2023 • edited Loading

tychoish left a comment

Choose a reason for hiding this comment

universalmind303 commented Sep 19, 2023 • edited Loading

tychoish commented Sep 19, 2023

universalmind303 commented Sep 19, 2023

universalmind303 commented Sep 19, 2023

universalmind303 commented Sep 20, 2023

universalmind303 commented Sep 19, 2023 •

edited

Loading

universalmind303 commented Sep 19, 2023 •

edited

Loading