Parquet filter pushdown #23297

filimonov · 2021-04-19T08:58:19Z

Limit reads from parquet file, when filters exist: similar https://drill.apache.org/docs/parquet-filter-pushdown/

filimonov · 2023-07-27T13:36:07Z

That was implemented for Hive (only) some time ago. Same is needed for file(..) and s3(...)

danthegoodman1 · 2023-07-27T20:33:36Z

Reviving this because it's super useful, duckdb has this and that makes an insane difference in more selective queries on larger files

minguyen9988 · 2023-07-28T03:24:20Z

second it, on select with where query duckdb vastly outperform Clickhouse (10+ times ).

danthegoodman1 · 2023-07-28T19:44:07Z

That was implemented for Hive (only) some time ago. Same is needed for file(..) and s3(...)

#34631

Don't forget url() (and *cluster versions!)

danthegoodman1 · 2023-07-28T19:45:15Z

second it, on select with where query duckdb vastly outperform Clickhouse (10+ times ).

For reference, you can see clickhouse is faster until the queries have any sort of selectivity in such a way that the minxmax index of the file can be used. In my experience, I am always filtering so I am forced to use DuckDB for the performance gains of parquet files.

danthegoodman1 · 2023-08-21T22:26:52Z

Don’t forget to update clickbench!

filimonov added feature good independant issue labels Apr 19, 2021

alexey-milovidov removed the good independent task label Jun 22, 2021

danthegoodman1 mentioned this issue Jul 27, 2023

Support embedded indexes in Parquet #48725

Closed

This was referenced Jul 30, 2023

IceDB v3 danthegoodman1/icedb#50

Closed

Parquet files should be able to count from metadata #44334

Closed

al13n321 self-assigned this Jul 31, 2023

danthegoodman1 mentioned this issue Aug 2, 2023

Parquet filter pushdown #52951

Merged

al13n321 closed this as completed in #52951 Aug 21, 2023

Provide feedback