Support orc filter push down (file + stripe + rowgroup level) #55330

taiyang-li · 2023-10-08T11:50:35Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support orc filter push down (rowgroup level)

related pr of contrib/orc: ClickHouse/orc#11

robot-clickhouse-ci-1 · 2023-10-08T11:57:58Z

This is an automated comment for commit ef4b5d5 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker image for servers	The check to build and optionally push the mentioned image to docker hub	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
Mergeable Check	Checks if all other necessary checks are successful	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Push to Dockerhub	The check for building and pushing the CI related docker images to docker hub	✅ success
SQLTest	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
SQLancer	Fuzzing tests that detect logical bugs with SQLancer tool	✅ success
Sqllogic	Run clickhouse on the sqllogic test set against sqlite and checks that all statements are passed	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Style Check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	✅ success

Check name	Description	Status
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	⏳ pending
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	❌ failure

taiyang-li · 2023-10-10T10:11:26Z

@alexey-milovidov can you review it, thanks!

al13n321

Cool, thanks for the SourceWithKeyCondition refactoring!

Please add a test and remove the leftover debug logging (once you don't need it).

src/Processors/Formats/Impl/NativeORCBlockInputFormat.cpp

taiyang-li · 2023-10-13T03:11:39Z

@al13n321 can you also review this related pr: ClickHouse/orc#11

taiyang-li · 2023-10-13T04:27:10Z

Generating orc file

echo "select number as a, cast(number as String) as b from numbers(100000000) format ORC"  | clickhouse-client > seq.orc

First test was improved by 8.7x:

Query: select a % 10, length(b) % 10, count(1) from file('seq.orc') where a > 90000000 group by a % 10, length(b) % 10

Without orc filter push down(set input_format_orc_filter_push_down = false)

10 rows in set. Elapsed: 2.086 sec. Processed 100.00 million rows, 791.82 MB (47.93 million rows/s., 379.55 MB/s.)
Peak memory usage: 15.29 MiB.

10 rows in set. Elapsed: 2.187 sec. Processed 100.00 million rows, 791.82 MB (45.72 million rows/s., 362.02 MB/s.)
Peak memory usage: 16.54 MiB.

10 rows in set. Elapsed: 2.143 sec. Processed 100.00 million rows, 791.82 MB (46.67 million rows/s., 369.57 MB/s.)
Peak memory usage: 14.39 MiB.

With orc filter push down(set input_format_orc_filter_push_down = true)

10 rows in set. Elapsed: 0.241 sec. Processed 10.00 million rows, 80.33 MB (41.56 million rows/s., 333.68 MB/s.)
Peak memory usage: 14.89 MiB.

10 rows in set. Elapsed: 0.238 sec. Processed 10.00 million rows, 80.33 MB (42.06 million rows/s., 337.68 MB/s.)
Peak memory usage: 14.83 MiB.

10 rows in set. Elapsed: 0.243 sec. Processed 10.00 million rows, 80.33 MB (41.24 million rows/s., 331.16 MB/s.)
Peak memory usage: 16.42 MiB.

Second test was improved by 117x

Query: select a % 10, length(b) % 10, count(1) from file('seq.orc') where a in (90000000, 1000) group by a % 10, length(b) % 10

Without orc filter push down(set input_format_orc_filter_push_down = false)

2 rows in set. Elapsed: 2.182 sec. Processed 100.00 million rows, 791.82 MB (45.84 million rows/s., 362.95 MB/s.)
Peak memory usage: 8.19 MiB.

2 rows in set. Elapsed: 2.175 sec. Processed 100.00 million rows, 791.82 MB (45.99 million rows/s., 364.14 MB/s.)
Peak memory usage: 8.20 MiB.

2 rows in set. Elapsed: 2.118 sec. Processed 100.00 million rows, 791.82 MB (47.21 million rows/s., 373.83 MB/s.)
Peak memory usage: 8.25 MiB.

With orc filter push down(set input_format_orc_filter_push_down = true)

2 rows in set. Elapsed: 0.025 sec. Processed 20.00 thousand rows, 149.44 KB (798.89 thousand rows/s., 5.97 MB/s.)
Peak memory usage: 758.29 KiB.

2 rows in set. Elapsed: 0.023 sec. Processed 20.00 thousand rows, 149.44 KB (882.51 thousand rows/s., 6.59 MB/s.)
Peak memory usage: 754.80 KiB.

2 rows in set. Elapsed: 0.018 sec. Processed 20.00 thousand rows, 149.44 KB (1.09 million rows/s., 8.15 MB/s.)
Peak memory usage: 757.79 KiB.

al13n321 · 2023-10-13T08:07:29Z

Lgtm, now just to figure out what to do with the orc PR (see comment there).

…t-reports/55330/de22fdcaea2e12c96f300e95f59beba84401712d/fuzzer_astfuzzerubsan/report.html

al13n321 · 2023-10-18T23:59:23Z

I can't figure out why new perf test orc_filter_push_down.xml failed in https://s3.amazonaws.com/clickhouse-test-reports/55330/de22fdcaea2e12c96f300e95f59beba84401712d/performance_comparison_[1_4]/report.html#run-errors @al13n321 need you help?

Maybe it's just from removing inline from Range::equals()? Looks like the parallel_index test does a few million Field comparisons (index granularity: 2, rows: 1M, WHERE has 6 equalities) and got slower by 100ms. That's on the order of 10 ns per Field comparison - kind of the right order of magnitude for function call overhead. Try making it inline again?

(If that's correct, I'm surprised that a simple function call overhead is so significant compared to all the Field visitor stuff and the index stuff and KeyCondition etc.)

taiyang-li · 2023-10-19T02:49:25Z

But why perf test orc_filter_push_down has error:

al13n321 · 2023-10-19T03:24:15Z

The error is: Unknown setting input_format_orc_filter_push_down (from "Test output" at the bottom of the report page). Probably when running clickhouse without this PR (for the "old" side of the comparison). Maybe remove the settings input_format_orc_filter_push_down = 1 from the query. (Then we'll get a "regression" if the default value of that setting is changed in future, but then the PR that changes is can just add the settings input_format_orc_filter_push_down = 1 to the test again).

…ut format. refer to https://s3.amazonaws.com/clickhouse-test-reports/55330/be39d23af2d7e27f5ec7f168947cf75aeaabf674/stateless_tests__asan__[4_4].htm

…m/clickhouse-test-reports/55330/be39d23af2d7e27f5ec7f168947cf75aeaabf674/stateless_tests__aarch64_.html

taiyang-li · 2023-10-19T08:40:35Z

src/Processors/Formats/Impl/NativeORCBlockInputFormat.cpp

@@ -312,6 +797,9 @@ Chunk NativeORCBlockInputFormat::generate()
    if (is_stopped)
        return {};

+    /// TODO: figure out why reuse batch would cause asan fatals in https://s3.amazonaws.com/clickhouse-test-reports/55330/be39d23af2d7e27f5ec7f168947cf75aeaabf674/stateless_tests__asan__[4_4].html
+    /// Not sure if it is a false positive case. Notice that reusing batch will speed up reading ORC by 1.15x.


I don't understand why asan fatals appear if batch is reused. Maybe it is a false positive ? @al13n321

taiyang-li · 2023-10-20T08:29:59Z

Failed ut https://s3.amazonaws.com/clickhouse-test-reports/55330/0f4a3c8b40f0eff543b0931a39776bbd5b3efcca/stateless_tests__tsan__s3_storage__[3_5].html seems not related to this pr.

taiyang-li · 2023-10-20T08:32:37Z

It is faster after using orc filter push down feature.

taiyang-li · 2023-10-23T07:50:40Z

@al13n321 do you think it is ok to merge this pr now.

al13n321 · 2023-10-24T03:13:39Z

Failed ut https://s3.amazonaws.com/clickhouse-test-reports/55330/0f4a3c8b40f0eff543b0931a39776bbd5b3efcca/stateless_tests__tsan__s3_storage__[3_5].html seems not related to this pr.

Looks like a TSAN error in setKeyConditionImpl(), please take a look:
https://s3.amazonaws.com/clickhouse-test-reports/55330/0f4a3c8b40f0eff543b0931a39776bbd5b3efcca/stateless_tests__tsan__s3_storage__[3_5]/gdb.log

EDIT: A more complete report in https://s3.amazonaws.com/clickhouse-test-reports/55330/0f4a3c8b40f0eff543b0931a39776bbd5b3efcca/stateless_tests__tsan__s3_storage__[3_5]/run.log (EDIT 2: This has 2 TSAN errors, the first one seems unrelated, the second is KeyCondition)

taiyang-li · 2023-10-24T07:58:05Z

Failed ut https://s3.amazonaws.com/clickhouse-test-reports/55330/0f4a3c8b40f0eff543b0931a39776bbd5b3efcca/stateless_tests__tsan__s3_storage__[3_5].html seems not related to this pr.

Looks like a TSAN error in setKeyConditionImpl(), please take a look: https://s3.amazonaws.com/clickhouse-test-reports/55330/0f4a3c8b40f0eff543b0931a39776bbd5b3efcca/stateless_tests__tsan__s3_storage__[3_5]/gdb.log

EDIT: A more complete report in https://s3.amazonaws.com/clickhouse-test-reports/55330/0f4a3c8b40f0eff543b0931a39776bbd5b3efcca/stateless_tests__tsan__s3_storage__[3_5]/run.log (EDIT 2: This has 2 TSAN errors, the first one seems unrelated, the second is KeyCondition)

It is fixed. Let's wait for ci.

taiyang-li · 2023-10-24T10:51:45Z

https://s3.amazonaws.com/clickhouse-test-reports/55330/eacc41ebc9d871e9308dbcc76cf11b329a0ec0a4/stateless_tests_flaky_check__asan_.html

The failed flaky test can't be reproduced in my local environment.

canhld94 · 2023-11-02T05:40:02Z

This PR create a dangling commit in orc submodule.

alexey-milovidov · 2024-03-05T04:50:03Z

@taiyang-li, according the test result it works even better on AArch64, but I don't know why: https://s3.amazonaws.com/clickhouse-test-reports/58061/7f170575229e41ba73102d6d7fb3718d04c604b8/stateless_tests__aarch64_.html

Quite soon, we will not allow any tests without AArch64: #58061

taiyang-li · 2024-03-05T08:22:29Z

@taiyang-li, according the test result it works even better on AArch64, but I don't know why: https://s3.amazonaws.com/clickhouse-test-reports/58061/7f170575229e41ba73102d6d7fb3718d04c604b8/stateless_tests__aarch64_.html

Quite soon, we will not allow any tests without AArch64: #58061

Interesting, I'll see.

taiyang-li · 2024-03-05T08:34:58Z

@alexey-milovidov sorry, I don't have any develop environment with AArch64. Could you add set input_format_orc_filter_push_down = false in tests/queries/0_stateless/02892_orc_filter_pushdown.sql, and test if outputs are different on X86-64 and AArch64? If yes, then the issue maybe related to ORC filter push down.

taiyang-li added 2 commits October 8, 2023 19:49

support orc filter push down

20d1eb4

update orc lib version

bd51d21

robot-clickhouse-ci-1 added pr-performance Pull request with some performance improvements submodule changed At least one submodule changed in this PR. labels Oct 8, 2023

replace setqueryinfo with setkeycondition

bd011a3

alexey-milovidov added the can be tested Allows running workflows for external contributors label Oct 8, 2023

fix issue ClickHouse#53536

c482ad8

taiyang-li mentioned this pull request Oct 9, 2023

Make KeyCondition usable outside *MergeTree when analyzer is enabled #53536

Open

taiyang-li added 4 commits October 9, 2023 17:41

refactor source with key condition

31bd247

fix building error

2f5f2ba

remove std::cout

0d3213d

update orc

7b328df

taiyang-li marked this pull request as ready for review October 9, 2023 10:52

taiyang-li added 4 commits October 10, 2023 11:18

update orc version

507620b

fix bugs

f05511b

improve code

7acefea

upgrade orc lib

aa7d89f

al13n321 self-assigned this Oct 10, 2023

fix code style

d7b1257

taiyang-li mentioned this pull request Oct 11, 2023

[CH] utilize ORC filter push down to reduce remote read IO apache/incubator-gluten#3297

Closed

al13n321 requested changes Oct 12, 2023

View reviewed changes

taiyang-li added 2 commits October 13, 2023 14:05

change as requested

1bb2a22

add performance tests for orc filter push down

72c11c4

al13n321 approved these changes Oct 13, 2023

View reviewed changes

add performance tests for orc filter push down

c01a206

taiyang-li added 4 commits October 18, 2023 17:24

fix failed uts

5b47432

fix ast fuzzer tests

1c4530d

Merge branch 'master' into ch_orc_filter_push_down

6e9ca51

fix bug of uint64 overflow in https://s3.amazonaws.com/clickhouse-tes…

be39d23

…t-reports/55330/de22fdcaea2e12c96f300e95f59beba84401712d/fuzzer_astfuzzerubsan/report.html

taiyang-li added 5 commits October 19, 2023 16:28

fix asan fatal caused by reused column vector batch in native orc inp…

fab596d

…ut format. refer to https://s3.amazonaws.com/clickhouse-test-reports/55330/be39d23af2d7e27f5ec7f168947cf75aeaabf674/stateless_tests__asan__[4_4].htm

fix wrong performance tests

bb0a5a9

disable 02892_orc_filter_pushdown on aarch64. https://s3.amazonaws.co…

21f1db6

…m/clickhouse-test-reports/55330/be39d23af2d7e27f5ec7f168947cf75aeaabf674/stateless_tests__aarch64_.html

add some comments

d3e89f7

add some comments

d36e3e1

taiyang-li commented Oct 19, 2023

View reviewed changes

taiyang-li added 2 commits October 19, 2023 21:50

Merge branch 'master' into ch_orc_filter_push_down

822965d

inline range::equals and range::less

0f4a3c8

fix data race of key condition

eacc41e

trigger ci

ef4b5d5

al13n321 merged commit 465962d into ClickHouse:master Oct 24, 2023
271 of 274 checks passed

baibaichen mentioned this pull request Oct 25, 2023

[GLUTEN-1632][CH]Daily Update Clickhouse Version (20231025) apache/incubator-gluten#3518

Merged

taiyang-li mentioned this pull request Oct 26, 2023

[GLUTEN-3297][CH] Refactor filter push down framework in gluten, support orc FPD and reuse parquet FPD in CH. apache/incubator-gluten#3301

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support orc filter push down (file + stripe + rowgroup level) #55330

Support orc filter push down (file + stripe + rowgroup level) #55330

taiyang-li commented Oct 8, 2023 •

edited

robot-clickhouse-ci-1 commented Oct 8, 2023 •

edited by robot-clickhouse

taiyang-li commented Oct 10, 2023

al13n321 left a comment

taiyang-li commented Oct 13, 2023

taiyang-li commented Oct 13, 2023 •

edited

al13n321 commented Oct 13, 2023

al13n321 commented Oct 18, 2023

taiyang-li commented Oct 19, 2023

al13n321 commented Oct 19, 2023

taiyang-li Oct 19, 2023 •

edited

taiyang-li commented Oct 20, 2023 •

edited

taiyang-li commented Oct 20, 2023

taiyang-li commented Oct 23, 2023 •

edited

al13n321 commented Oct 24, 2023 •

edited

taiyang-li commented Oct 24, 2023

taiyang-li commented Oct 24, 2023

canhld94 commented Nov 2, 2023

alexey-milovidov commented Mar 5, 2024

taiyang-li commented Mar 5, 2024

taiyang-li commented Mar 5, 2024

Support orc filter push down (file + stripe + rowgroup level) #55330

Support orc filter push down (file + stripe + rowgroup level) #55330

Conversation

taiyang-li commented Oct 8, 2023 • edited

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

robot-clickhouse-ci-1 commented Oct 8, 2023 • edited by robot-clickhouse

taiyang-li commented Oct 10, 2023

al13n321 left a comment

Choose a reason for hiding this comment

taiyang-li commented Oct 13, 2023

taiyang-li commented Oct 13, 2023 • edited

Generating orc file

First test was improved by 8.7x:

Second test was improved by 117x

al13n321 commented Oct 13, 2023

al13n321 commented Oct 18, 2023

taiyang-li commented Oct 19, 2023

al13n321 commented Oct 19, 2023

taiyang-li Oct 19, 2023 • edited

Choose a reason for hiding this comment

taiyang-li commented Oct 20, 2023 • edited

taiyang-li commented Oct 20, 2023

taiyang-li commented Oct 23, 2023 • edited

al13n321 commented Oct 24, 2023 • edited

taiyang-li commented Oct 24, 2023

taiyang-li commented Oct 24, 2023

canhld94 commented Nov 2, 2023

alexey-milovidov commented Mar 5, 2024

taiyang-li commented Mar 5, 2024

taiyang-li commented Mar 5, 2024

taiyang-li commented Oct 8, 2023 •

edited

robot-clickhouse-ci-1 commented Oct 8, 2023 •

edited by robot-clickhouse

taiyang-li commented Oct 13, 2023 •

edited

taiyang-li Oct 19, 2023 •

edited

taiyang-li commented Oct 20, 2023 •

edited

taiyang-li commented Oct 23, 2023 •

edited

al13n321 commented Oct 24, 2023 •

edited