Convert hashSets in parallel before merge #50748

jiebinn · 2023-06-09T09:05:30Z

Before merge, if one of the lhs and rhs is singleLevelSet and the other is twoLevelSet, then the SingleLevelSet will call convertToTwoLevel(). The convert process is serial and not in parallel. It will cost lots of cycle before it cosumes all the singleLevelSet.

The idea of the patch is to convert all the singleLevelSets to twoLevelSets in parallel before merge.

I have tested the patch on Intel 2 x 112 vCPUs SPR server with clickbench and latest upstream ClickHouse. Q5 has got a big 264% performance improvement and 24 queries have got at least 5% performance gain. The overall geomean of 43 queries has gained 7.4% more than the base code.

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

This patch will provide a method to deal with all the hashsets in parallel before merge.

The detail to define the performance issue and how to resolve it.

First, we have found that there is a performance drop for the Q5 of clickbench as the core count increases.

Then we have collected a pipeline visualization if the max_threads of thread_pool is 80 or 112.
max_threads = 80

max_threads = 112

If the max_threads increase from 80 to 112, the merge stage does not decrease, but merge time is 3.2x more.
when merging two twoLevelHash, there is already an optimization to start thread_pool and merge in parallel. However, when merge one singleLevelHash and one TwoLevelHash, the singleLevelHash has to convert to twoLevelHash. The convert progress is serial.
If there is at least one singleLevelHash and one TwoLevelHash, all the singleLevelHash have to be converted to twoLevelHash before merged with the other twoLevelHash. We could add a new stage before merge called Prepare_Hash_before_Merge, where all the hashset would be processed before merging. All the singleLevelHash would be converted to twoLevelHash in parallel in this stage instead of in serial in merge stage.
with this patch, Q5 has got 2.64x performance improvement on a 2x112 vCPUs system (max_threads = 112).

lastest upstream master

clickhouse-ci · 2023-06-09T09:06:25Z

This is an automatic comment. The PR descriptions does not match the template.

Please, edit it accordingly.

The error is: Changelog entry required for category 'Performance Improvement'

clickhouse-ci · 2023-06-09T09:06:43Z

This is an automatic comment. The PR descriptions does not match the template.

Please, edit it accordingly.

The error is: Changelog entry required for category 'Performance Improvement'

robot-ch-test-poll3 · 2023-06-09T17:25:45Z

This is an automated comment for commit 635e9d7 with description of existing statuses. It's updated for the latest CI running
The full report is available here
The overall status of the commit is 🔴 failure

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	🟢 success
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	🟢 success
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	🟢 success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	🟢 success
Docker image for servers	The check to build and optionally push the mentioned image to docker hub	🟢 success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	🟢 success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	🟢 success
Install packages	Checks that the built packages are installable in a clear environment	🟢 success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	🟢 success
Mergeable Check	Checks if all other necessary checks are successful	🟢 success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	🟢 success
Push to Dockerhub	The check for building and pushing the CI related docker images to docker hub	🟢 success
SQLancer	Fuzzing tests that detect logical bugs with SQLancer tool	🟢 success
Sqllogic	Run clickhouse on the sqllogic test set against sqlite and checks that all statements are passed	🟢 success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	🟢 success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	🟢 success
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	🔴 failure
Style Check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	🟢 success
Unit tests	Runs the unit tests for different release types	🟢 success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	🟢 success

nickitat · 2023-06-13T18:57:09Z

pls take a look

Jun 11 02:52:18 FAILED: src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o 
Jun 11 02:52:18 /usr/bin/cmake -E __run_co_compile --launcher="prlimit;--as=10000000000;--data=5000000000;--cpu=1000;/usr/bin/sccache" --tidy="/usr/bin/clang-tidy-cache;/usr/bin/clang-tidy-16;--extra-arg-before=--driver-mode=g++" --source=/build/src/Interpreters/examples/two_level_hash_map.cpp -- /usr/bin/clang++-16 --target=x86_64-linux-gnu --sysroot=/build/cmake/linux/../../contrib/sysroot/linux-x86_64/x86_64-linux-gnu/libc -DANNOYLIB_MULTITHREADED_BUILD -DAWS_SDK_VERSION_MAJOR=1 -DAWS_SDK_VERSION_MINOR=10 -DAWS_SDK_VERSION_PATCH=36 -DBOOST_ASIO_HAS_STD_INVOKE_RESULT=1 -DBOOST_ASIO_STANDALONE=1 -DCARES_STATICLIB -DDUMMY_BACKTRACE -DENABLE_ANNOY -DENABLE_MULTITARGET_CODE=1 -DENABLE_QPL_COMPRESSION -DPOCO_ENABLE_CPP11 -DPOCO_HAVE_FD_EPOLL -DPOCO_OS_FAMILY_UNIX -DSTD_EXCEPTION_HAS_STACK_TRACE=1 -DUNALIGNED_OK -DWITH_COVERAGE=0 -DWITH_GZFILEOP -DX86_64 -DZLIB_COMPAT -D_LIBCPP_ENABLE_THREAD_SAFETY_ANNOTATIONS -I/build/build_docker/includes/configs -I/build/base/glibc-compatibility/memcpy -I/build/src -I/build/build_docker/src -I/build/build_docker/src/Core/include -I/build/base/base/.. -I/build/build_docker/base/base/.. -I/build/contrib/cctz/include -I/build/base/pcg-random/. -I/build/contrib/libfiu/libfiu -I/build/contrib/miniselect/include -I/build/contrib/zstd/lib -I/build/src/Common/mysqlxx/. -I/build/build_docker/rust/skim/include -isystem /build/contrib/llvm-project/libcxx/include -isystem /build/contrib/llvm-project/libcxxabi/include -isystem /build/contrib/libunwind/include -isystem /build/build_docker/contrib/orc/c++/include -isystem /build/contrib/llvm-project/llvm/include -isystem /build/build_docker/contrib/llvm-project/llvm/include -isystem /build/contrib/abseil-cpp -isystem /build/contrib/croaring/cpp -isystem /build/contrib/croaring/include -isystem /build/contrib/sparsehash-c11 -isystem /build/contrib/cityhash102/include -isystem /build/contrib/boost -isystem /build/base/poco/Net/include -isystem /build/base/poco/Foundation/include -isystem /build/base/poco/NetSSL_OpenSSL/include -isystem /build/base/poco/Crypto/include -isystem /build/contrib/boringssl/include -isystem /build/base/poco/Util/include -isystem /build/base/poco/JSON/include -isystem /build/base/poco/XML/include -isystem /build/contrib/replxx/include -isystem /build/contrib/fmtlib-cmake/../fmtlib/include -isystem /build/contrib/magic_enum/include -isystem /build/contrib/double-conversion -isystem /build/contrib/dragonbox/include -isystem /build/contrib/re2 -isystem /build/build_docker/contrib/re2-cmake -isystem /build/contrib/zlib-ng -isystem /build/build_docker/contrib/zlib-ng-cmake -isystem /build/contrib/pdqsort -isystem /build/contrib/xz/src/liblzma/api -isystem /build/contrib/aws/src/aws-cpp-sdk-core/include -isystem /build/build_docker/contrib/aws-cmake/include -isystem /build/contrib/aws/generated/src/aws-cpp-sdk-s3/include -isystem /build/contrib/aws-c-auth/include -isystem /build/contrib/aws-c-common/include -isystem /build/contrib/aws-c-io/include -isystem /build/contrib/aws-crt-cpp/include -isystem /build/contrib/aws-c-mqtt/include -isystem /build/contrib/aws-c-sdkutils/include -isystem /build/contrib/snappy -isystem /build/build_docker/contrib/snappy-cmake -isystem /build/contrib/libbcrypt -isystem /build/contrib/msgpack-c/include -isystem /build/build_docker/contrib/liburing/src/include-compat -isystem /build/build_docker/contrib/liburing/src/include -isystem /build/contrib/liburing/src/include -isystem /build/contrib/fast_float/include -isystem /build/contrib/NuRaft/include -isystem /build/base/poco/MongoDB/include -isystem /build/build_docker/contrib/mariadb-connector-c-cmake/include-public -isystem /build/contrib/mariadb-connector-c/include -isystem /build/contrib/mariadb-connector-c/libmariadb -isystem /build/build_docker/src/Server/grpc_protos -isystem /build/contrib/grpc/include -isystem /build/contrib/c-ares/src/lib -isystem /build/contrib/c-ares/include -isystem /build/contrib/c-ares-cmake/linux -isystem /build/contrib/google-protobuf/src -isystem /build/contrib/s2geometry/src -isystem /build/contrib/AMQP-CPP/include -isystem /build/contrib/AMQP-CPP -isystem /build/contrib/libuv/include -isystem /build/contrib/sqlite-amalgamation -isystem /build/contrib/rocksdb/include -isystem /build/contrib/libpqxx/include -isystem /build/contrib/libpq -isystem /build/contrib/libpq/include -isystem /build/contrib/qpl-cmake -isystem /build/contrib/qpl/include -isystem /build/contrib/idxd-config/accfg -isystem /build/contrib/libstemmer_c/include -isystem /build/contrib/wordnet-blast -isystem /build/contrib/lemmagen-c/include -isystem /build/contrib/ulid-c/include -isystem /build/contrib/consistent-hashing -isystem /build/contrib/annoy/src --gcc-toolchain=/build/cmake/linux/../../contrib/sysroot/linux-x86_64 -std=c++20 -fdiagnostics-color=always -Xclang -fuse-ctor-homing -Wno-enum-constexpr-conversion -fsized-deallocation  -UNDEBUG -gdwarf-aranges -pipe -mssse3 -msse4.1 -msse4.2 -mpclmul -mpopcnt -fasynchronous-unwind-tables -falign-functions=32 -mbranches-within-32B-boundaries -fdiagnostics-absolute-paths -fstrict-vtable-pointers -Wall -Wextra -Weverything -Wpedantic -Wno-zero-length-array -Wno-c++98-compat-pedantic -Wno-c++98-compat -Wno-c++20-compat -Wno-sign-conversion -Wno-implicit-int-conversion -Wno-implicit-int-float-conversion -Wno-ctad-maybe-unsupported -Wno-disabled-macro-expansion -Wno-documentation-unknown-command -Wno-double-promotion -Wno-exit-time-destructors -Wno-float-equal -Wno-global-constructors -Wno-missing-prototypes -Wno-missing-variable-declarations -Wno-padded -Wno-switch-enum -Wno-undefined-func-template -Wno-unused-template -Wno-vla -Wno-weak-template-vtables -Wno-weak-vtables -Wno-thread-safety-negative -Wno-enum-constexpr-conversion -Wno-unsafe-buffer-usage -g -O0 -g -gdwarf-4  -D_LIBCPP_DEBUG=0 -std=c++2b   -D OS_LINUX -I/build/base -I/build/contrib/magic_enum/include -include /build/src/Core/iostream_debug_helpers.h -Werror -nostdinc++ -MD -MT src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o -MF src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o.d -o src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o -c /build/src/Interpreters/examples/two_level_hash_map.cpp
Jun 11 02:52:18 /build/src/AggregateFunctions/UniquesHashSet.h:369:91: error: unknown type name 'ThreadPool' [clang-diagnostic-error]
Jun 11 02:52:18     static void parallelizeMergePrepare(const std::vector<UniquesHashSet *> & /*places*/, ThreadPool * /*thread_pool = nullptr*/) {}

nickitat · 2023-06-13T19:24:28Z

src/AggregateFunctions/IAggregateFunction.h

@@ -147,6 +148,10 @@ class IAggregateFunction : public std::enable_shared_from_this<IAggregateFunctio
    /// Default values must be a the 0-th positions in columns.
    virtual void addManyDefaults(AggregateDataPtr __restrict place, const IColumn ** columns, size_t length, Arena * arena) const = 0;

+    virtual bool isParallelizeMergePrepareNeeded() const { return false; }
+
+    virtual void parallelizeMergePrepare(AggregateDataPtrs & /*places*/, ThreadPool & /*thread_pool*/) const {}


throw Exception(ErrorCodes::NOT_IMPLEMENTED, ...); here and override only where we actually do smth.

Done. Add the exception in virtual function.

src/AggregateFunctions/AggregateFunctionUniq.h

nickitat · 2023-06-13T19:30:40Z

src/Interpreters/Aggregator.cpp

@@ -2601,6 +2601,21 @@ void NO_INLINE Aggregator::mergeWithoutKeyDataImpl(

    AggregatedDataVariantsPtr & res = non_empty_data[0];

+    for (size_t i = 0; i < params.aggregates_size; ++i)
+    {
+        if (aggregate_functions[i]->isAbleToParallelizeMerge() &&


isAbleToParallelizeMerge looks irrelevant here, isParallelizeMergePrepareNeeded should be enough

The flag isAbleToParallelizeMerge is for the parallel merge of UniqExact. If the merge stage is not parallel (isAbleToParallelizeMerge = false), we could still do hashset converting in parallel. The merge stage and hashset converting stage are independent and irrelevant. Then I think isAbleToParallelizeMerge could be removed.

nickitat · 2023-06-13T19:35:45Z

src/AggregateFunctions/UniqExactSet.h

+    /// In merge, if one of the lhs and rhs is twolevelset and the other is singlelevelset, then the singlelevelset will need to convertToTwoLevel().
+    /// It's not in parallel and will cost extra large time if the thread_num is large.
+    /// This method will convert all the SingleLevelSet to TwoLevelSet in parallel if the hashsets are not all singlelevel or not all twolevel.
+    static void parallelizeMergePrepare(const std::vector<UniqExactSet *> & data_vec, ThreadPool * thread_pool = nullptr)


could we make sure that ThreadPool is always not null here and pass it as a reference?

From the void parallelizeMergePrepare(AggregateDataPtrs & places, ThreadPool & thread_pool) const override in AggregateFunctionUniq.h, the ThreadPool is alway not null. The latest code has used the reference instead of a pointer.

nickitat · 2023-06-13T19:40:42Z

src/AggregateFunctions/UniqExactSet.h

+                    }
+                };
+                for (size_t i = 0; i < std::min<size_t>(thread_pool->getMaxThreads(), single_level_set_num); ++i)
+                    thread_pool->scheduleOrThrowOnError(thread_func);


if exception will be thrown here, we will not reach wait(). all this code should be wrapped into try-catch that also does wait in case of exception (refer to #50590).

Add try catch in latest code.

nickitat · 2023-06-13T19:42:40Z

would be good to find perf-tests that shows speed-up in the addressed case.

devcrafter · 2023-06-13T20:01:11Z

src/AggregateFunctions/AggregateFunctionUniq.h

+
+    void parallelizeMergePrepare(AggregateDataPtrs & places, ThreadPool & thread_pool) const override
+    {
+        std::vector<DataSet *> data_vec;


data_vec.resize(places.size());

I have tried to add the code line data_vec.resize(places.size()); after the initialization of std::vector<DataSet *> data_vec; But the ClickHouse server crashed. I will check that.

Add the resize() for data_vec in parallelizeMergePrepare.

jiebinn · 2023-06-15T11:51:57Z

@nickitat @devcrafter Thanks for your kind review. I will think about that and let you know once I finish.

jiebinn · 2023-06-20T09:26:43Z

pls take a look

Jun 11 02:52:18 FAILED: src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o 
Jun 11 02:52:18 /usr/bin/cmake -E __run_co_compile --launcher="prlimit;--as=10000000000;--data=5000000000;--cpu=1000;/usr/bin/sccache" --tidy="/usr/bin/clang-tidy-cache;/usr/bin/clang-tidy-16;--extra-arg-before=--driver-mode=g++" --source=/build/src/Interpreters/examples/two_level_hash_map.cpp -- /usr/bin/clang++-16 --target=x86_64-linux-gnu --sysroot=/build/cmake/linux/../../contrib/sysroot/linux-x86_64/x86_64-linux-gnu/libc -DANNOYLIB_MULTITHREADED_BUILD -DAWS_SDK_VERSION_MAJOR=1 -DAWS_SDK_VERSION_MINOR=10 -DAWS_SDK_VERSION_PATCH=36 -DBOOST_ASIO_HAS_STD_INVOKE_RESULT=1 -DBOOST_ASIO_STANDALONE=1 -DCARES_STATICLIB -DDUMMY_BACKTRACE -DENABLE_ANNOY -DENABLE_MULTITARGET_CODE=1 -DENABLE_QPL_COMPRESSION -DPOCO_ENABLE_CPP11 -DPOCO_HAVE_FD_EPOLL -DPOCO_OS_FAMILY_UNIX -DSTD_EXCEPTION_HAS_STACK_TRACE=1 -DUNALIGNED_OK -DWITH_COVERAGE=0 -DWITH_GZFILEOP -DX86_64 -DZLIB_COMPAT -D_LIBCPP_ENABLE_THREAD_SAFETY_ANNOTATIONS -I/build/build_docker/includes/configs -I/build/base/glibc-compatibility/memcpy -I/build/src -I/build/build_docker/src -I/build/build_docker/src/Core/include -I/build/base/base/.. -I/build/build_docker/base/base/.. -I/build/contrib/cctz/include -I/build/base/pcg-random/. -I/build/contrib/libfiu/libfiu -I/build/contrib/miniselect/include -I/build/contrib/zstd/lib -I/build/src/Common/mysqlxx/. -I/build/build_docker/rust/skim/include -isystem /build/contrib/llvm-project/libcxx/include -isystem /build/contrib/llvm-project/libcxxabi/include -isystem /build/contrib/libunwind/include -isystem /build/build_docker/contrib/orc/c++/include -isystem /build/contrib/llvm-project/llvm/include -isystem /build/build_docker/contrib/llvm-project/llvm/include -isystem /build/contrib/abseil-cpp -isystem /build/contrib/croaring/cpp -isystem /build/contrib/croaring/include -isystem /build/contrib/sparsehash-c11 -isystem /build/contrib/cityhash102/include -isystem /build/contrib/boost -isystem /build/base/poco/Net/include -isystem /build/base/poco/Foundation/include -isystem /build/base/poco/NetSSL_OpenSSL/include -isystem /build/base/poco/Crypto/include -isystem /build/contrib/boringssl/include -isystem /build/base/poco/Util/include -isystem /build/base/poco/JSON/include -isystem /build/base/poco/XML/include -isystem /build/contrib/replxx/include -isystem /build/contrib/fmtlib-cmake/../fmtlib/include -isystem /build/contrib/magic_enum/include -isystem /build/contrib/double-conversion -isystem /build/contrib/dragonbox/include -isystem /build/contrib/re2 -isystem /build/build_docker/contrib/re2-cmake -isystem /build/contrib/zlib-ng -isystem /build/build_docker/contrib/zlib-ng-cmake -isystem /build/contrib/pdqsort -isystem /build/contrib/xz/src/liblzma/api -isystem /build/contrib/aws/src/aws-cpp-sdk-core/include -isystem /build/build_docker/contrib/aws-cmake/include -isystem /build/contrib/aws/generated/src/aws-cpp-sdk-s3/include -isystem /build/contrib/aws-c-auth/include -isystem /build/contrib/aws-c-common/include -isystem /build/contrib/aws-c-io/include -isystem /build/contrib/aws-crt-cpp/include -isystem /build/contrib/aws-c-mqtt/include -isystem /build/contrib/aws-c-sdkutils/include -isystem /build/contrib/snappy -isystem /build/build_docker/contrib/snappy-cmake -isystem /build/contrib/libbcrypt -isystem /build/contrib/msgpack-c/include -isystem /build/build_docker/contrib/liburing/src/include-compat -isystem /build/build_docker/contrib/liburing/src/include -isystem /build/contrib/liburing/src/include -isystem /build/contrib/fast_float/include -isystem /build/contrib/NuRaft/include -isystem /build/base/poco/MongoDB/include -isystem /build/build_docker/contrib/mariadb-connector-c-cmake/include-public -isystem /build/contrib/mariadb-connector-c/include -isystem /build/contrib/mariadb-connector-c/libmariadb -isystem /build/build_docker/src/Server/grpc_protos -isystem /build/contrib/grpc/include -isystem /build/contrib/c-ares/src/lib -isystem /build/contrib/c-ares/include -isystem /build/contrib/c-ares-cmake/linux -isystem /build/contrib/google-protobuf/src -isystem /build/contrib/s2geometry/src -isystem /build/contrib/AMQP-CPP/include -isystem /build/contrib/AMQP-CPP -isystem /build/contrib/libuv/include -isystem /build/contrib/sqlite-amalgamation -isystem /build/contrib/rocksdb/include -isystem /build/contrib/libpqxx/include -isystem /build/contrib/libpq -isystem /build/contrib/libpq/include -isystem /build/contrib/qpl-cmake -isystem /build/contrib/qpl/include -isystem /build/contrib/idxd-config/accfg -isystem /build/contrib/libstemmer_c/include -isystem /build/contrib/wordnet-blast -isystem /build/contrib/lemmagen-c/include -isystem /build/contrib/ulid-c/include -isystem /build/contrib/consistent-hashing -isystem /build/contrib/annoy/src --gcc-toolchain=/build/cmake/linux/../../contrib/sysroot/linux-x86_64 -std=c++20 -fdiagnostics-color=always -Xclang -fuse-ctor-homing -Wno-enum-constexpr-conversion -fsized-deallocation  -UNDEBUG -gdwarf-aranges -pipe -mssse3 -msse4.1 -msse4.2 -mpclmul -mpopcnt -fasynchronous-unwind-tables -falign-functions=32 -mbranches-within-32B-boundaries -fdiagnostics-absolute-paths -fstrict-vtable-pointers -Wall -Wextra -Weverything -Wpedantic -Wno-zero-length-array -Wno-c++98-compat-pedantic -Wno-c++98-compat -Wno-c++20-compat -Wno-sign-conversion -Wno-implicit-int-conversion -Wno-implicit-int-float-conversion -Wno-ctad-maybe-unsupported -Wno-disabled-macro-expansion -Wno-documentation-unknown-command -Wno-double-promotion -Wno-exit-time-destructors -Wno-float-equal -Wno-global-constructors -Wno-missing-prototypes -Wno-missing-variable-declarations -Wno-padded -Wno-switch-enum -Wno-undefined-func-template -Wno-unused-template -Wno-vla -Wno-weak-template-vtables -Wno-weak-vtables -Wno-thread-safety-negative -Wno-enum-constexpr-conversion -Wno-unsafe-buffer-usage -g -O0 -g -gdwarf-4  -D_LIBCPP_DEBUG=0 -std=c++2b   -D OS_LINUX -I/build/base -I/build/contrib/magic_enum/include -include /build/src/Core/iostream_debug_helpers.h -Werror -nostdinc++ -MD -MT src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o -MF src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o.d -o src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o -c /build/src/Interpreters/examples/two_level_hash_map.cpp
Jun 11 02:52:18 /build/src/AggregateFunctions/UniquesHashSet.h:369:91: error: unknown type name 'ThreadPool' [clang-diagnostic-error]
Jun 11 02:52:18     static void parallelizeMergePrepare(const std::vector<UniquesHashSet *> & /*places*/, ThreadPool * /*thread_pool = nullptr*/) {}

The latest code has involved the head file <Common/ThreadPool_fwd.h> in data set, such as ThetaSketchData, HyperLogLogWithSmallSetOptimization and UniquesHashSet.

jiebinn · 2023-06-20T09:46:49Z

would be good to find perf-tests that shows speed-up in the addressed case.

I have used the clickbench for the performance test. You mean we might find the "distinct/uniq" operation in tests and check the performance?

jiebinn · 2023-06-26T08:20:14Z

Hi @nickitat @devcrafter, thanks for your previous code review. I have updated the patch code and comments according to the review of PR.

Update to the master

Before merge, if one of the lhs and rhs is singleLevelSet and the other is twoLevelSet, then the SingleLevelSet will call convertToTwoLevel(). The convert process is not in parallel and it will cost lots of cycle if it cosume all the singleLevelSet. The idea of the patch is to convert all the singleLevelSets to twoLevelSets in parallel if the hashsets are not all singleLevel or not all twoLevel. I have tested the patch on Intel 2 x 112 vCPUs SPR server with clickbench and latest upstream ClickHouse. Q5 has got a big 264% performance improvement and 24 queries have got at least 5% performance gain. The overall geomean of 43 queries has gained 7.4% more than the base code. Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

jiebinn · 2023-07-09T02:24:29Z

@nickitat The CI error has succeeded in the latest attempt. But the status doesn't show that.

jiebinn · 2023-07-19T03:28:34Z

Update the clickbench Q5 performance test of core scaling and pipeline figures at the top of PR, which would show how to define the issue and make the merge more scalable.

jiebinn · 2023-07-21T09:19:15Z

@nickitat @devcrafter @kitaisreal
I have added the performance test for the data set hits_v1, which have 2.1x and 1.8x performance improvement on my machine.

SELECT COUNT(DISTINCT Title) FROM hits_v1 SETTINGS max_threads = 24
SELECT COUNT(DISTINCT Referer) FROM hits_v1 SETTINGS max_threads = 22

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

jiebinn · 2023-07-24T02:10:32Z

Rename the data set name of performance test from hits_v1 to test.hits to fit the CI rule.

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

nickitat · 2023-07-25T17:18:16Z

src/AggregateFunctions/ThetaSketchData.h

@@ -66,6 +66,8 @@ class ThetaSketchData : private boost::noncopyable
            return 0;
    }

+    static void parallelizeMergePrepare(const std::vector<ThetaSketchData *> & /*places*/, ThreadPool & /*thread_pool*/) {}


if I'm not missing smth all parallelizeMergePrepare could be one of:

not defined/declared at all

throw exception

do meaningful work (empty implementation is not meaningful)

I have some prejudice towards empty methods, what do you think?

Agree.

For these empty methods in other datasets, we could add the throw exception.

Or we may just delete these empty methods in other datasets. Call the DataSet::parallelizeMergePrepare() only when is_parallelize_merge_prepare_needed is true (UniqExactSet) and use constexpr to avoid compiling error.

I would like to suggest the 2nd way.

src/AggregateFunctions/UniqExactSet.h

Co-authored-by: Nikita Taranov <nickita.taranov@gmail.com>

…repare() Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

jiebinn · 2023-07-27T01:27:08Z

The failing check - stress_test (msan) is successful in latest attempt. CI does not update the latest status on the webpage.

In PR(ClickHouse#50748), it has added new phase `parallelizeMergePrepare` before merge if all the hashSets are not all singleLevel or not all twoLevel. Then it will convert all the singleLevelSet to twoLevelSet in parallel, which will increase the CPU utilization and QPS. But if all the hashtables are singleLevel, it could also benefit from the `parallelizeMergePrepare` optimization in most cases if the hashtable size are not too small. By tuning the Query `SELECT COUNT(DISTINCT SearchPhase) FROM hits_v1` in different threads, we have got the mild threshold 6,000. Test patch with the Query 'SELECT COUNT(DISTINCT Title) FROM hits_v1' on 2x80 vCPUs server. If the threads are less than 48, the hashSets are all twoLevel or mixed by singleLevel and twoLevel. If the threads are over 56, all the hashSets are singleLevel. And the QPS has got at most 2.35x performance gain. Threads Opt/Base 8 100.0% 16 99.4% 24 110.3% 32 99.9% 40 99.3% 48 99.8% 56 183.0% 64 234.7% 72 233.1% 80 229.9% 88 224.5% 96 229.6% 104 235.1% 112 229.5% 120 229.1% 128 217.8% 136 222.9% 144 217.8% 152 204.3% 160 203.2% Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

…52973) * Optimize the merge if all hashSets are singleLevel In PR(#50748), it has added new phase `parallelizeMergePrepare` before merge if all the hashSets are not all singleLevel or not all twoLevel. Then it will convert all the singleLevelSet to twoLevelSet in parallel, which will increase the CPU utilization and QPS. But if all the hashtables are singleLevel, it could also benefit from the `parallelizeMergePrepare` optimization in most cases if the hashtable size are not too small. By tuning the Query `SELECT COUNT(DISTINCT SearchPhase) FROM hits_v1` in different threads, we have got the mild threshold 6,000. Test patch with the Query 'SELECT COUNT(DISTINCT Title) FROM hits_v1' on 2x80 vCPUs server. If the threads are less than 48, the hashSets are all twoLevel or mixed by singleLevel and twoLevel. If the threads are over 56, all the hashSets are singleLevel. And the QPS has got at most 2.35x performance gain. Threads Opt/Base 8 100.0% 16 99.4% 24 110.3% 32 99.9% 40 99.3% 48 99.8% 56 183.0% 64 234.7% 72 233.1% 80 229.9% 88 224.5% 96 229.6% 104 235.1% 112 229.5% 120 229.1% 128 217.8% 136 222.9% 144 217.8% 152 204.3% 160 203.2% Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> * Add the comment and explanation for PR#52973 Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> --------- Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

Merge pull request #2 from ClickHouse/master

f3b3d70

lastest upstream master

alexey-milovidov added the can be tested Allows running workflows for external contributors label Jun 9, 2023

robot-ch-test-poll3 added the pr-performance Pull request with some performance improvements label Jun 9, 2023

alexey-milovidov requested review from kitaisreal and nickitat June 10, 2023 08:57

nickitat self-assigned this Jun 10, 2023

nickitat reviewed Jun 13, 2023

View reviewed changes

devcrafter reviewed Jun 13, 2023

View reviewed changes

jiebinn force-pushed the UniqExactSet branch from 439a27a to fd3a45d Compare June 20, 2023 09:42

jiebinn force-pushed the UniqExactSet branch 5 times, most recently from 9a19224 to e42a62f Compare June 26, 2023 06:56

Merge pull request #4 from ClickHouse/master

1bbf378

Update to the master

jiebinn force-pushed the UniqExactSet branch from 8b8b9ac to 09e3509 Compare June 29, 2023 05:37

jiebinn force-pushed the UniqExactSet branch from cd7eab3 to 1c1a9d1 Compare July 7, 2023 09:41

add resize() for the data_vec in parallelizeMergePrepare()

1c1a9d1

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

jiebinn force-pushed the UniqExactSet branch from e064916 to 1c1a9d1 Compare July 13, 2023 01:52

jiebinn force-pushed the UniqExactSet branch from a43101c to ad109c7 Compare July 21, 2023 09:09

jiebinn requested review from devcrafter and nickitat July 21, 2023 09:19

Add the performance test prepare_hash_before_merge.xml

ad109c7

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

jiebinn force-pushed the UniqExactSet branch from 69d573e to ad0ca53 Compare July 24, 2023 02:08

jiebinn added 4 commits July 24, 2023 13:02

Merge branch 'master' into UniqExactSet

4c93d05

Fit the CI to rename the data set from hits_v1 to test.hits.

ad0ca53

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

Merge branch 'master' into UniqExactSet

b7d9777

Merge branch 'master' into UniqExactSet

2504987

nickitat reviewed Jul 25, 2023

View reviewed changes

remove the redundant branch in UniqExactSet

a8b5b55

Co-authored-by: Nikita Taranov <nickita.taranov@gmail.com>

jiebinn force-pushed the UniqExactSet branch 3 times, most recently from 53fa19d to 635e9d7 Compare July 26, 2023 07:34

nickitat approved these changes Jul 26, 2023

View reviewed changes

Remove the empty methods and add throw exception in parallelizeMergeP…

635e9d7

…repare() Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

nickitat merged commit 78f3a57 into ClickHouse:master Jul 27, 2023
271 of 272 checks passed

jiebinn mentioned this pull request Aug 3, 2023

Optimize the merge if all hashSets are singleLevel in UniqExactSet #52973

Merged

yl-lisen mentioned this pull request Mar 19, 2024

Fixes and improve more for aggregation timeplus-io/proton#614

Closed

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert hashSets in parallel before merge #50748

Convert hashSets in parallel before merge #50748

jiebinn commented Jun 9, 2023 •

edited

clickhouse-ci bot commented Jun 9, 2023

clickhouse-ci bot commented Jun 9, 2023

robot-ch-test-poll3 commented Jun 9, 2023 •

edited by robot-clickhouse

nickitat commented Jun 13, 2023

nickitat Jun 13, 2023

jiebinn Jun 20, 2023

nickitat Jun 13, 2023

jiebinn Jun 26, 2023

nickitat Jun 13, 2023

jiebinn Jun 20, 2023

nickitat Jun 13, 2023

jiebinn Jun 20, 2023 •

edited

nickitat commented Jun 13, 2023

devcrafter Jun 13, 2023

jiebinn Jun 20, 2023

jiebinn Jul 7, 2023

jiebinn commented Jun 15, 2023

jiebinn commented Jun 20, 2023 •

edited

jiebinn commented Jun 20, 2023

jiebinn commented Jun 26, 2023

jiebinn commented Jul 9, 2023

jiebinn commented Jul 19, 2023 •

edited

jiebinn commented Jul 21, 2023

jiebinn commented Jul 24, 2023

nickitat Jul 25, 2023

jiebinn Jul 26, 2023 •

edited

jiebinn commented Jul 27, 2023

Convert hashSets in parallel before merge #50748

Convert hashSets in parallel before merge #50748

Conversation

jiebinn commented Jun 9, 2023 • edited

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

The detail to define the performance issue and how to resolve it.

clickhouse-ci bot commented Jun 9, 2023

clickhouse-ci bot commented Jun 9, 2023

robot-ch-test-poll3 commented Jun 9, 2023 • edited by robot-clickhouse

nickitat commented Jun 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiebinn Jun 20, 2023 • edited

Choose a reason for hiding this comment

nickitat commented Jun 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiebinn commented Jun 15, 2023

jiebinn commented Jun 20, 2023 • edited

jiebinn commented Jun 20, 2023

jiebinn commented Jun 26, 2023

jiebinn commented Jul 9, 2023

jiebinn commented Jul 19, 2023 • edited

jiebinn commented Jul 21, 2023

jiebinn commented Jul 24, 2023

Choose a reason for hiding this comment

jiebinn Jul 26, 2023 • edited

Choose a reason for hiding this comment

jiebinn commented Jul 27, 2023

jiebinn commented Jun 9, 2023 •

edited

robot-ch-test-poll3 commented Jun 9, 2023 •

edited by robot-clickhouse

jiebinn Jun 20, 2023 •

edited

jiebinn commented Jun 20, 2023 •

edited

jiebinn commented Jul 19, 2023 •

edited

jiebinn Jul 26, 2023 •

edited