Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert hashSets in parallel before merge #50748

Merged
merged 11 commits into from Jul 27, 2023

Conversation

jiebinn
Copy link
Contributor

@jiebinn jiebinn commented Jun 9, 2023

Before merge, if one of the lhs and rhs is singleLevelSet and the other is twoLevelSet, then the SingleLevelSet will call convertToTwoLevel(). The convert process is serial and not in parallel. It will cost lots of cycle before it cosumes all the singleLevelSet.

The idea of the patch is to convert all the singleLevelSets to twoLevelSets in parallel before merge.

I have tested the patch on Intel 2 x 112 vCPUs SPR server with clickbench and latest upstream ClickHouse. Q5 has got a big 264% performance improvement and 24 queries have got at least 5% performance gain. The overall geomean of 43 queries has gained 7.4% more than the base code.

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

This patch will provide a method to deal with all the hashsets in parallel before merge.

The detail to define the performance issue and how to resolve it.

First, we have found that there is a performance drop for the Q5 of clickbench as the core count increases.
image
Then we have collected a pipeline visualization if the max_threads of thread_pool is 80 or 112.
max_threads = 80
image
max_threads = 112
image
If the max_threads increase from 80 to 112, the merge stage does not decrease, but merge time is 3.2x more.
when merging two twoLevelHash, there is already an optimization to start thread_pool and merge in parallel. However, when merge one singleLevelHash and one TwoLevelHash, the singleLevelHash has to convert to twoLevelHash. The convert progress is serial.
If there is at least one singleLevelHash and one TwoLevelHash, all the singleLevelHash have to be converted to twoLevelHash before merged with the other twoLevelHash. We could add a new stage before merge called Prepare_Hash_before_Merge, where all the hashset would be processed before merging. All the singleLevelHash would be converted to twoLevelHash in parallel in this stage instead of in serial in merge stage.
with this patch, Q5 has got 2.64x performance improvement on a 2x112 vCPUs system (max_threads = 112).
image

@clickhouse-ci
Copy link

clickhouse-ci bot commented Jun 9, 2023

This is an automatic comment. The PR descriptions does not match the template.

Please, edit it accordingly.

The error is: Changelog entry required for category 'Performance Improvement'

1 similar comment
@clickhouse-ci
Copy link

clickhouse-ci bot commented Jun 9, 2023

This is an automatic comment. The PR descriptions does not match the template.

Please, edit it accordingly.

The error is: Changelog entry required for category 'Performance Improvement'

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jun 9, 2023
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-performance Pull request with some performance improvements label Jun 9, 2023
@robot-ch-test-poll3
Copy link
Contributor

robot-ch-test-poll3 commented Jun 9, 2023

This is an automated comment for commit 635e9d7 with description of existing statuses. It's updated for the latest CI running
The full report is available here
The overall status of the commit is 🔴 failure

Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help🟢 success
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR🟢 success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process🟢 success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help🟢 success
Docker image for serversThe check to build and optionally push the mentioned image to docker hub🟢 success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here🟢 success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc🟢 success
Install packagesChecks that the built packages are installable in a clear environment🟢 success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests🟢 success
Mergeable CheckChecks if all other necessary checks are successful🟢 success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests🟢 success
Push to DockerhubThe check for building and pushing the CI related docker images to docker hub🟢 success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool🟢 success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed🟢 success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc🟢 success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc🟢 success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors🔴 failure
Style CheckRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report🟢 success
Unit testsRuns the unit tests for different release types🟢 success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts🟢 success

@nickitat nickitat self-assigned this Jun 10, 2023
@nickitat
Copy link
Member

pls take a look

Jun 11 02:52:18 FAILED: src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o 
Jun 11 02:52:18 /usr/bin/cmake -E __run_co_compile --launcher="prlimit;--as=10000000000;--data=5000000000;--cpu=1000;/usr/bin/sccache" --tidy="/usr/bin/clang-tidy-cache;/usr/bin/clang-tidy-16;--extra-arg-before=--driver-mode=g++" --source=/build/src/Interpreters/examples/two_level_hash_map.cpp -- /usr/bin/clang++-16 --target=x86_64-linux-gnu --sysroot=/build/cmake/linux/../../contrib/sysroot/linux-x86_64/x86_64-linux-gnu/libc -DANNOYLIB_MULTITHREADED_BUILD -DAWS_SDK_VERSION_MAJOR=1 -DAWS_SDK_VERSION_MINOR=10 -DAWS_SDK_VERSION_PATCH=36 -DBOOST_ASIO_HAS_STD_INVOKE_RESULT=1 -DBOOST_ASIO_STANDALONE=1 -DCARES_STATICLIB -DDUMMY_BACKTRACE -DENABLE_ANNOY -DENABLE_MULTITARGET_CODE=1 -DENABLE_QPL_COMPRESSION -DPOCO_ENABLE_CPP11 -DPOCO_HAVE_FD_EPOLL -DPOCO_OS_FAMILY_UNIX -DSTD_EXCEPTION_HAS_STACK_TRACE=1 -DUNALIGNED_OK -DWITH_COVERAGE=0 -DWITH_GZFILEOP -DX86_64 -DZLIB_COMPAT -D_LIBCPP_ENABLE_THREAD_SAFETY_ANNOTATIONS -I/build/build_docker/includes/configs -I/build/base/glibc-compatibility/memcpy -I/build/src -I/build/build_docker/src -I/build/build_docker/src/Core/include -I/build/base/base/.. -I/build/build_docker/base/base/.. -I/build/contrib/cctz/include -I/build/base/pcg-random/. -I/build/contrib/libfiu/libfiu -I/build/contrib/miniselect/include -I/build/contrib/zstd/lib -I/build/src/Common/mysqlxx/. -I/build/build_docker/rust/skim/include -isystem /build/contrib/llvm-project/libcxx/include -isystem /build/contrib/llvm-project/libcxxabi/include -isystem /build/contrib/libunwind/include -isystem /build/build_docker/contrib/orc/c++/include -isystem /build/contrib/llvm-project/llvm/include -isystem /build/build_docker/contrib/llvm-project/llvm/include -isystem /build/contrib/abseil-cpp -isystem /build/contrib/croaring/cpp -isystem /build/contrib/croaring/include -isystem /build/contrib/sparsehash-c11 -isystem /build/contrib/cityhash102/include -isystem /build/contrib/boost -isystem /build/base/poco/Net/include -isystem /build/base/poco/Foundation/include -isystem /build/base/poco/NetSSL_OpenSSL/include -isystem /build/base/poco/Crypto/include -isystem /build/contrib/boringssl/include -isystem /build/base/poco/Util/include -isystem /build/base/poco/JSON/include -isystem /build/base/poco/XML/include -isystem /build/contrib/replxx/include -isystem /build/contrib/fmtlib-cmake/../fmtlib/include -isystem /build/contrib/magic_enum/include -isystem /build/contrib/double-conversion -isystem /build/contrib/dragonbox/include -isystem /build/contrib/re2 -isystem /build/build_docker/contrib/re2-cmake -isystem /build/contrib/zlib-ng -isystem /build/build_docker/contrib/zlib-ng-cmake -isystem /build/contrib/pdqsort -isystem /build/contrib/xz/src/liblzma/api -isystem /build/contrib/aws/src/aws-cpp-sdk-core/include -isystem /build/build_docker/contrib/aws-cmake/include -isystem /build/contrib/aws/generated/src/aws-cpp-sdk-s3/include -isystem /build/contrib/aws-c-auth/include -isystem /build/contrib/aws-c-common/include -isystem /build/contrib/aws-c-io/include -isystem /build/contrib/aws-crt-cpp/include -isystem /build/contrib/aws-c-mqtt/include -isystem /build/contrib/aws-c-sdkutils/include -isystem /build/contrib/snappy -isystem /build/build_docker/contrib/snappy-cmake -isystem /build/contrib/libbcrypt -isystem /build/contrib/msgpack-c/include -isystem /build/build_docker/contrib/liburing/src/include-compat -isystem /build/build_docker/contrib/liburing/src/include -isystem /build/contrib/liburing/src/include -isystem /build/contrib/fast_float/include -isystem /build/contrib/NuRaft/include -isystem /build/base/poco/MongoDB/include -isystem /build/build_docker/contrib/mariadb-connector-c-cmake/include-public -isystem /build/contrib/mariadb-connector-c/include -isystem /build/contrib/mariadb-connector-c/libmariadb -isystem /build/build_docker/src/Server/grpc_protos -isystem /build/contrib/grpc/include -isystem /build/contrib/c-ares/src/lib -isystem /build/contrib/c-ares/include -isystem /build/contrib/c-ares-cmake/linux -isystem /build/contrib/google-protobuf/src -isystem /build/contrib/s2geometry/src -isystem /build/contrib/AMQP-CPP/include -isystem /build/contrib/AMQP-CPP -isystem /build/contrib/libuv/include -isystem /build/contrib/sqlite-amalgamation -isystem /build/contrib/rocksdb/include -isystem /build/contrib/libpqxx/include -isystem /build/contrib/libpq -isystem /build/contrib/libpq/include -isystem /build/contrib/qpl-cmake -isystem /build/contrib/qpl/include -isystem /build/contrib/idxd-config/accfg -isystem /build/contrib/libstemmer_c/include -isystem /build/contrib/wordnet-blast -isystem /build/contrib/lemmagen-c/include -isystem /build/contrib/ulid-c/include -isystem /build/contrib/consistent-hashing -isystem /build/contrib/annoy/src --gcc-toolchain=/build/cmake/linux/../../contrib/sysroot/linux-x86_64 -std=c++20 -fdiagnostics-color=always -Xclang -fuse-ctor-homing -Wno-enum-constexpr-conversion -fsized-deallocation  -UNDEBUG -gdwarf-aranges -pipe -mssse3 -msse4.1 -msse4.2 -mpclmul -mpopcnt -fasynchronous-unwind-tables -falign-functions=32 -mbranches-within-32B-boundaries -fdiagnostics-absolute-paths -fstrict-vtable-pointers -Wall -Wextra -Weverything -Wpedantic -Wno-zero-length-array -Wno-c++98-compat-pedantic -Wno-c++98-compat -Wno-c++20-compat -Wno-sign-conversion -Wno-implicit-int-conversion -Wno-implicit-int-float-conversion -Wno-ctad-maybe-unsupported -Wno-disabled-macro-expansion -Wno-documentation-unknown-command -Wno-double-promotion -Wno-exit-time-destructors -Wno-float-equal -Wno-global-constructors -Wno-missing-prototypes -Wno-missing-variable-declarations -Wno-padded -Wno-switch-enum -Wno-undefined-func-template -Wno-unused-template -Wno-vla -Wno-weak-template-vtables -Wno-weak-vtables -Wno-thread-safety-negative -Wno-enum-constexpr-conversion -Wno-unsafe-buffer-usage -g -O0 -g -gdwarf-4  -D_LIBCPP_DEBUG=0 -std=c++2b   -D OS_LINUX -I/build/base -I/build/contrib/magic_enum/include -include /build/src/Core/iostream_debug_helpers.h -Werror -nostdinc++ -MD -MT src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o -MF src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o.d -o src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o -c /build/src/Interpreters/examples/two_level_hash_map.cpp
Jun 11 02:52:18 /build/src/AggregateFunctions/UniquesHashSet.h:369:91: error: unknown type name 'ThreadPool' [clang-diagnostic-error]
Jun 11 02:52:18     static void parallelizeMergePrepare(const std::vector<UniquesHashSet *> & /*places*/, ThreadPool * /*thread_pool = nullptr*/) {}

@@ -147,6 +148,10 @@ class IAggregateFunction : public std::enable_shared_from_this<IAggregateFunctio
/// Default values must be a the 0-th positions in columns.
virtual void addManyDefaults(AggregateDataPtr __restrict place, const IColumn ** columns, size_t length, Arena * arena) const = 0;

virtual bool isParallelizeMergePrepareNeeded() const { return false; }

virtual void parallelizeMergePrepare(AggregateDataPtrs & /*places*/, ThreadPool & /*thread_pool*/) const {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw Exception(ErrorCodes::NOT_IMPLEMENTED, ...); here and override only where we actually do smth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Add the exception in virtual function.

@@ -2601,6 +2601,21 @@ void NO_INLINE Aggregator::mergeWithoutKeyDataImpl(

AggregatedDataVariantsPtr & res = non_empty_data[0];

for (size_t i = 0; i < params.aggregates_size; ++i)
{
if (aggregate_functions[i]->isAbleToParallelizeMerge() &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isAbleToParallelizeMerge looks irrelevant here, isParallelizeMergePrepareNeeded should be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag isAbleToParallelizeMerge is for the parallel merge of UniqExact. If the merge stage is not parallel (isAbleToParallelizeMerge = false), we could still do hashset converting in parallel. The merge stage and hashset converting stage are independent and irrelevant. Then I think isAbleToParallelizeMerge could be removed.

/// In merge, if one of the lhs and rhs is twolevelset and the other is singlelevelset, then the singlelevelset will need to convertToTwoLevel().
/// It's not in parallel and will cost extra large time if the thread_num is large.
/// This method will convert all the SingleLevelSet to TwoLevelSet in parallel if the hashsets are not all singlelevel or not all twolevel.
static void parallelizeMergePrepare(const std::vector<UniqExactSet *> & data_vec, ThreadPool * thread_pool = nullptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make sure that ThreadPool is always not null here and pass it as a reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the void parallelizeMergePrepare(AggregateDataPtrs & places, ThreadPool & thread_pool) const override in AggregateFunctionUniq.h, the ThreadPool is alway not null. The latest code has used the reference instead of a pointer.

}
};
for (size_t i = 0; i < std::min<size_t>(thread_pool->getMaxThreads(), single_level_set_num); ++i)
thread_pool->scheduleOrThrowOnError(thread_func);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if exception will be thrown here, we will not reach wait(). all this code should be wrapped into try-catch that also does wait in case of exception (refer to #50590).

Copy link
Contributor Author

@jiebinn jiebinn Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add try catch in latest code.

@nickitat
Copy link
Member

would be good to find perf-tests that shows speed-up in the addressed case.


void parallelizeMergePrepare(AggregateDataPtrs & places, ThreadPool & thread_pool) const override
{
std::vector<DataSet *> data_vec;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_vec.resize(places.size());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to add the code line data_vec.resize(places.size()); after the initialization of std::vector<DataSet *> data_vec; But the ClickHouse server crashed. I will check that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the resize() for data_vec in parallelizeMergePrepare.

@jiebinn
Copy link
Contributor Author

jiebinn commented Jun 15, 2023

@nickitat @devcrafter Thanks for your kind review. I will think about that and let you know once I finish.

@jiebinn
Copy link
Contributor Author

jiebinn commented Jun 20, 2023

pls take a look

Jun 11 02:52:18 FAILED: src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o 
Jun 11 02:52:18 /usr/bin/cmake -E __run_co_compile --launcher="prlimit;--as=10000000000;--data=5000000000;--cpu=1000;/usr/bin/sccache" --tidy="/usr/bin/clang-tidy-cache;/usr/bin/clang-tidy-16;--extra-arg-before=--driver-mode=g++" --source=/build/src/Interpreters/examples/two_level_hash_map.cpp -- /usr/bin/clang++-16 --target=x86_64-linux-gnu --sysroot=/build/cmake/linux/../../contrib/sysroot/linux-x86_64/x86_64-linux-gnu/libc -DANNOYLIB_MULTITHREADED_BUILD -DAWS_SDK_VERSION_MAJOR=1 -DAWS_SDK_VERSION_MINOR=10 -DAWS_SDK_VERSION_PATCH=36 -DBOOST_ASIO_HAS_STD_INVOKE_RESULT=1 -DBOOST_ASIO_STANDALONE=1 -DCARES_STATICLIB -DDUMMY_BACKTRACE -DENABLE_ANNOY -DENABLE_MULTITARGET_CODE=1 -DENABLE_QPL_COMPRESSION -DPOCO_ENABLE_CPP11 -DPOCO_HAVE_FD_EPOLL -DPOCO_OS_FAMILY_UNIX -DSTD_EXCEPTION_HAS_STACK_TRACE=1 -DUNALIGNED_OK -DWITH_COVERAGE=0 -DWITH_GZFILEOP -DX86_64 -DZLIB_COMPAT -D_LIBCPP_ENABLE_THREAD_SAFETY_ANNOTATIONS -I/build/build_docker/includes/configs -I/build/base/glibc-compatibility/memcpy -I/build/src -I/build/build_docker/src -I/build/build_docker/src/Core/include -I/build/base/base/.. -I/build/build_docker/base/base/.. -I/build/contrib/cctz/include -I/build/base/pcg-random/. -I/build/contrib/libfiu/libfiu -I/build/contrib/miniselect/include -I/build/contrib/zstd/lib -I/build/src/Common/mysqlxx/. -I/build/build_docker/rust/skim/include -isystem /build/contrib/llvm-project/libcxx/include -isystem /build/contrib/llvm-project/libcxxabi/include -isystem /build/contrib/libunwind/include -isystem /build/build_docker/contrib/orc/c++/include -isystem /build/contrib/llvm-project/llvm/include -isystem /build/build_docker/contrib/llvm-project/llvm/include -isystem /build/contrib/abseil-cpp -isystem /build/contrib/croaring/cpp -isystem /build/contrib/croaring/include -isystem /build/contrib/sparsehash-c11 -isystem /build/contrib/cityhash102/include -isystem /build/contrib/boost -isystem /build/base/poco/Net/include -isystem /build/base/poco/Foundation/include -isystem /build/base/poco/NetSSL_OpenSSL/include -isystem /build/base/poco/Crypto/include -isystem /build/contrib/boringssl/include -isystem /build/base/poco/Util/include -isystem /build/base/poco/JSON/include -isystem /build/base/poco/XML/include -isystem /build/contrib/replxx/include -isystem /build/contrib/fmtlib-cmake/../fmtlib/include -isystem /build/contrib/magic_enum/include -isystem /build/contrib/double-conversion -isystem /build/contrib/dragonbox/include -isystem /build/contrib/re2 -isystem /build/build_docker/contrib/re2-cmake -isystem /build/contrib/zlib-ng -isystem /build/build_docker/contrib/zlib-ng-cmake -isystem /build/contrib/pdqsort -isystem /build/contrib/xz/src/liblzma/api -isystem /build/contrib/aws/src/aws-cpp-sdk-core/include -isystem /build/build_docker/contrib/aws-cmake/include -isystem /build/contrib/aws/generated/src/aws-cpp-sdk-s3/include -isystem /build/contrib/aws-c-auth/include -isystem /build/contrib/aws-c-common/include -isystem /build/contrib/aws-c-io/include -isystem /build/contrib/aws-crt-cpp/include -isystem /build/contrib/aws-c-mqtt/include -isystem /build/contrib/aws-c-sdkutils/include -isystem /build/contrib/snappy -isystem /build/build_docker/contrib/snappy-cmake -isystem /build/contrib/libbcrypt -isystem /build/contrib/msgpack-c/include -isystem /build/build_docker/contrib/liburing/src/include-compat -isystem /build/build_docker/contrib/liburing/src/include -isystem /build/contrib/liburing/src/include -isystem /build/contrib/fast_float/include -isystem /build/contrib/NuRaft/include -isystem /build/base/poco/MongoDB/include -isystem /build/build_docker/contrib/mariadb-connector-c-cmake/include-public -isystem /build/contrib/mariadb-connector-c/include -isystem /build/contrib/mariadb-connector-c/libmariadb -isystem /build/build_docker/src/Server/grpc_protos -isystem /build/contrib/grpc/include -isystem /build/contrib/c-ares/src/lib -isystem /build/contrib/c-ares/include -isystem /build/contrib/c-ares-cmake/linux -isystem /build/contrib/google-protobuf/src -isystem /build/contrib/s2geometry/src -isystem /build/contrib/AMQP-CPP/include -isystem /build/contrib/AMQP-CPP -isystem /build/contrib/libuv/include -isystem /build/contrib/sqlite-amalgamation -isystem /build/contrib/rocksdb/include -isystem /build/contrib/libpqxx/include -isystem /build/contrib/libpq -isystem /build/contrib/libpq/include -isystem /build/contrib/qpl-cmake -isystem /build/contrib/qpl/include -isystem /build/contrib/idxd-config/accfg -isystem /build/contrib/libstemmer_c/include -isystem /build/contrib/wordnet-blast -isystem /build/contrib/lemmagen-c/include -isystem /build/contrib/ulid-c/include -isystem /build/contrib/consistent-hashing -isystem /build/contrib/annoy/src --gcc-toolchain=/build/cmake/linux/../../contrib/sysroot/linux-x86_64 -std=c++20 -fdiagnostics-color=always -Xclang -fuse-ctor-homing -Wno-enum-constexpr-conversion -fsized-deallocation  -UNDEBUG -gdwarf-aranges -pipe -mssse3 -msse4.1 -msse4.2 -mpclmul -mpopcnt -fasynchronous-unwind-tables -falign-functions=32 -mbranches-within-32B-boundaries -fdiagnostics-absolute-paths -fstrict-vtable-pointers -Wall -Wextra -Weverything -Wpedantic -Wno-zero-length-array -Wno-c++98-compat-pedantic -Wno-c++98-compat -Wno-c++20-compat -Wno-sign-conversion -Wno-implicit-int-conversion -Wno-implicit-int-float-conversion -Wno-ctad-maybe-unsupported -Wno-disabled-macro-expansion -Wno-documentation-unknown-command -Wno-double-promotion -Wno-exit-time-destructors -Wno-float-equal -Wno-global-constructors -Wno-missing-prototypes -Wno-missing-variable-declarations -Wno-padded -Wno-switch-enum -Wno-undefined-func-template -Wno-unused-template -Wno-vla -Wno-weak-template-vtables -Wno-weak-vtables -Wno-thread-safety-negative -Wno-enum-constexpr-conversion -Wno-unsafe-buffer-usage -g -O0 -g -gdwarf-4  -D_LIBCPP_DEBUG=0 -std=c++2b   -D OS_LINUX -I/build/base -I/build/contrib/magic_enum/include -include /build/src/Core/iostream_debug_helpers.h -Werror -nostdinc++ -MD -MT src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o -MF src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o.d -o src/Interpreters/examples/CMakeFiles/two_level_hash_map.dir/two_level_hash_map.cpp.o -c /build/src/Interpreters/examples/two_level_hash_map.cpp
Jun 11 02:52:18 /build/src/AggregateFunctions/UniquesHashSet.h:369:91: error: unknown type name 'ThreadPool' [clang-diagnostic-error]
Jun 11 02:52:18     static void parallelizeMergePrepare(const std::vector<UniquesHashSet *> & /*places*/, ThreadPool * /*thread_pool = nullptr*/) {}

The latest code has involved the head file <Common/ThreadPool_fwd.h> in data set, such as ThetaSketchData, HyperLogLogWithSmallSetOptimization and UniquesHashSet.

@jiebinn
Copy link
Contributor Author

jiebinn commented Jun 20, 2023

would be good to find perf-tests that shows speed-up in the addressed case.

I have used the clickbench for the performance test. You mean we might find the "distinct/uniq" operation in tests and check the performance?

@jiebinn jiebinn force-pushed the UniqExactSet branch 5 times, most recently from 9a19224 to e42a62f Compare June 26, 2023 06:56
@jiebinn
Copy link
Contributor Author

jiebinn commented Jun 26, 2023

Hi @nickitat @devcrafter, thanks for your previous code review. I have updated the patch code and comments according to the review of PR.

Before merge, if one of the lhs and rhs is singleLevelSet and the other is twoLevelSet,
then the SingleLevelSet will call convertToTwoLevel(). The convert process is not in parallel
and it will cost lots of cycle if it cosume all the singleLevelSet.

The idea of the patch is to convert all the singleLevelSets to twoLevelSets in parallel if
the hashsets are not all singleLevel or not all twoLevel.

I have tested the patch on Intel 2 x 112 vCPUs SPR server with clickbench and latest upstream
ClickHouse.
Q5 has got a big 264% performance improvement and 24 queries have got at least 5% performance
gain. The overall geomean of 43 queries has gained 7.4% more than the base code.

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
@jiebinn
Copy link
Contributor Author

jiebinn commented Jul 9, 2023

@nickitat The CI error has succeeded in the latest attempt. But the status doesn't show that.
image

@jiebinn
Copy link
Contributor Author

jiebinn commented Jul 19, 2023

Update the clickbench Q5 performance test of core scaling and pipeline figures at the top of PR, which would show how to define the issue and make the merge more scalable.

@jiebinn
Copy link
Contributor Author

jiebinn commented Jul 21, 2023

@nickitat @devcrafter @kitaisreal
I have added the performance test for the data set hits_v1, which have 2.1x and 1.8x performance improvement on my machine.

SELECT COUNT(DISTINCT Title) FROM hits_v1 SETTINGS max_threads = 24
SELECT COUNT(DISTINCT Referer) FROM hits_v1 SETTINGS max_threads = 22

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
@jiebinn
Copy link
Contributor Author

jiebinn commented Jul 24, 2023

Rename the data set name of performance test from hits_v1 to test.hits to fit the CI rule.

@@ -66,6 +66,8 @@ class ThetaSketchData : private boost::noncopyable
return 0;
}

static void parallelizeMergePrepare(const std::vector<ThetaSketchData *> & /*places*/, ThreadPool & /*thread_pool*/) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I'm not missing smth all parallelizeMergePrepare could be one of:

  • not defined/declared at all
  • throw exception
  • do meaningful work (empty implementation is not meaningful)

I have some prejudice towards empty methods, what do you think?

Copy link
Contributor Author

@jiebinn jiebinn Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

  • For these empty methods in other datasets, we could add the throw exception.

  • Or we may just delete these empty methods in other datasets. Call the DataSet::parallelizeMergePrepare() only when is_parallelize_merge_prepare_needed is true (UniqExactSet) and use constexpr to avoid compiling error.

I would like to suggest the 2nd way.

src/AggregateFunctions/UniqExactSet.h Outdated Show resolved Hide resolved
Co-authored-by: Nikita Taranov <nickita.taranov@gmail.com>
@jiebinn jiebinn force-pushed the UniqExactSet branch 3 times, most recently from 53fa19d to 635e9d7 Compare July 26, 2023 07:34
…repare()

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
@jiebinn
Copy link
Contributor Author

jiebinn commented Jul 27, 2023

The failing check - stress_test (msan) is successful in latest attempt. CI does not update the latest status on the webpage.
image

@nickitat nickitat merged commit 78f3a57 into ClickHouse:master Jul 27, 2023
271 of 272 checks passed
jiebinn added a commit to jiebinn/ClickHouse that referenced this pull request Aug 4, 2023
In PR(ClickHouse#50748), it has added new phase
`parallelizeMergePrepare` before merge if all the hashSets are not all singleLevel
or not all twoLevel. Then it will convert all the singleLevelSet to twoLevelSet in
parallel, which will increase the CPU utilization and QPS.

But if all the hashtables are singleLevel, it could also benefit from the
`parallelizeMergePrepare` optimization in most cases if the hashtable size are not
too small. By tuning the Query `SELECT COUNT(DISTINCT SearchPhase) FROM hits_v1`
in different threads, we have got the mild threshold 6,000.

Test patch with the Query 'SELECT COUNT(DISTINCT Title) FROM hits_v1' on 2x80 vCPUs
server. If the threads are less than 48, the hashSets are all twoLevel or mixed by
singleLevel and twoLevel. If the threads are over 56, all the hashSets are singleLevel.
And the QPS has got at most 2.35x performance gain.

Threads	Opt/Base
8	100.0%
16	99.4%
24	110.3%
32	99.9%
40	99.3%
48	99.8%
56	183.0%
64	234.7%
72	233.1%
80	229.9%
88	224.5%
96	229.6%
104	235.1%
112	229.5%
120	229.1%
128	217.8%
136	222.9%
144	217.8%
152	204.3%
160	203.2%

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
nickitat pushed a commit that referenced this pull request Aug 30, 2023
…52973)

* Optimize the merge if all hashSets are singleLevel

In PR(#50748), it has added new phase
`parallelizeMergePrepare` before merge if all the hashSets are not all singleLevel
or not all twoLevel. Then it will convert all the singleLevelSet to twoLevelSet in
parallel, which will increase the CPU utilization and QPS.

But if all the hashtables are singleLevel, it could also benefit from the
`parallelizeMergePrepare` optimization in most cases if the hashtable size are not
too small. By tuning the Query `SELECT COUNT(DISTINCT SearchPhase) FROM hits_v1`
in different threads, we have got the mild threshold 6,000.

Test patch with the Query 'SELECT COUNT(DISTINCT Title) FROM hits_v1' on 2x80 vCPUs
server. If the threads are less than 48, the hashSets are all twoLevel or mixed by
singleLevel and twoLevel. If the threads are over 56, all the hashSets are singleLevel.
And the QPS has got at most 2.35x performance gain.

Threads	Opt/Base
8	100.0%
16	99.4%
24	110.3%
32	99.9%
40	99.3%
48	99.8%
56	183.0%
64	234.7%
72	233.1%
80	229.9%
88	224.5%
96	229.6%
104	235.1%
112	229.5%
120	229.1%
128	217.8%
136	222.9%
144	217.8%
152	204.3%
160	203.2%

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

* Add the comment and explanation for PR#52973

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>

---------

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants