New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor CapnProto format to improve input/output performance #49752
Conversation
This is an automated comment for commit 24d70a2 with description of existing statuses. It's updated for the latest CI running
|
Benchmark for
After this PR:
x5 improvement for output, x3 improvement for input. But still 10s is too long for 1000000 rows, but it's already the limitation of |
@@ -0,0 +1,298 @@ | |||
#include <Formats/CapnProtoSchema.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code from this file is not new, it's copied from old file CapnProtoUtils.cpp
. The new code is in CapnProtoSerializer.cpp
By updating capnp library (just add additional arguments for some methods to avoid copying, will test it more and make a patch) I can achive x1.5-x2 performance improvement. But the main problem with CapnProto input/output format is that we first serialize/deserialize message in memory and then read data from it. Perfect solution will be to read/write message directly from/into buffer, but it will require writing our own CapnProto message serializer/deserializer (as it's done for Protobuf format). But I am not ready for it now. |
…o better-capnproto-3
New results:
Total improvement now ~15x for output and ~10x for input. @al13n321 I think it's ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, nice results!
@@ -959,7 +959,7 @@ class IColumn; | |||
M(Bool, output_format_orc_string_as_string, false, "Use ORC String type instead of Binary for String columns", 0) \ | |||
M(ORCCompression, output_format_orc_compression_method, "lz4", "Compression method for ORC output format. Supported codecs: lz4, snappy, zlib, zstd, none (uncompressed)", 0) \ | |||
\ | |||
M(EnumComparingMode, format_capn_proto_enum_comparising_mode, FormatSettings::EnumComparingMode::BY_VALUES, "How to map ClickHouse Enum and CapnProto Enum", 0) \ | |||
M(CapnProtoEnumComparingMode, format_capn_proto_enum_comparising_mode, FormatSettings::CapnProtoEnumComparingMode::BY_VALUES, "How to map ClickHouse Enum and CapnProto Enum", 0) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format_capn_proto_enum_comparising_mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this setting exists for a long time and we cannot change it's name(
tests/queries/0_stateless/02735_capnp_case_insensitive_names_matcing.reference
Outdated
Show resolved
Hide resolved
src/Formats/CapnProtoSerializer.cpp
Outdated
if (enum_comparing_mode == FormatSettings::CapnProtoEnumComparingMode::BY_VALUES) | ||
assert_cast<ColumnVector<EnumType> &>(column).insertValue(static_cast<EnumType>(capnp_enum_value)); | ||
else | ||
assert_cast<ColumnVector<EnumType> &>(column).insertValue(capnp_to_ch_values[capnp_enum_value]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
std::unordered_map<EnumType, UInt16> ch_to_capnp_values; | ||
std::unordered_map<UInt16, EnumType> capnp_to_ch_values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could be std::vectors, for speed. Since all indices are at most 16 bits, it'll be at most 128 KB (+ maybe a 64 KB null mask to detect unexpected indices), so don't event need a fallback to hashtable for sparse enums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will try to change to vectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to leave it as map, because by using vectors we won't be able to determine if we have invalid enum value
f39862c
to
47b0c2a
Compare
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Rewrite CapnProto input/output format to improve its performance. Map column names and CapnProto fields case insensitive, fix reading/writing of nested structure fields.
Documentation entry for user-facing changes