New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce String Hash Map to speed up aggregation over short string keys. #6243
Conversation
744e913
to
5f3057f
Compare
79ce3e1
to
2bb68c2
Compare
16c9554
to
c5cc239
Compare
c00ddeb
to
ce3671e
Compare
No significant difference on this query:
|
That's what I got from string_hash_map test, with ArenaKeyHolder, 49 runs:
All short strings improve, and for long strings there is 5% regression which shouldn't be noticeable in real workload. We'll see what the CI performance test says. |
The next thing I want to do is to study the coverage info -- e.g. |
This query regresses x1.23 in performance test, but I can't reproduce the difference. According to perf report, the time is dominated by reading the table, applying the prewhere clause and calculating |
It speeds up aggregation over short string keys. Use it as a default aggregation method for string keys.
switch (sz) | ||
{ | ||
case 0: | ||
keyHolderDiscardKey(key_holder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, we don't persist yet, why discard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emplace() guarantees that it will either persist or discard the key. It doesn't pass the zero key_holder forward, and instead passes the empty key0
, so it doesn't need the key from key_holder and should discard it here.
Actually I'm not sure when leaving it out would have any visible consequences -- discard is only defined for SerializedKeyHolder, and if it is zero, it's going to have zero size, and the discard would be a noop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emplace() guarantees that it will either persist or discard the key.
Hmm, but dispatch is also used for find() and potentially hash().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For uses other than emplace, it should always be the NoopKeyHolder, for which this call does nothing. I noticed that the signature of FindCallable::operator () is somewhat misleading now, because it accepts a template KeyHolder. I'll change it to only accept the NoopKeyHolder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, we don't have the noop key holder anymore. It's represented by just the plain key now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized I've answered the question about OnExistingKey and OnNewKey to you #5417 (comment). Heh, now it confuses me. Those names do convey more than they should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think the persist/discard is better that onNew/onExisting, but maybe still bad indeed. Don't know how to rename it. Basically, it is a way for a hash table to say that it either needs a persistent version of this key, or it doesn't. Something like makeKeyPersistent
/dontMakeKeyPersistent
? This also may be bad, because for SerializedKeyHolder, discardKey
actually destroys it, and you can't use it anymore.
The ubsan failure is in |
With the latest dispatch improvement (
|
|
This PR introduces the string hash map that speeds up the aggregation over short string keys, finishing the series of patches from #5417.
Category:
Changelog entry:
The performance of aggregation over short string keys is improved.