New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide fallback to Python/Spark-like splitting in splitBy*() functions #54518
Conversation
This is an automated comment for commit 774c4b5 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
00ac14c
to
b583b80
Compare
@@ -38,6 +38,8 @@ The behavior of parameter `max_substrings` changed starting with ClickHouse v22. | |||
For example, | |||
- in v22.10: `SELECT splitByChar('=', 'a=b=c=d', 2); -- ['a','b','c=d']` | |||
- in v22.11: `SELECT splitByChar('=', 'a=b=c=d', 2); -- ['a','b']` | |||
|
|||
The previous behavior can be restored by setting [splitby_max_substring_behavior](../../operations/settings/settings.md#splitby-max-substring-behavior) = 'python'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably use the compatibility setting to set it to correct behaviour and mention here that it can be used also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new behavior with splitby_max_substrings_includes_remaining_string = 1
doesn't match 100% the pre-v22.11 behavior (one needs to adjust max_substrings
argument by 1). But I feel that's an acceptable compromise which reduces implementation complexity and understanding for users. So I better refrain from compatibility settings, to keep surprises minimal.
src/Functions/FunctionsStringArray.h
Outdated
if (max_splits) | ||
{ | ||
switch (max_substring_behavior) | ||
{ | ||
case MaxSubstringBehavior::LikeClickHouse: | ||
{ | ||
if (splits == *max_splits) | ||
return false; | ||
break; | ||
} | ||
case MaxSubstringBehavior::LikeSpark: | ||
{ | ||
if (splits == *max_splits - 1) | ||
{ | ||
token_end = end; | ||
pos = nullptr; | ||
return true; | ||
} | ||
break; | ||
} | ||
case MaxSubstringBehavior::LikePython: | ||
{ | ||
if (splits == *max_splits) | ||
{ | ||
token_end = end; | ||
pos = nullptr; | ||
return true; | ||
} | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar chunk seems to be repeated often
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will hopefully become simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some subtle differences between places where the chunk is used. Can't pull it out easily. It is much simpler in the last commit, guess we can keep it as is.
Added a setting
splitby_max_substrings_includes_remaining_string
which controls if functionssplitBy*()
withmax_substring
argument > 0 include the remaining string in the result array (Python/Spark semantics) or not. To keep existing use cases working, I did not touch the default behavior, i.e.splitby_max_substrings_includes_remaining_string
by default is0
.For reference, please see the discussions
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Added a setting "splitby_max_substrings_includes_remaining_string" which controls if functions "splitBy*()" with argument "max_substring" > 0 include the remaining string (if any) in the result array (Python/Spark semantics) or not. The default behavior does not change.