Skip to content

Commit

Permalink
Merge pull request #54518 from rschu1ze/split-better
Browse files Browse the repository at this point in the history
Provide fallback to Python/Spark-like splitting in splitBy*() functions
  • Loading branch information
rschu1ze committed Sep 22, 2023
2 parents 1639611 + 774c4b5 commit be1e92a
Show file tree
Hide file tree
Showing 13 changed files with 699 additions and 410 deletions.
11 changes: 11 additions & 0 deletions docs/en/operations/settings/settings.md
Expand Up @@ -4067,6 +4067,17 @@ Result:
└─────┴─────┴───────┘
```

## splitby_max_substrings_includes_remaining_string {#splitby_max_substrings_includes_remaining_string}

Controls whether function [splitBy*()](../../sql-reference/functions/splitting-merging-functions.md) with argument `max_substrings` > 0 will include the remaining string in the last element of the result array.

Possible values:

- `0` - The remaining string will not be included in the last element of the result array.
- `1` - The remaining string will be included in the last element of the result array. This is the behavior of Spark's [`split()`](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.split.html) function and Python's ['string.split()'](https://docs.python.org/3/library/stdtypes.html#str.split) method.

Default value: `0`

## enable_extended_results_for_datetime_functions {#enable-extended-results-for-datetime-functions}

Enables or disables returning results of type:
Expand Down
16 changes: 15 additions & 1 deletion docs/en/sql-reference/functions/splitting-merging-functions.md
Expand Up @@ -21,7 +21,7 @@ splitByChar(separator, s[, max_substrings]))

- `separator` — The separator which should contain exactly one character. [String](../../sql-reference/data-types/string.md).
- `s` — The string to split. [String](../../sql-reference/data-types/string.md).
- `max_substrings` — An optional `Int64` defaulting to 0. When `max_substrings` > 0, the returned substrings will be no more than `max_substrings`, otherwise the function will return as many substrings as possible.
- `max_substrings` — An optional `Int64` defaulting to 0. If `max_substrings` > 0, the returned array will contain at most `max_substrings` substrings, otherwise the function will return as many substrings as possible.

**Returned value(s)**

Expand All @@ -38,6 +38,10 @@ The behavior of parameter `max_substrings` changed starting with ClickHouse v22.
For example,
- in v22.10: `SELECT splitByChar('=', 'a=b=c=d', 2); -- ['a','b','c=d']`
- in v22.11: `SELECT splitByChar('=', 'a=b=c=d', 2); -- ['a','b']`

A behavior similar to ClickHouse pre-v22.11 can be achieved by setting
[splitby_max_substrings_includes_remaining_string](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string)
`SELECT splitByChar('=', 'a=b=c=d', 2) SETTINGS splitby_max_substrings_includes_remaining_string = 1 -- ['a', 'b=c=d']`
:::

**Example**
Expand Down Expand Up @@ -80,6 +84,8 @@ Type: [Array](../../sql-reference/data-types/array.md)([String](../../sql-refere
- There are multiple consecutive non-empty separators;
- The original string `s` is empty while the separator is not empty.

Setting [splitby_max_substrings_includes_remaining_string](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string) (default: 0) controls if the remaining string is included in the last element of the result array when argument `max_substrings` > 0.

**Example**

``` sql
Expand Down Expand Up @@ -133,6 +139,8 @@ Returns an array of selected substrings. Empty substrings may be selected when:

Type: [Array](../../sql-reference/data-types/array.md)([String](../../sql-reference/data-types/string.md)).

Setting [splitby_max_substrings_includes_remaining_string](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string) (default: 0) controls if the remaining string is included in the last element of the result array when argument `max_substrings` > 0.

**Example**

``` sql
Expand Down Expand Up @@ -182,6 +190,8 @@ Returns an array of selected substrings.

Type: [Array](../../sql-reference/data-types/array.md)([String](../../sql-reference/data-types/string.md)).

Setting [splitby_max_substrings_includes_remaining_string](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string) (default: 0) controls if the remaining string is included in the last element of the result array when argument `max_substrings` > 0.

**Example**

``` sql
Expand Down Expand Up @@ -219,6 +229,8 @@ Returns an array of selected substrings.

Type: [Array](../../sql-reference/data-types/array.md)([String](../../sql-reference/data-types/string.md)).

Setting [splitby_max_substrings_includes_remaining_string](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string) (default: 0) controls if the remaining string is included in the last element of the result array when argument `max_substrings` > 0.

**Example**

``` sql
Expand Down Expand Up @@ -279,6 +291,8 @@ Returns an array of selected substrings.

Type: [Array](../../sql-reference/data-types/array.md)([String](../../sql-reference/data-types/string.md)).

Setting [splitby_max_substrings_includes_remaining_string](../../operations/settings/settings.md#splitby_max_substrings_includes_remaining_string) (default: 0) controls if the remaining string is included in the last element of the result array when argument `max_substrings` > 0.

**Example**

``` sql
Expand Down
1 change: 1 addition & 0 deletions src/Core/Settings.h
Expand Up @@ -503,6 +503,7 @@ class IColumn;
M(Bool, reject_expensive_hyperscan_regexps, true, "Reject patterns which will likely be expensive to evaluate with hyperscan (due to NFA state explosion)", 0) \
M(Bool, allow_simdjson, true, "Allow using simdjson library in 'JSON*' functions if AVX2 instructions are available. If disabled rapidjson will be used.", 0) \
M(Bool, allow_introspection_functions, false, "Allow functions for introspection of ELF and DWARF for query profiling. These functions are slow and may impose security considerations.", 0) \
M(Bool, splitby_max_substrings_includes_remaining_string, false, "Functions 'splitBy*()' with 'max_substrings' argument > 0 include the remaining string as last element in the result", 0) \
\
M(Bool, allow_execute_multiif_columnar, true, "Allow execute multiIf function columnar", 0) \
M(Bool, formatdatetime_f_prints_single_zero, false, "Formatter '%f' in function 'formatDateTime()' produces a single zero instead of six zeros if the formatted value has no fractional seconds.", 0) \
Expand Down
2 changes: 1 addition & 1 deletion src/Functions/FunctionHelpers.cpp
Expand Up @@ -104,7 +104,7 @@ void validateArgumentType(const IFunction & func, const DataTypes & arguments,

const auto & argument = arguments[argument_index];
if (!validator_func(*argument))
throw Exception(ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT, "Illegal type {} of {} argument of function {} expected {}",
throw Exception(ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT, "Illegal type {} of {} argument of function {}, expected {}",
argument->getName(), std::to_string(argument_index), func.getName(), expected_type_description);
}

Expand Down
54 changes: 42 additions & 12 deletions src/Functions/FunctionsStringArray.cpp
Expand Up @@ -5,23 +5,53 @@ namespace DB
{
namespace ErrorCodes
{
extern const int ILLEGAL_TYPE_OF_ARGUMENT;
extern const int NUMBER_OF_ARGUMENTS_DOESNT_MATCH;
extern const int ILLEGAL_COLUMN;
}

DataTypePtr FunctionArrayStringConcat::getReturnTypeImpl(const DataTypes & arguments) const
template <typename DataType>
std::optional<Int64> extractMaxSplitsImpl(const ColumnWithTypeAndName & argument)
{
if (arguments.size() != 1 && arguments.size() != 2)
throw Exception(ErrorCodes::NUMBER_OF_ARGUMENTS_DOESNT_MATCH,
"Number of arguments for function {} doesn't match: passed {}, should be 1 or 2.",
getName(), arguments.size());
const auto * col = checkAndGetColumnConst<ColumnVector<DataType>>(argument.column.get());
if (!col)
return std::nullopt;

const DataTypeArray * array_type = checkAndGetDataType<DataTypeArray>(arguments[0].get());
if (!array_type)
throw Exception(ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT, "First argument for function {} must be an array.", getName());
auto value = col->template getValue<DataType>();
return static_cast<Int64>(value);
}

std::optional<size_t> extractMaxSplits(const ColumnsWithTypeAndName & arguments, size_t max_substrings_argument_position)
{
if (max_substrings_argument_position >= arguments.size())
return std::nullopt;

std::optional<Int64> max_splits;
if (!((max_splits = extractMaxSplitsImpl<UInt8>(arguments[max_substrings_argument_position])) || (max_splits = extractMaxSplitsImpl<Int8>(arguments[max_substrings_argument_position]))
|| (max_splits = extractMaxSplitsImpl<UInt16>(arguments[max_substrings_argument_position])) || (max_splits = extractMaxSplitsImpl<Int16>(arguments[max_substrings_argument_position]))
|| (max_splits = extractMaxSplitsImpl<UInt32>(arguments[max_substrings_argument_position])) || (max_splits = extractMaxSplitsImpl<Int32>(arguments[max_substrings_argument_position]))
|| (max_splits = extractMaxSplitsImpl<UInt64>(arguments[max_substrings_argument_position])) || (max_splits = extractMaxSplitsImpl<Int64>(arguments[max_substrings_argument_position]))))
throw Exception(
ErrorCodes::ILLEGAL_COLUMN,
"Illegal column {}, which is {}-th argument",
arguments[max_substrings_argument_position].column->getName(),
max_substrings_argument_position + 1);

if (*max_splits <= 0)
return std::nullopt;

return max_splits;
}

DataTypePtr FunctionArrayStringConcat::getReturnTypeImpl(const ColumnsWithTypeAndName & arguments) const
{
FunctionArgumentDescriptors mandatory_args{
{"arr", &isArray<IDataType>, nullptr, "Array"},
};

FunctionArgumentDescriptors optional_args{
{"separator", &isString<IDataType>, isColumnConst, "const String"},
};

if (arguments.size() == 2 && !isString(arguments[1]))
throw Exception(ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT, "Second argument for function {} must be constant string.", getName());
validateFunctionArgumentTypes(*this, arguments, mandatory_args, optional_args);

return std::make_shared<DataTypeString>();
}
Expand Down

0 comments on commit be1e92a

Please sign in to comment.