Fix extractAll and similar functions on Nullable inputs#104326
Fix extractAll and similar functions on Nullable inputs#104326alexey-milovidov wants to merge 3 commits intomasterfrom
Conversation
For functions that use the default Nulls implementation but return a type that cannot be Nullable (e.g. `Array`, `Tuple`, `Map`), the framework previously called `makeNullable` on the return type, which threw "Nested type is not allowed inside Nullable type". This made `extractAll`, `splitByChar`, `splitByRegexp`, `extractAllGroups`, and similar functions unusable on Nullable columns. Switch `IFunctionOverloadResolver::getReturnTypeWithoutLowCardinality` to use `makeNullableSafe`, and skip the `wrapInNullable` step in `IExecutableFunction::defaultImplementationForNulls` when the result type is not Nullable. For null input rows the function is evaluated over the default value of the nested column (e.g. an empty string), producing the natural default of the result type (e.g. an empty array) instead of raising a type-check error. Closes #56977 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [40f3891] Summary: ❌
AI ReviewSummaryThis PR fixes default null-handling for functions that use ClickHouse Rules
Final VerdictStatus: ✅ Approve |
|
The comment in IFunction.h above useDefaultImplementationForNulls() says that the function may be executed with garbage input instead of null values, and this PR removes the wrapInNullable() and just returns res without applying the null_map. I think there is a possibility of returning the output over these garbage values in some cases, reproducer (the first line should just output null for extractAll but it outputs |
…esult
`IExecutableFunction::defaultImplementationForNulls` had three
optimization paths that returned `default(result_type)` for null input
rows: the early return when a const-null argument is detected, the
all-null block early return, and the short-circuit `filter`/`expand`
branch. For non-Nullable result types this differs from the
`f(default(input))` semantics described in the PR — for example,
`splitByChar(',', toNullable(''))` should produce `['']`, but the
short-circuit branch (with
`short_circuit_function_evaluation_for_nulls_threshold = 0.5`) would
produce `[]` because the default of `Array(String)` is the empty array.
Disable each of these optimizations when the result type is not
Nullable so null rows fall through to the regular evaluation path,
which runs the function on the nested column where null rows hold the
default of the input type.
Spotted by clickhouse-gh review on
#104326.
Extend `04209_extractAll_nullable` with a short-circuit case and update
the all-null reference outputs to reflect the corrected semantics
(`splitByChar(',', NULL)` = `['']`, `splitByWhitespace(NULL)` = `[]`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before, `arrayAutocorrelation([1, 2, 3], CAST(2, 'Nullable(UInt32)'))` failed with `ILLEGAL_TYPE_OF_ARGUMENT` only because `makeNullable(Array(Float64))` rejected the wrapping in the framework's `defaultImplementationForNulls`. With the framework fix in #104326 the function legitimately accepts a Nullable lag (the framework strips `Nullable` before invoking `getReturnTypeImpl`), so the assertion no longer holds. Replace the negative-test expectation with the actual result `[1,0]`. The Nullable element-type case (`arrayAutocorrelation(CAST([1,2,3], 'Array(Nullable(UInt32))'))`) is rejected by `getReturnTypeImpl` itself and continues to error. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104326&sha=f5d225718dcbb18021b0230351e79f836c31395c&name_0=PR&name_1=Fast%20test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
george-larionov
left a comment
There was a problem hiding this comment.
I have some concerns with the handling of null inputs containing garbage vals, see comments.
| /// return it as-is. For null input rows, the function will be evaluated | ||
| /// over the default values of the nested column and produce the corresponding default result | ||
| /// (e.g. an empty array), instead of failing the type check. | ||
| return makeNullableSafe(return_type); |
There was a problem hiding this comment.
makeNullableSafe() returns the return_type untouched if it cannot be made Nullable. Since useDefaultImplementationForNulls() is true by default for derived classes, this means that functions that return non-nullable types used to fail and force the author to think about whether they should be using useDefaultImplementationForNulls or not. With this change, they will just run with potentially incorrect results. Maybe we should set useDefaultImplementationForNulls to false by default then?
| /// Each row should be evaluated if there are no nulls or short circuiting is disabled. | ||
| auto res = executeWithoutLowCardinalityColumns(temporary_columns, temporary_result_type, input_rows_count, dry_run); | ||
| if (!result_is_nullable) | ||
| return res; |
There was a problem hiding this comment.
This res may contain outputs from garbage inputs (there is no guarantee that the null rows actually contain default vals AFAIK). Unlike in the code on/after line 322, there is no handling of this. Maybe in the case of !result_is_nullable it should always go through the logic on line 322-339? See this comment.
|
|
||
| auto res = executeWithoutLowCardinalityColumns(temporary_columns, temporary_result_type, input_rows_count, dry_run); | ||
| if (!result_is_nullable) | ||
| return res; |
There was a problem hiding this comment.
This res may contain outputs from garbage inputs (there is no guarantee that the null rows actually contain default vals AFAIK). Unlike in the code on/after line 322, there is no handling of this. Maybe in the case of !result_is_nullable it should always go through the logic on line 322-339? See this comment.
|
Another example with incorrect output, should evaluate to |
LLVM Coverage Report
Changed lines: 97.37% (37/38) · Uncovered code |
For functions that use the default Nulls implementation but return a type that cannot be Nullable (e.g.
Array,Tuple,Map), the framework previously calledmakeNullableon the return type, which threw "Nested type is not allowed inside Nullable type". This madeextractAll,splitByChar,splitByRegexp,extractAllGroups, and similar functions unusable onNullablecolumns.IFunctionOverloadResolver::getReturnTypeWithoutLowCardinalitynow usesmakeNullableSafe, andIExecutableFunction::defaultImplementationForNullsskips thewrapInNullablestep when the result type is notNullable. Null input rows are evaluated over the default value of the nested column (e.g. an empty string), producing the natural default of the result type (e.g.[]forextractAll).Closes #56977
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Functions that return a non-
Nullabletype (such asArray,Tuple, orMap) now acceptNullablearguments. Affected functions includeextractAll,extractAllGroups,extractAllGroupsHorizontal,extractAllGroupsVertical,extractGroups,splitByChar,splitByString,splitByRegexp,splitByWhitespace,splitByNonAlpha, andalphaTokens.NULLinput rows produce the default value of the result type (e.g. an empty array) instead of raising "Nested type is not allowed inside Nullable type".Documentation entry for user-facing changes