[Enhancement] `files()` type promotion #40959

rickif · 2024-02-07T09:57:28Z

Why I'm doing:

The files() would merge file schemas. The current rule of merging is too simple.

What I'm doing:

This PR makes some improvements to the files type promotion, including the following changes:

promote decimal types to bigger decimal type
promote string types to bigger string type
promote float type and integer type to double type
promote different complex types to vachar.
Fixes [schema auto-detection merge] when query file table function the data file has different decimal type of parquet format, error #38992

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

starrocks-cr · 2024-02-07T09:59:54Z

be/src/runtime/types.h

+        // treat other conflicted types as varchar.
+        return TypeDescriptor::create_varchar_type(TypeDescriptor::MAX_VARCHAR_LENGTH);
+    }
+
 private:
    /// Used to create a possibly nested type from the flattened Thrift representation.
    ///


The most risky bug in this code is:
incorrect or missing handling for type conflicts between integer and float types in the promote_types method

You can modify the code like this:

static TypeDescriptor promote_types(const TypeDescriptor& type1, const TypeDescriptor& type2) { DCHECK(type1 != type2); if (type1.is_integer_type() && type2.is_integer_type()) { // promote integer type. Larger enum values mean larger value ranges. auto tp = type1.type > type2.type ? type1.type : type2.type; return TypeDescriptor::from_logical_type(tp); } else if ((type1.is_float_type() && type2.is_integer_type()) || (type1.is_integer_type() && type2.is_float_type())) { // if one is float and other is integer, promote to double return TypeDescriptor::from_logical_type(TYPE_DOUBLE); } else if (type1.is_float_type() && type2.is_float_type()) { // promote all float to double. return TypeDescriptor::from_logical_type(TYPE_DOUBLE); } else if (type1.is_decimal_type() && type2.is_decimal_type()) { // decimal v3 only auto tp = type1.type > type2.type ? type1.type : type2.type; if (tp > TYPE_DECIMAL128) tp = TYPE_DECIMAL128; if (tp < TYPE_DECIMAL32) tp = TYPE_DECIMAL32; auto precision = type1.precision > type2.precision ? type1.precision : type2.precision; if (precision > MAX_PRECISION) precision = MAX_PRECISION; auto scale = type1.scale > type2.scale ? type1.scale : type2.scale; if (scale > MAX_SCALE) scale = MAX_SCALE; return TypeDescriptor::create_decimalv3_type(tp, precision, scale); } else if (type1.type == TYPE_VARCHAR && type2.type == TYPE_VARCHAR) { auto len = type1.len > type2.len ? type1.len : type2.len; return TypeDescriptor::create_varchar_type(len); } else if (type1.type == TYPE_CHAR && type2.type == TYPE_CHAR) { auto len = type1.len > type2.len ? type1.len : type2.len; return TypeDescriptor::create_char_type(len); } else if (type1.type == TYPE_VARBINARY && type2.type == TYPE_VARBINARY) { auto len = type1.len > type2.len ? type1.len : type2.len; return TypeDescriptor::create_varbinary_type(len); } // treat other conflicted types as varchar. return TypeDescriptor::create_varchar_type(TypeDescriptor::MAX_VARCHAR_LENGTH); }

This modification aims to address the risk by adding a case to handle the scenario where one type is an integer and the other is a float, promoting the result to double, thus preventing potential loss of precision or incorrect type promotion.

Does this comment help you ?

@dirtysalt Yeah. I've changed the PR according to the comment.

dirtysalt · 2024-02-17T00:26:49Z

be/test/formats/orc/orc_chunk_reader_test.cpp

@@ -2241,7 +2241,7 @@ TEST_F(OrcChunkReaderTest, get_file_schema) {
              {"col_float", TypeDescriptor::from_logical_type(TYPE_FLOAT)},
              {"col_double", TypeDescriptor::from_logical_type(TYPE_DOUBLE)},
              {"col_string", TypeDescriptor::create_varchar_type(1048576)},
-              {"col_char", TypeDescriptor::create_char_type(10)},
+              {"col_char", TypeDescriptor::create_char_type(255)},


use MAX_CHAR_LENGTH better?

rickif · 2024-02-19T05:58:39Z

@Mergifyio rebase

mergify · 2024-02-19T05:59:01Z

rebase

✅ Branch has been successfully rebased

rickif · 2024-02-19T08:11:33Z

@Mergifyio rebase

Signed-off-by: ricky <rickif@qq.com>

This reverts commit dbb3d59. Signed-off-by: ricky <rickif@qq.com>

Signed-off-by: ricky <rickif@qq.com>

mergify · 2024-02-19T08:11:54Z

rebase

✅ Branch has been successfully rebased

Signed-off-by: ricky <rickif@qq.com>

github-actions · 2024-02-23T12:55:31Z

[FE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2024-02-23T13:04:18Z

[BE Incremental Coverage Report]

✅ pass : 38 / 38 (100.00%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	be/src/exec/file_scanner.cpp	4	4	100.00%	[]
🔵	be/src/runtime/types.h	5	5	100.00%	[]
🔵	be/src/runtime/types.cpp	29	29	100.00%	[]

github-actions · 2024-02-28T02:09:14Z

@Mergifyio backport branch-3.2

mergify · 2024-02-28T02:09:18Z

backport branch-3.2

✅ Backports have been created

#41782 [Enhancement] files() type promotion (backport #40959) has been created for branch branch-3.2

Signed-off-by: ricky <rickif@qq.com> (cherry picked from commit bb00982)

Co-authored-by: ricky <rickif@qq.com>

Signed-off-by: ricky <rickif@qq.com> Signed-off-by: Seaven <seaven_7@qq.com>

rickif requested a review from a team as a code owner February 7, 2024 09:57

github-actions bot added the 3.2 label Feb 7, 2024

mergify bot assigned rickif Feb 7, 2024

starrocks-cr bot reviewed Feb 7, 2024

View reviewed changes

rickif requested a review from a team as a code owner February 10, 2024 13:44

rickif force-pushed the fix/promote-types branch from ebea1c2 to 044afba Compare February 13, 2024 03:18

dirtysalt reviewed Feb 17, 2024

View reviewed changes

rickif force-pushed the fix/promote-types branch 4 times, most recently from 9c6e553 to 34bd7e9 Compare February 19, 2024 02:10

rickif force-pushed the fix/promote-types branch from 34bd7e9 to b076f91 Compare February 19, 2024 05:59

rickif added 8 commits February 19, 2024 08:11

test: add test_promote_types

5170e0b

Signed-off-by: ricky <rickif@qq.com>

types: handle string type

915e787

Signed-off-by: ricky <rickif@qq.com>

test: update test_promote_types

833d160

Signed-off-by: ricky <rickif@qq.com>

orc_schema_builder: infer orc VARCHAR/CHAR as max length

cd0595d

Signed-off-by: ricky <rickif@qq.com>

*: format

f019e56

Signed-off-by: ricky <rickif@qq.com>

make: rename test file name

c0708be

Signed-off-by: ricky <rickif@qq.com>

Revert "make: rename test file name"

e40d3a7

This reverts commit dbb3d59. Signed-off-by: ricky <rickif@qq.com>

types: move implement to cpp

5e3d677

Signed-off-by: ricky <rickif@qq.com>

rickif force-pushed the fix/promote-types branch from b076f91 to 5e3d677 Compare February 19, 2024 08:11

wyb requested review from wyb and meegoo February 23, 2024 06:53

*: update OrcChunkReaderTest

ab3a21e

Signed-off-by: ricky <rickif@qq.com>

wyb approved these changes Feb 23, 2024

View reviewed changes

wyb enabled auto-merge (squash) February 23, 2024 11:31

meegoo approved these changes Feb 27, 2024

View reviewed changes

satanson approved these changes Feb 27, 2024

View reviewed changes

dirtysalt approved these changes Feb 28, 2024

View reviewed changes

wyb merged commit bb00982 into StarRocks:main Feb 28, 2024
47 checks passed

github-actions bot removed the 3.2 label Feb 28, 2024

mergify bot pushed a commit that referenced this pull request Feb 28, 2024

[Enhancement] files() type promotion (#40959)

545a361

Signed-off-by: ricky <rickif@qq.com> (cherry picked from commit bb00982)

mergify bot mentioned this pull request Feb 28, 2024

[Enhancement] files() type promotion (backport #40959) #41782

Merged

18 tasks

wanpengfei-git pushed a commit that referenced this pull request Feb 28, 2024

[Enhancement] files() type promotion (backport #40959) (#41782)

65ef3f6

Co-authored-by: ricky <rickif@qq.com>

github-actions bot added the 3.2-merged label Feb 28, 2024

Seaven pushed a commit to Seaven/starrocks that referenced this pull request Feb 28, 2024

[Enhancement] files() type promotion (StarRocks#40959)

b8d29da

Signed-off-by: ricky <rickif@qq.com> Signed-off-by: Seaven <seaven_7@qq.com>

rickif deleted the fix/promote-types branch March 6, 2024 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] `files()` type promotion #40959

[Enhancement] `files()` type promotion #40959

rickif commented Feb 7, 2024 •

edited

starrocks-cr bot Feb 7, 2024

dirtysalt Feb 17, 2024

rickif Feb 18, 2024

dirtysalt Feb 17, 2024

rickif commented Feb 19, 2024

mergify bot commented Feb 19, 2024

rickif commented Feb 19, 2024

mergify bot commented Feb 19, 2024

github-actions bot commented Feb 23, 2024

github-actions bot commented Feb 23, 2024

github-actions bot commented Feb 28, 2024

mergify bot commented Feb 28, 2024 •

edited

[Enhancement] files() type promotion #40959

[Enhancement] files() type promotion #40959

Conversation

rickif commented Feb 7, 2024 • edited

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

starrocks-cr bot Feb 7, 2024

Choose a reason for hiding this comment

dirtysalt Feb 17, 2024

Choose a reason for hiding this comment

rickif Feb 18, 2024

Choose a reason for hiding this comment

dirtysalt Feb 17, 2024

Choose a reason for hiding this comment

rickif commented Feb 19, 2024

mergify bot commented Feb 19, 2024

✅ Branch has been successfully rebased

rickif commented Feb 19, 2024

mergify bot commented Feb 19, 2024

✅ Branch has been successfully rebased

github-actions bot commented Feb 23, 2024

[FE Incremental Coverage Report]

github-actions bot commented Feb 23, 2024

[BE Incremental Coverage Report]

file detail

github-actions bot commented Feb 28, 2024

mergify bot commented Feb 28, 2024 • edited

✅ Backports have been created

[Enhancement] `files()` type promotion #40959

[Enhancement] `files()` type promotion #40959

rickif commented Feb 7, 2024 •

edited

mergify bot commented Feb 28, 2024 •

edited