Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] files() type promotion #40959

Merged
merged 9 commits into from Feb 28, 2024
Merged

Conversation

rickif
Copy link
Contributor

@rickif rickif commented Feb 7, 2024

Why I'm doing:

The files() would merge file schemas. The current rule of merging is too simple.

What I'm doing:

This PR makes some improvements to the files type promotion, including the following changes:

  1. promote decimal types to bigger decimal type
  2. promote string types to bigger string type
  3. promote float type and integer type to double type
  4. promote different complex types to vachar.
    Fixes [schema auto-detection merge] when query file table function the data file has different decimal type of parquet format, error #38992

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@rickif rickif requested a review from a team as a code owner February 7, 2024 09:57
@github-actions github-actions bot added the 3.2 label Feb 7, 2024
@mergify mergify bot assigned rickif Feb 7, 2024
// treat other conflicted types as varchar.
return TypeDescriptor::create_varchar_type(TypeDescriptor::MAX_VARCHAR_LENGTH);
}

private:
/// Used to create a possibly nested type from the flattened Thrift representation.
///
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
incorrect or missing handling for type conflicts between integer and float types in the promote_types method

You can modify the code like this:

    static TypeDescriptor promote_types(const TypeDescriptor& type1, const TypeDescriptor& type2) {
        DCHECK(type1 != type2);
        if (type1.is_integer_type() && type2.is_integer_type()) {
            // promote integer type. Larger enum values mean larger value ranges.
            auto tp = type1.type > type2.type ? type1.type : type2.type;
            return TypeDescriptor::from_logical_type(tp);
        } else if ((type1.is_float_type() && type2.is_integer_type()) || (type1.is_integer_type() && type2.is_float_type())) {
            // if one is float and other is integer, promote to double
            return TypeDescriptor::from_logical_type(TYPE_DOUBLE);
        } else if (type1.is_float_type() && type2.is_float_type()) {
            // promote all float to double.
            return TypeDescriptor::from_logical_type(TYPE_DOUBLE);
        } else if (type1.is_decimal_type() && type2.is_decimal_type()) {
            // decimal v3 only
            auto tp = type1.type > type2.type ? type1.type : type2.type;
            if (tp > TYPE_DECIMAL128) tp = TYPE_DECIMAL128;
            if (tp < TYPE_DECIMAL32) tp = TYPE_DECIMAL32;
            auto precision = type1.precision > type2.precision ? type1.precision : type2.precision;
            if (precision > MAX_PRECISION) precision = MAX_PRECISION;
            auto scale = type1.scale > type2.scale ? type1.scale : type2.scale;
            if (scale > MAX_SCALE) scale = MAX_SCALE;
            return TypeDescriptor::create_decimalv3_type(tp, precision, scale);
        } else if (type1.type == TYPE_VARCHAR && type2.type == TYPE_VARCHAR) {
            auto len = type1.len > type2.len ? type1.len : type2.len;
            return TypeDescriptor::create_varchar_type(len);
        } else if (type1.type == TYPE_CHAR && type2.type == TYPE_CHAR) {
            auto len = type1.len > type2.len ? type1.len : type2.len;
            return TypeDescriptor::create_char_type(len);
        } else if (type1.type == TYPE_VARBINARY && type2.type == TYPE_VARBINARY) {
            auto len = type1.len > type2.len ? type1.len : type2.len;
            return TypeDescriptor::create_varbinary_type(len);
        }
        // treat other conflicted types as varchar.
        return TypeDescriptor::create_varchar_type(TypeDescriptor::MAX_VARCHAR_LENGTH);
    }

This modification aims to address the risk by adding a case to handle the scenario where one type is an integer and the other is a float, promoting the result to double, thus preventing potential loss of precision or incorrect type promotion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment help you ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dirtysalt Yeah. I've changed the PR according to the comment.

@@ -2241,7 +2241,7 @@ TEST_F(OrcChunkReaderTest, get_file_schema) {
{"col_float", TypeDescriptor::from_logical_type(TYPE_FLOAT)},
{"col_double", TypeDescriptor::from_logical_type(TYPE_DOUBLE)},
{"col_string", TypeDescriptor::create_varchar_type(1048576)},
{"col_char", TypeDescriptor::create_char_type(10)},
{"col_char", TypeDescriptor::create_char_type(255)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use MAX_CHAR_LENGTH better?

@rickif rickif force-pushed the fix/promote-types branch 4 times, most recently from 9c6e553 to 34bd7e9 Compare February 19, 2024 02:10
@rickif
Copy link
Contributor Author

rickif commented Feb 19, 2024

@Mergifyio rebase

Copy link
Contributor

mergify bot commented Feb 19, 2024

rebase

✅ Branch has been successfully rebased

@rickif
Copy link
Contributor Author

rickif commented Feb 19, 2024

@Mergifyio rebase

Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
This reverts commit dbb3d59.

Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Copy link
Contributor

mergify bot commented Feb 19, 2024

rebase

✅ Branch has been successfully rebased

Signed-off-by: ricky <rickif@qq.com>
@wyb wyb enabled auto-merge (squash) February 23, 2024 11:31
Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[BE Incremental Coverage Report]

pass : 38 / 38 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/exec/file_scanner.cpp 4 4 100.00% []
🔵 be/src/runtime/types.h 5 5 100.00% []
🔵 be/src/runtime/types.cpp 29 29 100.00% []

@wyb wyb merged commit bb00982 into StarRocks:main Feb 28, 2024
47 checks passed
Copy link

@Mergifyio backport branch-3.2

@github-actions github-actions bot removed the 3.2 label Feb 28, 2024
Copy link
Contributor

mergify bot commented Feb 28, 2024

backport branch-3.2

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Feb 28, 2024
Signed-off-by: ricky <rickif@qq.com>
(cherry picked from commit bb00982)
wanpengfei-git pushed a commit that referenced this pull request Feb 28, 2024
Co-authored-by: ricky <rickif@qq.com>
Seaven pushed a commit to Seaven/starrocks that referenced this pull request Feb 28, 2024
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: Seaven <seaven_7@qq.com>
@rickif rickif deleted the fix/promote-types branch March 6, 2024 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants