New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] files()
type promotion
#40959
Conversation
// treat other conflicted types as varchar. | ||
return TypeDescriptor::create_varchar_type(TypeDescriptor::MAX_VARCHAR_LENGTH); | ||
} | ||
|
||
private: | ||
/// Used to create a possibly nested type from the flattened Thrift representation. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
incorrect or missing handling for type conflicts between integer and float types in the promote_types
method
You can modify the code like this:
static TypeDescriptor promote_types(const TypeDescriptor& type1, const TypeDescriptor& type2) {
DCHECK(type1 != type2);
if (type1.is_integer_type() && type2.is_integer_type()) {
// promote integer type. Larger enum values mean larger value ranges.
auto tp = type1.type > type2.type ? type1.type : type2.type;
return TypeDescriptor::from_logical_type(tp);
} else if ((type1.is_float_type() && type2.is_integer_type()) || (type1.is_integer_type() && type2.is_float_type())) {
// if one is float and other is integer, promote to double
return TypeDescriptor::from_logical_type(TYPE_DOUBLE);
} else if (type1.is_float_type() && type2.is_float_type()) {
// promote all float to double.
return TypeDescriptor::from_logical_type(TYPE_DOUBLE);
} else if (type1.is_decimal_type() && type2.is_decimal_type()) {
// decimal v3 only
auto tp = type1.type > type2.type ? type1.type : type2.type;
if (tp > TYPE_DECIMAL128) tp = TYPE_DECIMAL128;
if (tp < TYPE_DECIMAL32) tp = TYPE_DECIMAL32;
auto precision = type1.precision > type2.precision ? type1.precision : type2.precision;
if (precision > MAX_PRECISION) precision = MAX_PRECISION;
auto scale = type1.scale > type2.scale ? type1.scale : type2.scale;
if (scale > MAX_SCALE) scale = MAX_SCALE;
return TypeDescriptor::create_decimalv3_type(tp, precision, scale);
} else if (type1.type == TYPE_VARCHAR && type2.type == TYPE_VARCHAR) {
auto len = type1.len > type2.len ? type1.len : type2.len;
return TypeDescriptor::create_varchar_type(len);
} else if (type1.type == TYPE_CHAR && type2.type == TYPE_CHAR) {
auto len = type1.len > type2.len ? type1.len : type2.len;
return TypeDescriptor::create_char_type(len);
} else if (type1.type == TYPE_VARBINARY && type2.type == TYPE_VARBINARY) {
auto len = type1.len > type2.len ? type1.len : type2.len;
return TypeDescriptor::create_varbinary_type(len);
}
// treat other conflicted types as varchar.
return TypeDescriptor::create_varchar_type(TypeDescriptor::MAX_VARCHAR_LENGTH);
}
This modification aims to address the risk by adding a case to handle the scenario where one type is an integer and the other is a float, promoting the result to double, thus preventing potential loss of precision or incorrect type promotion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this comment help you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dirtysalt Yeah. I've changed the PR according to the comment.
ebea1c2
to
044afba
Compare
@@ -2241,7 +2241,7 @@ TEST_F(OrcChunkReaderTest, get_file_schema) { | |||
{"col_float", TypeDescriptor::from_logical_type(TYPE_FLOAT)}, | |||
{"col_double", TypeDescriptor::from_logical_type(TYPE_DOUBLE)}, | |||
{"col_string", TypeDescriptor::create_varchar_type(1048576)}, | |||
{"col_char", TypeDescriptor::create_char_type(10)}, | |||
{"col_char", TypeDescriptor::create_char_type(255)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use MAX_CHAR_LENGTH better?
9c6e553
to
34bd7e9
Compare
@Mergifyio rebase |
✅ Branch has been successfully rebased |
34bd7e9
to
b076f91
Compare
@Mergifyio rebase |
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
This reverts commit dbb3d59. Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
✅ Branch has been successfully rebased |
b076f91
to
5e3d677
Compare
Signed-off-by: ricky <rickif@qq.com>
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 38 / 38 (100.00%) file detail
|
@Mergifyio backport branch-3.2 |
✅ Backports have been created
|
Signed-off-by: ricky <rickif@qq.com> (cherry picked from commit bb00982)
Co-authored-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com> Signed-off-by: Seaven <seaven_7@qq.com>
Why I'm doing:
The
files()
would merge file schemas. The current rule of merging is too simple.What I'm doing:
This PR makes some improvements to the
files
type promotion, including the following changes:Fixes [schema auto-detection merge] when query file table function the data file has different decimal type of parquet format, error #38992
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: