New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes and improvements for Variant type #60198
Fixes and improvements for Variant type #60198
Conversation
This is an automated comment for commit a7eabbb with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
|
||
The result of operator `<` for values `v1` with underlying type `T1` and `v2` with underlying type `T2` of a type `Variant(..., T1, ... T2, ...)` is defined as follows: | ||
- If `T1 = T2 = T`, the result will be `v1.T < v2.T` (underlying values will be compared). | ||
- If `T1 != T2`, the result will be `T1 < T2` (type names will be compared). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be so confusing that (100 :: Int32) < (1 :: UInt32)
:
:) create table t(x Variant(UInt32, Int64)) ENGINE = Memory;
Ok.
0 rows in set. Elapsed: 0.015 sec.
:) insert into t VALUES (1),(101-1),(100);
Ok.
3 rows in set. Elapsed: 0.003 sec.
:) select * from t ORDER BY x;
┌─x───┐
│ 100 │
│ 1 │
│ 100 │
└─────┘
3 rows in set. Elapsed: 0.003 sec.
I see that we just want to have a total ordering on Variant
, but maybe we can keep cast to Variant monotonic:
SELECT
CAST('100', 'Int32') AS a,
CAST('1', 'UInt32') AS b,
a < b,
CAST(a, 'Variant(UInt32, Int32)') < CAST(b, 'Variant(UInt32, Int32)')
Row 1:
──────
a: 100
b: 1
less(CAST('100', 'Int32'), CAST('1', 'UInt32')): 0
less(CAST(CAST('100', 'Int32'), 'Variant(UInt32, Int32)'), CAST(CAST('1', 'UInt32'), 'Variant(UInt32, Int32)')): 1
For example, we can Imagine situation when user has Variant(T1, T2)
(e.g. Variant(UInt32, Int32)
) but then wants to convert it to leastSupertype(T1, T2)
(e.g. Int64) (or vise versa) and result ordering will be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we just want to have a total ordering on Variant, but maybe we can keep cast to Variant monotonic:
Yes, having total ordering is the only reason. In general case Variant can have any data types inside that can be not comparable between each other, and I am not sure it is good idea to add extra logic for comparing between some subset of types. And actually, it is not trivial to implement as we can have a lot of different combinations of types that can be compared, like all integers, floats and decimals, dates and datetimes, arrays of integers, etc. And all this logic should be implemented in ColumnVariant::compareAt()
where we don't have the data types of variants but only columns (so we even cannot use implementation of function less
).
So, I decided to use the simpliest comparing rule that will allow to have total ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the user always can use specific variants in order by to make desired order, or cast the whole Variant to some common type:
avogar-dev :) select * from t ORDER BY if(variantType(x) == 'UInt32', x.UInt32, x.Int64);
SELECT *
FROM t
ORDER BY if(variantType(x) = 'UInt32', x.UInt32, x.Int64) ASC
Query id: 45153e4b-75e1-4363-b611-3dbf4a145e8d
┌─x───┐
│ 1 │
│ 100 │
│ 100 │
└─────┘
3 rows in set. Elapsed: 0.027 sec.
avogar-dev :) select * from t ORDER BY x::Int64;
SELECT *
FROM t
ORDER BY CAST(x, 'Int64') ASC
Query id: 6d987031-8aec-4bed-b54c-bb3a0d3dca57
┌─x───┐
│ 1 │
│ 100 │
│ 100 │
└─────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep current simple comparison as I don't see how we can properly implement something more advanced. Maybe we can improve it later as this type is still experimental and we can change its behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's just add a warning to doc then? Also we may just forbid such variants that contains several different numeric types (e.g. both UInt32 and Int64) and allow it under setting similarly to allow_suspicious_lowcardinality_types
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's just add a warning to doc then?
Sure, I will add it.
Also we may just forbid such variants that contains several different numeric types (e.g. both UInt32 and Int64) and allow it under setting similarly to allow_suspicious_lowcardinality_types.
Sounds like a good idea, even maybe not only for different numeric types , but more general - types that have common supertype.
I see only 1 problem, we will have the same problem with Dynamic type that I am working on right now, and there we won't be able to check for similar types, because initially we will have 0 variants and will extend them dynamically, so it may happen that 1 part will have Dynamic with inner type UInt32 and 2 part with Int64, and when we will read from such table with ORDER BY we will have the same problem.
But Dynamic is abother story, I think we can add such setting for Variant, I will add it in this PR
bool tryInsert(const DB::Field & x) override | ||
{ | ||
NearestFieldType<T> value; | ||
if (!x.tryGet<NearestFieldType<T>>(value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, tryInsert
doesn't insert Int32
into ColumnVector<Int64>
, is it desired behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But Field cannot store Int32
value, it stores only Int64/UInt64
values. For Int8/Int16/Int32/Int64
NearestFieldType<T> = Int64
, for UInt8/UInt16/UInt32/UInt64
NearestFieldType<T> = UInt64
. We cannot insert signed value into unsigned and vice versa, other conversions are possible.
|
||
The result of operator `<` for values `v1` with underlying type `T1` and `v2` with underlying type `T2` of a type `Variant(..., T1, ... T2, ...)` is defined as follows: | ||
- If `T1 = T2 = T`, the result will be `v1.T < v2.T` (underlying values will be compared). | ||
- If `T1 != T2`, the result will be `T1 < T2` (type names will be compared). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's just add a warning to doc then? Also we may just forbid such variants that contains several different numeric types (e.g. both UInt32 and Int64) and allow it under setting similarly to allow_suspicious_lowcardinality_types
.
/// Create temporary column and check if we can insert this field to the variant. | ||
/// If we can insert, no need to convert anything. | ||
auto col = type_variant->createColumn(); | ||
if (col->tryInsert(src)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the only usage of tryInsert
? If it's used only here, can't we introduce more specific method that checks type of the Field, but do not insert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main usage of tryInsert
is ColumnVariant::insert
, when we cannot determine to which variant we need to insert the field, so we just try to insert it to all variants in order
|
||
for (size_t i = 0; i != variants.size(); ++i) | ||
{ | ||
if (variants[i]->tryInsert(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's possible to insert into several variants, shouldn't we choose the most specific? But it would work if Filed had specific numeric type inside. But since it's only (U)Int64, we will just insert to first sub-column with matching sign, probably also ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's possible to insert into several variants, shouldn't we choose the most specific?
It's actually the main problem with interaction between Field and Variant, sometimes Field with the same inner type can be inserted into many data types and in most cases it's not clear which Variant is most specific (Field with UInt64 value can actually store Date value, but we cannot know about it as Field loses this information). Initially I just forbid inserting a Field intp ColumnVariant as it could lead to ambiguity, but in the code in several places we still use Fields, so we need at least some implementation of this method.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implement comparison operator for Variant values and proper Field inserting into Variant column. Don't allow creating
Variant
type with similar variant types by default (allow uder a settingallow_suspicious_variant_types
)Closes #59996. Closes #59850
Documentation entry for user-facing changes