New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Versioning of AggregateFunction states. #12552
Comments
We also should support having single type of state for slightly different aggregate functions, for example |
Can we do without this part? If we speak about the version of serialized data, it exists only during storage or transfer of data. The in-memory representation, on the other hand, exists only in run time. It can arbitrarily change between server revisions, and doesn't need versioning. So the version of serialized data is not really a property of a deserialized column, and should not be reflected in its type.
I'd prefer to avoid supporting serialization to old versions, but probably we can't do that if we want to support shard-wise rolling upgrades of the cluster. How should we tie the protocol version to the serialization version of aggregate functions? Maybe we can use the protocol version number as a serialization version number? It's kind of big now, the current version is 54226. We'd want to use a tightly packed small number to minimize overhead, so probably we'll have to use some kind of lookup tables... Another part we have to be careful about is transitioning from no versions (now) to some versions. Probably we'll have to have some versioning in the column metadata. Do we have it already? It would be solved by your |
Thinking more about it, we don't have to put the version number into the function state itself -- we can't have mixed versions of states within block. If we put the version into column metadata, using the protocol version is acceptable. |
Version in type name is needed to correctly deserialize data dumps in Native and other formats (scenarios 5 and 6), and to read tables with stored AggregateFunctions (scenarios 3 and 4). Data type is what drives serialization/deserialization. (There is a task to move binary ser/de from DataType to Column and to enable multiple Column representatons for a single DataType, but it's another story) |
It changes more frequently than binary formats of aggregate functions. So, we can use it to negotiate supported version. But not to persist as a version number. |
This may be true for our code, but in general I'd say that a data type is a set of allowed values + operations on them. Serialization is an orthogonal concept. For example, we have a lot of output formats in which we can read and write an Int64, but it's still the same data type. The cases 3 and 4, when there is an aggregate function state column, we can solve by adding metadata with version to this column. But the cases 5, 6 of text formats are indeed problematic, because there is no place to put this metadata, except the type. But it is more of a hack around the deficiency of the format. I think we might be setting ourselves up for trouble if we deepen the confusion between data type and serialization. It breaks a basic expectation -- the type is the same, regardless of how it was serialized. Arguably, |
@akuzm I understand but actually - using different data types for different formats of binary data of the same aggregate function - still looks very natural for me. And AggregateFunction is parametric data type. It will be almost the same but differ in version parameter. PS. @KochetovNicolai has proposed another solution that requires to implement "different representations for the same data type" first. But it is much more hard to implement. E.g. I can implement what's proposed here in one week. And "different representations for the same data type" is likely not going to be implemented this year. |
BTW, how do you want to deal with:
I think that we can convert user provided data type as if it is a synonim for another data type.
Another example:
But I did not think about it in all detail. |
Yes, I think this would work. Maybe we don't even need to rewrite the type, just determine the underlying type of aggregate function state when we parse it. E.g. for both Looks like this might be a task independent from versioning. |
If you think you can implement the original proposal fast, maybe we should try. I don't think it is going to lead to serious maintenance problems, we will probably be able to change it to some other schema when we wish.
Yes, but in case of an aggregate function state column, the runtime format of binary data is the same. What differs is serialized data, which is arguably a different data type. If we had explicit types for serialized data, like
Yes, if we tweak overload resolution for AggregateFunctionState to ignore the Version parameter, it will work. Maybe it's even logically OK, we might have other type parameters that require special treatment, like the underlying data type we mentioned above. The simplistic overload resolution that just matches the type for equality won't work, but maybe it's not expressive enough for us anyway. |
I think that it is not very difficult to separate serialization format and data type. We may postulate that name which is used for serialization is a "serialization format" name, not "data type" name. So, while transfering data server to server, new server will map format to type, and it will be possible to add more formats to mapping. There will be a problem for MV in case we want to change serialization format (e.g. update for new version may brake reading of saved data). It could be solved in a following way:
If we use default serialization format for data type, we won't specify it in table definition. For other cases format is added automatically to table definition. So, when we add serialization format, we may use new version by default, but it will be added to table definition. |
@KochetovNicolai It imposes the following complications:
|
But I don't understand why. The name won't change. I suppose that for old typed it will work the same way.
Also don't understand. Why column names are affected at all? ClickHouse/src/DataStreams/NativeBlockOutputStream.cpp Lines 108 to 112 in 06446b4
Probably not. I don't see why.
It is possibly true, however, I expect that main changes will be for data type serialization. It shouldn't change engines a lot. I just think that we have a misunderstanding about my proposal. |
The user should be able to specify serialization format of the data. The only possible way to do it for these formats is - in type name.
Const / LowCardinality / Sparse ...
Where else we can store serialization format (the name of column type)? |
I expect that we use serialization format instead of type name. And it will almost always be equal to type name. |
Then it is not too different than what I have proposed... |
It is not different if we talk about data serialization. But logically it is different, because new server will know about the separation between data type and format. Single in-memory representation of type may be serialized to different formats. |
Use case
Sometimes we have to change serialization format of AggregateFunction states due to bugs or inefficiencies. It should not compromise backward compatibility.
Currently we do it only in exceptional cases:
When this aggregate function is released just recently and rarely used. And we have to put a warning about backward compatibility in changelog.
Sometimes we just miss changes by mistake.
Proposal
Methods
IAggregateFunction::serialize
,IAggregateFunction::deserialize
will take additional argument with version. These methods have to support serialization and deserialization with all known versions.When user creates a data type
AggregateFunction(...)
, e.g.AggregateFunction(avg, UInt64)
, we transform it adding parameter with version at front with the most recent version, e.g.AggregateFunction(v1, avg, UInt64)
.The user will see the data type with version number in SHOW CREATE TABLE, DESCRIBE TABLE, etc.
When data type
AggregateFunction(...)
without version is already specified in table definition or in serialization formats (Native), version 0 is assumed implicitly.When sending data to the client with native protocol, the revision of the client is taken into account.
IAggregateFunction
should have a method to determine the maximum supported version according to the client revision. The version ofAggregateFunction
is changed that way. IfAggregateFunction
data type will have version zero, it is not printed in data type name.Scenarios
The text was updated successfully, but these errors were encountered: