Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioning of AggregateFunction states. #12552

Closed
alexey-milovidov opened this issue Jul 16, 2020 · 17 comments · Fixed by #24820
Closed

Versioning of AggregateFunction states. #12552

alexey-milovidov opened this issue Jul 16, 2020 · 17 comments · Fixed by #24820
Assignees
Labels

Comments

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jul 16, 2020

Use case

Sometimes we have to change serialization format of AggregateFunction states due to bugs or inefficiencies. It should not compromise backward compatibility.

Currently we do it only in exceptional cases:

When this aggregate function is released just recently and rarely used. And we have to put a warning about backward compatibility in changelog.

Sometimes we just miss changes by mistake.

Proposal

Methods IAggregateFunction::serialize, IAggregateFunction::deserialize will take additional argument with version. These methods have to support serialization and deserialization with all known versions.

When user creates a data type AggregateFunction(...), e.g. AggregateFunction(avg, UInt64), we transform it adding parameter with version at front with the most recent version, e.g. AggregateFunction(v1, avg, UInt64).

The user will see the data type with version number in SHOW CREATE TABLE, DESCRIBE TABLE, etc.

When data type AggregateFunction(...) without version is already specified in table definition or in serialization formats (Native), version 0 is assumed implicitly.

When sending data to the client with native protocol, the revision of the client is taken into account. IAggregateFunction should have a method to determine the maximum supported version according to the client revision. The version of AggregateFunction is changed that way. If AggregateFunction data type will have version zero, it is not printed in data type name.

Scenarios

  1. Server sends data to old client. Should work seamless.
  2. Server sends data to old client that is actually another server that initiated distributed query. Should work seamless.
  3. We have a table with columns of AggregateFunction data type stored inside; then upgraded the server and continue to read and write to that table. Should work seamless.
  4. We have a table with columns of AggregateFunction data type stored inside; then upgraded the server and continue to read and write to that table. Then downgraded the server and continued to read and write to that table. Should work seamless.
  5. We have created dump in format TSVWithNamesAndTypes, CSVWithNamesAndTypes, etc. on old server, then trying to upload it to new server. Should work seamless.
  6. We have created dump in format without data types (like TSV, RowBinary) on old server. Then trying to upload it to new server. It may require user to explicitly specify version in data type when creating a table.
@alexey-milovidov alexey-milovidov added feature st-discussion The story requires discussion /research / expert help / design & decomposition before will be taken labels Jul 16, 2020
@alexey-milovidov
Copy link
Member Author

We also should support having single type of state for slightly different aggregate functions, for example sum and sumIf.

@akuzm
Copy link
Contributor

akuzm commented Aug 26, 2020

When user creates a data type AggregateFunction(...), e.g. AggregateFunction(avg, UInt64), we transform it adding parameter with version at front with the most recent version, e.g. AggregateFunction(v1, avg, UInt64).

Can we do without this part? If we speak about the version of serialized data, it exists only during storage or transfer of data. The in-memory representation, on the other hand, exists only in run time. It can arbitrarily change between server revisions, and doesn't need versioning. So the version of serialized data is not really a property of a deserialized column, and should not be reflected in its type.

These methods have to support serialization and deserialization with all known versions.

I'd prefer to avoid supporting serialization to old versions, but probably we can't do that if we want to support shard-wise rolling upgrades of the cluster. How should we tie the protocol version to the serialization version of aggregate functions? Maybe we can use the protocol version number as a serialization version number? It's kind of big now, the current version is 54226. We'd want to use a tightly packed small number to minimize overhead, so probably we'll have to use some kind of lookup tables...

Another part we have to be careful about is transitioning from no versions (now) to some versions. Probably we'll have to have some versioning in the column metadata. Do we have it already? It would be solved by your AggregateFunction(v1, ... proposal, but putting it in the type doesn't feel right.

@akuzm
Copy link
Contributor

akuzm commented Aug 26, 2020

We'd want to use a tightly packed small number to minimize overhead, so probably we'll have to use some kind of lookup tables...

Thinking more about it, we don't have to put the version number into the function state itself -- we can't have mixed versions of states within block. If we put the version into column metadata, using the protocol version is acceptable.

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Aug 26, 2020

Can we do without this part? If we speak about the version of serialized data, it exists only during storage or transfer of data.

Version in type name is needed to correctly deserialize data dumps in Native and other formats (scenarios 5 and 6), and to read tables with stored AggregateFunctions (scenarios 3 and 4).

Data type is what drives serialization/deserialization.

(There is a task to move binary ser/de from DataType to Column and to enable multiple Column representatons for a single DataType, but it's another story)

@alexey-milovidov
Copy link
Member Author

Maybe we can use the protocol version number as a serialization version number? It's kind of big now, the current version is 54226.

It changes more frequently than binary formats of aggregate functions. So, we can use it to negotiate supported version. But not to persist as a version number.

@akuzm
Copy link
Contributor

akuzm commented Aug 27, 2020

Data type is what drives serialization/deserialization.

This may be true for our code, but in general I'd say that a data type is a set of allowed values + operations on them. Serialization is an orthogonal concept. For example, we have a lot of output formats in which we can read and write an Int64, but it's still the same data type.

The cases 3 and 4, when there is an aggregate function state column, we can solve by adding metadata with version to this column. But the cases 5, 6 of text formats are indeed problematic, because there is no place to put this metadata, except the type. But it is more of a hack around the deficiency of the format. I think we might be setting ourselves up for trouble if we deepen the confusion between data type and serialization. It breaks a basic expectation -- the type is the same, regardless of how it was serialized.

Arguably, AggregateFunctionState is some opaque data type that doesn't necessarily have a user-visible byte representation. To be represented as bytes it must be serialized. The state serialized with a particular version is indeed a separate data type -- a domain over Bytes that has de/serialization functions converting it to/from the AggregateFunctionState. So we could have explicit types like AggregateFunctionUniqSerializedV2 with ser/de functions, forbid conversion of AggregateFunctionState to Bytes or Text, and require explicit serialization. This is somewhat verbose, but perhaps a better model for what is actually happening.

@alexey-milovidov
Copy link
Member Author

@akuzm I understand but actually - using different data types for different formats of binary data of the same aggregate function - still looks very natural for me.

And AggregateFunction is parametric data type. It will be almost the same but differ in version parameter.
And the user can write it without version but we will substitute the default version on table creation.
The version will clearly indicate that it only differs with version...

PS. @KochetovNicolai has proposed another solution that requires to implement "different representations for the same data type" first. But it is much more hard to implement.

E.g. I can implement what's proposed here in one week. And "different representations for the same data type" is likely not going to be implemented this year.

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Aug 27, 2020

BTW, how do you want to deal with:

We also should support having single type of state for slightly different aggregate functions, for example sum and sumIf.

I think that we can convert user provided data type as if it is a synonim for another data type.
For example, the user provide

AggregateFunction(sumIf, UInt64, UInt8)
we will write
AggregateFunction(v2, sum, UInt64)

Another example:

AggregateFunction(quantiles(0.5, 0.9), UInt64)
we will write
AggregateFunction(v0, quantile, UInt64)

But I did not think about it in all detail.

@akuzm
Copy link
Contributor

akuzm commented Aug 27, 2020

BTW, how do you want to deal with:

We also should support having single type of state for slightly different aggregate functions, for example sum and sumIf.

I think that we can convert user provided data type as if it is a synonim for another data type.

Yes, I think this would work. Maybe we don't even need to rewrite the type, just determine the underlying type of aggregate function state when we parse it. E.g. for both sum and sumIf it is AggregateFunctionSumData or something like this. Then, in overload resolution, we will check not only the nominal data type, but also the type of agg function state, and accept the argument if it matches.

Looks like this might be a task independent from versioning.

@akuzm
Copy link
Contributor

akuzm commented Aug 27, 2020

If you think you can implement the original proposal fast, maybe we should try. I don't think it is going to lead to serious maintenance problems, we will probably be able to change it to some other schema when we wish.

@akuzm I understand but actually - using different data types for different formats of binary data of the same aggregate function - still looks very natural for me.

Yes, but in case of an aggregate function state column, the runtime format of binary data is the same. What differs is serialized data, which is arguably a different data type. If we had explicit types for serialized data, like AggregateFunctionGroupArraySerializedV5, it would match this idea perfectly.

And AggregateFunction is parametric data type. It will be almost the same but differ in version parameter.
And the user can write it without version but we will substitute the default version on table creation.
The version will clearly indicate that it only differs with version...

Yes, if we tweak overload resolution for AggregateFunctionState to ignore the Version parameter, it will work. Maybe it's even logically OK, we might have other type parameters that require special treatment, like the underlying data type we mentioned above. The simplistic overload resolution that just matches the type for equality won't work, but maybe it's not expressive enough for us anyway.

@KochetovNicolai
Copy link
Member

I think that it is not very difficult to separate serialization format and data type.

We may postulate that name which is used for serialization is a "serialization format" name, not "data type" name.
There will be mapping "serialization format" -> "data type" (many to one), and currently this mapping is identical.

So, while transfering data server to server, new server will map format to type, and it will be possible to add more formats to mapping.

There will be a problem for MV in case we want to change serialization format (e.g. update for new version may brake reading of saved data). It could be solved in a following way:

  • Add optional serialization format to table definition
  • Specify default serialization format (which won't ever be changed) for each data type

If we use default serialization format for data type, we won't specify it in table definition. For other cases format is added automatically to table definition. So, when we add serialization format, we may use new version by default, but it will be added to table definition.

@alexey-milovidov
Copy link
Member Author

@KochetovNicolai It imposes the following complications:

  • it won't work for RowBinary, TSV, etc... - the user will not be able to deserialize the format with old version;
  • it will require to add compatibility logic (add column names near data type names) for Native format;
  • it will require to change columns.txt for data parts in MergeTree tables;
  • it will require to modify all other table engines like Log, TinyLog, StripeLog, Set/Join and MergeTree WAL.

@KochetovNicolai
Copy link
Member

it won't work for RowBinary, TSV, etc... - the user will not be able to deserialize the format with old version

But I don't understand why. The name won't change. I suppose that for old typed it will work the same way.

it will require to add compatibility logic (add column names near data type names) for Native format

Also don't understand. Why column names are affected at all?
We already write column name and type name for native format:

/// Name
writeStringBinary(column.name, ostr);
/// Type
String type_name = column.type->getName();

it will require to change columns.txt for data parts in MergeTree tables

Probably not. I don't see why.

it will require to modify all other table engines

It is possibly true, however, I expect that main changes will be for data type serialization. It shouldn't change engines a lot.

I just think that we have a misunderstanding about my proposal.

@alexey-milovidov
Copy link
Member Author

But I don't understand why. The name won't change. I suppose that for old typed it will work the same way.

The user should be able to specify serialization format of the data. The only possible way to do it for these formats is - in type name.

Also don't understand. Why column names are affected at all?

Const / LowCardinality / Sparse ...

Probably not. I don't see why.

Where else we can store serialization format (the name of column type)?

@KochetovNicolai
Copy link
Member

I expect that we use serialization format instead of type name. And it will almost always be equal to type name.
And we will use mapping format -> type name to determine type name

@alexey-milovidov
Copy link
Member Author

Then it is not too different than what I have proposed...

@KochetovNicolai
Copy link
Member

It is not different if we talk about data serialization. But logically it is different, because new server will know about the separation between data type and format. Single in-memory representation of type may be serialized to different formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants