Support JSON and Dynamic types in hash functions#87791
Support JSON and Dynamic types in hash functions#87791Avogar merged 10 commits intoClickHouse:masterfrom
Conversation
| 15346611575624920065 | ||
| 3232684886454240779 |
There was a problem hiding this comment.
Wait, the hash depends on max_dynamic_paths?
- This is different from our hash functions behavior on other types, where we try to produce the same hash for the "same" value of slightly different type, e.g. Int32 vs Int64 or String vs FixedString.
- I tried it, and this behavior seems to depend on the value:
{"a": 42}gives the same hash for anymax_dynamic_paths, while{"a" : [{"b": 42}]}gives different hashes formax_dynamic_paths=100vsmax_dynamic_paths=200, but same hash formax_dynamic_paths=100vsmax_dynamic_paths=101. Is it a bug?
There was a problem hiding this comment.
I tried it, and this behavior seems to depend on the value
That's because the array's SerializationDynamic::serializeBinary writes the type name, which looks like Array(JSON(max_dynamic_types=16, max_dynamic_paths=50)), where the max_dynamic_paths depends on the outer max_dynamic_paths (I guess it's outer max_dynamic_paths/4).
I guess we can:
- give up, include the outer type name in the hash as well, and say that JSON hash is sensitive to the type parameters, or
- add a new code path specifically for this instead of serializeDynamicPathsAndSharedDataIntoArena (put the code in a new method in IColumn, or directly in
FunctionAnyHash, or inSerializationObjectbehind a new flag in FormatSettings, or something).- This will also allow making the hash insensitive to typed_paths, which is nice.
- Maybe this can also be used for deterministically formatting JSON as string, with the same independence of typed_paths and such; maybe the current JSON->String formatting code can be refactored into this and reused? (I.e. have a template function that traverses the value's json tree in deterministic order and calls a given function for innermost values, which can either format to string or add to hash. Idk whether such unification makes sense.) (Or just actually format to string and hash that, though it'll limit performance.)
- Or maybe it can be done in
FunctionAnyHashin a fast column-oriented way, hashing/combining whole subcolumns instead of traversing the json tree for each row.
I don't like (1) because it seems important to make hash functions good on first try because they can't be fixed later without breaking compatibility (or burdening users with a new setting and a data migration, etc).
There was a problem hiding this comment.
Agree, I impleemnted 2 option as a new method ISerialization::serializeForHashCalculation. Please, take a look
…h calculation of the value in the column
| String removeJSONParametersFromTypeName(const String & name) | ||
| { | ||
| String result = name; | ||
| RE2::GlobalReplace(&result, RE2(R"(JSON\([^)]*\))"), "JSON"); |
There was a problem hiding this comment.
Some counterexamples: JSON(SKIP REGEXP '(oh no)'), Enum('JSON(', 'lol :)')
The whole value tree is serialized using serializeForHashCalculation at every level, right? Then can't we prepend something like TypeIndex in each serializeForHashCalculation implementation, without needing a representation for a whole tree of IDataTypes? Oh, I see ISerialization doesn't know the data type; and even the caller of serializeForHashCalculation doesn't always know the data type, e.g. TupleSerialization doesn't know the tuple element types. Ugh, such an artificial problem, in reality the ISerialization instance necessarily knows what kind of data it's serializing, this information is just encoded in a weird and unstable way (in form of vtable pointer). Maybe it would make sense to add virtual TypeIndex getTypeId() to ISerialization? Or think of something else.
There was a problem hiding this comment.
TypeIndex is actually not suitable for it as we don't guarantee the order of types in this enum. I decided to just add a new version of encodeDataType that encodes data type specifically for this use case where we skip all JSON/Dynamic types parameters. It should work ok.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Support JSON and Dynamic types in hash functions. Resolves #87734
Documentation entry for user-facing changes