Skip to content

Support JSON and Dynamic types in hash functions#87791

Merged
Avogar merged 10 commits intoClickHouse:masterfrom
Avogar:json-in-hash-functions
Nov 10, 2025
Merged

Support JSON and Dynamic types in hash functions#87791
Avogar merged 10 commits intoClickHouse:masterfrom
Avogar:json-in-hash-functions

Conversation

@Avogar
Copy link
Copy Markdown
Member

@Avogar Avogar commented Sep 29, 2025

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Support JSON and Dynamic types in hash functions. Resolves #87734

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Sep 29, 2025

Workflow [PR], commit [e8ca2f3]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Sep 29, 2025
@al13n321 al13n321 self-assigned this Sep 29, 2025
Comment on lines +2 to +3
15346611575624920065
3232684886454240779
Copy link
Copy Markdown
Member

@al13n321 al13n321 Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, the hash depends on max_dynamic_paths?

  • This is different from our hash functions behavior on other types, where we try to produce the same hash for the "same" value of slightly different type, e.g. Int32 vs Int64 or String vs FixedString.
  • I tried it, and this behavior seems to depend on the value: {"a": 42} gives the same hash for any max_dynamic_paths, while {"a" : [{"b": 42}]} gives different hashes for max_dynamic_paths=100 vs max_dynamic_paths=200, but same hash for max_dynamic_paths=100 vs max_dynamic_paths=101. Is it a bug?

Copy link
Copy Markdown
Member

@al13n321 al13n321 Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it, and this behavior seems to depend on the value

That's because the array's SerializationDynamic::serializeBinary writes the type name, which looks like Array(JSON(max_dynamic_types=16, max_dynamic_paths=50)), where the max_dynamic_paths depends on the outer max_dynamic_paths (I guess it's outer max_dynamic_paths/4).

I guess we can:

  1. give up, include the outer type name in the hash as well, and say that JSON hash is sensitive to the type parameters, or
  2. add a new code path specifically for this instead of serializeDynamicPathsAndSharedDataIntoArena (put the code in a new method in IColumn, or directly in FunctionAnyHash, or in SerializationObject behind a new flag in FormatSettings, or something).
    • This will also allow making the hash insensitive to typed_paths, which is nice.
    • Maybe this can also be used for deterministically formatting JSON as string, with the same independence of typed_paths and such; maybe the current JSON->String formatting code can be refactored into this and reused? (I.e. have a template function that traverses the value's json tree in deterministic order and calls a given function for innermost values, which can either format to string or add to hash. Idk whether such unification makes sense.) (Or just actually format to string and hash that, though it'll limit performance.)
    • Or maybe it can be done in FunctionAnyHash in a fast column-oriented way, hashing/combining whole subcolumns instead of traversing the json tree for each row.

I don't like (1) because it seems important to make hash functions good on first try because they can't be fixed later without breaking compatibility (or burdening users with a new setting and a data migration, etc).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I impleemnted 2 option as a new method ISerialization::serializeForHashCalculation. Please, take a look

@Avogar Avogar requested a review from al13n321 October 31, 2025 14:26
String removeJSONParametersFromTypeName(const String & name)
{
String result = name;
RE2::GlobalReplace(&result, RE2(R"(JSON\([^)]*\))"), "JSON");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some counterexamples: JSON(SKIP REGEXP '(oh no)'), Enum('JSON(', 'lol :)')

The whole value tree is serialized using serializeForHashCalculation at every level, right? Then can't we prepend something like TypeIndex in each serializeForHashCalculation implementation, without needing a representation for a whole tree of IDataTypes? Oh, I see ISerialization doesn't know the data type; and even the caller of serializeForHashCalculation doesn't always know the data type, e.g. TupleSerialization doesn't know the tuple element types. Ugh, such an artificial problem, in reality the ISerialization instance necessarily knows what kind of data it's serializing, this information is just encoded in a weird and unstable way (in form of vtable pointer). Maybe it would make sense to add virtual TypeIndex getTypeId() to ISerialization? Or think of something else.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeIndex is actually not suitable for it as we don't guarantee the order of types in this enum. I decided to just add a new version of encodeDataType that encodes data type specifically for this use case where we skip all JSON/Dynamic types parameters. It should work ok.

@Avogar Avogar requested a review from al13n321 November 5, 2025 18:29
@Avogar Avogar added this pull request to the merge queue Nov 10, 2025
Merged via the queue into ClickHouse:master with commit 8907f5e Nov 10, 2025
125 checks passed
@Avogar Avogar deleted the json-in-hash-functions branch November 10, 2025 20:50
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support JSON type in hash functions

3 participants