Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support declared value for Nullable storage, avoiding separate bitmap performance impact #31837

Open
joshg opened this issue Nov 26, 2021 · 1 comment
Labels

Comments

@joshg
Copy link

joshg commented Nov 26, 2021

Use case

Would like to use an explicit encoding value for Nullable column types, keeping the clean semantics of NULL and compatibility with Null-friendly SQL code, but without taking the hit of the separate bitmap to record NULL records. The documentation is clear that there is almost always a performance and space impact of making a field Nullable. In cases where the full value space of the type is not being used, NULL semantics can be maintained without this impacting storage or performance.

Describe the solution you'd like

I would like to be able to specify the storage value for NULL within the type space of the Nullable type.
When specified, NULL will be stored inline with the column and transparently handled as a NULL value, but without the separate bitmaps storage or lookup.

For example:

Nullable(UInt32 AS 999999999) or Nullable(UInt32, NULL AS 999999999) or similar syntax.

The value specification would be completely optional, but when specified NULL would be stored as the specified value in the primary table. No separate NULL bitmap would be stored.

So in the example above, inserting 1, NULL, 2 into this table would result in:

1, 999999999, 2 actually being stored in the column and no separate NULL bitmap.

All of the SQL NULL handling (e.g., isNull, etc.) would be transparently mapped under the hood to comparisons to the specified value.

So isNull(v) of above would still return false, true, false. There would be no way to 'read' the 999999999. It is a declared storage-only value for the NULL representation.

When specified as the NULL value, it would no longer be possible to store that value in the column.
It would either ERROR on insert or transparently write, but read back as a NULL value (I don't care on this).

Describe alternatives you've considered

This can be handled explicitly in the logic above Clickhouse, but it's messy and inconsistent. It also makes it more difficult to translate and adapt alternate SQL code that cleanly and appropriately uses SQL NULL semantics.

As Clickhouse philosophically allows a great deal of storage control in the interest of performance and efficiency, it seems consistent to allow the encoding of NULL to be inline with the Nullable data type, in instances where the full value space of the underlying type is not in fact needed.

Additional context

Notably #16330 suggests that the lowCardinality type may also operate this way (assigning a value in the dictionary to represent the NULL, avoiding the separate mask). This effectively extends this approach to any data type where a value can be set aside for use in NULL encoding, restoring all the efficiency of the non-Nullable implementation.

Also, since this is completely optional, there would be no impact on storage or implementation for Nullable data types declared without an encoding value specified.

@joshg joshg added the feature label Nov 26, 2021
@den-crane
Copy link
Contributor

It does not solve performance issue. You cannot use sum(x), anyway you have to implement some additional logic if x = 999999999 then ... else ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants