bfloat16 histogram state merging is slow when compared with tdigest #32228

vmihailenco · 2021-12-04T08:41:56Z

If looks like bfloat16 state merging is at least 2x times slower than tdigest even though compressed bfloat16 histogram is smaller on disk and faster in some/most benchmarks. Does bfloat16 merging requires more computational resources or there is an inefficiency in current implementation?

For example, for this table:

CREATE TABLE test (
  td AggregateFunction(quantilesTDigest(0.5, 0.9, 0.99), UInt64),
  bf AggregateFunction(quantilesBFloat16(0.5, 0.9, 0.99), UInt64),
  grouping_key UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (grouping_key);

insert into test
select
  quantilesTDigestState(0.5, 0.9, 0.99)(number),
  quantilesBFloat16State(0.5, 0.9, 0.99)(number),
  number % 10000 as grouping_key
from numbers(10e7)
group by grouping_key;

optimize table test final;

TDigest:

SELECT quantilesTDigestMerge(0.5)(td)
FROM test

Query id: 2ec3e5e3-49fe-46be-9208-be8201b4775a

┌─quantilesTDigestMerge(0.5)(td)─┐
│ [100047170]                    │
└────────────────────────────────┘

1 rows in set. Elapsed: 0.172 sec. Processed 10.00 thousand rows, 2.43 MB (58.24 thousand rows/s., 14.16 MB/s.)

BFloat16:

SELECT quantilesBFloat16Merge(0.5)(bf)
FROM test

Query id: b72df84a-30db-4dbf-9fab-d24902e3bf2a

┌─quantilesBFloat16Merge(0.5)(bf)─┐
│ [99614720]                      │
└─────────────────────────────────┘

1 rows in set. Elapsed: 0.425 sec. Processed 10.00 thousand rows, 5.31 MB (23.51 thousand rows/s., 12.50 MB/s.)

The situation is similar with production data.

vmihailenco · 2021-12-09T07:21:20Z

CC @RedClusive in case you have any input / ideas...

vmihailenco · 2021-12-28T11:14:55Z

For what it is worth, I've tried to re-compile clickhouse using crc32 hash instead of trivial hash for bfloat16 HashMap. No change in performance.

UnamedRus · 2023-11-05T17:04:10Z

It's because BFloat16 histogram use UInt64 as Weight parameter, so each pair use 10 bytes, 2 for bfloat16 key and 8 for UInt64 Weight, so state size for quantileBFloat16 is up to 50%+ bigger than for TDigest, which use 2 Float32 values for it's Centroids

ClickHouse/src/AggregateFunctions/QuantileBFloat16Histogram.h

Line 47 in e2b66c9

using Weight = UInt64;

vmihailenco added the performance label Dec 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bfloat16 histogram state merging is slow when compared with tdigest #32228

bfloat16 histogram state merging is slow when compared with tdigest #32228

vmihailenco commented Dec 4, 2021 •

edited

vmihailenco commented Dec 9, 2021

vmihailenco commented Dec 28, 2021

UnamedRus commented Nov 5, 2023

bfloat16 histogram state merging is slow when compared with tdigest #32228

bfloat16 histogram state merging is slow when compared with tdigest #32228

Comments

vmihailenco commented Dec 4, 2021 • edited

vmihailenco commented Dec 9, 2021

vmihailenco commented Dec 28, 2021

UnamedRus commented Nov 5, 2023

vmihailenco commented Dec 4, 2021 •

edited