Add support for auto-increment and skip_rows #63473

tchaton · 2024-05-07T12:58:39Z

(you don't have to strictly follow this form)

Use case

My current use case is to create metrics logger for AI applications.

The user log metrics as dictionaries where the key is the metric name and value the metric. Each user can create multiple time series with million of points.

When inserting the metrics, I know they are coming ordered. So I want to keep track of the row_idx as an auto-increment.

When reading, I want to be able to sub-sample a given number of points by skipping rows. Currently, I am doing this with modulo, but it is too slow.

A clear and concise description of what is the intended usage scenario is.

Describe the solution you'd like

Example: I have 1M points for a given metric and I want to retrieve 1k points. I would take 1 point every 1000.

I am thinking the SQL API could look this .

CREATE TABLE random_table
(
    id UInt32,
    name String,
    idx AutoIncrementUInt32
	....
)
ENGINE=MergeTree() ORDER BY (id, metric_name, row_idx)

SELECT * FROM random_table SKIP ROWS 1000 OFFSETS 0, -1

A clear and concise description of what you want to happen.

Describe alternatives you've considered

I have looked into MODULO to filter the rows but it is too slow. I have tried pre-computing modulos as extra column and performing prime decomposition of a given modulo at read time by shifting bytes and doing a composed bitAnd operator. It is too slow.

I have tried SAMPLE BY, but this is deterministic and not really want I need.

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

UnamedRus · 2024-05-07T13:34:08Z

I would take 1 point every 1000.

You need to put some constraints on which OFFSETS and STEP values you can have.
Because in general case it does mean random access of rows, and clickhouse as block_compressed dbms with sparse index bad fit for that.

tchaton · 2024-05-07T13:49:39Z

Hey @UnamedRus. For sure. This is only a suggestion on the API side. I am pretty sure Clickhouse can check the OFFSETS and SKIP_ROWS and fail the query if it is out of bound.

For the sparse compressed dbms, for my use case, there is a garuantee the row is only incrementing by 1, so it won't be sparse.

On my side, I am tracking the number of rows as side metadata, so It won't exceed it. But I am thinking to go and build my own storage mechanism right now as Clickhouse isn't fast enough for my use case.

Or I would be happy to consider contributing pieces if I was guided enough.

nikitamikhaylov · 2024-05-07T17:06:07Z

From my understanding auto-increment requires the UniqueMergeTree to be implemented #41817 and this one requires a built-in RAFT for maintaining the state.

tchaton · 2024-05-07T18:51:54Z

Hey @nikitamikhaylov. Thanks for sharing about the RFC. From what I can read, I would still need support for a primary key which isn't shown in the proposed designed.

tchaton · 2024-05-09T19:36:12Z

Hey @nikitamikhaylov. Extra information. There is a lot of speed to gain by implementing this natively in Clickhouse and making sampling more flexible, non-deterministic in general.

I have implemented my own storage mechanism and I am getting a 10x speed compared to clickhouse, 1/10 storage space.

Kind regards, T.C

tchaton added the feature label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for auto-increment and skip_rows #63473

Add support for auto-increment and skip_rows #63473

tchaton commented May 7, 2024 •

edited

UnamedRus commented May 7, 2024

tchaton commented May 7, 2024 •

edited

nikitamikhaylov commented May 7, 2024

tchaton commented May 7, 2024

tchaton commented May 9, 2024 •

edited

Add support for auto-increment and skip_rows #63473

Add support for auto-increment and skip_rows #63473

Comments

tchaton commented May 7, 2024 • edited

UnamedRus commented May 7, 2024

tchaton commented May 7, 2024 • edited

nikitamikhaylov commented May 7, 2024

tchaton commented May 7, 2024

tchaton commented May 9, 2024 • edited

tchaton commented May 7, 2024 •

edited

tchaton commented May 7, 2024 •

edited

tchaton commented May 9, 2024 •

edited