Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for auto-increment and skip_rows #63473

Open
tchaton opened this issue May 7, 2024 · 5 comments
Open

Add support for auto-increment and skip_rows #63473

tchaton opened this issue May 7, 2024 · 5 comments
Labels

Comments

@tchaton
Copy link

tchaton commented May 7, 2024

(you don't have to strictly follow this form)

Use case

My current use case is to create metrics logger for AI applications.

The user log metrics as dictionaries where the key is the metric name and value the metric. Each user can create multiple time series with million of points.

When inserting the metrics, I know they are coming ordered. So I want to keep track of the row_idx as an auto-increment.

When reading, I want to be able to sub-sample a given number of points by skipping rows. Currently, I am doing this with modulo, but it is too slow.

A clear and concise description of what is the intended usage scenario is.

Describe the solution you'd like

Example: I have 1M points for a given metric and I want to retrieve 1k points. I would take 1 point every 1000.

I am thinking the SQL API could look this .

CREATE TABLE random_table
(
    id UInt32,
    name String,
    idx AutoIncrementUInt32
	....
)
ENGINE=MergeTree() ORDER BY (id, metric_name, row_idx) 
SELECT * FROM random_table SKIP ROWS 1000 OFFSETS 0, -1

A clear and concise description of what you want to happen.

Describe alternatives you've considered

I have looked into MODULO to filter the rows but it is too slow. I have tried pre-computing modulos as extra column and performing prime decomposition of a given modulo at read time by shifting bytes and doing a composed bitAnd operator. It is too slow.

I have tried SAMPLE BY, but this is deterministic and not really want I need.

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

@tchaton tchaton added the feature label May 7, 2024
@UnamedRus
Copy link
Contributor

I would take 1 point every 1000.

You need to put some constraints on which OFFSETS and STEP values you can have.
Because in general case it does mean random access of rows, and clickhouse as block_compressed dbms with sparse index bad fit for that.

@tchaton
Copy link
Author

tchaton commented May 7, 2024

Hey @UnamedRus. For sure. This is only a suggestion on the API side. I am pretty sure Clickhouse can check the OFFSETS and SKIP_ROWS and fail the query if it is out of bound.

For the sparse compressed dbms, for my use case, there is a garuantee the row is only incrementing by 1, so it won't be sparse.

On my side, I am tracking the number of rows as side metadata, so It won't exceed it. But I am thinking to go and build my own storage mechanism right now as Clickhouse isn't fast enough for my use case.

Or I would be happy to consider contributing pieces if I was guided enough.

@nikitamikhaylov
Copy link
Member

From my understanding auto-increment requires the UniqueMergeTree to be implemented #41817 and this one requires a built-in RAFT for maintaining the state.

@tchaton
Copy link
Author

tchaton commented May 7, 2024

Hey @nikitamikhaylov. Thanks for sharing about the RFC. From what I can read, I would still need support for a primary key which isn't shown in the proposed designed.

@tchaton
Copy link
Author

tchaton commented May 9, 2024

Hey @nikitamikhaylov. Extra information. There is a lot of speed to gain by implementing this natively in Clickhouse and making sampling more flexible, non-deterministic in general.

I have implemented my own storage mechanism and I am getting a 10x speed compared to clickhouse, 1/10 storage space.

Kind regards, T.C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants