-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampling support in querying #12908
Comments
I've picked this task up now. Here are my initial thoughts: Present vs. future
Benchmarking
UX
Query layer
Complementary queries
|
Very nice! One more thing waiting for UX wise, the way I've been imagining a sampling toggle is a slider with a logarithmic scale. I assume you'd want to sample data only if the regular queries are too slow, and in that case the sampling rate you'd choose will greatly depend on how much data you have, what query you're running, etc. You'd choose whichever rate to make your query complete in I imagine users going "my query is too slow, let's trade accuracy for speed by moving the sampling slider". |
RE changing the
https://clickhouse.com/docs/en/sql-reference/statements/alter/sample-by/ |
Not sure if you've also seen the linked issue: #12909 - |
When I am exploring data, I want to see results in the fastest way possible.
One way to achieve this would be to sample data. ClickHouse supports sampling via SAMPLE BY clause.
When sampled, queries would ideally return data in the same shape as the main dataset but lose precision due to looking at less people.
Implementation notes / Open topics
What to sample by?
The current table is set up to sample by
distinct_id
. While we could sample byevent
, for funnels/paths it makes more sense to sample byperson_id
.Setting up the schema for sampling
To sample by
person_id
, we'd need to re-migrate our main dataset to includeperson_id
in tableORDER BY
. This is a heavy migration similar to 0002_events_sample_byThis might be a good spot to make other changes to the
ORDER BY
as well.UX for querying
There needs to be a toggle to allow toggling sampling off and on in querying.
Ideally for large customers when building insights sampling would always be on when querying large time windows, but after saving the insight or adding it to the dashboard we should make it turn off.
The fact data is sampled should also be clearly indicated in the UI.
Setting sampling rate
We'll need a system for figuring out the default sampling rate for a given team.
The text was updated successfully, but these errors were encountered: