Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling support in querying #12908

Open
macobo opened this issue Nov 22, 2022 · 4 comments
Open

Sampling support in querying #12908

macobo opened this issue Nov 22, 2022 · 4 comments
Assignees
Labels
clickhouse Related to ClickHouse specific bugs or improvements epic feature/insights Issues spanning all of insight-related features performance Has to do with performance. For PRs, runs the clickhouse query performance suite

Comments

@macobo
Copy link
Contributor

macobo commented Nov 22, 2022

When I am exploring data, I want to see results in the fastest way possible.

One way to achieve this would be to sample data. ClickHouse supports sampling via SAMPLE BY clause.

When sampled, queries would ideally return data in the same shape as the main dataset but lose precision due to looking at less people.

Implementation notes / Open topics

What to sample by?

The current table is set up to sample by distinct_id. While we could sample by event, for funnels/paths it makes more sense to sample by person_id.

Setting up the schema for sampling

To sample by person_id, we'd need to re-migrate our main dataset to include person_id in table ORDER BY. This is a heavy migration similar to 0002_events_sample_by

This might be a good spot to make other changes to the ORDER BY as well.

UX for querying

There needs to be a toggle to allow toggling sampling off and on in querying.

Ideally for large customers when building insights sampling would always be on when querying large time windows, but after saving the insight or adding it to the dashboard we should make it turn off.

The fact data is sampled should also be clearly indicated in the UI.

Setting sampling rate

We'll need a system for figuring out the default sampling rate for a given team.

@macobo macobo changed the title Build out sampling support for smoother data exploration Sampling support Nov 22, 2022
@macobo macobo added performance Has to do with performance. For PRs, runs the clickhouse query performance suite feature/insights Issues spanning all of insight-related features clickhouse Related to ClickHouse specific bugs or improvements epic labels Nov 22, 2022
@macobo macobo changed the title Sampling support Sampling support in querying Nov 22, 2022
@yakkomajuri yakkomajuri self-assigned this Feb 16, 2023
@yakkomajuri
Copy link
Contributor

yakkomajuri commented Feb 16, 2023

I've picked this task up now. Here are my initial thoughts:

Present vs. future

  • Currently our events table has SAMPLE BY distinct_id. This means sampling will work best/only for teams that deal with only one type of user (i.e. anonymous or identified but not both)
  • In the future we should sample by person_id. Unsure about the work needed to change a SAMPLE BY clause at this point.

Benchmarking

  • Sampling definitely speeds things up a lot (duh!) so I think we really should invest in it
  • From my benchmarking, it can speed up things more than linearly i.e. sampling by 10% can make things over 10x faster because of the compounding effect of sorts, groupings, joins, etc down the line. However, from my quick benchmarking (3 queries for our largest team), sampling by a factor of 0.1 gave performance benefits in the order of 1.5-6x speed improvement. 10% sampling was off by around ~1% on average for retention and DAUs with the largest difference for a data point being ~3%. For a lifecycle query though 2 data points differed by about 12%. Need to do further testing to find the optimal sampling rate (overall? per team? per insight?). We can also consider using SAMPLE n instead of SAMPLE k for appropriate team-level factors. Btw, my suggestion is to test this in prod (behind a feature flag with some opt-in teams ofc) as that'll give us the most/best data.

UX

  • Initially at least let's make this toggle-able at the insight level without any stored state
  • Later there are two potential approaches. Note that it will become important to be able to save state to prevent us from doing unnecessary querying.
    1. Make this a team-level setting
    2. Make this a saved insight-level setting
  • Initial suggestion: just under the filters on the right

Screenshot 2023-02-16 at 15 46 18

Query layer

  • Sampling will return results for a sampling factor s e.g. 0.1 and we then need to multiply the results by 1/s. I suggest we do this in Python rather than ClickHouse, as it's logic that is much better suited on that layer.
  • We might not be able to turn on sampling for all types of insights, so need to consider our queries carefully. Math operations are particularly problematic. Notes on this:
    • avg: doesn't need any post-processing
    • min/max: should not allow sampling
    • median/p90/p95/p99: we should probably not allow sampling + the current approach using quantile is anyway effectively sampled
    • sum: multiply by 1/s as standard
  • Given all this handling, we might just turn off sampling for math operations as a start? Either way need to consider things carefully when implementing it
  • Need to consider how best to test sampling

Complementary queries

  • In many cases we have queries that complement each other. The best example of this is the persons modal. Question: if the main query is sampled, should we sample the others as well? We can probably punt on this and leave the supplementary queries unsampled as they should run fewer times than the main query anyway.
  • What about multiple series where one doesn't support sampling? Probably ok to just "sample what we can" in any case when someone turns on sampling

@mariusandra
Copy link
Collaborator

Very nice! One more thing waiting for person_id 😅

UX wise, the way I've been imagining a sampling toggle is a slider with a logarithmic scale. I assume you'd want to sample data only if the regular queries are too slow, and in that case the sampling rate you'd choose will greatly depend on how much data you have, what query you're running, etc. You'd choose whichever rate to make your query complete in $reasonable_timerange.

I imagine users going "my query is too slow, let's trade accuracy for speed by moving the sampling slider".

@yakkomajuri
Copy link
Contributor

RE changing the SAMPLE BY clause of events:

  • It's an easy operation in itself
  • But it depends on person_id being in the sorting key, which requires an async migration

https://clickhouse.com/docs/en/sql-reference/statements/alter/sample-by/

@neilkakkar
Copy link
Collaborator

Not sure if you've also seen the linked issue: #12909 -

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse Related to ClickHouse specific bugs or improvements epic feature/insights Issues spanning all of insight-related features performance Has to do with performance. For PRs, runs the clickhouse query performance suite
Projects
None yet
Development

No branches or pull requests

4 participants