Sampling support in querying #12908

macobo · 2022-11-22T13:50:45Z

When I am exploring data, I want to see results in the fastest way possible.

One way to achieve this would be to sample data. ClickHouse supports sampling via SAMPLE BY clause.

When sampled, queries would ideally return data in the same shape as the main dataset but lose precision due to looking at less people.

Implementation notes / Open topics

What to sample by?

The current table is set up to sample by distinct_id. While we could sample by event, for funnels/paths it makes more sense to sample by person_id.

Setting up the schema for sampling

To sample by person_id, we'd need to re-migrate our main dataset to include person_id in table ORDER BY. This is a heavy migration similar to 0002_events_sample_by

This might be a good spot to make other changes to the ORDER BY as well.

UX for querying

There needs to be a toggle to allow toggling sampling off and on in querying.

Ideally for large customers when building insights sampling would always be on when querying large time windows, but after saving the insight or adding it to the dashboard we should make it turn off.

The fact data is sampled should also be clearly indicated in the UI.

Setting sampling rate

We'll need a system for figuring out the default sampling rate for a given team.

The text was updated successfully, but these errors were encountered:

yakkomajuri · 2023-02-16T18:59:19Z

I've picked this task up now. Here are my initial thoughts:

Present vs. future

Currently our events table has SAMPLE BY distinct_id. This means sampling will work best/only for teams that deal with only one type of user (i.e. anonymous or identified but not both)
In the future we should sample by person_id. Unsure about the work needed to change a SAMPLE BY clause at this point.

Benchmarking

Sampling definitely speeds things up a lot (duh!) so I think we really should invest in it
From my benchmarking, it can speed up things more than linearly i.e. sampling by 10% can make things over 10x faster because of the compounding effect of sorts, groupings, joins, etc down the line. However, from my quick benchmarking (3 queries for our largest team), sampling by a factor of 0.1 gave performance benefits in the order of 1.5-6x speed improvement. 10% sampling was off by around ~1% on average for retention and DAUs with the largest difference for a data point being ~3%. For a lifecycle query though 2 data points differed by about 12%. Need to do further testing to find the optimal sampling rate (overall? per team? per insight?). We can also consider using SAMPLE n instead of SAMPLE k for appropriate team-level factors. Btw, my suggestion is to test this in prod (behind a feature flag with some opt-in teams ofc) as that'll give us the most/best data.

UX

Initially at least let's make this toggle-able at the insight level without any stored state
Later there are two potential approaches. Note that it will become important to be able to save state to prevent us from doing unnecessary querying.
1. Make this a team-level setting
2. Make this a saved insight-level setting
Initial suggestion: just under the filters on the right

Query layer

Sampling will return results for a sampling factor s e.g. 0.1 and we then need to multiply the results by 1/s. I suggest we do this in Python rather than ClickHouse, as it's logic that is much better suited on that layer.
We might not be able to turn on sampling for all types of insights, so need to consider our queries carefully. Math operations are particularly problematic. Notes on this:
- avg: doesn't need any post-processing
- min/max: should not allow sampling
- median/p90/p95/p99: we should probably not allow sampling + the current approach using quantile is anyway effectively sampled
- sum: multiply by 1/s as standard
Given all this handling, we might just turn off sampling for math operations as a start? Either way need to consider things carefully when implementing it
Need to consider how best to test sampling

Complementary queries

In many cases we have queries that complement each other. The best example of this is the persons modal. Question: if the main query is sampled, should we sample the others as well? We can probably punt on this and leave the supplementary queries unsampled as they should run fewer times than the main query anyway.
What about multiple series where one doesn't support sampling? Probably ok to just "sample what we can" in any case when someone turns on sampling

mariusandra · 2023-02-17T08:33:11Z

Very nice! One more thing waiting for person_id 😅

UX wise, the way I've been imagining a sampling toggle is a slider with a logarithmic scale. I assume you'd want to sample data only if the regular queries are too slow, and in that case the sampling rate you'd choose will greatly depend on how much data you have, what query you're running, etc. You'd choose whichever rate to make your query complete in $reasonable_timerange.

I imagine users going "my query is too slow, let's trade accuracy for speed by moving the sampling slider".

yakkomajuri · 2023-02-17T13:22:09Z

RE changing the SAMPLE BY clause of events:

It's an easy operation in itself
But it depends on person_id being in the sorting key, which requires an async migration

https://clickhouse.com/docs/en/sql-reference/statements/alter/sample-by/

neilkakkar · 2023-02-17T13:47:22Z

Not sure if you've also seen the linked issue: #12909 -

macobo changed the title ~~Build out sampling support for smoother data exploration~~ Sampling support Nov 22, 2022

macobo added performance Has to do with performance. For PRs, runs the clickhouse query performance suite feature/insights Issues spanning all of insight-related features clickhouse Related to ClickHouse specific bugs or improvements epic labels Nov 22, 2022

macobo changed the title ~~Sampling support~~ Sampling support in querying Nov 22, 2022

macobo mentioned this issue Nov 22, 2022

Shard event data by person_id #12909

Open

yakkomajuri self-assigned this Feb 16, 2023

yakkomajuri mentioned this issue Feb 16, 2023

feat(performance): implement sampling for lifecycle queries #14283

Merged

This was referenced Feb 22, 2023

feat(performance): sampling support for funnels #14373

Merged

feat(sampling): add initial unrestricted sampling to trends total volume #14427

Merged

feat(sampling): add sampling to retention #14447

Merged

feat(sampling): initial sampling ui #14488

Merged

This was referenced Mar 6, 2023

feat(sampling): move fastmax toggle and sync with insight filters #14577

Merged

feat(sampling): support sampling in experiments #14608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling support in querying #12908

Sampling support in querying #12908

macobo commented Nov 22, 2022 •

edited

yakkomajuri commented Feb 16, 2023 •

edited

mariusandra commented Feb 17, 2023

yakkomajuri commented Feb 17, 2023

neilkakkar commented Feb 17, 2023

Sampling support in querying #12908

Sampling support in querying #12908

Comments

macobo commented Nov 22, 2022 • edited

Implementation notes / Open topics

What to sample by?

Setting up the schema for sampling

UX for querying

Setting sampling rate

yakkomajuri commented Feb 16, 2023 • edited

Present vs. future

Benchmarking

UX

Query layer

Complementary queries

mariusandra commented Feb 17, 2023

yakkomajuri commented Feb 17, 2023

neilkakkar commented Feb 17, 2023

macobo commented Nov 22, 2022 •

edited

yakkomajuri commented Feb 16, 2023 •

edited