-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement buffer for conversion events #9182
Comments
Curious what this will be used for? |
Consider the backend problem I mentioned here: The idea is to have a
Here events will wait for a cron job that will select from the table and insert into This helps us minimize the "backend problem" and ensure events around the "$identify edge" get the correct person_id. |
Sorry I missed this issue and the description, but here's a proposal:
What happens for the first sign-up if the identify event arrives within conversion window of any other event for the same identified person = it gets properly tied to the previous anonymous person same as the initial proposal here. What happens on second login's backend usage etc = we're more likely to tie the people together (i.e. will do if within conversion window). Downside: buffer is used more |
I think this is a valid proposal. It solves a problem we've determined doesn't need solving per se (subsequent anonymous sessions) but I'm also with you in the sense that I would like to solve (help) it if possible. The main issue here is that it would probably significantly increase the size of the buffer, which worries me. Not only does that increase Kafka size, but it increases the number of events that we consume and process twice, which adds a lot of load to an already struggling plugin server. Given we decided not to solve for this problem, the current buffer is fine, so no rush and we can always change it. However, the good news is I built a no-op buffer that's in right now with some metrics, so we can see how many events we would have sent to the buffer if it was launched. Thus, I'll let this run for a bit and eventually implement this proposal so we see the difference in the number of events that would go to the buffer. That should give us a better sense to make a decision here. I've also not thought deeply about edge cases here, so would need to do a bit of that before pulling the trigger. |
Note to self: Test how pausing works when you pause a partition and then pause the whole topic or the entire consumer |
We discussed this with @yakkomajuri further detailed discussion can be found here https://docs.google.com/document/d/1ucnwo0QwQNCboDRaBP3PzTP0magEMlluHuTmZLqQ5XY TLDR: When to write an event to the buffer vs process immediately
End user perspective for first logins/signup:
Second login: anon events before identify will have a different person ID. |
We'll be implementing a buffer using Kafka and the plugin server to ensure we associate events with the right distinct ID around the "identify edge", which we can also refer to as "conversion" events.
The solution will work as follows:
events_buffer
, where messages contain event payloads as well as an extra fieldprocess_at
processEvent
andonEvent
normally. However, at the very "end" of theingestEvent
task, make a decision to send an event to ClickHouse directly or to the buffer topic. The heuristic for this is as follows:The consumer should work as follows:
process_at
. Fort = process_at - now
, ift > 0
, don't commit the offset, finish the execution, stop the consumer and sleep fort
. Ift <= 0
, ingest the event nowWon't waste a bunch of time making a graph that's a perfect representation of the world, but this should give a good overview of how this system will work:
This issue previously outlined a ClickHouse buffer solution that we've decided again. Click below to see its content.
Old issue content
The `staging_events` table will have the same schema as the events table.
Creating it could be done via a "normal" CH migration as it is a new table.
However, we want to only create the materialized view and Kafka table on one server. This is to ensure consistency when querying from this table to write to
writable_events
.For this we will need some assistance from Team Infra (@guidoiaquinti @hazzadous), as we need a way for self-hosted users to also leverage
CLICKHOUSE_STABLE_HOST
. Effectively, we need a way to connect to one individual ClickHouse server for this.The text was updated successfully, but these errors were encountered: