Implement buffer for conversion events #9182

yakkomajuri · 2022-03-22T14:52:06Z

We'll be implementing a buffer using Kafka and the plugin server to ensure we associate events with the right distinct ID around the "identify edge", which we can also refer to as "conversion" events.

The solution will work as follows:

Add a new Kafka topic for the buffer, called e.g. events_buffer, where messages contain event payloads as well as an extra field process_at
When processing an event, run processEvent and onEvent normally. However, at the very "end" of the ingestEvent task, make a decision to send an event to ClickHouse directly or to the buffer topic. The heuristic for this is as follows:

anonymous events -> clickhouse
$identify -> clickhouse
non-anonymous events without a person -> buffer
non-anonymous events within conversion window -> buffer

Set up a new consumer on the main thread to consume from the buffer topic and send to a worker task doing the following:

Look up person id
Add id to event
Produce event to Kafka topic consumed by CH

The consumer should work as follows:
1. Pull a message from Kafka
2. Check process_at. For t = process_at - now, if t > 0, don't commit the offset, finish the execution, stop the consumer and sleep for t. If t <= 0, ingest the event now

Won't waste a bunch of time making a graph that's a perfect representation of the world, but this should give a good overview of how this system will work:

This issue previously outlined a ClickHouse buffer solution that we've decided again. Click below to see its content.

Old issue content

The `staging_events` table will have the same schema as the events table.

Creating it could be done via a "normal" CH migration as it is a new table.

However, we want to only create the materialized view and Kafka table on one server. This is to ensure consistency when querying from this table to write to writable_events.

For this we will need some assistance from Team Infra (@guidoiaquinti @hazzadous), as we need a way for self-hosted users to also leverage CLICKHOUSE_STABLE_HOST. Effectively, we need a way to connect to one individual ClickHouse server for this.

The text was updated successfully, but these errors were encountered:

neilkakkar · 2022-03-23T16:52:05Z

Curious what this will be used for?

yakkomajuri · 2022-03-24T08:03:31Z

Consider the backend problem I mentioned here:

#9188 (comment)

The idea is to have a staging_events table which will get events in two scenarios:

Events from a non-anonymous user with no person associated to them
Events from a non-anonymous user within a few minutes of the person's created_at

Here events will wait for a cron job that will select from the table and insert into writable_events while doing a join with the person distinct id table, ensuring they have the latest person ID.

This helps us minimize the "backend problem" and ensure events around the "$identify edge" get the correct person_id.

tiina303 · 2022-04-20T14:00:27Z

Sorry I missed this issue and the description, but here's a proposal:

if person exists we ingest directly, otherwise add to the buffer
prepareEvent handles identify and alias & creates a person if the person doesn't exist - and then we ingest it directly as the person exists
Conversion window is tied to arrival to PostHog TS (not to the timestamp of the event as when we're backlogged this will work better)

What happens for the first sign-up if the identify event arrives within conversion window of any other event for the same identified person = it gets properly tied to the previous anonymous person same as the initial proposal here.

What happens on second login's backend usage etc = we're more likely to tie the people together (i.e. will do if within conversion window).

Downside: buffer is used more
Upside: more events attached to the right person & I'd argue it's simpler.
Pre-req: handing person property set/set_once based on timestamps (though as discussed extensively person properties shouldn't be changed often, so nice to have really and similar for the original proposal here too)

yakkomajuri · 2022-04-25T16:21:55Z

I think this is a valid proposal.

It solves a problem we've determined doesn't need solving per se (subsequent anonymous sessions) but I'm also with you in the sense that I would like to solve (help) it if possible.

The main issue here is that it would probably significantly increase the size of the buffer, which worries me. Not only does that increase Kafka size, but it increases the number of events that we consume and process twice, which adds a lot of load to an already struggling plugin server.

Given we decided not to solve for this problem, the current buffer is fine, so no rush and we can always change it.

However, the good news is I built a no-op buffer that's in right now with some metrics, so we can see how many events we would have sent to the buffer if it was launched.

Thus, I'll let this run for a bit and eventually implement this proposal so we see the difference in the number of events that would go to the buffer.

That should give us a better sense to make a decision here.

I've also not thought deeply about edge cases here, so would need to do a bit of that before pulling the trigger.

yakkomajuri · 2022-05-02T10:09:18Z

Note to self:

Test how pausing works when you pause a partition and then pause the whole topic or the entire consumer

tiina303 · 2022-05-23T10:49:14Z

We discussed this with @yakkomajuri further detailed discussion can be found here https://docs.google.com/document/d/1ucnwo0QwQNCboDRaBP3PzTP0magEMlluHuTmZLqQ5XY

TLDR:
Person creation for events that go to the buffer will be processed when they are processed from the buffer (not when we receive the event/before putting the event to the buffer). The benefit of doing person processing later is that we'll have less merges, hence removing an expensive operation and getting better perf

When to write an event to the buffer vs process immediately

person already exists for this distinctId -> clickhouse
$identify / $alias event -> clickhouse
anonymous events -> clickhouse
non-anonymous events without a person -> buffer

End user perspective for first logins/signup:

If identify is the first event we see for Bob: all events will be tied to the same person ID
Any length anon usage and then within the conversion window of the first event for Bob we see identify: all events will be tied to the same person ID
Long anon and long Bob usage before we see identify(Bob, anon): All Bob's events will have the same person ID, previous anon events will have a different person ID (Note we can lose either anon or Bob history, losing anon is better)

Second login: anon events before identify will have a different person ID.

yakkomajuri added the project/persons-on-events label Mar 22, 2022

yakkomajuri self-assigned this Mar 22, 2022

yakkomajuri mentioned this issue Mar 25, 2022

Use the staging_events buffer table #9185

Closed

yakkomajuri changed the title ~~Add the staging_events buffer table~~ Implement buffer for conversion events Mar 25, 2022

This was referenced Apr 15, 2022

feat(ingestion): conversion buffer producer #9427

Merged

feature(ingestion): conversion events buffer consumer #9429

Closed

feat(ingestion): conversion events buffer consumer #9432

Merged

yakkomajuri added the awaiting rollout label Apr 26, 2022

yakkomajuri mentioned this issue Apr 27, 2022

Plan for rolling out persons/groups on events to individual Cloud teams #9555

Closed

9 tasks

tiina303 mentioned this issue Jun 8, 2022

Expose "person" properties and person_id in plugins #10192

Closed

This was referenced Jun 24, 2022

chore(plugin-server): Consume from buffer topic #10475

Merged

chore(event-buffer): Enable buffer for test project #10546

Merged

chore(event-buffer): Add metric for buffer usefulness #10554

Merged

yakkomajuri closed this as completed Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement buffer for conversion events #9182

Implement buffer for conversion events #9182

yakkomajuri commented Mar 22, 2022 •

edited

Loading

neilkakkar commented Mar 23, 2022

yakkomajuri commented Mar 24, 2022

tiina303 commented Apr 20, 2022

yakkomajuri commented Apr 25, 2022

yakkomajuri commented May 2, 2022

tiina303 commented May 23, 2022 •

edited

Loading

Implement buffer for conversion events #9182

Implement buffer for conversion events #9182

Comments

yakkomajuri commented Mar 22, 2022 • edited Loading

neilkakkar commented Mar 23, 2022

yakkomajuri commented Mar 24, 2022

tiina303 commented Apr 20, 2022

yakkomajuri commented Apr 25, 2022

yakkomajuri commented May 2, 2022

tiina303 commented May 23, 2022 • edited Loading

yakkomajuri commented Mar 22, 2022 •

edited

Loading

tiina303 commented May 23, 2022 •

edited

Loading