Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement buffer for conversion events #9182

Closed
Tracked by #9555
yakkomajuri opened this issue Mar 22, 2022 · 6 comments
Closed
Tracked by #9555

Implement buffer for conversion events #9182

yakkomajuri opened this issue Mar 22, 2022 · 6 comments

Comments

@yakkomajuri
Copy link
Contributor

yakkomajuri commented Mar 22, 2022

We'll be implementing a buffer using Kafka and the plugin server to ensure we associate events with the right distinct ID around the "identify edge", which we can also refer to as "conversion" events.

The solution will work as follows:

  1. Add a new Kafka topic for the buffer, called e.g. events_buffer, where messages contain event payloads as well as an extra field process_at
  2. When processing an event, run processEvent and onEvent normally. However, at the very "end" of the ingestEvent task, make a decision to send an event to ClickHouse directly or to the buffer topic. The heuristic for this is as follows:
anonymous events -> clickhouse
$identify -> clickhouse
non-anonymous events without a person -> buffer
non-anonymous events within conversion window -> buffer
  1. Set up a new consumer on the main thread to consume from the buffer topic and send to a worker task doing the following:
  • Look up person id
  • Add id to event
  • Produce event to Kafka topic consumed by CH
  1. The consumer should work as follows:

    1. Pull a message from Kafka
    2. Check process_at. For t = process_at - now, if t > 0, don't commit the offset, finish the execution, stop the consumer and sleep for t. If t <= 0, ingest the event now

Won't waste a bunch of time making a graph that's a perfect representation of the world, but this should give a good overview of how this system will work:

Screenshot 2022-03-25 at 14 37 09


This issue previously outlined a ClickHouse buffer solution that we've decided again. Click below to see its content.

Old issue content
The `staging_events` table will have the same schema as the events table.

Creating it could be done via a "normal" CH migration as it is a new table.

However, we want to only create the materialized view and Kafka table on one server. This is to ensure consistency when querying from this table to write to writable_events.

For this we will need some assistance from Team Infra (@guidoiaquinti @hazzadous), as we need a way for self-hosted users to also leverage CLICKHOUSE_STABLE_HOST. Effectively, we need a way to connect to one individual ClickHouse server for this.

@neilkakkar
Copy link
Collaborator

Curious what this will be used for?

@yakkomajuri
Copy link
Contributor Author

Consider the backend problem I mentioned here:

#9188 (comment)

The idea is to have a staging_events table which will get events in two scenarios:

  1. Events from a non-anonymous user with no person associated to them
  2. Events from a non-anonymous user within a few minutes of the person's created_at

Here events will wait for a cron job that will select from the table and insert into writable_events while doing a join with the person distinct id table, ensuring they have the latest person ID.

This helps us minimize the "backend problem" and ensure events around the "$identify edge" get the correct person_id.

@yakkomajuri yakkomajuri changed the title Add the staging_events buffer table Implement buffer for conversion events Mar 25, 2022
@tiina303
Copy link
Contributor

Sorry I missed this issue and the description, but here's a proposal:

  1. if person exists we ingest directly, otherwise add to the buffer
  2. prepareEvent handles identify and alias & creates a person if the person doesn't exist - and then we ingest it directly as the person exists
  3. Conversion window is tied to arrival to PostHog TS (not to the timestamp of the event as when we're backlogged this will work better)

What happens for the first sign-up if the identify event arrives within conversion window of any other event for the same identified person = it gets properly tied to the previous anonymous person same as the initial proposal here.

What happens on second login's backend usage etc = we're more likely to tie the people together (i.e. will do if within conversion window).

Downside: buffer is used more
Upside: more events attached to the right person & I'd argue it's simpler.
Pre-req: handing person property set/set_once based on timestamps (though as discussed extensively person properties shouldn't be changed often, so nice to have really and similar for the original proposal here too)

@yakkomajuri
Copy link
Contributor Author

I think this is a valid proposal.

It solves a problem we've determined doesn't need solving per se (subsequent anonymous sessions) but I'm also with you in the sense that I would like to solve (help) it if possible.

The main issue here is that it would probably significantly increase the size of the buffer, which worries me. Not only does that increase Kafka size, but it increases the number of events that we consume and process twice, which adds a lot of load to an already struggling plugin server.

Given we decided not to solve for this problem, the current buffer is fine, so no rush and we can always change it.

However, the good news is I built a no-op buffer that's in right now with some metrics, so we can see how many events we would have sent to the buffer if it was launched.

Thus, I'll let this run for a bit and eventually implement this proposal so we see the difference in the number of events that would go to the buffer.

That should give us a better sense to make a decision here.

I've also not thought deeply about edge cases here, so would need to do a bit of that before pulling the trigger.

@yakkomajuri
Copy link
Contributor Author

Note to self:

Test how pausing works when you pause a partition and then pause the whole topic or the entire consumer

@tiina303
Copy link
Contributor

tiina303 commented May 23, 2022

We discussed this with @yakkomajuri further detailed discussion can be found here https://docs.google.com/document/d/1ucnwo0QwQNCboDRaBP3PzTP0magEMlluHuTmZLqQ5XY

TLDR:
Person creation for events that go to the buffer will be processed when they are processed from the buffer (not when we receive the event/before putting the event to the buffer). The benefit of doing person processing later is that we'll have less merges, hence removing an expensive operation and getting better perf

When to write an event to the buffer vs process immediately

person already exists for this distinctId -> clickhouse
$identify / $alias event -> clickhouse
anonymous events -> clickhouse
non-anonymous events without a person -> buffer

End user perspective for first logins/signup:

  1. If identify is the first event we see for Bob: all events will be tied to the same person ID
  2. Any length anon usage and then within the conversion window of the first event for Bob we see identify: all events will be tied to the same person ID
  3. Long anon and long Bob usage before we see identify(Bob, anon): All Bob's events will have the same person ID, previous anon events will have a different person ID (Note we can lose either anon or Bob history, losing anon is better)

Second login: anon events before identify will have a different person ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants