Skip to content
This repository has been archived by the owner on Mar 7, 2024. It is now read-only.

Add bq_ingested_timestamp #9

Merged
merged 9 commits into from
Jun 15, 2021
Merged

Add bq_ingested_timestamp #9

merged 9 commits into from
Jun 15, 2021

Conversation

mariusandra
Copy link
Collaborator

image

@mariusandra
Copy link
Collaborator Author

Locally as well it's kind of stuck:
image

To explore further another day.

@neilkakkar
Copy link
Contributor

This is problematic: as it also leads to duplicates in bigquery (implying the timeout perhaps doesn't happen with the bigquery calls, and the retries thus insert things into bigquery more than once).

Here's an example:

SELECT uuid, timestamp, bq_ingested_timestamp FROM `posthog-301601.posthog_export_test.events` where uuid IN ('0179d742-b93e-0000-58fd-a0e7bd4f5e81', '0179d742-f416-0000-01b8-e1ceb2850e65', '0179d744-2bb8-0000-b363-69c3d883757a') order by timestamp LIMIT 1000

in: https://console.cloud.google.com/bigquery?orgonly=true&project=posthog-301601&supportedpurview=project&ws=!1m9!1m3!3m2!1sposthog-301601!2sposthog_export_test!1m4!4m3!1sposthog-301601!2sposthog_export_test!3sevents!1m5!1m4!1m3!1sposthog-301601!2sbquxjob_2e7771cf_179d7490989!3sUS&page=queryresults&j=bq:US:bquxjob_2e7771cf_179d7490989

@neilkakkar neilkakkar marked this pull request as ready for review June 7, 2021 13:07
@neilkakkar
Copy link
Contributor

So, the problem here is entirely with bigquery - it's the specific upload (POST to /insertAll) that gets excruciatingly slow, on my local dev too. (and the library retrying itself doesn't help)

I'm unsure what to do about this, except make minor tweaks with the options to ensure things return quickly.

That's what I've done with this PR: disabled retrying by the bigquery library (so only our retry queue retries), making the dedupe option explicit (we were the library was inefficiently deduping earlier, by generating a new uuid for every retry, which was plain wrong) - which enables increasing throughput / quota limits if need be.

FWIW, with these tweaks, there's no problems on cloud (been running for a few hours now on cloud, not a single batch took more than 2 seconds, while on local batches still take upto 60 seconds. shrug. It's not a question of how many events are sent, nor are we exceeding any quotas.
( https://app.posthog.com/project/plugins -> BigQuery Debug - for an example) - will remove it from cloud once the PR gets approved.

Finally, there's intermittent problems on upgrade: since the schema is changing, the change requires time to trickle down on bigquery, which means in some cases, users would have to disable/enable the plugin again to continue ingesting.

@neilkakkar
Copy link
Contributor

neilkakkar commented Jun 7, 2021

cc @mariusandra for review please :)


Another thing (for a separate companion PR - PostHog/plugin-server#465 ) I'm doing now is making it easier to debug timeout errors like these in plugin server, by telling what exactly timed out.

Copy link
Contributor

@yakkomajuri yakkomajuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor points but otherwise looks fine - didn't run it though

index.ts Outdated Show resolved Hide resolved
index.ts Outdated Show resolved Hide resolved
index.ts Outdated
}
}

if (global.deduplicateEvents) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given we've disabled BQ retrying, is this used for anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've disabled retrying via the bigquery library, not via the plugin-server, so yes, this does work as long as the retry doesn't happen more than a few minutes later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant why do we need to dedupe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, because exceeding timeouts via the async guard doesn't stop the request from succeeding (most times, depending on the case), so we're making two requests with the same data to bigquery, which leads to duplicates.

This helps resolve some of them. (https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now, makes sense!

index.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@neilkakkar neilkakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 😂

plugin.json Outdated
@@ -49,9 +49,10 @@
{
"key": "deduplicateEvents",
"name": "Attempt deduplicating events",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we have yes and no, might make sense wording as a question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean something like: "Attempt deduplicating events?"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for example

@mariusandra
Copy link
Collaborator Author

Regarding "deduplicateEvents", is this something we need to expose to the user? I'd assume my export to just work (TM), not that I need to know low level details like this.

Reading the question, as a plugin user, I'm still not sure what do I need to choose here:

"Attempt deduplicating events?" --> "Default 'Yes'. We try to deduplicate events being sent to Bigquery if set to 'Yes'. Set to 'No' to increase Bigquery insertion quota limit"

What quota limit?

Can't we just choose the best default for the user?

@neilkakkar
Copy link
Contributor

So, I wanted to expose this because the amount of data we can push into Bigquery depends on this parameter. We choose a default that works for most users, and for ones that have A LOT of data, they can switch this toggle.

Perhaps I can change the name to something like "high event volume?" but that feels pretty misleading.

@mariusandra
Copy link
Collaborator Author

How does the amount of data we can push depend on this parameter? What is A LOT? 😀

(See Amazon writing styles tip #2 here)

@mariusandra
Copy link
Collaborator Author

mariusandra commented Jun 9, 2021

That said, I still think the inclusion of this option as it is makes it feel as if the entire export is flimsy. ("What do you mean I now have to monitor my export logs for quota errors? Where do I see them? What should I choose?" etc)

Can't we make some safe assumptions that will work for everyone? Or perhaps just relax/clarify the guarantees we provide (exported "at least once") and push the burden forward?

Anything that makes users think is bad usability.

@neilkakkar
Copy link
Contributor

Given that:
(1) Higher quota limit is better than less.
(2) Deduplication is best effort and doesn't work all the time
(3) Bigquery has docs on how to remove duplicates: https://cloud.google.com/bigquery/streaming-data-into-bigquery#manually_removing_duplicates

I like the idea of defaulting to no deduplication on our side, and passing it all onto the user.

On the flip side, while this makes the user think less at the beginning, it may easily lead to them seeing duplicates - and thinking the export isn't working properly.


On the whole, the number of users who might see duplicates are a small fraction of the total, so, yep, what you say makes sense - let's not make issues of a minority a global concern.

I'll add a link to the deduplication docs on the README, but remove it as an option from the plugin. Thanks for pushing back on this :)

@mariusandra
Copy link
Collaborator Author

Thanks!

Yet if I read the line in the README, the first question is? Hmm... duplicate events? What's that? Is this a problem? How often do they happen? Should I set up alerts in bigquery to monitor for this? Can I trust this export?

Can we add something to ease the mind of users? Something like "There's a very rare case that could lead to duplicates, when due to a network error, we see the export as failed, yet it actually reached BigQuery. We will then try the export again, but this may lead to duplicates in BQ"

And again, can we actually quantify how big of a problem this may be? The way it's set now, I find it hard to trust the plugin. :)

@neilkakkar
Copy link
Contributor

I think it's hard to quantify how big of a problem this may be? The bigquery sdk also faces the same issue, so I'm not sure how to alleviate concerns here.

Copy link
Collaborator Author

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new line in the readme is already a lot more reassuring. Just one more nit for the code, but otherwise looks good.

I'll still give it a quick test as well to make sure I'm not seeing the issues I saw earlier.

},
}

export const onEvent: BigQueryPlugin['onEvent'] = (event, { global }) => {
if (!global.exportEventsToIgnore.has(event.event)) {
global.exportEventsBuffer.add(event)
global.exportEventsBuffer.add(event, JSON.stringify(event).length)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

ip,
site_url,
timestamp: timestamp ? global.bigQueryClient.timestamp(timestamp) : null,
const object: {json: Record<string, any>, insertId?: string} = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this object if not using insertId?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could remove the insertId from the type, but other than that, nope, can't do, because the sdk by default inserts a random insertId unless I'm using the raw version, which requires the object to look like this^

@neilkakkar
Copy link
Contributor

It probably wouldn't fix all issues on your local: #9 (comment) - but just make them less frequent.

@mariusandra
Copy link
Collaborator Author

I don't know why, but this plugin is still really unstable for me locally:

"Adding event" is printed in onEvent and Starting export in exportEventsToBigQuery. It takes more than 10sec to flush for some unknown reason, and then when it starts, it never finishes. Except sometimes.

image

There's one big elephant in the room here: the createBuffer function in plugin-contrib is not tested at all. It's possible that may also have some bugs inside? :)

@neilkakkar
Copy link
Contributor

neilkakkar commented Jun 11, 2021

Hmm, where does it "take more than 10s to flush"? (having a hard time understanding from the image. Is it 12:07->12:10 ?).

I thought there might be something wonky with the createBuffer too, but tested it out, works fine, and works fine on cloud too!

Regarding never finishing 👀 - was it because of a TimeourError, or something else?


What I did find out when I raised the timeout limit (to sanity check everything), was that it's the bigquery client call that takes > 30s at times on local (didn't dive deeper into why it's a local specific problem).

@mariusandra
Copy link
Collaborator Author

I've configured it to upload every 10sec, yet minutes pass before any upload starts, and then more minutes before I get a confirmation that the export was a success.

However sometimes it happens instantly.

The dev console logs are again full of timeouts of each calibre:

image

image

@neilkakkar
Copy link
Contributor

Hmm, I'll look into this specific part, see if I run into the same thing. But since this issue is unrelated to this specific PR, can we still merge this one?

(i.e., if you run the existing plugin, you'll face these same issues on local dev, too)

@mariusandra
Copy link
Collaborator Author

I'll test it one more time when using clickhouse as the database. Until now I just tested with postges. Perhaps this will avoid all the problems with logging (logs go through kafka, not straight to postgres) and everything will just work... 🤔

@mariusandra
Copy link
Collaborator Author

Switching to EE mode (no logs in postgres), those errors disappeared, but it's still a really flaky experience locally :'(

image

Few things I just don't understand.

  1. The buffer is set to flush every 10sec, yet it takes minutes before it even starts doing anything (Starting export is logged in the beginning of exportEventsToBigQuery and Adding event in onEvent). Checking further, somehow no matter what I saved in the config, it would not update the actual values. The config key for seconds changed from uploadSeconds to exportEventsBufferSeconds (needed for exportEvents to work, preparing for the future here), but the config schema was just not updated. Reinstalling the plugin fixed this.

  2. Now I stopped getting logs altogether in the interface (not sure why), and only see this in the console outuput:

image
image
image
image
image
image

If this system is acting weird and irrationally on my intel macbook, it'll probably also act weird and irrationally for someone else, possibly in production. Now it just segfaultet for me as well (screenshots above)

Thus I suggest finding out what is actually happening and fixing it as priorities number 0, 1 and 2 for the next sprint... and for this one. :)


However, if it makes anything better, I'm experiencing similar problems with the latest main bigquery plugin. Thus I doubt this PR is causing the issues. I even took versions of plugin server from 1 and 2 months ago (0.21.6 and 0.19.0) and the problems were there.

At the end of the day, the retrying logic is also working, and eventually things are retried and sent... unless it crashes totally, which has also happened. I just don't know :D

@neilkakkar
Copy link
Contributor

Okay, wow, did not expect to see a segfault here.

So, to summarize, (1) isn't an issue anymore. (2) is maybe something to do with switching to EE? Definitely need to investigate this. And (3) Bigquery being flaky locally.. is still a problem, and possibly one every self hosted version would face? (Need to try a self-deployment and test this out too).

Agree on investigating bigquery slowness (3) and (2), both. (but I still think neither blocks this PR 😅 )

@mariusandra
Copy link
Collaborator Author

Yeah, I think it's fine to merge this... but please test test test and keep an eye on it :)

@neilkakkar neilkakkar merged commit 902becb into main Jun 15, 2021
@neilkakkar neilkakkar deleted the bq-timestamp branch June 15, 2021 09:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

worth adding a bq_timestamp?
3 participants