Skip to content

Conversation

@macobo
Copy link
Contributor

@macobo macobo commented Jul 26, 2022

Problem

Kafka.produce calls are failing

Changes

  • Remove Buffer.from usage from plugin-server produce usage - this allows for more accurate message size estimation
  • Make estimation of buffer size better
  • Fix: immediately flush if enqueueing a too-large message. This way unrelated messages don't end up in DLQ and sentry reports have proper context.

How did you test this code?

Tests + manual checking things work

macobo added 5 commits July 26, 2022 12:29
We're failing to send batches of messages to kafka on a semi-regular
basis due to message sizes. It's unclear why this is the case as we try
to limit each message batch size.

This PR adds information on these failed batches to sentry error
messages.

Example error: https://sentry.io/organizations/posthog2/issues/3291755686/?project=6423401&query=is%3Aunresolved+level%3Aerror
This allows us to be much more accurate estimating message sizes,
hopefully eliminating a class of errors
@macobo macobo requested a review from tiina303 July 26, 2022 12:02
@macobo macobo changed the title WIP: More accurate produce chore(plugin-server): Improve kafka producer wrapper Jul 26, 2022
@macobo macobo marked this pull request as ready for review July 26, 2022 12:02
macobo added a commit that referenced this pull request Jul 26, 2022
This helps avoid 'message too large' type errors (see
#10968) by compressing in-flight
messages.

I would have preferred to use zstd, but the libraries did not compile
cleanly on my machine.
macobo added a commit that referenced this pull request Jul 26, 2022
This helps avoid 'message too large' type errors (see
#10968) by compressing in-flight
messages.

I would have preferred to use zstd, but the libraries did not compile
cleanly on my machine.

expect(producer.currentBatch.length).toEqual(1)
expect(producer.currentBatchSize).toEqual(40)
expect(producer.currentBatchSize).toBeGreaterThan(40)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not check for what it's now supposed to be exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the result is kind of random (73) since it now accounts for sizes of the keys and so on.

This gets the intent across better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, but we might mess up that function in the future to be something way too big

const timeSinceLastFlush = Date.now() - this.lastFlushTime
if (timeSinceLastFlush > this.flushFrequencyMs || this.currentBatch.length >= this.maxQueueSize) {
if (
this.currentBatchSize > this.maxBatchSize ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix: immediately flush if enqueueing a too-large message. This way unrelated messages don't end up in DLQ and sentry reports have proper context.

Does this help here because this.flush() threw before (line 55) making us miss the next message, if it happened the next time we called queueMessage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Previously if the batch was empty and a single too-large message came in, we would call flush() before appending and then append. Then when the next message comes along, that flush fails since the message in the queue is too large.

Now we append and flush immediately. Check the new test added - it would have failed before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so that's what I thought - clearly I didn't word the question well, thanks for confirming.

@macobo macobo enabled auto-merge (squash) July 27, 2022 11:25
@macobo macobo merged commit d00d587 into master Jul 27, 2022
@macobo macobo deleted the more-accurate-produce branch July 27, 2022 11:26
macobo added a commit that referenced this pull request Jul 28, 2022
This helps avoid 'message too large' type errors (see
#10968) by compressing in-flight
messages.

I would have preferred to use zstd, but the libraries did not compile
cleanly on my machine.
macobo added a commit that referenced this pull request Jul 28, 2022
#10974)

* feat(plugin-server): Use Snappy compression codec for kafka production

This helps avoid 'message too large' type errors (see
#10968) by compressing in-flight
messages.

I would have preferred to use zstd, but the libraries did not compile
cleanly on my machine.

* Update tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants