Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add failing test for edge case max_batch_size=max_entries #11378

Closed
wants to merge 4 commits into from

Conversation

JensErat
Copy link
Contributor

@JensErat JensErat commented Aug 9, 2023

When defining a very special edge case configuration having max_batch_size=max_entries, the queue can fail with an assertion error when removing the frontmost element. This happens especially when the callback repeatedly fails (eg. an unavailable backend system receiving data).

What happens:

  1. we add max_batch_size elements, all of which "post" resources
  2. the batch queue consumes all of those resources in process_once by wait()ing for them, but gets stuck processing/sending the batch
  3. as process_once is stuck until max_retry_time passed, the function does not run delete_frontmost_entry() and thus actually moves the front reference
  4. when enqueuing the next item, it tries to drop the oldest entry, but triggers the assertion in queue.lua as no resources are left

Kudos to @27ascii for discovering the edge case configuration.

Summary

Checklist

Full changelog

  • [Implement ...]

Issue reference

FixTest case for #11377

Jens Erat <jens.erat@mercedes-benz.com>, Mercedes-Benz Tech Innovation GmbH, imprint

@JensErat
Copy link
Contributor Author

JensErat commented Aug 9, 2023

I think I found a reasonable bugfix (already pushed, tests are fine again):

potential fix for race condition

This commit might fix #11377 by removing currently processed elements
out of the race condition window.

Two tests needed changes:

  1. "giving up sending after retrying" needed another (otherwise) ignored
    value, such that we can wait long enough in wait_until_queue_done
    (there might be a more elegant solution here)
  2. the new test required reactivating the handler to succeed to finally
    clear the queue

Why do I think this works?

  • immediately after the last call on semaphore:wait(), we'll start
    actually removing items from entries
  • the code cannot be interrupted by other light threads before we
    actually start the handler

These assumptions strongly need verification by some lua experts!

@samugi samugi linked an issue Aug 11, 2023 that may be closed by this pull request
1 task
Copy link
Contributor

@hanshuebner hanshuebner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Can you address the one comment that I have put in and also add a CHANGELOG entry? Thank you!

@@ -251,11 +251,21 @@ function Queue:process_once()
end
end

local batch = {unpack(self.entries, self.front, self.front + entry_count - 1)}
-- Guard against queue shrinkage during handler invocation by using math.min below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed. The comment kind of pointed at the bug that you've discovered without me actually realizing that it was a bug. With your change, we'll actually always be able to remove entry_count entries.

When defining a very special edge case configuration having
max_batch_size=max_entries, the queue can fail with an assertion error when
removing the frontmost element. This happens especially when the
callback repeatedly fails (eg. an unavailable backend system receiving
data).

What happens:

1. we add max_batch_size elements, all of which "post" resources
2. the batch queue consumes all of those resources in `process_once` by `wait()`ing for them, but gets stuck processing/sending the batch
3. as `process_once` is stuck until `max_retry_time` passed, the function does not run `delete_frontmost_entry()` and thus actually moves the `front` reference
4. when enqueuing the next item, it tries to drop the oldest entry, but triggers the assertion in queue.lua as no resources are left
This commit might fix Kong#11377 by removing currently processed elements
out of the race condition window.

Two tests needed changes:

1. "giving up sending after retrying" needed another (otherwise) ignored
value, such that we can wait long enough in `wait_until_queue_done`
(there might be a more elegant solution here)
2. the new test required reactivating the handler to succeed to finally
clear the queue

Why do I think this works?

- immediately after the last call on `semaphore:wait()`, we'll start
actually removing items from `entries`
- the code cannot be interrupted by other light threads before we
actually start the handler

These assumptions strongly need verification by some lua experts!
@JensErat
Copy link
Contributor Author

Nice catch! Can you address the one comment that I have put in and also add a CHANGELOG entry? Thank you!

Removed the min statement and added the changelog entry. I bundled both queue changes in subsequent lines, I guess the changelog order is not relevant otherwise.

@hanshuebner
Copy link
Contributor

I've moved these changes to #11431 so that I can deal with the CI issues that we currently have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New queue fails in edge case max_batch_size=max_entries with assertion error
2 participants