Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]Archiver throws exceptions when used with a Dynamo DB outbox #3108

Closed
dhickie opened this issue May 22, 2024 · 17 comments · Fixed by #3180
Closed

[Bug]Archiver throws exceptions when used with a Dynamo DB outbox #3108

dhickie opened this issue May 22, 2024 · 17 comments · Fixed by #3180
Assignees
Labels
Bug v9 Required for v9 release

Comments

@dhickie
Copy link
Contributor

dhickie commented May 22, 2024

Describe the bug

The outbox Archiver uses the DispatchedMessagesAsync method to get a list of messages which have successfully been dispatched and are safe to archive. There are two methods, one of which immediately throws a NotImplementedException in v9.8.0. This is the method that's used by the outbox archiver.

In the latest v10 code, this function has an implementation, but passes null as the value to the args parameter to the overloaded version. This would result in an immediate ArgumentException.

To Reproduce

Attempt to use an OutboxArchiver with a DynamoDB outbox.

Exceptions (if any)

NotImplementedException and ArgumentException

Further technical details

  • Brighter version: 9.8.0
@iancooper iancooper added Bug v9 Required for v9 release labels May 22, 2024
@iancooper iancooper self-assigned this May 22, 2024
@iancooper
Copy link
Member

@dhickie I'll take a look

@iancooper iancooper changed the title Archiver throws exceptions when used with a Dynamo DB outbox [Bug]Archiver throws exceptions when used with a Dynamo DB outbox May 22, 2024
@dhickie
Copy link
Contributor Author

dhickie commented May 29, 2024

@iancooper I could take a look at this if you're busy elsewhere? I'd just need a bit of context around the intent behind requiring a single topic name when fetching the list of dispatched messages, and whether we could just fetch the list of dispatched messages for all topics instead.

@iancooper
Copy link
Member

Happy of you want to pick this up @dhickie I can't recall the reasoning for the archiver needing to work by topic. We may be using topic for the outbox to batch up read and produce operations. I am looking at that batching (a little by accident from the OTel work) and writing an ADR about changes we might want.

@iancooper
Copy link
Member

@preardon You are closer to this than I am. Any thoughts for @dhickie ?

@dhickie
Copy link
Contributor Author

dhickie commented May 30, 2024

It looks like this is something specific to the Dynamo DB outbox implementation. I can see why it's there - the GSI for dispatched messages uses the topic name as the hash key and the dispatch date as the range key. That means that when it's fetching a list of messages which can be archived, it can do a Query operation and use a key expression to only return messages that are past the threshold that makes them eligible for archival.

The important thing here is that key expressions are evaluated server side, so only the data fulfilling the expression is sent over the wire. This, of course, requires that you know the topic name in the first place. As the topic name is the hash key, without this information you would need to do a Scan operation on the index and apply a filter expression instead. Filter expressions, however, are applied client side after the data has been returned. For a busy outbox, this could be a huge amount of data that has to be filtered in memory.

There are nonetheless a number of issues with the current Dynamo outbox implementation:

  • It is completely incompatible with the current archiver design and simply throws exceptions instead of archiving
  • Anyone using the Dynamo DB outbox manually from client code has to notice that they need to specify an optional args parameter, and populate it with a valid Topic key or it will throw an exception - this isn't very user friendly
  • It ignores the batch size parameter passed to the ArchiveAsync method and instead treats it as a page size parameter per query, but continues to page through all results before returning a list of messages. If, for whatever reason, the archiver hasn't run in a while, this could still be a large amount of data and has no upper bound. This could potentially be because pagination of scans and queries in Dynamo DB is incompatible with specifying a pageNumber parameter - pagination in DynamoDB works by returning a lastEvaluatedKey value along with the query/scan result which can be passed with the next query/scan for the next page, so one can't simply ask for "page 10". The current way of working around this is to simply return everything in one "page".

My proposal would be as follows:

In v10

  • Keep the logic allowing callers to specify a particular topic name, but enforce the provided batch size. If the query returns a result indicating that there are further pages available, only return the first page but keep a dictionary of the topic name to an object specifying:
    • The number of the next page that can be retrieved, nextPage
    • The lastEvaluatedKey to use when retrieving the next page
  • If a page number is provided which matches the next page number in the dictionary for that topic, then use the lastEvaluatedKey value to fetch the next page and set the next page key in the dictionary
  • If a page number is provided which doesn't match the next page number in the dictionary for that topic (and isn't page 1), then throw an exception
  • Whenever a message is added to the outbox, add the name of the topic as a key in a ConcurrentDictionary<string,byte> called topicNames (unfortunately .NET doesn't have a built in thread safe version of HashSet)
  • If no topic name is provided in the args parameter, then iterate through the keys in topicNames, performing a query for each:
    • If the page number is 1, then take a copy of the keys from topicNames and begin to iterate through them with queries until the batch size has been reached.
    • In a member variable called allTopicsQueryContext, store an object containing:
      • The number of the next page available, nextPage
      • The lastEvaluatedKey to use when retrieving the next page
      • The topic names that have not yet been fully queried, in the order they should be queried in, outstandingTopics
    • If a page number that matches the value in allTopicsQueryContext is provided, then continue to iterate through the topics as contained in outstandingTopics up to the batch size provided, updating allTopicsQueryContext if there are further pages available
    • If a page number that does not match the value of nextPage in allTopicsQueryContext (and isn't the first page) then throw an exception
    • If the final page is retrieved (no lastEvaluatedKey is returned and there are no remaining outstandingTopics), then set allTopicsQueryContext to null

This isn't the cleanest of solutions, but it's the best I can think of that:

  • Works with the way Dynamo deals with paging while adhering as closely as possible to the outbox interface definition for paging
  • Places an client customisable upper bound on the number of messages returned by DispatchedMessagesAsync to avoid potential memory issues
  • Fixes the issue of the outbox archiver not having a list of topic names it should be archiving
  • Provides thread safety as new topics could be published to while someone is paging through dispatched messages in the archiver or elsewhere, while guaranteeing consistent pagination

In v9

  • The same as v10, but without the changes to pagination for when providing a topic name in the args parameter, as this would be a breaking change to existing behaviour

I'm keen to get your thoughts @iancooper @preardon, this isn't as straightforward a problem as I'd originally hoped thanks to the quirks of Dynamo DB but is an urgent issue for us as it prevents us from adopting Brighter for our outbox implementation.

@iancooper
Copy link
Member

It looks like this is something specific to the Dynamo DB outbox implementation. I can see why it's there - the GSI for dispatched messages uses the topic name as the hash key and the dispatch date as the range key.

Ah yes, I think I wrote that. Old programmer you is like a different person sometimes. Thanks for digging

@iancooper
Copy link
Member

iancooper commented May 30, 2024

In v10

The plan sounds reasonable.

Interestingly in V10 we have surfaced RequestContext on most calls, we could actually pass that information there and drop args . I didn't do that because I would have to rewrite the dynamo usage, but if we are rewriting that anyway we could.

The RequestContext has a bag. We could add an explicit property although we don't ask you to provide a RequestContext, you just can, so we can't rely on it being there and I worry a little that folks might set a Topic even if they did not need to i.e. using DynamoDb.

Maybe creating an explicit DynamoDbOutbox record type that includes Topic and then adding that into the bag under a well-known key is one workaround to the optionality issues here?

@iancooper
Copy link
Member

iancooper commented May 30, 2024

Keep the logic allowing callers to specify a particular topic name, but enforce the provided batch size.

I assume we iterate to fill the batch, if the batch is larger than a page of results; we only pause iterating if we have reached the batch size?

What happens if the batch is smaller than a page? In that case we would be iterating over the page, and not across pages?

@iancooper
Copy link
Member

set the next page key in the dictionary
How long would we keep the dictionary in memory for future requests?

@iancooper
Copy link
Member

Attempt to use an OutboxArchiver with a DynamoDB outbox.

BTW @dhickie the archiver was written recently, so the DynamoDB outbox was tested against outbox/sweeper but probably not against archive when that came out. It suggests a gap somewhere in our approach. But that is why you are hitting issues with it. We used to treat archiving as an "out of band" issue for a user to resolve themselves

@preardon
Copy link
Member

@dhickie sorry for the slow reply, sound reasonable

@dhickie
Copy link
Contributor Author

dhickie commented May 30, 2024

@iancooper

Interestingly in V10 we have surfaced RequestContext on most calls, we could actually pass that information there and drop args . I didn't do that because I would have to rewrite the dynamo usage, but if we are rewriting that anyway we could.

That's an interesting one. I can't actually find any instances of a topic being provided in args in Brighter itself, which would suggest that's there for people to use in their own code if they need it, like it was before the archiver existed. With that in mind, I'm not sure whether RequestContext would be a good fit - it feels a bit like bleeding out an internal implementation detail for a user with a separate concern that doesn't relate to a request pipeline. Unless you disagree?

I assume we iterate to fill the batch, if the batch is larger than a page of results; we only pause iterating if we have reached the batch size?

Yep that's what I had in mind.

What happens if the batch is smaller than a page? In that case we would be iterating over the page, and not across pages?

The API for a Query operation supports an optional Limit parameter, so it would be a case of checking for the remaining number of entries left in the batch size, and then setting Limit to ensure the results don't go over that value. The only scenario where that would be an issue is if you were using a filter expression, as Limit is actually a limit on the number of entries read from the table, but since we're using a key expression that isn't a problem.

How long would we keep the dictionary in memory for future requests?

I was thinking indefinitely - if another request comes in for page 1 for a given topic then that would overwrite anything that was already in the dictionary, so the maximum number of keys in the dictionary would be the number of different topics a given process publishes to.

I'll make a start on a first draft PR. I'm on holiday next week so it probably won't be ready until the week after.

@preardon
Copy link
Member

I've never really seen args used for anything (that being said I mainly spend time on the Microsoft stacks)

@dhickie
Copy link
Contributor Author

dhickie commented May 31, 2024

While working on this, I've uncovered a further issue with the outbox sweeper when used with a Dynamo DB outbox. When setting up a timed outbox sweeper hosted service, one has to provide a Topic value in the outbox args in order to work with the topic indexed GSI mentioned above. In practice, this means that a sweeper can only ever clear unsent messages for a single topic, no matter how many are configured for the external bus, and the remainder go unsent.

There is configuration on each individual Publication object to provide a different Topic, but AFAICT the only use of this is for some logging which runs periodically to check how many outstanding messages there are, and even then it only does this for the Publication that appears first in the list - the remainder go unused.

Looking at the Dynamo outbox implementation, it has the same issues here as with querying for dispatched messages - you have to provide a topic name and the batch size is ignored.

While I'm in this bit of the code I'll do some updates for the OutstandingMessages and OutstandingMessagesAsync methods along similar lines to the approach outlined above:

  • In v10:
    • Enforce the batch size when a topic name is provided in args
    • Support fetching outstanding messages for all topics
  • In v9:
    • Support fetching outstanding messages for all topics

@iancooper
Copy link
Member

There is configuration on each individual Publication object to provide a different Topic, but AFAICT the only use of this is for some logging

FYI

In V10 this gets used to allow us to pass the Publication through the message mapper pipeline. It will end up being used by our default message mapper, so that you can pick up the topic that the message is being sent to and set it on the header. It is also likely to be used by our OTel support. So it is becoming more important as we have a producer registered against a given topic now and you can look it up.

@iancooper
Copy link
Member

Looking at the Dynamo outbox implementation, it has the same issues here as with querying for dispatched messages - you have to provide a topic name and the batch size is ignored.

I can't recall why I ignored the batch size. So give it a shot and we can see what the effects are.

@iancooper
Copy link
Member

iancooper commented Jun 1, 2024

and even then it only does this for the Publication that appears first in the list - the remainder go unused.

I would need to check but I believe I fixed this in V10 as it was a known limitation in V9 (due to not wanting to break interfaces)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug v9 Required for v9 release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants