Add random sharding for the DynamoDb outbox#2813
Conversation
|
@iancooper I'm not sure how to trigger CI |
|
|
||
| private async Task<IEnumerable<MessageItem>> QueryAllOutstandingShardsAsync(string topic, DateTime minimumAge, CancellationToken cancellationToken = default) | ||
| { | ||
| using var semaphore = new SemaphoreSlim(20); |
There was a problem hiding this comment.
No reasoning behind this being 20, maybe should be configurable
There was a problem hiding this comment.
What is the reason for the semaphore?
There was a problem hiding this comment.
I now realise maybe doing a Parallel.ForEachAsync is more suitable? I was thinking we need to set MaxDegreeOfParallelism as max number of shards is unbounded
There was a problem hiding this comment.
I wonder if we could sensibly limit the number of shards. At some point we only want so many threads, so many open sockets etc. If someone exceeds that number in order for their outbox to work, I wonder if an outbox is the right solution to transactional messaging for them.
It is a good question though
There was a problem hiding this comment.
i think we should set an upper limit when you configure the number of partitions. Let's say 20 for now and throw an error if you use more than that.
I wonder if you need more than 20 partitions whether you should first think about using DAX over increasing the number of partitions further.
There was a problem hiding this comment.
That makes sense, will simplify to remove semaphores and just set max shards to 20. Can always be another PR if we find 20 isn't enough.
Speaking of DAX, I'm not sure how that would solve the hot partitioning issue, I think it would alleviate maybe some reads and reduce latency but underlying issue would still be there? (from someone who's only just read about DAX on the surface)
There was a problem hiding this comment.
Looks like we have reached agreement to limit at config, which is more explicit.
In principle a write through-cache will limit a hotspot as our partition would tend to be held in memory not on disk so access would not hit an RCU or WCU limit. For "hot spots" DAX is often a good solution, particularly in this case where you write and then read shortly after what you just read, and can evict older elements (which would be dispatched messages for us). I would guess that a write through cache would resolve this issue. So if our partitioning strategy gets too many partitions and thus requires a lot of threads on your Sweeper process it may be more efficient to move to an in-memory cache.
There was a problem hiding this comment.
Thanks for explaining Ian
| _dynamoOverwriteTableConfig = new DynamoDBOperationConfig | ||
| { | ||
| OverrideTableName = _configuration.TableName, | ||
| ConsistentRead = true |
There was a problem hiding this comment.
Setting ConsistentRead = true as default for non-GSI lookups to mitigate against NullReferenceExceptions we've been seeing when calling PostAsync
GitHub feature, one of us has to approve you when you have not triggered one before. It's to stop you uploading something that will start bitcoin mining or the like |
* Start to add random sharding for the DynamoDb outbox * Limit to 20 shards and remove semaphores --------- Co-authored-by: Ian Cooper <ian_hammond_cooper@yahoo.co.uk>
Addresses #2810