[watermill-sql] Allow batch reading #218

breml · 2021-02-18T07:16:22Z

The current implementation of watermill-sql does read exactly one message at the time (PostgreSQL implementation: https://github.com/ThreeDotsLabs/watermill-sql/blob/master/pkg/sql/schema_adapter_postgresql.go#L66).

I assume a significant subscriber performance improvement, if the subscriber could read the messages in batches (of configurable size), e.g. 10 or 100 items per SELECT.

My question is, if there are conceptually arguments against this idea.

screwyprof · 2021-03-02T02:08:16Z

Following @breml , I've got a similar question about mysql implementation. https://github.com/ThreeDotsLabs/watermill-sql/blob/a2768559a9c416c1d8b5fe506401dc51639abb63/pkg/sql/subscriber.go#L249 Is it possible to optimise the reads?

breml · 2021-03-04T14:36:52Z

Today I did some initial tests with batch reading and it is not that easy (and I am no longer sure, it will bring a performance gain).

The problem is:
There are known problems with Go, database/sql and the postgres driver when trying to execute additional database queries while still reading a result set (sql.Rows). This confuses the postgres driver and it returns errors instead of executing the queries.

The workaround to the above problem would be to first read all rows (e.g 100) into memory, close the sql.Rows and only then update the offset_consumed field in the DB. I did not implement or test this, because I am no longer convinced, that this will bring the performance gain I was hoping for.

roblaszczak · 2021-04-04T08:07:43Z

The workaround to the above problem would be to first read all rows (e.g 100) into memory, close the sql.Rows and only then update the offset_consumed field in the DB. I did not implement or test this, because I am no longer convinced, that this will bring the performance gain I was hoping for.

I think that in theory, it should help to achieve some performance gains. The downside is that we need to accept that we will lose exactly-once delivery. But I'd not expect Kafka-like performance - this was not a goal of this Pub/Sub 😉

What I can recommend, is to try to do some TDD and experimentation of reading more rows at once and committing in a separate transaction. With doing small changes and trusting tests you should be able to develop something that works :) Please keep in mind, that you will need to change test features a bit: https://github.com/ThreeDotsLabs/watermill-sql/blob/master/pkg/sql/pubsub_test.go#L175 (due to lost of exactly-once delivery for example).

condorcet · 2023-03-01T20:29:15Z

@roblaszczak @m110 Some thoughts based on my current experience with pub/subs :)

One of the argument for using batch reading is the implementation of batch processing.
In case of outbox pattern, we route messages to some broker strictly one by one and it could be a bottleneck. But if we have batch reading we could publish messages concurrently in scope of the batch. E.g. read 100 messages and publish them in with 100 goroutines (relatively speaking) concurrently.

Of course, one by one processing provides delivery order by design. But I think in practice you may need guarantee order of messages only in certain cases. Most use cases don't bother with message ordering.

In cases when we really need message ordering in SQL, it could be considered as a special feature like Google pub/sub message ordering: something like ordering_key in table schema / message could give ability to publish messages sequentially in scope of group with same ordering_key.

For example, we read batch of messages

offset,	ordering_key
1		NULL
2		NULL
3		"key1"
4		"key2"
5		"key1"
6		"key2"
7		NULL

so we could send messages concurrently: [1], [2] ,[7], [3,5], [4,6],
where [3,5] and [4,6] -- messages grouped by ordering_key that should be send sequentially (first "3", then "5" etc)

Batch size processing in general has downsides, like partial success and atomicity (when only part of batch was processed):

offset,	ordering_key
1		NULL		sent
2		NULL		failed  <-- this will be a actual acked offset for consumer, so other messages (3-7) will be redelivered on next iteration in current implementation
3		"key1"		sent
4		"key2"		sent
5		"key1"		sent
6		"key2"		sent
7		NULL		sent

but I think in case of "at-least-once" semantics (which is fundamental for pub/subs) it should not be a problem.

roblaszczak · 2023-04-06T18:21:51Z

As part of fixing #311 I added experimental support for batch reading for Postgres here: ThreeDotsLabs/watermill-sql#22 (warning: this is Proof of concept, so far).

roblaszczak added enhancement New feature or request help wanted Extra attention is needed M medium issue module: subscriber labels Apr 4, 2021

m110 changed the title ~~Allow batch reading in watermill-sql~~ [watermill-sql] Allow batch reading Jan 21, 2023

roblaszczak mentioned this issue Jun 21, 2023

Fix potential message loss with small PollInterval and allow batch querying of messages ThreeDotsLabs/watermill-sql#23

Merged

roblaszczak closed this as completed in ThreeDotsLabs/watermill-sql#23 Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[watermill-sql] Allow batch reading #218

[watermill-sql] Allow batch reading #218

breml commented Feb 18, 2021

screwyprof commented Mar 2, 2021

breml commented Mar 4, 2021

roblaszczak commented Apr 4, 2021

condorcet commented Mar 1, 2023

roblaszczak commented Apr 6, 2023

[watermill-sql] Allow batch reading #218

[watermill-sql] Allow batch reading #218

Comments

breml commented Feb 18, 2021

screwyprof commented Mar 2, 2021

breml commented Mar 4, 2021

roblaszczak commented Apr 4, 2021

condorcet commented Mar 1, 2023

roblaszczak commented Apr 6, 2023