Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[watermill-sql] Allow batch reading #218

Closed
breml opened this issue Feb 18, 2021 · 5 comments · Fixed by ThreeDotsLabs/watermill-sql#23
Closed

[watermill-sql] Allow batch reading #218

breml opened this issue Feb 18, 2021 · 5 comments · Fixed by ThreeDotsLabs/watermill-sql#23
Labels
enhancement New feature or request help wanted Extra attention is needed M medium issue module: subscriber

Comments

@breml
Copy link
Contributor

breml commented Feb 18, 2021

The current implementation of watermill-sql does read exactly one message at the time (PostgreSQL implementation: https://github.com/ThreeDotsLabs/watermill-sql/blob/master/pkg/sql/schema_adapter_postgresql.go#L66).

I assume a significant subscriber performance improvement, if the subscriber could read the messages in batches (of configurable size), e.g. 10 or 100 items per SELECT.

My question is, if there are conceptually arguments against this idea.

@screwyprof
Copy link

Following @breml , I've got a similar question about mysql implementation. https://github.com/ThreeDotsLabs/watermill-sql/blob/a2768559a9c416c1d8b5fe506401dc51639abb63/pkg/sql/subscriber.go#L249 Is it possible to optimise the reads?

@breml
Copy link
Contributor Author

breml commented Mar 4, 2021

Today I did some initial tests with batch reading and it is not that easy (and I am no longer sure, it will bring a performance gain).

The problem is:
There are known problems with Go, database/sql and the postgres driver when trying to execute additional database queries while still reading a result set (sql.Rows). This confuses the postgres driver and it returns errors instead of executing the queries.

The workaround to the above problem would be to first read all rows (e.g 100) into memory, close the sql.Rows and only then update the offset_consumed field in the DB. I did not implement or test this, because I am no longer convinced, that this will bring the performance gain I was hoping for.

@roblaszczak
Copy link
Member

The workaround to the above problem would be to first read all rows (e.g 100) into memory, close the sql.Rows and only then update the offset_consumed field in the DB. I did not implement or test this, because I am no longer convinced, that this will bring the performance gain I was hoping for.

I think that in theory, it should help to achieve some performance gains. The downside is that we need to accept that we will lose exactly-once delivery. But I'd not expect Kafka-like performance - this was not a goal of this Pub/Sub 😉

What I can recommend, is to try to do some TDD and experimentation of reading more rows at once and committing in a separate transaction. With doing small changes and trusting tests you should be able to develop something that works :) Please keep in mind, that you will need to change test features a bit: https://github.com/ThreeDotsLabs/watermill-sql/blob/master/pkg/sql/pubsub_test.go#L175 (due to lost of exactly-once delivery for example).

@roblaszczak roblaszczak added enhancement New feature or request help wanted Extra attention is needed M medium issue module: subscriber labels Apr 4, 2021
@m110 m110 changed the title Allow batch reading in watermill-sql [watermill-sql] Allow batch reading Jan 21, 2023
@condorcet
Copy link

@roblaszczak @m110 Some thoughts based on my current experience with pub/subs :)

One of the argument for using batch reading is the implementation of batch processing.
In case of outbox pattern, we route messages to some broker strictly one by one and it could be a bottleneck. But if we have batch reading we could publish messages concurrently in scope of the batch. E.g. read 100 messages and publish them in with 100 goroutines (relatively speaking) concurrently.

Of course, one by one processing provides delivery order by design. But I think in practice you may need guarantee order of messages only in certain cases. Most use cases don't bother with message ordering.

In cases when we really need message ordering in SQL, it could be considered as a special feature like Google pub/sub message ordering: something like ordering_key in table schema / message could give ability to publish messages sequentially in scope of group with same ordering_key.

For example, we read batch of messages

offset,	ordering_key
1		NULL
2		NULL
3		"key1"
4		"key2"
5		"key1"
6		"key2"
7		NULL

so we could send messages concurrently: [1], [2] ,[7], [3,5], [4,6],
where [3,5] and [4,6] -- messages grouped by ordering_key that should be send sequentially (first "3", then "5" etc)

Batch size processing in general has downsides, like partial success and atomicity (when only part of batch was processed):

offset,	ordering_key
1		NULL		sent
2		NULL		failed  <-- this will be a actual acked offset for consumer, so other messages (3-7) will be redelivered on next iteration in current implementation
3		"key1"		sent
4		"key2"		sent
5		"key1"		sent
6		"key2"		sent
7		NULL		sent

but I think in case of "at-least-once" semantics (which is fundamental for pub/subs) it should not be a problem.

@roblaszczak
Copy link
Member

As part of fixing #311 I added experimental support for batch reading for Postgres here: ThreeDotsLabs/watermill-sql#22 (warning: this is Proof of concept, so far).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed M medium issue module: subscriber
Projects
None yet
4 participants