Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioning support for Event Store #770

Open
jokokko opened this issue May 23, 2017 · 8 comments
Open

Partitioning support for Event Store #770

jokokko opened this issue May 23, 2017 · 8 comments
Milestone

Comments

@jokokko
Copy link
Collaborator

@jokokko jokokko commented May 23, 2017

To avoid table bloat with the Marten Event Store, events table (mt_events by default) could be partitioned using PostgreSQL table partitioning.

While the table can already be manually partitioned after creation with table inheritance in PG 9.x, PG 10 will introduce native table partitioning with new prerequisites (and limitations): https://www.postgresql.org/docs/10/static/ddl-partitioning.html Given that "It is not possible to turn a regular table into a partitioned table or vice versa.", it would be extremely useful to be able to define a partitioning scheme upfront as Marten creates the necessary tables.

@cocowalla
Copy link
Contributor

@cocowalla cocowalla commented Jul 7, 2017

It would be great to also support this on arbitrary tables when using Marten as a document database, not just as an Event Store.

@jokokko
Copy link
Collaborator Author

@jokokko jokokko commented Jul 8, 2017

Indeed. But if it's implemented for Event Store tables (well, mainly mt_events), should not be too big a step to generalize for all tables. The hardest thing likely is to offer a flexible way to define the actual partitioning scheme. Then again, being able to do it via Marten would be a huge advantage, as, afaik, tables cannot be partitioned afterwards (talking about native partitioning in PG 10).

@turbo
Copy link

@turbo turbo commented Sep 21, 2018

I'm relying on Marten for most persistence in my app, but there are two massive tables that are still raw PGSQL because they need partitioning supports (by tenant, timestamp). It would be nice to handle this in Marten.

Although I'd urge you to wait for PG11, which will have quite a few partition improvements.

@oskardudycz
Copy link
Collaborator

@oskardudycz oskardudycz commented Sep 24, 2018

@turbo currently you can use Conjoined Tenancy that will at least give you possibility to split it by tenant (also stream and events table). I'll have a look on the PG partitioning in PG 10.

@isen-ng
Copy link
Contributor

@isen-ng isen-ng commented Nov 15, 2018

Seems like PG11 has been released
https://www.postgresql.org/about/news/1894/

@jeremydmiller
Copy link
Contributor

@jeremydmiller jeremydmiller commented Jul 8, 2019

My thoughts:

Just use what Postgresql 11 can do

  • Partitioning by tenant id should work with native Postgresql support, but I'd rather have someone test that before claiming so
  • Allow users to identify streams by long numbers backed by a sequence and then allow native Postgresql to do its thing?
  • See if Postgresql "range" partitioning can work by sequential UUIDs?
  • Allow users to partition streams by the creation date (not sure how valuable that would be, seems like you'd be mostly querying by stream Id or metadata)
  • If we did stream level metadata by duplicated columns, it seems like that would greatly aid in partitioning

If we programmatically "route" events to separate tables

  • Gotta figure out how event/streams get assigned by logical partition. Do you do it by stream type, which makes you require the stream type in StartStream() and that hasn't been popular. Could you do it by event type? That's easier to handle in the routing in a way.
  • The "routing" would need to be part of appending events, the async daemon (totally hoses up the async daemon the way it is today, but we can beat that), and any and all event store queries
  • Route by metadata elements? That's more complexity
  • Do we require users to explicitly choose the logical event store? That's easier, but might cause usage problems. Worth asking folks who care about this
  • If we separate out the events, with today's design it's an event table, the stream table, a generated function to append to the table, and a sequence to track event numbers. If we partition, do we have one of all of these per logical event store?

Async Daemon

  • The async daemon might need to take advantage of the partitioning for improved performance. Rebuilds might go much faster if you use "range" partitioning by event sequence #
@oskardudycz
Copy link
Collaborator

@oskardudycz oskardudycz commented Jul 26, 2019

@jeremydmiller I think that maybe we could start with Native partitioning by stream type? Imho this could be the easiest way to start investigation around that topic. I could try to work on PoC. (eg. by adding new Tenancy "ByStreamType").

If we have the paritioning by stream type then we could try to extend that with the next levels like by date, stream id etc.

I agree that duplicate field would be helpfull and that it would be worth to checking if the MetaData would help us on partitioning.

Imho it would be also worth checking TimescaleDB as it might potentialy simplify that process a lot (@cocowalla did already some initial investigations).

p.s. nice introduction to partitioning https://severalnines.com/blog/how-take-advantage-new-partitioning-features-postgresql-11

@jeremydmiller jeremydmiller added this to the 4.0 milestone Apr 28, 2020
@jeremydmiller
Copy link
Contributor

@jeremydmiller jeremydmiller commented Nov 17, 2020

@oskardudycz @mysticmind Getting back into this a little bit today. Some thoughts:

  • Haven't spoken w/ y'all about this yet, but I kinda like the idea of letting folks use additional event collections instead of the single, default event collection we have today. That segments things real fast, but it adds all kinds of complexity down the line. So rather than today's stream type (or in addition to), we allow you to define a completely separate event collection (stream & events table)
  • If you're using the async daemon in a polling mode, I'd vote to have this segmented by a range of the seq_id because that's how the async daemon hits the table.
  • I agree about duplicating information from the stream to the event table if that helps the partitioning work
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants