Flexible Document Metadata #1337

oskardudycz · 2019-08-14T17:25:01Z

Currently it's not possible to automatically map to the event fields stuff like: Version, Timestamp, StreamId etc.
Because of that it's not easy to implement elegant Aggregates + Repository implementation without the reflection or making some compromises like having mutable Version field, data duplication.
I think that we should give user possibility to map those fields automatically from metadata.

See more in discussion on the Event Sourced Aggregate+Repository sample: #1299

jeremydmiller · 2020-09-18T19:27:00Z

I'm generalizing this a bit.

Usages:

Opt out of metadata columns
Have soft deleted values set on documents as its loaded similar to versions today
Extensible metadata columns
Extensible metadata shows up in Metadata queries

jeremydmiller · 2020-09-24T19:29:49Z

How does custom Document Metadata work?

I think the main usage we've heard so far has been for traceability, ala something like the Open Telemetry standard. So tracking correlation identifiers for the document saves, as well as the active user.

I think these are the open questions:

Do we make the metadata be open-ended and user-defined, or do we try to guess what folks would need and hard code slots for common metadata items? (I lean open-ended, but hard-coded would make for an easier implementation).
How do we capture the metadata on the DocumentSession operations?
How do we store the metadata in Postgresql itself?
What facilities do we add for querying events or documents against the metadata values?
How do users configure the metadata storage if we don't opt for a Dictionary storage?

This is very closely related to #780 for the event metadata, and I think there's going to be some shared functionality here.

So let's talk about the mechanics, and this is a brainstorm w/ some of my own opinions called out:

Open-Ended w/ Dictionary<string, object>

The obvious "open-ended" way would be to persist a JSON serialized Dictionary<string, object> to capture the metadata. If you do that, I think you would add some Linq WHERE helper methods to aid in searching. I think we'd add some extension methods to try to standardize common elements like the responsible user and open telemetry correlation identifiers.

I'm a little concerned by this approach about performance because it forces you to do an extra serialization step within the unit of work operations.

Strong-Typed Metadata Objects

You could have a user configure a DocumentStore to use a certain metadata type of their choosing like:

public class TransactionData
{
    public Guid TransactionId {get; set;}
    public Guid CorrelationId {get;set;}
    public string UserName {get;set;}
}

And this leads us to some mapping concerns. Is this stored as JSON serialized in each document/event record in one field? Do we split it out into separate fields? More on this later.

Capturing Metadata on IDocumentSession

I think in this case you could capture the metadata by one of these methods on IDocumentSession:



// I think this fits well with application frameworks that open a session from DI and stick the 
// known user, correlation, timestamp kind of stuff directly on the session before your application
// code starts using it

// Hard-coded
public object Metadata {get;set;}

// If we choose Dictionary for all the things
public Dictionary<string, object> Metadata {get;set;}



// If you don't do the metadata tagging at session creation time
SaveChanges(Dictionary<string, object> metadata = n ull);

I'm not enthusiastic about having folks set the metadata as part of individual Store() or Events.Append() calls.

Storing the MetaData

I see three options:

Serialized JSON headers in the document/event rows -- most flexible, probably the slowest, PITA to query unless you're good with the Postgres operators and most folks aren't, and this would require us to add some custom Linq helpers for querying. Probably also the most expensive on document querying if we're going to do the trick where we apply metadata to the documents as we load them.
Separate database columns in the document/event rows. More complexity at runtime, but the code generation in v4 softens the blow from this considerably over the <=v3 Expression building mechanics. Less flexibility because we'd need to know the possible metadata values upfront -- but I think that's a good idea regardless. Possibly expensive in terms of storage
Have a completely separate Transaction table that has all the known metadata values with some kind of transaction id, then just store the transaction id in the document/event tables. I think this is actually my (Jeremy) preferred approach at the moment. It's gonna cause some JOINs, but that shouldn't be a huge deal. Helps on storage costs. I'm thinking the JOIN is cheaper than the JSON serialization and parsing later

Configuring the DocumentStore

Anything goes in a Dictionary
Define upfront the expected metadata keys by either supplying a strong typed class or one key/type at a time

cocowalla · 2020-09-25T09:48:43Z

My take on this is really from the viewpoint of #1557 - I personally have little use for document metadata, and above all would like the option not to bloat my table storage with them if they're not needed.

Having said that, I can understand use cases; for a simple example, a "last modified" timestamp is a really common thing to have, and you might want to use Marten's mt_last_modified column to avoid having 2 timestamps.

This is a hard problem, and I'm not sure you're going to find a method that suits everyone 😆, but I'll weigh in from the perspective of a document store user:

How does custom Document Metadata work?
Personally, I'm not fully convinced about the utility of automatically mapping metadata to models - it seems like a lot of work (especially mapping in the direction of documents to the database) that has the potential to negavitely impact on those that don't need it. I need to say though, I don't use the event store functionality, only the document store functionality.

Storing the Metadata,

My least favourite, because of the perf hit
This is how Marten v3 works, so might mean a more familiar upgrade path from v3 -> v4. I actually like that it provides less flexibility, because I want to avoid a proliferation of metadata in future versions 🙄
I wonder if this would make it easier to add the plumbing to opt-out of all unnecessary metadata? I would be concerned about the perf if there is any metadata that is mandatory; imagine tables with hundreds of millions or billions of rows - that Transaction table is going to get very big (assuming you meant a single god table, and not an additional table-per-type)

I'm torn between (2) and (3)

mdissel · 2020-09-25T10:25:12Z

Capturing Metadata on IDocumentSession

Yes! This would make the mapping of CurrentUser / CorrelationId, etc in a DocumentSessionListener much easier

Storing the Metadata,

The mt_lastmodified column currently has no technical reasons to exist, so it can be removed and if people need it, just define the property in the entity.

The mt_version and mt_is_deleted* columns are needed by Marten, so I would leave them as is.

To minimize the json (de-)serialization I would suggest to not make a separate column to store the metadata. If people need additional metadata to be stored, just defined the properties as part of the entity. The linq parser will handle the queries if people need to query on those fields.

Less features, less work to maintain 😃

jeremydmiller · 2020-09-25T11:10:28Z

@cocowalla We've committed to "opt out" of all the document metadata in this issue too, so you're covered.

oskardudycz · 2020-09-26T11:52:54Z

From my perspective for sure metadata should be opt-out functionality (or even user could decide which set of fields to use).

I understand the query performance perspective, but I don't see also why it should be a key factor - from my perspective querying based on the metadata shouldn't be the common case. This should be probably more used for diagnostics, debugging etc. when it's usually fine to have bigger latency than the regular one. Eg. what business logic would be querying by the correlation id? For sure fields like created, version etc. have use case (like eg. in Event Sourcing - time travelling).

I'm not a huge fan of joins with a separate table. I think that this will make the stuff more complex for the advanced scenarios that we plan to add in future like using Postgres partitioning etc. But I agree, it's an option.

I agree with the storage size perspective - for sure we need to keep an eye on that eg. by adding feature toggle.

Maybe we could find some middle ground by selecting a set of columns that will be common (like correlation id, created, version etc.) and put them into columns, then allow the user to define their own set of data and store it in separate JSON column? I think that giving optional flexibility is a must-have, as we cannot cover all cases (and probably don't want to cover them). Eg. I talked with @jacobpovar and they needed to write their own Aggregator and other internals - to support conflict resolution for the event streams in a multi-master environment.

I'm not sure tho if that wouldn't be too complex in implementation then.

Also what about the possibility to automatically apply the metadata into the event or document class? That would be extremely useful (eg. for version, timestamp columns for event sourcing).

@jeremydmiller thoughts?

jeremydmiller · 2020-09-26T12:06:40Z

Oh, definitely go all the way on metadata. Either they're all in separate columns or all in a JSON serialized field, but not a mixed set. We've already agreed to make the existing metadata fields be opt out for storage. Any new metadata fields would be "opt in".

"what about the possibility to automatically apply the metadata into the event or document class?" -- that's relatively easy if it's a public property/field. A tiny bit more work if we have to support private/internal setters.

I'm not very worried about the query performance against metadata, because as you said, it's probably going to be rare. Insert time though, that matters, and that's why I don't like the JSON serialized header idea.

The bigger question here was on how the metadata is captured. Any thoughts on some of the alternatives?

oskardudycz · 2020-09-26T12:17:55Z

Fair point about insert performance. It's extremelay important - especially for the event store. I think that it'd be worth doing some PoC on those approaches to see how big is the impact and what improvements can we have there. What do you think?

Regarding the capturing - I believe that minimum that we have to do is injecting them in the session initialization - and this option probably should be always available even if we select other options - that's probably the most convinient way to inject telemetry related params (as it's usually injected in scope).

I'm also not a huge fan of injecting them in SaveChanges, I'd probably prefer to let to pass them in the Insert, Store, AppendEvent methods - as then it'd be more explicit what we're trying to do and still let some custom business logic result be injected.

I think that we can start with injecting them during the session initialization and then gather the feedback and then extend it later on if needed.

P.S. I know that's might be out of scope, but I'd like to also add possibility to add stream type column into the events table - that can be useful for the outbox pattern and routing (eg. to Kafka topics). We could also consider to use same or similar mechanism for that.

jeremydmiller · 2020-09-27T18:23:53Z

jeremydmiller · 2020-10-13T20:09:13Z

@cocowalla I've got the ability to omit the informational metadata columns working in a local branch. It's document type by document type so far, but we'll add some policy support soon too.

cocowalla · 2020-10-13T20:12:40Z

@jeremydmiller 🎉 great to hear, thanks for the update!

oskardudycz added enhancement event store labels Aug 14, 2019

oskardudycz mentioned this issue Aug 14, 2019

Updated Scenario Aggregates, Events, Repositories to be simplified to not provide additional infrastructure wrapping around Marten #1299

Closed

jeremydmiller changed the title ~~Make possible to map Metadata fields to event by naming convention~~ Flexible Document Metadata Sep 18, 2020

oskardudycz added the documentdb label Sep 26, 2020

jeremydmiller added this to the 4.0 milestone Sep 28, 2020

jeremydmiller removed the event store label Oct 15, 2020

jeremydmiller closed this as completed in d526911 Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible Document Metadata #1337

Flexible Document Metadata #1337

oskardudycz commented Aug 14, 2019 •

edited

jeremydmiller commented Sep 18, 2020 •

edited

jeremydmiller commented Sep 24, 2020

cocowalla commented Sep 25, 2020

mdissel commented Sep 25, 2020

jeremydmiller commented Sep 25, 2020

oskardudycz commented Sep 26, 2020

jeremydmiller commented Sep 26, 2020

oskardudycz commented Sep 26, 2020

jeremydmiller commented Sep 27, 2020 •

edited

jeremydmiller commented Oct 13, 2020

cocowalla commented Oct 13, 2020

Flexible Document Metadata #1337

Flexible Document Metadata #1337

Comments

oskardudycz commented Aug 14, 2019 • edited

jeremydmiller commented Sep 18, 2020 • edited

jeremydmiller commented Sep 24, 2020

How does custom Document Metadata work?

Open-Ended w/ Dictionary<string, object>

Strong-Typed Metadata Objects

Capturing Metadata on IDocumentSession

Storing the MetaData

Configuring the DocumentStore

cocowalla commented Sep 25, 2020

mdissel commented Sep 25, 2020

jeremydmiller commented Sep 25, 2020

oskardudycz commented Sep 26, 2020

jeremydmiller commented Sep 26, 2020

oskardudycz commented Sep 26, 2020

jeremydmiller commented Sep 27, 2020 • edited

Tasking

Refactorings

Goals

Nice to dos

Dev Tasks

jeremydmiller commented Oct 13, 2020

cocowalla commented Oct 13, 2020

oskardudycz commented Aug 14, 2019 •

edited

jeremydmiller commented Sep 18, 2020 •

edited

jeremydmiller commented Sep 27, 2020 •

edited