Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible Document Metadata #1337

Closed
oskardudycz opened this issue Aug 14, 2019 · 11 comments
Closed

Flexible Document Metadata #1337

oskardudycz opened this issue Aug 14, 2019 · 11 comments

Comments

@oskardudycz
Copy link
Collaborator

oskardudycz commented Aug 14, 2019

Currently it's not possible to automatically map to the event fields stuff like: Version, Timestamp, StreamId etc.
Because of that it's not easy to implement elegant Aggregates + Repository implementation without the reflection or making some compromises like having mutable Version field, data duplication.
I think that we should give user possibility to map those fields automatically from metadata.

See more in discussion on the Event Sourced Aggregate+Repository sample: #1299

@jeremydmiller
Copy link
Member

jeremydmiller commented Sep 18, 2020

I'm generalizing this a bit.

Usages:

  • Opt out of metadata columns
  • Have soft deleted values set on documents as its loaded similar to versions today
  • Extensible metadata columns
  • Extensible metadata shows up in Metadata queries

@jeremydmiller
Copy link
Member

How does custom Document Metadata work?

I think the main usage we've heard so far has been for traceability, ala something like the Open Telemetry standard. So tracking correlation identifiers for the document saves, as well as the active user.

I think these are the open questions:

  • Do we make the metadata be open-ended and user-defined, or do we try to guess what folks would need and hard code slots for common metadata items? (I lean open-ended, but hard-coded would make for an easier implementation).
  • How do we capture the metadata on the DocumentSession operations?
  • How do we store the metadata in Postgresql itself?
  • What facilities do we add for querying events or documents against the metadata values?
  • How do users configure the metadata storage if we don't opt for a Dictionary storage?

This is very closely related to #780 for the event metadata, and I think there's going to be some shared functionality here.

So let's talk about the mechanics, and this is a brainstorm w/ some of my own opinions called out:

Open-Ended w/ Dictionary<string, object>

The obvious "open-ended" way would be to persist a JSON serialized Dictionary<string, object> to capture the metadata. If you do that, I think you would add some Linq WHERE helper methods to aid in searching. I think we'd add some extension methods to try to standardize common elements like the responsible user and open telemetry correlation identifiers.

I'm a little concerned by this approach about performance because it forces you to do an extra serialization step within the unit of work operations.

Strong-Typed Metadata Objects

You could have a user configure a DocumentStore to use a certain metadata type of their choosing like:

public class TransactionData
{
    public Guid TransactionId {get; set;}
    public Guid CorrelationId {get;set;}
    public string UserName {get;set;}
}

And this leads us to some mapping concerns. Is this stored as JSON serialized in each document/event record in one field? Do we split it out into separate fields? More on this later.

Capturing Metadata on IDocumentSession

I think in this case you could capture the metadata by one of these methods on IDocumentSession:



// I think this fits well with application frameworks that open a session from DI and stick the 
// known user, correlation, timestamp kind of stuff directly on the session before your application
// code starts using it

// Hard-coded
public object Metadata {get;set;}

// If we choose Dictionary for all the things
public Dictionary<string, object> Metadata {get;set;}



// If you don't do the metadata tagging at session creation time
SaveChanges(Dictionary<string, object> metadata = n ull);

I'm not enthusiastic about having folks set the metadata as part of individual Store() or Events.Append() calls.

Storing the MetaData

I see three options:

  1. Serialized JSON headers in the document/event rows -- most flexible, probably the slowest, PITA to query unless you're good with the Postgres operators and most folks aren't, and this would require us to add some custom Linq helpers for querying. Probably also the most expensive on document querying if we're going to do the trick where we apply metadata to the documents as we load them.
  2. Separate database columns in the document/event rows. More complexity at runtime, but the code generation in v4 softens the blow from this considerably over the <=v3 Expression building mechanics. Less flexibility because we'd need to know the possible metadata values upfront -- but I think that's a good idea regardless. Possibly expensive in terms of storage
  3. Have a completely separate Transaction table that has all the known metadata values with some kind of transaction id, then just store the transaction id in the document/event tables. I think this is actually my (Jeremy) preferred approach at the moment. It's gonna cause some JOINs, but that shouldn't be a huge deal. Helps on storage costs. I'm thinking the JOIN is cheaper than the JSON serialization and parsing later

Configuring the DocumentStore

  1. Anything goes in a Dictionary
  2. Define upfront the expected metadata keys by either supplying a strong typed class or one key/type at a time

@cocowalla
Copy link
Contributor

My take on this is really from the viewpoint of #1557 - I personally have little use for document metadata, and above all would like the option not to bloat my table storage with them if they're not needed.

Having said that, I can understand use cases; for a simple example, a "last modified" timestamp is a really common thing to have, and you might want to use Marten's mt_last_modified column to avoid having 2 timestamps.

This is a hard problem, and I'm not sure you're going to find a method that suits everyone 😆, but I'll weigh in from the perspective of a document store user:

How does custom Document Metadata work?
Personally, I'm not fully convinced about the utility of automatically mapping metadata to models - it seems like a lot of work (especially mapping in the direction of documents to the database) that has the potential to negavitely impact on those that don't need it. I need to say though, I don't use the event store functionality, only the document store functionality.

Storing the Metadata,

  1. My least favourite, because of the perf hit
  2. This is how Marten v3 works, so might mean a more familiar upgrade path from v3 -> v4. I actually like that it provides less flexibility, because I want to avoid a proliferation of metadata in future versions 🙄
  3. I wonder if this would make it easier to add the plumbing to opt-out of all unnecessary metadata? I would be concerned about the perf if there is any metadata that is mandatory; imagine tables with hundreds of millions or billions of rows - that Transaction table is going to get very big (assuming you meant a single god table, and not an additional table-per-type)

I'm torn between (2) and (3)

@mdissel
Copy link
Contributor

mdissel commented Sep 25, 2020

Capturing Metadata on IDocumentSession

Yes! This would make the mapping of CurrentUser / CorrelationId, etc in a DocumentSessionListener much easier

Storing the Metadata,

The mt_lastmodified column currently has no technical reasons to exist, so it can be removed and if people need it, just define the property in the entity.

The mt_version and mt_is_deleted* columns are needed by Marten, so I would leave them as is.

To minimize the json (de-)serialization I would suggest to not make a separate column to store the metadata. If people need additional metadata to be stored, just defined the properties as part of the entity. The linq parser will handle the queries if people need to query on those fields.

Less features, less work to maintain 😃

@jeremydmiller
Copy link
Member

@cocowalla We've committed to "opt out" of all the document metadata in this issue too, so you're covered.

@oskardudycz
Copy link
Collaborator Author

From my perspective for sure metadata should be opt-out functionality (or even user could decide which set of fields to use).

I understand the query performance perspective, but I don't see also why it should be a key factor - from my perspective querying based on the metadata shouldn't be the common case. This should be probably more used for diagnostics, debugging etc. when it's usually fine to have bigger latency than the regular one. Eg. what business logic would be querying by the correlation id? For sure fields like created, version etc. have use case (like eg. in Event Sourcing - time travelling).

I'm not a huge fan of joins with a separate table. I think that this will make the stuff more complex for the advanced scenarios that we plan to add in future like using Postgres partitioning etc. But I agree, it's an option.

I agree with the storage size perspective - for sure we need to keep an eye on that eg. by adding feature toggle.

Maybe we could find some middle ground by selecting a set of columns that will be common (like correlation id, created, version etc.) and put them into columns, then allow the user to define their own set of data and store it in separate JSON column? I think that giving optional flexibility is a must-have, as we cannot cover all cases (and probably don't want to cover them). Eg. I talked with @jacobpovar and they needed to write their own Aggregator and other internals - to support conflict resolution for the event streams in a multi-master environment.

I'm not sure tho if that wouldn't be too complex in implementation then.

Also what about the possibility to automatically apply the metadata into the event or document class? That would be extremely useful (eg. for version, timestamp columns for event sourcing).

@jeremydmiller thoughts?

@jeremydmiller
Copy link
Member

Oh, definitely go all the way on metadata. Either they're all in separate columns or all in a JSON serialized field, but not a mixed set. We've already agreed to make the existing metadata fields be opt out for storage. Any new metadata fields would be "opt in".

"what about the possibility to automatically apply the metadata into the event or document class?" -- that's relatively easy if it's a public property/field. A tiny bit more work if we have to support private/internal setters.

I'm not very worried about the query performance against metadata, because as you said, it's probably going to be rare. Insert time though, that matters, and that's why I don't like the JSON serialized header idea.

The bigger question here was on how the metadata is captured. Any thoughts on some of the alternatives?

@oskardudycz
Copy link
Collaborator Author

Fair point about insert performance. It's extremelay important - especially for the event store. I think that it'd be worth doing some PoC on those approaches to see how big is the impact and what improvements can we have there. What do you think?

Regarding the capturing - I believe that minimum that we have to do is injecting them in the session initialization - and this option probably should be always available even if we select other options - that's probably the most convinient way to inject telemetry related params (as it's usually injected in scope).

I'm also not a huge fan of injecting them in SaveChanges, I'd probably prefer to let to pass them in the Insert, Store, AppendEvent methods - as then it'd be more explicit what we're trying to do and still let some custom business logic result be injected.

I think that we can start with injecting them during the session initialization and then gather the feedback and then extend it later on if needed.

P.S. I know that's might be out of scope, but I'd like to also add possibility to add stream type column into the events table - that can be useful for the outbox pattern and routing (eg. to Kafka topics). We could also consider to use same or similar mechanism for that.

@jeremydmiller
Copy link
Member

jeremydmiller commented Sep 27, 2020

Tasking

Refactorings

  • Use ISelectableColumn to determine what columns need to be fetched in each ISelector
  • Eliminate DocumentMapping.SelectFields()
  • DocumentStorage<,> base class needs to take in a StorageStyle argument to differentiate
  • SelectorBuilder first chooses the columns that are active, then generates everything off of that list.
  • Use a separate base type for DocumentStorage for dirty checking storage
  • Do a type sweep to mark things as internal wherever possible
  • Unit test DocumentMapping.SelectColumns()
  • Make the construction of DocumentTable be lazy on DocumentMapping so we stop building it over and over again

Goals

  • Version and other members are set on QueryOnly selectors
  • Opt out of metadata columns
  • Have soft deleted values set on documents as its loaded similar to versions today
  • CausationId, CorrelationId, LastModifiedBy metadata columns
  • CausationId, CorrelationId, LastModifiedBy metadata are in the DocumentMetadata queries
  • Extensible "header" metadata column
  • Extensible "header" metadata shows up in Metadata queries
  • Test internal/private setters for metadata properties and Fields

Nice to dos

  • Modify Version when updated?

Dev Tasks

  • Convert the DocumentMappingExpression fluent interface to using Metadata<T>
  • Shorthand for disabling informational metadata columns
  • Add XML comments to Metadata
  • Add XML comments to DocumentMetadata

@jeremydmiller jeremydmiller added this to the 4.0 milestone Sep 28, 2020
@jeremydmiller
Copy link
Member

@cocowalla I've got the ability to omit the informational metadata columns working in a local branch. It's document type by document type so far, but we'll add some policy support soon too.

@cocowalla
Copy link
Contributor

@jeremydmiller 🎉 great to hear, thanks for the update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants