May inventories contain properties that aren't defined in the spec? #474

pwinckles · 2020-05-20T22:50:57Z

I was sure that the spec forbid this, but @neilsjefferies just pointed out to me that it in fact does not. Is this intentional?

If so, this would allow for something like, for example, a media type extension that augments inventory files like the following:

{
    "id": "obj1",
    "type": "https://ocfl.io/1.0/spec/#inventory",
    "digestAlgorithm": "sha512",
    "head": "v1",
    "contentDirectory": "content",
    "fixity": {},
    "manifest": {
        "5a..e1": [
            "v1/content/test.txt"
        ]
    },
    "versions": {
        "v1": {
            "created": "2020-05-20T16:26:11Z",
            "message": "initial commit",
            "state": {
                "5a..e1": [
                    "test.txt"
                ]
            }
        }
    },
    "mediaTypes": {
        "5a..e1": "text/plain"
    }
}

The text was updated successfully, but these errors were encountered:

ahankinson · 2020-05-21T06:28:55Z

It was certainly intended that no other keys can be present in the inventory other than those defined in the spec.

neilsjefferies · 2020-05-21T07:46:54Z

Im pretty sure it was to enable subsequent versions of OCFL to make additions without breaking backwards compatibility.

ahankinson · 2020-05-21T07:55:10Z

If new keys are added to the inventory then that would necessitate a new minor version of the spec, but it wouldn't break backwards compatibility. If the current keys defined in the spec were to be renamed or removed, it would break backwards compatibility but it would also require a major version change.

But in both cases I don't believe it was ever intended that keys not defined in the spec are permitted in the inventory. I think the MUST in the inventory section are missing an explicit 'and MUST NOT contain any other keys'.

neilsjefferies · 2020-05-21T08:16:43Z

If there is a MUST NOT clause then adding new keys in new versions breaks backwards compatibility. A V1.1 Inventory would fail V1.0 validation. This was something I thought we wanted to avoid as much as possible.

However, now we have the concept of an extensions mechanism, which did not exist when we worked on this bit, we have the additional possibility of having an "extension" key - which can contain all the extensions relevant to the object with their parameters. We can then require non-OCFL keys to be encapsulated in an extension. Thoughts...?

pwinckles · 2020-05-21T11:53:23Z

I had assumed that inventories were to be validated against the OCFL version specified in their type.

There are some ambiguities here that should probably be clarified in the spec. Here are some more questions:

Can an object created as OCFL v1.0 later be changed to v1.1 in a similar fashion to changing digestAlgorithm?
If yes, then which version does the object conformance declaration reflect?

In a scenario where multiple OCFL spec versions exist that have substantive differences in inventory serialization, then it does become problematic to deserialize inventory files to anything other than a basic map structure if it is not knowable without reading the inventory contents what format its in.

That aside, if you were to allow inventories to include keys that are not defined in the spec, I would certainly feel better about it if they were at the very least encapsulated in some way, as @neilsjefferies was suggesting.

bcail · 2020-05-21T11:57:44Z

A mediaType extension like @pwinckles's example would be interesting to me. mediaType/mimetype is one of the system properties handled by Fedora 3. I can move the mimetype to another file in ocfl, but if it were an option, I might put the mimetype in inventory.json. And I'd be fine with it being encapsulated in an "extensions" key. But I don't have to have it in inventory.json at all - I can handle it either way.

zimeon · 2020-05-21T12:17:22Z

My recollection is that it was intentional only to specify the necessary structure of the inventory so that there could be evolution or extension anywhere else. Boxing within extensions is fine for extensions but doesn't allow good evolution to future versions

pwinckles · 2020-05-21T12:25:23Z

Is the inventory_schema.json up to date? It does not allow undefined properties: https://github.com/OCFL/spec/blob/master/draft/spec/inventory_schema.json#L7

ahankinson · 2020-05-21T12:26:08Z

Validating a 1.1 inventory with a 1.0 validator is a break in forward compatibility (the object is newer than the code that is verifying it), but I think that's expected. (You can't build a validator that will predict the future...)

If the new keys added in the subsequent version were required keys I think it should mean a major version change (e.g. 2.x) but if they were optional it would only require a minor version bump. I don't see how a blanket 'MUST NOT' would break compatibility, though -- if new keys are added, a new version of the spec is created, and then the MUST NOT no longer forbids that key. If, for example, @pwinckles suggestion of a mediaTypes key gets added to 1.1, then it is no longer forbidden and can be included in v1.1 inventories.

I think we should be clear about the purpose of the inventory file. In my mind, it contains only the data required to effectively track the changes to the files in the object. What I don't think it should be is a sort of config file for capturing different options and behaviours of the object. That would be my concern if we started putting keys, such as extensions, in it.

neilsjefferies · 2020-05-21T14:56:36Z

Yep, @ahankinson , I meant forwards. This does potentially matter though, an object version (and thus its inventory) should always be a valid within all future versions of OCFL (since it is immutable). This actually places quite a few restrictions on what we can do with inventory entries. For example, it is not possible to require new keys without version specific language. @pwinckles I think this answers your question on conformance and upgrading too. A V2.0 object can contain V<2.0 versions and they should be valid. Inventory versions can only ratchet upwards with new versions, obviously!

Being overly proscriptive about keys doesn't prevent any failure modes or add new capabilities as far as I can see - all it does is add an additional compatibility issue for no obvious benefit.

@zimeon Hence I said "non-OCFL" keys should be in extensions, future OCFL versions should be able to specify additional keys. This needs careful wording though.

In the case of digests and fixity outside the OCFL standards, it does make sense that some reference can be made in the inventory to the relevant extension.

ahankinson · 2020-05-21T15:20:58Z

Uhhh... I don't think I agree with @neilsjefferies . We haven't really made any declarations about whether OCFL Objects can have mixed versions, and the presence of the 0=ocfl_object_1.0 NamAsTe file in the root as a declaration of the OCFL Object version would make that difficult.

zimeon · 2020-05-21T17:15:26Z

To my mind the handling of mixed version objects is an issue to defer for now

ahankinson · 2020-06-02T13:41:29Z

Editor's meeting: Decision was to disallow attributes not specified in the spec, under the principle that it would be better to restrict behaviours and then gradually open them up, then to do the opposite if it becomes a problem. This will be open for community feedback and more use-case gathering post-1.0.

Fixes #474

bdwheele · 2020-06-11T20:45:39Z

Crud it looks like I missed the ticket. At IU we're starting to look at OCFL as a potential storage format and since we're using tapes as our storage being able to hold some technical metadata (or other storage information) about the files in an object as part of the inventory would be a big help to reduce tape access if someone just wants to know the (rough) size of the object or duration information.

Until the 2.0 release cycle is there an option to add add a use-at-your-own-peril key that could be used to store that information and still validate? I'm thinking in the same vein as IANA's X-* mime types.

ahankinson · 2020-06-11T21:35:26Z

Hi @bdwheele, the inventory wasn’t really designed to hold metadata about the files, it was primarily designed to make the versioning system in OCFL work. Since it’s trying to keep a level of compatibility across time and across clients, and because we didn’t feel like we had gathered enough use cases for this, we felt it was best to follow Postel’s law and “be conservative in what we validate”.

The equivalent to the “x-*” mimetype would probably be an extension that gathered the relevant metadata you need.

bdwheele · 2020-06-12T12:56:25Z

That's fair enough and I understand the rationale, but if I may offer a bit of background of where I'm coming from to provide some context

Here at Indiana University we have several decades of scanned documents (photos & books) as well as a sizeable collection of A/V material (~14PB) all of which are stored on a proprietary tape system (HPSS). As a further fly in the ointment, we share the tape system with the rest of the university so we have to be good citizens and not inadvertently create denial-of-service to the other units.

We're currently looking at what our preservation situation is going to look like for the future since what we have is a mix of several different systems. OCFL has come up multiple times during our investigation and it looks interesting. We have not decided if we want to use someone else's management software (with modifications to use our storage) or if we want to write our own, or even a mix of the two.

Using a tape system creates a lot of headaches when managing the content due to latency issues, so we try to collect as much information about the objects as possible before they're committed to storage. The downside currently is that the information is kept in two separate places: a copy in our database(s) and one on the tape storage (in several cases). We would like to make sure that the metadata we've collected in both places is consistent, or could be rebuilt without reading the files since that's incredibly time consuming.

It seems to me that having the ability to store arbitrary (and explicitly separated) data in the inventory would be a good thing. With that (and RFC 760's "Liberal in its receiving" text) in mind, would it be possible to create a 'private' toplevel that would be used by management tools to store arbitrary data about the object which would not be validated by the OCFL validator beyond being syntactically correct? Within the private node it would probably be wise to suggest an application ID (such as edu_indiana_dlib_archivemanager or something) to allow multiple applications to store metadata without interfering with either the OCFL content or content stored by other applications.

For IU we'd probably want to store technical metadata (size, mime type, stream information, etc), but a tar-like application that generates OCFL would likely include ownership and permissions. One assumes that "immutable" descriptive metadata would also be an option (ownership, alternate IDs, title, etc)

A solution of that nature would be forward compatible because it is up to the application to manage that content. Data that stored in the private space would be ignored by parsers reading the 1.0(?) spec. If a future specification included fields for commonly used metadata, it would be up to the application to upgrade the package (because there are other files involved beyond the inventory) and it would be able to deal with backward compatibility by looking at both the future spec's location as well as its own private data.

An additional benefit to adding this space is that it would provide real usage data for future directions for OCFL: if all of the management applications are storing file size, for example, that would lend creedence to adding a size field for future versions of OCFL

Thank you for your time.

pwinckles · 2020-06-12T13:31:36Z

@bdwheele In your use case, is there a prohibiting factor that makes storing metadata in an object content file unacceptable?

University of Technology Sydney is currently doing this by combining OCFL and ro-crate.

bdwheele · 2020-06-12T18:59:10Z

@pwinckles, no there isn't anything absolutely preventing us from going that route and it could work. So it's not a blocker, but embedding the immutable metadata in with the inventory could offer advantages in some situations:

When reading the information about the object would only be a single tape read. Due to the system we have there's no guarantee that two files would end up on the same tape. In HPSS, often files pushed at a single time are copied to separate tapes to speed up storage. Even if files were stored on the same tape, some metadata (such as descriptive metadata for the object as a whole) would most likely incur a substantial seek time to grab the current version inventory and the metadata that was written at object creation time.
If arbitrary metadata is allowed in the inventory then everything one needs to know about the object is in a single document, even if the consumer doesn't understand or care about all of it. Loading those documents into something like mongodb and doing collection queries would be trivial. It's true that documents could be merged from multiple files, but which files? It may be different on objects from different sources or different types of objects.
There's nowhere in the OCFL specification (at least that I see) that is the equivalent to BagIt's tags for organization, contact information, or description. If someone gets a tarball of an OCFL object there's no context (beyond the ID -- which may not be resolvable outside the creator's organization) about what the object is or where it came from. By allowing packaging tools to put information into the inventory, if that's something that's deemed important by the organization, there's place to put it. It may not be a rigorously defined place, but it would get you into the self-documentation ballpark

* Fixed: disallow arbitrary keys Fixes #474 * Fixed: Addressing review comments Moved MUST NOT constraint to section introduction

pwinckles mentioned this issue May 21, 2020

Object and inventory versioning #475

Closed

zimeon added this to the 1.0 milestone May 21, 2020

zimeon added Needs Discussion OCFL Object labels May 21, 2020

ahankinson self-assigned this Jun 2, 2020

zimeon added Decided An editorial decision that was decided and removed Needs Discussion labels Jun 2, 2020

ahankinson added a commit that referenced this issue Jun 9, 2020

Fixed: disallow arbitrary keys

ee81d1c

Fixes #474

ahankinson mentioned this issue Jun 9, 2020

Fixed: disallow arbitrary keys #478

Merged

julianmorley closed this as completed in #478 Jun 15, 2020

julianmorley pushed a commit that referenced this issue Jun 15, 2020

Fixed: disallow arbitrary keys (#478)

408d828

* Fixed: disallow arbitrary keys Fixes #474 * Fixed: Addressing review comments Moved MUST NOT constraint to section introduction

bcail mentioned this issue Aug 24, 2020

Adding data to inventory.json OCFL/Use-Cases#37

Closed

tomwrobel mentioned this issue May 10, 2023

Adding a file size key to the inventory #629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

May inventories contain properties that aren't defined in the spec? #474

May inventories contain properties that aren't defined in the spec? #474

pwinckles commented May 20, 2020

ahankinson commented May 21, 2020

neilsjefferies commented May 21, 2020

ahankinson commented May 21, 2020 •

edited

Loading

neilsjefferies commented May 21, 2020

pwinckles commented May 21, 2020

bcail commented May 21, 2020

zimeon commented May 21, 2020

pwinckles commented May 21, 2020

ahankinson commented May 21, 2020

neilsjefferies commented May 21, 2020

ahankinson commented May 21, 2020

zimeon commented May 21, 2020

ahankinson commented Jun 2, 2020

bdwheele commented Jun 11, 2020

ahankinson commented Jun 11, 2020 •

edited

Loading

bdwheele commented Jun 12, 2020

pwinckles commented Jun 12, 2020

bdwheele commented Jun 12, 2020

May inventories contain properties that aren't defined in the spec? #474

May inventories contain properties that aren't defined in the spec? #474

Comments

pwinckles commented May 20, 2020

ahankinson commented May 21, 2020

neilsjefferies commented May 21, 2020

ahankinson commented May 21, 2020 • edited Loading

neilsjefferies commented May 21, 2020

pwinckles commented May 21, 2020

bcail commented May 21, 2020

zimeon commented May 21, 2020

pwinckles commented May 21, 2020

ahankinson commented May 21, 2020

neilsjefferies commented May 21, 2020

ahankinson commented May 21, 2020

zimeon commented May 21, 2020

ahankinson commented Jun 2, 2020

bdwheele commented Jun 11, 2020

ahankinson commented Jun 11, 2020 • edited Loading

bdwheele commented Jun 12, 2020

pwinckles commented Jun 12, 2020

bdwheele commented Jun 12, 2020

ahankinson commented May 21, 2020 •

edited

Loading

ahankinson commented Jun 11, 2020 •

edited

Loading