Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify whether and OCFL Object Root may contain other files/directories not specified #373

Closed
zimeon opened this issue Sep 18, 2019 · 52 comments · Fixed by #381
Closed
Assignees
Milestone

Comments

@zimeon
Copy link
Contributor

zimeon commented Sep 18, 2019

By my reading of 3. OCFL Object it is not clear whether files or directories not specified in section 3 (conformance decl., version dirs, inventory, inventory checksum, logs dir) are permitted. I think we should add language to 3.1 Object Structure along the lines of either:

The OCFL Object Root MUST NOT contain files or directories other than those specified in the following sections.

or:

The OCFL Object Root MAY contain additional files or directories not specified in the following sections. OCFL-compliant tools (including any validators) MUST ignore and such additional files or directories.

(or one could allow just files, or just directories)

@zimeon
Copy link
Contributor Author

zimeon commented Sep 18, 2019

See discussion in #369 that raises this question.

Per discussion in #320 we came down against allowing a deposit directory in the Object Root which implies that no additional files are allowed.

@rosy1280 rosy1280 added this to the 1.0 milestone Oct 2, 2019
@rosy1280
Copy link
Contributor

rosy1280 commented Oct 2, 2019

this issue also relates to discussion in #367

@rosy1280
Copy link
Contributor

rosy1280 commented Oct 2, 2019

we decided that we would like to go with:

The OCFL Object Root MUST NOT contain files or directories other than those specified in the following sections.

this is similar to decisions that BagIT has made https://tools.ietf.org/html/rfc8493#section-3

For BagIt 1.0, every payload file MUST be listed in every payload manifest. Note that older versions of BagIt allowed payload files to be listed in just one of the manifests.

@rosy1280 rosy1280 self-assigned this Oct 2, 2019
ahankinson pushed a commit that referenced this issue Oct 2, 2019
* updates object structure section.  fixes #373

* line length

* fix paragraph

* pretty formatting
@rosy1280
Copy link
Contributor

rosy1280 commented Oct 23, 2019

I'd like to revisit this ticket and discuss changing the language from

The OCFL Object Root MUST NOT contain files or directories other than those specified in the following sections.

to

The OCFL Object Root SHOULD NOT contain files or directories other than those specified in the following sections.

@rosy1280 rosy1280 reopened this Oct 23, 2019
@zimeon
Copy link
Contributor Author

zimeon commented Oct 23, 2019

I wonder whether, if we make the change to allow additional directories, it might be better to allow a specific additional directory rather than any files or directories. If we wanted to avoid implying a suggested use then a generic name could be picked (e.g. ancillary, misc, extra, xyz etc.). If we wanted to emphasize that the content of an extra directory does not benefit from the preservation infrastructure of versioned content the OCFL object then name implying that might be chosen (e.g. unversioned, tmp etc.). However, in the latter case one risks tacitly blessing a particular use.

@neilsjefferies
Copy link
Member

var - because it can be variable? cf linux filesystem hierarchy too. It also sorts at the end of the version directories for mutable head peeps.

@neilsjefferies
Copy link
Member

...also abbreviation for version-at-risk :)

@zimeon
Copy link
Contributor Author

zimeon commented Oct 23, 2019

var would be just soooo cute... and even sorts v1, v2, v3, var ;-)

@awoods
Copy link
Member

awoods commented Oct 23, 2019

Specifying "OCFL Objects MAY include a var directory" sounds good.

@rosy1280
Copy link
Contributor

Hi. Pot stirrer here. What's the difference between var and log?

In #320 there is discussion of using the log directory for this very purpose but then it's abandoned. I want to make sure that we don't end up back in this same loop of a conversation.

@zimeon
Copy link
Contributor Author

zimeon commented Oct 24, 2019

I don't recall why we stopped talking about logs in #320 after #320 (comment) . I think the difference now is spelled out in the intended use of logs in the current draft:

... The base directory of an OCFL Object MAY contain a directory named logs, which MAY be empty. Implementers SHOULD use this for storing files that contain a record of actions taken on the object. ...

One could, I suppose, generalize the description of logs to say anything goes. But then we are diluting the clarity around purpose of the logs directory which we expect to be widely used. Whereas I suspect var would be used only in a smaller fraction of implementations.

@neilsjefferies
Copy link
Member

My take is for logs to contain information that does NOT have have any impact on versioning - fixity checks that were OK, coping of the object from one store to another etc. var contains information that IS pertinent to object versioning in some way (mutable head being the most common one). My approach would be to allow var in the spec with a brief description to this effect and then have mutable head, versionless objects and placeholder objects in a "special case objects" section of the implementation notes. Also, the presence of logs is permissible, I would expect var to generate a warning and anything else that's not a version to be an error.

@rosy1280
Copy link
Contributor

I like this language

var contains information that is pertinent to object versioning in some way.

Because it is more general and I'm really not interested in adding more and more directories every time an implementation comes up with some new directory they need for whatever reason.

That being said, I would like to talk about this in next week's meeting to make sure we are all on the same page because I recognize that @ahankinson and @julianmorley haven't weighed in yet.

@ahankinson
Copy link
Contributor

ahankinson commented Oct 24, 2019

I'm not so keen on var. I think it would raise more problems than it solves because it's so well established in other contexts.

I'm less interested in what we call it than the mechanics of how it works. Logs is completely unstructured because we felt that it was important to keep a log of actions, but that those actions don't affect the content.

This is specifically about the content, so if we are going to do it I think we MUST specify its behaviours. I think there are some bigger questions that we need to decide on:

  1. Does it need to be a consistently valid OCFL version, or can it be 'miscellaneous' content that is unversioned until a versioning action is specifically taken?
  2. Do files in the 'unversioned' folder get digests at point of ingest, or at point of versioning?
    1. If point of ingest, where do these get tracked? inventory.json? Does that inventory always track the most recent version? Is there any danger that the unversioned inventory can ever be behind the versions?
    2. If point of versioning, where do the digests live when they are accessioned?

@awoods
Copy link
Member

awoods commented Oct 24, 2019

My thinking on the mutable-head directory (var or whatever makes sense), is that its contents represents an internally valid OCFL Object.

In other words, its contents would pass a run of a validator, with the exception that its inventory.json file is not the same as the top-level inventory.json file that is found in the OCFL Object root.

I believe that also answers question (2.) above.

@neilsjefferies
Copy link
Member

I really don't want OCFL to have to specify behaviour, it is a location for out-of-band, implementation specific stuff. It's there so when migration/recovery needs to happen, we know there are some dangling bits there that will need resolving. That beign said, I've just posted some Implementation Notes that outline a not-completely-dumb thing to do with var in line with @awoods suggestion but that does not preclude other uses.

@ahankinson
Copy link
Contributor

@awoods is there any danger that the inventory.json in the 'unversioned' directory would ever get behind the version in the vN directories?

@ahankinson
Copy link
Contributor

and then what happens to resolve it?

@neilsjefferies
Copy link
Member

Why would it matter, @ahankinson?

@ahankinson
Copy link
Contributor

Situation:

  • Client A uses the inventory from v3 as the basis for the inventory in the unversioned directory. It adds files in the unversioned directory and writes the inventory with the hashes.
  • Client B adds files and creates v4.
  • Client A wants to mint a new version based on the unversioned directory contents, but it doesn't know anything about v4. Its inventory is invalid now because it doesn't contain the files from the new version.

@neilsjefferies
Copy link
Member

...and version forking is something we need to consider for V2.

@ahankinson
Copy link
Contributor

ahankinson commented Oct 24, 2019

But if we haven't specified behaviours, it's entirely within the spec to do this. Some clients SHOULD support it, others don't have to.

It couldn't happen at any time now because versions are sequential and writing a new version writes a sequential number. Also the inventory.json is directly derived from the most immediately preceding version at time of accession.

Yes, multiple clients can work on the content. This is the point of the "Folder structure as API" idea.

@ahankinson
Copy link
Contributor

If we're implementing "mutable head" then var does not communicate that idea very well. head communicates sequence (or next?) var communicates unstructured, unsequential stuff.

@neilsjefferies
Copy link
Member

OCFL mandates no behaviours, just the eventual layout. The Implementation Notes are suggestions for rational behaviour. Mutable head is just one use for var, I can envisage others (e.g. a transaction log that contains multiple versioning events).

@ahankinson
Copy link
Contributor

I am not saying that it mandates behaviours. OCFL as written, however, presents a structured way of managing change. This proposal throws those managed behaviours in disarray, since it injects a non-sequential method of producing versions -- basically a non-resolveable fork that may or may not resolve into some version.

If we want that to be the next version, then we need to specify that, and we should not call it var. If we want there to be an inventory in it, at the very least to store digests at point of accession, then we need to make that clear.

Otherwise we will end up with just a folder system that imposes spec overhead with no practical benefit. You can have an OCFL object with no versions, at which point there's not really much use in using OCFL, is there?

@neilsjefferies
Copy link
Member

Hmm, maybe we reduce this to no more than... If a standard OCFL client encounters var then the object should be considered read-only (because we can't safely update) and reads should return the latest OCFL version as indicated by the inventory. Application specific clients that understand the contents of var may behave differently. And leave it at that.

@pwinckles
Copy link

pwinckles commented Oct 25, 2019

I hear what @ahankinson is saying about the complications that are introduced with the forking. I think if I were implementing this, I might do something like the following:

  1. When an object is updated without creating a new OCFL version and has no pre-existing staged changes, the changes are written to [object root]/var/head. The contents of this directory are the same as a valid OCFL version directory, including an inventory.json file. The staged content is recorded in the inventory as version n+1.
  2. When the staged content is "officially" versioned, the var/head directory is moved to the object root, using the version number it was assigned when var/head/inventory.json was originally created, and the inventory is moved into the object root.
  3. This operation fails if there is already a directory in the object root with the same version number. This indicates that the object was updated since the original fork was made. The HEAD changes could either be discarded (simplest), attempted to be applied on top of the current object state and discarded if there's a conflict that can't be resolved, or set aside for a user to resolve manually.

To me, some of the issues @ahankinson has described that involve multiple clients mutating OCFL objects with different expectations are configuration issues. If you are using multiple different clients on the same OCLF repository that have different ways of doing things, of course you're going to run into problems. It's also a problem if they don't both map object ids to paths in the same way, for example. If you want your repository to support a mutable head, then you should only access the repository with clients that support it.

@pwinckles
Copy link

That said, you are in a bad place if a new version is created while there is still unversioned content under var. When a client that is "mutable head aware" retrieves the current version of an object, what does it return? The mutable version is no longer the most recent. It would likely just need to purge the unversioned content. It also means that before accessing an object, such a client needs to compare the current state of the object to the object state in the head directory, looking for conflicts. None of this is impossible, but it is certainly more complicated.

@neilsjefferies
Copy link
Member

Agreed, if you have multiple clients writing to your store and you don't know if they agree on how objects work you have far bigger problems. Even vanilla OCFL does not (can not) specify how simultaneous updates are managed between clients. Personally, I don't like the idea of mutable head or partially assembled versions in an OCFL object. However, var provides a location for the implementation of such functions in a way that an OCFL client can detect and work around (either back off, as I suggested earlier, or scrub back to a known good version as @pwinckles suggested). Perhaps the Extensions mechanism might be used to codify particular uses of var rather than being part of OCFL core?

@birkland
Copy link
Contributor

A little late to the party, but this does seem like one plausible way forward among many:

Perhaps the Extensions mechanism might be used to codify particular uses of var rather than being part of OCFL core?

As it stands now, any extension along these lines would violate the spec, due to an absolute prohibition on additional files directories in an object root, except those explicitly enumerated. Allowing a var at least would allow extensions fill in the details, maybe becoming incorporated or referenced in the next spec version where that makes sense.

Altering the spec to say an object root SHOULD NOT have any additional directories except as defined in an extension (which is basically @rosy1280's original suggestion when reopening this issue) is even more flexible, and would mean the spec editors wouldn't have to commit to a specific var directory at this time.

@rosy1280
Copy link
Contributor

Thanks @birkland for writing what I was just about to.

The conversation in this thread has convinced me we should take the simplest approach -- switch the language from MUST NOT to SHOULD NOT and leave how to handle everything else to the individuals implementing the spec (including writing an extension if they want).

@neilsjefferies
Copy link
Member

I would prefer to keep a named directory...just so a vanilla client/person coming to the object can easily tell if there are version extensions in use or not and act accordingly.

@zimeon
Copy link
Contributor Author

zimeon commented Oct 27, 2019

There seems to be rough consensus on allowing additional files in the object. Like @neilsjefferies, I would prefer allowing just a named directory for tighter validation and easier client checks.

I like language along the lines of @awoods suggestion "OCFL Objects MAY include a var directory" and then not saying anything/much about the contents but instead suggesting "Uses and conventions for the var directory may be defined in OCFL Community Extensions".

@ahankinson
Copy link
Contributor

I am definitely NOT in favour of calling it var. IF we are going to allow it, it should follow terminology that makes it clear that it has a role in the versioning of the content.

@neilsjefferies
Copy link
Member

How about version_extensions or v_extensions then? We want to avoid anything that indicates a particular application since that would be equally confusing.

@zimeon
Copy link
Contributor Author

zimeon commented Oct 28, 2019

Hmmm... I thought var was rather good (though I'm certainly not stuck on that name) precisely because it doesn't say much about how it is used. I like the idea that usage patterns are defined in extensions and may be things nobody has yet thought of.

@awoods
Copy link
Member

awoods commented Oct 28, 2019

One outstanding question is whether we want this "yet-unnamed-directory" to be specific to the mutable-head use case, or open to any, potentially yet unforeseen use case?

@ahankinson
Copy link
Contributor

How about this?

  1. We define a new directory, extensions which MAY exist (similar to logs).
  2. We specify that any extended use of OCFL that does not fall into the 'normal' versioning workflow use this directory. Every extension MUST have a published description, and MUST specify the directory name they use within extensions. So, for example, extensions/fedora, extensions/dspace, extensions/oxford, extensions/cornell, extensions/mutable-head, etc. (Publishing and specifying would help prevent collisions, e.g., two different implementations of mutable-head).
  3. Implementers SHOULD publish a plain-text version of their extension in the OCFL Storage Root.
  4. Any requirements for validation SHOULD be specified in the extension. OCFL validators MUST ignore the extensions directory and any sub-directories. Here be dragons.

@neilsjefferies
Copy link
Member

@woods Not just mutable-head so I like @ahankinson's suggestion for a namespace below. However, we can't completely ignore it - if it exists then clients that are not compatible must not attempt to update an object.

@ahankinson
Copy link
Contributor

That leads us in to murkier waters, since it is possible that there can be non-content extensions. It might be possible, for example, to have an extension to store lossy derivatives of lossless content. (I'm thinking on the fly here). That would exist outside of the versioning flow, so clients could safely ignore it.

@neilsjefferies
Copy link
Member

...which would be why I suggested version_extensions rather than just extensions since the concern here is stuff that has an impact on versioning.

@ahankinson
Copy link
Contributor

I don't think we can simultaneously propose a mechanism for extending the core functionality of the spec into non-core territory, while also specifying how clients should act when they see that. Either we're completely hands-off, including validation of extension components, or we specify the behaviours of the 'mutable head' in the core.

@zimeon
Copy link
Contributor Author

zimeon commented Oct 28, 2019

I think there is possible middle ground along the lines of "implementations MAY use the ABC extension mechanism, clients that do not understand the specific extension MUST/MUST-NOT do XYZ"

@ahankinson
Copy link
Contributor

I don't know that we'll be able to pre-determine the value of "XYZ" beforehand. It could be about versioning, or it could be something that we have not anticipated.

@zimeon
Copy link
Contributor Author

zimeon commented Oct 28, 2019

Yup, that is the tricky bit. For digests we say "Optional fixity algorithms that are not supported by a client must be ignored by that client." (https://ocfl.io/draft/spec/#digests)

@ahankinson
Copy link
Contributor

So since extensions are, by definition, not covered by the spec, I think we need to adopt a similar approach. "Extension behaviours not covered by the spec must be ignored by clients that do not implement those extension behaviours".

@ahankinson
Copy link
Contributor

Which means that non-extension aware clients can't assume any additional behaviours that are dependent on specific extensions being present.

Especially since it's possible to have extensions in some OCFL objects in a storage root, but not in others. :-/

@awoods
Copy link
Member

awoods commented Oct 29, 2019

Related issue: "Add 'extensions' support to spec" - #403

@awoods
Copy link
Member

awoods commented Nov 5, 2019

Superseded by #403

@awoods awoods closed this as completed Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants