Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to implementation notes a discussion of the idea of temporary space while building OCFL Objects #320

Closed
zimeon opened this issue Mar 14, 2019 · 21 comments

Comments

Projects
None yet
6 participants
@zimeon
Copy link
Contributor

commented Mar 14, 2019

From discussion in 2019.03.13 Community Meeting there might be a need for a "draft" or "tmp" directory for active OCFL Objects.

@zimeon

This comment has been minimized.

Copy link
Contributor Author

commented Mar 14, 2019

I wonder whether there might be a case for a normative (specification) addition that says some particular directory (maybe tmp) MAY be present but MUST be ignored. This would support validation of an object that is actively being worked on for all but the final creation of the new version (something along the lines of mv tmp vN; cp -p vN/inventory.json vN/inventory.json.sha512 .).

@awoods

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

Sounds like a similar description for the current logs directory.

@birkland

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

I toyed with the idea of placing the in-progress directory somewhere under the logs directory (e.g. logs/.v3), vs simply building up a content/v3 that is unreferenced in any inventory files until the very end when the work is "committed". I ended up doing the latter, justifying it by the notion that there is always a period of time when the object is in an inconsistent state when copying it into place, or adding a new version - so why not embrace it. The purview of the spec is objects at rest anyway, and building up a version is motion.

A temp directory (or some sort of convention, like directories under content that begin with a dot . MUST be ignored) does seem cleaner, though.

@ahankinson

This comment has been minimized.

Copy link
Contributor

commented Mar 16, 2019

What about .vX (in-progress) vs vX (committed)? So .v3 while the content is being assembled, and then moved to v3 when it's finished.

@zimeon

This comment has been minimized.

Copy link
Contributor Author

commented Mar 18, 2019

Something along the lines of _vX or vX_tmp would be fine but I don't see any advantage to hiding with a leading dot. Of course you are making the validation less tight by admitting directories with any X rather than one specific name

@rosy1280 rosy1280 added this to the Beta milestone Mar 20, 2019

@julianmorley

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

A temporary space on the storage root, not named /tmp, for the assembly of future OCFL object versions.

@rosy1280

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

we agreed to use the language in @zimeon second comment, except the directory should be named deposit and be placed at the storage root.

@zimeon

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2019

Hmmm... as @rosy1280 just pointed out on my abortive PR #324, the #320 (comment) above suggests one deposit directory on the storage root. I'm not opposed to such a directory which might be useful for the assembly of new objects, but I do not think it is a good solution for the issue debated on this ticket, which is primarily updating objects with new versions.

Having one directory per storage root suddenly couples updates to different objects under the root in a potentially awkward way. It also doesn't provide a standard solution for within-object manipulations where they perhaps aren't following the storage root approach. (I do understand that in a filesystem implementation a whole root would likely be on one filesystem and thus move from a single deposit directory would still be efficient (relinking)).

@birkland

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2019

A global deposit directory seems workable, but as Simeon implies each application would need to establish its own conventions to correlate activities in ${STORAGE_ROOT}/deposit with the relevant object(s) the updates will eventually go into. If multiple tools are involved in building up that new version, they would need to agree on the same convention. It would be more straightforward, in my opinion, to define an object-scope work directory to serve use cases related to updates. Keeping the proposed root-level directory for staging new objects is fine.

@awoods

This comment has been minimized.

Copy link
Member

commented Mar 21, 2019

Although not detailed in the 2019.03.20-Editors-Meeting notes, the concern in that meeting with having a deposit directory within the Object Root came from the likely result of having draft content intermingled with the preservation resources.

@zimeon

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2019

I think my answer to that concern is that it is optional to use a workspace within the object. We have at least one example (see #320 (comment)) of a choice to do this.

@rosy1280

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2019

@zimeon the benefit of the deposit directory at the storage root is that it would keep the object itself (and its versions) clean until a version can be moved out of the deposit directory. in the example you cite, we have to futz with the object to remove a version if processes stall mid way through creation. with the deposit directory at the storage root you don't.

as for how the deposit directory would work at a global level, you would need to create the hierarchy that you create for the regular storage root in the deposit directory. what i mean by that is if you're creating a pair tree hierarchy, then create the pair tree hierarchy for the object that you put in the deposit directory. if you have no hierarchy, then don't create a hierarchy.

as an fyi this is how Moab does deposits as well.

(note i edited this comment because now i see other comments that weren't appearing before)

@ahankinson

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2019

Personally, my feeling is that the location of temporary space allocated for assembling new versions should be left up to the client to implement. There are too many use cases and edge cases for us to properly understand this.

Some implementers may be satisfied to use the /tmp directory; others may need larger assembly spaces. Some may need to do it on cloud storage, where they have no client-local object representation.

@awoods

This comment has been minimized.

Copy link
Member

commented Mar 21, 2019

I agree, @ahankinson . From a validation perspective, however, we should include in the specification locations that should be ignored.

@ahankinson

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2019

My first inclination would be to say that nothing is allowed in the Object Root, save for what we have specified. Any application-specific logic (and I would consider an 'in-flight' temporary directory application-specific) should not be stored with the content to be preserved.

@ahankinson

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2019

I am less stuffy about the storage root, since we're looser on the validation part, but would agree with @zimeon about the relative perils of doing this.

@zimeon

This comment has been minimized.

Copy link
Contributor Author

commented Mar 22, 2019

I don't think we should be prescriptive about how implementations do their manipulation of OCFL Objects alone or within an OCFL Storage Root. However, we should enable implementations to do it in the way they choose. I see three options and I advocate that all should be possible within spec:

  1. Implementations use /tmp or equivalent, entirely outside of an OCFL Storage Root. No spec support required
  2. Implementations use some named directory in an OCFL Storage Root, say deposit, to assemble new objects or updates. Spec would require explicit exclusion of this directory from validation processes (per #320 (comment))
  3. Implementations use some named directory in an OCFL Object, say deposit, to assemble updates to the object. Spec would require explicit exclusion of this directory from validation processes (perhaps as outlined in my erroneous #324). We already have one example of @birkland's implementation adopting an ad hoc solution to this in-object manipulation approach
@ahankinson

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

@julianmorley raised the problem though, and I agree with him, that allowing incomplete or failed 'commits' within the "preservation" storage would seriously gum up the works in the long term. It's not just validation, it's also clarity of purpose -- OCFL Objects are "object at rest", not "object in motion." We've made the distinction quite clear by having Spec and Implementation Notes; I think we would be making a big step backwards if we were to start muddying it up this close to the finish line.

So I would be big thumbs-down to 3, and little thumbs-down to 2.

@birkland

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

To be clear, my implementation currently just writes content directly into the next vN directory (with the intent to commit when it's done - it's not treated as a scratch space), and writes the inventory files as a last step. Cloud storage (like S3) doesn't have a rename operation anyway, so copying files directly into the vN directory (and coping somehow if individual operations fail) is really the only strategy available there. In that light, I'm fine if the spec doesn't define an object-level deposit directory.

In any case, I think best practices will emerge from experience.

I think the Fedora project might need to tweak or re-think their anticipated use cases for working with un-versioned content (where it is expected to change or otherwise be volatile before committed to a version), but that's neither here nor there.

In general, the possibility of failed or incomplete "commits" are unavoidable due to the fact that there is always some degree of motion in an object's lifetime as files fall into place (which can be mitigated to some extent by leveraging atomic renames), but it's proper for the spec as "object at rest" to be silent about that and just describe the expected states.

@zimeon

This comment has been minimized.

Copy link
Contributor Author

commented Mar 26, 2019

It seems that the consensus is that we should not allow anything in the Object (no change to spec required) and allow a deposit directory in the Storage Root (change to spec, per original #320 (comment)) ... I'll make a new PR for that

@rosy1280

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

I also wonder if we should add something to the implementation notes discussing this topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.