-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parts of the graph serialization are interspersed with the original payload data under /data directory #5
Comments
There are several reasons to support at least configuration:
|
Ruth backs Tim |
So neither of you consider domain objects, or metadata centered by the tool, to be custodial content of the package. In that case, the package tool would need something in the UI that indicates this. Would it be granular enough (or intuitive enough) to have a "do not package the metadata entered by this tool as custodial content" check box? |
Also, the choice the tool makes now is the conservative choice - the tool doesn't necessarily know a priori how the client views the domain objects created by the tool, so it assumes that everything is custodial content (packaged data) by default (i.e. the tool user may be creating new intellectual assets that are intended to be conveyed as part of a total, packaged work). So that's the perspective on why we made the tool put everything under /data by default |
We just need the ability to place domain objects outside of the payload. There are multiple ways to achieve this. I would add that the only use case I have at present that might require support for the current format (data/bin/, data/obj/) is the need to transform packages that we've already created into the desired format (data/). Certainly, we care about the metadata and domain objects, but both are distinct from the data produced by researchers, instruments, etc. That While the assertions (in metadata, domain objects, graph) might change over time, the data is much less likely to do so in the use cases of the DMS (and probably NSIDC, as well, given their current use cases for DC packages/DCS pkg ingest). And when it does change, it usually means a new version (new ID, etc). Assuming that payload (and thus "custodial content") connote the same concept as "payload" from the bagit specificiation, then I agree with definition of a Package in Section 2.2 (Terminology) DC Packaging Spec, which implies this same distinction:
|
Thanks for being so eloquent Tim… Much better than How I would have put it! Ruth
|
So would a check box indicating your preference be sufficient? |
Minimally, a checkbox and a canonical location for the domain object serializations (what is currently under data/obj/). In addition, would be useful to (1) add properties in bag-info.txt indicating both the payload and domain object locations and (2) modify the spec docs to capture these changes. Additionally, a configuration option and command-line parameter would be needed for the automated tool. |
Hi Tim, Can you be a little more specific about what is being proposed for inclusion in bagit.txt? Is it the user preference entered into the tool indicating the directory into which domain objects generated by the tool shall go? If so, it may not deserve mention in the spec, as it's just part of the internal function of one particular tool. There is another place in the bag used for storing PTG configuration, but I don't exactly remember where that is at the moment The ReM manifest specifies the location for all resources considered to be domain objects, and is completely agnostic of any convention or policy of locating them (i.e. it could be in the payload section, outside the payload section, whatever. It only cares that they have URIs that can be resolved). |
Hi @birkland, My suggestion was for properties in bag-info.txt, rather than bagit.txt, but yes, you're right: There is probably no need to call out the default base location for the object serializations, since the locations of individual serializations can be extracted from META-INF/org.dataconservancy.packaging/PKG-INFO/ORE-REM/ORE-REM.ttl or whatever is pointed to by the already-specified Resource-Manifest bagi-info.txt property. If we plan to support payload content in more than one location, then it would probably be a good idea to add a property in bag-info.txt that specfies where the payload can be found. For my current use cases, that property would always point to "data/", the canonical payload location in the existing BagIt specs. |
The assumption the spec makes is that /data is the one and only payload directory - but the spec is written in a way that domain object resources don't necessarily have to be payload. Whether they are payload in the BagIt sense depends on if they are located in /data or not. If the domain objects not in /data, then they are not payload, and not intended to be conveyed. They're just a specialized kind of metadata a client may safely choose to ignore. If I'm understanding correctly, you and Ruth want a UI option in the PTG to allow the user to control where domain objects created by the PTG are put. Your viewpoint on whether domain objects are payload or not is implicit in your choice of location; anything not under /data is not payload. Does that sound about right? |
While that is true, the PTG 1.0.x currently places a the content payload in /data/bin. But I want that content at the top level of /data when the domain object serializations are not in the bag payload. I'm suggesting a property in bag-info.txt because, while I WOULD be able get the locations of the domain object serializations by following the aforementioned Resource-Manifest bagi-info.txt property, I would NOT be able tell whether I should look in /data or /data/bin for the binary content. The property wouldn't have to have as a value the directory location of the binary content, necessarily; it could be a flag that indicates which mode we're in. But I think the former would be a lot clearer and a lot more flexible in the long term. |
@tdilauro by virtue of the fact that you know you're dealing with a DC package, and you have a location of the ORE-ReM, you can parse the ReM to find the data, no matter where they are located (/data vs /data/bin), so do we really need to add a property? I would also add that you know what "mode" you are in by examining the package and determining whether or not the ORE-ReM is in the payload or not. I'm wondering if the spec need to support this idea of a "mode" or is this just an implementation detail of the PTG. |
@emetsger In the simple case, I think it should be possible to extract the payload from the package without having to know how to parse and understand the graph serializations. |
@tdilauro so in that instance you would examine the package, and determine that the ReM is not payload, so then you would expect to find payload under |
@emetsger The location of the ReM is given by the Resource-Manifest bagi-info.txt property and seems to default to META-INF/org.dataconservancy.packaging/PKG-INFO/ORE-REM/ORE-REM.ttl, so the ReM is not currently in the payload, at least not for bags produced by the PTG. It is only the domain object serializations that were in the payload, as far as I can tell. So using it's location to make such a determination would be problematic. Would adding such a property be difficult or problematic, assuming that we proceed with this work, for other reasons that I'm possibly not grokking? |
Hah right! Well, again, I think we have to decide if this is an implementation detail or does the spec need to be updated? Certainly the PTG could add any property it needs to a bag tag file, but it's a question of whether or not this is an issue that needs to be enumerated in the package spec. |
My personal opinion is that we make the simple case(e.g., "I just want to pull my bytes out") easy and consistent, no matter what tool creates the package. That last part seems like the role of the spec, which should be about interoperability. If it's not in the spec, then there will be PTG packages and other tool packages that are incompatible. |
Hm, I don't think I understand the problem? @tdilauro Is not the simple case "I just want to pull my bytes out" merely "grab all files out of /data" whilst ignoring everything else? @emetsger what are we trying to decide is an implementation detail vs in the spec? I don't think I understand what we're referring to any more. |
@birkland It's not, because we already have packages that have been created in the current version 1.0.x format. In the future, I hope to have data in a different location. If I'm processing a package, I might need to know where my data (vs. my graph) is. |
@birkland I'm suggesting that our spec should not say anything about how On Monday, June 20, 2016, birkland notifications@github.com wrote:
|
I see. For any package (created at any time, by any tool), we know that:1. The resources in /data are the custodial content (payload) of the bag So selecting the 'non-graph' payload is a matter of selecting everything out of /data, and removing anything aggregated by the manifest ReM. This is the general solution that will guarantee a correct answer without knowing any a priori knowledge about the bag or its structure. The process of extracting non-graph data can be simplified (i.e. no parsing of manifest) if you know that a certain file path in the bag contains exactly all the non-graph data. For that reason, @tdilauro suggested a property to indicate the directory that exclusively contains all the non-graph data. @emetsger suggests that the spec should remain agnostic of such issues, and that defining such a property complicates the spec and implementation of tools. @birkland thinks the problem can be worked around by configuring the tool to place graph resources outside of /data, and tools that ignore the spec completely should be happy. |
This has caused some usage issues for the DMS team. A feature to allow user to specify a different location (outside of the payload /data directory) would be useful.
The text was updated successfully, but these errors were encountered: