Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concept Detached RO-Crate #189

Merged
merged 17 commits into from Mar 23, 2023
Merged

Add concept Detached RO-Crate #189

merged 17 commits into from Mar 23, 2023

Conversation

stain
Copy link
Contributor

@stain stain commented Jan 27, 2022

.. to support #183 my logical conclusion is that we need the concept of a Detached RO-Crate.

Suggest definition (from this pull request's structure.md)

There are two classes of RO-Crate detailed below:

Regular RO-Crate
: A crate that has a well-defined RO-Crate Root directory and can carry an explicit payload of local data entities as regular files (combined with Web-based Data Entities where needed). This type of RO-Crate can be suitable for long-term preservation, transfer and publishing, as the RO-Crate Metadata File is stored alongside the crate's payload.

Detached RO-Crate
: A crate without a defined payload directory. In this kind of crate, all data references are absolute. This approach may be suitable for use with dynamic web service APIs and repositories that can't preserve file paths. As the data of these crates can only be Web-based Data Entities, the payload is implicit and must be preserved/transferred/archived independent of the RO-Crate Metadata File.

See further definition of detached RO-Crate

I think this is necessary because of #183 allowing @id to be any ID, as here proposed in new sub section Root Data Entity identifier - then

If the @id of the Root Data Entity is an absolute URI, the Crate SHOULD NOT contain data entities using relative URI references, but MAY contain Web-based Data Entities using absolute URIs.

And from that my logical conclusion is that the whole concept of "RO-Crate Root" and any relative URIs becomes ambigious and difficult if we no longer have "@id: ./" of the Root Dataset and the URI that serves ro-crate-metadata.json no longer is grounded in something similar to a folder.

I would hope for some discussion on this in the RO-Crate meeting today 2022-01-27.

@stain
Copy link
Contributor Author

stain commented Jan 27, 2022

From RO-Crate meeting 2022-01-27:

  • General agreement to keep the suggested split of all-relative or all-absolute.
  • Change terminology, “Packaged RO-Crate”? “Transferred RO-Crate”?
    → Peter to review this PR, then revise intro text accordingly.
  • How to write this back out?
    → Add to JSON-LD appendix on how to convert between absolute and relative using @base tricks.

@ptsefton
Copy link
Contributor

ptsefton commented Feb 10, 2022

I suggest we use the term "Attached RO-Crate". I

Suggest definition (from this pull request's structure.md)

There are two classes of RO-Crate detailed below:

Attached RO-Crate
: A crate that has a well-defined RO-Crate Root directory and can carry an explicit payload of local data entities as regular files (combined with Web-based Data Entities where needed) using relative URIs. This type of RO-Crate can be suitable for long-term preservation, transfer and publishing, as the RO-Crate Metadata File is stored alongside the crate's payload.

If a crate makes any relative references then it is considered an Attached RO-Crate and the Root Dataset ID MUST be "./".

Detached RO-Crate
: A crate without a defined payload directory. In this kind of crate, all data references are absolute. This approach may be suitable for use with dynamic web service APIs and repositories that can't preserve file paths. As the data of these crates can only be Web-based Data Entities, the payload is implicit and must be preserved/transferred/archived independent of the RO-Crate Metadata File.

See further definition of detached RO-Crate

I think this is necessary because of #183 allowing @id to be any ID, as here proposed in new sub section Root Data Entity identifier - then

If the @id of the Root Data Entity is an absolute URI, the Crate SHOULD NOT contain data entities using relative URI references, but MAY contain Web-based Data Entities using absolute URIs.

@stain
Copy link
Contributor Author

stain commented Feb 10, 2022

Terminology attached/detached RO-Crate agreed in RO-Crate meeting 2022-02-10.

@stain
Copy link
Contributor Author

stain commented Feb 24, 2022

I started drafting a section Converting from attached to detached

just wanted to check if we are OK with what comes out of the JSON-LD flattening:

{
  "@context": [
    {"@base": "arcp://uuid,d6be5c9b-132a-4a93-9837-3e02e06c08e6/"},
    "https://w3id.org/ro/crate/1.1/context"
  ],
  "@graph": [
    {
      "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json",
      "@type": "CreativeWork",
      "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"},
      "about": {"@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/"},
      "creator": {"@id": "https://orcid.org/0000-0001-9842-9718"}
    },
    {
      "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/",
      "@type": "Dataset",
      "hasPart": [
        { "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/index.html"},
        { "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/example/"},
      ],
      "name": "Workflow RO-Crate profile"
    },
  {
      "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json#include-ComputationalWorkflow",
      "@type": "Recommendation",
      "category": "MUST",
      "name": "Include Main Workflow",
      "itemReviewed": {
        "@id": "https://bioschemas.org/ComputationalWorkflow"
      }
    }
  1. ro-crate-metadata.json now have absolute @id (breaks a SHOULD?)
  2. The root dataset matches the public Web presence (but then why did we need Detached?)
  3. #local identifiers become grounded in the original RO-Crate
  4. The identifier of the corresponding attached RO-Crate is preserved in @id

For anything more "proper" I think you would need manual processing, e.g. manual deposit and rewrite of each data entity file, manual UUID for each contextual entity.

@simleo
Copy link
Contributor

simleo commented Feb 24, 2022

I think we should recommend removing the {"@base": "arcp://uuid,.../"} from the converted output. It can be confusing (I was wondering what its purpose was until I read the section about converting to detached).

@ptsefton
Copy link
Contributor

Are the uuids intended to be unique? Cos people will copy and paste, or hardcode them into their crate.

Regarding attached crates can we do the deal with the relativity of paths using base: "./" or similar (or is that not allowed?) I know I have base: null in crates to stop JSON-LD libraries from messing with my paths - would have to refresh my memory

@jmfernandez
Copy link
Contributor

If you leave the arcp based UUID, you should add a small recipe about using python's arcp or how to generate in a couple of programming languages those UUIDs in the namespace of URLs.

import uuid
the_url = 'https://example.org'
the_uuid = uuid.uuid5(uuid.NAMESPACE_URL, the_url)
# the_uuid.hex has the UUID string representation
import arcp
the_arcp = arcp.arcp_location("http://example.com/data.zip", "/file.txt")
# the_arcp has the ARCP string representation
import uuid
the_random_uuid = uuid.uuid4()
# the_random_uuid.hex has the UUID string representation
import arcp
the_random_arcp = arcp.arcp_random()
# the_random_arcp has the ARCP string representation

@ptsefton
Copy link
Contributor

ptsefton commented Mar 6, 2022

On reflection I don't think we need this attached/detached distinction. I think we should look at providing clear info about how to use relative and absolute paths for various resources.

Based on experience where we have implemented an API that uses the API URL as the @id but it is then not clear how to reconstitute a crate, I think that approach was a mistake. It might be better to go back to an approach where @ids are

  1. Relative URIs To describe how resources would be or laid out on disk as a set of relative paths with ./ for the root
  2. Absolute URIs For URL addressable resources

For packaged crates-on-disk use @base: null with relative paths for data entities

For crates over an API use the dcat:downloadURL property on DataEntities for the place where you can get a file and as per (1) above make its @id the filename it should have relative to the root. and Identifier for IDs like DOIs.

@ptsefton
Copy link
Contributor

ptsefton commented Mar 9, 2022

Further to my last comment @stain & @simleo. I think I have found a neat solution to the problem we were having with letting "@id" in for a File be a URI - how would you save it to disk and re-construct the relative path structure of a package?

Solution: In RO-Crate Metadata Documents served from a service leave the @ids as relative paths but use DCAT accessUrl (to point to RO-Crate Metadata served over an API) and downloadURL for the actual datastream. We can then recommend that a process for reconstituting an RO-Crate by using the @id to create directories and write file contents.

I have written this up in the work I was doing on a new intro - this detail probably does not all belong in the intro though.

Here's a copy and paste from that Google doc.

an RO-Crate Metadata Document is served from a service use the following DCAT properties:

dcat:accessURL – RO-Crate Metadata Documents

dcat:downloadURL - Direct downloads of bitstreams (files)

Client software to construct RO-Crates SHOULD:
Save the RO-Crate Metadata file into an empty create download directory
For Dataset that has a relative URI, make a subdirectory with the same path as the Dataset id - eg /data/pictures
For each File Entitiy fetch the datastream using downloadURI and write it to a relative path that corresponds to its “@id” (creating directories as needed, even if they are not described in the RO-Crate).

In the case where the RO-Crate metadata Document is being served from a service,
{ "@context": "https://w3id.org/ro/crate/1.1/context",
"@graph": [

{
"@type": "CreativeWork",
"@id": "ro-crate-metadata.json",
"conformsTo": {"@id": "https://w3id.org/ro/crate/2.0"},
"about": {"@id": "./"}
},
{
"@id": "./",
"accessUrl": "https://example.com/ro-crate/api/crate/000001",
"@type": "Dataset",
"datePublished": "2022-02-01",
"name": "Example dataset for RO-Crate specification",
"description": "If this were real data it would contain a minute-by-minute rainfall readings for my weather gauge", "license": "CC BY-NC-SA 3.0"
"hasPart": {"@id": "data.csv"}
}
{
"@id": "data.csv",
"downloadURL": https://example.com/ro-crate/api/crate/000001?file=data.csv,
"@type": "File",
"encodingFormat": "text/csv",
"name": "Rainfall data for Katoomba, NSW Australia, 2022-02-01",
"license": "CC BY-NC-SA 3.0 AU"

}

]
}

@jmfernandez
Copy link
Contributor

Looks nice, @ptsefton @stain @simleo !! I have several questions, some of them offtopic.

  • If the resource has a public id representation as a short URI using an scheme registered either at identifiers.org or at n2t.net in CURIE format, where should we put those public identifiers, as we are keeping @id to represent the internal, relative placement of the resource?
  • If the resource has a download URL which is not http based (for instance, an FTP or an Amazon S3 link), is still downloadURL predicate the recommended one to provide it?
  • Even when it is not possible to provide a downloadURL, is there some proper way to declare that the resource is under controlled access? For instance, data from EGA (European Genome Phenome Archive) or dbGaP (NCBI's database of Genomes and Phenomes)

@simleo
Copy link
Contributor

simleo commented Mar 9, 2022

I think I have found a neat solution to the problem we were having with letting "@id" in for a File be a URI - how would you save it to disk and re-construct the relative path structure of a package?

The current spec already allows File @ids to be URIs. In ro-crate-py, this is handled via a fetch_remote keyword argument that allows the library user to decide what happens when the crate is written out to disk (the user might not want to download the file -- or be able to do so -- for various reasons):

url = "http://example.com/foo.txt"
# Download file; it will be placed under <CRATE_DIR>/examples when the crate is written out
crate.add_file(url, "examples/foo.txt", fetch_remote=True)
# Don't download file; its @id will still be a URI in the output crate
crate.add_file(url, fetch_remote=False)

In the latter case, a "url": "http://example.com/foo.txt" is automatically added to the entity; however, we're currently not doing that in the former case, and I now realize that we should. But maybe we should use "downloadUrl" rather than "url".

@jmfernandez I don't think there's any requirement for URL schemes to be http[s] in Schema.org.

UPDATE: https://schema.org/downloadUrl is only used in SoftwareApplication, so I guess we should use url
UPDATE 2: on second (third?) thought, automatically adding a "url" property in ro-crate-py is a bad thing. Users might want to create a local copy that doesn't list any remote reference (it could be considered a different crate, in a way), or specify a different URI (e.g., from a mirror), which is always possible via properties anyway.

@@ -44,7 +44,7 @@ The _RO-Crate JSON-LD_ MUST contain a self-describing
**RO-Crate Metadata File Descriptor** with
the `@id` value `ro-crate-metadata.json` (or `ro-crate-metadata.jsonld` in legacy
crates) and `@type` [CreativeWork]. This descriptor MUST have an [about]
Copy link
Contributor

@simleo simleo Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to clarify that the descriptor's id can also be an absolute URI. For instance, by add a sentence here like:

In a [detached RO-Crate](structure.md#detached-ro-crate), the descriptor's `@id` can be
an absolute URI; in this case, its _last path segment_ MUST be `ro-crate-metadata.json`
(or `ro-crate-metadata.jsonld` in legacy crates)

Copy link
Contributor

@simleo simleo Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is rephrasing the previous sentence as:

The _RO-Crate JSON-LD_ MUST contain a self-describing **RO-Crate Metadata File Descriptor**
whose `@id` MUST have `ro-crate-metadata.json` (or `ro-crate-metadata.jsonld` in legacy crates)
as its last path segment, and `@type` [CreativeWork].

@stain
Copy link
Contributor Author

stain commented Aug 25, 2022

UPDATE: https://schema.org/downloadUrl is only used in SoftwareApplication, so I guess we should use url

@simleo url is how we've said in https://www.researchobject.org/ro-crate/1.1/data-entities.html#embedded-data-entities-that-are-also-on-the-web - but how about https://schema.org/contentUrl which is defined for MediaObject aka our File, which would be the schema-org way of doing dcat:downloadURL - rather than url which is just "URL of the item" (and therefore weird to understand when it's different from @id)

@stain
Copy link
Contributor Author

stain commented Mar 23, 2023

Call 2023-03-23 agreed to merge all outstanding PRs.

There's outstanding how to do re-construct the relative path -- @simleo may have also thoughts on this now from Workflow Run profile perspective which also needed to this.

@stain stain merged commit cf5d1cd into master Mar 23, 2023
@stain stain deleted the issue-183-nonslash-root branch March 23, 2023 21:24
@stain stain added this to the RO-Crate 1.2 milestone Mar 23, 2023
@stain stain mentioned this pull request Apr 28, 2023
29 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants