Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CQ1 - Container image #9

Closed
simleo opened this issue Jun 23, 2022 · 11 comments · Fixed by #64
Closed

CQ1 - Container image #9

simleo opened this issue Jun 23, 2022 · 11 comments · Fixed by #64
Labels
Requirement Something we want to capture in the spec

Comments

@simleo
Copy link
Collaborator

simleo commented Jun 23, 2022

What container images (e.g., Docker) were used by the run?

  • Source entity: CreateAction
  • Target entity? It could be File if the image is a tarball from docker save
  • Property? Overload image?
This was referenced Jun 23, 2022
@simleo simleo added the Requirement Something we want to capture in the spec label Jul 6, 2022
@mr-c
Copy link

mr-c commented Jul 21, 2022

For containers, capturing the digest (checksum) of the actual container run should be the minimum along the host OS w/ version (previous Linux kernel versions have had math bugs); CPU info (basically the contents of /proc/cpuinfo)

@ilveroluca
Copy link
Contributor

A File seems too restrictive.

One should be able to reference a container image in any remote repository. Also, it seems handy to be able to define what type of container image it is (e.g., Singularity, Docker, etc.).

Both these requirements could be satisfied by using a full URI, where the scheme is used to identify the image type. This approach is also used by Snakemake.

@simleo
Copy link
Collaborator Author

simleo commented Oct 13, 2022

Thanks @ilveroluca. Following your link, it looks like Snakemake, in turn, accepts what's supported by Singularity. So the spec could say something like "values for image SHOULD be in the format accepted by Singularity. e.g. docker://quay.io/calico/node". So formats that respect the SHOULD could be used by tooling that want to enable reproducibility, with the ability to actually pull the image. More "informational" (non-pullable) URLs, e.g. of a web page that describes the image would still be useful for traceability.

@dgarijo
Copy link
Contributor

dgarijo commented Oct 13, 2022

The registry where the container is (e.g., Dockerhub, GitHub, etc.) is quite important here as well. I propose capturing it (in case a file is not used, just the id in that registry)

@simleo
Copy link
Collaborator Author

simleo commented Oct 14, 2022

The registry where the container is (e.g., Dockerhub, GitHub, etc.) is quite important here as well. I propose capturing it (in case a file is not used, just the id in that registry)

I think the idea discussed yesterday was to capture, using separate properties:

  • Image type (Docker, Singularity, ...). We could define values for the most popular types in ro-terms, like we did in Workflow Testing RO-Crate for CI service types (JenkinsService, GithubService, TravisService)
  • Registry (Docker Hub, Quay.io, ...). Values here could be strings, but probably not full URIs with a scheme: we want this representation to be machine actionable, so this field should map to the first field in the REGISTRY/ORG/IMAGE:TAG scheme used by docker pull. E.g. "quay.io" rather than "https://quay.io/". An alternative is pointing to more articulate objects that would have a generic URL pointing to a descriptive web page and a specialized property for mapping to the appropriate field in the image pull command.
  • Organization within the registry ("crs4", "biocontainers", ...)
  • Image name ("postgres", "samtools", ...)
  • Tag ("latest", "1.0", ...)
  • Digest, e.g. "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855". Like for Registry, though, how to map to the pull syntax (e.g. docker pull ubuntu@sha256:82becede498899ec668628e7cb0ad87b6e1c371cb8a1e597d83a47fac21d6af3) should be made straightforward.

@simleo
Copy link
Collaborator Author

simleo commented Aug 3, 2023

Since the question is what container images were used by the run, the source entity should be CreateAction. I've updated the issue description, which listed SoftwareApplication instead.

As for the property used to link to the image, Reusing image does not feel quite right, since it's meant for pictures. We could define a containerImage property in ro-terms and make it point to a ContainerImage entity (this pattern is not uncommon in Schema.org, e.g. contactPoint pointing to ContactPoint). The question then would be how to structure ContainerImage.

For the image type we could use additionalType, and define DockerImage and SIFImage (https://github.com/apptainer/sif) in ro-terms for the values for now.

For the registry we should define a custom property, which could be registry and take textual values. This is implicitly "docker.io" when not specified in docker pull. Singularity (both SingularityCE and Apptainer) seems to allow pulling from an arbitrary http(s) URL: in this case the ContainerImage should probably have a url property instead, listing the full image URL.

Referring to the previous comment, the problem with the "organization" bit is that it's not always an organization. Keeping as reference the docker pull scheme and the Docker Hub, that field could represent a user, or be missing in the case of an official image. In practice, for official images it defaults to "library", so that docker pull debian is short for docker pull docker.io/library/debian:latest. So it's probably better not to have such a field and consider this part of the image name instead, which is consistent with the docker pull [OPTIONS] NAME[:TAG|@DIGEST] docker pull syntax.

For the image name we can use name, mapping to text like "debian", "biocontainer/samtools", etc. Note that the terminology is not always consistent in the Docker docs: e.g., what is referred to as "name" in the docker pull docs is called "repository" in the docker images docs.

The tag needs a new custom property that we can call tag, with textual values

For the digest, we already have sha256 in ro-terms.

Here is a possible example:

{
    "@id": "#cb04c897-eb92-4c53-8a38-bcc1a16fd650",
    "@type": "CreateAction",
    "instrument": {"@id": "bam2fastq.cwl"},
    ...
    "containerImage": {"@id": "#samtools-image"}
},
{
    "@id": "#samtools-image",
    "@type": "ContainerImage",
    "additionalType": "DockerImage",
    "registry": "docker.io",
    "name": "biocontainers/samtools",
    "tag": "v1.9-4-deb_cv1",
    "sha256": "da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a"
}

@ilveroluca
Copy link
Contributor

While I think your proposal for ContainerImage will work in practice, I have some considerations.

The only think that really identifies the image is the checksum. On the other hand, it's possible that images are mirrored in multiple locations, or that over time they migrate across repositories. Tags can also be reused (while this is not a best practice, it can happen).

Also, I question the value added by splitting the image URL into its components (i.e., registry, name, tag).

I would therefore consider defining a ContainerImage that: 1) uses a "simple" URL to references image locations; and 2) allows referencing secondary image locations. An example might look like this:

{
    "@id": "#samtools-image",
    "@type": "ContainerImage",
    "additionalType": "DockerImage",
    "sha256": "da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a",
    "mainUrl": "https://docker.io/biocontainers/samtools:v1.9-4-deb_cv1",
    "alternativeUrls": [ "https://quay.io/repository/...." ]
}

@simleo
Copy link
Collaborator Author

simleo commented Aug 3, 2023

One problem with "https://docker.io/biocontainers/samtools:v1.9-4-deb_cv1" is that it does not represent a resource on the web: it leads to a "page not found" if entered on a browser and you cannot docker pull it ("invalid reference format"). What you can docker pull is:

  • docker.io/biocontainers/samtools:v1.9-4-deb_cv1
  • docker.io/biocontainers/samtools@sha256:da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a

The separate fields would allow the consumer to build the preferred pull syntax easily by joining the relevant parts, and also to perform more articulate queries (e.g., all images from quay.io).

That's for Docker images at least, since Singularity allows pulling by URL.

@rsirvent
Copy link
Contributor

rsirvent commented Aug 3, 2023

Here is a possible example:

{
    "@id": "#cb04c897-eb92-4c53-8a38-bcc1a16fd650",
    "@type": "CreateAction",
    "instrument": {"@id": "bam2fastq.cwl"},
    ...
    "containerImage": {"@id": "#samtools-image"}
},
{
    "@id": "#samtools-image",
    "@type": "ContainerImage",
    "additionalType": "DockerImage",
    "registry": "docker.io",
    "name": "biocontainers/samtools",
    "tag": "v1.9-4-deb_cv1",
    "sha256": "da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a"
}

I'm more in favor of this approach, since it describes in more details the image, thus you are getting richer metadata than can later be used.

@simleo
Copy link
Collaborator Author

simleo commented Aug 4, 2023

@stain any thoughts on this one?

jmfernandez added a commit to inab/WfExS-backend that referenced this issue Oct 5, 2023
…esearchObject/workflow-run-crate#9 (comment)

Also, common.py has been thinned, moving several declarations to their "natural" places.

This has led to a major code reorganization, which has raised an issue unmarshalling some instances from the working directory state files. So, yaml loader has been taught how to deal with this mismatch.
@simleo
Copy link
Collaborator Author

simleo commented Oct 13, 2023

As noted by Stian, "additionalType": {"@id": "https://w3id.org/ro/terms/workflow-run#DockerImage"} is more correct.

@stain stain closed this as completed in #64 Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Requirement Something we want to capture in the spec
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants