Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define components, standards, requirements for tool discovery #31

Closed
proycon opened this issue Jan 4, 2022 · 8 comments
Closed

Define components, standards, requirements for tool discovery #31

proycon opened this issue Jan 4, 2022 · 8 comments
Assignees
Labels
discussion This is a discussion point; invitation to discussion FAIR Tool Discovery FAIR Tool Discovery planning Preparation and planning activities

Comments

@proycon
Copy link
Member

proycon commented Jan 4, 2022

We need to clearly define the software components, service components and data components for tool discovery, along with the standards we adopt and requirements we want to set for all CLARIAH participants.

All these will be formulated here as part of the Shared Development Roadmap v2: https://github.com/CLARIAH/clariah-plus/blob/main/shared-development-roadmap/epics/fair-tool-discovery.md

It contains an initial proposal, which was already discussed and positively received by the technical committee, but further details remain to be filled. A workflow schema also needs to be added.

Further discussion can take place in this thread.

@proycon proycon added planning Preparation and planning activities discussion This is a discussion point; invitation to discussion labels Jan 4, 2022
@proycon
Copy link
Member Author

proycon commented Jan 5, 2022

Relevant short blog post from the Sofware Sustainability Institute: https://software.ac.uk/blog/2021-05-20-what-are-formats-tools-and-techniques-harvesting-metadata-software-repositories

@proycon
Copy link
Member Author

proycon commented Jan 5, 2022

I also recommend this paper on FAIR research software: https://content.iospress.com/articles/data-science/ds190026

@proycon
Copy link
Member Author

proycon commented Jan 6, 2022

The main premises I envision for software metadata harvesting are:

  • All software metadata is stored and maintained at the source as
    much as possible. The closest to the source you can get is by having metadata
    in the source code repository itself. The idea is that software developers themselves are best capable of
    expressing their own metadata, doing so will be a requirement in the CLARIAH Requirements for Infrastructure and Software/Services.
  • We adopt codemeta as a common vocabulary (uses schema.org vocab as much as possible), this is expressed as JSON-LD so we can extend
    it with any additional linked data vocabulary we need (Define extra vocabulary for tool discovery #32)
  • We automatically map to codemeta from various existing schemas (codemeta specifically defines crosswalks between schemas), we
    want to prevent any manual duplication. Almost all software ecosystems already have mechanisms in place for specifying basic metadata,
    these should be used (PyPi, Java Maven, Rust Cargo, Npm, etc etc), another source of metadata is CITATION.cff. Codemeta conversions already exist for several of these so we use these existing tools for codemeta mapping (codemetapy, codemetar, cffconvert) and extend where no solutions exist yet.

The role of the harvester is to collect software metadata in one common vocabulary (
codemeta, plus whatever extended vocabulary we need) from all CLARIAH software. The procedure is as follows:

  1. The input for the harvester is a list of source code repositories (and service endpoints, but more about this later).
    This I call the tool source registry, it can be simply a git repository holding the necessary simple configuration files.
  2. The harvester queries these source repositories (it simply git clones them and then looks for certain files)
  3. Ideally, there is a codemeta.json at the root of the source repo, then we collect that and are done.
  4. If not, we detect what other supported metadata is present, and invoke the necessary tool(s) to convert it to codemeta.
  5. The harvester stores its results in the tool store

The role of the tool store is to hold and make available for querying the
aggregated collection of codemeta records (one per tool). The tool store would
then allow SPARQL and/or other queries on the data collection. Export functions
to other metadata formats (CMDI, OAI, Ineo's yaml) could either be build-in
server-side, or in dedicated clients.

As for implementations I'd like to aim for simplicity. I started a codemeta
harvester
(#33) concept in the form
of a POSIX shell script that probably won't exceed 250 LoC. The real work of
converting other metadata to codemeta is delegated to other dedicated tools
(and that's where the work will be).

The tool store can probably also be kept quite simple. Just loading all triples
into memory (it'll be of very limited scope after all, we don't intend to scale
to thousands of tools) and allowing some kind of SPARQL query on it will also get us a long way.
Alternatively, existing triple stores like Virtuoso could be considered (but
might be overkill).

Now there's one major aspect which I skipped over. I want to make a clear
distinction between Software and Software as a Service and this proposal
thus-far has neglected the service aspect. However, in CLARIAH we're quite
service-oriented and most software will be made available as a service, i.e. a
web application hosted at a particular institute and made available over the
web with proper federated authentication etc... We want to have these 'service
entrypoints' in our metadata as well, but they don't fit the paradigm of being
specified in the source code repository because the source code repository
doesn't/shouldn't know where/when it is deployed.

Codemeta is more focussed on describing the source code, so already in 2018 I
proposed an extension
to
codemeta that would allow for also describing entrypoints and specifying their
interface type. This is limited and not intended to be a full interface
specification like what OpenAPI or CLAM offers; the URL to such a full service
specification is simply a field in this extension.

To accommodate software as a service I imagine that we also list service
endpoints as part of the tool source registry (and not just the source
repository). The harvester can then query these endpoints and convert the
metadata in there to codemeta (e.g. using my extension), and augment the
metadata obtained from the source repository with it. These endpoints could
offer OpenAPI, OAI-PMH, or simply dublin core metadata in
HTML
, as long
as we have some kind of tooling available to do a proper mapping (this is again
where the actual work is). We'll probably have to cope with some amount of
diversity but should limit this to a manageable degree by formulating clear
software/service requirements for CLARIAH.

@proycon
Copy link
Member Author

proycon commented Jan 6, 2022

Had a quick call with @menzowindhouwer about this, to discuss possible alignment with the FAIR Datasets track. He wants to expand the already established OAI Harvest Manager with further options to deal with non-OAI-PMH and non-XML based metadata (of which codemeta would be one). Such functionality will be needed anyway for FAIR Datasets. I expressed some concerns regarding complexity when extending that harvest manager to do too much, although it looks well designed and fairly extensible. We decided to continue on both tracks, I'll implement the simple harvester because it will be easy and fast (and we need results quickly here). Menzo will continue with the harvester manager because that will be needed in other scopes (FAIR datasets) anyway. The harvester script I propose may also serve as an inspiration/example/proof-of-concept for further development of the OAI harvest manager. At the end we can always decide whether to replace the more simple solution with the more complex one if it proves more fruitful.

We'll eventually need further convergence regarding the tool store aspect as well, possibly using the same solution for both tools and data.

@proycon proycon self-assigned this Jan 6, 2022
@proycon
Copy link
Member Author

proycon commented Jan 12, 2022

Software metadata is often encoded in READMEs. If there is no more formal schema available, we can extract metadata from a README and convert it to codemeta. An existing tool is already available that does precisely this: https://github.com/KnowledgeCaptureAndDiscovery/somef

@proycon
Copy link
Member Author

proycon commented Jan 12, 2022

As mentioned earlier, the current codemeta standard does not feature everything we need for a more service-oriented approach, as it focusses on describing the software source (schema:SoftwareSourceCode). We also want to be able to describe webservice and web application endpoints (in some generic terms) and make the software - software instance/deployment distinction explicit in metadata. I proposed an extension codemeta/codemeta#183 but more work/thought may be required here. There is other existing ongoing work at schema.org and the W3C that may serve us here: this is described in schemaorg/schemaorg#2635 and schemaorg/schemaorg#1423 .

@ddeboer
Copy link
Contributor

ddeboer commented Jan 27, 2022

There’s a slight contradiction between:

All software metadata is stored and maintained at the source as
much as possible.

and

We automatically map to codemeta from various existing schemas
(…)
4. If not, we detect what other supported metadata is present, and invoke the necessary tool(s) to convert it to codemeta.

A way to solve this, is to make the codemeta.json a hard requirement and offer tooling and documentation on how owners can generate a codemeta.json based on their current metadata (e.g. GitHub repo metadata, language-specific package metadata etc.). I see two advantages of this approach:

  1. Software owners keep full ownership of their metadata; there’s no ‘magic’ extraction that they have no control over. Instead, they themselves generate the codemeta.json, giving them the chance to make manual corrections to it.
  2. It keeps the Harvester simpler because that only has to look for the codemeta.json file.

The question, of course, is whether we can ask this of software developers. We can mitigate this by offering a separate web service, a Software Metadata Extractor or CodeMeta Generator, where developers enter the URL of a repository, magic happens, and a codemeta.json is returned, which the developers copy, possibly modify and add to their repository. This way you make it easy for developers but still give them full ownership of the metadata.

A final problem is that of synchronising metadata: for example if developers change the LICENSE file or the license property in the language-specific package definition (e.g. package.json for NPM), how is that change propagated to the codemeta.json? Your approach has this same problem once owners have added a codemeta.json (because the Harvester would then ignore any other metadata).

@proycon
Copy link
Member Author

proycon commented Jan 27, 2022

Those are very good points yes. I was aware there was a bit of a contradiction and that the requirements might need some tweaking as the tool discovery task progresses. I was also a bit on the fence about how hard the requirement should be. The ownership argument you put forward is a good one and for CLARIAH software it would be fair demand to make. If we want to add some CLARIAH-specific vocabulary it might be inevitable even. But for possible external software and for some flexibility it helps if the harvester can do conversion for the cases where it wasn't already provided, it also helps prevent the sync issue you describe later.

We can mitigate this by offering a separate web service, a Software Metadata Extractor or CodeMeta Generator

Yes, the current harvester+conversion implementation I'm working on actually provides that function as well (without the webservice part though). The whole thing should remain simple enough.

A final problem is that of synchronising metadata: for example if developers change the LICENSE file or the license property in > the language-specific package definition (e.g. package.json for NPM), how is that change propagated to the codemeta.json?
Your approach has this same problem once owners have added a codemeta.json (because the Harvester would then ignore
any other metadata).

I think a part of the job of the harvester+conversion is to do some basic validation so blatant out-of-sync errors are reported.

But the syncing issue indeed remains; if users provide an explicit codemeta.json, update their package-specific metadata and neglect to update the codemeta.json again. This is part of why I was on the fence about requiring codemeta.json vs auto-converting it every time. Generation of the codemeta.json can also be automatically invoked from things like setup.py, or in a git commit hook or through a continuous deployment environment (but that might be overkill and complicate things).

@proycon proycon added the FAIR Tool Discovery FAIR Tool Discovery label Feb 10, 2022
proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 17, 2022
proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 17, 2022
proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 17, 2022
proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 18, 2022
proycon added a commit that referenced this issue Apr 6, 2022
@proycon proycon moved this from In Progress to Done in CLARIAH+ Shared Service: FAIR Tool Discovery Apr 13, 2022
@proycon proycon closed this as completed Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion This is a discussion point; invitation to discussion FAIR Tool Discovery FAIR Tool Discovery planning Preparation and planning activities
Development

No branches or pull requests

2 participants