Define components, standards, requirements for tool discovery #31

proycon · 2022-01-04T12:47:37Z

We need to clearly define the software components, service components and data components for tool discovery, along with the standards we adopt and requirements we want to set for all CLARIAH participants.

All these will be formulated here as part of the Shared Development Roadmap v2: https://github.com/CLARIAH/clariah-plus/blob/main/shared-development-roadmap/epics/fair-tool-discovery.md

It contains an initial proposal, which was already discussed and positively received by the technical committee, but further details remain to be filled. A workflow schema also needs to be added.

Further discussion can take place in this thread.

proycon · 2022-01-05T15:33:55Z

Relevant short blog post from the Sofware Sustainability Institute: https://software.ac.uk/blog/2021-05-20-what-are-formats-tools-and-techniques-harvesting-metadata-software-repositories

proycon · 2022-01-05T15:40:21Z

I also recommend this paper on FAIR research software: https://content.iospress.com/articles/data-science/ds190026

proycon · 2022-01-06T11:05:30Z

The main premises I envision for software metadata harvesting are:

All software metadata is stored and maintained at the source as
much as possible. The closest to the source you can get is by having metadata
in the source code repository itself. The idea is that software developers themselves are best capable of
expressing their own metadata, doing so will be a requirement in the CLARIAH Requirements for Infrastructure and Software/Services.
We adopt codemeta as a common vocabulary (uses schema.org vocab as much as possible), this is expressed as JSON-LD so we can extend
it with any additional linked data vocabulary we need (Define extra vocabulary for tool discovery #32)
We automatically map to codemeta from various existing schemas (codemeta specifically defines crosswalks between schemas), we
want to prevent any manual duplication. Almost all software ecosystems already have mechanisms in place for specifying basic metadata,
these should be used (PyPi, Java Maven, Rust Cargo, Npm, etc etc), another source of metadata is CITATION.cff. Codemeta conversions already exist for several of these so we use these existing tools for codemeta mapping (codemetapy, codemetar, cffconvert) and extend where no solutions exist yet.

The role of the harvester is to collect software metadata in one common vocabulary (
codemeta, plus whatever extended vocabulary we need) from all CLARIAH software. The procedure is as follows:

The input for the harvester is a list of source code repositories (and service endpoints, but more about this later).
This I call the tool source registry, it can be simply a git repository holding the necessary simple configuration files.
The harvester queries these source repositories (it simply git clones them and then looks for certain files)
Ideally, there is a codemeta.json at the root of the source repo, then we collect that and are done.
If not, we detect what other supported metadata is present, and invoke the necessary tool(s) to convert it to codemeta.
The harvester stores its results in the tool store

The role of the tool store is to hold and make available for querying the
aggregated collection of codemeta records (one per tool). The tool store would
then allow SPARQL and/or other queries on the data collection. Export functions
to other metadata formats (CMDI, OAI, Ineo's yaml) could either be build-in
server-side, or in dedicated clients.

As for implementations I'd like to aim for simplicity. I started a codemeta
harvester (#33) concept in the form
of a POSIX shell script that probably won't exceed 250 LoC. The real work of
converting other metadata to codemeta is delegated to other dedicated tools
(and that's where the work will be).

The tool store can probably also be kept quite simple. Just loading all triples
into memory (it'll be of very limited scope after all, we don't intend to scale
to thousands of tools) and allowing some kind of SPARQL query on it will also get us a long way.
Alternatively, existing triple stores like Virtuoso could be considered (but
might be overkill).

Now there's one major aspect which I skipped over. I want to make a clear
distinction between Software and Software as a Service and this proposal
thus-far has neglected the service aspect. However, in CLARIAH we're quite
service-oriented and most software will be made available as a service, i.e. a
web application hosted at a particular institute and made available over the
web with proper federated authentication etc... We want to have these 'service
entrypoints' in our metadata as well, but they don't fit the paradigm of being
specified in the source code repository because the source code repository
doesn't/shouldn't know where/when it is deployed.

Codemeta is more focussed on describing the source code, so already in 2018 I
proposed an extension to
codemeta that would allow for also describing entrypoints and specifying their
interface type. This is limited and not intended to be a full interface
specification like what OpenAPI or CLAM offers; the URL to such a full service
specification is simply a field in this extension.

To accommodate software as a service I imagine that we also list service
endpoints as part of the tool source registry (and not just the source
repository). The harvester can then query these endpoints and convert the
metadata in there to codemeta (e.g. using my extension), and augment the
metadata obtained from the source repository with it. These endpoints could
offer OpenAPI, OAI-PMH, or simply dublin core metadata in
HTML, as long
as we have some kind of tooling available to do a proper mapping (this is again
where the actual work is). We'll probably have to cope with some amount of
diversity but should limit this to a manageable degree by formulating clear
software/service requirements for CLARIAH.

proycon · 2022-01-06T13:51:02Z

Had a quick call with @menzowindhouwer about this, to discuss possible alignment with the FAIR Datasets track. He wants to expand the already established OAI Harvest Manager with further options to deal with non-OAI-PMH and non-XML based metadata (of which codemeta would be one). Such functionality will be needed anyway for FAIR Datasets. I expressed some concerns regarding complexity when extending that harvest manager to do too much, although it looks well designed and fairly extensible. We decided to continue on both tracks, I'll implement the simple harvester because it will be easy and fast (and we need results quickly here). Menzo will continue with the harvester manager because that will be needed in other scopes (FAIR datasets) anyway. The harvester script I propose may also serve as an inspiration/example/proof-of-concept for further development of the OAI harvest manager. At the end we can always decide whether to replace the more simple solution with the more complex one if it proves more fruitful.

We'll eventually need further convergence regarding the tool store aspect as well, possibly using the same solution for both tools and data.

proycon · 2022-01-12T18:54:21Z

Software metadata is often encoded in READMEs. If there is no more formal schema available, we can extract metadata from a README and convert it to codemeta. An existing tool is already available that does precisely this: https://github.com/KnowledgeCaptureAndDiscovery/somef

proycon · 2022-01-12T19:23:24Z

As mentioned earlier, the current codemeta standard does not feature everything we need for a more service-oriented approach, as it focusses on describing the software source (schema:SoftwareSourceCode). We also want to be able to describe webservice and web application endpoints (in some generic terms) and make the software - software instance/deployment distinction explicit in metadata. I proposed an extension codemeta/codemeta#183 but more work/thought may be required here. There is other existing ongoing work at schema.org and the W3C that may serve us here: this is described in schemaorg/schemaorg#2635 and schemaorg/schemaorg#1423 .

ddeboer · 2022-01-27T07:47:15Z

There’s a slight contradiction between:

All software metadata is stored and maintained at the source as
much as possible.

and

We automatically map to codemeta from various existing schemas
(…)
4. If not, we detect what other supported metadata is present, and invoke the necessary tool(s) to convert it to codemeta.

A way to solve this, is to make the codemeta.json a hard requirement and offer tooling and documentation on how owners can generate a codemeta.json based on their current metadata (e.g. GitHub repo metadata, language-specific package metadata etc.). I see two advantages of this approach:

Software owners keep full ownership of their metadata; there’s no ‘magic’ extraction that they have no control over. Instead, they themselves generate the codemeta.json, giving them the chance to make manual corrections to it.
It keeps the Harvester simpler because that only has to look for the codemeta.json file.

The question, of course, is whether we can ask this of software developers. We can mitigate this by offering a separate web service, a Software Metadata Extractor or CodeMeta Generator, where developers enter the URL of a repository, magic happens, and a codemeta.json is returned, which the developers copy, possibly modify and add to their repository. This way you make it easy for developers but still give them full ownership of the metadata.

A final problem is that of synchronising metadata: for example if developers change the LICENSE file or the license property in the language-specific package definition (e.g. package.json for NPM), how is that change propagated to the codemeta.json? Your approach has this same problem once owners have added a codemeta.json (because the Harvester would then ignore any other metadata).

proycon · 2022-01-27T11:55:50Z

Those are very good points yes. I was aware there was a bit of a contradiction and that the requirements might need some tweaking as the tool discovery task progresses. I was also a bit on the fence about how hard the requirement should be. The ownership argument you put forward is a good one and for CLARIAH software it would be fair demand to make. If we want to add some CLARIAH-specific vocabulary it might be inevitable even. But for possible external software and for some flexibility it helps if the harvester can do conversion for the cases where it wasn't already provided, it also helps prevent the sync issue you describe later.

We can mitigate this by offering a separate web service, a Software Metadata Extractor or CodeMeta Generator

Yes, the current harvester+conversion implementation I'm working on actually provides that function as well (without the webservice part though). The whole thing should remain simple enough.

A final problem is that of synchronising metadata: for example if developers change the LICENSE file or the license property in > the language-specific package definition (e.g. package.json for NPM), how is that change propagated to the codemeta.json?
Your approach has this same problem once owners have added a codemeta.json (because the Harvester would then ignore
any other metadata).

I think a part of the job of the harvester+conversion is to do some basic validation so blatant out-of-sync errors are reported.

But the syncing issue indeed remains; if users provide an explicit codemeta.json, update their package-specific metadata and neglect to update the codemeta.json again. This is part of why I was on the fence about requiring codemeta.json vs auto-converting it every time. Generation of the codemeta.json can also be automatically invoked from things like setup.py, or in a git commit hook or through a continuous deployment environment (but that might be overkill and complicate things).

…AH/clariah-plus#32 and CLARIAH/clariah-plus#31)

…CLARIAH/clariah-plus#32 and CLARIAH/clariah-plus#31)

…//github.com/SoftwareUnderstanding/software_types now (relates to CLARIAH/clariah-plus#32 and CLARIAH/clariah-plus#31)

…iah-plus#31)

…ll as automatically extracting contributors from git (CLARIAH/clariah-plus#31)

…AH/clariah-plus#31)

…oftware component description #24 #31

proycon added this to the Shared Development Roadmap - phase 2 milestone Jan 4, 2022

proycon moved this to In Progress in CLARIAH+ Shared Service: FAIR Tool Discovery Jan 4, 2022

proycon added this to CLARIAH+ Shared Service: FAIR Tool Discovery Jan 4, 2022

proycon added planning Preparation and planning activities discussion This is a discussion point; invitation to discussion labels Jan 4, 2022

proycon self-assigned this Jan 6, 2022

proycon added a commit that referenced this issue Jan 13, 2022

documented codemeta conversion components used by the harvester #31 #33

36f0d7b

proycon added a commit that referenced this issue Jan 20, 2022

added tool discovery schema for tech day presentation #31

8891134

ddeboer mentioned this issue Jan 27, 2022

Define extra vocabulary for tool discovery #32

Closed

proycon added a commit to CLARIAH/tool-discovery that referenced this issue Jan 31, 2022

adding formal schema for proposal in codemeta/codemeta#271 (cf. CLARI…

f03b425

…AH/clariah-plus#32 and CLARIAH/clariah-plus#31)

proycon added a commit to SoftwareUnderstanding/software_types that referenced this issue Feb 3, 2022

adding formal schema for proposal in codemeta/codemeta#271 (relates to …

d48ac4b

…CLARIAH/clariah-plus#32 and CLARIAH/clariah-plus#31)

proycon added a commit to CLARIAH/tool-discovery that referenced this issue Feb 3, 2022

removed software type proposals from this repository, moved to https:…

8d56dd0

…//github.com/SoftwareUnderstanding/software_types now (relates to CLARIAH/clariah-plus#32 and CLARIAH/clariah-plus#31)

proycon added the FAIR Tool Discovery FAIR Tool Discovery label Feb 10, 2022

proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 17, 2022

Added an initial README (CLARIAH/clariah-plus#31)

e68dac7

proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 17, 2022

Added an initial README (CLARIAH/clariah-plus#31)

d916daa

proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 17, 2022

updated Dockerfile and instructions to use it in README (CLARIAH/clar…

4a4fcc8

…iah-plus#31)

proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 18, 2022

implemented parsing of CONTRIBUTORS, AUTHORS, MAINTAINERS files as we…

f20ab47

…ll as automatically extracting contributors from git (CLARIAH/clariah-plus#31)

proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 21, 2022

extract release notes URL and download URL for github projects (CLARI…

8ad74b4

…AH/clariah-plus#31)

proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 21, 2022

automatically extract known documentation URLs from the README (CLARI…

15e4599

…AH/clariah-plus#31)

proycon added a commit that referenced this issue Apr 6, 2022

SDRv2: Updated tool-discovery plan; updated planning and simplified s…

653b481

…oftware component description #24 #31

proycon moved this from In Progress to Done in CLARIAH+ Shared Service: FAIR Tool Discovery Apr 13, 2022

proycon closed this as completed Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define components, standards, requirements for tool discovery #31

Define components, standards, requirements for tool discovery #31

proycon commented Jan 4, 2022 •

edited

Loading

proycon commented Jan 5, 2022

proycon commented Jan 5, 2022

proycon commented Jan 6, 2022 •

edited

Loading

proycon commented Jan 6, 2022

proycon commented Jan 12, 2022

proycon commented Jan 12, 2022 •

edited

Loading

ddeboer commented Jan 27, 2022 •

edited

Loading

proycon commented Jan 27, 2022

Define components, standards, requirements for tool discovery #31

Define components, standards, requirements for tool discovery #31

Comments

proycon commented Jan 4, 2022 • edited Loading

proycon commented Jan 5, 2022

proycon commented Jan 5, 2022

proycon commented Jan 6, 2022 • edited Loading

proycon commented Jan 6, 2022

proycon commented Jan 12, 2022

proycon commented Jan 12, 2022 • edited Loading

ddeboer commented Jan 27, 2022 • edited Loading

proycon commented Jan 27, 2022

proycon commented Jan 4, 2022 •

edited

Loading

proycon commented Jan 6, 2022 •

edited

Loading

proycon commented Jan 12, 2022 •

edited

Loading

ddeboer commented Jan 27, 2022 •

edited

Loading