Skip to content
Dan Brickley edited this page Nov 4, 2020 · 17 revisions

Introduction

This is a master page for information on all the technical aspects of publishing and consuming Bioschemas markup. This will grow over time. Please feel free

Topics

Format

Bioschemas strongly recommends using JSON-LD to publish markup, as also recommended by Google. schema.org also allows RDFa and Microdata, but standardizing on JSON-LD allows Bioschemas example markup to be simpler and more consistent. JSON-LD also separates its markup from the page HTML, which may be better for scientific sites publishing large volumes of markup that may change relatively infrequently compared to the human-readable webpage.

"@id" & "@context"

Bioschemas recommends using "@id" to assign the node a URL. This is best practise for Linked Data and it prevents the creation of blank nodes.

The equivalent of "@id" in Microdata is the itemid attribute, and the equivalent in RDFa Lite is the resource attribute.

"@context" is routinely ignored by crawlers who simply replace it with the default "@context":"https://schema.org". Best practise is to use the default version in all markup.

Publishing Bioschemas markup

Make Bioschemas markup reachable

In principle, you should be able to publish Bioschemas markup in any of your webpages, much like any schema.org markup. This will always be true for general search engines such as Bing and Google, as and when they process Bioschemas markup. However, so that markup can be found by life sciences specific search engines and other applications, we recommend that if possible, all markup can be reached by crawling the website's sitemap.xml.

If this isn't possible, then Bioschemas markup should at least be discoverable through link following. This may reduce the number of consumers, since only those that implement link crawling will find it.

If markup isn't available through the sitemap.xml or link following, then its use will be restricted to applications that know to specifically crawl marked up page URLs (e.g. it won't be available for general life sciences search engines).

Also, to make your markup crawlable, don't forget to make sure that your robots.txt allows search engines to crawl your site. You can check that your website allows this for Google using Google's robots.txt testing tool, but be sure to consider allowing other crawlers too.

Don't publish Bioschemas markup dynamically (i.e. through delayed fetch via Javascript), if possible.

If possible, it's best to publish Bioschemas markup on statically rendered pages. This will make it available to the widest range of consumers. However, we understand that in various web frameworks and architecture this isn't realistically possible and markup needs to be added dynamically through Javascript. Please be aware that this will make it available only to crawlers that implement headless rendering of pages before they scrape the schemas data.

Questions

Does Bioschemas markup need to be published on the same page as the human readable content?

In principle, since JSON-LD separates semantic markup from the human readable HTML, it can be tempting to make the publishing process simpler by aggregating markup for many different entities and publishing it on pages separate from their human-readable representations. This may reduce the effort of publishing and the efficiency by which crawlers can find that markup, particularly if it referenced from the sitemap.xml.

However, Google states in their guidelines that there must be human readable content for that markup on the same page, in order to give the best possible general search experience. This requirement, which stems from the need to avoid misleading the user, may apply more to general Internet sites that can look to game the search system, rather than scientific data sites that are looking to provide useful data. Nonetheless, in order to make content discoverable by Google and similar search engines, we recommend that the markup always be on the same page as the associated human-readable content.

Links

Adding profile specific relations to BioChemEntity and DataRecord

Bioschemas interoperability

Compact Identifiers

During the BioHackathon 2018 Paris it was explored if Compact Identifiers can be used as value for the identifier property. This property has as valid value types Text and URL, and we looked into the question if Compact Identifiers are valid URLs. This seems to be the case:

  1. Compact Identifiers are CURIEs (see this paper)
  2. CURIEs are URIs (see this specification)
  3. The type URLs includes URIs (see this page)

As a result, Bioschemas is interoperable with Compact Identifiers.

Future ideas

  • The Bioschemas common crawl - how to access a common crawl for applications that need a large amount of collected Bioschemas markup but don't want to operate their own crawling facilities. Either this is collected by commoncrawl.org or published by a search engine gathering this information anyway, such as Buzzbang.