Skip to content
Michel Dumontier edited this page Jul 11, 2013 · 26 revisions

version 1.1 (in progress v3 guidelines)

The linked data that forms part of Bio2RDF ascribes a to simple set of modeling patterns that permit our different datasets to syntactically interoperate. The best practices here presented have been inspired by the Banff Manifesto, Tim Berner-Lee's design principles and the collective experience of our community. This document provides a simple set of guidelines to guide Bio2RDF users and contributors in the creation and querying of our data.

This guide will assume that you have working experience in creating RDF documents programatically. If this describes you, then read on!

Table of Contents

Reusing known identifiers

The over 1800 biological databases that are currently available usually provide unique identifiers for every record that they contain. For example, the Protein Databank uses a four character string to represent their unique entries (e.g. 1Y26), similarly PubMed uses an integer to identify publication records (e.g. 22359647).

The linked data that forms part of Bio2RDF distinguishes between those identifiers that refer to the original records and any other auxiliary identifiers used in the creation of the linked data graph.

In order to maintain a clear link back to the original data provider's records, the Bio2RDF network makes use of a simple URI pattern that is composed of a unique namespace (one for every data provider) followed by a unique record identifier:

http://bio2rdf.org/namespace:identifier

For example, the Bio2RDF URI for the Uniprot record with the identifier P26838 would be

http://bio2rdf.org/uniprot:P26838

The list of available namespaces to the Bio2RDF network can be found here. Because every URI in the Bio2RDF network is a URL all identifiers must be URL-safe. For example, if an identifier contains a round bracket i.e. ( or ), then the URL encoding of this bracket must be used in the URI. A list of URL encodings for common characters is available here.

Creating auxiliary URIs

Now that you know how unique record-based URIs are created, we can demonstrate the URI patterns that Bio2RDF datasets use for metadata about these records. We have identified two categories of auxiliary namespaces for each dataset:

namespace_vocabulary

A namespace_vocabulary namespace is to be used when serializing additional data about a record with an existing identifier, but which requires new predicates or types. We use the namespace_vocabulary for dataset-specific types and predicates as follows:

http://bio2rdf.org/namespace_vocabulary:someUniqueString

For example, SGD describes genes and their protein products, and the Bio2RDF URI for SGD's 'Protein' type is:

http://bio2rdf.org/sgd_vocabulary:Protein

namespace_resource

If in the process of converting a dataset to RDF you create new identifiers that did not previously exist in the dataset being converted, then use a namespace_resource namespace. For example, PharmGKB describes associations between diseases, genes and drugs, but does not specify an identifier for these associations. Thus, the URI for the PharmGKB record describing interactions between the drug sertraline, with identifier PA451333, and gene ABCB1 would be:

http://bio2rdf.org/pharmgkb_resource:PA451333-ABCB1-assoc

Annotating resources

Here we provide some recommendations to follow when creating Bio2RDF linked data.

  1. Annotate the resources you create:
 http://purl.org/dc/terms/title
 a human readable title as it appears in the source data.
 http://purl.org/dc/terms/identifier
 a string that contains the identifier using the following pattern <namespace>:<identifier>
 http://www.w3.org/2000/01/rdf-schema#label
 a Bio2RDF generated label containing a title followed by the identifier "title [ns:id]"
 (used by convention in most RDF browsers to render the name of resource instead of its URI)

Taken together, the set of Bio2RDF triples for the NCBI gene identifier 15275 would be:

  @prefix dc: <http://purl.org/dc/terms/> .
  @prefix geneid: <http://bio2rdf.org/geneid:> .
  @prefix geneid_vocabulary: <http:bio2rdf.org/geneid_vocabulary:> .
  @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
  @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
  geneid:15275
    rdfs:label "Hk1 [geneid:15275]";
    dc:title "Hk1";
    dc:identifier "geneid:15275" ;
    rdf:type geneid_vocabulary:Gene . 

Adding provenance information

Every dataset that forms part of the Bio2RDF network contains metadata about the source data and the transformed linked dataset including, but not limited to: creation date, license information, URL of script used, publisher and SPARQL endpoint URL (see figure). The provenance metadata graph has as its central hub a void:Dataset resource that is linked to every unique record resource in the generated linked data through the void:inDataset predicate. Consider for example the following MeSH record:

  @prefix mesh: <http://bio2rdf.org/mesh:> .
  @prefix mesh_vocabulary: <http:bio2rdf.org/mesh_vocabulary:> .
  @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
  @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
  @prefix void: <http://rdfs.org/ns/void#>.
  mesh:8232b9b6cf938f834497a03ad4a8c0bd
    rdf:type mesh_vocabulary:descriptor_record.
    rdfs:label "Calcimycin [mesh:8232b9b6cf938f834497a03ad4a8c0bd]" .
    mesh_vocabulary:mesh-tree-number "D03.438.221.173" .
    void:inDataset <http://bio2rdf.org/bio2rdf_dataset:bio2rdf-mesh-20120827> ;

This MeSH record (mesh:8232b9b6cf938f834497a03ad4a8c0bd) directly makes use of the void:inDataset predicate to link to the following void:Dataset:

  @prefix dcterms: <http://purl.org/dc/terms/> .
  @prefix cc: <http://creativecommons.org/>.
  @prefix bio2rdf_dataset <http://bio2rdf.org/bio2rdf_dataset> .
  @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
  @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
  @prefix w3cprov: <http://www.w3.org/prov#> .
  @prefix void: <http://rdfs.org/ns/void#> .
  bio2rdf_dataset:bio2rdf-mesh-20120827
    rdf:type void:Dataset.
    rdfs:label "mesh dataset by Bio2RDF [bio2rdf_dataset:bio2rdf-mesh-20120827]" .
    w3cprov:wasDerivedFrom bio2rdf_dataset:mesh .
    void:sparqlEndpoint <http://mesh.bio2rdf.org/sparql> .
    void:Datadump <http://download.bio2rdf.org/rdf/mesh> .
    dcterms:date "2012-08-27"^^xsd:date .
    dcterms:license cc:by-attrybution .
    dcterms:publisher <http://bio2rdf.org> .
    dcterms:creator <https://github.com/bio2rdf/bio2rdf-scripts/blob/master/mesh/mesh_parser.php> .
    dcterms:rights "use-share-modify by-attribution restricted-by-source-license" ;

Notably, this provenance data graph provides users with the ability to execute date specific queries over a given Bio2RDF endpoint by querying for Bio2RDF records in a given void:Dataset produced on a given date.