Skip to content

Bio2RDF Dataset Provenance

skchrko edited this page Jan 13, 2015 · 58 revisions

Metadata about each Bio2RDF linked dataset is now accessible through a dataset specific provenance graph. We include details such as: Date of linked data conversion, licensing of source data, age of source data, link to the script that generated the Bio2RDF dataset, among others. The following is a detailed description of our provenance modeling scheme and how to you can make use of it.

Bio2RDF Provenance Model

Provenance data for each Bio2RDF dataset is stored in a separate named graph in each corresponding SPARQL endpoint. The provenance graph URI follows the pattern: http://bio2rdf.org/bio2rdf-[dataset]-provenance Where 'dataset' is the preferred short name (or prefix) for a given source dataset as extracted from our Life Science Registry. For example, the Comparative Toxicogenomics Database (CTD) dataset provenance graph makes use of:

http://bio2rdf.org/bio2rdf-ctd-provenance

as its graph URI. Our provenance model uses the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary (PROV) and Dublin Core vocabulary. Each data item is linked to a linked dataset-unique provenance object that points to descriptions of:

  • source of the data (i.e. the provider's website)
  • date on which the RDF was generated
  • licensing of source data (if available)
  • SPARQL endpoint URL
  • URL to our download page

Each dataset provenance object has a unique IRI and label based on the dataset name and creation date, e.g.

http://bio2rdf.org/bio2rdf_dataset:bio2rdf-mesh-20120827

The date-specific dataset IRI is linked to a unique dataset IRI using the W3C PROV predicate ‘wasDerivedFrom’:

<http://bio2rdf.org/bio2rdf_dataset:bio2rdf-mesh-20120827> prov:wasDerivedFrom <http://bio2rdf.org/bio2rdf_dataset:mesh>

such that one can query the dataset SPARQL endpoint to retrieve all provenance records for datasets created on different dates. An example provenance record for the NLM Medical Subject Headings (MeSH) dataset can be seen below. Note that each subject IRI in the dataset is linked the date-unique dataset IRI that is part of the provenance record using the VoID ‘inDataset’ predicate. Other important features of the provenance record include the use of the Dublin Core ‘creator’ term to link a dataset to the script on Github that was used to generate it, the VoID predicate ‘sparqlEndpoint’ to point to the dataset SPARQL endpoint, and VoID predicate ‘dataDump’ to point to the data download URL.

Bio2RDF Provenance Model

Querying the provenance graph

The entry point for a provenance record are resources typed as

http://rdfs.org/ns/void#Dataset

Thus, to retrieve the provenance details for a dataset, one could use the following SPARQL query:

PREFIX dcterms: <http://purl.org/dc/terms/>  
PREFIX prov: <http://www.w3.org/ns/prov#>  
PREFIX void: <http://rdfs.org/ns/void#>  
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>  
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>  
SELECT * WHERE {  
?dataset rdf:type void:Dataset .  
?dataset rdfs:label ?datasetLabel .  
?dataset dcterms:created ?creationDate .  
?dataset dcterms:creator ?creationScript .  
?dataset dcterms:rights ?license .  
?dataset prov:wasDerivedFrom ?parentDataSet .  
?dataset void:dataDump ?downloadLink .  
?dataset void:sparqlEndpoint ?endpointLink .  
}