Skip to content
jctoledo edited this page Feb 3, 2012 · 26 revisions

The linked data that forms part of Bio2RDF ascribes a to simple set modeling patterns that permit our different datasets to syntactically interoperate. The best practices here presented have been inspired by the Banff Manifesto, Tim Berner-Lee's design principles and the collective experience of our community. This document intends to provide a clear set of guidelines that will guide Bio2RDF users and contributors in understanding how to use and create Bio2RDF compatible linked data. Comments and suggestions are always welcome, join our maling list to get more involved!

Table of Contents

Data principles

1) URI's a syntactic pattern and are dereferenceable

2)Authoritative public namespaces are used

  1. Mandatory predicates are used
  2. RDFizer programs must be open source
  3. Every dataset must be shipped with an ontlogy file

Identifiers

The first step of the RDFization process involves the use of a consistent identifier identifier scheme. Data providers such as NCBI, EBI, etc. use unique identifiers to refer to the entities that they are hosting. The linked data that forms part of Bio2RDF distinguishes between those identifiers that refer to the original hosted entities and any other auxliary identifiers used in the creation of the linked data graph

Entities

For every unique entity c to a record Bio2RDF identifiers are given by the following URI pattern:

http://bio2rdf.org/''namespace'':''identifier''

where the namespace is a short name listed in our dataset registry that uniquely identifies the source (dataset/database). The identifier is the (alpha)numeric string assigned to identify that entity. For instance, the gene identified by the number 15275 in the NCBI EntrezGene Database (namespace = geneid) has the following identifier:

 <code>http://bio2rdf.org/''geneid'':''15275''</code>

Vocabulary

The Bio2RDF URI scheme is applied not just to data entries, but also for the vocabulary (types and relations) to describe these entries.

 <code>http://bio2rdf.org/''namespace''_term:''term''</code>

For example, the gene identified by geneid:15275 is a kind of Gene, as defined by Entrez Gene.

 <code>http://bio2rdf.org/''geneid''_term:''Gene''</code>

Descriptions

Minimum Annotations

Each resource should contain the following annotations:

 <code>http://purl.org/dc/terms/title</code> 
 a human readable title as it appears in the source data.
 <code>http://purl.org/dc/terms/identifier</code>
 a string that contains the identifier using the following pattern <namespace>:<identifier>
 <code>rdfs:label</code>
 a Bio2RDF generated label containing a title followed by the identifier "title [ns:id]". 

Used by convention in most RDF browsers to render the name of resource instead of its URI.

Taken together,

 <code>
  geneid:15275 
   rdfs:label "Hk1 [geneid:15275]" ;
   dc:title "Hk1" ;
   dc:identifier "geneid:15275" ;
   rdf:type geneid_term:Gene .
 </code>

Datasets, Records and Entities

We recognize a minimum of 3 entities found in biological information resources: physical entities, records and datasets.

1. Record

Records are information objects that contain a set of statements, primarily about the subject.

 <code>
  namespace_record:identifier
    bio2rdf_term:has-primary-subject namespace:identifier .
 </code>
 <code>
  namespace:identifier
   bio2rdf_term:is-described-by namespace_record:identifier .
 </code>

2. Dataset Datasets are collections of records.

 <code>
  bio2rdf_dataset:<namespace>
    bio2rdf_term:has-item namespace_record:identifer .
 </code>

Since datasets can be versioned, we

 <code>
  bio2rdf_dataset:namespace.version
    dc:hasVersion "13" ;
    dc:partOf bio2rdf_dataset:namespace .
 </code>

Mappings

this section is about how to create mappings from your dataset specific vocabulary to SIO.

Ontologies

Scripts

:Category:Scripts

Serialization

Loading

Loading the RDF database