Skip to content
hlapp edited this page May 21, 2011 · 9 revisions


Integrating loosely structured data into the Linked Open Data cloud
A DataONE Summer Internship Project

Synopsis

DataONE is a virtual network of data repositories covering ecological, evolutionary, and earth sciences. One of DataONE's objectives is to make the data held in these repositories broadly discoverable, accessible, and reusable. The Linked Open Data movement has created a set of conventions that allows data published on the web to form an interconnected globally navigable web of data, the Linked Open Data (LOD) cloud. It would thus be desirable to expose data held within the DataONE network to the LOD cloud. While converting data objects into RDF syntax is not hard, doing so in a manner that allows the resulting RDF to fully integrate into and connect with the rest of the LOD cloud is difficult. The reason is that the majority of the data held in network repositories is highly heterogenous, and typically represented in loosely structured or ad-hoc instrument and program-specific formats.

People

Student: Aída Gándara, University of Texas at El Paso

Mentor: Hilmar Lapp, National Evolutionary Synthesis Center (NESCent)

Project idea description

The Linked Data conventions describe four principles that allow data of any kind and from any online source to form a global interconnected web of data: i) name every "thing" that has some data or information associated with it; ii) use HTTP URIs to do so; iii) provide useful information or data in Resource Description Framework (RDF) format to someone looking up such URIs; and iv) within information provided this way, link to other common "things", such as points or axes of reference, and use common vocabularies to attach meaning to links wherever possible. These seemingly simple principles have nonetheless been highly effective in facilitating the creation of large, globally distributed, and constantly growing aggregations of Linked Open Data (LOD), a unversally applicable framework for machines and users alike to integrate, navigate, and discover data by following links that are semantically of interest.

Trying to apply the Linked Data principles to data holdings of non-specialized digital repositories, such as DataONE and many of its member nodes, is challenging. These data are often highly heterogenous, and not natively expressed in RDF, or a format structured enough that would lend itself to automatic conversion to RDF. Instead, they are typically represented in formats that are either loosely structured in an ad-hoc manner (such as spreadsheets), or according to one of a myriad of formats output by instruments or analysis programs. It is thus not clear what the universe of "things" to name is, what are common points or axes of reference, what kinds (semantics) of links are needed, and how data archived in this way can be exposed in RDF such that the conversion can be automated, yet is still useful for science-motivated discovery and integration.

The idea of this project is to develop an exploratory prototype, and practical recommendations resulting from it, for how the heterogeneous and loosely structured data held in non-specialized DataONE member nodes can be exposed to the Linked (Open) Data cloud. The approach would consist of obtaining a sufficiently representative sample of data sets from DataONE's initial 3 member nodes (Dryad, KNB, and ORNL-DAAC), and using them as instance data for which to define the RDF predicate vocabularies, domain ontologies, resource URIs, and conversion mechanisms that are necessary to create a LOD representation of these data. This representation can then be uploaded to, navigated, and queried in either one of the web-based LOD browsers (such as URIburner), or for example in a local installation of OpenLink Virtuoso.

Clone this wiki locally