New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Case: Aggregated and Integrated Data Sets #18

pgroth opened this Issue Jan 25, 2015 · 1 comment


None yet
4 participants

pgroth commented Jan 25, 2015

Contributors: Paul Groth, Mike Taylor

Goals and Summary

Many datasets are the product of the aggregation of data from multiple sources and often multiple parties. For example, the Open PHACTS platform ( provides integrated access to over 10 different databases. These databases, for example, Chembl and Uniprot, are amalgamations of data extracted both automatically and via human curation from other sources such as the literature. These databases in turn rely on data models and ontologies developed by still others. For example, the GO or Chebi ontologies. Additionally, integrators may slightly change datasets through format changes or addition of links.

  • Issue 1: The provenance or credit chain of a single answer given by a data integration platform can be much bigger than the answer itself. How do we correctly ensure credit is given to all actors in the system? Furthermore, how do we ensure that these chains can be effectively traced back?

An additional aspect of such integrations is that they are all under context flux. Note that, Uniprot is released every 4 weeks. Chembl is released quarterly, and some such as SureChembl are hourly. How does a data integrator appropriately capture and expose this information? Currently, this is often done by providing versioned data dumps. However, this may not be allowed or supported in many cases due to licensing, technical or policy issues.

  • Issue 2: How do we cite developing data sets originating from multiple sources?

Why is it important and to whom?

  • Down stream data providers need to be given appropriate credit
  • Usage is a key metric for continued funding of these databases.
  • Funders would like to know what data sources are being effectively used.
  • Curators need to be credited with the important work that they do
  • Users would like to know best practice in terms of citation. The how to cite us page very across data sets. Furthermore, it's difficult to cite specific time dimensions.
  • Encourages the fundamental work of data integration.

Why hasn’t it been solved yet?

  • No current agreement on what data integrators should supply in terms of citation
  • Need to be able to expand the entire provenance trace easily which requires agreement across the data supply chain. While standards such as W3C PROV provenance standard exist, they are still not widely used.
  • Inability to appropriately "ping back" or notify upstream data providers about usage
  • Need to combine data and software citation

Actionable Outcomes


This comment has been minimized.

janemrc commented Jan 29, 2015

I think this is really significant, and the interplay between PROV and data citation needs to be examined. j

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment