Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Use Case: Aggregated and Integrated Data Sets #18
Contributors: Paul Groth, Mike Taylor
Goals and Summary
Many datasets are the product of the aggregation of data from multiple sources and often multiple parties. For example, the Open PHACTS platform (http://dev.openphacts.org) provides integrated access to over 10 different databases. These databases, for example, Chembl and Uniprot, are amalgamations of data extracted both automatically and via human curation from other sources such as the literature. These databases in turn rely on data models and ontologies developed by still others. For example, the GO or Chebi ontologies. Additionally, integrators may slightly change datasets through format changes or addition of links.
An additional aspect of such integrations is that they are all under context flux. Note that, Uniprot is released every 4 weeks. Chembl is released quarterly, and some such as SureChembl are hourly. How does a data integrator appropriately capture and expose this information? Currently, this is often done by providing versioned data dumps. However, this may not be allowed or supported in many cases due to licensing, technical or policy issues.
Why is it important and to whom?
Why hasn’t it been solved yet?