Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add URIs & definitions for additional data formats (XLSX, HDF5, JSON) #52

Closed
matthiaskoenig opened this issue Sep 30, 2017 · 8 comments
Closed

Comments

@matthiaskoenig
Copy link
Collaborator

matthiaskoenig commented Sep 30, 2017

Issue

In L1V3 only NuML, CSV and TSV is defined. We have to add section to the spec describing additional formats.

  • XSLX
  • HDF5 (used a lot, bit advantage of being binary and very compact, required for large datasets)
  • JSON (used a lot, especially in the context of transferring data on the web for web apps to work with SED-ML)

Proposal

Define the respective URIS

  • urn:sedml:format:xslx
  • urn:sedml:format:hdf5
  • urn:sedml:format:json
    with the restriction of the allowed data and DimensionDescriptions.

This requires the ability to specify complex sources. I.e. nested files and parts of files.
#46

@matthiaskoenig matthiaskoenig added this to the L1V4 release milestone Sep 30, 2017
@matthiaskoenig matthiaskoenig changed the title Add clarification and URIs for additional data formats (HDF5, JSON) Add URIs & definitions for additional data formats (XLSX, HDF5, JSON) Oct 3, 2017
@matthiaskoenig matthiaskoenig removed this from the L1V4 release milestone Jun 21, 2018
@jonrkarr
Copy link
Contributor

jonrkarr commented Feb 2, 2021

EDAM (see #94) already has terms for all of these formats.

I second the use of HDF5. This is key for large datasets.

For structured datasets, another format that might make sense is SQLite.

luciansmith added a commit that referenced this issue Jun 11, 2021
Define the HDF5 format, and explain how to plot multidimensional data.
@luciansmith
Copy link
Contributor

I added hdf5 as an option, as it's clearly already getting a ton of use. Here's what I put in for its section (after the CSV/TSV descriptions):

HDF5 (Hierarchical Data Format version 5)
The format HDF5 is defined at https://portal.hdfgroup.org/display/HDF5/HDF5. It supports the
storage of multidimensional data, and is therefore ideal for storing the SED-ML output of repeated
tasks; particularly nested repeated tasks.

Each dimension of SED-ML RepeatedTask output should be labeled according to the id of the SED-ML
object that describes that dimension, namely:
� The id of the top-level RepeatedTask
� The id of the SubTask
� The id of any nested SubTask (for arbitrarily-deeply nested subtasks).
� The dimension of the data itself (i.e. time for a UniformTimeCourse).
� The id of the requested variable, or the infix representation of the Math from the DataGenerator.

Each dimension may also be annotated in this format, with some ontology such as the ’Semanticscience
Integrated Ontology’ (SIO, https://bioportal.bioontology.org/ontologies/SIO)

@luciansmith
Copy link
Contributor

I didn't add xlsx or JSON or SQLite. I can, though those might be more complicated?

@jonrkarr
Copy link
Contributor

� The id of the top-level RepeatedTask
� The id of the SubTask
� The id of any nested SubTask (for arbitrarily-deeply nested subtasks).

This information is only straightforward for datasets when datasets derive from a single top-level task. Data sets which arise from computations spanning the results of multiple tasks won't have a single top-level task id or clear semantics for other dimensions.

There's multiple options around this

  • Focus storage of raw results on variables rather than on reports
  • Dissallow calculations involving multiple tasks
  • Particularly when calculations involve multiple tasks, allow investigators to annotate their meaning and copy this information into files which contain results (e.g., HDF, JSON, XLSX, etc.)

I think L1V4 could say something like "when data generators only contain results from a single task, we recommend that reports of their results contain the following metadata ...". Dealing with this properly could be punted to L2.

@jonrkarr
Copy link
Contributor

If JSON is being used, I feel like that would benefit from its own explanation since there's multiple ways data could be encoded.

@luciansmith
Copy link
Contributor

You're right that I should include a bit about the RemainingDimensions, but I don't know of any other way to reduce the dimensionality of SED-ML data through computation, given that we require all calculations to be element-by-element, and for cross-matrix data calculations to have identical dimensions.

I don't know of anyone using JSON; if there is, I would invite them to write about how they're using it to encode this data!

@luciansmith
Copy link
Contributor

OK, I updated the HDF5 section to include:

"When a DependentVariable is used to reduce the dimensionality of a set of data, the ids of whatever
dimensions remain should be used (defined by its RemainingDimension children). The dimensions may
by annotated to describe the dimension reduction as well. When a DataGenerator contains a Dependent-
Variable that outputs a matrix, that matrix can also be labeled appropriately (such as with species
or reaction ids).

When output from multiple tasks are combined mathematically, their dimensions must match exactly,
so the ids from either (or a combination of both) may be used. Again, annotations are recommended to
describe how the data was combined."

I also added this bit to the DataGenerator class:

"When multidimensional data is output to a Report, information about the dimensions should be stored
in the output format chosen for the report, such as CSV or HDF5."

(Both CSV and HDF5 are links to the relevant sections.)

@jonrkarr
Copy link
Contributor

When output from multiple tasks are combined mathematically, their dimensions must match exactly,

I think this will conflict with making number of steps optional. Calculations beyond the shape of the smallest input can be defined to be NaN.

so the ids from either (or a combination of both) may be used.

I don't think this is needed. The results of calculations are assigned to data generators, which have ids. Users can set these ids to be meaningful strings as with all other ids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants