Add URIs & definitions for additional data formats (XLSX, HDF5, JSON) #52

matthiaskoenig · 2017-09-30T10:46:02Z

Issue

In L1V3 only NuML, CSV and TSV is defined. We have to add section to the spec describing additional formats.

XSLX
HDF5 (used a lot, bit advantage of being binary and very compact, required for large datasets)
JSON (used a lot, especially in the context of transferring data on the web for web apps to work with SED-ML)

Proposal

Define the respective URIS

urn:sedml:format:xslx
urn:sedml:format:hdf5
urn:sedml:format:json
with the restriction of the allowed data and DimensionDescriptions.

This requires the ability to specify complex sources. I.e. nested files and parts of files.
#46

The text was updated successfully, but these errors were encountered:

jonrkarr · 2021-02-02T07:06:37Z

EDAM (see #94) already has terms for all of these formats.

I second the use of HDF5. This is key for large datasets.

For structured datasets, another format that might make sense is SQLite.

Define the HDF5 format, and explain how to plot multidimensional data.

luciansmith · 2021-06-11T20:42:10Z

I added hdf5 as an option, as it's clearly already getting a ton of use. Here's what I put in for its section (after the CSV/TSV descriptions):

HDF5 (Hierarchical Data Format version 5)
The format HDF5 is defined at https://portal.hdfgroup.org/display/HDF5/HDF5. It supports the
storage of multidimensional data, and is therefore ideal for storing the SED-ML output of repeated
tasks; particularly nested repeated tasks.

Each dimension of SED-ML RepeatedTask output should be labeled according to the id of the SED-ML
object that describes that dimension, namely:
� The id of the top-level RepeatedTask
� The id of the SubTask
� The id of any nested SubTask (for arbitrarily-deeply nested subtasks).
� The dimension of the data itself (i.e. time for a UniformTimeCourse).
� The id of the requested variable, or the infix representation of the Math from the DataGenerator.

Each dimension may also be annotated in this format, with some ontology such as the ’Semanticscience
Integrated Ontology’ (SIO, https://bioportal.bioontology.org/ontologies/SIO)

luciansmith · 2021-06-11T20:42:47Z

I didn't add xlsx or JSON or SQLite. I can, though those might be more complicated?

jonrkarr · 2021-06-11T20:53:38Z

� The id of the top-level RepeatedTask
� The id of the SubTask
� The id of any nested SubTask (for arbitrarily-deeply nested subtasks).

This information is only straightforward for datasets when datasets derive from a single top-level task. Data sets which arise from computations spanning the results of multiple tasks won't have a single top-level task id or clear semantics for other dimensions.

There's multiple options around this

Focus storage of raw results on variables rather than on reports
Dissallow calculations involving multiple tasks
Particularly when calculations involve multiple tasks, allow investigators to annotate their meaning and copy this information into files which contain results (e.g., HDF, JSON, XLSX, etc.)

I think L1V4 could say something like "when data generators only contain results from a single task, we recommend that reports of their results contain the following metadata ...". Dealing with this properly could be punted to L2.

jonrkarr · 2021-06-11T20:56:59Z

If JSON is being used, I feel like that would benefit from its own explanation since there's multiple ways data could be encoded.

luciansmith · 2021-06-11T21:04:35Z

You're right that I should include a bit about the RemainingDimensions, but I don't know of any other way to reduce the dimensionality of SED-ML data through computation, given that we require all calculations to be element-by-element, and for cross-matrix data calculations to have identical dimensions.

I don't know of anyone using JSON; if there is, I would invite them to write about how they're using it to encode this data!

luciansmith · 2021-06-11T21:49:29Z

OK, I updated the HDF5 section to include:

"When a DependentVariable is used to reduce the dimensionality of a set of data, the ids of whatever
dimensions remain should be used (defined by its RemainingDimension children). The dimensions may
by annotated to describe the dimension reduction as well. When a DataGenerator contains a Dependent-
Variable that outputs a matrix, that matrix can also be labeled appropriately (such as with species
or reaction ids).

When output from multiple tasks are combined mathematically, their dimensions must match exactly,
so the ids from either (or a combination of both) may be used. Again, annotations are recommended to
describe how the data was combined."

I also added this bit to the DataGenerator class:

"When multidimensional data is output to a Report, information about the dimensions should be stored
in the output format chosen for the report, such as CSV or HDF5."

(Both CSV and HDF5 are links to the relevant sections.)

jonrkarr · 2021-06-11T22:48:07Z

When output from multiple tasks are combined mathematically, their dimensions must match exactly,

I think this will conflict with making number of steps optional. Calculations beyond the shape of the smallest input can be defined to be NaN.

so the ids from either (or a combination of both) may be used.

I don't think this is needed. The results of calculations are assigned to data generators, which have ids. Users can set these ids to be meaningful strings as with all other ids.

matthiaskoenig added L1V4 specification labels Sep 30, 2017

matthiaskoenig added this to the L1V4 release milestone Sep 30, 2017

matthiaskoenig changed the title ~~Add clarification and URIs for additional data formats (HDF5, JSON)~~ Add URIs & definitions for additional data formats (XLSX, HDF5, JSON) Oct 3, 2017

matthiaskoenig added the feature label Oct 3, 2017

matthiaskoenig removed L1V4 labels Jun 21, 2018

matthiaskoenig removed this from the L1V4 release milestone Jun 21, 2018

luciansmith added the L1V4 label Mar 30, 2021

luciansmith added a commit that referenced this issue Jun 11, 2021

Fixes for #52, #103, and #58

517c093

Define the HDF5 format, and explain how to plot multidimensional data.

luciansmith added the draft fix label Jun 11, 2021

luciansmith mentioned this issue Jun 11, 2021

Clarify the semantics of the interaction of repeated tasks with reports and plots #103

Closed

luciansmith added a commit that referenced this issue Jun 11, 2021

Further fix for #52: discuss multidimensional output

a00242b

luciansmith closed this as completed Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add URIs & definitions for additional data formats (XLSX, HDF5, JSON) #52

Add URIs & definitions for additional data formats (XLSX, HDF5, JSON) #52

matthiaskoenig commented Sep 30, 2017 •

edited

jonrkarr commented Feb 2, 2021

luciansmith commented Jun 11, 2021

luciansmith commented Jun 11, 2021

jonrkarr commented Jun 11, 2021

jonrkarr commented Jun 11, 2021

luciansmith commented Jun 11, 2021

luciansmith commented Jun 11, 2021

jonrkarr commented Jun 11, 2021

Add URIs & definitions for additional data formats (XLSX, HDF5, JSON) #52

Add URIs & definitions for additional data formats (XLSX, HDF5, JSON) #52

Comments

matthiaskoenig commented Sep 30, 2017 • edited

Issue

Proposal

jonrkarr commented Feb 2, 2021

luciansmith commented Jun 11, 2021

luciansmith commented Jun 11, 2021

jonrkarr commented Jun 11, 2021

jonrkarr commented Jun 11, 2021

luciansmith commented Jun 11, 2021

luciansmith commented Jun 11, 2021

jonrkarr commented Jun 11, 2021

matthiaskoenig commented Sep 30, 2017 •

edited