FAIR data decisions: Lossy or lossless #27

hrzepa · 2017-07-25T14:58:26Z

One of the issues often confronted by depositors of aspiring FAIR data is how much data loss to tolerate. I give just one example, crystallographic data in chemistry (often described as the Gold Standard in chemical Data). There are the following hierarchies, with increasing data loss:

The raw instrument data
The processed instrument data, including "hkl" information
The processed instrument data, including rich structure information but excluding "hkl" data
The processed minimum dataset, which suffices for perhaps 90% of most user's needs
A graphical representation of the minimum dataset, as a JPEG or PDF...
which itself can be lossy.

So most consumers of say category 4 would find it adequately FAIR for their needs, but some specialist users would find it too lossy, and might need to go as high as category 1. The trouble is that this type of data might be as much as 10,000 times larger than the minimal set.

Unfortunately there is no easy way of specifying the degree of data loss in any aspiring FAIR dataset as metadata information. This remember is considered the "gold" standard. One finds similar situations in other types of chemical data.

evomellor · 2017-07-25T15:12:15Z

"Unfortunately there is no easy way of specifying the degree of data loss in any aspiring FAIR dataset as metadata information." Do you mean that once only the cleaned data are presented (e.g. category 4) it is impossible for another person to quantify the loss from category 1?

Though this would not preserve the lost information, metadata for a shared, cleaned dataset should ideally contain information about the cleaning process, up to and including any scripts that were used to do the cleaning. Besides scripts, a narrative description of the cleaning process and any reasonable explanation of what information has been lost is good practice.

I'm a strong proponent of trying not to let the perfect get in the way of the practical (or any improvement upon the status quo). For situations where sharing and preserving large data sets are impractical, sharing category 4 is a vast improvement.

Do you recommend revising categories or the standards define for each?

band · 2017-07-25T15:32:44Z

NASA EOSDIS data products use a defined classification of Data Processing Levels. If such a classification is available for other data products then maybe it is enough to include that level specification in the metadata.

CaroleGoble · 2017-07-31T15:27:45Z

I would say the whole point is that there is no one FAIR. FAIR is a landscape of degrees - or levels.
"50 shades of FAIR" and this is highly related to the metrics. The worst thing we can do is declare a single perspective.
FAIR means different things to different stakeholders for different purposes and that is to be celebrated and respected, not suppressed.
What is “Rich metadata” varies per domain, and varies
• Within and across disciplines
• Across Layers of the infrastructure stack: EOSC e-Infrastructures vs Research Infrastructures
• At the institutional level vs public archives level
• Depending on the purpose:

"F" may be feasible, "I" may not be, (and by the way Reproducibility might be harder than Reuse
FAIR across research boundaries means that for the native discipline the metadata may not be enough for reuse but for the non-native it is. I've seen this is Sys Bio models. The modellers won't reuse but the experimentalists will.

CaroleGoble · 2017-07-31T20:22:57Z

The challenge will be distilling the “in common” without enforcing one view or need

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAIR data decisions: Lossy or lossless #27

FAIR data decisions: Lossy or lossless #27

hrzepa commented Jul 25, 2017

evomellor commented Jul 25, 2017

band commented Jul 25, 2017

CaroleGoble commented Jul 31, 2017

CaroleGoble commented Jul 31, 2017

FAIR data decisions: Lossy or lossless #27

FAIR data decisions: Lossy or lossless #27

Comments

hrzepa commented Jul 25, 2017

evomellor commented Jul 25, 2017

band commented Jul 25, 2017

CaroleGoble commented Jul 31, 2017

CaroleGoble commented Jul 31, 2017