Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAIR data decisions: Lossy or lossless #27

Open
hrzepa opened this issue Jul 25, 2017 · 4 comments
Open

FAIR data decisions: Lossy or lossless #27

hrzepa opened this issue Jul 25, 2017 · 4 comments

Comments

@hrzepa
Copy link

hrzepa commented Jul 25, 2017

One of the issues often confronted by depositors of aspiring FAIR data is how much data loss to tolerate. I give just one example, crystallographic data in chemistry (often described as the Gold Standard in chemical Data). There are the following hierarchies, with increasing data loss:

  1. The raw instrument data
  2. The processed instrument data, including "hkl" information
  3. The processed instrument data, including rich structure information but excluding "hkl" data
  4. The processed minimum dataset, which suffices for perhaps 90% of most user's needs
  5. A graphical representation of the minimum dataset, as a JPEG or PDF...
  6. which itself can be lossy.

So most consumers of say category 4 would find it adequately FAIR for their needs, but some specialist users would find it too lossy, and might need to go as high as category 1. The trouble is that this type of data might be as much as 10,000 times larger than the minimal set.

Unfortunately there is no easy way of specifying the degree of data loss in any aspiring FAIR dataset as metadata information. This remember is considered the "gold" standard. One finds similar situations in other types of chemical data.

@evomellor
Copy link

"Unfortunately there is no easy way of specifying the degree of data loss in any aspiring FAIR dataset as metadata information." Do you mean that once only the cleaned data are presented (e.g. category 4) it is impossible for another person to quantify the loss from category 1?

Though this would not preserve the lost information, metadata for a shared, cleaned dataset should ideally contain information about the cleaning process, up to and including any scripts that were used to do the cleaning. Besides scripts, a narrative description of the cleaning process and any reasonable explanation of what information has been lost is good practice.

I'm a strong proponent of trying not to let the perfect get in the way of the practical (or any improvement upon the status quo). For situations where sharing and preserving large data sets are impractical, sharing category 4 is a vast improvement.

Do you recommend revising categories or the standards define for each?

@band
Copy link

band commented Jul 25, 2017

NASA EOSDIS data products use a defined classification of Data Processing Levels. If such a classification is available for other data products then maybe it is enough to include that level specification in the metadata.

@CaroleGoble
Copy link

I would say the whole point is that there is no one FAIR. FAIR is a landscape of degrees - or levels.
"50 shades of FAIR" and this is highly related to the metrics. The worst thing we can do is declare a single perspective.
FAIR means different things to different stakeholders for different purposes and that is to be celebrated and respected, not suppressed.
What is “Rich metadata” varies per domain, and varies
• Within and across disciplines
• Across Layers of the infrastructure stack: EOSC e-Infrastructures vs Research Infrastructures
• At the institutional level vs public archives level
• Depending on the purpose:

  • "F" may be feasible, "I" may not be, (and by the way Reproducibility might be harder than Reuse
  • FAIR across research boundaries means that for the native discipline the metadata may not be enough for reuse but for the non-native it is. I've seen this is Sys Bio models. The modellers won't reuse but the experimentalists will.

@CaroleGoble
Copy link

The challenge will be distilling the “in common” without enforcing one view or need

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants