Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Case: Describe a tabular data file directly in RO-Crate metadata #27

Open
eocarragain opened this issue Aug 1, 2019 · 15 comments
Labels

Comments

@eocarragain
Copy link
Contributor

@eocarragain eocarragain commented Aug 1, 2019

As a researcher working with tabular data, I want to be able to define the columns (description, data-type, valid values/ranges, etc.), so that I can provide a structured data dictionary.

Approaches elsewhere:

@eocarragain eocarragain added the use-case label Aug 1, 2019
@eocarragain

This comment has been minimized.

Copy link
Contributor Author

@eocarragain eocarragain commented Aug 1, 2019

Note: we may need a more general use-case for how to express sub-file/variable level metadata. Some concrete non-tabular examples would be good though

@eocarragain

This comment has been minimized.

Copy link
Contributor Author

@eocarragain eocarragain commented Aug 8, 2019

Discussed this on Editor's call 2019-08-08, and agreed it would be good to use the schema.org flavour if possible, e.g. Dataspice, Psych-DS

@eocarragain

This comment has been minimized.

Copy link
Contributor Author

@eocarragain eocarragain commented Oct 3, 2019

The table below compares the Frictionless Data tabular data specs with the schema.org variableMeasured property. It also shows the additional fields that the psych-ds team have added on top of their use of variableMeasured.

table_schema schema:variableMeasured psych-ds
dialect
name name schema:name
description description schema:name
title alternateName
type type
type>rdfType propertyId
format
constraints>required
constraints>unique
constraints>minLength minValue schema:minValue
constraints>maxLength maxValue schema:maxValue
constraints>minimum ~minValue
constraints>maximum ~maxValue
constraints>pattern
constraints>enum levels
missingValues na/naValues
primaryKey
foreignKeys
~type>rdfType unitCode schema:unitCode
~type>rdfType unitText schema:unitText
derivation
imputation

Notes:

  • table schema enumerates types (e.g. string, Boolean, number) and formats (e.g. email address, ISO8601) for fields, see https://frictionlessdata.io/specs/table-schema#types-and-formats. There is no equivalent in schema.org
  • table_schema doesn't have a direct equivalent of unitCode or unitText but type>rdfType could probably be used
@dgarijo

This comment has been minimized.

Copy link

@dgarijo dgarijo commented Oct 3, 2019

This sounds a little like: https://www.w3.org/TR/tabular-data-primer/#string-restriction
Why not reuse it?
EDIT: Oh, I see you listed it above, but it covers all the constraints nicely...

@eocarragain

This comment has been minimized.

Copy link
Contributor Author

@eocarragain eocarragain commented Oct 3, 2019

@dgarijo agreed "csvw" is probably the most complete rdf-friendly way to do this. It also has the benefit that Google seem to be adopting it in the dataset search. However, we received quite strong feedback at Open Repositories that CSVW was 'too complicated' for most researchers & coders to pick up and use easily.

There may be ways around this in terms of how we present it in the RO-Crate spec, i.e. just provide examples of the most common cases, more or less equivalent to table-schema?

EDIT: if we did this, the psych-ds community might be a good test group as they are clearly struggling with the fact that schema.org doesn't quite do what they need

@dgarijo

This comment has been minimized.

Copy link

@dgarijo dgarijo commented Oct 3, 2019

I don't think you need to adopt all of it, just the parts that cover your use cases (as you point out). In PROV we had like 3 main concepts and 8 relationships among them and people still said it was complicated...

@eocarragain

This comment has been minimized.

Copy link
Contributor Author

@eocarragain eocarragain commented Oct 3, 2019

Example of what the schema.org approach would look like in an RO-Crate context:

{ "@context": "https://w3id.org/ro/crate/0.3-DRAFT/context",
  "@graph": [
  {
    "@id": "./",
    "@type": [
      "Dataset"
    ],
    "hasPart": [
      {
        "@id": "./table.csv"
      },
      ],
   },
  {
    "@id": "./table.csv",
    "@type": ["File", "Dataset"],
    "contentSize": "383766",
    "description": "A table capturing all my data",
    "variableMeasured": [
        {
        "type": "PropertyValue",
        "unitText": "metres",
        "name": "wall_width",
        "description": "The width of the wall in metres"
        },
        {
        "type": "PropertyValue",
        "unitCode": "CMT",
        "name": "wall_height",
        "description": "The height of the wall in centimetres"
        },
        {
        "type": "PropertyValue",
        "name": "datetime",
        "description": "The date and time of the measurement"
        },
    ]    
  },

]

Issue: in schema.org variableMeasured is only defined as a property of schema:Dataset, i.e. it cannot be used on an RO-Crate file as this maps to schema:MediaObject

EDIT: made the file a Dataset in the example above following @dgarijo's comments below

@dgarijo

This comment has been minimized.

Copy link

@dgarijo dgarijo commented Oct 3, 2019

Are they disjoint (I don't see anything about that in schema.org)? If not, I don't see the problem in using them.

@eocarragain

This comment has been minimized.

Copy link
Contributor Author

@eocarragain eocarragain commented Oct 3, 2019

Would that mean making all ro-crate "files" be both schema:MediaObject and schema:Dataset?

@dgarijo

This comment has been minimized.

Copy link

@dgarijo dgarijo commented Oct 3, 2019

not all of them, just the ones you want to describe with those properties. A research object may contain many files. Some of them may be datasets. Some may be Slides, workflows, SoftwareApplications...

@eocarragain

This comment has been minimized.

Copy link
Contributor Author

@eocarragain eocarragain commented Oct 3, 2019

Ok - made that change in the example above. Fact remains that schema.org doesn't cover a lot of common use cases for describing tabular data, so should we look at providing a simplified subset of CSVW more or less corresponding to table_schema?

@jmfernandez

This comment has been minimized.

Copy link

@jmfernandez jmfernandez commented Oct 3, 2019

I have a naive question: if the tabular format is an standard one, described in some ontology (but not at this granularity level), what should we do?

@stain

This comment has been minimized.

Copy link
Contributor

@stain stain commented Oct 3, 2019

@eocarragain

This comment has been minimized.

Copy link
Contributor Author

@eocarragain eocarragain commented Oct 3, 2019

I have a naive question: if the tabular format is an standard one, described in some ontology (but not at this granularity level), what should we do?

@Stian suggested conformsTo or schema:additionalType (or maybe schema:schemaVersion)

@eocarragain

This comment has been minimized.

Copy link
Contributor Author

@eocarragain eocarragain commented Oct 3, 2019

isatab is another example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.