Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check: physical format matches data file format #9

Closed
eeerika opened this issue Jul 22, 2022 · 2 comments
Closed

Check: physical format matches data file format #9

eeerika opened this issue Jul 22, 2022 · 2 comments
Labels
Milestone

Comments

@eeerika
Copy link
Collaborator

eeerika commented Jul 22, 2022

Purpose

This check will look to see if the format name listed in the physical section of the metadata matches the data file's format.

Components

  • format name from physical section
  • file format from data file itself

Result

SUCCESS: if the two formats match
FAILURE: if the two formats do not match
ERROR: on system error or exception in the check code, representing a bug in the check system

@eeerika eeerika added the check label Jul 22, 2022
@jeanetteclark
Copy link
Collaborator

jeanetteclark commented Dec 12, 2022

I'm starting to think about this check, and wondering what pieces of information we should actually be checking. The data-format of a file will be recorded in 3 places, potentially:

  • the file itself
  • sysmeta formatId
  • EML dataFormat OR ISO field

I'm envisioning this as a congruency check between the return value of file test.csv (bash, wrapped in an R system call) and either the formatId, metadata field, or both. I see a couple of hurdles though.

  1. establishing a mapping between the format name in each location. For example, file test.csv returns "CSV text" which should be "text/csv" in the sysmeta, and in EML could be described in a few ways but probably the most reliably as `physical/dataFormat/textFormat/ with fieldDelimeter set to ","
  2. I don't think that any checks currently actually use sysmeta values as part of the check. So need to review how this information could be made available to the R process.

I think the steps needed to implement this check are, in order of difficulty:

  • determine how (if?) sysmeta values can be passed to check code in metadig-engine
  • add sysmetaXML as an argument to runCheck in metadig-R
  • establish mapping between common file types for the output of file commands and DataONE formatIds
  • add to the above mapping for EML spec
  • add to the above mapping for ISO spec

lmk if you see anything that seems amiss @mbjones @iannesbitt

@jeanetteclark
Copy link
Collaborator

The initial version of this check is done, and is in this repo for now (inst/extdata).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants