Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check: Text file format valid #2

Closed
jeanetteclark opened this issue Jul 15, 2022 · 7 comments
Closed

Check: Text file format valid #2

jeanetteclark opened this issue Jul 15, 2022 · 7 comments
Labels
Milestone

Comments

@jeanetteclark
Copy link
Collaborator

jeanetteclark commented Jul 15, 2022

Purpose

This check will look to see if a tabular data file in a text format can be parsed.

Components

  • is a text format (boolean)
  • file name
  • distribution URL
  • number of header lines
  • delimiter

Result

SUCCESS: if one or more files are parsed correctly or no text files exist
FAILURE: if no files can be parsed
ERROR: if files cannot be accessed

@jeanetteclark jeanetteclark changed the title Check: Text file ormat valid Check: Text file format valid Jul 15, 2022
@mbjones
Copy link
Member

mbjones commented Jul 15, 2022

@jeanetteclark ERROR is reserved for when the test fails to run (e.g. the network is down). An ERROR indicates a bug in the system, not a data driven failure. When a test runs to completion, it should always return SUCCESS or FAILURE based on the content evaluation. Happy to discuss.

@jeanetteclark
Copy link
Collaborator Author

Okay that makes sense. I'll move the "no text files exist" case to success

@jeanetteclark
Copy link
Collaborator Author

This check is nearly done - need to do some work to make the mechanism for retrieving data pids (and thus URLs/paths) for data access consistent with what I did for the data format check

@mbjones
Copy link
Member

mbjones commented Feb 14, 2023

Great! Can you define 'text'? Do you mean ASCII? UTF-8? UTF-16? Other unicode encodings? Windows cp-1252?

@jeanetteclark
Copy link
Collaborator Author

so I've been thinking the name should probably be changed, since this check is really about delimited text files (csv, tsv) and doesn't deal with encodings at all. Files are identified by looking in the metadata for entities with a physical/dataFormat/textFormat element. I think though that we should probably be checking on formatId instead. Happy to hear your thoughts

@mbjones
Copy link
Member

mbjones commented Feb 15, 2023

Aha, that makes sense. Yes, I think using formatId of text/csv for example makes sense to apply this test. How about naming it something more like data.table-text-delimited.well-formed? See naming discussion in #15 .

Some related tests might be metadata.formatId.congruent (to test if the formatId and the values inside the metadata format fields like physical/dataFormat match) and data.format.congruent (to test if the data format found in the file matches what is claimed in the metadata formatId.

@jeanetteclark
Copy link
Collaborator Author

check has been renamed and restructured @c47b03c8c

going to close this one for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants