Skip to content
This repository has been archived by the owner on Oct 11, 2022. It is now read-only.

As a data manager, I want to see alerts when the data does not pass validation checks so that I can fix validation problems quickly. #72

Closed
5 tasks done
rlskoeser opened this issue Jun 22, 2020 · 15 comments
Assignees

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Jun 22, 2020

dev notes

  • add slack notification to existing validation github actions
  • output errors in build for review
  • summarize errors when validation fails
  • link to build for error details
  • if no validation is done, check if schema is valid json and report as an error
@kmcelwee kmcelwee self-assigned this Jul 1, 2020
@kmcelwee
Copy link
Contributor

kmcelwee commented Jul 2, 2020

Addressed in Princeton-CDH/pemm-data#6

@kmcelwee kmcelwee closed this as completed Jul 6, 2020
@kmcelwee kmcelwee reopened this Jul 6, 2020
@kmcelwee kmcelwee added the awaiting testing issue is ready for acceptance testing label Jul 7, 2020
@kmcelwee
Copy link
Contributor

kmcelwee commented Jul 7, 2020

@elambrinaki This one is tricky to test. Don't feel limited by the checklist I've provided below.

Currently test messages are going to the #slack-webhook-test #pemm channel in the CDH slack.

  • Test that the link properly goes to the correct build in GitHub.
  • Download error-summary under "Artifacts" after following the link.
  • Ensure that the format of error-summary is clear and properly matches the output in the Slack message

@elambrinaki
Copy link

@kmcelwee To check the format of error-summary matches the Slack message, I want to view the goodtables reports with the new setup of 20 error messages shown. How do I run the test? (in the #pemm slack channel, there are links to reports with 5 error messages)

@kmcelwee
Copy link
Contributor

kmcelwee commented Jul 8, 2020

@elambrinaki I proposed the change to 20 on a branch, but just waiting on approval from @rlskoeser. Rebecca, for something small like this, should I merge w/o your approval? Or is it always safe to just wait?

@rlskoeser
Copy link
Contributor Author

@kmcelwee I apologize, I didn't see a review request. For trivial changes, I'm not even sure they need a PR. Changing a config value like this without changing any code logic doesn't require review.

@kmcelwee
Copy link
Contributor

kmcelwee commented Jul 8, 2020

When goodtables is fed an improperly formatted JSON, it exits as if everything passed, creates the output file saying it has 0 errors, and just outputs a warning in its output json:

"tables": [],
    "warnings": [
        "Data Package \"datapackage.json\" has a loading error \"Unable to parse JSON at \"datapackage.json\". Expecting ',' delimiter: line 596 column 21 (char 23272)\""
    ],

Option 1
Precede everything with the following command:

python -m json.tool < datapackage.json 

This would stop the validation from executing. But no messages would be sent to Slack unless it was a pull request, but Evgeniia will probably only do direct commits.

The errors are output to the shell, and while they're viewable, you'd have to know where to look.

Option 2
Or we could look for the warning under "warnings" in the output JSON, and send a slack message that the JSON wasn't formatted correctly.

But I even went into goodtables's doumentation and couldn't find where or how they list their warnings. It's going to be something messy like:

if error_json.get('warnings', None):
     if any([warning.startswith('Data Package "datapackage.json" has a loading error') for warning in error_json['warnings']]):
          slack_message = '\n'.join(error_json['warnings'])

@rlskoeser thoughts?

@rlskoeser
Copy link
Contributor Author

rlskoeser commented Jul 8, 2020

Wow, their warning output really isn't that great, is it. And why is that a warning and not an outright error?

@kmcelwee could you revise your python script to detect nonstandard json output (absence of any validation information, I maybe) and then check the datapackage.json validity and send a slack alert with the json parsing error? (Probably any case where we don't get validation output should be reported as an error / failure)

@elambrinaki
Copy link

@kmcelwee Thank you for finding my error!

@elambrinaki
Copy link

@kmcelwee Thank you very much for setting up the goodtables reports! With their help, I corrected several errors on the Manuscript and Canonical Story sheets. For these two sheets, I confirm the error-summary is clear and matches the Slack message.

For the Story Instance with 765 errors, having 20 errors shown in the report is not enough. Is it possible to increase the number of errors shown for a specific sheet only? If so, I would like to view 1,000 errors on the Story Instance sheet and 20 errors on other sheets. Alternatively, is it possible to change the order in which the errors are reported from
extra-header, duplicate-row, type-or-format-error, pattern-constraint
to
type-or-format-error, pattern-constraint, extra-header, duplicate-row,

A clarification regarding a big number of duplicate rows on the Story Instance sheet: for the mss that are currently being cataloged, I added identical rows with ms name and formulas, but without unique information (ID, folios, etc.) -- this will be done by catalogers. So duplicate rows are expected for the current stage of work.

@rlskoeser
Copy link
Contributor Author

@elambrinaki I noticed this when I was starting to setup the validation. Two options that have occurred to me so far:

  1. Is there a simple way for you to make these identical rows unique? e.g., could you fill in preliminary miracle numbers now that can be checked/refined when they are cataloged?
  2. We can turn off the duplicate row check in the goodtables validation — but as far as I can tell there's no way to do this for a single sheet, we would be turning it off for all sheets. (If we go this route, we'll want to make sure it's configurable so that we can easily turn it off later once those manuscripts have been cataloged.)

@kmcelwee have you thought of any other ways to handle this?

FWIW, I prefer option 1.

@elambrinaki
Copy link

@rlskoeser @kmcelwee Is increasing the number of reported errors to 1000 a possible solution?

@rlskoeser
Copy link
Contributor Author

@elambrinaki oh, I didn't comment on that part of your request but Kevin and I discussed it on Slack! Sorry.  I don't think the number of errors reported in the build should be capped. Kevin's going to work on making that config optional and then turn it off for you, so you'll see all the errors. (FWIW, I don't want him spending additional dev time on sorting the errors.)

@kmcelwee
Copy link
Contributor

kmcelwee commented Jul 9, 2020

@elambrinaki @rlskoeser I've just addressed ERROR_MAX in my latest commits to Princeton-CDH/pemm-data#12

@elambrinaki
Copy link

@rlskoeser @kmcelwee Perfect! Thank you very much!

@kmcelwee
Copy link
Contributor

kmcelwee commented Jul 9, 2020

@kmcelwee kmcelwee closed this as completed Jul 9, 2020
@rlskoeser rlskoeser removed the awaiting testing issue is ready for acceptance testing label Jul 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants