As a data manager, I want to see alerts when the data does not pass validation checks so that I can fix validation problems quickly. #72

rlskoeser · 2020-06-22T15:59:59Z

dev notes

add slack notification to existing validation github actions
output errors in build for review
summarize errors when validation fails
link to build for error details
if no validation is done, check if schema is valid json and report as an error

kmcelwee · 2020-07-02T15:16:26Z

Addressed in Princeton-CDH/pemm-data#6

kmcelwee · 2020-07-07T18:44:48Z

@elambrinaki This one is tricky to test. Don't feel limited by the checklist I've provided below.

Currently test messages are going to the ~~#slack-webhook-test~~ #pemm channel in the CDH slack.

Test that the link properly goes to the correct build in GitHub.
Download error-summary under "Artifacts" after following the link.
Ensure that the format of error-summary is clear and properly matches the output in the Slack message

elambrinaki · 2020-07-08T12:33:59Z

@kmcelwee To check the format of error-summary matches the Slack message, I want to view the goodtables reports with the new setup of 20 error messages shown. How do I run the test? (in the #pemm slack channel, there are links to reports with 5 error messages)

kmcelwee · 2020-07-08T12:53:08Z

@elambrinaki I proposed the change to 20 on a branch, but just waiting on approval from @rlskoeser. Rebecca, for something small like this, should I merge w/o your approval? Or is it always safe to just wait?

rlskoeser · 2020-07-08T13:02:49Z

@kmcelwee I apologize, I didn't see a review request. For trivial changes, I'm not even sure they need a PR. Changing a config value like this without changing any code logic doesn't require review.

kmcelwee · 2020-07-08T13:44:21Z

When goodtables is fed an improperly formatted JSON, it exits as if everything passed, creates the output file saying it has 0 errors, and just outputs a warning in its output json:

"tables": [],
    "warnings": [
        "Data Package \"datapackage.json\" has a loading error \"Unable to parse JSON at \"datapackage.json\". Expecting ',' delimiter: line 596 column 21 (char 23272)\""
    ],

Option 1
Precede everything with the following command:

python -m json.tool < datapackage.json

This would stop the validation from executing. But no messages would be sent to Slack unless it was a pull request, but Evgeniia will probably only do direct commits.

The errors are output to the shell, and while they're viewable, you'd have to know where to look.

Option 2
Or we could look for the warning under "warnings" in the output JSON, and send a slack message that the JSON wasn't formatted correctly.

But I even went into goodtables's doumentation and couldn't find where or how they list their warnings. It's going to be something messy like:

if error_json.get('warnings', None):
     if any([warning.startswith('Data Package "datapackage.json" has a loading error') for warning in error_json['warnings']]):
          slack_message = '\n'.join(error_json['warnings'])

@rlskoeser thoughts?

rlskoeser · 2020-07-08T13:54:17Z

Wow, their warning output really isn't that great, is it. And why is that a warning and not an outright error?

@kmcelwee could you revise your python script to detect nonstandard json output (absence of any validation information, I maybe) and then check the datapackage.json validity and send a slack alert with the json parsing error? (Probably any case where we don't get validation output should be reported as an error / failure)

elambrinaki · 2020-07-08T13:55:56Z

@kmcelwee Thank you for finding my error!

elambrinaki · 2020-07-09T12:11:30Z

@kmcelwee Thank you very much for setting up the goodtables reports! With their help, I corrected several errors on the Manuscript and Canonical Story sheets. For these two sheets, I confirm the error-summary is clear and matches the Slack message.

For the Story Instance with 765 errors, having 20 errors shown in the report is not enough. Is it possible to increase the number of errors shown for a specific sheet only? If so, I would like to view 1,000 errors on the Story Instance sheet and 20 errors on other sheets. Alternatively, is it possible to change the order in which the errors are reported from
extra-header, duplicate-row, type-or-format-error, pattern-constraint
to
type-or-format-error, pattern-constraint, extra-header, duplicate-row,

A clarification regarding a big number of duplicate rows on the Story Instance sheet: for the mss that are currently being cataloged, I added identical rows with ms name and formulas, but without unique information (ID, folios, etc.) -- this will be done by catalogers. So duplicate rows are expected for the current stage of work.

rlskoeser · 2020-07-09T13:17:07Z

@elambrinaki I noticed this when I was starting to setup the validation. Two options that have occurred to me so far:

Is there a simple way for you to make these identical rows unique? e.g., could you fill in preliminary miracle numbers now that can be checked/refined when they are cataloged?
We can turn off the duplicate row check in the goodtables validation — but as far as I can tell there's no way to do this for a single sheet, we would be turning it off for all sheets. (If we go this route, we'll want to make sure it's configurable so that we can easily turn it off later once those manuscripts have been cataloged.)

@kmcelwee have you thought of any other ways to handle this?

FWIW, I prefer option 1.

elambrinaki · 2020-07-09T13:33:29Z

@rlskoeser @kmcelwee Is increasing the number of reported errors to 1000 a possible solution?

rlskoeser · 2020-07-09T13:35:27Z

@elambrinaki oh, I didn't comment on that part of your request but Kevin and I discussed it on Slack! Sorry. I don't think the number of errors reported in the build should be capped. Kevin's going to work on making that config optional and then turn it off for you, so you'll see all the errors. (FWIW, I don't want him spending additional dev time on sorting the errors.)

kmcelwee · 2020-07-09T13:37:04Z

@elambrinaki @rlskoeser I've just addressed ERROR_MAX in my latest commits to Princeton-CDH/pemm-data#12 ✅

elambrinaki · 2020-07-09T13:46:59Z

@rlskoeser @kmcelwee Perfect! Thank you very much!

kmcelwee · 2020-07-09T15:33:29Z

Closed by...

kmcelwee self-assigned this Jul 1, 2020

kmcelwee closed this as completed Jul 6, 2020

kmcelwee reopened this Jul 6, 2020

kmcelwee added the awaiting testing issue is ready for acceptance testing label Jul 7, 2020

kmcelwee closed this as completed Jul 9, 2020

rlskoeser removed the awaiting testing issue is ready for acceptance testing label Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a data manager, I want to see alerts when the data does not pass validation checks so that I can fix validation problems quickly. #72

As a data manager, I want to see alerts when the data does not pass validation checks so that I can fix validation problems quickly. #72

rlskoeser commented Jun 22, 2020 •

edited

kmcelwee commented Jul 2, 2020

kmcelwee commented Jul 7, 2020 •

edited by elambrinaki

elambrinaki commented Jul 8, 2020

kmcelwee commented Jul 8, 2020

rlskoeser commented Jul 8, 2020

kmcelwee commented Jul 8, 2020 •

edited

rlskoeser commented Jul 8, 2020 •

edited

elambrinaki commented Jul 8, 2020

elambrinaki commented Jul 9, 2020

rlskoeser commented Jul 9, 2020

elambrinaki commented Jul 9, 2020

rlskoeser commented Jul 9, 2020

kmcelwee commented Jul 9, 2020

elambrinaki commented Jul 9, 2020

kmcelwee commented Jul 9, 2020

As a data manager, I want to see alerts when the data does not pass validation checks so that I can fix validation problems quickly. #72

As a data manager, I want to see alerts when the data does not pass validation checks so that I can fix validation problems quickly. #72

Comments

rlskoeser commented Jun 22, 2020 • edited

dev notes

kmcelwee commented Jul 2, 2020

kmcelwee commented Jul 7, 2020 • edited by elambrinaki

elambrinaki commented Jul 8, 2020

kmcelwee commented Jul 8, 2020

rlskoeser commented Jul 8, 2020

kmcelwee commented Jul 8, 2020 • edited

rlskoeser commented Jul 8, 2020 • edited

elambrinaki commented Jul 8, 2020

elambrinaki commented Jul 9, 2020

rlskoeser commented Jul 9, 2020

elambrinaki commented Jul 9, 2020

rlskoeser commented Jul 9, 2020

kmcelwee commented Jul 9, 2020

elambrinaki commented Jul 9, 2020

kmcelwee commented Jul 9, 2020

rlskoeser commented Jun 22, 2020 •

edited

kmcelwee commented Jul 7, 2020 •

edited by elambrinaki

kmcelwee commented Jul 8, 2020 •

edited

rlskoeser commented Jul 8, 2020 •

edited