3 types of use case:
- Report and manage issues - requires sharing
- Error logging and analytics - why share?
- Data quality reporting for data publishers - requires sharing
- Establish transparency about errors in open data apps (data quality reports)
As a Data Wrangler I want to Pipe my errors into an online system so that I can review them later
Notes:
- What are errors? Example: Cell B20 in sheet X is empty and should be float
As a .... I want to have an error or set of errors associated with an issue (automatically?) so that fix them in bulk
As a ... I want to generate an aggregate report of all the errors on a task and their associated issues so that I can see patterns
Create an Issue ------------
As a Data User I want to report a problem with a dataset so that it can be fixed by the owner and I can see that it was fixed (or not!)
Notes:
- An Issue can be an Error (as above) but also can be higher level - e.g. all dates are in yyyy-dd-mm format rather than yyyy-mm-dd
As a Task Owner I want to close an issue so that I can indicate its fixed (or that it won't be fixed etc)
...
- Task
- Issue
- Error
Generating task:
- repo_url -
- dataset_url -
- triggering_user
- run_id -
Info on the actual error:
- timestamp
- record_id - row number in most cases
- source_path - input file name
- dest_path - output file name - ??
- source_field/attribute -
- dest_field - ??
- query (xpath, sql) - when you do scraping you have xpath or css selector etc
- value - erroneous value
- level - debug, info, error, warning
- error_type - ValidationError, TypeError, ValueError, ...
- publisher_name
- publisher_id
- dataset_url
- message - JSON structured message with more info?
status = 'open', 'closed'
issue_comment
action = comment || closing || reopening