-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Implement more constraint checks in the Wikidata extension #2354
Copy link
Copy link
Closed
Labels
Type: Feature RequestIdentifies requests for new features or enhancements. These involve proposing new improvements.Identifies requests for new features or enhancements. These involve proposing new improvements.gsoc/outreachyProjects proposed for internships. Please hold back from these tasks if you are not elligible.Projects proposed for internships. Please hold back from these tasks if you are not elligible.wikibaseRelated to wikidata/wikibase integrationRelated to wikidata/wikibase integration
Milestone
Metadata
Metadata
Assignees
Labels
Type: Feature RequestIdentifies requests for new features or enhancements. These involve proposing new improvements.Identifies requests for new features or enhancements. These involve proposing new improvements.gsoc/outreachyProjects proposed for internships. Please hold back from these tasks if you are not elligible.Projects proposed for internships. Please hold back from these tasks if you are not elligible.wikibaseRelated to wikidata/wikibase integrationRelated to wikidata/wikibase integration
When uploading data to Wikidata, OpenRefine checks for common issues in the uploaded data, and reports these to the user before the upload. Many of these checks rely on Wikidata's own constraint system, which lets Wikidata contributors specify how each Wikidata property should be used (for instance by providing a regular expression for its format).
The Wikidata extension in OpenRefine only supports some of the constraints that Wikidata uses. This means that some problems in data imports can go undetected and get flagged up as constraint violations later on in Wikidata itself.
Proposed solution
We could implement more constraint checks. This could include constraints defined in Wikidata but also other generic checks such as those implemented in #2103.
Additional context
Some constraints are expensive to check as they require communicating with Wikidata itself. Since constraint checks are run in real time (to provide quick feedback to the user), we should be careful not to add any expensive operations in new constraint checks.
The architecture of constraint checks in OpenRefine can evolve - for instance to accommodate for more expensive checks transparently, better warnings reported to the user, better handling of multiple constraint declarations of the same type on the same property… The current design is not set in stone.
There is also an interest in developing a generic data validation system, not specific to Wikidata, where all sorts of issues could be reported (think validation against any tabular schema, for instance as defined by the Data Package or CSVW specs).
This is a proposed Outreachy project in 2020. If you are not planning to apply for an internship via Outreachy, we kindly ask that you do not work on this task yet, in order to leave the floor to potential interns.