-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement more constraint checks in the Wikidata extension #2354
Comments
Also, regarding validating against a Type or Statement and "checks", using Schemas and Shape Expressions (ShEx) for validation/rules is something that many folks inside and outside the Wikidata ecosystem have been talking about for a long time and been experimenting with, such as myself and @VladimirAlexiev Example of a tool (but there are other APIs mentioned (pyshexy, etc.) in discussion link above): There are of course tasks and issues in Wikidata Phabricator that would help (linking Schemas in statements) and some in the larger Epic (Adding new datatypes to Wikidata) |
Have we got an issue for more general validation in OR? |
General validation? Not really, but we could use #1727 (Design how type validation and other constraint validations should be handled) ...which was to began to try to push along research on validation (we can cut out the Data Package mention within it) ... and the rest of many designs are in my head and others, but not formally written down for us. A lot of this kind of work, I've done outside of OpenRefine as part of my job as a Data Architect where the Data Modeling with 3GPP systems for Telecom was very strict and standards based (otherwise your cell phone wouldn't even work). But here with Wikidata and OpenRefine working with general Statements, Types, Annotations, there is the need for conforming and validating Schema on both sides. ShEx is just one plausible solution with lots of experimenting needed and many folks worry about performance during verify/validation runs depending on what was asked of the Wikidata backend and indexes. We already have inside and outside the Wikidata community some very well defined use-cases already for various validation and constraint needs. It's "where and how" do we define them, and "who" is going to be impacted with performance... user|service|both and then how to minimize that impact. I think its worth having a general community meetup on this topic of validation at some Wikidata conf. or virtually online to discuss "what is possible". This is a wide topic with system-level impacts for data producers like Wikidata, Google, GLAM, OCLC and Schema producers like Uni-Leipzig, etc. @danbri might even want to be slightly involved in discussion here, dunno. |
I don't think we have got one yet - it might make sense to move some of the discussion there to keep this focused on Wikidata constraints. |
Is it possible to use WD's constraint system itself rather than reimplementing constraints? SHEX is not yet used on a large scale in WD... Also, some specialized constraints (eg Contemporary With) are not implementable in SHEX. |
That is something that @lucaswerkmeister and I have talked about when the Wikidata extension was first released. Currently, WD's constraint system can only check violations for statements that are already saved in WD: https://phabricator.wikimedia.org/T194194. Even if that limitation was lifted, we would need to think twice before migrating to it. Having a local implementation of the constraints makes it possible to check for issues in real time, even for relatively big edit batches, which would probably not be doable if we have to issue one HTTP request for each statement in the batch (see more discussion on the phabricator task above). Also, some of the issues we report do not correspond to any WD constraint (see for instance #2103 which added a bunch of new ones).
Yes, there has been some interest in "implementing SHEX in OpenRefine" but I am yet to see a clear use case for it. It is surely not a drop-in replacement for the WD constraint system at the moment, at least. |
@wetneb I wanna start my contribution for outreachy but it is bit initmidating to start off with. Also, I couldn't able to comprend how to proceed with this task . |
Hi @TejaswiKarasani, The idea behind the contribution phase is not that you start working on the Outreachy project directly - the intention is more that you get familiar with the environment, by making much smaller contributions which will help you get up to speed. We have a collection of good first issues: those are reasonably small tasks that you can tackle, this will give you the opportunity to set up your development environment and clarify any issues about the workflow to contribute to OpenRefine. To understand better what this task is about, I encourage you to try out OpenRefine and its Wikidata integration by yourself. Try following tutorials such as this one: Once you have done both of these things, you should be in a better position to apply for this task. Let me know if you have any specific questions in the process :) |
Ok @wetneb :) |
I first forked it (https://github.com/OpenRefine/OpenRefine) into my GtiHub and then cloned it https://github.com/TejaswiKarasani/OpenRefine into my desktop. When I use refine.bat build, I can just see the following but can't intall anything I am facing following errors while setting up the project in Eclipse though I did install maven dependency too Missing artifact com.codeberry.jdatapath:jdatapath:jar:alpha2 |
Hi everyone! I'm Hammad, a final year Software Engineering student from Pakistan and an Outreachy 2020 aspirant. I was looking at OpenRefine's project: "Implement more constraint checks in OpenRefine's Wikidata extension" and it seemed very interesting to me. Having an affinity for Java and some experience with Wikidata, I feel like I can do well. I'd already introduced myself on Gitter but thought I'd do it here too. Looking forward to working with you all! |
@TejaswiKarasani This seems to be due to the fact that @madham32 welcome to the project! |
Thanks for your response @wetneb . I will contact @thadguidry through email :) |
@TejaswiKarasani 1st things... The Maven cache path that you have Maven cache folder on Windows should default to a So the the above tells me your Maven installation on Windows is not proper. You might need to clean up and delete that folder Once Maven is cleaned up (you no longer have a Maven's installation should also set it's Don't you love the simplicity that Windows brings to your life? :-) |
@thadguidry thanks a ton it did run :) |
I'd set up the openrefine and was looking over some good-first-issue issues though was having a hard time understanding the codebase and where to get started. Was wondering if there's any pointers to where I can start working or some particularly easy to understand issues or code I can start with because right now I was feeling kinda lost. Thanks! |
@madham32 Try going through the wiki once, there is also this document there on how to write an extension, and they kinda explain the codebase structure in it. |
Implemented conflicts-with scrutinizer as part of #2354
Implemented conflicts-with scrutinizer as part of #2354
Implemented conflicts-with scrutinizer as part of #2354
Implemented conflicts-with scrutinizer as part of #2354
Implemented conflicts-with scrutinizer as part of OpenRefine#2354
* Start for Goto Page link. First implementation. Is functional. * Background-color & validation Background-color for each « button » & validation for the page choice of the user: number, > 1 & < last & singular for a one page project. * Changed from prompt() to <input type="number"> Changed from prompt() to <input type="number">, and visual X out Y. * Manage bounds in a sticky way If the user choose below 1, 1 will be displayed, and if the user choose above the max, the max page will be displayed. * width of <input> & « of X pages » Add pages after « of maxValue », calculate the width of <input> based on max value. * Update data-table-view.js Little fix. * Correct min and max for <input> Correct min and max for <input> * Managment of the arrow's key down. Add managment to keep the arrow's key in the CurrentPage <input>. * Fixes for Thad’s KeyDown's « Infinite Paging » Fixes for Thad’s KeyDown's « Infinite Paging » & PageSize changes. * Code rehookCurrentPageInput & spacing Code rehookCurrentPageInput & spacing for PageSize section * Update data-table-view.js Fix & move code in Page Control. * Wrap in a Try/Catch the currentPageInput.focus() Wrap in a Try/Catch the currentPageInput.focus() and add a verification of focus change with if(!(currentPageInput.is(":focus"))) window.setTimeout(rehookCurrentPageInput, rehookDelay). * Fixed currentPageInput != document.activeElement Fixed currentPageInput != document.activeElement that was before: !(currentPageInput.is(":focus")) * Better handling of the « Infinite Paging » Better handling of the « Infinite Paging » with window.setTimeout(rehookCurrentPageInput, rehookDelay) * $.i18n('core-views/goto-page', … $.i18n('core-views/goto-page', '<span id="currentPageInput" />', '<span id="lastPageSpan" />')) * Fix i18n plural Fix i18n plural, FR still not fixed. * Remove {{plural:$2|page|pages}} in french Remove {{plural:$2|page|pages}} in French, not working. * Update data-table-view.js Forgot a semi-comma. * First implementation First implementation, is functional. * Remove setTimeout(refocusCurrentPageInput, refocusDelay); Remove setTimeout(refocusCurrentPageInput, refocusDelay);, instead, onready after creating the page input field. * Typo: missing semi-colon Typo: missing semi-colon * Fix bug related to stealing focus of facets & added a delay Fix bug related to stealing focus of facets & added a delay (1 s.) before changing pages. * Revert "Remove {{plural:$2|page|pages}} in french" This reverts commit 7274a21. * Start for Goto Page link. First implementation. Is functional. * Background-color & validation Background-color for each « button » & validation for the page choice of the user: number, > 1 & < last & singular for a one page project. * Changed from prompt() to <input type="number"> Changed from prompt() to <input type="number">, and visual X out Y. * Manage bounds in a sticky way If the user choose below 1, 1 will be displayed, and if the user choose above the max, the max page will be displayed. * width of <input> & « of X pages » Add pages after « of maxValue », calculate the width of <input> based on max value. * Update data-table-view.js Little fix. * Correct min and max for <input> Correct min and max for <input> * Managment of the arrow's key down. Add managment to keep the arrow's key in the CurrentPage <input>. * Fixes for Thad’s KeyDown's « Infinite Paging » Fixes for Thad’s KeyDown's « Infinite Paging » & PageSize changes. * Code rehookCurrentPageInput & spacing Code rehookCurrentPageInput & spacing for PageSize section * Update data-table-view.js Fix & move code in Page Control. * Wrap in a Try/Catch the currentPageInput.focus() Wrap in a Try/Catch the currentPageInput.focus() and add a verification of focus change with if(!(currentPageInput.is(":focus"))) window.setTimeout(rehookCurrentPageInput, rehookDelay). * Fixed currentPageInput != document.activeElement Fixed currentPageInput != document.activeElement that was before: !(currentPageInput.is(":focus")) * Better handling of the « Infinite Paging » Better handling of the « Infinite Paging » with window.setTimeout(rehookCurrentPageInput, rehookDelay) * $.i18n('core-views/goto-page', … $.i18n('core-views/goto-page', '<span id="currentPageInput" />', '<span id="lastPageSpan" />')) * Bump rhino from 1.7.10 to 1.7.12 Bumps [rhino](https://github.com/mozilla/rhino) from 1.7.10 to 1.7.12. - [Release notes](https://github.com/mozilla/rhino/releases) - [Changelog](https://github.com/mozilla/rhino/blob/master/RELEASE-NOTES.md) - [Commits](https://github.com/mozilla/rhino/commits) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Fix i18n plural Fix i18n plural, FR still not fixed. * Remove {{plural:$2|page|pages}} in french Remove {{plural:$2|page|pages}} in French, not working. * Update data-table-view.js Forgot a semi-comma. * Bump guava from 19.0 to 23.0 Bumps [guava](https://github.com/google/guava) from 19.0 to 23.0. - [Release notes](https://github.com/google/guava/releases) - [Commits](google/guava@v19.0...v23.0) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Bump testng from 6.9.10 to 7.1.0 Bumps [testng](https://github.com/cbeust/testng) from 6.9.10 to 7.1.0. - [Release notes](https://github.com/cbeust/testng/releases) - [Changelog](https://github.com/cbeust/testng/blob/master/CHANGES.txt) - [Commits](https://github.com/cbeust/testng/commits) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Bump to Guava 23.6.1-jre * Bump jasypt from 1.9.2 to 1.9.3 Bumps [jasypt](https://github.com/jasypt/jasypt) from 1.9.2 to 1.9.3. - [Release notes](https://github.com/jasypt/jasypt/releases) - [Commits](https://github.com/jasypt/jasypt/commits/jasypt-1.9.3) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Added conflicts-with constraints (#2641) Implemented conflicts-with scrutinizer as part of #2354 * Fix a bug introduced by I #1038, PR #2616 (#2684) Fix a bug introduced by I #1038, where the first tab, in the Wikidata mode, would have a bad background. PR #2616 * Fixed typos made in PR #2641 * Remove setTimeout(refocusCurrentPageInput, refocusDelay); Remove setTimeout(refocusCurrentPageInput, refocusDelay);, instead, onready after creating the page input field. * Trim input url whitespace (#2534) Fixes #2516 * Bump google-http-client-jackson2 from 1.20.0 to 1.35.0 Bumps [google-http-client-jackson2](https://github.com/googleapis/google-http-java-client) from 1.20.0 to 1.35.0. - [Release notes](https://github.com/googleapis/google-http-java-client/releases) - [Changelog](https://github.com/googleapis/google-http-java-client/blob/master/CHANGELOG.md) - [Commits](googleapis/google-http-java-client@1.20.0...v1.35.0) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Switch to new rhino-runtime component * Typo: missing semi-colon Typo: missing semi-colon * Replace Apache Ant with Commons Compress (#2691) NOTE: Changes the public API where some of the old types were embedded which means that any extensions that extend these interfaces will have to be updated. Fixes #2690. * Fix bug related to stealing focus of facets & added a delay Fix bug related to stealing focus of facets & added a delay (1 s.) before changing pages. * Revert "Remove {{plural:$2|page|pages}} in french" This reverts commit 7274a21. * Update to latest Jython version 2.7.2 Closes #2642 * Bump git-commit-id-plugin from 2.2.4 to 4.0.0 Bumps [git-commit-id-plugin](https://github.com/git-commit-id/maven-git-commit-id-plugin) from 2.2.4 to 4.0.0. - [Release notes](https://github.com/git-commit-id/maven-git-commit-id-plugin/releases) - [Commits](git-commit-id/git-commit-id-maven-plugin@v2.2.4...v4.0.0) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Bump maven-assembly-plugin from 3.1.0 to 3.3.0 Bumps [maven-assembly-plugin](https://github.com/apache/maven-assembly-plugin) from 3.1.0 to 3.3.0. - [Release notes](https://github.com/apache/maven-assembly-plugin/releases) - [Commits](apache/maven-assembly-plugin@maven-assembly-plugin-3.1.0...maven-assembly-plugin-3.3.0) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Bump httpclient from 4.5.5 to 4.5.12 Bumps httpclient from 4.5.5 to 4.5.12. Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Bump powermock.version from 2.0.2 to 2.0.7 Bumps `powermock.version` from 2.0.2 to 2.0.7. Updates `powermock-module-testng` from 2.0.2 to 2.0.7 - [Release notes](https://github.com/powermock/powermock/releases) - [Changelog](https://github.com/powermock/powermock/blob/release/2.x/docs/changelog.txt) - [Commits](powermock/powermock@powermock-2.0.2...powermock-2.0.7) Updates `powermock-api-mockito2` from 2.0.2 to 2.0.7 - [Release notes](https://github.com/powermock/powermock/releases) - [Changelog](https://github.com/powermock/powermock/blob/release/2.x/docs/changelog.txt) - [Commits](powermock/powermock@powermock-2.0.2...powermock-2.0.7) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Bump butterfly from 1.0.2 to 1.0.3 Bumps [butterfly](https://github.com/OpenRefine/simile-butterfly) from 1.0.2 to 1.0.3. - [Release notes](https://github.com/OpenRefine/simile-butterfly/releases) - [Commits](https://github.com/OpenRefine/simile-butterfly/commits) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Bump slf4j-api from 1.7.18 to 1.7.30 Bumps [slf4j-api](https://github.com/qos-ch/slf4j) from 1.7.18 to 1.7.30. - [Release notes](https://github.com/qos-ch/slf4j/releases) - [Commits](qos-ch/slf4j@v_1.7.18...v_1.7.30) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Bump commons-validator from 1.5.1 to 1.6 Bumps commons-validator from 1.5.1 to 1.6. Signed-off-by: dependabot-preview[bot] <support@dependabot.com> * Added translation using Weblate (Portuguese (Brazil)) * Translated using Weblate (Portuguese (Brazil)) Currently translated at 100.0% (47 of 47 strings) Translation: OpenRefine/gdata Translate-URL: https://hosted.weblate.org/projects/openrefine/gdata/pt_BR/ * Translated using Weblate (Japanese) Currently translated at 100.0% (179 of 179 strings) Translation: OpenRefine/wikidata Translate-URL: https://hosted.weblate.org/projects/openrefine/wikidata/ja/ * Disables FacetContainer while computing clusters fixes #2675 * Added translation using Weblate (Portuguese (Brazil)) * Translated using Weblate (Portuguese (Brazil)) Currently translated at 100.0% (740 of 740 strings) Translation: OpenRefine/Translations Translate-URL: https://hosted.weblate.org/projects/openrefine/translations/pt_BR/ * Added translation using Weblate (Portuguese (Brazil)) * Translated using Weblate (Portuguese (Brazil)) Currently translated at 100.0% (60 of 60 strings) Translation: OpenRefine/database Translate-URL: https://hosted.weblate.org/projects/openrefine/database/pt_BR/ * Translated using Weblate (Portuguese (Brazil)) Currently translated at 6.7% (12 of 179 strings) Translation: OpenRefine/wikidata Translate-URL: https://hosted.weblate.org/projects/openrefine/wikidata/pt_BR/ * Translated using Weblate (Japanese) Currently translated at 100.0% (60 of 60 strings) Translation: OpenRefine/database Translate-URL: https://hosted.weblate.org/projects/openrefine/database/ja/ * Translated using Weblate (Japanese) Currently translated at 100.0% (740 of 740 strings) Translation: OpenRefine/Translations Translate-URL: https://hosted.weblate.org/projects/openrefine/translations/ja/ * Translated using Weblate (Japanese) Currently translated at 100.0% (60 of 60 strings) Translation: OpenRefine/database Translate-URL: https://hosted.weblate.org/projects/openrefine/database/ja/ * Added translation using Weblate (Bengali (India)) * Load GDrive icon from local resource, not Github (#2689) Fixes #2688. * Convert illegal characters into legal ones. (#2431) * Convert illegal characters into leagal ones. * Test tab in key & value string Also fix up test that depended on previous TAB related error message and clean up logging Co-authored-by: Tom Morris <tfmorris@gmail.com> * Bump signpost-commonshttp4 from 1.2.1.2 to 2.0.0 (#2695) Bumps [signpost-commonshttp4](https://github.com/mttkay/signpost) from 1.2.1.2 to 2.0.0. - [Release notes](https://github.com/mttkay/signpost/releases) - [Changelog](https://github.com/mttkay/signpost/blob/master/CHANGELOG.md) - [Commits](mttkay/signpost@1.2.1.2...oauth-signpost-2.0.0) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com> * Add comment to help designers understand TreeData (#2715) * Update jquery.i18n to 1.07 and fix non-English plural support (#2717) * Update jquery.i18n to 1.07 and add missing rule parser Fixes #2700 Adds missing CLDRPluralRuleParser.js so that plurals are supported. Updates all files to jquery.i18n 1.07 Includes a bunch of specialty language support, but only Finnish and Russian are loaded as examples. * Add some missing translations, including plurals Fix some cases of Javascript string concatenation and plural conditionalization to demonstrate that plurals work in both English and French now. NOTE: Corresponding updates need to be made to all the other language files since some keys were renamed or eliminated. * Unused imports and other minor cleanups (#2723) * Two minor fixes - prevent invalid index error on empty strings (shouldn't normally happen) - update deprecated Apache Commons Lang method * Remove unused imports * Remove feature Edit Facet Name Remove feature Edit Facet Name that got merged by mistake. * Remove feature Edit Facet Name Remove feature Edit Facet Name that got merged by mistake. * Spacing Spacing * Fix the delay, adjust to .2 s. Fix the delay, adjust to .2 s. and stop using promises. * Fix headerTable to tableHeader Fix headerTable to tableHeader (PR #2719) * data-header-table to data-table-header data-header-table to data-table-header * Remove .data-header-table-container Remove .data-header-table-container Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com> Co-authored-by: Tom Morris <tfmorris@gmail.com> Co-authored-by: Ekta Mishra <ektamishra1999@gmail.com> Co-authored-by: Nishtha <51858166+Nishtha3512@users.noreply.github.com> Co-authored-by: Thad Guidry <thadguidry@gmail.com> Co-authored-by: Rafael Fontenelle <rafaelff@gnome.org> Co-authored-by: Isao Matsunami <isao.matsunami@gmail.com> Co-authored-by: Biswaranjan Manna <manna.biswaranjan@gmail.com> Co-authored-by: chuhao zeng <32441682+zengchu2@users.noreply.github.com>
* Add ItemRequires Constraint Implemented Item requires constraint as part of #2354 * done with implementation of ItemRequiresScrutinizer Class * Test class added with suitable test cases
* Add One-of qualifier value property constraint Implemented one-of qualifier value property constraint as part of #2354 * Test class added * Test cases updated and working fine * resolved merge conflicts
Implemented Citation needed Cconstraint as part of #2354 Test class added with appropriate testc cases Updated severity level to critical as well as the messages merged unsourced and citation-needed scrutinizer updated severity levels and warning messages
Closing as @darecoder implemented all constraint checks that can be executed quickly. Great job! |
When uploading data to Wikidata, OpenRefine checks for common issues in the uploaded data, and reports these to the user before the upload. Many of these checks rely on Wikidata's own constraint system, which lets Wikidata contributors specify how each Wikidata property should be used (for instance by providing a regular expression for its format).
The Wikidata extension in OpenRefine only supports some of the constraints that Wikidata uses. This means that some problems in data imports can go undetected and get flagged up as constraint violations later on in Wikidata itself.
Proposed solution
We could implement more constraint checks. This could include constraints defined in Wikidata but also other generic checks such as those implemented in #2103.
Additional context
Some constraints are expensive to check as they require communicating with Wikidata itself. Since constraint checks are run in real time (to provide quick feedback to the user), we should be careful not to add any expensive operations in new constraint checks.
The architecture of constraint checks in OpenRefine can evolve - for instance to accommodate for more expensive checks transparently, better warnings reported to the user, better handling of multiple constraint declarations of the same type on the same property… The current design is not set in stone.
There is also an interest in developing a generic data validation system, not specific to Wikidata, where all sorts of issues could be reported (think validation against any tabular schema, for instance as defined by the Data Package or CSVW specs).
This is a proposed Outreachy project in 2020. If you are not planning to apply for an internship via Outreachy, we kindly ask that you do not work on this task yet, in order to leave the floor to potential interns.
The text was updated successfully, but these errors were encountered: