-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to save "schema templates": schemas where some values are left empty #5043
Comments
The proposal of “schema templates”, especially for the Wikimedia Commons Integration project is interesting. I understand the requirement where the users would propose “incomplete” schemas or “schemas with holes” suiting their particular or temporary requirements. We have a number of such use cases. For example, different language communities may describe a hospital in different ways, suiting their local needs. Hence, during WikiTechstorm 2019, we proposed to simplify the shape expressions (ShEx) in a way that helps users easily create schemas for their needs and validate their data with their own proposed schema (for example, the local definition of a hospital with the required properties). This is also very important for lexicographical data, where we required different schemas for lexemes of multiple languages. Now coming to the Wikimedia commons integration project, based on my current usage/understanding of structured data on commons, I feel that the requirements for schemas for describing various data are not very complex and can be handled by simple shape expressions. These use-cases were the inspirations behind the proposition of ShExStatements, a tool with the possibility to create simple shape expressions using CSV files. Take, for example, the QuickStatements tool that helps users easily create/modify statements on Wikidata using CSV files. Personally, I feel that this simplicity of using CSV files has helped lots of users to use OpenRefine and QuickStatements. The proposal of “schema templates” could be useful for the community in the long run if it is based on standards. Tools like ShExStatements, EntitySchema Generator, ShExLite were proposed to simplify the ShEx creation process. These could be explored. My suggestion is that you could use ShEx as the underlying technology of your proposal and continue proposing CSV for OpenRefine users for defining “schema templates”. A number of examples of shape expressions based on ShExStatements are given here. There are cases, where the use of certain properties was marked optional. Some of the shape expressions created by ShExStatements can be found here. Finally, I would like to point out the cradle tool, which also proposes ShEx for creating new Wikidata items. |
Hi @johnsamuelwrites, thanks a lot for chiming in!
How would you overcome the hurdles I mentioned on Wikidata? Of course, if the schemas simply consist of a list of properties that should be on a given entity, that is not very hard to parse (but then ShEx is perhaps a bit overkill for such a use case). Consider the case where the data needs to be input as qualifiers or references of statements. The corresponding ShEx would follow the RDF serialization of the data model to express this. For OpenRefine to be able to parse this and coerce it back to a schema with missing values, a rather complex heuristic needs to be implemented. This scenario of using qualifiers is not fictional: it is used very commonly on Wikimedia Commons.
I am not sure what you mean. Currently OpenRefine does not have any notion of schema template, and the one I am planning to implement does not use CSV, but JSON. |
Translation to RDF and simplification are two possible approaches (also covered in the above link). For the second, we tried to handle the problem of creating Shape Expressions by proposing some of the tools/language subsets posted above. However, I haven't tried the former. I have used this tool for validation, but they are using the complete RDF of every entity to be validated. Based on my understanding, you want to handle it at the level of statements. For example, something like E273
I wanted to say that OpenRefine must continue to support CSV files, even for future and new proposed use-cases (in this case, schema templates). JSON is interesting from a developer's perspective, but I am not convinced so far about its generalized use.
But lighter versions of ShEx can also handle these cases. For example, this ShEx of outbreak in CSV format uses qualifiers. You can even split ShEx to multiple files: E273, E274, and E275. |
I think this highlights a misunderstanding. The set of formats that OpenRefine supports to create a new project has nothing to do with the format chosen to represent schema templates (just like currently, the format used to represent Wikibase schemas in OpenRefine has no relationship to the data import formats supported by OpenRefine).
Yes - I am not saying that ShEx does not support qualifiers or references. What I am saying is that it seems difficult, in general, to parse a ShEx file and turn it into an OpenRefine schema where some values are left empty. There are plenty of constructs in ShEx that simply do not have an equivalent in our context: for instance, the As an example of why this is difficult, see this discussion about why ShEx's support in Cradle is experimental: https://twitter.com/MagnusManske/status/1227545532698169345 Now I do not think it is impossible to come up with some import/export from/to ShEx which works in simple cases, but this is bound to be brittle in general. So I would rather not use ShEx as primary data storage format for our schema templates. |
Suppose Alice and Bob both want to upload CSVs about book editions. To cover more use cases, let's say that there's a constraint on ps:P629 that requires a P407 edition name and constrains a P1365 replaces to be another work. replaces P1365 <edition_of_a_written_work> {
# title
wdt:P1476 LITERAL ;
# edition or translation of
ps:P629 {
# edition name
p:P407 @<language> ;
# replaces
p:P1365 @<written_work> *
} ? ;
... Bob.csv has a title and a publication date so he has rules that map column 1 to P1476. Alice.csv has detailed data bout encylopediae and maps column 5 to P1476, column 8 to ps:P629/p:P407, and column 9 to ps:P629/p:P1365. In my example, it was worth exchanging the structure of the mapping target but not the mapping rules because the sources (CSVs) had different structures. Is that realistic? I think the OpenRefine interpretation of ShEx depends on the what you expect OpenRefine to do. If the goal is just to drive user interface, it's easy to take the above schema (or ShExJ equivalent), and grep through the expressions for the predicates in order to drive a picklist with ~10 items rather than the 10K predicates in WD. Further, it would be easy to pick P629 and follow that up with prompts to supply P407 and P1365. If the use case goes beyond UI prompts, it would be nice to work through a concrete example in order to see how the required info relates to the entity schema. |
Hi Eric, thanks for joining the discussion :) We can absolutely grep through ShEx files. That is probably a cheap way to implement an import from ShEx files to our schema templates. But if we are just grepping through stuff, that means it can not be the native format we use to store these templates internally. Perhaps it is worth making clear that I am not expecting any user to interact with the JSON representation of schema templates (just like users do not need to interact with the JSON representation of our schemas). Users interact with schema templates only graphically (not just to use them, but also to create them). See #5103 for an overview of what is proposed. Generally, in OpenRefine the intention is to match the user experience on Wikibase itself as much as possible, and lower the entry bar by avoiding the need to learn a textual syntax. |
I'm not convinced that we can not, would not want to (if you don't mind sorting through my double-negatives). Improvising a bit from [#5103]'s Artwork wireframe to add an example qualifier, we can invent a minimal template expression to capture the anticipated shared structure (and ignoring the Wikitext and Captions 'cause I didn't understand them). If we wanted a minimal expression of OpenRefine's current expressivity, we could have something like: {
"name": "Artwork",
"properties": [
{
"property": "P571",
"propertyLabel": { "@en": "inception" },
"valuePrompt": "YYYY(-MM(-DD))"
},
{
"property": "P170",
"propertyLabel": { "@en": "creator" }
},
{
"property": "P180",
"propertyLabel": { "@en": "depicts" },
"qualifiers": [
{
"required": true,
"qualifier": "P518",
"qualifierLabel": {"@en": "applies to part" }
}
]
}
]
} You suggested that this could be generated from some community ShEx schema: { "shapes": [
{ "id": "https://www.wikidata.org/wiki/EntitySchema:Artwork",
"expression": {
"type": "EachOf",
"expressions": [
{
"predicate": "http://www.wikidata.org/prop/P571",
"min": 0,
"valueExpr": {
"type": "NodeConstraint",
"pattern": {
"type": "ShapeOr",
"shapeExprs": [
"http://www.w3.org/2001/XMLSchema#date",
"http://www.w3.org/2001/XMLSchema#gYearMonth",
"http://www.w3.org/2001/XMLSchema#gYear",
],
"annotations": [{
"type": "Annotatio",
"predicate": "http://www.wikidata.org/prop/P8397",
"object": { "value": "YYYY(-MM(-DD))" }
}]
}
}
},
{
"predicate": "http://www.wikidata.org/prop/P170",
"min": 0
},
{
"predicate": "http://www.wikidata.org/prop/statement/P180",
"valueExpr": {
"type": "EachOf",
"expressions":{[
{
"predicate": "http://www.wikidata.org/prop/P518",
"min": 1
}
]}
}
}
]
}
}
]} supplemented by some labels: wd:P571 rdfs:label "inception"@en ;
skos:prefLabel "inception"@en ;
schema:name "inception"@en ;
rdfs:label "date de fondation ou de création"@fr ;
skos:prefLabel "date de fondation ou de création"@fr ;
schema:name "date de fondation ou de création"@fr ;
I'd asked if the goal was to exchange "the mapping target but not the mapping rules" but because I wasn't able to restrain myself, I asked more questions afterwards. Is that an accurate expression of the goal, i.e. is the issue what OpenRefine's UI generates and shares with other instances of OpenRefine? If so, you of course expect that a single-issue zealot like myself will argue that it can and should use ShEx directly but I'll make that argument in a later comment. |
Hi @ericprud, thanks for digging in this more. As a developer, when I chose a serialization format for some class in OpenRefine, I have the following requirements:
I am not capable to satisfy those requirements for ShEx. Perhaps you are able to. If you are interested in working on this yourself, the door is open (see my reply on-wiki as well). |
I'd be glad to pair with you. Though I'd still like to know the objective, e.g. "the mapping target but not the mapping rules". |
I do not understand what you mean by that. I was hoping that the issue statement is clear enough if you are familiar with OpenRefine, but I am not sure you have used the tool already (in particular doing uploads to a Wikibase instance from OpenRefine)? |
As I understand it, the wireframe in [#5103] has CSV column names across the top:
Sharing a pickled form of the output specification encourages folks with diverse inputs to converge on a common output structure (as well as reducing their effort). Sharing the column mapping would be useful if different folks were likely to import data from the same input CSV schema. My question is whether the goal is to share the output schema or the mapping rules (assignment of CSV columns to output fields). |
So, if I understand you correctly, what you call "output schema" corresponds to what this issue calls "schema templates" (which does not exist yet in OpenRefine and which this issue proposes to introduce), and what you call "mapping rules" is what OpenRefine currently calls "(Wikibase) schema", which OpenRefine users are already able to share, using a in-house JSON format. |
FYI, some updates re:conversations on (not) using ShEX here: https://www.wikidata.org/wiki/Wikidata_talk:Schemas#Feedback_from_the_%22Entity_Schema_build-out_and_integration%22_task_of_WMDE's_development_plan,_and_OpenRefine |
Yes, and in particular, let me emphasize the following points: We are not against a ShEx integration: if someone has an idea of what it should look like and how to implement it, they can totally come and implement it here. On my side, I am planning to stop working on Wikibase integration (and therefore will not work on ShEx support), because I want to focus my energy on the core of OpenRefine (which will also benefit the Wikibase use cases of OpenRefine). I am available to review code and help onboard people, but will (soon) stop developing new features and fixing bugs in this part of the tool. |
I think that answers my question. I was trying to figure out the scope "schema templates". Your text suggests that it's a description of the structure to be emitted by OpenRefine. What I meant by "rules" is the pairings of some column in the CSV to an output field description in the schema template, I think the Wikibase shema is more a meta-schema describing how the notional data structure described by the schema template is communicated in the wikibase infrastructure. Again, all subject to my understanding being correct. |
Great! Again, if you are interested in thinking about a ShEx integration in OpenRefine, I think it is worth getting familiar with the tool itself first, as a user. There are plenty of tutorials (textual, video) for that. Once you have this familiarity, it is much easier to relate to the actual needs of the users and come up with a meaningful integration. |
Closes OpenRefine#5043. This adds the ability to start a schema from a previously saved schema template. A schema template is a schema with some missing values. It is also possible to pre-define some templates in the Wikibase manifest, so that those are added to OpenRefine when the manifest is first loaded. I think it would be useful to add an interface to list existing schema templates, which would enable: * deleting schema templates * exporting a schema template to JSON (useful to add it to the manifest afterwards)
* Initial support for schema templates. Closes #5043. This adds the ability to start a schema from a previously saved schema template. A schema template is a schema with some missing values. It is also possible to pre-define some templates in the Wikibase manifest, so that those are added to OpenRefine when the manifest is first loaded. I think it would be useful to add an interface to list existing schema templates, which would enable: * deleting schema templates * exporting a schema template to JSON (useful to add it to the manifest afterwards) * Small UI improvements
Wikibase communities generally establish some data modelling conventions, which users are asked to follow when importing data.
For instance, a book on Wikidata will generally have a certain set of statements on it (title, author, date of publication…).
When importing data in a Wikibase instance, users need to have the knowledge of those modelling conventions. If they do not have this knowledge, a blank schema page can be quite daunting.
Proposed solution
We should make it possible for users to define "schema templates" which represent a prototypical Wikibase schema for an import in a certain domain. Those schema templates would be incomplete Wikibase schemas, in the sense that some statement values could be left blank, for instance. After loading such a template, the user could then drag and drop the columns from their project to the appropriate locations in the schema.
For the implementation, we would just use the same classes as the ones that define schemas, but relax the validation in the constructors to allow for empty value. We would then add methods to check that a given schema is complete. This would be the opportunity to return more precise error messages when the schema is incomplete (#4724).
Alternatives considered
Use ShEx or SHACL. Those formats are not designed for the Wikibase data model, but rather for RDF. This means that it would not be possible to take a schema represented in one of those formats, and render it as a Wikibase schema with holes. Therefore this does not seem fitting here.
Additional context
Requested by @trnstlntk @lozanaross in the context of the Wikimedia Commons integration project.
The text was updated successfully, but these errors were encountered: