Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to save "schema templates": schemas where some values are left empty #5043

Closed
wetneb opened this issue Jul 8, 2022 · 16 comments · Fixed by #5191
Closed

Ability to save "schema templates": schemas where some values are left empty #5043

wetneb opened this issue Jul 8, 2022 · 16 comments · Fixed by #5191
Labels
Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. wikibase Related to wikidata/wikibase integration wikicommons Related to Wikimedia Commons integration

Comments

@wetneb
Copy link
Member

wetneb commented Jul 8, 2022

Wikibase communities generally establish some data modelling conventions, which users are asked to follow when importing data.
For instance, a book on Wikidata will generally have a certain set of statements on it (title, author, date of publication…).

When importing data in a Wikibase instance, users need to have the knowledge of those modelling conventions. If they do not have this knowledge, a blank schema page can be quite daunting.

Proposed solution

We should make it possible for users to define "schema templates" which represent a prototypical Wikibase schema for an import in a certain domain. Those schema templates would be incomplete Wikibase schemas, in the sense that some statement values could be left blank, for instance. After loading such a template, the user could then drag and drop the columns from their project to the appropriate locations in the schema.

For the implementation, we would just use the same classes as the ones that define schemas, but relax the validation in the constructors to allow for empty value. We would then add methods to check that a given schema is complete. This would be the opportunity to return more precise error messages when the schema is incomplete (#4724).

Alternatives considered

Use ShEx or SHACL. Those formats are not designed for the Wikibase data model, but rather for RDF. This means that it would not be possible to take a schema represented in one of those formats, and render it as a Wikibase schema with holes. Therefore this does not seem fitting here.

Additional context

Requested by @trnstlntk @lozanaross in the context of the Wikimedia Commons integration project.

@wetneb wetneb added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. wikibase Related to wikidata/wikibase integration wikicommons Related to Wikimedia Commons integration labels Jul 8, 2022
@trnstlntk trnstlntk moved this to To be triaged in Structured Data on Commons Jul 17, 2022
@trnstlntk trnstlntk moved this from To be triaged to WMF grant - to do in Structured Data on Commons Jul 17, 2022
@johnsamuelwrites
Copy link

The proposal of “schema templates”, especially for the Wikimedia Commons Integration project is interesting. I understand the requirement where the users would propose “incomplete” schemas or “schemas with holes” suiting their particular or temporary requirements. We have a number of such use cases. For example, different language communities may describe a hospital in different ways, suiting their local needs. Hence, during WikiTechstorm 2019, we proposed to simplify the shape expressions (ShEx) in a way that helps users easily create schemas for their needs and validate their data with their own proposed schema (for example, the local definition of a hospital with the required properties). This is also very important for lexicographical data, where we required different schemas for lexemes of multiple languages.

Now coming to the Wikimedia commons integration project, based on my current usage/understanding of structured data on commons, I feel that the requirements for schemas for describing various data are not very complex and can be handled by simple shape expressions.

These use-cases were the inspirations behind the proposition of ShExStatements, a tool with the possibility to create simple shape expressions using CSV files. Take, for example, the QuickStatements tool that helps users easily create/modify statements on Wikidata using CSV files. Personally, I feel that this simplicity of using CSV files has helped lots of users to use OpenRefine and QuickStatements.

The proposal of “schema templates” could be useful for the community in the long run if it is based on standards. Tools like ShExStatements, EntitySchema Generator, ShExLite were proposed to simplify the ShEx creation process. These could be explored.

My suggestion is that you could use ShEx as the underlying technology of your proposal and continue proposing CSV for OpenRefine users for defining “schema templates”. A number of examples of shape expressions based on ShExStatements are given here. There are cases, where the use of certain properties was marked optional. Some of the shape expressions created by ShExStatements can be found here.

Finally, I would like to point out the cradle tool, which also proposes ShEx for creating new Wikidata items.

@wetneb
Copy link
Member Author

wetneb commented Jul 25, 2022

Hi @johnsamuelwrites, thanks a lot for chiming in!

My suggestion is that you could use ShEx as the underlying technology of your proposal

How would you overcome the hurdles I mentioned on Wikidata?

Of course, if the schemas simply consist of a list of properties that should be on a given entity, that is not very hard to parse (but then ShEx is perhaps a bit overkill for such a use case).

Consider the case where the data needs to be input as qualifiers or references of statements. The corresponding ShEx would follow the RDF serialization of the data model to express this. For OpenRefine to be able to parse this and coerce it back to a schema with missing values, a rather complex heuristic needs to be implemented.

This scenario of using qualifiers is not fictional: it is used very commonly on Wikimedia Commons.

continue proposing CSV for OpenRefine users for defining “schema templates”.

I am not sure what you mean. Currently OpenRefine does not have any notion of schema template, and the one I am planning to implement does not use CSV, but JSON.

@johnsamuelwrites
Copy link

How would you overcome the hurdles I mentioned on Wikidata?

Translation to RDF and simplification are two possible approaches (also covered in the above link). For the second, we tried to handle the problem of creating Shape Expressions by proposing some of the tools/language subsets posted above. However, I haven't tried the former. I have used this tool for validation, but they are using the complete RDF of every entity to be validated. Based on my understanding, you want to handle it at the level of statements. For example, something like E273

continue proposing CSV for OpenRefine users for defining “schema templates”.

I wanted to say that OpenRefine must continue to support CSV files, even for future and new proposed use-cases (in this case, schema templates). JSON is interesting from a developer's perspective, but I am not convinced so far about its generalized use.

This scenario of using qualifiers is not fictional: it is used very commonly on Wikimedia Commons.

But lighter versions of ShEx can also handle these cases. For example, this ShEx of outbreak in CSV format uses qualifiers. You can even split ShEx to multiple files: E273, E274, and E275.

@wetneb
Copy link
Member Author

wetneb commented Jul 25, 2022

I wanted to say that OpenRefine must continue to support CSV files, even for future and new proposed use-cases (in this case, schema templates). JSON is interesting from a developer's perspective, but I am not convinced so far about its generalized use.

I think this highlights a misunderstanding. The set of formats that OpenRefine supports to create a new project has nothing to do with the format chosen to represent schema templates (just like currently, the format used to represent Wikibase schemas in OpenRefine has no relationship to the data import formats supported by OpenRefine).

But lighter versions of ShEx can also handle these cases. For example, this ShEx of outbreak in CSV format uses qualifiers. You can even split ShEx to multiple files: E273, E274, and E275.

Yes - I am not saying that ShEx does not support qualifiers or references. What I am saying is that it seems difficult, in general, to parse a ShEx file and turn it into an OpenRefine schema where some values are left empty. There are plenty of constructs in ShEx that simply do not have an equivalent in our context: for instance, the OR operator, cardinality constraints, and so on.

As an example of why this is difficult, see this discussion about why ShEx's support in Cradle is experimental: https://twitter.com/MagnusManske/status/1227545532698169345

Now I do not think it is impossible to come up with some import/export from/to ShEx which works in simple cases, but this is bound to be brittle in general. So I would rather not use ShEx as primary data storage format for our schema templates.
If you (or anyone else) are interested in working on such an integration, I would be happy to advise and facilitate.

@ericprud
Copy link

ericprud commented Jul 28, 2022

Suppose Alice and Bob both want to upload CSVs about book editions. To cover more use cases, let's say that there's a constraint on ps:P629 that requires a P407 edition name and constrains a P1365 replaces to be another work.

replaces P1365

<edition_of_a_written_work> {
  # title
  wdt:P1476 LITERAL ;
  # edition or translation of
  ps:P629 {
    # edition name
    p:P407 @<language> ;
   # replaces
    p:P1365 @<written_work> *
  } ? ;
...

Bob.csv has a title and a publication date so he has rules that map column 1 to P1476. Alice.csv has detailed data bout encylopediae and maps column 5 to P1476, column 8 to ps:P629/p:P407, and column 9 to ps:P629/p:P1365.

In my example, it was worth exchanging the structure of the mapping target but not the mapping rules because the sources (CSVs) had different structures. Is that realistic?

I think the OpenRefine interpretation of ShEx depends on the what you expect OpenRefine to do. If the goal is just to drive user interface, it's easy to take the above schema (or ShExJ equivalent), and grep through the expressions for the predicates in order to drive a picklist with ~10 items rather than the 10K predicates in WD. Further, it would be easy to pick P629 and follow that up with prompts to supply P407 and P1365.

If the use case goes beyond UI prompts, it would be nice to work through a concrete example in order to see how the required info relates to the entity schema.

@wetneb
Copy link
Member Author

wetneb commented Jul 28, 2022

Hi Eric, thanks for joining the discussion :)
Yes, this issue is solely about creating a data input UI (although we can definitely create more issues if you have other ideas).

We can absolutely grep through ShEx files. That is probably a cheap way to implement an import from ShEx files to our schema templates. But if we are just grepping through stuff, that means it can not be the native format we use to store these templates internally.

Perhaps it is worth making clear that I am not expecting any user to interact with the JSON representation of schema templates (just like users do not need to interact with the JSON representation of our schemas). Users interact with schema templates only graphically (not just to use them, but also to create them). See #5103 for an overview of what is proposed. Generally, in OpenRefine the intention is to match the user experience on Wikibase itself as much as possible, and lower the entry bar by avoiding the need to learn a textual syntax.

@ericprud
Copy link

ericprud commented Aug 3, 2022

But if we are just grepping through stuff, that means it can not be the native format we use to store these templates internally.

I'm not convinced that we can not, would not want to (if you don't mind sorting through my double-negatives). Improvising a bit from [#5103]'s Artwork wireframe to add an example qualifier, we can invent a minimal template expression to capture the anticipated shared structure (and ignoring the Wikitext and Captions 'cause I didn't understand them). If we wanted a minimal expression of OpenRefine's current expressivity, we could have something like:

{
  "name": "Artwork",
  "properties": [
    {
      "property": "P571",
      "propertyLabel": { "@en": "inception" },
      "valuePrompt": "YYYY(-MM(-DD))"
    },
    {
      "property": "P170",
      "propertyLabel": { "@en": "creator" }
    },
    {
      "property": "P180",
      "propertyLabel": { "@en": "depicts" },
      "qualifiers": [
        {
          "required": true,
          "qualifier": "P518",
          "qualifierLabel": {"@en": "applies to part" }
        }
      ]
    }
  ]
}

You suggested that this could be generated from some community ShEx schema:

{ "shapes": [
  { "id": "https://www.wikidata.org/wiki/EntitySchema:Artwork",
    "expression": {
      "type": "EachOf",
      "expressions": [  
        {
          "predicate": "http://www.wikidata.org/prop/P571",
          "min": 0,
          "valueExpr": {
            "type": "NodeConstraint",
            "pattern": {
              "type": "ShapeOr",
              "shapeExprs": [
                "http://www.w3.org/2001/XMLSchema#date",
                "http://www.w3.org/2001/XMLSchema#gYearMonth",
                "http://www.w3.org/2001/XMLSchema#gYear",
              ],
              "annotations": [{
                "type": "Annotatio",
                "predicate": "http://www.wikidata.org/prop/P8397",
                "object": { "value": "YYYY(-MM(-DD))" }
              }]
            }
          }
        },
        {
          "predicate": "http://www.wikidata.org/prop/P170",
          "min": 0
        },
        {
          "predicate": "http://www.wikidata.org/prop/statement/P180",
          "valueExpr": {
            "type": "EachOf",
            "expressions":{[
              {
                "predicate": "http://www.wikidata.org/prop/P518",
                "min": 1
              }
            ]}
          }
        }
      ]
    }
  }
]}

supplemented by some labels:
P571:

wd:P571 rdfs:label "inception"@en ;
	skos:prefLabel "inception"@en ;
	schema:name "inception"@en ;
	rdfs:label "date de fondation ou de création"@fr ;
	skos:prefLabel "date de fondation ou de création"@fr ;
	schema:name "date de fondation ou de création"@fr ;

etc, for P170, P810 and P518.

not expecting any user to interact with the JSON representation of schema templates (just like users do not need to interact with the JSON representation of our schemas).

I'd asked if the goal was to exchange "the mapping target but not the mapping rules" but because I wasn't able to restrain myself, I asked more questions afterwards. Is that an accurate expression of the goal, i.e. is the issue what OpenRefine's UI generates and shares with other instances of OpenRefine? If so, you of course expect that a single-issue zealot like myself will argue that it can and should use ShEx directly but I'll make that argument in a later comment.

@wetneb
Copy link
Member Author

wetneb commented Aug 4, 2022

Hi @ericprud, thanks for digging in this more.

As a developer, when I chose a serialization format for some class in OpenRefine, I have the following requirements:

  • Using a well defined serialization format, which means having a simple and short definition of what is allowed (for instance formulated in XSD or JSON Schema)
  • Reliable serialization and deserialization code, which generally means avoiding to write deserialization code myself. If we are introducing a new format (and we would need really good reasons to do so), the parser should be developed with a parser generator such as yacc.
  • Serialization code should work for all possible Java values. Deserialization code should work for all possible serialized values. The two should be inverses of each other.
  • The format should ideally be extensible: if we realize later on that we need another data field somewhere, we should be able to introduce it, while still being able to deserialize values serialized in the older format.

I am not capable to satisfy those requirements for ShEx. Perhaps you are able to. If you are interested in working on this yourself, the door is open (see my reply on-wiki as well).

@ericprud
Copy link

ericprud commented Aug 6, 2022

I'd be glad to pair with you. Though I'd still like to know the objective, e.g. "the mapping target but not the mapping rules".

@wetneb
Copy link
Member Author

wetneb commented Aug 6, 2022

Though I'd still like to know the objective, e.g. "the mapping target but not the mapping rules".

I do not understand what you mean by that. I was hoping that the issue statement is clear enough if you are familiar with OpenRefine, but I am not sure you have used the tool already (in particular doing uploads to a Wikibase instance from OpenRefine)?

@ericprud
Copy link

ericprud commented Aug 7, 2022

As I understand it, the wireframe in [#5103] has CSV column names across the top:
["File path name", "File name", "Creator", "Date", "License", "Other data", "Other data 2"]
and controls for the output specification arranged vertically underneath:

Statements
inception YYYY(-MM(-DD))
0 references
remove
configure
add qualifier
source of file type entity or drag reconciled column header
0 references
remove
configure
add qualifier
creator type entity or drag reconciled column header
0 references
remove
configure
add qualifier
copyright status type entity or drag reconciled column header
0 references
remove
configure
add qualifier
copyright licence type entity or drag reconciled column header
0 references
remove
configure
add qualifier
depicts type entity or drag reconciled column header
0 references
remove
configure
add qualifier

Sharing a pickled form of the output specification encourages folks with diverse inputs to converge on a common output structure (as well as reducing their effort). Sharing the column mapping would be useful if different folks were likely to import data from the same input CSV schema. My question is whether the goal is to share the output schema or the mapping rules (assignment of CSV columns to output fields).

@wetneb
Copy link
Member Author

wetneb commented Aug 7, 2022

My question is whether the goal is to share the output schema or the mapping rules (assignment of CSV columns to output fields).

So, if I understand you correctly, what you call "output schema" corresponds to what this issue calls "schema templates" (which does not exist yet in OpenRefine and which this issue proposes to introduce), and what you call "mapping rules" is what OpenRefine currently calls "(Wikibase) schema", which OpenRefine users are already able to share, using a in-house JSON format.

@trnstlntk
Copy link
Contributor

FYI, some updates re:conversations on (not) using ShEX here: https://www.wikidata.org/wiki/Wikidata_talk:Schemas#Feedback_from_the_%22Entity_Schema_build-out_and_integration%22_task_of_WMDE's_development_plan,_and_OpenRefine

@wetneb
Copy link
Member Author

wetneb commented Aug 10, 2022

Yes, and in particular, let me emphasize the following points:

We are not against a ShEx integration: if someone has an idea of what it should look like and how to implement it, they can totally come and implement it here.

On my side, I am planning to stop working on Wikibase integration (and therefore will not work on ShEx support), because I want to focus my energy on the core of OpenRefine (which will also benefit the Wikibase use cases of OpenRefine). I am available to review code and help onboard people, but will (soon) stop developing new features and fixing bugs in this part of the tool.

@ericprud
Copy link

My question is whether the goal is to share the output schema or the mapping rules (assignment of CSV columns to output fields).

So, if I understand you correctly, what you call "output schema" corresponds to what this issue calls "schema templates" (which does not exist yet in OpenRefine and which this issue proposes to introduce), and what you call "mapping rules" is what OpenRefine currently calls "(Wikibase) schema", which OpenRefine users are already able to share, using a in-house JSON format.

I think that answers my question. I was trying to figure out the scope "schema templates". Your text suggests that it's a description of the structure to be emitted by OpenRefine. What I meant by "rules" is the pairings of some column in the CSV to an output field description in the schema template, I think the Wikibase shema is more a meta-schema describing how the notional data structure described by the schema template is communicated in the wikibase infrastructure. Again, all subject to my understanding being correct.

@wetneb
Copy link
Member Author

wetneb commented Aug 17, 2022

Great! Again, if you are interested in thinking about a ShEx integration in OpenRefine, I think it is worth getting familiar with the tool itself first, as a user. There are plenty of tutorials (textual, video) for that. Once you have this familiarity, it is much easier to relate to the actual needs of the users and come up with a meaningful integration.

wetneb added a commit to wetneb/OpenRefine that referenced this issue Aug 17, 2022
Closes OpenRefine#5043.

This adds the ability to start a schema from a previously saved schema template.
A schema template is a schema with some missing values.
It is also possible to pre-define some templates in the Wikibase manifest, so that
those are added to OpenRefine when the manifest is first loaded.

I think it would be useful to add an interface to list existing schema templates, which would enable:
* deleting schema templates
* exporting a schema template to JSON (useful to add it to the manifest afterwards)
Repository owner moved this from WMF grant - to do to WMF grant - done in Structured Data on Commons Aug 31, 2022
Repository owner moved this from 🤔 To do to ✅ Done in Wikibase support 2022: file upload, custom data types Aug 31, 2022
wetneb added a commit that referenced this issue Aug 31, 2022
* Initial support for schema templates.

Closes #5043.

This adds the ability to start a schema from a previously saved schema template.
A schema template is a schema with some missing values.
It is also possible to pre-define some templates in the Wikibase manifest, so that
those are added to OpenRefine when the manifest is first loaded.

I think it would be useful to add an interface to list existing schema templates, which would enable:
* deleting schema templates
* exporting a schema template to JSON (useful to add it to the manifest afterwards)

* Small UI improvements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. wikibase Related to wikidata/wikibase integration wikicommons Related to Wikimedia Commons integration
Projects
Status: SDC (WMF) grant 2021-22 - done
Development

Successfully merging a pull request may close this issue.

4 participants