Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use wikidata to provide skos:definition to owl:Class'es #125

Open
lewismc opened this issue Jul 16, 2019 · 71 comments
Open

Use wikidata to provide skos:definition to owl:Class'es #125

lewismc opened this issue Jul 16, 2019 · 71 comments

Comments

@lewismc
Copy link
Member

lewismc commented Jul 16, 2019

Building on from #20 this issue simply aims to provide rdfs:comment (and/or skos:definition or dct:description) text to all terms.
Open tasks involve us collectively agreeing upon which vocabulary we wish to use e.g. rdfs:comment (and/or skos:definition or dct:description) and additionally whether we manually curate the comments or else automate this by fetching them from wikipedia/dbpedia/dictionary or elsewhere.

Any comments here?

@brandonnodnarb
Copy link
Member

I have candidate term definitions for ~2K SWEET terms/classes pulled from Earth science glossaries we can sort through. Although, I'm not sure the best way to do that at present.

@lewismc
Copy link
Member Author

lewismc commented Jul 16, 2019

Excellent :)

@brandonnodnarb where do they exist? Do you have them in electronic format somewhere?

At lunch, @dr-shorthair and I were discussing possibly just providing a dct:description (although that would introduce a brand new namespace into SWEET) which is essentially a link to an alternate, maintained description which exists elsewhere e.g. DBPedia, ENVO, .... The keyword here is maintained. I think it would be a bad decision right now for us to go ahead and implement a whole bunch f descriptions which exists solely within SWEET. On the other hand, if they do link to other, better defined, maintained descriptions then it would make sense to link to them.

Any comments @brandonnodnarb ?

@rduerr
Copy link
Contributor

rduerr commented Jul 16, 2019 via email

@lewismc
Copy link
Member Author

lewismc commented Jul 16, 2019

I completely agree @rduerr

@brandonnodnarb
Copy link
Member

@lewismc these are in a spreadsheet. i'll see if I can clean it up and post it somewhere for review.

Also, I didn't think of this until you mentioned it, but another option could be to push these things to wikipedia/dbpedia, or make sure they are included and cited (and maintained) there. Hmmm...let me think about this a bit.

@lewismc
Copy link
Member Author

lewismc commented Jul 17, 2019

Yes ideally we could even get to this on Thursday as well. I thinl pusing to DBPedia would be an excellent idea. It would be excellent for us to re-use and/or make available as much of this to the wider audience. As this is a pretty large task, the best way may infact be the easiest way e.g. automating pulling comments from DBPedia. An example, very simple SPARQL query can be found as follows

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?comment where {
  <http://dbpedia.org/resource/Gazetteer> rdfs:comment ?comment .
  FILTER (lang(?comment) = 'en')
}

Ofcourse we would merely substitute the subject IRI with whatever term we get from ESIP and then experiment with FILTER regex or other functions.
Thoughts?

@graybeal
Copy link
Collaborator

From a usability standpoint, an embedded description is much nicer (because it is there in front of you), and a little more confidence-inducing, because (a) it implies the author of the ontology (SWEET) vouch for it, (b) it is likely to be coherent with the purposes of the ontology, and (c) it is unlikely to drift without explicit reason. Add to that the opportunity for providing definitions that specifically disambiguate the term from its siblings, and fill the term space.

Some of these sources may achieve some of these goals. @lewismc, is your particular proposal to find the comments and 'bring them in', or simply to reference them in their original location? If the former, will the process be re-run every few years, or will we freeze this moment in our definitions?

Is it worth considering both a local copy of the definition and a reference to the source comment?

I'll be OK with whatever approach y'all think is reasonable and achievable. If there is decision-making involved, an ideal presentation would be a Google spreadsheet with the SWEET ontology name, term name, label, and external descriptions from whatever sources we are considering. That would make it easy to review the whole set at once as well as comment or vote on sources for particular terms, should it come to that.

@lewismc
Copy link
Member Author

lewismc commented Jul 17, 2019

Hi @graybeal excellent questions, thanks for jumping in. You make some good points which I appreciate.

is your particular proposal to find the comments and 'bring them in',

If we did this, we would be essentially duplicating the content (and it would be appropriate to use of one rdfs:comment, skos:definition or dct:description). This is not to say that the things being represented are equals but merely that the way the thing is described is identical at that point in time. As you state, the actual literal values (from where they were acquired and where they exist within SWEET) will most likely diverge over time. Is this OK? It may be... but it may not be. More below...

or simply to reference them in their original location?

We could also look into using rdfs:seeAlso as a mechanism for addressing the above issue (to clarify, this issue is the divergence of the content we bring in and encode as either rdfs:comment, skos:definition or dct:description AND the canonical source from which we obtained the information e.g. that same rdfs:comment over at DBPedia), where rdfs:seeAlso would reference the original source e.g. http://dbpedia.org/resource/Jet_Propulsion_Laboratory from which the rdfs:comment literal was extracted and rdfs:comment would be the actual literal content.
If this explanation is not clear then please let me know. An example would be as follows

Consider the following

### http://sweetontology.net/reprDataProduct/Dataset
dprepr:Dataset rdf:type owl:Class ;
rdfs:subClassOf dprepr:DataProduct ;
rdfs:label "dataset"@en .

Once the above work was done it would look as follows

###  http://sweetontology.net/reprDataProduct/Dataset
dprepr:Dataset rdf:type owl:Class ;
               rdfs:subClassOf dprepr:DataProduct ;
               rdfs:label "dataset"@en ;
               rdfs:comment "A  data set (or dataset, although this spelling is not present in many  contemporary dictionaries) is a collection of data. Most commonly a data  set corresponds to the contents of a single database table, or a single  statistical data matrix, where every column of the table represents a  particular variable, and each row corresponds to a given member of the  data set in question. The data set lists values for each of the  variables, such as height and weight of an object, for each member of  the data set. Each value is known as a datum. The data set may comprise  data for one or more members, corresponding to the number of rows."@en ;
               rdfs:seeAlso http://dbpedia.org/resource/Data_set .

If the former, will the process be re-run every few years, or will we freeze this moment in our definitions?

I'm not sure about this. We need to think it through.

@lewismc lewismc modified the milestones: 3.3.0, 3.4.0 Jul 18, 2019
@cmungall
Copy link
Collaborator

A bit confused by this discussion as I am used to ontologies where the definitions are authored by the developers of that ontology (sometimes adapted from an external source, with attribution). Randomly bringing in dictionary definitions could lead to incoherence, and how do we know the definitions reflects the intended meaning?

Yes, I have my handy blog post OntoTip entry for text definitions as well:
https://douroucouli.wordpress.com/2019/07/08/ontotip-write-simple-concise-clear-operational-textual-definitions/

Regardless of who writes them and what pipeline you use, it's super-important to track provenance of definitions, e.g. via axiom annotation

@graybeal
Copy link
Collaborator

OK, so who is the developer of SWEET these days? (Presumably the people who are currently maintaining it?)

And how does that developer now create appropriate definitions, if not by referencing existing expertise?

@rduerr
Copy link
Contributor

rduerr commented Aug 12, 2019

OK, so for the developer question, I do think ENVO's micro-citation is useful. In other words, if I make a change to a term (any change) I annotate the change with my ORCID. I also like using DBXREF's to cite the original source of the definitions. I haven't looked at DBPedia; but perhaps that should be where I dump all the GCW terms and definitions?

While I do like having embedded definitions, I really hate the idea of having to update the same definition in more than one place. Would having them in DBPedia help with this problem?

Also, I note that all the cryospheric terms and definitions and sources for those can be provided as csv file if that is helpful.

Thoughts?

@lewismc
Copy link
Member Author

lewismc commented Aug 15, 2019

@rduerr

Would having them in DBPedia help with this problem?

Yes it would, we would then look at the comment over in the DBPedia resource and determine whether we want a hard mapping.

@cmungall thanks for chiming in. I agree with @graybeal here in that the response to your statement

...where the definitions are authored by the developers of that ontology

That is essentially us. Raskin et al. never added simple labels or verbose descriptions so it is down to us to annotate and contextualize whatever we feel is necessary.

IMHO DBPedia is the best resource I've come across where we can leverage existing knowledge. We can even do this one Class at a time with one pull request. Then every one of the proposed augmentations could be scrutinized.

Does this sound logical or is it way off?

@graybeal
Copy link
Collaborator

I'm wrestling with implications here, mostly because these external definitions are not versioned, are they? So please pardon my TLDR comments.

If we embed (copy) a definition we are then claiming it as our own, and ours won't track any changes to the original source (which may be the best thing); or if it does track changes, we'll have an ongoing monitoring task. In any case yes we'll need to evaluate each one.

If we link to definitions that live elsewhere, we still have the monitoring issue (what if that definition changes enough to make it wrong for SWEET?). And if we make the link a hard one (along the lines of sameAs or exactMatch) we are effectively claiming it as our own, and therefore still have to track any changes made to the original to see if we agree. So we're effectively back at the first option.

I don't think we can support either of these approaches, even if we could create a great first version. And SWEET is not an authoritative real-world model that can be used for detailed reasoning about the world, and we can't pretend we will be able to come up with all-knowing definitions for these terms. It makes more sense to me to give people pointers to helpful information, and maintain SWEET as a relatively minimalist description of these earth science concepts.

So I think it would be best to have the definitions be notional, not authoritative. The relationship would then be 'notionallyDescribedBy', or better words to that effect, and there could be several of them, even with some contradictions between them. This best reflects the real world of SWEET in my opinion.

With that approach they could be either embedded (with the definitions sourced in the provenance, and updated automatically from the original content); or referenced remotely (though that makes SWEET less handy to use).

I'd prefer the embedded option, where multiple embedded definitions have been pulled from other sources (with date, source citation, and process citation). That follows best practices as far as I'm concerned.

@rduerr
Copy link
Contributor

rduerr commented Aug 15, 2019

@pbuttigieg @cmungall Your take on this?

@dr-shorthair
Copy link
Collaborator

dr-shorthair commented Aug 15, 2019

How about

###  http://sweetontology.net/reprDataProduct/Dataset
dprepr:Dataset rdf:type owl:Class ;
               rdfs:subClassOf dprepr:DataProduct ;
               rdfs:label "dataset"@en ;
               skos:definition  [ 
                   rdfs:comment  "A  data set (or dataset, although this spelling is not present in many  contemporary dictionaries) is a collection of data. Most commonly a data  set corresponds to the contents of a single database table, or a single  statistical data matrix, where every column of the table represents a  particular variable, and each row corresponds to a given member of the  data set in question. The data set lists values for each of the  variables, such as height and weight of an object, for each member of  the data set. Each value is known as a datum. The data set may comprise  data for one or more members, corresponding to the number of rows."@en ;
                   dct:source <http://dbpedia.org/resource/Data_set> ;
                   dct:created "2019-08-16T11:35:21.06Z"^^xsd:dateTimeStamp ;
             ] .

The range of skos:definition is rdfs:resource. This gets you the text locally along with the citation and the date it was copied. Of course the downside is that its now a property path skos:definition/rdfs:comment rather than just a simple property, but the complexity is not more than the problem being solved.

@lewismc
Copy link
Member Author

lewismc commented Aug 15, 2019

This makes sense to me.
I could easily code up something which opens a new pull request for every hit that we get from DBPedia.
Let's see if we can get any more consensus...

@graybeal
Copy link
Collaborator

graybeal commented Aug 15, 2019

Ignoring the temptation to comment on the definition :-), I like this. Presumably there can be multiple definitions, which I think is helpful to prevent people from trying to "reason over the definitions" (or argue over the definitions, equally to the point). Good general-purpose definitions are very hard to build, so most aren't that good; the meaning is in the interplay of definitions.

In the interest of rigor, can the date be an ISO 8601 date+time+time zone? How does RDFS feel about (read: tolerate) that format?

@lewismc What about doing all the pull requests automatically in a branch, then push them all to a Google table (or similar) for each review/comment? (a) You don't want to give someone carpal tunnel approving pull requests, and (b) the likelihood should significantly favor acceptance, with a definition rejected only if there is an agreement it is clearly unacceptable or represents a different concept. (And in the former case, that it's just a poor definition, the disapproval could be represented by annotating the definition, rather than by not including it.) Some system to keep track of the issues and rejections for future updates would be very helpful to minimize future maintenance costs. But treating this as a "handy dandy reference" not as a rigorous definition means reviews could be pretty superficial, just: Is it the right concept or the wrong concept?

@cmungall
Copy link
Collaborator

cmungall commented Aug 19, 2019 via email

@brandonnodnarb brandonnodnarb pinned this issue Oct 4, 2019
@brandonnodnarb brandonnodnarb unpinned this issue Nov 13, 2019
@lewismc lewismc added this to the 3.5.0 milestone Nov 21, 2019
@lewismc
Copy link
Member Author

lewismc commented Mar 13, 2020

Would it be worthwhile to put the terms and schema:descriptions in a spreadsheet for us to review?

I could do this... however mapping from those determinations back to the correct logic for the pull request cannot be automated. We are talking about thousands of labels for which we are looking to obtain definitions. I completely understand why manual curation is your desire... but once that spreadsheet is populated, then I am left to manually map all of those decisions back into source code... that would take me literally years...
I kinda torn here...

@wdduncan
Copy link

Maybe a statistical sampling would work?
Perhaps you could put the content in a triple store, and then use a delete/update query to change the annotations?

@lewismc
Copy link
Member Author

lewismc commented Mar 13, 2020

Thanks for the suggestions @wdduncan. It's looking more and more like this is not going to be a reproduceable process... which is what I would have preferred.
I'll work on pulling together the pull request first (we can always close it without merging into master) such that we can evaluate the size of the peer review process.
After that I'll revisit your comment.

@smrgeoinfo
Copy link
Collaborator

smrgeoinfo commented Mar 16, 2020

Seems like we could put the schema:description content from wikidata in rdfs:comment, and then wouldn't have to import another big namespace. Especially considering that the content of wikidata descriptions is likely to be pretty heterogeneous. If we're going to add definitions, I'd prefer using skos:definition directly or one of the approaches outlined above (#125 (comment), #125 (comment)).

@lewismc lewismc changed the title Provide rdfs:comment (and/or skos:definition or dct:description) text to all terms Use wikidata to provide skos:definition to owl:Class'es Jul 17, 2020
@dr-shorthair
Copy link
Collaborator

dr-shorthair commented Aug 24, 2021

Reviving this issue following the SemTech meeting today:

here is a preliminary pattern for coordinating labels and definitions harvested from multiple external sources -

###  http://sweetontology.net/reprDataProduct/Dataset
dprepr:Dataset rdf:type owl:Class ;
               rdfs:subClassOf dprepr:DataProduct ;
               rdfs:label "dataset"@en ; # Up to here, this is from SWEET, the following is pulled from alternative external sources
               skos:definition  [ 
                   skos:definition  "A  data set (or dataset, although this spelling is not present in many  contemporary dictionaries) is a collection of data. Most commonly a data  set corresponds to the contents of a single database table, or a single  statistical data matrix, where every column of the table represents a  particular variable, and each row corresponds to a given member of the  data set in question. The data set lists values for each of the  variables, such as height and weight of an object, for each member of  the data set. Each value is known as a datum. The data set may comprise  data for one or more members, corresponding to the number of rows."@en ;
                   skos:prefLabel "Data set"@en ;
                   dcterms:source <http://dbpedia.org/resource/Data_set> ;
                   dcterms:created "2019-08-16"^^xsd:date ;
             ]  ;
               skos:definition  [ 
                   skos:definition "collection of data"@en ;
                   skos:prefLabel "data set"@en ;
                   dcterms:source <https://www.wikidata.org/wiki/Q1172284> ;
                   dcterms:created "2021-08-25"^^xsd:date ;
             ]  ;
               skos:definition  [ 
                   skos:definition  "A collection of data, published or curated by a single agent, and available for access or download in one or more representations"@en ;
                   skos:prefLabel "Dataset"@en ;
                   dcterms:source <https://www.w3.org/TR/vocab-dcat/#Class:Dataset> ;
                   dcterms:created "2021-08-25"^^xsd:date ;
             ]  ;
.

The pattern above is not necessarily the ultimate solution. But it records separate labels and definitions - here embedded in blank-nodes - clearly identified by source, so data formatted this way could be transformed to another pattern with SPARQL queries if another pattern is preferred.

@rduerr does this help?

@rduerr
Copy link
Contributor

rduerr commented Aug 24, 2021

Yup - absolutely!!!

@graybeal
Copy link
Collaborator

@dr-shorthair why would you use dct:created instead of dcterms:created? I thought a best practice was to use dcterms for everything because dct was so, you know … old.

@dr-shorthair
Copy link
Collaborator

My mistake. dct: and dcterms: have both been used as the prefix for DC Terms. I have transitioned to using the latter because it won't be confused with dctype: which is a different namespace. It was just a copy-paste-failed-to-update-it-all error. I've edited and fixed now.

@wdduncan
Copy link

@dr-shorthair sorry ... I haven't been able to make the calls in a while. But, I am not sure what you are trying to model by nesting the skos:definitions. Using Protege, you can annotate a class with multiple definitions. The "@" in the upper right hand corner allows you to put annotations on the definitions. Here is a simple demo of what it looks like.

image

If you are interested in the turtle it looks like this:

:complex_object  a       owl:Class ;
        rdfs:label       "complex object" ;
        rdfs:subClassOf  :object ;
       skos:definition
                "An object with more than one component." , "An object with many parts." .

[ a                      owl:Axiom ;
  dcterms:created "2020-07-02" ;
  dcterms:creator "William Duncan" ;
  owl:annotatedProperty  skos:definition ;
  owl:annotatedSource    :complex_object ;
  owl:annotatedTarget    "An object with many parts."
] .


[ a                      owl:Axiom ;
  dcterms:created "2021-01-01" ;
  dctems:creator "Bill Duncan" ;
  owl:annotatedProperty  skos:definition ;
  owl:annotatedSource    :complex_object ;
  owl:annotatedTarget    "An object with more than one component."
] .

You can, of course, add other annotation properties like you have in the example above.

@dr-shorthair
Copy link
Collaborator

@wdduncan as I hinted in my comment after the example, the exact pattern is not final. And since these are just annotations it is not really important.

I was merely suggesting that under the agreed plan to Reimagine SWEET as a compilation of textual definitions, then (i) label (ii) textual definition (iii) source (iv) date are probably the minimum items needed that would support usefully associating each external definition with a SWEET class. I'm not wedded to any particular model.

(I do not routinely use Protege, so am not bound to its OWLy view of the world.)

@wdduncan
Copy link

@dr-shorthair sorry, I didn't catch your drift at the end of the example :)

I think what I proposed satisfies your criteria. I understand that not everyone uses Protege, but it is good to stay within the OWLy realm if possible.

@dr-shorthair
Copy link
Collaborator

This is basically a reification pattern. The downside is that a label and definition from the same source would appear in separate axioms. They can be linked through having the same source and date, but a small overhead.

@wdduncan
Copy link

Yes. It is the OWL-reification pattern.

They can be linked through having the same source and date, but a small overhead.

Sorry, I'm not following this point. In the turtle, the axioms are tied to the class via the owl:annotatedSource predicate:

owl:annotatedSource    :complex_object ; 

@dr-shorthair
Copy link
Collaborator

If you look up at my original proposal, the label and definition from a single source are part of the same object, rather than being in two separate objects.

@wdduncan
Copy link

Yes. I noticed that. However, I think it better to stick with a well defined standard rather than making up a new way to it.

@nicholascar
Copy link
Collaborator

So that’s what the Protege @ is for! A decade of using Protege (sporadically) and I never knew…

Despite my happiness at learning the above, I think that where a “pure RDF” patter and an OWL pattern can be used to communicate something, the pure RDF pattern should be preferred. Unless the OWL patter is in wide use and this can be demonstrated, which is a step beyond just that the patter is available in “a well defined standard” (OWL).

Having said that, it’s a bit cheeky to use skos:definition twice over @dr-shorthair! That looks like the schema.org Qualified Relations pattern but the sense in which the property is used changes because you include not just provenance properties - who and when - but also a prefLabel which then changes what is being defined. Could some other properties be used for the first skos:definition? If not, there may have to be a rule that no other properties other than provenance annotations are allowed in the BN.

@dr-shorthair
Copy link
Collaborator

Well we could stay within the DC world this way

###  http://sweetontology.net/reprDataProduct/Dataset
dprepr:Dataset rdf:type owl:Class ;
               rdfs:subClassOf dprepr:DataProduct ;
               rdfs:label "dataset"@en ; # Up to here, this is from SWEET, the following is pulled from alternative external sources
               skos:definition  [ 
                   dcterms:description  "A  data set (or dataset, although this spelling is not present in many  contemporary dictionaries) is a collection of data. Most commonly a data  set corresponds to the contents of a single database table, or a single  statistical data matrix, where every column of the table represents a  particular variable, and each row corresponds to a given member of the  data set in question. The data set lists values for each of the  variables, such as height and weight of an object, for each member of  the data set. Each value is known as a datum. The data set may comprise  data for one or more members, corresponding to the number of rows."@en ;
                   dcterms:title "Data set"@en ;
                   dcterms:source <http://dbpedia.org/resource/Data_set> ;
                   dcterms:created "2019-08-16"^^xsd:date ;
             ]  ;
               skos:definition  [ 
                   dcterms:description "collection of data"@en ;
                   dcterms:title "data set"@en ;
                   dcterms:source <https://www.wikidata.org/wiki/Q1172284> ;
                   dcterms:created "2021-08-25"^^xsd:date ;
             ]  ;
               skos:definition  [ 
                   dcterms:description "A collection of data, published or curated by a single agent, and available for access or download in one or more representations"@en ;
                   dcterms:title "Dataset"@en ;
                   dcterms:source <https://www.w3.org/TR/vocab-dcat/#Class:Dataset> ;
                   dcterms:created "2021-08-25"^^xsd:date ;
             ]  ;
.

But I think @wdduncan is commenting on the fact that the blank-nodes are untyped - just a bag of properties, if you like. Fair enough.

OTOH are owl:Axiom resources found much in the wild? It is really just another meta-model, with very weak semantics in itself.

@cmungall
Copy link
Collaborator

I am not sure how much its used outside the biosciences but it's very widely used in OBO. Many of our ontologies have detailed axiom level provenance. I will hopefully have a blog about it in this series soon https://douroucouli.wordpress.com/2020/09/11/edge-properties-part-1-reification/

@cmungall
Copy link
Collaborator

I'd say there are zero semantics to owlAxiom rather then weak semantics

@wdduncan
Copy link

wdduncan commented Aug 27, 2021

@dr-shorthair I understand that what you are proposing is completely valid the SKOS world. The notes in the skos reference state:

Note that no domain is stated for the SKOS documentation properties. Thus, the effective domain for these properties is the class of all resources (rdfs:Resource). Therefore, using the SKOS documentation properties to provide information on any type of resource is consistent with the SKOS data model.

However, when users query for definitions, they will receive a complex object instead of text, and will then have to further process the complex object if they are only interested in the text of the definition.

In my experience, when I query for definitions, I am looking for (or expecting) text, and I am not interested in the provenance of the defintion. If I am interested in the provenance, then I do a different query. But, perhaps that is only my experience.

For what it is worth, when I encounter skos:definitions they have text values, but (again) that is just my experience.

@smrgeoinfo
Copy link
Collaborator

If a SWEET class has a collection of definitions, then the class might represent different concepts (when he definitions are not consistent). If each definition might also have a distinct label, then the SWEET class does not represent a 'word' (lexical item) either. So what does the SWEET class represent?

I think under the proposal adopted in #211, that there must only be one prefLabel associated with the SWEET class, making it essentially a dictionary-- a mapping between a lexical item (word) and possible meanings. altLabels might be associated with specific definitions, but there would need to be some clear logic on when the 'label' is different enough that it should be a different SWEET word class.

Use of blank nodes for the definitions is also problematic-- what if someone wants to link to a particular definition?

@rduerr
Copy link
Contributor

rduerr commented Sep 13, 2021

See discussion https://github.com/ESIPFed/sweet/discussions/259 for a proposal on this topic.

As for blank nodes with these definition blocks, anyone want to tackle a proposal for that? I'd like to move forward on this stuff!!!

@lewismc
Copy link
Member Author

lewismc commented Sep 17, 2021 via email

@brandonnodnarb brandonnodnarb modified the milestones: 3.5.0, 3.6.0 Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment