Add lexical matching to standard SOP for New Ontology Requests #2517

matentzn · 2024-01-24T09:16:46Z

@cthoyt wrote this script to check overlap with existing OBO ontologies; I think we should fold this into our NOR SOP asap.

I wrote a script that checks for lexical matches between this proposal and existing ontologies in the OBO Foundry. Note that OBO Principle 10 "Commitment To Collaboration" states:

It is expected that Foundry ontologies will collaborate with other Foundry ontologies, particularly in ensuring orthogonality of distinct ontologies, in re-using content from other ontologies in cross-product definitions where appropriate, and in establishing and evolving Foundry principles to advance the Foundry suite of ontologies to better serve the joint users.

Below are the results. A case can be made that it's okay to duplicate NCIT terms since this is just an obo export of a resource that does not actually participate in the open community.

Lexical matching returned results

CAROLIO:0000411 mild pain
- ncit:C136549 Neck Pain Score 2 (0.54)
CAROLIO:0000412 moderate pain
- ncit:C121394 Moderate Extremity Pain (0.54)
- ncit:C136551 Neck Pain Score 4 (0.54)
CAROLIO:0000413 no pain
- ncit:C119987 Had No Pain (0.54)
- ncit:C121390 No Extremity Pain (0.54)
- ncit:C136547 Neck Pain Score 0 (0.54)
CAROLIO:0000414 severe pain
- ncit:C121395 Severe Extremity Pain (0.54)
- ncit:C136553 Neck Pain Score 6 (0.54)
CAROLIO:0001000 caroli syndrome
- doid:0081394 Caroli syndrome (0.772)
- mondo:0018808 Caroli syndrome (0.772)
CAROLIO:0003100 endoscopic treatment
- ncit:C16546 Endoscopic Procedure (0.54)
CAROLIO:0003120 endoscopic retrograde cholangiopancreatography
- maxo:0035049 endoscopic retrograde cholangiopancreatography (0.778)
- ncit:C16430 Endoscopic Retrograde Cholangiopancreatography (0.762)
CAROLIO:0003200 interventional radiology procedure
- ncit:C63334 Interventional Radiology Procedure (0.762)
CAROLIO:0003210 locoregional therapy
- ncit:C25388 Local-Regional (0.54)
- ncit:C94796 Locally Recurrent Malignant Neoplasm (0.54)
CAROLIO:0003220 paracentesis
- maxo:0035106 paracentesis (0.778)
- ncit:C15310 Paracentesis (0.762)
CAROLIO:0003250 transjugular intrahepatic portosystemic shunt
- ncit:C126288 Transjugular Intrahepatic Portosystemic Shunt (0.762)
CAROLIO:0003300 pharmaceutical treatment
- maxo:0000058 pharmacotherapy (0.556)
CAROLIO:0003310 antibiotic treatment
- ncit:C258 Antibiotic (0.762)
- chebi:33281 antimicrobial agent (0.556)
- xco:0000482 antimicrobial agent (0.556)
CAROLIO:0003320 antiemetic treatment
- chebi:50919 antiemetic (0.778)
- xco:0001245 antiemetic (0.778)
- ncit:C267 Antiemetic Agent (0.556)
CAROLIO:0003330 bile acid treatment
- chebi:3098 bile acid (0.778)
- chebi:22868 bile salt (0.549)
- ncit:C74800 Bile Acid Measurement (0.54)
CAROLIO:0003340 chemotherapy
- maxo:0000647 chemotherapy (0.778)
- ncit:C15632 Chemotherapy (0.762)
CAROLIO:0003350 diuretics treatment
- chebi:35498 diuretic (0.778)
- xco:0000122 diuretic (0.778)
- ncit:C448 Diuretic (0.762)
CAROLIO:0003360 octreotide treatment
- chebi:7726 octreotide (0.778)
- ncit:C711 Octreotide (0.762)
CAROLIO:0003370 proton pump inhibitor treatment
- xco:0000577 proton pump inhibitor (0.778)
- ncit:C29723 Proton Pump Inhibitor (0.762)
- chebi:49200 EC 3.6.3.10 (H(+)/K(+)-exchanging ATPase) inhibitor (0.556)
CAROLIO:0003380 pruritus treatment
- hp:0000989 Pruritus (0.762)
- ncit:C3344 Pruritus (0.762)
- scdo:0000935 Pruritus (0.762)
- symp:0000432 itching (0.556)
- ncit:C58006 Pruritus, CTCAE (0.54)
CAROLIO:0003400 radiation therapy
- maxo:0000014 radiation therapy (0.778)
- ncit:C15313 Radiation Therapy (0.762)
CAROLIO:0003500 surgical treatment
- ncit:C15329 Surgical Procedure (0.54)
CAROLIO:0003510 organ transplant
- ncit:C122934 Organ Graft (0.54)
CAROLIO:0003520 roux-en-y
- ncit:C51756 Roux-en-Y Anastomosis (0.549)
CAROLIO:0003530 surgical resection
- maxo:0000448 surgical resection (0.778)
- ncit:C158758 Resection (0.54)

Lexical matching returned no results

CAROLIO:0000400 value partition
CAROLIO:0000410 pain scale
CAROLIO:0000420 symptom recurrence status
CAROLIO:0000421 non-recurrent symptom status
CAROLIO:0000422 recurrent symptom status
CAROLIO:0002000 variceal bleeding
CAROLIO:0003110 endoscopic band ligation
CAROLIO:0003121 biliary drainage
CAROLIO:0003122 biliary dilatation
CAROLIO:0003123 biliary stent placement
CAROLIO:0003124 gallstones removal
CAROLIO:0003230 percutaneous aspiration and drainage
CAROLIO:0003240 percutaneous transhepatic cholangiogram

However, these have big overlap with MAXO and SYMP/HP, and should be considered to be submitted there.

Originally posted by @cthoyt in #2406 (comment)

The text was updated successfully, but these errors were encountered:

addiehl · 2024-02-06T17:30:55Z

While many of these matches make sense, some are totally off, such as
CAROLIO:0003210 locoregional therapy
ncit:C25388 Local-Regional (0.54)
ncit:C94796 Locally Recurrent Malignant Neoplasm (0.54)

Thus, this should not be an automated review for the dashboard, but rather presented to the ontology submitter for their review, so that they can be encouraged to import classes rather than recreate them, where appropriate.

pfabry · 2024-02-08T14:40:57Z

@cthoyt
Could you please run the script for other new ontologies? Namely: aFPO, GALLONT, LSDAO In addition, NCIT could be removed from the matches.
Thanks!

matentzn · 2024-02-08T14:43:34Z

Could you please run the script for other new ontologies?

@cthoyt so this is not on your plate, can you share the script so we can setup a github action to do this?

cmungall · 2024-02-08T15:09:49Z

I’m wary of this approach. We should at least accept that accuracy will be wildly variable depending on multiple arbitrary factors. There will be many many false negatives because reasons. It will take deep obo knowledge to make the results actionable (example: many new ontologies will have concepts in OMIT. What then?) I do however think there is opportunities for LLMs to help with initial triage

…

On Thu, Feb 8, 2024 at 6:43 AM Nico Matentzoglu ***@***.***> wrote: Could you please run the script for other new ontologies? @cthoyt <https://github.com/cthoyt> so this is not on your plate, can you share the script so we can setup a github action to do this? — Reply to this email directly, view it on GitHub <#2517 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMMOOBXRJKPW7ZSQXX55LYSTQCNAVCNFSM6AAAAABCIMWHC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGI3DQMZQGY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

matentzn · 2024-02-08T15:21:00Z

Our thinking so far is this:

It is better to do it somewhat approximately than not to do it. A lot of matches reveal a pattern.
The burden is on the submitters. They see the match, they check it and say: "this is not the same thing".

Re OMIT, I cannot say anything. We could push our "foundry status" back and define it as "passing the dashboard" - and only match against these. Just spitballing.

cmungall · 2024-02-08T15:49:29Z

Re OMIT, I cannot say anything. We could push our "foundry status" back

and define it as "passing the dashboard" - and only match against these. Just spitballing. You're wanting a deterministic quantitive solution where some qualitative aspect is required. FMA may fail the dashboard but we'd still want to know if a new ontology had massively overlapping content. There is probably a way to get OMIT to pash the dashboard but that doesn't solve the problem of its massively out of scope content.

…

On Thu, Feb 8, 2024 at 7:21 AM Nico Matentzoglu ***@***.***> wrote: Our thinking so far is this: 1. It is better to do it somewhat approximately than not to do it. A lot of matches reveal a pattern. 2. The burden is on the submitters. They see the match, they check it and say: "this is not the same thing". Re OMIT, I cannot say anything. We could push our "foundry status" back and define it as "passing the dashboard" - and only match against these. Just spitballing. — Reply to this email directly, view it on GitHub <#2517 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMMOI7FSJGEV6QRIMUDGTYSTUOPAVCNFSM6AAAAABCIMWHC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGM2TKMRTGA> . You are receiving this because you commented.Message ID: ***@***.***>

matentzn · 2024-02-08T16:08:33Z

This is a bit of a complicated criticism..

(1) We do not want to let the 4th ontology defining Alzheimer's and glucose into OBO Foundry Ontology Library
(2) We have no good way to separate GUOBO modules (components of the Grand Unified OBO Ontology) from Application/Project ontologies.

We will not be able in any reasonable timeframe define "COB-Branch owning" ontologies. However, we could, possibly, use "bottom-up" COB mapping curation here to say: For new ontologies, only matches against ontologies mapped in COB are relevant. This is a bit shady, to be clear (not unreasonable, just a bit shady), as we refer from one system (OBO Library membership) to another (COB mappings), but I would be ok with that as well.

But IMO the need to achieve (1) outweighs all other concerns you raised. We can get a touch of "qualitative" in there by adding an SOP that the ontology reviewer can apply judgement if some of these matches are blocking or not.

What is the alternative?

cmungall · 2024-02-08T16:18:49Z

This can be seen as a variant or subtype of the ontology recommendation problem (cc-ing Marcos by email as I don't know your github username!) https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-017-0128-y

…

On Thu, Feb 8, 2024 at 8:08 AM Nico Matentzoglu ***@***.***> wrote: This is a bit of a complicated criticism.. (1) We do not want to let the 4th ontology defining Alzheimer's and glucose into OBO Foundry Ontology Library (2) We have no good way to separate GUOBO modules (components of the Grand Unified OBO Ontology) from Application/Project ontologies. We will not be able in any reasonable timeframe define "COB-Branch owning" ontologies. However, we could, possibly, use "bottom-up" COB mapping curation here to say: For new ontologies, only matches against ontologies mapped in COB are relevant. This is a bit shady, to be clear (not unreasonable, just a bit shady), as we refer from one system (OBO Library membership) to another (COB mappings), but I would be ok with that as well. But IMO the need to achieve (1) outweighs all other concerns you raised. We can get a touch of "qualitative" in there by adding an SOP that the ontology reviewer can apply judgement if some of these matches are blocking or not. What is the alternative? — Reply to this email directly, view it on GitHub <#2517 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMMOJ32WSV3DNVDN3FB53YST2A3AVCNFSM6AAAAABCIMWHC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGQ2TENRYHA> . You are receiving this because you commented.Message ID: ***@***.***>

matentzn · 2024-02-08T16:57:11Z

Well, I dont really think this is quite the same. We are looking for a way to ensure that ontologies wanting to join the OBO Foundry do not significantly overlap with (key?) ontologies in OBO Foundry. And label matching is the only way I can think of right now to at least get started with this. What is the principle we should derive from a recommendation approach? I need to know at least how we should act now short term, like - is your preference that for the 5 ontologies currently under revision, we do not run lexical matching? If so what is your suggestion exactly?

cmungall · 2024-02-14T20:08:56Z

Here is an example of the problem

#2522 (comment)

This would not have been found by text mining. But clearly there needs to be coordination between these two ontologies. The correct way to do this by looking at the scope of the new ontology and the scope of existing ontologies. Currently this takes some very minimal knowledge of what is in OBO (I am not sure why we aren't doing this). This part could easily be semi-automated by e.g LLMs. But frankly everyone reviewing ontologies should be aware of the scope of different ontologies in OBO especially widely used ones like PO

cthoyt · 2024-02-20T16:55:19Z

@matentzn I have a new repo where I am building up and storing various lexical indexes, it now has a pre-built one for OBO to make this much more user friendly (and not have to parse the resources yourself)

https://github.com/biopragmatics/biolexica/tree/main/lexica/obo

matentzn · 2024-02-21T19:49:22Z

Thank you @cthoyt - i am not too sure though what Chris position seems to be here :D As soon as there is some agreement somewhere, I will assign someone to work on this!

cthoyt · 2024-02-21T22:55:44Z

There doesn't necessarily have to be any agreement anywhere. OBO reviews are open and I can always re-run my script for each new ontology and post the results to the issue thread (I already sort of automated it). Any requester who disregards reasonable suggestions from this process has other bigger problems.

matentzn · 2024-02-22T08:34:29Z

@cthoyt I made a case at the last call that the OBO New Ontology Request Manager should be running your script as part of the official pipeline, so that you dont have to distribute your attention too much. If, while I am trying to make this overlap checking more official, you could keep making these "overlap" posts and add a sentence:

This is (also) for the OBO Ontology Reviewer to assess overlap between the proposed ontology and existing ontologies.

To make clear that the reviewer should actually consider this, I would be greatful!

Thanks a ton!

jonquet · 2024-02-22T13:03:46Z

Hello @matentzn @cmungall , all...
As usual and just quickly went thru the issue ...

I am assuming you're familiar with the functionalities from BioPortal (and OntoPortal) that automatically compute the "lexical matches" (using LOOM) with all the other ontologies in the portal.... I often not promote very much this as a "mapping" feature (because we all know lexical matches are very limited) but I often argue on the fact that OntoPortals is the only place that when one drop an ontology he/she gets an automatic lexical overlap with all the other ontologies in the next hour.

I mean could it help you address your need here?
Would that make sense to build on this feature to improve it?

matentzn · 2024-02-22T13:07:44Z

@jonquet thanks for chiming in. The real problem lies in the fact that the ontologies we need to check are not loaded in any indexed infrastructure (including Bioportal). @cthoyt idea is to basically have one massive lexical index covering all of OBO dumped, and have a script just compare quickly and incoming ontology with that index. I am not sure if BioPortal should be covering this specific use case, as it is primarily concerned with ontologies outside of BioPortal..

matentzn · 2024-02-29T16:13:07Z

@pfabry cc @OBOFoundry/obo-foundry-operations-committee

#2522 (comment)

While I think there is value in doing lexical matching to assess overlap, I agree with Chris that it should be used a bit more wisely. The lexmatch should provide some evidence of non-reuse. But this is not just about using IRIs of existing ontologies. This should result in questions such as:

GALLONT seems to define charactertistics. Are these aligned with PATO? Why are they not added in PATO?
Should we really tell people if there is overlap with BTO, NCIT? These are all ontologies that contain everything under the sun.

I would suggest if Paul continues creating these, that we:

Create an exclude list of ontologies including BTO, NCIT, perhaps OMIT
Not including matching results from these ontologies
Add a note to the comment by the NOR reviewer that "These matches are only for indication and do not constitute a formal part of the review. The ontology reviewer may refer to this information to illustrate patterns where re-use could be improved".

pfabry · 2024-02-29T19:00:20Z

@matentzn

I would suggest if Paul continues creating these, that we:

1. Create an exclude list of ontologies including BTO, NCIT, perhaps OMIT

2. Not including matching results from these ontologies

3. Add a note to the comment by the NOR reviewer that "These matches are only for indication and do not constitute a formal part of the review. The ontology reviewer may refer to this information to illustrate patterns where re-use could be improved".

I agree that the lexical match could be a valuable informative tool but should be used with caution as it is "only" a lexical match. I agree with the 3 propositions, but I think the lexical match could be done even earlier, at the pre-registration checklist.
One of the check is the following:

For every term in my ontology, I checked whether another OBO Foundry ontology has one with the same meaning. If so, I re-used that term directly (not by cross-reference, by directly using the IRI).

Of course, label != meaning, but the lexical match could provide a general overview for the submitter.

@cthoyt

Thank you VERY much for the script. However, while I have been able to use it for GALLONT, I can't make it work for LSDAO and I really don't know why. I created an issue about this.

pfabry · 2024-03-12T19:23:41Z

Just a heads up and a question. Thanks to @cthoyt I have been able to run the lexmatch for the LSDAO ontology. @zhengj2007 as the reviewer of this ontology, how do you want to proceed? Do you want me to send you the file? post it directly in the issue? do not send it at all ?

zhengj2007 · 2024-03-12T19:35:53Z

Thanks @cthoyt and @pfabry for lexmatch!

It would be nice to post it on the LSDAO new ontology request issue. Thanks @pfabry !

nlharris added the automated validation of principles Issues for the editorial WG pertinent to the automating the validation of the Principles. label Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lexical matching to standard SOP for New Ontology Requests #2517

Add lexical matching to standard SOP for New Ontology Requests #2517

matentzn commented Jan 24, 2024 •

edited

addiehl commented Feb 6, 2024 •

edited

pfabry commented Feb 8, 2024

matentzn commented Feb 8, 2024

cmungall commented Feb 8, 2024 via email

matentzn commented Feb 8, 2024

cmungall commented Feb 8, 2024 via email

matentzn commented Feb 8, 2024

cmungall commented Feb 8, 2024 via email

matentzn commented Feb 8, 2024

cmungall commented Feb 14, 2024

cthoyt commented Feb 20, 2024

matentzn commented Feb 21, 2024

cthoyt commented Feb 21, 2024

matentzn commented Feb 22, 2024

jonquet commented Feb 22, 2024

matentzn commented Feb 22, 2024

matentzn commented Feb 29, 2024

pfabry commented Feb 29, 2024

pfabry commented Mar 12, 2024

zhengj2007 commented Mar 12, 2024

Add lexical matching to standard SOP for New Ontology Requests #2517

Add lexical matching to standard SOP for New Ontology Requests #2517

Comments

matentzn commented Jan 24, 2024 • edited

Lexical matching returned results

Lexical matching returned no results

addiehl commented Feb 6, 2024 • edited

pfabry commented Feb 8, 2024

matentzn commented Feb 8, 2024

cmungall commented Feb 8, 2024 via email

matentzn commented Feb 8, 2024

cmungall commented Feb 8, 2024 via email

matentzn commented Feb 8, 2024

cmungall commented Feb 8, 2024 via email

matentzn commented Feb 8, 2024

cmungall commented Feb 14, 2024

cthoyt commented Feb 20, 2024

matentzn commented Feb 21, 2024

cthoyt commented Feb 21, 2024

matentzn commented Feb 22, 2024

jonquet commented Feb 22, 2024

matentzn commented Feb 22, 2024

matentzn commented Feb 29, 2024

pfabry commented Feb 29, 2024

pfabry commented Mar 12, 2024

zhengj2007 commented Mar 12, 2024

matentzn commented Jan 24, 2024 •

edited

addiehl commented Feb 6, 2024 •

edited