Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hyphenated terms not found by annotator given non-hyphenated words #6

Open
graybeal opened this issue Sep 4, 2019 · 5 comments
Open
Assignees

Comments

@graybeal
Copy link

graybeal commented Sep 4, 2019

Recently, we used our lookup tool, which calls the Annotator, and noticed a discrepancy in how the Annotator searches for ontology class matches vs a manual search. Basically, the Annotator does not seem to have the flexibility to deal with hyphens or similar characters while the manual search in an ontology will find matches. The example uses an input term of “sodium iodide symporter”. MESH has a “sodium-iodide symporter” but this is not found using the Annotator. Instead, the Annotator finds matches just to sodium iodide (see excel attachment). Is this an issue of which you are already aware? If so, is there a plan for an Annotator version update or would a fix be simple enough to implement in the near future?

@graybeal
Copy link
Author

graybeal commented Sep 4, 2019

The reason the two processes are finding different strings is not so much (or not just) because of the hyphen, but because search includes the description in the search; Annotator does not. The non-hyphenated string appears exactly in the description, and so search is finding that. (Search for "member 5 protein" and hit return to see the same result.)

So it is arguable whether this is a bug, a feature, or just a possible enhancement. I am pretty sure the mgrep method used by the annotator is quite strict about finding exact matches, which this is not. While we don't have any short-term plans for Annotator updates, we can look at whether a simple solution is available that would be smarter (or maybe, 'looser') about hyphens. Because we are using the mgrep algorithm, it may be either very easy or very time-consuming to update.

@jonquet
Copy link

jonquet commented Sep 5, 2019

I know about this issue. Indeed Mgrep is not flexible on the match: it is strict and this is the reason why it does not catch the non hyphen version.
The reason why the search service got it is not so because the non hyphen version is within the synonym it is because it relies on Lucene which allow flexible match.

If you search for "sodium iodide symporter" in PR, you will get a match even if this exact string is not a synonym.
https://bioportal.bioontology.org/ontologies/PR?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPR_Q63008

Within SIFR, we partially fixed this by offering a "beta" version using a lemmatizer in the back end (and therefore using a lemmatize dictionary too). This allow to fix plurals too. Preliminary (no formal) study show the decrease in precision of using lemmatization l was not match by the increase in recall. So it stays as an option one can decide to use or not depending on what it prefers (precision or recall).

@jonquet
Copy link

jonquet commented Sep 5, 2019

A side comment to this allow to see however that MeSH's synonyms are not well parsed by the Annotator (not included in the dictionary):

Try to annoate "Nis protein, rat" whici is a altLabel in MeSH of the same class:
https://bioportal.bioontology.org/ontologies/MESH?p=classes&conceptid=http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FMESH%2FC070626

Does not return any result. There is an issue in how MeSH's synonyms are defined. Metadata need to be changed.

@graybeal
Copy link
Author

graybeal commented Sep 9, 2019

In your last comment @jonquet, are you observing that an altLabel with an embedded comma is a problem? If not, can you be a little more explicit about exactly the problem? It seems to me that annotating any string containing a comma is likely to fail, because the comma will be treated as a phrase delimiter by the annotation algorithm.

@jonquet
Copy link

jonquet commented Sep 9, 2019

The problem is not the comma. Try to anotate "Hormones, Hormone Substitutes, and Hormone Antagonists" and you will get a match (via preferred name).

The problem that is described in this last comment is a synonym parsing issue:
Try to annotate "thyroid iodide transporter" with MeSH. You do not get any result even if this expression is a synonym of "sodium-iodide symporter"

I believe the synonym property of MeSH is not well defined in the metadata. An admin need to correct this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants