Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New synonym format leads to much worse querying #50

Closed
gaurav opened this issue Apr 26, 2023 · 2 comments
Closed

New synonym format leads to much worse querying #50

gaurav opened this issue Apr 26, 2023 · 2 comments

Comments

@gaurav
Copy link
Contributor

gaurav commented Apr 26, 2023

I've set up a NameRes instance on Sterling (accessible in the RENCI VPN only) at http://name-resolution-sri-dev.apps.renci.org/docs using the new synonym format we've built for NameRes (#46, helxplatform/translator-devops#634, TranslatorSRI/Babel#113).

You can also directly access the underlying Solr database by running:

$ kubectl port-forward -n translator-exp name-lookup-solr-dep-0 8983:8983

and then accessing http://localhost:8983/ on your computer.

The bad news is that both directly querying Solr and querying it through the NameRes frontend results in significantly worse results than we get with the old system. For example, querying https://name-resolution-sri.renci.org/docs for blood gives us UBERON:0000178, NCIT:C12434 and UMLS:C0851353 (all meaning "blood") followed by UMLS:C0851353 ("bloody"). But running the same query on http://name-resolution-sri-dev.apps.renci.org/docs gives us UMLS:C5169928 ("JWH-073 3-hydroxybutyl (synthetic cannabinoid metabolite) | Blood | Drug toxicology"), UMLS:C5171063 ("Lindane | Blood | Drug toxicology"), UMLS:C0312901 ("Blood group antigen IBH") and a bunch of others.

Searching with Solr gives slightly more relevant results, but not the really good results that https://name-resolution-sri.renci.org/docs gives.

One possible reason for this is that I've indexed the names field as a multiValued field (since it contains multiple values). Changing it to a non-multiValued field definitely helps with the results in Solr, but it causes NameRes to no longer work. I'll try fixing that and see if that solves this bug. If not, I'll probably need some help with the Solr querying and indexing aspect of all this.

@gaurav
Copy link
Contributor Author

gaurav commented Apr 27, 2023

This seems to be caused by the query being names:{fragment}*. Removing the asterisk fixing this problem, and the query (preferred_name:{fragment}^10 OR names:{fragment} OR names:{fragment}*) works pretty well:

filters = [
# Boost the preferred name by a factor of 10.
# Using names:{fragment}* causes Solr to prioritize some odd results;
# using names:{fragment} OR names:{fragment}* should cause it to still
# include those results while prioritizing complete fragments.
f"(preferred_name:{fragment}^10 OR names:{fragment} OR names:{fragment}*)"
for fragment in fragments if len(fragment) > 0
]

((preferred_name:{fragment}^10 OR names:{fragment}* still prioritizes odd results over anything that isn't a preferred-name match, and (preferred_name:{fragment}^10 OR names:{fragment} fails to match when the fragment is incomplete, i.e. Alzheimer disease matches but Alzheimer's disease fails.)

@gaurav
Copy link
Contributor Author

gaurav commented Dec 3, 2023

This has now been significantly improved, and it's working well enough that is what is being used by Translator UI. Closing.

@gaurav gaurav closed this as completed Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant