New synonym format leads to much worse querying #50

gaurav · 2023-04-26T17:46:59Z

I've set up a NameRes instance on Sterling (accessible in the RENCI VPN only) at http://name-resolution-sri-dev.apps.renci.org/docs using the new synonym format we've built for NameRes (#46, helxplatform/translator-devops#634, TranslatorSRI/Babel#113).

You can also directly access the underlying Solr database by running:

$ kubectl port-forward -n translator-exp name-lookup-solr-dep-0 8983:8983

and then accessing http://localhost:8983/ on your computer.

The bad news is that both directly querying Solr and querying it through the NameRes frontend results in significantly worse results than we get with the old system. For example, querying https://name-resolution-sri.renci.org/docs for blood gives us UBERON:0000178, NCIT:C12434 and UMLS:C0851353 (all meaning "blood") followed by UMLS:C0851353 ("bloody"). But running the same query on http://name-resolution-sri-dev.apps.renci.org/docs gives us UMLS:C5169928 ("JWH-073 3-hydroxybutyl (synthetic cannabinoid metabolite) | Blood | Drug toxicology"), UMLS:C5171063 ("Lindane | Blood | Drug toxicology"), UMLS:C0312901 ("Blood group antigen IBH") and a bunch of others.

Searching with Solr gives slightly more relevant results, but not the really good results that https://name-resolution-sri.renci.org/docs gives.

One possible reason for this is that I've indexed the names field as a multiValued field (since it contains multiple values). Changing it to a non-multiValued field definitely helps with the results in Solr, but it causes NameRes to no longer work. I'll try fixing that and see if that solves this bug. If not, I'll probably need some help with the Solr querying and indexing aspect of all this.

The text was updated successfully, but these errors were encountered:

gaurav · 2023-04-27T02:19:05Z

This seems to be caused by the query being names:{fragment}*. Removing the asterisk fixing this problem, and the query (preferred_name:{fragment}^10 OR names:{fragment} OR names:{fragment}*) works pretty well:

NameResolution/api/server.py

Lines 104 to 111 in 61fb6d2

    
           filters = [ 
        
               # Boost the preferred name by a factor of 10. 
        
               # Using names:{fragment}* causes Solr to prioritize some odd results; 
        
               # using names:{fragment} OR names:{fragment}* should cause it to still 
        
               # include those results while prioritizing complete fragments. 
        
               f"(preferred_name:{fragment}^10 OR names:{fragment} OR names:{fragment}*)" 
        
               for fragment in fragments if len(fragment) > 0 
        
           ]

((preferred_name:{fragment}^10 OR names:{fragment}* still prioritizes odd results over anything that isn't a preferred-name match, and (preferred_name:{fragment}^10 OR names:{fragment} fails to match when the fragment is incomplete, i.e. Alzheimer disease matches but Alzheimer's disease fails.)

gaurav · 2023-12-03T04:10:55Z

This has now been significantly improved, and it's working well enough that is what is being used by Translator UI. Closing.

gaurav added this to the Name Resolver v2.0.0 (with updated synonym format) milestone Jun 8, 2023

gaurav added the high priority label Jun 8, 2023

gaurav mentioned this issue Jun 22, 2023

Restore previous sorting method #65

Closed

gaurav modified the milestones: Name Resolver v2.0.0 (with updated synonym format), NameRes October 2023 Sep 27, 2023

gaurav closed this as completed Dec 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New synonym format leads to much worse querying #50

New synonym format leads to much worse querying #50

gaurav commented Apr 26, 2023

gaurav commented Apr 27, 2023

gaurav commented Dec 3, 2023

New synonym format leads to much worse querying #50

New synonym format leads to much worse querying #50

Comments

gaurav commented Apr 26, 2023

gaurav commented Apr 27, 2023

gaurav commented Dec 3, 2023