Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change syn file format #113

Merged
merged 12 commits into from
May 8, 2023
Merged

change syn file format #113

merged 12 commits into from
May 8, 2023

Conversation

cbizon
Copy link
Contributor

@cbizon cbizon commented Mar 10, 2023

Changes the structure of the synonyms files to

  1. Be in jsonl format
  2. include a preferred_name label
  3. include biolink types

This new style can be loaded into the prototype name resolver.

While working on this PR, we also fixed a bug in NCBIGene synonym generation.

@cbizon cbizon requested a review from gaurav March 10, 2023 21:38
Copy link
Collaborator

@gaurav gaurav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! This code doesn't currently run because synonyms is a set() and not a list() (see inline comment), but that should be a simple fix. I would also recommend changing name to names if it's going to be a set or list of names instead of a single name.

src/babel_utils.py Outdated Show resolved Hide resolved
src/babel_utils.py Outdated Show resolved Hide resolved
@cbizon
Copy link
Contributor Author

cbizon commented Mar 13, 2023

Can't fully test b/c of BMT bug, but I think that this fixes the problems.

@cbizon cbizon requested a review from gaurav March 13, 2023 17:20
@gaurav
Copy link
Collaborator

gaurav commented Mar 15, 2023

Thanks, Chris! I'll wait for the BMT bug (biolink/biolink-model-toolkit#111) to be fixed before reviewing this.

@gaurav
Copy link
Collaborator

gaurav commented Mar 23, 2023

I tested this out on "anatomy" and it works great! Let me know if you'd like to have a look at the generated synonym files. Otherwise, I'm planning to rerun all of Babel early next week and can set up a NameRes loaded from the updated synonym files then.

@gaurav
Copy link
Collaborator

gaurav commented Mar 23, 2023

One last thing that doesn't make sense: apparently this code (and my code) all worked fine even on bmt v0.9.0 today, not v1.0.4 as fixed by Sierra! I do not understand and do not like pip. I'll upgrade to v1.0.4 and recheck before approving this PR.

Copy link
Collaborator

@gaurav gaurav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've confirmed that this works fine on bmt 1.0.4. I think we're good to merge!

@gaurav gaurav self-requested a review April 20, 2023 04:40
@gaurav
Copy link
Collaborator

gaurav commented Apr 20, 2023

I tried loading this into Solr, but I found something unexpected: note that synonym_factory.get_synonyms(node) doesn't return a list of synonyms, but a list of tuples in the form ('http://www.geneontology.org/formats/oboInOwl#hasExactSynonym', 'Caudal articular process of eighteenth thoracic vertebra'), where the first element is either oboInOwl:hasExactSynonym or the relation included in the synonyms file. When saved into JSON, this is turned into a two dimensional array:

{ "names": [
  ["http://www.geneontology.org/formats/oboInOwl#hasExactSynonym", "Caudal articular process of eighteenth thoracic vertebra"],
  ["http://www.geneontology.org/formats/oboInOwl#hasExactSynonym", "Caudal articular process of eighteenth thoracic vertebra (body structure)]"]
] }

I assume we want to keep this format in Solr. However, when I try loading these JSON files into a Solr instance, it flattens this into a single list of strings:

{"names":[
  "[http://www.geneontology.org/formats/oboInOwl#hasExactSynonym, Caudal articular process of eighteenth thoracic vertebra]",
  "[http://www.geneontology.org/formats/oboInOwl#hasExactSynonym, Caudal articular process of eighteenth thoracic vertebra (body structure)]"]
}

This still works, since you can still search for "Caudal" and get back this result, but all the names also match "hasExactSynonym", which seems non-ideal.

Do we want to keep the synonym types in the synonym output? If not, I could remove that we could simplify this to just a list of names. If we do, I can look into whether we can include more complex data into Solr. @cbizon What do you think?

@cbizon
Copy link
Contributor Author

cbizon commented Apr 20, 2023

Hmm, it's nice to have in the file for backtracking, but it's of little use in the solr database. So I guess I would take it out of the synonyms files. I would say that we could have an intermediate step where we filter them out, so we have one set that we can use for reference and another that is for loading, but the files are already so large that doubling them seems like a bad idea.

So I'm leaning towards just producing the synonyms without the synonym type.

@gaurav
Copy link
Collaborator

gaurav commented Apr 26, 2023

  • There are fewer synonyms being emitted for some terms (e.g. UBERON:0000178 has 13 synonyms in 2022dec2-2 but only five with this code). I wonder if we previously considered labels to also be synonyms but don't do that any longer, or if there's some other general reason for this drop. (Synonyms missing in synonym output #128)
    • Most of those synonyms are associated with UMLS:C0005767, which is a secondary identifier in the UBERON:0000178 clique.
    • Aha, that clique has now been split. Don't know why, but it's not likely to be related to this PR.
  • Some leftover UMLS synonyms produced with this code only had a single synonym (e.g. UMLS:C0000715 only has "House, Slaughter") but in 2022dec2-2 these had many more synonyms (e.g. UMLS:C0000715 had ten, including various spellings of "abattoirs" and "slaughterhouses"). This is probably a bug in the UMLS export I wrote. (fixed in 83e83c7)

Copy link
Collaborator

@gaurav gaurav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure I've found and fixed all the immediate issues with this PR, so I think it's fine to merge it in. Some additional fixes are in PR #124.

@gaurav gaurav merged commit 50167e9 into master May 8, 2023
@gaurav gaurav deleted the synonym_prototype branch May 8, 2023 01:14
gaurav added a commit to TranslatorSRI/NameResolution that referenced this pull request Jun 12, 2023
…t-code

This PR adds a GitHub Action to build two Docker images for NameRes -- a `nameresolution` image based on `./Dockerfile`, which can act as a web host for a correctly configured Solr database (see setup.sh for code on building this), and a `nameresolution-data-loading` image based on the `./data-loading/Dockerfile`, which can be used to load a Solr database with the synonym files produced by Babel (after TranslatorSRI/Babel#113). You can see these Docker images at https://github.com/orgs/TranslatorSRI/packages?repo_name=NameResolution

Also contains a minor bug fix: we now allow a document to be missing a preferred_name, names, curie or types field and provide blank output in that case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants