-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change syn file format #113
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! This code doesn't currently run because synonyms is a set() and not a list() (see inline comment), but that should be a simple fix. I would also recommend changing name
to names
if it's going to be a set or list of names instead of a single name.
Can't fully test b/c of BMT bug, but I think that this fixes the problems. |
Thanks, Chris! I'll wait for the BMT bug (biolink/biolink-model-toolkit#111) to be fixed before reviewing this. |
I tested this out on "anatomy" and it works great! Let me know if you'd like to have a look at the generated synonym files. Otherwise, I'm planning to rerun all of Babel early next week and can set up a NameRes loaded from the updated synonym files then. |
One last thing that doesn't make sense: apparently this code (and my code) all worked fine even on bmt v0.9.0 today, not v1.0.4 as fixed by Sierra! I do not understand and do not like pip. I'll upgrade to v1.0.4 and recheck before approving this PR. |
e3299ce
to
45b6a85
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've confirmed that this works fine on bmt 1.0.4. I think we're good to merge!
I tried loading this into Solr, but I found something unexpected: note that { "names": [
["http://www.geneontology.org/formats/oboInOwl#hasExactSynonym", "Caudal articular process of eighteenth thoracic vertebra"],
["http://www.geneontology.org/formats/oboInOwl#hasExactSynonym", "Caudal articular process of eighteenth thoracic vertebra (body structure)]"]
] } I assume we want to keep this format in Solr. However, when I try loading these JSON files into a Solr instance, it flattens this into a single list of strings: {"names":[
"[http://www.geneontology.org/formats/oboInOwl#hasExactSynonym, Caudal articular process of eighteenth thoracic vertebra]",
"[http://www.geneontology.org/formats/oboInOwl#hasExactSynonym, Caudal articular process of eighteenth thoracic vertebra (body structure)]"]
} This still works, since you can still search for "Caudal" and get back this result, but all the names also match "hasExactSynonym", which seems non-ideal. Do we want to keep the synonym types in the synonym output? If not, I could remove that we could simplify this to just a list of names. If we do, I can look into whether we can include more complex data into Solr. @cbizon What do you think? |
Hmm, it's nice to have in the file for backtracking, but it's of little use in the solr database. So I guess I would take it out of the synonyms files. I would say that we could have an intermediate step where we filter them out, so we have one set that we can use for reference and another that is for loading, but the files are already so large that doubling them seems like a bad idea. So I'm leaning towards just producing the synonyms without the synonym type. |
bb02514
to
13a13dd
Compare
This is all pretty hacky, but I'll clean it up in #126
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure I've found and fixed all the immediate issues with this PR, so I think it's fine to merge it in. Some additional fixes are in PR #124.
…t-code This PR adds a GitHub Action to build two Docker images for NameRes -- a `nameresolution` image based on `./Dockerfile`, which can act as a web host for a correctly configured Solr database (see setup.sh for code on building this), and a `nameresolution-data-loading` image based on the `./data-loading/Dockerfile`, which can be used to load a Solr database with the synonym files produced by Babel (after TranslatorSRI/Babel#113). You can see these Docker images at https://github.com/orgs/TranslatorSRI/packages?repo_name=NameResolution Also contains a minor bug fix: we now allow a document to be missing a preferred_name, names, curie or types field and provide blank output in that case.
Changes the structure of the synonyms files to
This new style can be loaded into the prototype name resolver.
While working on this PR, we also fixed a bug in NCBIGene synonym generation.