-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge entities that are associated with strains #38
Comments
If the indices get too big, then we can't use them for sorting in our combined metric value. Ref: Merge entities that are associated with strains #38
@metincansiper This would be a good issue to look at once you've finished the I've created all of the underlying API to support this feature, so all that remains is to implement (1), (2), and (3) as a post-processing step after your indexing code. |
All the code is in the |
@maxkfranz Should not it be like "Check if there is an descendent entry for the top-level taxon ID (e.g. 83333) with the same name (e.g. ccdB)."? Because, from my understanding "562" is the ancestor and "83333" is in the descendent list of it as oppose to how it is stated. Am I right? |
@maxkfranz
|
Yes, the example might not be great. For some reason, we were using 83333 as the "main" one. I guess since a lot of results come for that particular strain. You're right that it's not the ancestor entry. All of this editing would have to happen after the normal parsing and indexing finishes. Basically,
|
You could do this process bottom-up:
Because of the bottom-up approach, there should be no non-root entries remaining. |
@maxkfranz I am a bit confused about this. Okay, 83333 is not an ancestor but is there anything needed specific to it in terms of implementation? Is it correct that 562 and 4932 are the only ancestors (maybe for just for now) and their descendents are the descendent lists associated with them (ecoli strans list for 562 and scervisiae strains list for 4932) as I mentioned in my previous comment? |
It's in the Organism class in the orgs-and-strains branch: https://github.com/PathwayCommons/grounding-search/blob/orgs-and-strains/src/server/datasource/organisms.js#L43-L53
|
Just use whatever's in the Organism class. I checked that the IDs are correct. |
Okay, great. Actually, this was what I mean. |
Okay, looks like this approach is the same of the one that I mentioned. |
It's similar, but it's important to start with the descendants rather than the root. There may be entries that exist only for the descendant but not the root, and we want everything to use the root ID. |
@maxkfranz I realized that there may be more than 1 entry that is eligible to be the ancestor of an entry. What I am currently doing is the first of eligible entries as the ancestor (I am making sure to choose the same entry as the ancestor for the descendants with the same organisation and name). How about this approach in general? Also, would it make sense to merge the root entries to one of them as well? |
I am actually not sure about if there may be multiple candidates to be the ancestor of an entry. I thought it may be possible, because there maybe multiple entities having the same root organization and the same name. However, maybe none of such entities has any descendants in a way. I may need to make more investigation on that. Do you have any prior info about that? |
Every descendant or strain will have exactly one corresponding root organism. So if you check each entity that has a descendant organism ID (i.e. an ID not equal to any of the roots), then all descendant entities can be moved up to the root. If there is an existing match for the root to merge into, that's ideal. If not, then you have to create a new entity for the root into which all descendants are merged. |
@maxkfranz I created the PR #47 to resolve this issue |
For e. coli and yeast strains, merge all entities that have the exact same name.
For example, there are many entries for "CcdB". Each entry is the same basically except for the taxonomy ID. Here is a sample:
For each entry from NCBI that is associated with a strain taxon ID (e.g. 562):
entry.organisms
, avoiding duplicates.entry.ids
, avoiding duplicates.organism
field normalised.Update the ranking algorithm: The ranking w.r.t.
organismOrdering
should consider the best match ofentry.organism
andentry.organisms
.The text was updated successfully, but these errors were encountered: