Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge entities that are associated with strains #38

Closed
maxkfranz opened this issue Jun 18, 2019 · 16 comments
Closed

Merge entities that are associated with strains #38

maxkfranz opened this issue Jun 18, 2019 · 16 comments

Comments

@maxkfranz
Copy link
Member

maxkfranz commented Jun 18, 2019

For e. coli and yeast strains, merge all entities that have the exact same name.

For example, there are many entries for "CcdB". Each entry is the same basically except for the taxonomy ID. Here is a sample:

{
    "namespace": "ncbi",
    "type": "protein",
    "id": "39521901",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "C7V14_00585",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39524440",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "EJC48_00625",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39529410",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "U14A_A00031",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "8877686",
    "organism": "573",
    "name": "ccdB",
    "synonyms": [
      "CcdB toxin protein"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39650970",
    "organism": "621",
    "name": "ccdB",
    "synonyms": [
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39651896",
    "organism": "622",
    "name": "ccdB",
    "synonyms": [
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "9538168",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "plasmid maintenance protein",
      "toxin component",
      "plasmid maintenance protein; toxin component"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  }

For each entry from NCBI that is associated with a strain taxon ID (e.g. 562):

  1. Check if there is an ancestor entry for the top-level taxon ID (e.g. 83333) with the same name (e.g. ccdB).
  2. If an entry was found, then merge the descendant entry with the ancestor entry.
    1. Add the synonyms from the descendant entry into the ancestor entry, avoiding duplicates.
    2. Add the taxon IDs for the ancestor entry and the descendant entry into entry.organisms, avoiding duplicates.
    3. Add the grounding ID for the ancestor and the descendant= to entry.ids, avoiding duplicates.
  3. If there is no ancestor entry, then replace the descendant entry taxon ID with the ancestor taxon ID to make the organism field normalised.

Update the ranking algorithm: The ranking w.r.t. organismOrdering should consider the best match of entry.organism and entry.organisms.

maxkfranz added a commit that referenced this issue Jun 19, 2019
Ref:
- Merge entities that are associated with strains #38
- Use NCBI taxonomy database for `entry.organismName` property #43
maxkfranz added a commit that referenced this issue Jun 19, 2019
The strain information is stored in the `strains` dir.

Ref:
- Merge entities that are associated with strains #38
- Use NCBI taxonomy database for `entry.organismName` property #43
maxkfranz added a commit that referenced this issue Jun 19, 2019
If the indices get too big, then we can't use them for sorting in our combined metric value.

Ref: Merge entities that are associated with strains #38
@maxkfranz
Copy link
Member Author

@metincansiper This would be a good issue to look at once you've finished the import code in factoid.

I've created all of the underlying API to support this feature, so all that remains is to implement (1), (2), and (3) as a post-processing step after your indexing code.

@maxkfranz
Copy link
Member Author

All the code is in the orgs-and-strains branch.

@metincansiper
Copy link
Contributor

Check if there is an ancestor entry for the top-level taxon ID (e.g. 83333) with the same name (e.g. ccdB).

@maxkfranz Should not it be like "Check if there is an descendent entry for the top-level taxon ID (e.g. 83333) with the same name (e.g. ccdB)."?

Because, from my understanding "562" is the ancestor and "83333" is in the descendent list of it as oppose to how it is stated. Am I right?

@metincansiper
Copy link
Contributor

@maxkfranz
I am actually not sure if I can understand what to do correctly. Therefore, I am writing my understanding of what to do (maybe in a way closer to practice) and you can correct me if something is wrong:

  • After NCBI update is completed and entries are inserted to DB, query DB to find the entries whose organism id is a strain taxon id (either 562 or 4932) (most probably need to search and process chunk by chunk)
  • For each search result (search results are the ancestor entries):
    • Use the descendant list associated to the organism id (ecoli strans list for 562 and scervisiae strains list for 4932)
    • Make a new query to find the entries whose organism id is in the related descendents list and where the name matches (the results will be the descendents)
    • Update the DB as described in the issue description (As far as I can understand descendent entries properties are supposed to be merged into the ancestor entries properties. Should the descendent entries be removed from DB then?).

@maxkfranz
Copy link
Member Author

Because, from my understanding "562" is the ancestor and "83333" is in the descendent list of it as oppose to how it is stated. Am I right?

Yes, the example might not be great. For some reason, we were using 83333 as the "main" one. I guess since a lot of results come for that particular strain. You're right that it's not the ancestor entry.

All of this editing would have to happen after the normal parsing and indexing finishes. Basically,

  • Find all entries with the same name.
  • Filter the entry list to match only the descendant IDs.
  • Merge the descendant entries into the ancestor entry.
  • Delete the descendant entries from the index.

@maxkfranz
Copy link
Member Author

You could do this process bottom-up:

  • Find all entries for each strain of a particular root organism, e.g. yeast.
  • For each entry, see if there is an entry for the root organism with the same name. If so, merge the descendant entry in the root one. If not, create a new entry for the root one and merge the descendant into it.

Because of the bottom-up approach, there should be no non-root entries remaining.

@metincansiper
Copy link
Contributor

Yes, the example might not be great. For some reason, we were using 83333 as the "main" one. I guess since a lot of results come for that particular strain. You're right that it's not the ancestor entry.

@maxkfranz I am a bit confused about this. Okay, 83333 is not an ancestor but is there anything needed specific to it in terms of implementation?

Is it correct that 562 and 4932 are the only ancestors (maybe for just for now) and their descendents are the descendent lists associated with them (ecoli strans list for 562 and scervisiae strains list for 4932) as I mentioned in my previous comment?

@maxkfranz
Copy link
Member Author

It's in the Organism class in the orgs-and-strains branch: https://github.com/PathwayCommons/grounding-search/blob/orgs-and-strains/src/server/datasource/organisms.js#L43-L53

org.id is the ID of the root organism. org.descendantIds contains all the IDs of the strains.

@maxkfranz
Copy link
Member Author

Just use whatever's in the Organism class. I checked that the IDs are correct.

@metincansiper
Copy link
Contributor

org.id is the ID of the root organism. org.descendantIds contains all the IDs of the strains.

Okay, great. Actually, this was what I mean.

@metincansiper
Copy link
Contributor

You could do this process bottom-up:

  • Find all entries for each strain of a particular root organism, e.g. yeast.
  • For each entry, see if there is an entry for the root organism with the same name. If so, merge the descendant entry in the root one. If not, create a new entry for the root one and merge the descendant into it.

Because of the bottom-up approach, there should be no non-root entries remaining.

Okay, looks like this approach is the same of the one that I mentioned.

@maxkfranz
Copy link
Member Author

It's similar, but it's important to start with the descendants rather than the root. There may be entries that exist only for the descendant but not the root, and we want everything to use the root ID.

@metincansiper
Copy link
Contributor

metincansiper commented Jul 1, 2019

@maxkfranz I realized that there may be more than 1 entry that is eligible to be the ancestor of an entry. What I am currently doing is the first of eligible entries as the ancestor (I am making sure to choose the same entry as the ancestor for the descendants with the same organisation and name).

How about this approach in general? Also, would it make sense to merge the root entries to one of them as well?

@metincansiper
Copy link
Contributor

I am actually not sure about if there may be multiple candidates to be the ancestor of an entry. I thought it may be possible, because there maybe multiple entities having the same root organization and the same name. However, maybe none of such entities has any descendants in a way. I may need to make more investigation on that. Do you have any prior info about that?

@maxkfranz
Copy link
Member Author

Every descendant or strain will have exactly one corresponding root organism. So if you check each entity that has a descendant organism ID (i.e. an ID not equal to any of the roots), then all descendant entities can be moved up to the root.

If there is an existing match for the root to merge into, that's ideal. If not, then you have to create a new entity for the root into which all descendants are merged.

@metincansiper
Copy link
Contributor

@maxkfranz I created the PR #47 to resolve this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants