Merge entities that are associated with strains #38

maxkfranz · 2019-06-18T17:36:43Z

For e. coli and yeast strains, merge all entities that have the exact same name.

For example, there are many entries for "CcdB". Each entry is the same basically except for the taxonomy ID. Here is a sample:

{
    "namespace": "ncbi",
    "type": "protein",
    "id": "39521901",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "C7V14_00585",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39524440",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "EJC48_00625",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39529410",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "U14A_A00031",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "8877686",
    "organism": "573",
    "name": "ccdB",
    "synonyms": [
      "CcdB toxin protein"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39650970",
    "organism": "621",
    "name": "ccdB",
    "synonyms": [
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39651896",
    "organism": "622",
    "name": "ccdB",
    "synonyms": [
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "9538168",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "plasmid maintenance protein",
      "toxin component",
      "plasmid maintenance protein; toxin component"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  }

For each entry from NCBI that is associated with a strain taxon ID (e.g. 562):

Check if there is an ancestor entry for the top-level taxon ID (e.g. 83333) with the same name (e.g. ccdB).
If an entry was found, then merge the descendant entry with the ancestor entry.
1. Add the synonyms from the descendant entry into the ancestor entry, avoiding duplicates.
2. Add the taxon IDs for the ancestor entry and the descendant entry into entry.organisms, avoiding duplicates.
3. Add the grounding ID for the ancestor and the descendant= to entry.ids, avoiding duplicates.
If there is no ancestor entry, then replace the descendant entry taxon ID with the ancestor taxon ID to make the organism field normalised.

Update the ranking algorithm: The ranking w.r.t. organismOrdering should consider the best match of entry.organism and entry.organisms.

The text was updated successfully, but these errors were encountered:

Ref: - Merge entities that are associated with strains #38 - Use NCBI taxonomy database for `entry.organismName` property #43

The strain information is stored in the `strains` dir. Ref: - Merge entities that are associated with strains #38 - Use NCBI taxonomy database for `entry.organismName` property #43

If the indices get too big, then we can't use them for sorting in our combined metric value. Ref: Merge entities that are associated with strains #38

maxkfranz · 2019-06-19T20:27:53Z

@metincansiper This would be a good issue to look at once you've finished the import code in factoid.

I've created all of the underlying API to support this feature, so all that remains is to implement (1), (2), and (3) as a post-processing step after your indexing code.

maxkfranz · 2019-06-19T20:28:22Z

All the code is in the orgs-and-strains branch.

metincansiper · 2019-06-24T22:17:53Z

Check if there is an ancestor entry for the top-level taxon ID (e.g. 83333) with the same name (e.g. ccdB).

@maxkfranz Should not it be like "Check if there is an descendent entry for the top-level taxon ID (e.g. 83333) with the same name (e.g. ccdB)."?

Because, from my understanding "562" is the ancestor and "83333" is in the descendent list of it as oppose to how it is stated. Am I right?

metincansiper · 2019-06-24T23:14:45Z

@maxkfranz
I am actually not sure if I can understand what to do correctly. Therefore, I am writing my understanding of what to do (maybe in a way closer to practice) and you can correct me if something is wrong:

After NCBI update is completed and entries are inserted to DB, query DB to find the entries whose organism id is a strain taxon id (either 562 or 4932) (most probably need to search and process chunk by chunk)
For each search result (search results are the ancestor entries):
- Use the descendant list associated to the organism id (ecoli strans list for 562 and scervisiae strains list for 4932)
- Make a new query to find the entries whose organism id is in the related descendents list and where the name matches (the results will be the descendents)
- Update the DB as described in the issue description (As far as I can understand descendent entries properties are supposed to be merged into the ancestor entries properties. Should the descendent entries be removed from DB then?).

maxkfranz · 2019-06-25T16:03:09Z

Because, from my understanding "562" is the ancestor and "83333" is in the descendent list of it as oppose to how it is stated. Am I right?

Yes, the example might not be great. For some reason, we were using 83333 as the "main" one. I guess since a lot of results come for that particular strain. You're right that it's not the ancestor entry.

All of this editing would have to happen after the normal parsing and indexing finishes. Basically,

Find all entries with the same name.
Filter the entry list to match only the descendant IDs.
Merge the descendant entries into the ancestor entry.
Delete the descendant entries from the index.

maxkfranz · 2019-06-25T16:41:03Z

You could do this process bottom-up:

Find all entries for each strain of a particular root organism, e.g. yeast.
For each entry, see if there is an entry for the root organism with the same name. If so, merge the descendant entry in the root one. If not, create a new entry for the root one and merge the descendant into it.

Because of the bottom-up approach, there should be no non-root entries remaining.

metincansiper · 2019-06-25T16:42:07Z

Yes, the example might not be great. For some reason, we were using 83333 as the "main" one. I guess since a lot of results come for that particular strain. You're right that it's not the ancestor entry.

@maxkfranz I am a bit confused about this. Okay, 83333 is not an ancestor but is there anything needed specific to it in terms of implementation?

Is it correct that 562 and 4932 are the only ancestors (maybe for just for now) and their descendents are the descendent lists associated with them (ecoli strans list for 562 and scervisiae strains list for 4932) as I mentioned in my previous comment?

maxkfranz · 2019-06-25T16:49:40Z

It's in the Organism class in the orgs-and-strains branch: https://github.com/PathwayCommons/grounding-search/blob/orgs-and-strains/src/server/datasource/organisms.js#L43-L53

org.id is the ID of the root organism. org.descendantIds contains all the IDs of the strains.

maxkfranz · 2019-06-25T16:51:11Z

Just use whatever's in the Organism class. I checked that the IDs are correct.

metincansiper · 2019-06-25T16:51:12Z

org.id is the ID of the root organism. org.descendantIds contains all the IDs of the strains.

Okay, great. Actually, this was what I mean.

metincansiper · 2019-06-25T17:03:03Z

You could do this process bottom-up:

Find all entries for each strain of a particular root organism, e.g. yeast.

For each entry, see if there is an entry for the root organism with the same name. If so, merge the descendant entry in the root one. If not, create a new entry for the root one and merge the descendant into it.

Because of the bottom-up approach, there should be no non-root entries remaining.

Okay, looks like this approach is the same of the one that I mentioned.

maxkfranz · 2019-06-25T18:28:25Z

It's similar, but it's important to start with the descendants rather than the root. There may be entries that exist only for the descendant but not the root, and we want everything to use the root ID.

metincansiper · 2019-07-01T22:54:47Z

@maxkfranz I realized that there may be more than 1 entry that is eligible to be the ancestor of an entry. What I am currently doing is the first of eligible entries as the ancestor (I am making sure to choose the same entry as the ancestor for the descendants with the same organisation and name).

How about this approach in general? Also, would it make sense to merge the root entries to one of them as well?

metincansiper · 2019-07-01T23:37:59Z

I am actually not sure about if there may be multiple candidates to be the ancestor of an entry. I thought it may be possible, because there maybe multiple entities having the same root organization and the same name. However, maybe none of such entities has any descendants in a way. I may need to make more investigation on that. Do you have any prior info about that?

maxkfranz · 2019-07-02T14:38:33Z

Every descendant or strain will have exactly one corresponding root organism. So if you check each entity that has a descendant organism ID (i.e. an ID not equal to any of the roots), then all descendant entities can be moved up to the root.

If there is an existing match for the root to merge into, that's ideal. If not, then you have to create a new entity for the root into which all descendants are merged.

metincansiper · 2019-07-05T23:15:47Z

@maxkfranz I created the PR #47 to resolve this issue

maxkfranz mentioned this issue Jun 19, 2019

Use NCBI taxonomy database for entry.organismName property #43

Closed

maxkfranz added a commit that referenced this issue Jun 19, 2019

Add an Organism class.

65de011

Ref: - Merge entities that are associated with strains #38 - Use NCBI taxonomy database for `entry.organismName` property #43

maxkfranz added a commit that referenced this issue Jun 19, 2019

Consolidate the indices of strains so that the length remains small.

e78b108

If the indices get too big, then we can't use them for sorting in our combined metric value. Ref: Merge entities that are associated with strains #38

metincansiper mentioned this issue Jul 5, 2019

Merge entities that are associated with strains #47

Merged

maxkfranz mentioned this issue Jul 8, 2019

Improve handling of organisms and strains #49

Merged

maxkfranz closed this as completed Jul 8, 2019

metincansiper mentioned this issue Jul 10, 2019

Update ranking algorithm for strains #52

Closed

metincansiper mentioned this issue Feb 9, 2021

Are we capturing all of the e coli strains? #105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge entities that are associated with strains #38

Merge entities that are associated with strains #38

maxkfranz commented Jun 18, 2019 •

edited

maxkfranz commented Jun 19, 2019

maxkfranz commented Jun 19, 2019

metincansiper commented Jun 24, 2019

metincansiper commented Jun 24, 2019

maxkfranz commented Jun 25, 2019

maxkfranz commented Jun 25, 2019

metincansiper commented Jun 25, 2019

maxkfranz commented Jun 25, 2019

maxkfranz commented Jun 25, 2019

metincansiper commented Jun 25, 2019

metincansiper commented Jun 25, 2019

maxkfranz commented Jun 25, 2019

metincansiper commented Jul 1, 2019 •

edited

metincansiper commented Jul 1, 2019

maxkfranz commented Jul 2, 2019

metincansiper commented Jul 5, 2019

Merge entities that are associated with strains #38

Merge entities that are associated with strains #38

Comments

maxkfranz commented Jun 18, 2019 • edited

maxkfranz commented Jun 19, 2019

maxkfranz commented Jun 19, 2019

metincansiper commented Jun 24, 2019

metincansiper commented Jun 24, 2019

maxkfranz commented Jun 25, 2019

maxkfranz commented Jun 25, 2019

metincansiper commented Jun 25, 2019

maxkfranz commented Jun 25, 2019

maxkfranz commented Jun 25, 2019

metincansiper commented Jun 25, 2019

metincansiper commented Jun 25, 2019

maxkfranz commented Jun 25, 2019

metincansiper commented Jul 1, 2019 • edited

metincansiper commented Jul 1, 2019

maxkfranz commented Jul 2, 2019

metincansiper commented Jul 5, 2019

maxkfranz commented Jun 18, 2019 •

edited

metincansiper commented Jul 1, 2019 •

edited