Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we want obsolete nodes? #995

Closed
edeutsch opened this issue Aug 1, 2020 · 27 comments
Closed

Do we want obsolete nodes? #995

edeutsch opened this issue Aug 1, 2020 · 27 comments
Assignees
Labels
enhancement kg2 question sar-look Marks an issue that Steve needs to examine

Comments

@edeutsch
Copy link
Collaborator

edeutsch commented Aug 1, 2020

I've been meaning to mention this for a while but keep forgetting. I'm thinking it is not a good thing to include obsolete nodes in KG2. Should we filter these out? Perhaps just keep as a synonym to the new node?

grep -i obsolete NodeNamesDescriptions_KG2.tsv | wc -l
13147

grep -i obsolete NodeNamesDescriptions_KG2.tsv
CUI:C1646276	Flunisolide Anhydrous (obsolete)	unknown_category
GO:0070035	obsolete purine NTP-dependent helicase activity	named_thing
GO:0070036	obsolete GTP-dependent helicase activity	named_thing
GO:0008002	obsolete lamina lucida	named_thing
GO:0008004	obsolete lamina reticularis	named_thing
...
PR:000024298	obsolete ABC transporter arginine-binding protein 1, full-length form	protein
PR:000024288	obsolete L-arabinose-binding periplasmic protein, full-length form	protein
PR:000024285	obsolete periplasmic AppA protein, full-length form	protein
PR:000024293	obsolete lysine-arginine-ornithine-binding periplasmic protein, full-length form	protein
...
DOID:7009	obsolete adult diffuse astrocytoma	disease
DOID:7006	obsolete childhood cerebral diffuse astrocytoma	disease
DOID:7002	obsolete recurrent adenocarcinoma of lung	disease
DOID:8334	obsolete testicular intratubular germ cell neoplasia with extratubular extension	disease
...
MONDO:0003229	obsolete lymphedema	deprecated_node
MONDO:0003224	obsolete spindle cell hemangioma	deprecated_node
MONDO:0003226	obsolete Nelson syndrome	deprecated_node
MONDO:0005882	obsolete onchocerciasis	deprecated_node
MONDO:0003221	obsolete sclerosing hemangioma	deprecated_node
MONDO:0017873	obsolete Ebola hemorrhagic fever	deprecated_node
...
@saramsey
Copy link
Member

saramsey commented Aug 1, 2020

Good point. Thanks for bringing this issue to my attention.

As you implied, the advantage of deleting them is decrufting KG2.

But.... I wonder if there might conversely be an advantage to keeping them. Other nodes within KG2 (or outside of KG2) may connect to these nodes due to legacy datasets. At this point, this is speculation; I would have to check whether this is really the case.

Certainly, deleting them is possible during the KG2 build process.

@saramsey
Copy link
Member

saramsey commented Aug 5, 2020

Another possibility is to keep them during the KG2 build process, but to drop deprecated nodes during canonicalization.

@saramsey
Copy link
Member

saramsey commented Aug 16, 2020

Tagging @amykglen to solicit her thoughts on this

@amykglen
Copy link
Member

hmm.. I agree it's a bit strange to have obsolete nodes in our knowledge graph, but I checked and it's true that at least some of them are connected to non-deprecated nodes... (about 3,500 out of 36,000 deprecated nodes in kg2endpoint). and some of them are very connected:

match(n {deprecated:"True"})-[e]-(m {deprecated:"False"}) return distinct n.id, count(distinct m) order by count(distinct m) desc
n.id count(distinct m)
"GO:0005620" 109991
"CHEBI:52332" 8019
"DOID:1492" 4283
"DOID:0050815" 1504
"HP:0001006" 724
"DOID:12252" 494
"HP:0002459" 379
"DOID:613" 335
"HP:0000833" 329
"GO:0004871" 311
"DOID:2481" 291
"GO:0001077" 286
"GO:0044212" 247
"DOID:10747" 242
"HP:0001724" 223
...

I guess I'd say it'd be ideal if we kept them in the regular KG2, but that the NodeSynonymizer was capable of 'synonymizing' them (i.e., looking past the obsolete, obsolete_ or (obsolete) in the name). then hopefully many of them would be collapsed into other nodes in the canonicalized KG2? (I see it looks like this may already be happening, at least for some obsolete nodes.)

although it looks like 21,000 of the 36,000 deprecated nodes don't have a name, which means they can't really be synonymized by the NodeSynonymizer... hmm

@saramsey
Copy link
Member

saramsey commented Aug 18, 2020

OK, if a node satisfies:

  • deprecated = True
  • has null for the replaced_by field
  • has null for name, description, and full name

then I think it is a reasonable candidate for deletion at the filter_kg_and_remap_predicates.py stage. How does that sound?

@edeutsch
Copy link
Collaborator Author

The above from @saramsey seems reasonable to me.
But I also wonder:

  • Is there any reason to keep any node that has null for name, description, and full name? Is there value in that?

  • I started this issue considering nodes where the name contains a few different variants of "obsolete". I wonder how that is related to an attribute deprecated = True?

@saramsey
Copy link
Member

The above from @saramsey seems reasonable to me.
But I also wonder:

* Is there any reason to keep any node that has `null` for `name`, `description`, and `full name`? Is there value in that?

Yes, for example, a PathWhiz reaction node that connects to all of its reactants and products.

@saramsey
Copy link
Member

saramsey commented Aug 21, 2020

* I started this issue considering nodes where the _name_ contains a few different variants of "obsolete". I wonder how that is related to an attribute `deprecated = True`?

The multi_ont_to_json_kg.py script detects if a node's description field starts with obsolete and deletes it in that case.

If additional cases of uncaught obsolete nodes are pasted here, I can update our filters to catch them.

saramsey added a commit that referenced this issue Aug 21, 2020
@amykglen
Copy link
Member

I started this issue considering nodes where the name contains a few different variants of "obsolete". I wonder how that is related to an attribute deprecated = True?

in the latest KG2 (kg2-3-1), it looks like 96% of nodes with 'obsolete' in their name are marked as deprecated=True - about 486 have deprecated=False..looks like those are mainly coming from ORPHANET:

match (n {deprecated:"False"}) where toLower(n.name) contains "obsolete" return n.provided_by, count(n) order by count(n) desc
n.provided_by count(n)
"ORPHANET:" 426
"GO:go-plus.owl" 19
"OBO:foodon.owl" 11
"identifiers_org_registry:umls" 6
"umls_source:HL7" 5
"umls_source:MTH" 3
"UMLS_STY:" 3
"umls_source:RXNORM" 2
"umls_source:SNOMEDCT" 2
"OBO:ro.owl" 2
"OBO:genepio.owl" 2
"OBO:ino.owl" 2
"umls_source:NCBITAXON" 1
"umls_source:NCI" 1
"OBO:go/extensions/go-plus.owl" 1

@saramsey
Copy link
Member

Thanks, @amykglen. Would you mind pasting a couple of example name fields from obsolete Orphanet nodes (that have deprecated=false for whatever reason in Neo4j), into this issue?

@amykglen
Copy link
Member

sure! here are some examples:

n.id n.name n.deprecated
"ORPHANET:352699" "OBSOLETE: Cobblestone lissencephaly type C" "False"
"ORPHANET:352694" "OBSOLETE: Cobblestone lissencephaly type A" "False"
"ORPHANET:268874" "OBSOLETE: Congenital hydromyelia" "False"
"ORPHANET:206619" "OBSOLETE: Toxic or/and iatrogenic neuropathy" "False"
"ORPHANET:206606" "OBSOLETE: Other muscle weakness and/or chronic muscle pain" "False"

(obtained via this query in kg2-3-1: match (n {deprecated:"False", provided_by:"ORPHANET:"}) where toLower(n.name) contains "obsolete" return n.id, n.name, n.deprecated limit 100)

saramsey added a commit that referenced this issue Aug 27, 2020
@saramsey
Copy link
Member

saramsey commented Aug 27, 2020

Thanks, I've pushed a possible fix for this (38a2e6a )

@saramsey
Copy link
Member

Testing on kg2dev now....

@kvarforl
Copy link
Contributor

As of KG2.3.5, Amy's ORPHANET query returned no records, so that particular issue seems to be fixed.

(obtained via this query in kg2-3-1: match (n {deprecated:"False", provided_by:"ORPHANET:"}) where toLower(n.name) contains "obsolete" return n.id, n.name, n.deprecated limit 100)

However, when running match (n {deprecated:"False"}) where toLower(n.name) contains "obsolete" return n.provided_by, count(n) order by count(n) desc to find nodes with obsolete in the name and deprecated=False, GO:go-plus.owl is a big offender.

n.provided_by count(n)
"GO:go-plus.owl" 2227
"identifiers_org_registry:umls" 8
"umls_source:MTH" 6
"umls_source:HL7" 5
"umls_source:HCPCS" 3
"UMLS_STY:" 2
"umls_source:SNOMEDCT" 2
"OBO:ro.owl" 2

For that reason, I'm going to leave this issue open for now.

@saramsey
Copy link
Member

saramsey commented Feb 2, 2021

Hi @kvarforl what's the status of this issue? I see that we had "verify this fix in the next KG2 build" but then that label was removed last October. Just wondering about what I should do.

@kvarforl
Copy link
Contributor

kvarforl commented Feb 2, 2021

I haven't changed anything on this since october!

Did we decide that it's okay to have nodes with "obsolete" in the name and deprecated=False ?

if so, we can probably close this issue. If not, I can try to track down the offending nodes in Kg2.5.1

@saramsey
Copy link
Member

saramsey commented Feb 3, 2021

Hi @kvarforl ah, I missed where you wrote an explanation:

However, when running match (n {deprecated:"False"}) where toLower(n.name) contains "obsolete" return n.provided_by, count(n) order by count(n) desc to find nodes with obsolete in the name and deprecated=False, GO:go-plus.owl is a big offender.

n.provided_by count(n)
"GO:go-plus.owl" 2227
...

@saramsey
Copy link
Member

saramsey commented Feb 3, 2021

I wonder if in multi_ont_to_json_kg.py, we can add code to set deprecated=True if the node's name starts with the substring "obsolete" (but not delete the node). How does that sound, @kvarforl ?

Our test-case for whether or not it worked could be this Cypher query in the Neo4j:

match (n {deprecated:"False", provided_by: 'GO:go-plus.owl'}) where toLower(n.name) starts with "obsolete" return count(*)

currently returns 2227, but after your fix, it should return a count of 0.

saramsey added a commit that referenced this issue Feb 3, 2021
@saramsey
Copy link
Member

saramsey commented Feb 3, 2021

@kvarforl I took a shot at a definitive solution to this issue; see issue a0ca122. If we want, we can always add an option somewhere in the build process, to filter out the deprecated=True nodes.

@saramsey
Copy link
Member

saramsey commented Feb 3, 2021

OK @edeutsch after thinking about it, I believe the best approach is as follows:

  • in the KG2 ETL process, use heuristics to catch all cases of obsolete nodes and always set deprecated=True for such nodes
  • a reasonable place to filter out the deprecated=True nodes may be in the KG2c build process, what do you think @amykglen ?

@edeutsch
Copy link
Collaborator Author

edeutsch commented Feb 3, 2021

okay, fine with me. I do not recall a specific problem caused by obsolete nodes.

@amykglen
Copy link
Member

amykglen commented Feb 4, 2021

sure! if we don't want obsolete nodes, that's fine with me to remove them during the KG2c build process. so in that case, they'd still be a part of the NodeSynonymizer's knowledge base (since it's built off the regular KG2).... I suppose one little hitch is that we would need to make sure the synonymizer never chooses an obsolete curie as the preferred_curie for a synonym group (if that synonym group has any non-obsolete members). otherwise things could get a little hairy. see any issues with that, @edeutsch? (I guess you would need to check the deprecated property for KG2 nodes in the synonymizer's build process?)

@edeutsch
Copy link
Collaborator Author

edeutsch commented Feb 4, 2021

I don't currently have a way of checking the deprecated property during the NodeSynonymizer build process. The process works from file dumps not a live connection. One possibility is to include the word OBSOLETE in the name. That I can check. And is already done to some extent.

@kvarforl
Copy link
Contributor

kvarforl commented Mar 6, 2021

Okay, as of KG2.5.2 there are very few nodes that have obsolete in the name but deprecated as false:

match (n {deprecated:"False"}) where toLower(n.name) contains "obsolete" return n.provided_by, count(n) order by count(n) desc
n.provided_by count(n)
"identifiers_org_registry:umls" 8
"umls_source:MTH" 6
"UMLS_STY:" 4
"umls_source:HCPCS" 3
"umls_source:HL7" 3
"umls_source:SNOMEDCT" 2
"OBO:ro.owl" 2
"OBO:ino.owl" 2
"OBO:genepio.owl" 2
"umls_source:NCBITAXON" 1
"umls_source:NCI" 1
"umls_source:RXNORM" 1
"OBO:foodon.owl" 1
"OBO:hp.owl" 1

@saramsey is this sufficient, or should we put a catch somewhere to try to get them all to be marked as deprecated?

@kvarforl kvarforl added sar-look Marks an issue that Steve needs to examine and removed verify this fix in next KG2 build labels Mar 6, 2021
@saramsey
Copy link
Member

saramsey commented Mar 12, 2021

Hi @kvarforl can I see the actual names for some example nodes, along with their id fields and category fields?

@saramsey
Copy link
Member

Maybe something like this?

match (n {deprecated:"False"}) where toLower(n.name) contains "obsolete" return n.name, n.id, n.category limit 10;

@kvarforl
Copy link
Contributor

I'm going to close this issue and move the discussion over to #1261 since they're discussing the same question/problem. I added examples in https://github.com/RTXteam/RTX/issues/1261#issuecomment-797736336 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement kg2 question sar-look Marks an issue that Steve needs to examine
Projects
None yet
Development

No branches or pull requests

4 participants