Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'subclass_of' cycles in KG2c #1367

Open
amykglen opened this issue Apr 12, 2021 · 9 comments
Open

'subclass_of' cycles in KG2c #1367

amykglen opened this issue Apr 12, 2021 · 9 comments
Assignees

Comments

@amykglen
Copy link
Member

in working on KP reasoning requirement 1) in #1268, I went to build an index for Plover that recursively finds all nodes that are biolink:subclass_of a given node in KG2c. (so that if someone is looking for 'diabetes' in their query graph, the query will also effectively consider 'type 2 diabetes', as well as anything that might be a subclass of 'type 2 diabetes', and so on...)

but it quickly became apparent that there are a lot of (directed) 'subclass_of' cycles in KG2c. a couple examples from http://kg2c-5-2.rtx.ai:7474/browser/:

match p=(n)-[:`biolink:subclass_of` *3..4]->(n) return p limit 3

Screen Shot 2021-04-11 at 12 13 49 PM

and there seem to be many such cycles - apparently 26,000 with up to 3 edges in them, but there are also larger cycles (they take forever to count, so I don't have a number, but I suspect it's large). for example, here's a 7-edge one involving Acetaminophen:

match p=(n {id:'CHEMBL.COMPOUND:CHEMBL112'})<-[:`biolink:subclass_of` *1..7]-(n) return p limit 1

Screen Shot 2021-04-11 at 5 09 02 PM

(acetaminophen alone is apparently part of about 700 subclass_of cycles with up to 7 edges)

this definitely makes my task much harder, though I'm sure I can find a way to work around it... but in the bigger picture, should we worry about reasoning using this data? it's seeming like a bit of the wild west... I'm guessing most of these are the result of little bugs in KG2/its upstream sources and/or the node synonymizer (haven't dived in to investigate)... and I imagine they'll be quite hard to totally eradicate.

maybe there's some cleaner way of getting 'subclass_of' info for this purpose (KP reasoning)? for example, should we only trust biolink:subclass_of edges from certain provided_bys? (e.g., maybe don't trust such edges from SEMMED?)

@saramsey
Copy link
Member

Agree this is a problem that we need to address.

I'm kind of thinking #1342 will be critical to addressing it. What do you think?

@saramsey
Copy link
Member

saramsey commented Apr 12, 2021

for example, should we only trust biolink:subclass_of edges from certain provided_bys? (e.g., maybe don't trust such edges from SEMMED?)

Seems like a promising idea. I also wonder if we should schedule a KG2 hackathon to work on this.

@amykglen
Copy link
Member Author

I'm kind of thinking #1342 will be critical to addressing it. What do you think?

yeah, totally agree! would be a bit of a nightmare trying to address it without that.. 😂

@amykglen
Copy link
Member Author

amykglen commented May 8, 2021

this issue is now unblocked with KG2c.6.3 (http://kg2c-6-3.rtx.ai:7474/browser/), thanks to #1342

@amykglen
Copy link
Member Author

amykglen commented May 13, 2021

also worth noting - some of these are in KG2 itself (vs. just KG2c):

match p=(n)-[:`biolink:subclass_of` *1..3]->(n) return count(p)

returns 1,240 in KG2.6.3

(only counted up to 3 hops as it takes quite a while to look for longer paths)

it seems many of them involve SEMMEDDB edges, but not all do... here's one example (in KG2.6.3) - one edge is from OBO:go/extensions/go-plus.owl and the other edge is from umls_source:GO:

match p=(n)-[e1:`biolink:subclass_of`]->(m)-[e2:`biolink:subclass_of`]->(n) where not "SEMMEDDB:" in e1.provided_by and not "SEMMEDDB:" in e2.provided_by return p limit 2

Screen Shot 2021-05-13 at 11 36 00 AM

(not sure if there should be a separate issue for these KG2 cycles, or if they'll just be addressed as part of work on this issue?)

@amykglen
Copy link
Member Author

one more example to highlight how crazy the subclass_of situation is :) (which would also provide a good test case for whatever solution is worked up):

in KG2c, if you look for nodes that are subclass_of diabetes (MONDO:0005015) and go up to 6 levels deep, you wind up with over 250,000 distinct nodes:

match p=(n {id:"MONDO:0005015"})<-[:`biolink:subclass_of` *1..6]-(m) return count(distinct m)

returns 263,647 (on KG2c.6.3)

here's a random sample of some of these 263k nodes deemed 'subclasses' of diabetes:

match p=(n {id:"MONDO:0005015"})<-[:`biolink:subclass_of` *1..6]-(m) return distinct m.id, m.name order by rand() limit 200
m.id | m.name
-- | --
"OMIM:MTHU019953" | "Long phalanges"
"NCBITaxon:557599" | "Mycobacterium kansasii ATCC 12478"
"UMLS:C2881015" | "Bilateral acute angle-closure glaucoma"
"CHEMBL.COMPOUND:CHEMBL1561505" | "SID26666821"
"OMIM:MTHU032492" | "Defects in executive function"
"UniProtKB:P17098" | "ZNF8"
"MONDO:0001482" | "testicular leukemia"
"PR:O89110" | "caspase-8 (mouse)"
"UMLS:C3862265" | "Tendonitis of right wrist"
"CHEBI:165052" | "Tyr-Glu-Ala"
"UMLS:C0334601" | "Undifferentiated Retinoblastoma"
"MESH:D018092" | "Receptors, Kainic Acid"
"PR:P22725-1" | "protein Wnt-5a isoform 1 (mouse)"
"UMLS:C2228234" | "Episcleritis of left eye"
"UMLS:C3665458" | "Hypertensive heart AND chronic kidney disease with congestive heart failure"
"OMIM:MTHU037851" | "Short limbs (in some patients)"
"PR:P13405" | "retinoblastoma-associated protein (mouse)"
"OMIM:MTHU018855" | "Most remit by 6 weeks (1-6 months)"
"CHEMBL.COMPOUND:CHEMBL598951" | "BRAZILIN"
"UMLS:C3554724" | "Complete duplication of thumb phalanx"
"OMIM:MTHU005989" | "Progressive disorder due to secondary myopathy"
"VANDF:4023749" | "Fungi nail"
"UMLS:C2987267" | "Esophageal Synovial Sarcoma"

@finnagin
Copy link
Member

finnagin commented Feb 2, 2022

@amykglen @saramsey is this still relevant?

@saramsey saramsey removed their assignment Feb 22, 2023
@saramsey
Copy link
Member

@amykglen should we close out this issue, or transfer it to the PloverDB project area, or transfer it to the RTX-KG2 project area?

@amykglen
Copy link
Member Author

hmm, I suppose we should probably keep this open. we have #RTXteam/RTX-KG2#63 for tracking this problem in KG2pre, but this one is to track the issue in KG2c, whose code still lives in the RTX repo (and this isn't a Plover issue).

it might be relevant to entity resolution work as well (I suspect KG2c has some cycles that KG2pre does not, due to incorrect merging of concepts)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Todo
Development

No branches or pull requests

3 participants