-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'subclass_of' cycles in KG2c #1367
Comments
Agree this is a problem that we need to address. I'm kind of thinking #1342 will be critical to addressing it. What do you think? |
Seems like a promising idea. I also wonder if we should schedule a KG2 hackathon to work on this. |
yeah, totally agree! would be a bit of a nightmare trying to address it without that.. 😂 |
this issue is now unblocked with KG2c.6.3 (http://kg2c-6-3.rtx.ai:7474/browser/), thanks to #1342 |
also worth noting - some of these are in KG2 itself (vs. just KG2c):
returns 1,240 in KG2.6.3 (only counted up to 3 hops as it takes quite a while to look for longer paths) it seems many of them involve SEMMEDDB edges, but not all do... here's one example (in KG2.6.3) - one edge is from
(not sure if there should be a separate issue for these KG2 cycles, or if they'll just be addressed as part of work on this issue?) |
one more example to highlight how crazy the in KG2c, if you look for nodes that are subclass_of diabetes (MONDO:0005015) and go up to 6 levels deep, you wind up with over 250,000 distinct nodes:
returns 263,647 (on KG2c.6.3) here's a random sample of some of these 263k nodes deemed 'subclasses' of diabetes:
|
@amykglen should we close out this issue, or transfer it to the PloverDB project area, or transfer it to the RTX-KG2 project area? |
hmm, I suppose we should probably keep this open. we have #RTXteam/RTX-KG2#63 for tracking this problem in KG2pre, but this one is to track the issue in KG2c, whose code still lives in the RTX repo (and this isn't a Plover issue). it might be relevant to entity resolution work as well (I suspect KG2c has some cycles that KG2pre does not, due to incorrect merging of concepts) |
in working on KP reasoning requirement 1) in #1268, I went to build an index for Plover that recursively finds all nodes that are
biolink:subclass_of
a given node in KG2c. (so that if someone is looking for 'diabetes' in their query graph, the query will also effectively consider 'type 2 diabetes', as well as anything that might be a subclass of 'type 2 diabetes', and so on...)but it quickly became apparent that there are a lot of (directed) 'subclass_of' cycles in KG2c. a couple examples from http://kg2c-5-2.rtx.ai:7474/browser/:
and there seem to be many such cycles - apparently 26,000 with up to 3 edges in them, but there are also larger cycles (they take forever to count, so I don't have a number, but I suspect it's large). for example, here's a 7-edge one involving Acetaminophen:
(acetaminophen alone is apparently part of about 700 subclass_of cycles with up to 7 edges)
this definitely makes my task much harder, though I'm sure I can find a way to work around it... but in the bigger picture, should we worry about reasoning using this data? it's seeming like a bit of the wild west... I'm guessing most of these are the result of little bugs in KG2/its upstream sources and/or the node synonymizer (haven't dived in to investigate)... and I imagine they'll be quite hard to totally eradicate.
maybe there's some cleaner way of getting 'subclass_of' info for this purpose (KP reasoning)? for example, should we only trust
biolink:subclass_of
edges from certainprovided_by
s? (e.g., maybe don't trust such edges from SEMMED?)The text was updated successfully, but these errors were encountered: